Conference PaperPDF Available

One for Many: Transfer Learning for Building HVAC Control

Authors:
One for Many: Transfer Learning for Building HVAC Control
Shichao Xu
Northwestern University
Evanston, USA
shichaoxu2023@u.northwestern.edu
Yixuan Wang
Northwestern University
Evanston, USA
yixuanwang2024@u.northwestern.edu
Yanzhi Wang
Northeastern University
Boston, USA
yanz.wang@northeastern.edu
Zheng O’Neill
Texas A&M University
College Station, USA
zoneill@tamu.edu
Qi Zhu
Northwestern University
Evanston, USA
qzhu@northwestern.edu
ABSTRACT
The design of building heating, ventilation, and air conditioning
(HVAC) system is critically important, as it accounts for around half
of building energy consumption and directly aects occupant com-
fort, productivity, and health. Traditional HVAC control methods
are typically based on creating explicit physical models for building
thermal dynamics, which often require signicant eort to develop
and are dicult to achieve sucient accuracy and eciency for
runtime building control and scalability for eld implementations.
Recently, deep reinforcement learning (DRL) has emerged as a
promising data-driven method that provides good control perfor-
mance without analyzing physical models at runtime. However,
a major challenge to DRL (and many other data-driven learning
methods) is the long training time it takes to reach the desired
performance. In this work, we present a novel transfer learning
based approach to overcome this challenge. Our approach can ef-
fectively transfer a DRL-based HVAC controller trained for the
source building to a controller for the target building with minimal
eort and improved performance, by decomposing the design of
neural network controller into a transferable front-end network
that captures building-agnostic behavior and a back-end network
that can be eciently trained for each specic building. We con-
ducted experiments on a variety of transfer scenarios between
buildings with dierent sizes, numbers of thermal zones, materials
and layouts, air conditioner types, and ambient weather conditions.
The experimental results demonstrated the eectiveness of our
approach in signicantly reducing the training time, energy cost,
and temperature violations.
CCS CONCEPTS
Computing methodologies Reinforcement learning
;
Computer systems organization Embedded and cyber -
physical systems.
KEYWORDS
Smart Buildings, HVAC control, Data-driven, Deep reinforcement
learning, Transfer learning
1 INTRODUCTION
The building stock accounts for around 40% of the annual energy
consumption in the United States, and nearly half of the building en-
ergy is consumed by the heating, ventilation, and air conditioning
(HVAC) system [
26
]. On the other hand, average Americans spend
approximately 87% of their time indoors [
15
], where the operation
of HVAC system has a signicant impact on their comfort, produc-
tivity, and health. Thus, it is critically important to design HVAC
control systems that are both energy ecient and able to maintain
the desired temperature and indoor air quality for occupants.
In the literature, there is an extensive body of work address-
ing the control design of building HVAC systems [
20
,
27
,
30
,
33
].
Most of them use
model-based
approaches that create simplied
physical models to capture building thermal dynamics for ecient
HVAC control. For instance, resistor-capacitor (RC) networks are
used for modeling building thermal dynamics in [
20
22
], and linear-
quadratic regulator (LQR) or model predictive control (MPC) based
approaches are developed accordingly for ecient runtime control.
However, creating a simplied yet suciently-accurate physical
model for runtime HVAC control is often dicult, as building room
air temperature is complexly aected by a number of factors, in-
cluding building layout, structure, construction and materials, sur-
rounding environment (e.g., ambient temperature, humidity, and
solar radiation), internal heat generation from occupants, lighting,
and appliances, etc. Moreover, it takes signicant eort and time
to develop explicit physical models, nd the right parameters, and
update the models over the building lifecycle [28].
The drawbacks of model-based approaches have motivated the
development of
data-driven
HVAC control methods that do not
rely on analyzing physical models at runtime but rather directly
making the decisions based on input data. A number of data-driven
methods such as reinforcement learning (RL) have been proposed
in the literature, including more traditional methods that leverage
the classical Q-learning techniques and perform optimization based
on a tabular
𝑄
value function [
2
,
16
,
25
], earlier works that utilize
neural networks [
4
,
7
], and more recent deep reinforcement learning
(DRL) methods [
8
,
9
,
17
,
24
,
29
,
36
,
37
]. In particular, the DRL-based
methods leverage deep neural networks for estimating the
𝑄
values
associated with state-action pairs and are able to handle larger state
space than traditional RL methods [
28
]. They have emerged as a
promising solution that oers good HVAC control performance
without analyzing physical models at runtime.
However, there are major challenges in deploying DRL-based
methods in practice. Given the complexity of modern buildings, it
could take a signicant amount of training for DRL models to reach
the desired performance. For instance, around 50 to 100 months
of data are needed for training the models in [
28
,
29
] and 4000+
months of data are used for more complex models [
9
,
34
] even if
this could be drastically reduced to a few months or weeks, directly
arXiv:2008.03625v2 [eess.SY] 20 Oct 2020
Shichao Xu, Yixuan Wang, Yanzhi Wang, Zheng O’Neill, and Qi Zhu
deploying DRL models on operational buildings and taking so long
before getting the desired performance is impractical. The works
in [
28
,
29
] thus propose to rst use detailed and accurate physical
models (e.g., EnergyPlus [
5
]) for oine simulation-based training
before the deployment. While such an approach can speed up the
training process, it still requires the development and update of
detailed physical models, which as stated above needs signicant
domain expertise, eort, and time.
To address the challenges in DRL training for HVAC control,
we propose a
transfer learning
based approach in this paper, to
utilize existing models (that had been trained for old buildings) in
the development of DRL methods for new buildings. This is not
a straightforward process, however. Dierent buildings may have
dierent sizes, numbers of thermal zones, materials and layouts,
HVAC equipment, and operate under dierent ambient weather
conditions. As shown later in the experiments, directly transferring
models between such dierent buildings is not eective. In the
literature, there are a few works that have explored transfer learning
for buildings. In [
3
], a building temperature and humidity prediction
model is learned from supervised learning, and transferred to new
buildings with further tuning and utilized in an MPC algorithm.
The work in [
18
] investigates the transfer of Q-learning for building
HVAC control under dierent weather conditions and with dierent
room sizes, but it is limited to single-room buildings. The usage of
Q-table in conventional Q-learning also leads to limited memory
for state-action pairs and makes it unsuitable for complex buildings.
Our work addresses the limitations in the literature, and develops
for the rst time a Deep Q-Network (DQN) based transfer learning
approach for multiple-zone buildings. Our approach avoids the
development of physical models, signicantly reduces the DRL
training time via transfer learning, and is able to reduce energy
cost while maintaining room temperatures within desired bounds.
More specically, our work makes the following contributions:
We propose a novel transfer learning approach that decom-
poses the design of neural network based HVAC controller into
two (sub-)networks. The front-end network captures building-
agnostic behavior and can be directly transferred, while the
back-end network can be eciently trained for each specic
building in an oine supervised manner by leveraging a small
amount of data from existing controllers (e.g., simple on-o
controller).
Our approach requires little to no further tuning of the trans-
ferred DRL model after it is deployed in the new building, thanks
to the two-subnetwork design and the oine supervised training
of the back-end network. This avoids the initial cold start period
where the HVAC control may be unstable and unpredictable.
We have performed a number of experiments for evaluating
the eectiveness of our approach under various scenarios. The
results demonstrate that our approach can eectively transfer be-
tween buildings with dierent sizes, numbers of thermal zones,
materials and layouts, and HVAC equipment, as well as under dif-
ferent weather conditions in certain cases. Our approach could
enable fast deployment of DRL-based HVAC control with little
training time after transfer, and reduce building energy cost
with minimal violation of temperature constraints.
The rest of the paper is structured as follows. Section 2 pro-
vides a more detailed review of related work. Section 3 presents
our approach, including the design of two networks and the cor-
responding training methods. Section 4 shows the experiments
for dierent transfer scenarios and other related ablation studies.
Section 5 concludes the paper.
2 RELATED WORK
Model-based and Data-driven HVAC Control.
There is a rich
literature in HVAC control design, where the approaches can gener-
ally fall into two main categories, i.e., model-based and data-driven.
Traditional model-based HVAC control approaches typically
build explicit physical models for the controlled buildings and their
surrounding environment, and then design control algorithms ac-
cordingly [
20
,
27
]. For instance, the work in [
19
] presents a nonlin-
ear model for the overall cooling system, which includes chillers,
cooling towers and thermal storage tanks, and then develops an
MPC-based approach for reducing building energy consumption.
The work in [
20
] models the building thermal dynamics as RC
networks, calibrates the model based on historical data, and then
presents a tracking LQR approach for HVAC control. Similar sim-
plied models have been utilized in other works [
21
,
22
,
30
] for
HVAC control and for co-scheduling HVAC operation with other
energy demands and power supplies. While being ecient, these
simplied models often do not provide sucient accuracy for eec-
tive runtime control, given the complex relation between building
room air temperature and various factors of the building itself (e.g.,
layout, structure, construction and materials), its surrounding envi-
ronment (e.g., ambient temperature, humidity, solar radiation), and
internal operation (e.g., heat generation from occupants, lighting
and appliances). More accurate physical models can be built and
simulated with tools such as EnergyPlus [
5
], but those models are
typically too complex to be used for runtime control.
Data-driven approaches have thus emerged in recent years due
to their advantages of not requiring explicit physical models at
runtime . These approaches often leverage various machine learn-
ing techniques, in particular reinforcement learning. For instance,
in [
29
,
37
], DRL is applied to building HVAC control and an Energy-
Plus model is leveraged for simulation-based oine training of DRL.
In [
8
,
36
], DRL approaches leveraging the actor-critic methods are
applied. The works in [
9
,
24
] use data-driven methods to approx-
imate/learn the energy consumption and occupants’ satisfaction
under dierent thermal conditions, and then apply DRL to learn
an end-to-end HVAC control policy. These DRL-based methods
are shown to be eective at reducing energy cost and maintain-
ing desired temperature, and are suciently ecient at runtime.
However, they often take a long training time to reach the desired
performance, needing dozens and hundreds of months of data for
training [
28
,
29
] or even longer [
9
,
34
]. Directly deploying them
in real buildings for such long training process is obviously not
practical. Leveraging tools such as EnergyPlus for oine simulation-
based training can mitigate this issue, but again incurs the need for
the expensive and sometimes error-prone process of developing
accurate physical models (needed for simulation in this case). These
challenges have motivated this work to develop a transfer learning
approach for ecient and eective DRL control of HVAC systems.
One for Many: Transfer Learning for Building HVAC Control
Transfer Learning for HVAC control.
There are a few works
that have explored transfer learning in buildings HVAC control.
In [
18
], transfer learning of a Q-learning agent is studied, however
only a single room (thermal zone) is considered. The usage of a
tabular table for each state-action pair in the traditional Q-learning
in fact limits the approach’s capability to handle high-dimensional
data. In [
3
], a neural network model for predicting temperature
and humidity is learned in a supervised manner and transferred
to new buildings for MPC-based control. The approach also fo-
cuses on single-zone buildings and requires further tuning after the
deployment of the controller.
Dierent from these earlier works in transfer learning for HVAC
control, our approach addresses multi-zone buildings and considers
transfer between buildings with dierent sizes, number of ther-
mal zones, layouts and materials, HVAC equipment, and ambient
weather conditions. It also requires little to no further tuning after
the transfer. This is achieved with a novel DRL controller design
with two sub-networks and the corresponding training methods.
Transfer Learning in DRL.
Since our approach considers transfer
learning for DRL, it is worth to note some of the work in DRL-based
transfer learning for other domains [
1
,
6
,
11
,
35
]. For instance,
in [
11
], the distribution of optimal trajectories across similar robots
is matched for transfer learning in robotics. In [
1
], an environment
randomization approach is proposed, where DRL agents trained
in simulation with a large number of generated environments can
be successfully transferred to their real-world applications. To the
best of our knowledge, our work is the rst to propose DRL-based
transfer learning for multi-zone building HVAC control. It addresses
the unique challenges in building domain, e.g., designing a novel
two-subnetwork controller to avoid the complexity and cost of
creating accurate physical models for simulation.
3 OUR APPROACH
We present our transfer learning approach in this section, including
the design of the two-subnetwork controller and the training pro-
cess. Section 3.1 introduces the system model. Section 3.2 provides
an overview of our methodology. Section 3.3 presents the design
of the building-agnostic front-end (sub-)network, and Section 3.4
explains the design of the building-specic back-end (sub-)network.
3.1 System Model
The goal of our work is to build a transferable HVAC control system
that can maintain comfortable room air temperature within desired
bounds while reducing the energy cost. We adopt a building model
that is similar to the one used in [
29
], an
𝑛
-zone building model with
a variable air volume (VAV) HVAC system. The system provides
conditioned air at a ow rate chosen from
𝑚
discrete levels. Thus,
the entire action space for the
𝑛
-zone controller can be described
as
A={a1,a2,· · · ,an},
where
ai(
1
𝑖𝑛)
is chosen from
𝑚
VAV levels
{𝑓1, 𝑓2,· · · , 𝑓𝑚}
. Note that the size of the action space
(
𝑚𝑛
) increases exponentially with respect to the number of thermal
zones
𝑛
, which presents signicant challenge to DRL control for
larger buildings. We address this challenge in the design of our
two-subnetwork DRL controller by avoiding setting the size of the
neural network action output layer to
𝑚𝑛
. This will be explained
further later.
The DRL action is determined by the current system state. In
our model, the system state includes the current physical time
𝑡
,
inside state
𝑆𝑖𝑛
, and outside environment state
𝑆𝑜𝑢𝑡
. The inside
state
𝑆𝑖𝑛
includes the temperature of each thermal zone, denoted as
{𝑇1,𝑇2,· · · ,𝑇𝑛}
. The outside environment state
𝑆𝑜𝑢𝑡
includes the
ambient temperature and the solar irradiance (radiation intensity).
Similar to [
29
], to improve DRL performance,
𝑆𝑜𝑢𝑡
not only includes
the current values of the ambient temperature
𝑇𝑖
𝑜𝑢𝑡
and the solar
irradiance
𝑆𝑢𝑛𝑖
𝑜𝑢𝑡
, but also their weather forecast values for the
next three days. Thus, the outside environment state is denoted as
𝑆𝑜𝑢𝑡 ={𝑇0
𝑜𝑢𝑡 , 𝑇 1
𝑜𝑢𝑡 , 𝑇 2
𝑜𝑢𝑡 , 𝑇 3
𝑜𝑢𝑡 , 𝑆𝑢𝑛 0
𝑜𝑢𝑡 , 𝑆𝑢𝑛 1
𝑜𝑢𝑡 , 𝑆𝑢𝑛 2
𝑜𝑢𝑡 , 𝑆𝑢𝑛 3
𝑜𝑢𝑡 }
. Our
current model does not consider internal heat generation from
occupants, a limitation that we plan to address in future work.
3.2 Methodology Overview
We started our work by considering whether it is possible to di-
rectly transfer a well-trained DQN model for a single-zone source
building to every zone of a target multiple-zone building. However,
based on our experiments (shown later in Table 2 of Section 4),
such straightforward approach is not eective at all, leading to
signicant temperature violations. This is perhaps not surprising.
In DQN-based reinforcement learning, a neural network
𝑄
maps
the input
𝐼={𝐼1, 𝐼2,· · · , 𝐼𝑛}
, where
𝐼𝑖
is the state for each zone
𝑖
,
to the control action output
A
. The network
𝑄
is optimized based
on a reward function that considers energy cost and temperature
violation. Through training,
𝑄
learns a control strategy that incor-
porates the consideration of building thermal dynamics, including
the building-specic characteristics. Directly applying
𝑄
to a new
target building, which may have totally dierent characteristics
and dynamics, will not be eective in general.
Thus, our approach designs a novel architecture that includes
two sub-networks, with an intermediate state
Δ𝑇
that indicates a
predictive value of the controller’s willingness to change the indoor
temperature. The
front-end network 𝑄
maps the inputs
𝐼
to the
intermediate state
Δ𝑇
. It is trained to capture the building-agnostic
part of the control strategy, and is directly transferable. The
back-
end network
then maps
Δ𝑇
, together with
𝐼
, to the control action
output
A
. It is trained to capture the building-specic part of the
control, and can be viewed as an inverse building network
𝐹1
. An
overview of our approach is illustrated in Figure 1.
3.3 Front-end Building-agnostic Network
Design and Training
We introduce the design of our front-end network
𝑄
and its training
in this section.
𝑄
is composed of
𝑛
(sub-)networks itself, where
where
𝑛
is the number of building thermal zones. Each zone in
the building model has its corresponding sub-network, and all sub-
networks share their weights. In each sub-network for thermal zone
𝑖
, the input layer accepts state
𝐼𝑖
. It is followed by
𝐿
sequentially-
connected fully-connected layers (the exact number of neurons is
presented later in Table 1 of Section 4). Rather than directly giving
the control action likelihood vector, the network’s output layer
reects a planned temperature change value Δ𝑇𝑖for each zone.
More specically, the output of the last layer is designed as a
vector
𝑂Δ𝑇𝑖
of length
+
2in one-hot representation the planned
temperature changing range is equally divided into
intervals
Shichao Xu, Yixuan Wang, Yanzhi Wang, Zheng O’Neill, and Qi Zhu
Target building control model
𝑻
Actions 𝐴
System state 𝐼
System state 𝐼
Weight sharing
𝐼𝑖
𝑻𝒊
Front-end network
Back-end network
Source building control model
𝑻
Actions 𝐴
System state 𝐼
System state 𝐼
Weight sharing
𝐼𝑖
𝑻𝒊
Front-end network
Back-end network
Source building
Target building
Direct copy
ON-OFF controller
Data collector
Supervised
learning
Supervised
learning
Data collector
ON-OFF controller
Control action
Control action
Collect system state
for DQN training
Warm-up
control
Warm-up
control
Collect system state
for DQN training
Figure 1: Overview of our DRL-based transfer learning approach for HVAC control. We design a novel DQN architecture that includes two sub-
networks: A front-end network 𝑄captures the building-agnostic part of the control as much as possible, while a back-end network (inverse
building network) 𝐹1captures the building-specic behavior. At each control step, the front-end network𝑄maps the current system state 𝐼to
an intermediate state Δ𝑇. Then, the back-end network 𝐹1maps Δ𝑇, together with 𝐼, to the control action outputs A. During transfer learning
from a source building to a target building, the front-end network 𝑄is directly transferable. The back-end network 𝐹1can be trained in a
supervised manner, with data collected from an existing controller (e.g., a simple ON-OFF controller). Experiments have shown that around
two weeks of data is sucient for such supervised training of 𝐹1. If it is a brand new building without any existing controller, we can deploy
a simple ON-OFF controller for two weeks in a “warm-up” process. During this process, the ON-OFF controller can maintain the temperature
within the desired bounds (albeit with higher cost), and collect data that captures the building-specic behavior for training 𝐹1.
within a predened temperature range of
[−𝑏, 𝑏]
and two intervals
outside of that range are also considered. The relationship of the
planned temperature change value
Δ𝑇𝑖
of zone
𝑖
and the output
vector 𝑂Δ𝑇𝑖is as follows:
𝑂Δ𝑇𝑖
=
<1,0,· · · ,0>,Δ𝑇𝑖 𝑏,
<0,· · · ,0,1,0,· · · ,0>,𝑏<Δ𝑇𝑖<𝑏,
𝑡ℎ𝑒 𝑝𝑜𝑠 𝑖𝑡𝑖 𝑜𝑛 𝑜 𝑓 1𝑖𝑠 𝑎𝑡 (⌊ Δ𝑇𝑖/(2𝑏/)⌋ ))
<0,· · · ,0,1>,Δ𝑇𝑖𝑏.
(1)
Then, for the entire front-end network
𝑄
, the combined in-
put is
𝐼={𝐼1, 𝐼2,· · · , 𝐼𝑛}
, and the combined output is
𝑂Δ𝑇=
{𝑂Δ𝑇1, 𝑂Δ𝑇2,· · · , 𝑂Δ𝑇𝑛}.
It is worth noting that if we had designed the front-end network
in standard deep Q-learning model [
23
], it would take
𝐼
as the
network’s input, pass it through several fully-connected layers,
and output the selection among an action space that has a size of
(+
2
)𝑛
(as there are
𝑛
zones, and each has
+
2possible actions).
It also needs an equal number of neurons for the last layer, which
is not aordable when the number of zones gets large. Instead in
our design, the last layer of the front-end network
𝑄
has its size
reduced to
(+
2
) 𝑛
, which can be further reduced to
(+
2
)
with
the following weight-sharing technique.
We decide to let the
𝑛
sub-networks of
𝑄
share their weights
during training. One benet of this design is that it enables trans-
ferring the front-end network for a
𝑛
-zone source building to a
target
𝑚
-zone building, where
𝑚
could be dierent from
𝑛
. It also
reduces the training load by lowering the number of parameters.
Such design performs well in our experiments.
Our front-end network
𝑄
is trained with the standard deep Q-
learning techniques [
23
]. Note that while the output action for
𝑄
is
the planned temperature change vector
𝑂Δ𝑇
, the training process
uses a dynamic reward
𝑅𝑡
that depends on the eventual action
(i.e., output of network
𝐹1
), which will be introduced later in
Section 3.4. Specically, the training of the front-end network
𝑄
follows Algorithm 1 (the hyper-parameters used are listed later in
Table 1 of Section 4). First, we initialize
𝑄
by following the weights
initialization method described in [
12
] and copy its weights to the
target network
𝑄
(target network
𝑄
is a technique in deep Q-
learning that is used for improving performance.). The back-end
network
𝐹1
is initialized following Algorithm 2 (introduced later in
Section 3.4). We also empty the replay buer and set the exploration
rate 𝜖to 1.
At each control instant
𝑡
during a training epoch, we obtain
the current system state
𝑆𝑐𝑢𝑟 =
(
𝑡
,
𝑆𝑖𝑛
,
𝑆𝑜𝑢𝑡
) and calculate the
current reward
𝑅𝑡
. We then collect the learning samples (experience)
(
𝑆𝑝𝑟𝑒
,
𝑆𝑐𝑢𝑟
,
Δ𝑇
,
A
,
𝑅
) and store them in the replay buer. In the
following learning-related operations, we rst sample a data batch
One for Many: Transfer Learning for Building HVAC Control
𝑀=(𝒮
𝑝𝑟𝑖𝑚𝑒,𝒮
𝑛𝑒𝑥𝑡 ,𝒶,𝓇)
from the replay buer, and calculate
the actual temperature change value
Δ𝒯
𝑎
from
𝒮
𝑝𝑟𝑖𝑚𝑒
and
𝒮
𝑛𝑒𝑥𝑡
.
Then, we get the planned temperature change value from the back-
end network
𝐹1
, i.e.,
𝒶𝑝
=
𝐹1(Δ𝒯
𝑎,𝒮
𝑝𝑟 𝑖𝑚𝑒 )
. In this way, the
cross entropy loss can be calculated from the true label 𝒶and the
predicted label
𝒶𝑝
. We then use supervised learning to update
the back-end network
𝐹1
with the Adam optimizer [
14
] under
learning rate 𝑙𝑟2.
We follow the same procedure as described in [
23
] to calculate
the target vector
𝑣
that is used in deep Q-learning. With target
vector
𝑣
and input state
𝑆𝑝𝑟𝑖𝑚𝑒
, we can then train
𝑄
using the
back-propagation method [
10
] with mean squared error loss and
learning rate
𝑙𝑟1
. With a period of
Δ𝑛𝑡
, we assign the weights of
𝑄
to the target network
𝑄
. The exploration rate is updated as
𝜖=max{𝜖𝑙𝑜 𝑤, 𝜖 Δ𝜖}
. It is used for
𝜖
greedy policy to select each
planned temperature change value Δ𝑇𝑖:
Δ𝑇𝑖={𝑎𝑟 𝑔𝑚𝑎𝑥 𝑂Δ𝑇𝑖𝑤𝑖𝑡 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 1𝜖,
𝑟𝑎𝑛𝑑𝑜𝑚(0𝑡𝑜 +1)𝑤 𝑖𝑡 𝑝𝑟 𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝜖 . (2)
Δ𝑇={Δ𝑇1,Δ𝑇2,· · · ,Δ𝑇𝑛}.(3)
The control action Ais obtained from the back-end network:
A=𝐹1(Δ𝑇, 𝑆𝑐𝑢𝑟 ).(4)
3.4 Back-end Building-specic Network Design
and Training
The objective of the back-end network is to map the planned tem-
perature change vector
𝑂Δ𝑇
(or
Δ𝑇
), together with the system state
𝐼
, into the control action
A
. Consider that during operation, a build-
ing environment “maps” the control action and system state to the
actual temperature change value. So in a way, the back-end network
can be viewed as doing the inverse of what a building environment
does, i.e., it can be viewed as an inverse building network 𝐹1.
The network
𝐹1
receives the planned temperature change value
Δ𝑇
and the system state
𝐼
at its input layer. It is followed by
𝐿
fully-connected layers (exact number for experimentation is
specied in Table 1 of Section 4). It outputs a likelihood control
action vector
𝑂A={𝑣1, 𝑣2,· · · , 𝑣𝑛}
, which can be divided into
𝑛
groups. For group
𝑖
, it has a one-hot vector
𝑣𝑖
corresponding
to the control action for zone
𝑖
. The length of
𝑣𝑖
is
𝑚
, as there
are
𝑚
possible control actions for each zone as dened earlier.
When
𝑂A
is provided, control action
A
can be easily calculated
by applying argmax operation for each group in
𝑂A
, i.e.,
A=
{𝑎𝑟𝑔𝑚𝑎𝑥 {𝑣1}, 𝑎𝑟𝑔𝑚𝑎𝑥 {𝑣2},· · · , 𝑎𝑟𝑔𝑚𝑎𝑥 {𝑣𝑛}}.
The network 𝐹1is integrated with the reward function 𝑅𝑡:
𝑅𝑡=𝑤𝑐𝑜𝑠𝑡 𝑅_𝑐𝑜𝑠𝑡𝑡+𝑤𝑣𝑖𝑜 𝑅_𝑣𝑖𝑜 𝑡,(5)
where
𝑅_𝑐𝑜𝑠𝑡 𝑡
is the reward of energy cost at time step
𝑡
and
𝑤𝑐𝑜𝑠𝑡
is the corresponding scaling factor.
𝑅_𝑣𝑖𝑜𝑡
is the reward of zone
temperature violation at time step
𝑡
and
𝑤𝑣𝑖𝑜
is its scaling factor.
The two rewards are further dened as:
𝑅_𝑐𝑜𝑠𝑡 𝑡=𝑐𝑜 𝑠𝑡 (𝐹1(Δ𝑇𝑡1), 𝑡 1).(6)
𝑅_𝑣𝑖𝑜𝑡=
𝑛
𝑖=1
𝑚𝑎𝑥 (𝑇𝑖
𝑡𝑇𝑢𝑝𝑝𝑒𝑟 ,0) + 𝑚𝑎𝑥 (𝑇𝑙𝑜𝑤𝑒𝑟 𝑇𝑖
𝑡,0).(7)
Algorithm 1 Training of front-end network 𝑄
1: 𝑒𝑝 : the number of training epochs
2: Δ𝑐𝑡: the control period
3: 𝑡𝑀𝐴𝑋 : the maximum training time of an epoch
4: Δ𝑛𝑡: the time interval to update target network
5: Empty replay buer
6:
Initialize
𝑄
; set the weights of target network
𝑄
=
𝑄
; initialize
𝐹1
based on Algorithm 2
7: Initialize the current planned temperature change vector Δ𝑇
8: Initialize previous state 𝑆𝑝𝑟𝑒
9: Initialize exploration rate 𝜖
10: for 𝐸𝑝𝑜𝑐ℎ = 1 to 𝑒𝑝 do
11: for 𝑡= 0 to 𝑡𝑀𝐴𝑋 ,𝑡+= Δ𝑐𝑡 do
12: 𝑆𝑐𝑢𝑟 (𝑡,𝑆𝑖𝑛 ,𝑆𝑜𝑢𝑡 )
13: Calculate reward 𝑅
14: Add experience (𝑆𝑝𝑟𝑒 ,𝑆𝑐𝑢𝑟 ,Δ𝑇,A,𝑅) to the replay buer
15: for 𝑡𝑟 = 0 to 𝐿𝑀 𝐴𝑋 do
16: Sample a batch 𝑀=(𝒮
𝑝𝑟 𝑖𝑚𝑒 ,𝒮
𝑛𝑒𝑥𝑡 ,𝒶,𝓇)
17: Calculate actual temperature change value Δ𝒯
𝑎
18: Predicted label 𝒶𝑝=𝐹1(Δ𝒯
𝑎,𝒮
𝑝𝑟 𝑖𝑚𝑒 )
19: Set loss 𝐿=𝐶𝑟 𝑜𝑠𝑠 𝐸𝑛𝑡𝑟 𝑜𝑝𝑦 𝐿𝑜𝑠𝑠 (𝒶𝑝,𝒶)
20: Update 𝐹1with loss 𝐿and learning rate 𝑙𝑟2
21: Target 𝓋target network 𝑄(𝒮
𝑝𝑟 𝑖𝑚𝑒 )
22: Train network 𝑄with 𝒮
𝑝𝑟 𝑖𝑚𝑒 and 𝓋
23: end for
24: if 𝑡mod Δ𝑛𝑡 == 0 then
25: Update target network 𝑄
26: end if
27: 𝑂Δ𝑇=𝑄(𝑆𝑐𝑢𝑟 )
28: Update exploration rate 𝜖
29: Update each Δ𝑇𝑖follows 𝜖greedy policy
30: Δ𝑇=<Δ𝑇1,Δ𝑇2,· · · ,Δ𝑇𝑛>
31: Control action A𝐹1(Δ𝑇, 𝑆𝑐𝑢 𝑟 )
32: 𝑆𝑝𝑟 𝑒 =𝑆𝑐𝑢𝑟
33: end for
34: end for
Here,
𝑐𝑜𝑠 𝑡 (,)
is a function that calculates the energy cost within a
control period according to the local electricity price that changes
over time.
Δ𝑇𝑡1
is the planned temperature change value at time
𝑡
1.
𝑇𝑖
𝑡
is the zone
𝑖
temperature at time
𝑡
.
𝑇𝑢𝑝𝑝𝑒𝑟
and
𝑇𝑙𝑜𝑤𝑒𝑟
are
the comfortable temperature upper / lower bound, respectively.
As stated before,
𝐹1
can be trained in a supervised manner. We
could also directly deploy our DRL controller, with transferred front-
end network
𝑄
and an initially-randomized back-end network
𝐹1
;
but we have found that leveraging data collected from the existing
controller of the target building for oine supervise learning of
𝐹1
before deployment can provide signicantly better results than
starting with a random
𝐹1
. This is because that the data from
the existing controller provides insights into the building-specic
behavior, which after all is what
𝐹1
is for. In our experiments, we
have found that a simple existing controller such as the ON-OFF
controller with two weeks of data can already be very eective
for helping training
𝐹1
. Note that such supervised training of
𝐹1
does not require the front-end network
𝑄
, which means
𝐹1
could
be well-trained and ready for use before
𝑄
is trained and transferred.
In the case that the target building is brand new and there is no
existing controller, we can deploy a simple ON-OFF controller for
collecting such data in a warm-up process (Figure 1). While such
Shichao Xu, Yixuan Wang, Yanzhi Wang, Zheng O’Neill, and Qi Zhu
Algorithm 2 Training of back-end network 𝐹1
1: 𝑒𝑝𝐹: the number of training epochs
2: Δ𝑐𝑡: the control period
3: 𝑡
𝑀𝐴𝑋 : the maximum data collection time
4: Initialize previous state 𝑆𝑝𝑟𝑒
5: Initialize 𝐹1
6: Empty database 𝑀and dataset 𝐷
7: for 𝑡= 0 to 𝑡𝑀𝐴𝑋 ,𝑡+= Δ𝑐𝑡 do
8: 𝑆𝑐𝑢𝑟 (𝑡,𝑆𝑖𝑛 ,𝑆𝑜𝑢𝑟 )
9: Control action Arun ON-OFF controller on 𝑆𝑐𝑢𝑟
10: 𝑆𝑝𝑟 𝑒 =𝑆𝑐𝑢𝑟
11: Add sample (𝑆𝑐𝑢 𝑟 , 𝑆𝑝𝑟 𝑒 ,A)to database 𝑀
12: end for
13: for each sample u=(𝑆𝑐𝑢𝑟 , 𝑆𝑝𝑟 𝑒 ,a) in 𝑀do
14: Δ𝒯
𝑎calculate temperature dierence in (𝒮
𝑐𝑢𝑟 ,𝒮
𝑝𝑟 𝑒 )
15: Add sample v=(Δ𝒯
𝑎, 𝑆𝑝𝑟 𝑒 ,a)to dataset 𝐷
16: end for
17: for each sample u=(𝑆𝑐𝑢𝑟 , 𝑆𝑝𝑟 𝑒 ,a) in 𝑀do
18: Δ𝒯
𝑎lowest level
19: amaximum air condition level
20: Add sample v=(Δ𝒯
𝑎, 𝑆𝑝𝑟 𝑒 ,a)to dataset 𝐷
21: end for
22: for 𝐸𝑝𝑜𝑐ℎ = 1 to 𝑒𝑝𝐹do
23: for each training batch of (Δ𝒯
𝑎, 𝑆𝑝𝑟 𝑒 ,a)in dataset 𝐷do
24: network inputs = (Δ𝒯
𝑎, 𝑆𝑝𝑟 𝑒 )
25: corresponding labels = (a)
26: Train network 𝐹1
27: end for
28: end for
29: Return 𝐹1
Algorithm 3 Running of our proposed approach
1: Δ𝑐𝑡: the control period
2: 𝑡𝑀𝐴𝑋 : the maximum testing time
3:
Initialize the weights of
𝑄
with the front-end network transferred from
the source building (see Figure 1)
4: Initialize the weights of 𝐹1with weights learned using Algorithm 2
5: for 𝑡= 0 to 𝑡𝑀𝐴𝑋 ,𝑡+= Δ𝑐𝑡 do
6: 𝑆𝑐𝑢𝑟 (𝑡,𝑆𝑖𝑛 ,𝑆𝑜𝑢𝑡 )
7: Δ𝑇𝑎𝑟𝑔𝑚𝑎𝑥 𝑄 (𝑆𝑐𝑢𝑟 )
8: Control action A𝐹1(Δ𝑇, 𝑆𝑐𝑢 𝑟 )
9: end for
ON-OFF controller typically consumes signicantly higher energy,
it can eectively maintain the room temperature within desired
bounds, which means that the building could already be in use
during this period. Once
𝐹1
is trained, the DRL controller can
replace the ON-OFF controller in operation.
Algorithm 2 shows the detailed process for the training of
𝐹1
.
Note that the initialization of
𝐹1
in this algorithm also follows
the weights initialization method described in [
12
]. We also aug-
ment the collected training data to ensure the boundary condition.
The augmenting data is created by copying all samples from the
collected data and set temperature change value
Δ𝒯
to the lowest
level (
<𝑏
) while setting all control actions to the maximum level.
Once the front-end network
𝑄
is trained as in Algorithm 1 and
the back-end network
𝐹1
is trained as in Algorithm 2, our trans-
ferred DRL controller is ready to be deployed and can operate as
Parameter Value Parameter Value
Front-end
network layers
[10,128,256,
256,256,400,22]
Back-end
network layers
[22*n,128,256,
256,128,m*n]
𝑏220
𝑙𝑟10.0003 𝑒𝑝 150
𝑙𝑟20.0001 𝑒𝑝𝐹15
𝐿𝑀𝐴𝑋 1𝑤𝑐𝑜𝑠𝑡 1
1000
𝑒𝑝 150 𝑤𝑣𝑖𝑜 1
1600
𝑇𝑙𝑜𝑤𝑒 𝑟 19 𝑇𝑢𝑝𝑝𝑒𝑟 24
Δ𝑛𝑡 240*15 min Δ𝑐𝑡 15 min
𝑡
𝑀𝐴𝑋 2 weeks 𝑡𝑀𝐴𝑋 1 month
𝜖𝑙𝑜 𝑤 0.1
Table 1: Hyper-parameters used in our experiments.
described in Algorithm 3. Note that we could further ne-tune our
DRL controller during the operation. This can be done by enabling a
ne-tuning procedure that is similar to Algorithm 1. The dierence
is that instead of initializing the Q-network
𝑄
using [
12
], we copy
transferred Q-network weights from the source building to the
target building’s front-end network
𝑄
and its corresponding target
network
𝑄
. And we set
𝜖=
0,
𝜖𝑙𝑜 𝑤 =
0, and
𝐿𝑀𝐴𝑋
to 3instead of
1. Other operations remain the same as in Algorithm 1.
4 EXPERIMENTAL RESULTS
4.1 Experiment Settings
All experiments are conducted on a server equipped with a 2.10GHz
CPU (Intel Xeon(R) Gold 6130), 64GB RAM, and an NVIDIA TITAN
RTX GPU card. The learning algorithms are implemented in the
PyTorch learning framework. The Adam optimizer [
14
] is used to
optimize both front-end networks and back-end networks. The DRL
hyper-parameter settings are shown in Table 1. In addition, to accu-
rately evaluate our approach, we leverage the building simulation
tool EnergyPlus [
5
]. Note that EnergyPlus here is only used for
evaluation purpose, in place of real buildings. During the practical
application of our approach, EnergyPlus is not needed. This is dif-
ferent from some of the approaches in the literature [
28
,
29
], where
EnergyPlus is needed for oine training before deployment and
hence accurate and expensive physical models have to be developed.
In our experiments, simulation models in EnergyPlus interact
with the learning algorithms written in Python through the Building
Controls Virtual Test Bed (BCVTB) [
31
]. We simulate the building
models with the weather data obtained from the Typical Meteoro-
logical Year 3 database [
32
], and choose the summer weather data in
August (each training epoch contains one-month data). Apart from
the weather transferring experiments, all other experiments are
based on the weather data collected in Riverside, California, where
the ambient weather changes more drastically and thus presents
more challenges to the HVAC controller. Dierent building types
are used in our experiments, including one-zone building 1 (simpli-
ed as 1-zone 1), four-zone building 1 (4-zone 1), four-zone building
2 (4-zone 2), four-zone building 3 (4-zone 3), ve-zone building 1
(5-zone 1), seven-zone building 1 (7-zone 1). These models are visu-
alized in Figure 2. In addition, the conditioned air temperature sent
from the VAV HVAC system is set to 10 .
The symbols used in the result tables are explained as follows.
𝜃𝑖
denotes the temperature violation rate in the thermal zone
𝑖
. A
𝜃
and M
𝜃
represent the average temperature violation rate across
One for Many: Transfer Learning for Building HVAC Control
all zones and the maximum temperature violation rate across all
zones, respectively.
𝜇𝑖
denotes the maximum temperature violation
value for zone
𝑖
, measured in
. A
𝜇
and M
𝜇
are the average and
maximum temperature violation value across all zones, respectively.
EP represents the number of training epochs. The symbol
2
denotes
whether all the temperature violation rates across all zones are less
than 5%. If it is true, it is marked as
; otherwise, it is
×
(which is
typically not acceptable for HVAC control).
Before reporting the main part of our results, we want to show
that simply transferring a well-trained DQN model for a single-zone
source building to every zone of a target multi-zone building may
not yield good results, as discussed in Section 3.2. Here as shown in
Table 2, a DQN model trained for one-zone building 1 works well
for itself, but when being transferred directly to every zone of four-
zone building 2, there are signicant temperature violations. This
shows that a more sophisticated approach such as ours is needed.
The following sections will show the results of our approach and
its comparison with other methods.
4.2 Transfer from n-zone to n-zone with
dierent materials and layouts
In this section, we conduct experiments on building HVAC con-
troller transfer with four-zone buildings that have dierent mate-
rials and layouts. As shown in Figure 2, four-zone building 1 and
four-zone building 2 have dierent structures, and also dierent
wall materials in each zone with dierent heat capacities. Table 3
rst shows the direct training results on four-zone building 1, and
the main transferring results are presented in Table 4.
The direct training outcome by baselines and our approach are
shown in Table 3. The results include ON-OFF control, Deep Q-
network (DQN) control as described in [
29
] (which assigns an indi-
vidual DQN model for each zone in the building and trains them for
100 epochs, with one-month data for each epoch),
𝐷𝑄 𝑁
(standard
deep Q learning method with
𝑚𝑛
selections in the last layer [
13
]),
and the direct training result of our method without transferring.
Moreover, the DQN method is trained with 50, 100, and 150 training
epochs (months), respectively, to show the impact of training time.
As shown in the table, all learning-based methods demonstrate sig-
nicant energy cost reduction over ON-OFF control.
𝐷𝑄 𝑁
shows
slightly higher cost and violation rate, when compared to DQN
after 150 epochs. Our approach with Algorithm 1 (i.e., not trans-
ferred) achieves the lowest violation rate among all learning-based
methods, while providing a low cost.
Table 4 shows the
main comparison results
of our transfer
learning approach and other baselines on four-zone building 2 and
four-zone building 3. ON-OFF,
𝐷𝑄 𝑁
and
𝐷𝑄 𝑁
are directly trained
on those two buildings.
𝐷𝑄 𝑁
𝑇
is a transfer learning approach that
transfers a well-trained
𝐷𝑄 𝑁
model on four-zone building 1 to the
target building (four-zone building 2 or 3). Our approach transfers
our trained four-zone building 1 model (last line in Table 3) to the
target building. From Table 4, we can see that for both four-zone
building 2 and 3, with 150 training epochs,
𝐷𝑄 𝑁
and
𝐷𝑄 𝑁
pro-
vide lower violation rate and cost than ON-OFF control, although
𝐷𝑄 𝑁
cannot meet the temperature violation requirement. And
the other transfer learning approach
𝐷𝑄 𝑁
𝑇
shows very high vio-
lation rate. In comparison, our approach achieves extremely low
temperature violation rate and a relatively low energy cost without
any ne-tuning after transferring (i.e., EP is 0). We may ne tune
the controller for 1 epoch (month) after transferring to further re-
duce the energy cost (i.e., EP is 1), at the expense of slightly higher
violation rate (but still meeting the requirement). More studies on
ne-tuning can be found in Section 4.5. Figure 3 (left) also shows
the temperature over time for the target four-zone building 2, and
we can see that it is kept well within the bounds.
4.3 Transfer from n-zone to m-zone
We also study the transfer from an n-zone building to an m-zone
building. This is a dicult task because the input and output di-
mensions are dierent, presenting signicant challenges for DRL
network design. Here, we conduct experiments for transferring
HVAC controller for four-zone building 1 to ve-zone building 1
and seven-zone building 1, and the results are presented in Table 5.
For these cases,
𝐷𝑄 𝑁
and
𝐷𝑄 𝑁
𝑇
cannot provide feasible results as
the
𝑚𝑛
action space is too large for them, and the violation rate does
not go down even after 150 training epochs.
𝐷𝑄 𝑁
[
29
] also leads
to high violation rate. In comparison, our approach achieves both
low violation rate and low energy cost. Figure 3 (middle and right)
shows the temperature over time (kept well within the bounds) for
the two target buildings after using our transfer approach.
4.4 Transfer from n-zone to n-zone with
dierent HVAC equipment
In some cases, the target building may have dierent HVAC equip-
ment (or a building may have its equipment upgraded). The new
HVAC equipment may be more powerful or have a dierent number
of control levels, making the original controller not as eective. In
such cases, our transfer learning approach provides an eective
solution. Here we conduct experiments on transferring our con-
troller for the original HVAC equipment (denote as AC 1, which has
two control levels and used in all other experiments) to the same
building with new HVAC equipment (denoted as AC2, which has
ve control levels; and AC3, which has double max airow rate and
double air conditioner power compared to AC1). The experimental
results are shown in Table 6. We can see that our approach provides
zero violation rate after transferring, and the energy cost can be
further reduced with the ne tuning process.
4.5 Fine-tuning study
After transferring, although our method has already gained a great
performance without ne-tuning, further training is still worth
considering because it may provide even lower energy cost. We
record the change of cost and violation rate when ne-tuning our
method transferred from four-zone building 1 to four-zone building
2. The results are shown in Figure 4.
4.6 Discussion
4.6.1 Transfer from n-zone to n-zone with dierent weather. As
presented in [
18
], the Q-learning controller with weather that has a
larger temperature range and variance is easy to be transferred into
the environment with the weather that has a smaller temperature
range and variance, but it is much harder in the opposite direction.
This conclusion is similar to what we observed for our approach.
Shichao Xu, Yixuan Wang, Yanzhi Wang, Zheng O’Neill, and Qi Zhu
Figure 2: Dierent building models used in our experiments. From left to right, the models are one-zone building 1, four-zone
building 1, four-zone building 2 , four-zone building 3, ve-zone building 1, seven-zone building 1. Compared to four-zone
building 1, four-zone building 2 has dierent layout and wall material; four-zone building 3 has dierent layout, wall material,
and room size; ve-zone building 1 has dierent number of zones, layout, and wall material; and seven-zone building 1 has
dierent number of zones, layout, wall material, and room size.
Source building Target building 𝜃1𝜃2𝜃3𝜃4𝜇1𝜇2𝜇3𝜇42 Cost
1-zone 1 1-zone 1 1.62% - - - 1.11 - - - 248.43
1-zone 1 4-zone 2 1.88% 9.43% 10.19% 14.07% 0.44 0.97 1.04 1.17 ×308.13
Table 2: This table shows the experiment that transfers a single-zone DQN model (trained on one-zone building 1) to every
zone of four-zone building 2. The high violation rate shows that such a straightforward scheme may not yield good results
and more sophisticated methods such as ours are needed.
Method Building EP 𝜃1𝜃2𝜃3𝜃4𝜇1𝜇2𝜇3𝜇42 Cost
ON-OFF 4-zone 1 00.08% 0.08% 0.23% 0.19% 0.01 0.03 0.08 0.08 329.56
DQN[29] 4-zone 1 50 1.21% 22.72% 9.47% 20.66% 0.68 2.46 1.61 2.07 ×245.08
DQN[29] 4-zone 1 100 0.0% 0.53% 0.05% 0.93% 0.0 0.46 0.40 1.09 292.91
DQN[29] 4-zone 1 150 0.0% 0.95% 0.03% 1.59% 0.0 0.52 0.17 1.17 278.32
𝐷𝑄 𝑁 4-zone 1 150 1.74% 2.81% 1.80% 2.76% 0.45 0.79 1.08 1.22 289.09
Ours 4-zone 1 150 0.0% 0.04% 0.0% 0.03% 0.0 0.33 0.0 0.11 297.42
Table 3: Results of dierent methods on four-zone building 1. Apart from the ON-OFF control, all others are the training results
without transferring. The training model in the last row is used as the transfer model to other buildings in our method.
Method Building EP 𝜃1𝜃2𝜃3𝜃4𝜇1𝜇2𝜇3𝜇42 Cost
ON-OFF 4-zone 2 0 0.0% 0.0% 0.0% 0.02% 0.0 0.0 0.0 0.46 373.78
DQN[29] 4-zone 2 50 0.83% 49.22% 46.75% 60.48% 0.74 2.93 3.18 3.39 ×258.85
DQN[29] 4-zone 2 100 0.0% 1.67% 1.23% 3.58% 0.0 0.92 0.77 1.62 352.13
DQN[29] 4-zone 2 150 0.0% 2.52% 1.67% 4.84% 0.0 1.64 1.56 1.61 337.33
𝐷𝑄 𝑁 4-zone 2 150 1.16% 2.71% 2.17% 6.44% 0.61 1.11 0.77 1.11 ×323.72
𝐷𝑄 𝑁 𝑇4-zone 2 0 12.35% 19.10% 10.39% 23.59% 2.47 4.67 2.27 5.22 ×288.73
Ours 4-zone 2 0 0.0% 0.0% 0.0% 0.07% 0.0 0.0 0.0 0.88 338.45
Ours 4-zone 2 10.09% 3.44% 1.91% 4.06% 0.33 1.04 0.96 1.35 297.03
ON-OFF 4-zone 3 0 0.0% 0.19% 0.0% 0.0% 0.0 0.02 0.0 0.0 360.74
DQN[29] 4-zone 3 50 0.68% 47.21% 44.61% 56.19% 0.74 3.15 2.92 3.60 ×267.29
DQN[29] 4-zone 3 100 0.34% 2.53% 2.21% 5.59% 0.01 1.18 0.85 1.18 ×342.08
DQN[29] 4-zone 3 150 0.0% 1.55% 1.68% 3.79% 0.0 1.09 1.18 1.51 334.89
𝐷𝑄 𝑁 4-zone 3 150 7.09% 13.85% 2.87% 2.16% 1.26 1.48 1.42 1.01 ×316.93
𝐷𝑄 𝑁 𝑇4-zone 3 0 13.31% 8.11% 3.18% 0.66% 1.25 3.48 2.27 0.69 ×294.23
Ours 4-zone 3 0 0.0% 0.28% 0.0% 0.0% 0.0 0.37 0.0 0.0 340.40
Ours 4-zone 3 10.23% 2.74% 0.04% 0.13% 0.34 1.73 0.12 0.31 331.47
Table 4: Comparison between our approach and other baselines. The top half shows the performance of dierent controllers
on four-zone building 2, including ON-OFF controller, DQN from [29] trained with dierent number of epochs, the standard
Deep Q-learning method (𝐷𝑄 𝑁 ) and its transferred version from four-zone building 1 (𝐷𝑄𝑁
𝑇), and our approach transferred
from four-zone building 1 (without ne-tuning and with 1 epoch tuning, respectively). We can see that our method achieves
the lowest violation rate and very low energy cost after transferring without any further tuning/training. We may ne tune
our controller with 1 epoch (month) of training and achieve the lowest cost, at the expense of slightly higher violation rate
(but still meeting the requirement). The bottom half shows the similar comparison results for four-zone building 3.
We tested the weather from Riverside, Bualo, and Los Angeles,
which is shown in Figure 5. The results show that our approach can
easily be transferred from large range and high variance weather
(Riverside) to small range and low variance weather (Bualo and
Los Angeles(LA)), but not vice versa. Fortunately, the transferring
for a new building is still not aected, because our approach can
use the building models in the same region or obtain the weather
data in that region and create a simulated model for transferring.
4.6.2 Dierent seings for ON-OFF control. Our back-end network
(inverse building network) is learned from the dataset collected by
an ON-OFF control with low temperature violation rate. In practice,
it is exible to determine the actual temperature boundaries for
ON-OFF control. For instance, the operator may set the temperature
One for Many: Transfer Learning for Building HVAC Control
Figure 3: Temperature of four-zone building 2 (left), 5-zone building 1 (middle), and 7-zone building 1 (right) after transfer.
Method Building EP A𝜃M𝜃A𝜇M𝜇2 Cost
ON-OFF 5-zone 1 0 0.45% 2.2% 0.24 1.00 373.90
DQN[29] 5-zone 1 50 38.65% 65.00% 2.60 3.81 ×263.79
DQN[29] 5-zone 1 100 4.13% 11.59% 4.66 1.47 ×326.50
DQN[29] 5-zone 1 150 2.86% 10.94% 0.89 1.63 ×323.78
Ours 5-zone 1 0 0.47% 2.34% 0.33 1.42 339.73
Ours 5-zone 1 12.41% 4.48% 1.02 1.64 323.26
ON-OFF 7-zone 1 0 0.37% 2.61% 0.04 0.30 392.56
DQN[29] 7-zone 1 50 28.14% 54.28% 2.76 3.06 ×248.38
DQN[29] 7-zone 1 100 5.19% 18.91% 1.12 1.69 ×277.87
DQN[29] 7-zone 1 150 4.48% 18.34% 1.22 1.98 ×284.51
Ours 7-zone 1 0 0.42% 2.79% 0.10 0.43 332.07
Ours 7-zone 1 1 0.77% 1.16% 0.77 1.21 329.81
Table 5: Comparison of our approach and baselines on ve-
zone building 1 and seven-zone building 1.
Method AC EP 𝐴𝜃 M𝜃A𝜇M𝜇2 Cost
ON-OFF AC 2 0 0.15% 0.23% 0.05 0.08 329.56
DQN[29] AC 2 50 20.28% 35.56% 1.73 2.66 ×229.41
DQN[29] AC 2 100 1.25% 2.69% 0.61 1.20 270.93
DQN[29] AC 2 150 1.49% 2.87% 0.60 1.02 263.92
Ours AC 2 0 0.0% 0.0% 0.0 0.0 303.37
Ours AC 2 12.06% 4.20% 0.97 1.30 262.23
ON-OFF AC 3 0 0.01% 0.05% 0.22 0.88 317.53
DQN[29] AC 3 50 2.85% 3.76% 1.37 1.90 321.03
DQN[29] AC 3 100 0.69% 1.20% 0.53 0.99 265.46
DQN[29] AC 3 150 0.62% 1.07% 0.47 0.65 266.86
Ours AC 3 0 0.0% 0.0% 0.0 0.0 316.16
Ours AC 3 10.84% 1.42% 0.54 0.78 269.24
Table 6: Comparison under dierent HVAC equipment.
0.02% 0.74% 0.02% 1.53% 2.76% 2.41% 2.45%
Week
Average
violation
260
270
280
290
300
310
320
330
340
350
0123456
Cost
Figure 4: Fine-tuning results of our approach for four-zone
building 2. Our approach can signicantly reduce energy
cost after ne-tuning for 3 weeks, while keeping the tem-
perature violation rate at a low level.
Figure 5: The visualization of dierent weathers. The yellow
line is the Bualo weather, the green line is the LA weather,
the blue line is the Riverside weather, and the red lines are
the comfortable temperature boundary.
Building Source Target EP A𝜃M𝜃2 Cost
4-zone 1 LA LA 150 0.68% 1.71% 82.01
4-zone 1 Bualo Bualo 150 0.64% 1.14% 101.79
4-zone 1 Riverside Riverside 150 0.02% 0.04% 297.42
4-zone 1 Riverside LA 0 0.0% 0.0% 105.17
4-zone 1 Riverside Bualo 0 0.0% 0.0% 134.28
4-zone 1 LA Riverside 0 71.77% 89.34% ×158.06
4-zone 1 Bualo Riverside 0 54.92% 81.89% ×180.20
Table 7: Transferring between dierent weathers.
Method Upper-Bound EP A𝜃M𝜃Cost
ON-OFF 23 0 0.01% 0.02% 373.78
ON-OFF 24 0 61.45% 73.69% 256.46
ON-OFF 25 0 98.56% 99.99% 208.79
Ours 23 0 0.02% 0.07% 338.45
Ours 24 0 0.02% 0.07% 338.08
Ours 25 0 0.02% 0.07% 338.08
Table 8: Results of testing using dierent boundary.
bound of ON-OFF control to be within the human comfortable tem-
perature boundary (what we use for our method) or just the same as
the human comfortable temperature boundary, or even a little out
of boundary to save energy cost. Thus, we tested the performance
of our method by collecting data under dierent ON-OFF boundary
settings. Results in Table 8 shows that with dierent boundary set-
tings, supervised learning can stably learn from building-specic
behaviors.
Shichao Xu, Yixuan Wang, Yanzhi Wang, Zheng O’Neill, and Qi Zhu
5 CONCLUSION
In this paper, we present a novel transfer learning approach that de-
composes the design of the neural network based HVAC controller
into two sub-networks: a building-agnostic front-end network that
can be directly transferred, and a building-specic back-end net-
work that can be eciently trained with oine supervise learning.
Our approach successfully transfers the DRL-based building HVAC
controller from source buildings to target buildings that can have
a dierent number of thermal zones, dierent materials and lay-
outs, dierent HVAC equipment, and even under dierent weather
conditions in certain cases.
ACKNOWLEDGMENTS
We gratefully acknowledge the support from Department of Energy
(DOE) award DE-EE0009150 and National Science Foundation (NSF)
award 1834701.
REFERENCES
[1]
Ilge Akkaya, Marcin Andrychowicz, Maciek Chociej, Mateusz Litwin, Bob Mc-
Grew, Arthur Petron, Alex Paino,Matthias P lappert, Glenn Powell, Raphael Ribas,
et al
.
2019. Solving rubik’s cube with a robot hand. arXiv preprint arXiv:1910.07113
(2019).
[2]
Enda Barrett and Stephen Linder. 2015. Autonomous HVAC Control, A Reinforce-
ment Learning Approach. Springer.
[3]
Yujiao Chen, Zheming Tong, Yang Zheng, Holly Samuelson, and Leslie Norford.
2020. Transfer learning with deep neural networks for model predictive control
of HVAC and natural ventilation in smart buildings. Journal of Cleaner Production
254 (2020), 119866.
[4]
Giuseppe Tommaso Costanzo, Sandro Iacovella, Frederik Ruelens, Tim Leurs,
and Bert J Claessens. 2016. Experimental analysis of data-driven control for a
building heating system. Sustainable Energy, Grids and Networks 6 (2016), 81–90.
[5]
Drury B. Crawley, Curtis O. Pedersen, Linda K. Lawrie, and Frederick C. Winkel-
mann. 2000. EnergyPlus: Energy Simulation Program. ASHRAE Journal 42
(2000).
[6]
Felipe Leno Da Silva and Anna Helena Reali Costa. 2019. A survey on transfer
learning for multiagent reinforcement learning systems. Journal of Articial
Intelligence Research 64 (2019), 645–703.
[7]
Pedro Fazenda, Kalyan Veeramachaneni, Pedro Lima, and Una-May O’Reilly.
2014. Using reinforcement learning to optimize occupant comfort and energy
usage in HVAC systems. Journal of Ambient Intelligence and Smart Environments
(2014), 675–690.
[8]
Guanyu Gao, Jie Li, and Yonggang Wen. 2019. Energy-ecient thermal com-
fort control in smart buildings via deep reinforcement learning. arXiv preprint
arXiv:1901.04693 (2019).
[9]
Guanyu Gao, Jie Li, and Yonggang Wen. 2020. DeepComfort: Energy-Ecient
Thermal Comfort Control in Buildings via Reinforcement Learning. IEEE Internet
of Things Journal (2020).
[10]
Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. 6.5 Back-Propagation
and Other Dierentiation Algorithms. Deep Learning (2016), 200–220.
[11]
Abhishek Gupta, Coline Devin, YuXuan Liu, Pieter Abbeel, and Sergey Levine.
2017. Learning invariant feature spaces to transfer skills with reinforcement
learning. ICLR (2017).
[12]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Delving deep
into rectiers: Surpassing human-level performance on imagenet classication.
In Proceedings of the IEEE international conference on computer vision. 1026–1034.
[13]
Todd Hester, Matej Vecerik, Olivier Pietquin, Marc Lanctot, Tom Schaul, Bilal
Piot, Dan Horgan, John Quan, Andrew Sendonaris, Ian Osband, et al
.
2018. Deep
q-learning from demonstrations. In AAAI.
[14]
Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic opti-
mization. arXiv preprint arXiv:1412.6980 (2014).
[15]
Neil E Klepeis, William C Nelson, Wayne R Ott, John P Robinson, Andy M
Tsang, Paul Switzer, Joseph V Behar, Stephen C Hern, and William H Engelmann.
2001. The National Human Activity Pattern Survey (NHAPS): a resource for
assessing exposure to environmental pollutants. Journal of Exposure Science &
Environmental Epidemiology 11, 3 (2001), 231–252.
[16]
B. Li and L. Xia. 2015. A multi-grid reinforcement learning method for energy
conservation and comfort of HVAC in buildings. IEEE International Conference
on Automation Science and Engineering (CASE), 444–449.
[17]
Yuanlong Li, Yonggang Wen, Dacheng Tao, and Kyle Guan. 2019. Transforming
cooling optimization for green data center via deep reinforcement learning. IEEE
transactions on cybernetics 50, 5 (2019), 2002–2013.
[18]
Paulo Lissa, Michael Schukat, and Enda Barrett. 2020. Transfer Learning Applied
to Reinforcement Learning-Based HVAC Control. SN Computer Science 1 (2020).
[19]
Y. Ma, F. Borrelli, B. Hencey, B. Coey, S. Bengea, and P. Haves. 2012. Model Pre-
dictive Control for the Operation of Building Cooling Systems. IEEE Transactions
on Control Systems Technology 20, 3 (2012), 796–803.
[20]
Mehdi Maasoumy, Alessandro Pinto, and Alberto Sangiovanni-Vincentelli. 2011.
Model-based hierarchical optimal control design for HVAC systems. In Dynamic
Systems and Control Conference, Vol. 54754. 271–278.
[21]
Mehdi Maasoumy, M Razmara, M Shahbakhti, and A Sangiovanni Vincentelli.
2014. Handling model uncertainty in model predictive control for energy ecient
buildings. Energy and Buildings 77 (2014), 377–392.
[22]
Mehdi Maasoumy, Meysam Razmara, Mahdi Shahbakhti, and Alberto Sangio-
vanni Vincentelli. 2014. Selecting building predictive control based on model
uncertainty. In 2014 American Control Conference. IEEE, 404–411.
[23]
Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness,
Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg
Ostrovski, et al
.
2015. Human-level control through deep reinforcement learning.
nature 518, 7540 (2015), 529–533.
[24]
Aviek Naug, Ibrahim Ahmed, and Gautam Biswas. 2019. Online energy manage-
ment in commercial buildings using deep reinforcement learning. In 2019 IEEE
International Conference on Smart Computing (SMARTCOMP). IEEE, 249–257.
[25]
D Nikovski, J Xu, and M Nonaka. 2013. A method for computing optimal set-point
schedules for HVAC systems. In REHVA World Congress CLIMA.
[26] U.S. Department of Energy. 2011. Buildings energy data book.
[27]
Saran Salakij, Na Yu, Samuel Paolucci, and Panos Antsaklis. 2016. Model-Based
Predictive Control for building energy management. I: Energy modeling and
optimal control. Energy and Buildings 133 (2016), 345–358.
[28]
T. Wei, S. Ren, and Q. Zhu. 2019. Deep Reinforcement Learning for Joint Dat-
acenter and HVAC Load Control in Distributed Mixed-Use Buildings. IEEE
Transactions on Sustainable Computing (2019), 1–1.
[29]
Tianshu Wei, Yanzhi Wang, and Qi Zhu. 2017. Deep reinforcement learning for
building HVAC control. In Proceedings of the 54th Annual Design Automation
Conference 2017. 1–6.
[30]
Tianshu Wei, Qi Zhu, and Nanpeng Yu. 2015. Proactive demand participation of
smart buildings in smart grid. IEEE Trans. Comput. 65, 5 (2015), 1392–1406.
[31]
Michael Wetter. 2011. Co-simulation of building energy and control systems
with the Building Controls Virtual Test Bed. Journal of Building Performance
Simulation 4, 3 (2011), 185–203.
[32]
Stephen Wilcox and William Marion. 2008. Users manual for TMY3 data sets.
(2008).
[33]
Yu Yang, Seshadhri Srinivasan, Guoqiang Hu, and Costas J Spanos. 2020. Dis-
tributed Control of Multi-zone HVAC Systems Considering Indoor Air Quality.
arXiv preprint arXiv:2003.08208 (2020).
[34]
Liang Yu, Yi Sun, Zhanbo Xu, Chao Shen, Dong Yue, Tao Jiang, and Xiaohong
Guan. 2020. Multi-Agent Deep Reinforcement Learning for HVAC Control in
Commercial Buildings. IEEE Transactions on Smart Grid (2020).
[35]
Yusen Zhan and Mattew E Taylor. 2015. Online transfer learning in reinforcement
learning domains. In 2015 AAAI Fall Symposium Series.
[36]
Zhiang Zhang, Adrian Chong, Yuqi Pan, Chenlu Zhang, Siliang Lu, and Khee Poh
Lam. 2018. A deep reinforcement learning approach to using whole building
energy model for hvac optimal control. In 2018 Building Performance Analysis
Conference and SimBuild, Vol. 3. 22–23.
[37]
Zhiang Zhang and Khee Poh Lam. 2018. Practical implementation and evaluation
of deep reinforcement learning control for a radiant heating system. In Proceedings
of the 5th Conference on Systems for Built Environments. 148–157.
Article
Full-text available
Heating, ventilation, and air conditioning (HVAC) systems are a major driver of energy consumption in commercial and residential buildings. Recent studies have shown that Deep Reinforcement Learning (DRL) algorithms can outperform traditional reactive controllers. However, DRL-based solutions are generally designed for ad hoc setups and lack standardization for comparison. To fill this gap, this paper provides a critical and reproducible evaluation, in terms of comfort and energy consumption, of several state-of-the-art DRL algorithms for HVAC control. The study examines the controllers’ robustness, adaptability, and trade-off between optimization goals by using the Sinergym framework. The results obtained confirm the potential of DRL algorithms, such as SAC and TD3, in complex scenarios and reveal several challenges related to generalization and incremental learning.
Article
When new users join a recommender system, traditional approaches encounter challenges in accurately understanding their interests due to the absence of historical user behavior data, thus making it difficult to provide personalized recommendations. Currently, two main methods are employed to address this issue from different perspectives. One approach is centered on meta-learning, enabling models to adapt faster to new tasks by sharing knowledge and experiences across multiple tasks. However, these methods often overlook potential improvements based on cross-domain information. The other method involves cross-domain recommender systems, which transfer learned knowledge to different domains using shared models and transfer learning techniques. Nonetheless, this approach has certain limitations, as it necessitates a substantial amount of labeled data for training and may not accurately capture users' latent preferences when dealing with a limited number of samples. Therefore, a crucial need arises to devise a novel method that amalgamates cross-domain information and latent preference extraction to address this challenge. To accomplish this objective, we propose a Cross-domain Recommender System based on Domain Knowledge Transferor and Latent Preference Extractor (TECDR). In TECDR, we have designed a Latent Preference Extractor that transforms user behaviors into representations of their latent interests in items. Additionally, we have introduced a Domain Knowledge Transfer mechanism for transferring knowledge and patterns between domains. Moreover, we leverage meta-learning-based optimization methods to assist the model in adapting to new tasks. The experimental results from three cross-domain scenarios demonstrate that TECDR exhibits outstanding performance across various cross-domain recommender scenarios.
Article
Full-text available
The objective of energy transition is to convert the worldwide energy sector from using fossil fuels to using sources that do not emit carbon by the end of the current century. In order to achieve sustainability in the construction of energy-positive buildings, it is crucial to employ novel approaches to reduce reliance on fossil fuels. Hence, it is essential to develop buildings with very efficient structures to promote sustainable energy practices and minimize the environmental impact. Our aims were to shed some light on the standards, building modeling strategies, and recent advances regarding the methods of control utilized in the building sector and to pinpoint the areas for improvement in the methods of control in buildings in hopes of giving future scholars a clearer understanding of the issues that need to be addressed. Accordingly, we focused on recent works that handle methods of control in buildings, which we filtered based on their approaches and relevance to the subject at hand. Furthermore, we ran a critical analysis of the reviewed works. Our work proves that model predictive control (MPC) is the most commonly used among other methods in combination with AI. However, it still faces some challenges, especially regarding its complexity.
Article
Full-text available
Traditionally, building control systems for heating, ventilation, and air conditioning (HVAC) relied on rule-based scheduler systems. Deep reinforcement learning techniques have the ability to learn optimal control policies from data without the need for explicit programming or domain-specific knowledge. However, these data-driven methods require considerable time and data to learn effective policies without prior knowledge. Performing transfer learning using pre-trained models avoids the need to learn the underlying data from scratch, thus saving time and resources. In this work, we evaluate reinforcement learning as a method of pretraining and fine-tuning neural networks for HVAC control. First, we train an RL agent in a building simulation environment to obtain a foundation model. We then fine-tune this model on two separate simulation environments such that one environment simulates the same building under different weather conditions while the other environment simulates a different building under the same weather conditions. We perform these experiments with two different reward functions to evaluate their effect on transfer learning. The results indicate that transfer learning agents outperform the rule-based controller and show improvements in the range of 1% to 4% when compared to agents trained from scratch.
Article
Full-text available
Neural networks (NNs) playing the role of controllers have demonstrated impressive empirical performance on challenging control problems. However, the potential adoption of NN controllers in real-life applications has been significantly impeded by the growing concerns over the safety of these neural-network controlled systems (NNCSs). In this work, we present POLAR-Express, an efficient and precise formal reachability analysis tool for verifying the safety of NNCSs. POLAR-Express uses Taylor model arithmetic to propagate Taylor models (TMs) layer-by-layer across a neural network to compute an over-approximation of the neural network. It can be applied to analyze any feed-forward neural networks with continuous activation functions, such as ReLU, Sigmoid, and Tanh activation functions that cover the common benchmarks for NNCS reachability analysis. Compared with its earlier prototype POLAR, we develop a novel approach in POLAR-Express to propagate TMs more efficiently and precisely across ReLU activation functions, and provide parallel computation support for TM propagation, thus significantly improving the efficiency and scalability. Across the comparison with six other state-of-the-art tools on a diverse set of common benchmarks, POLAR-Express achieves the best verification efficiency and tightness in the reachable set analysis. POLAR-Express is publicly available at https://github.com/ChaoHuang2018/POLAR_Tool.
Article
Full-text available
In commercial buildings, about 40%–50% of the total electricity consumption is attributed to Heating, Ventilation, and Air Conditioning (HVAC) systems, which places an economic burden on building operators. In this paper, we intend to minimize the energy cost of an HVAC system in a multi-zone commercial building with the consideration of random zone occupancy, thermal comfort, and indoor air quality comfort. Due to the existence of unknown thermal dynamics models, parameter uncertainties (e.g., outdoor temperature, electricity price, and number of occupants), spatially and temporally coupled constraints associated with indoor temperature and CO <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">2</sub> concentration, a large discrete solution space, and a non-convex and non-separable objective function, it is very challenging to achieve the above aim. To this end, the above energy cost minimization problem is reformulated as a Markov game. Then, an HVAC control algorithm is proposed to solve the Markov game based on multi-agent deep reinforcement learning with attention mechanism. The proposed algorithm does not require any prior knowledge of uncertain parameters and can operate without knowing building thermal dynamics models. Simulation results based on real-world traces show the effectiveness, robustness and scalability of the proposed algorithm.
Article
Full-text available
Modern control solutions for HVAC have demonstrated excellent cost and energy savings through the utilisation of machine learning techniques. However, a challenging problem faced by most machine learning tasks is the amount of time and data required to train effective policies in the absence of prior knowledge. Considering that buildings from a specific geographical location share common environmental and structural features, this paper investigates the impact of spatial changes on performance accuracy through the use of transfer learning applied to reinforcement learning based HVAC control. We propose the development of an adapted RL (Q-learning) algorithm which can transfer HVAC control polices, adjusting themselves according to spatial changes. We examine the performance of our approach across multiple different locations. Moreover, an analysis of the user’s time out comfort has been made, comparing models with and without transfer learning. The results from different cases show that after applying transfer learning the learning time to train optimal or near-optimal control policies was reduced by more than a factor of 6 when comparing to experiments without it. Furthermore, the test case where the spatial variation was lower than 50% achieved a similar performance for both dynamic and static HVAC control, presenting an average time out comfort error of 2.55% and 3.83%, respectively. From the user’s perspective, it means they will not feel any additional discomfort, as the number of minutes out of the comfort zone for the static version is approximately the same for a 1-day interval. Finally, when examining the effect of transfer learning on geographical changes, the proposed method demonstrated higher performance in countries where the temperature variation is lower, reducing time out comfort by one-third. If an agent receives a policy from a place where the environmental conditions are very different the proposed method will still work and find the best policy, but not as fast as receiving it from a similar place.
Article
Full-text available
Multiagent Reinforcement Learning (RL) solves complex tasks that require coordination with other agents through autonomous exploration of the environment. However, learning a complex task from scratch is impractical due to the huge sample complexity of RL algorithms. For this reason, reusing knowledge that can come from previous experience or other agents is indispensable to scale up multiagent RL algorithms. This survey provides a unifying view of the literature on knowledge reuse in multiagent RL. We define a taxonomy of solutions for the general knowledge reuse problem, providing a comprehensive discussion of recent progress on knowledge reuse in Multiagent Systems (MAS) and of techniques for knowledge reuse across agents (that may be actuating in a shared environment or not). We aim at encouraging the community to work towards reusing all the knowledge sources available in a MAS. For that, we provide an in-depth discussion of current lines of research and open questions.
Conference Paper
Full-text available
Deep reinforcement learning (DRL) has become a popular optimal control method in recent years. This is mainly because DRL has the potential to solve the optimal control problems with complex process dynamics, such as the optimal control for heating, ventilation, and air-conditioning (HVAC) systems. However, DRL control for HVAC systems has not been well studied. There is limited research on the real-life implementation and evaluation of this method. This study implements and deploys a DRL control method for a radiant heating system in a real-life office building for energy efficiency. A physical-based model for the heating system is first created and then calibrated using the measured building operation data. After that, the model is used as a simulator to train the DRL agent. The trained agent is then deployed in the actual heating system, and a smartphone App is used to let the occupants submit their thermal preferences to the DRL agent. It is found the DRL control method can save 16.6% to 18.2% heating demand compared to the old rule-based control logic over the three-month deployment period. However, several limitations of this study are found, such as the low participation rate of the App-based thermal preference feedback system, inefficient DRL training, and the requirement for a large amount of building data.
Article
Deep reinforcement learning (RL) has achieved several high profile successes in difficult decision-making problems. However, these algorithms typically require a huge amount of data before they reach reasonable performance. In fact, their performance during learning can be extremely poor. This may be acceptable for a simulator, but it severely limits the applicability of deep RL to many real-world tasks, where the agent must learn in the real environment. In this paper we study a setting where the agent may access data from previous control of the system. We present an algorithm, Deep Q-learning from Demonstrations (DQfD), that leverages small sets of demonstration data to massively accelerate the learning process even from relatively small amounts of demonstration data and is able to automatically assess the necessary ratio of demonstration data while learning thanks to a prioritized replay mechanism. DQfD works by combining temporal difference updates with supervised classification of the demonstrator’s actions. We show that DQfD has better initial performance than Prioritized Dueling Double Deep Q-Networks (PDD DQN) as it starts with better scores on the first million steps on 41 of 42 games and on average it takes PDD DQN 83 million steps to catch up to DQfD’s performance. DQfD learns to out-perform the best demonstration given in 14 of 42 games. In addition, DQfD leverages human demonstrations to achieve state-of-the-art results for 11 games. Finally, we show that DQfD performs better than three related algorithms for incorporating demonstration data into DQN.
Article
This article studies a scalable control method for multizone heating, ventilation, and air-conditioning (HVAC) systems to optimize the energy cost for maintaining thermal comfort (TC) and indoor air quality (IAQ) (represented by CO₂) simultaneously. This problem is computationally challenging due to the complex system dynamics, various spatial and temporal couplings, as well as multiple control variables to be coordinated. To address the challenges, we propose a two-level distributed method (TLDM) with an upper level and lower level control integrated. The upper level computes zone mass flow rates for maintaining zone TC with minimal energy cost, and then, the lower level strategically regulates zone mass flow rates and the ventilation rate to achieve IAQ while preserving the near energy-saving performance of upper level. As both the upper and the lower level computation are deployed in a distributed manner, the proposed method is scalable and computationally efficient. The near-optimal performance of the method in energy cost saving is demonstrated through comparison with the centralized method. In addition, the comparisons with the existing distributed method show that our method can provide IAQ with only little increase of energy cost, while the latter fails. Moreover, we demonstrate that our method outperforms the demand-controlled ventilation (DCVs) strategies for IAQ management with about 8%-10% energy cost reduction. Note to Practitioners: The high portion of building energy consumption has motivated the energy saving for heating, ventilation, and air-conditioning (HVAC) systems. Concurrently, the living standards for indoor environment are rising among the occupants. Nevertheless, the status quo on improving building energy efficiency has mostly focused on maintaining thermal comfort (such as temperature), and the indoor air quality (IAQ) (usually represented by CO₂ level) has been seldom incorporated. In our previous work with the similar setting, we observed that the CO₂ levels will surge beyond tolerance during the high occupancy periods if only thermal comfort (TC) is considered for HVAC control. This deduces the IAQ and TC should be jointly considered while pursuing the energy cost saving target and thus studied in this article. This task is computationally cumbersome due to the complex system dynamics (thermal and CO₂) and tight correlations among the different control components (variable air volume and fresh air damper). To cope with these challenges, this work develops a two-level distributed computation paradigm for HVAC systems based on problem structures. Specifically, the upper level control (ULC) first calculates zone mass flow rates for maintaining comfortable zone temperature with minimal energy cost, and then, the lower level strategically regulates the computed zone mass flow rates as well as ventilation rate to satisfy IAQ while preserving the near energy-saving performance of the ULC. As both the upper and lower level calculations can be implemented in a distributed manner, the proposed method is scalable to large multizone deployment. The method's performance both in maintaining comfort (i.e., TC and IAQ) and energy cost saving is demonstrated via simulations in comparisons with the centralized method, the distributed token-based scheduling strategy, and the demand-controlled ventilation strategies.
Article
Heating, Ventilation, and Air Conditioning (HVAC) are extremely energy-consuming, accounting for 40% of total building energy consumption. It is crucial to design some energy-efficient building thermal comfort control strategy which can reduce the energy consumption of the HVAC while maintaining the comfort of the occupants. However, implementing such a strategy is challenging, because the changes of the thermal states in a building environment are influenced by various factors. The relationships among these influencing factors are hard to model and are always different in different building environments. To address this challenge, we propose a deep reinforcement learning based framework, DeepComfort, for thermal comfort control in buildings. We formulate the thermal comfort control as a cost-minimization problem by jointly considering the energy consumption of the HVAC and the occupants’ thermal comfort. We first design a deep Feedforward Neural Network (FNN) based approach for predicting the occupants’ thermal comfort, and then propose a Deep Deterministic Policy Gradients (DDPG) based approach for learning the optimal thermal comfort control policy. We implement a building thermal comfort control simulation environment and evaluate the performance under various settings. The experimental results show that our approaches can improve the performance of thermal comfort prediction by 14.5% and reduce the energy consumption of HVAC by 4.31% while improving the occupants’ thermal comfort by 13.6%.
Article
Advanced control strategies are central components of smart buildings. For model-based control algorithms, the quality of the model that represents building systems and dynamics is essential to guarantee satisfactory performance of smart building control and automation. For the model predictive control of the heating, ventilation, and air conditioning systems in buildings coupled with natural ventilation, a high-fidelity model is necessary to reliably predict the thermal responses of the building under various environmental and operational conditions. This task can be accomplished by using a deep neural network, which can capture the dynamics of complicated physical processes, such as natural ventilation. Training a deep neural network requires the collection of a large amount of data; however, in practice, the target building may not have enough operational data available. This study demonstrates how transfer learning could help with this dilemma. By freezing most layers of a deep neural network model with 42,902 parameters that are pre-trained on multi-year data from a source room in Beijing, the model can be re-trained with only 200 trainable parameters on only 15 days of data from the target room in Shanghai that has entirely different floor area, building material, and window size. The proposed transfer learning model achieves high accuracy predicting both indoor air temperature and relative humidity for a time horizon from 10 minutes to 2 hours, showing the mean squared error almost one magnitude smaller than the comparison model that is only trained on source data or target data. This methodology can be applied to the design of the control system in a new building which reduces the required amount of data for the training of the model, thus saving costs in control system design and commissioning.
Article
The majority of today's power-hungry datacenters are physically co-located with office rooms in mixed-use buildings (MUBs). The heating, ventilation and air conditioning (HVAC) system within each MUB is often shared or partially-shared between datacenter rooms and office zones, for removing the heat generated by computing equipment and maintaining desired room temperature for building tenants. To effectively reduce the total energy cost of MUBs, it is important to leverage the scheduling flexibility in both the HVAC system and the datacenter workload. In this work, we formulate both HVAC control and datacenter workload scheduling as a Markov decision process (MDP), and propose a deep reinforcement learning (DRL) based algorithm for minimizing the total energy cost while maintaining desired room temperature and meeting datacenter workload deadline constraints. Moreover, we also develop a heuristic DRL-based algorithm to enable interactive workload allocation among geographically distributed MUBs for further energy reduction. The experiment results demonstrate that our regular DRL-based algorithm can achieve up to 26.9% cost reduction for a single MUB, when compared with a baseline strategy. Our heuristic DRL-based algorithm can reduce the total energy cost by an additional 5.5%, when intelligently allocating interactive workload for multiple geographically distributed MUBs.