Conference PaperPDF Available

One for Many: Transfer Learning for Building HVAC Control

November 2020

November 2020

DOI:10.1145/3408308.3427617

Conference: BuildSys '20: The 7th ACM International Conference on Systems for Energy-Efficient Buildings, Cities, and Transportation

Authors:

Yixuan Wang

Northwestern University

Zheng O’Neill

Texas A&M University

Show all 5 authorsHide

Content uploaded by Yixuan Wang

Content may be subject to copyright.

One for Many: Transfer Learning for Building HVAC Control

Shichao Xu

Northwestern University

Evanston, USA

shichaoxu2023@u.northwestern.edu

Yixuan Wang

Northwestern University

Evanston, USA

yixuanwang2024@u.northwestern.edu

Yanzhi Wang

Northeastern University

Boston, USA

yanz.wang@northeastern.edu

Zheng O’Neill

Texas A&M University

College Station, USA

zoneill@tamu.edu

Qi Zhu

Northwestern University

Evanston, USA

qzhu@northwestern.edu

ABSTRACT

The design of building heating, ventilation, and air conditioning

(HVAC) system is critically important, as it accounts for around half

of building energy consumption and directly aects occupant com-

fort, productivity, and health. Traditional HVAC control methods

are typically based on creating explicit physical models for building

thermal dynamics, which often require signicant eort to develop

and are dicult to achieve sucient accuracy and eciency for

runtime building control and scalability for eld implementations.

Recently, deep reinforcement learning (DRL) has emerged as a

promising data-driven method that provides good control perfor-

mance without analyzing physical models at runtime. However,

a major challenge to DRL (and many other data-driven learning

methods) is the long training time it takes to reach the desired

performance. In this work, we present a novel transfer learning

based approach to overcome this challenge. Our approach can ef-

fectively transfer a DRL-based HVAC controller trained for the

source building to a controller for the target building with minimal

eort and improved performance, by decomposing the design of

neural network controller into a transferable front-end network

that captures building-agnostic behavior and a back-end network

that can be eciently trained for each specic building. We con-

ducted experiments on a variety of transfer scenarios between

buildings with dierent sizes, numbers of thermal zones, materials

and layouts, air conditioner types, and ambient weather conditions.

The experimental results demonstrated the eectiveness of our

approach in signicantly reducing the training time, energy cost,

and temperature violations.

CCS CONCEPTS

•Computing methodologies →Reinforcement learning

;

•

Computer systems organization →Embedded and cyber -

physical systems.

KEYWORDS

Smart Buildings, HVAC control, Data-driven, Deep reinforcement

learning, Transfer learning

1 INTRODUCTION

The building stock accounts for around 40% of the annual energy

consumption in the United States, and nearly half of the building en-

ergy is consumed by the heating, ventilation, and air conditioning

(HVAC) system [

]. On the other hand, average Americans spend

approximately 87% of their time indoors [

], where the operation

of HVAC system has a signicant impact on their comfort, produc-

tivity, and health. Thus, it is critically important to design HVAC

control systems that are both energy ecient and able to maintain

the desired temperature and indoor air quality for occupants.

In the literature, there is an extensive body of work address-

ing the control design of building HVAC systems [

Most of them use

model-based

approaches that create simplied

physical models to capture building thermal dynamics for ecient

HVAC control. For instance, resistor-capacitor (RC) networks are

used for modeling building thermal dynamics in [

–

], and linear-

quadratic regulator (LQR) or model predictive control (MPC) based

approaches are developed accordingly for ecient runtime control.

However, creating a simplied yet suciently-accurate physical

model for runtime HVAC control is often dicult, as building room

air temperature is complexly aected by a number of factors, in-

cluding building layout, structure, construction and materials, sur-

rounding environment (e.g., ambient temperature, humidity, and

solar radiation), internal heat generation from occupants, lighting,

and appliances, etc. Moreover, it takes signicant eort and time

to develop explicit physical models, nd the right parameters, and

update the models over the building lifecycle [28].

The drawbacks of model-based approaches have motivated the

development of

data-driven

HVAC control methods that do not

rely on analyzing physical models at runtime but rather directly

making the decisions based on input data. A number of data-driven

methods such as reinforcement learning (RL) have been proposed

in the literature, including more traditional methods that leverage

the classical Q-learning techniques and perform optimization based

on a tabular

𝑄

value function [

], earlier works that utilize

neural networks [

], and more recent deep reinforcement learning

(DRL) methods [

]. In particular, the DRL-based

methods leverage deep neural networks for estimating the

𝑄

values

associated with state-action pairs and are able to handle larger state

space than traditional RL methods [

]. They have emerged as a

promising solution that oers good HVAC control performance

without analyzing physical models at runtime.

However, there are major challenges in deploying DRL-based

methods in practice. Given the complexity of modern buildings, it

could take a signicant amount of training for DRL models to reach

the desired performance. For instance, around 50 to 100 months

of data are needed for training the models in [

] and 4000+

months of data are used for more complex models [

] – even if

this could be drastically reduced to a few months or weeks, directly

arXiv:2008.03625v2 [eess.SY] 20 Oct 2020

Shichao Xu, Yixuan Wang, Yanzhi Wang, Zheng O’Neill, and Qi Zhu

deploying DRL models on operational buildings and taking so long

before getting the desired performance is impractical. The works

in [

] thus propose to rst use detailed and accurate physical

models (e.g., EnergyPlus [

]) for oine simulation-based training

before the deployment. While such an approach can speed up the

training process, it still requires the development and update of

detailed physical models, which as stated above needs signicant

domain expertise, eort, and time.

To address the challenges in DRL training for HVAC control,

we propose a

transfer learning

based approach in this paper, to

utilize existing models (that had been trained for old buildings) in

the development of DRL methods for new buildings. This is not

a straightforward process, however. Dierent buildings may have

dierent sizes, numbers of thermal zones, materials and layouts,

HVAC equipment, and operate under dierent ambient weather

conditions. As shown later in the experiments, directly transferring

models between such dierent buildings is not eective. In the

literature, there are a few works that have explored transfer learning

for buildings. In [

], a building temperature and humidity prediction

model is learned from supervised learning, and transferred to new

buildings with further tuning and utilized in an MPC algorithm.

The work in [

] investigates the transfer of Q-learning for building

HVAC control under dierent weather conditions and with dierent

room sizes, but it is limited to single-room buildings. The usage of

Q-table in conventional Q-learning also leads to limited memory

for state-action pairs and makes it unsuitable for complex buildings.

Our work addresses the limitations in the literature, and develops

for the rst time a Deep Q-Network (DQN) based transfer learning

approach for multiple-zone buildings. Our approach avoids the

development of physical models, signicantly reduces the DRL

training time via transfer learning, and is able to reduce energy

cost while maintaining room temperatures within desired bounds.

More specically, our work makes the following contributions:

•

We propose a novel transfer learning approach that decom-

poses the design of neural network based HVAC controller into

two (sub-)networks. The front-end network captures building-

agnostic behavior and can be directly transferred, while the

back-end network can be eciently trained for each specic

building in an oine supervised manner by leveraging a small

amount of data from existing controllers (e.g., simple on-o

controller).

•

Our approach requires little to no further tuning of the trans-

ferred DRL model after it is deployed in the new building, thanks

to the two-subnetwork design and the oine supervised training

of the back-end network. This avoids the initial cold start period

where the HVAC control may be unstable and unpredictable.

•

We have performed a number of experiments for evaluating

the eectiveness of our approach under various scenarios. The

results demonstrate that our approach can eectively transfer be-

tween buildings with dierent sizes, numbers of thermal zones,

materials and layouts, and HVAC equipment, as well as under dif-

ferent weather conditions in certain cases. Our approach could

enable fast deployment of DRL-based HVAC control with little

training time after transfer, and reduce building energy cost

with minimal violation of temperature constraints.

The rest of the paper is structured as follows. Section 2 pro-

vides a more detailed review of related work. Section 3 presents

our approach, including the design of two networks and the cor-

responding training methods. Section 4 shows the experiments

for dierent transfer scenarios and other related ablation studies.

Section 5 concludes the paper.

2 RELATED WORK

Model-based and Data-driven HVAC Control.

There is a rich

literature in HVAC control design, where the approaches can gener-

ally fall into two main categories, i.e., model-based and data-driven.

Traditional model-based HVAC control approaches typically

build explicit physical models for the controlled buildings and their

surrounding environment, and then design control algorithms ac-

cordingly [

]. For instance, the work in [

] presents a nonlin-

ear model for the overall cooling system, which includes chillers,

cooling towers and thermal storage tanks, and then develops an

MPC-based approach for reducing building energy consumption.

The work in [

] models the building thermal dynamics as RC

networks, calibrates the model based on historical data, and then

presents a tracking LQR approach for HVAC control. Similar sim-

plied models have been utilized in other works [

] for

HVAC control and for co-scheduling HVAC operation with other

energy demands and power supplies. While being ecient, these

simplied models often do not provide sucient accuracy for eec-

tive runtime control, given the complex relation between building

room air temperature and various factors of the building itself (e.g.,

layout, structure, construction and materials), its surrounding envi-

ronment (e.g., ambient temperature, humidity, solar radiation), and

internal operation (e.g., heat generation from occupants, lighting

and appliances). More accurate physical models can be built and

simulated with tools such as EnergyPlus [

], but those models are

typically too complex to be used for runtime control.

Data-driven approaches have thus emerged in recent years due

to their advantages of not requiring explicit physical models at

runtime . These approaches often leverage various machine learn-

ing techniques, in particular reinforcement learning. For instance,

in [

], DRL is applied to building HVAC control and an Energy-

Plus model is leveraged for simulation-based oine training of DRL.

In [

], DRL approaches leveraging the actor-critic methods are

applied. The works in [

] use data-driven methods to approx-

imate/learn the energy consumption and occupants’ satisfaction

under dierent thermal conditions, and then apply DRL to learn

an end-to-end HVAC control policy. These DRL-based methods

are shown to be eective at reducing energy cost and maintain-

ing desired temperature, and are suciently ecient at runtime.

However, they often take a long training time to reach the desired

performance, needing dozens and hundreds of months of data for

training [

] or even longer [

]. Directly deploying them

in real buildings for such long training process is obviously not

practical. Leveraging tools such as EnergyPlus for oine simulation-

based training can mitigate this issue, but again incurs the need for

the expensive and sometimes error-prone process of developing

accurate physical models (needed for simulation in this case). These

challenges have motivated this work to develop a transfer learning

approach for ecient and eective DRL control of HVAC systems.

One for Many: Transfer Learning for Building HVAC Control

Transfer Learning for HVAC control.

There are a few works

that have explored transfer learning in buildings HVAC control.

In [

], transfer learning of a Q-learning agent is studied, however

only a single room (thermal zone) is considered. The usage of a

tabular table for each state-action pair in the traditional Q-learning

in fact limits the approach’s capability to handle high-dimensional

data. In [

], a neural network model for predicting temperature

and humidity is learned in a supervised manner and transferred

to new buildings for MPC-based control. The approach also fo-

cuses on single-zone buildings and requires further tuning after the

deployment of the controller.

Dierent from these earlier works in transfer learning for HVAC

control, our approach addresses multi-zone buildings and considers

transfer between buildings with dierent sizes, number of ther-

mal zones, layouts and materials, HVAC equipment, and ambient

weather conditions. It also requires little to no further tuning after

the transfer. This is achieved with a novel DRL controller design

with two sub-networks and the corresponding training methods.

Transfer Learning in DRL.

Since our approach considers transfer

learning for DRL, it is worth to note some of the work in DRL-based

transfer learning for other domains [

]. For instance,

in [

], the distribution of optimal trajectories across similar robots

is matched for transfer learning in robotics. In [

], an environment

randomization approach is proposed, where DRL agents trained

in simulation with a large number of generated environments can

be successfully transferred to their real-world applications. To the

best of our knowledge, our work is the rst to propose DRL-based

transfer learning for multi-zone building HVAC control. It addresses

the unique challenges in building domain, e.g., designing a novel

two-subnetwork controller to avoid the complexity and cost of

creating accurate physical models for simulation.

3 OUR APPROACH

We present our transfer learning approach in this section, including

the design of the two-subnetwork controller and the training pro-

cess. Section 3.1 introduces the system model. Section 3.2 provides

an overview of our methodology. Section 3.3 presents the design

of the building-agnostic front-end (sub-)network, and Section 3.4

explains the design of the building-specic back-end (sub-)network.

3.1 System Model

The goal of our work is to build a transferable HVAC control system

that can maintain comfortable room air temperature within desired

bounds while reducing the energy cost. We adopt a building model

that is similar to the one used in [

], an

𝑛

-zone building model with

a variable air volume (VAV) HVAC system. The system provides

conditioned air at a ow rate chosen from

𝑚

discrete levels. Thus,

the entire action space for the

𝑛

-zone controller can be described

A={a1,a2,· · · ,an},

where

ai(

≤𝑖≤𝑛)

is chosen from

𝑚

VAV levels

{𝑓1, 𝑓2,· · · , 𝑓𝑚}

. Note that the size of the action space

(

𝑚𝑛

) increases exponentially with respect to the number of thermal

zones

𝑛

, which presents signicant challenge to DRL control for

larger buildings. We address this challenge in the design of our

two-subnetwork DRL controller by avoiding setting the size of the

neural network action output layer to

𝑚𝑛

. This will be explained

further later.

The DRL action is determined by the current system state. In

our model, the system state includes the current physical time

𝑡

inside state

𝑆𝑖𝑛

, and outside environment state

𝑆𝑜𝑢𝑡

. The inside

state

𝑆𝑖𝑛

includes the temperature of each thermal zone, denoted as

{𝑇1,𝑇2,· · · ,𝑇𝑛}

. The outside environment state

𝑆𝑜𝑢𝑡

includes the

ambient temperature and the solar irradiance (radiation intensity).

Similar to [

], to improve DRL performance,

𝑆𝑜𝑢𝑡

not only includes

the current values of the ambient temperature

𝑇𝑖

𝑜𝑢𝑡

and the solar

irradiance

𝑆𝑢𝑛𝑖

𝑜𝑢𝑡

, but also their weather forecast values for the

next three days. Thus, the outside environment state is denoted as

𝑆𝑜𝑢𝑡 ={𝑇0

𝑜𝑢𝑡 , 𝑇 1

𝑜𝑢𝑡 , 𝑇 2

𝑜𝑢𝑡 , 𝑇 3

𝑜𝑢𝑡 , 𝑆𝑢𝑛 0

𝑜𝑢𝑡 , 𝑆𝑢𝑛 1

𝑜𝑢𝑡 , 𝑆𝑢𝑛 2

𝑜𝑢𝑡 , 𝑆𝑢𝑛 3

𝑜𝑢𝑡 }

. Our

current model does not consider internal heat generation from

occupants, a limitation that we plan to address in future work.

3.2 Methodology Overview

We started our work by considering whether it is possible to di-

rectly transfer a well-trained DQN model for a single-zone source

building to every zone of a target multiple-zone building. However,

based on our experiments (shown later in Table 2 of Section 4),

such straightforward approach is not eective at all, leading to

signicant temperature violations. This is perhaps not surprising.

In DQN-based reinforcement learning, a neural network

𝑄

maps

the input

𝐼={𝐼1, 𝐼2,· · · , 𝐼𝑛}

, where

𝐼𝑖

is the state for each zone

𝑖

to the control action output

. The network

𝑄

is optimized based

on a reward function that considers energy cost and temperature

violation. Through training,

𝑄

learns a control strategy that incor-

porates the consideration of building thermal dynamics, including

the building-specic characteristics. Directly applying

𝑄

to a new

target building, which may have totally dierent characteristics

and dynamics, will not be eective in general.

Thus, our approach designs a novel architecture that includes

two sub-networks, with an intermediate state

Δ𝑇

that indicates a

predictive value of the controller’s willingness to change the indoor

temperature. The

front-end network 𝑄

maps the inputs

𝐼

to the

intermediate state

Δ𝑇

. It is trained to capture the building-agnostic

part of the control strategy, and is directly transferable. The

back-

end network

then maps

Δ𝑇

, together with

𝐼

, to the control action

output

. It is trained to capture the building-specic part of the

control, and can be viewed as an inverse building network

𝐹−1

. An

overview of our approach is illustrated in Figure 1.

3.3 Front-end Building-agnostic Network

Design and Training

We introduce the design of our front-end network

𝑄

and its training

in this section.

𝑄

is composed of

𝑛

(sub-)networks itself, where

where

𝑛

is the number of building thermal zones. Each zone in

the building model has its corresponding sub-network, and all sub-

networks share their weights. In each sub-network for thermal zone

𝑖

, the input layer accepts state

𝐼𝑖

. It is followed by

𝐿

sequentially-

connected fully-connected layers (the exact number of neurons is

presented later in Table 1 of Section 4). Rather than directly giving

the control action likelihood vector, the network’s output layer

reects a planned temperature change value Δ𝑇𝑖for each zone.

More specically, the output of the last layer is designed as a

vector

𝑂Δ𝑇𝑖

of length

ℎ+

2in one-hot representation – the planned

temperature changing range is equally divided into

ℎ

intervals

Shichao Xu, Yixuan Wang, Yanzhi Wang, Zheng O’Neill, and Qi Zhu

Target building control model

∆𝑻

Actions 𝐴

System state 𝐼

Weight sharing

𝐼𝑖

∆𝑻𝒊

Front-end network

Back-end network

Source building control model

∆𝑻

Actions 𝐴

System state 𝐼

Weight sharing

𝐼𝑖

∆𝑻𝒊

Front-end network

Back-end network

Source building

Target building

Direct copy

ON-OFF controller

Data collector

Supervised

learning

Supervised

learning

Data collector

ON-OFF controller

Control action

Collect system state

for DQN training

Warm-up

control

Warm-up

control

Collect system state

for DQN training

Figure 1: Overview of our DRL-based transfer learning approach for HVAC control. We design a novel DQN architecture that includes two sub-

networks: A front-end network 𝑄captures the building-agnostic part of the control as much as possible, while a back-end network (inverse

building network) 𝐹−1captures the building-specic behavior. At each control step, the front-end network𝑄maps the current system state 𝐼to

an intermediate state Δ𝑇. Then, the back-end network 𝐹−1maps Δ𝑇, together with 𝐼, to the control action outputs A. During transfer learning

from a source building to a target building, the front-end network 𝑄is directly transferable. The back-end network 𝐹−1can be trained in a

supervised manner, with data collected from an existing controller (e.g., a simple ON-OFF controller). Experiments have shown that around

two weeks of data is sucient for such supervised training of 𝐹−1. If it is a brand new building without any existing controller, we can deploy

a simple ON-OFF controller for two weeks in a “warm-up” process. During this process, the ON-OFF controller can maintain the temperature

within the desired bounds (albeit with higher cost), and collect data that captures the building-specic behavior for training 𝐹−1.

within a predened temperature range of

[−𝑏, 𝑏]

and two intervals

outside of that range are also considered. The relationship of the

planned temperature change value

Δ𝑇𝑖

of zone

𝑖

and the output

vector 𝑂Δ𝑇𝑖is as follows:

𝑂Δ𝑇𝑖











<1,0,· · · ,0>,Δ𝑇𝑖≤ −𝑏,

<0,· · · ,0,1,0,· · · ,0>,−𝑏<Δ𝑇𝑖<𝑏,

𝑡ℎ𝑒 𝑝𝑜𝑠 𝑖𝑡𝑖 𝑜𝑛 𝑜 𝑓 1𝑖𝑠 𝑎𝑡 (⌊ Δ𝑇𝑖/(2𝑏/ℎ)⌋ ))

<0,· · · ,0,1>,Δ𝑇𝑖≥𝑏.

(1)

Then, for the entire front-end network

𝑄

, the combined in-

put is

𝐼={𝐼1, 𝐼2,· · · , 𝐼𝑛}

, and the combined output is

𝑂Δ𝑇=

{𝑂Δ𝑇1, 𝑂Δ𝑇2,· · · , 𝑂Δ𝑇𝑛}.

It is worth noting that if we had designed the front-end network

in standard deep Q-learning model [

], it would take

𝐼

as the

network’s input, pass it through several fully-connected layers,

and output the selection among an action space that has a size of

(ℎ+

)𝑛

(as there are

𝑛

zones, and each has

ℎ+

2possible actions).

It also needs an equal number of neurons for the last layer, which

is not aordable when the number of zones gets large. Instead in

our design, the last layer of the front-end network

𝑄

has its size

reduced to

(ℎ+

) ∗ 𝑛

, which can be further reduced to

(ℎ+

)

with

the following weight-sharing technique.

We decide to let the

𝑛

sub-networks of

𝑄

share their weights

during training. One benet of this design is that it enables trans-

ferring the front-end network for a

𝑛

-zone source building to a

target

𝑚

-zone building, where

𝑚

could be dierent from

𝑛

. It also

reduces the training load by lowering the number of parameters.

Such design performs well in our experiments.

Our front-end network

𝑄

is trained with the standard deep Q-

learning techniques [

]. Note that while the output action for

𝑄

the planned temperature change vector

𝑂Δ𝑇

, the training process

uses a dynamic reward

𝑅𝑡

that depends on the eventual action

(i.e., output of network

𝐹−1

), which will be introduced later in

Section 3.4. Specically, the training of the front-end network

𝑄

follows Algorithm 1 (the hyper-parameters used are listed later in

Table 1 of Section 4). First, we initialize

𝑄

by following the weights

initialization method described in [

] and copy its weights to the

target network

𝑄′

(target network

𝑄′

is a technique in deep Q-

learning that is used for improving performance.). The back-end

network

𝐹−1

is initialized following Algorithm 2 (introduced later in

Section 3.4). We also empty the replay buer and set the exploration

rate 𝜖to 1.

At each control instant

𝑡

during a training epoch, we obtain

the current system state

𝑆𝑐𝑢𝑟 =

(

𝑡

𝑆𝑖𝑛

𝑆𝑜𝑢𝑡

) and calculate the

current reward

𝑅𝑡

. We then collect the learning samples (experience)

(

𝑆𝑝𝑟𝑒

𝑆𝑐𝑢𝑟

Δ𝑇

𝑅

) and store them in the replay buer. In the

following learning-related operations, we rst sample a data batch

One for Many: Transfer Learning for Building HVAC Control

𝑀=(𝒮

𝑝𝑟𝑖𝑚𝑒,𝒮

𝑛𝑒𝑥𝑡 ,𝒶,𝓇)

from the replay buer, and calculate

the actual temperature change value

Δ𝒯

𝑎

from

𝒮

𝑝𝑟𝑖𝑚𝑒

and

𝒮

𝑛𝑒𝑥𝑡

Then, we get the planned temperature change value from the back-

end network

𝐹−1

, i.e.,

𝒶𝑝

𝐹−1(Δ𝒯

𝑎,𝒮

𝑝𝑟 𝑖𝑚𝑒 )

. In this way, the

cross entropy loss can be calculated from the true label 𝒶and the

predicted label

𝒶𝑝

. We then use supervised learning to update

the back-end network

𝐹−1

with the Adam optimizer [

] under

learning rate 𝑙𝑟2.

We follow the same procedure as described in [

] to calculate

the target vector

𝑣

that is used in deep Q-learning. With target

vector

𝑣

and input state

𝑆𝑝𝑟𝑖𝑚𝑒

, we can then train

𝑄

using the

back-propagation method [

] with mean squared error loss and

learning rate

𝑙𝑟1

. With a period of

Δ𝑛𝑡

, we assign the weights of

𝑄

to the target network

𝑄′

. The exploration rate is updated as

𝜖=max{𝜖𝑙𝑜 𝑤, 𝜖 −Δ𝜖}

. It is used for

𝜖−

greedy policy to select each

planned temperature change value Δ𝑇𝑖:

Δ𝑇𝑖={𝑎𝑟 𝑔𝑚𝑎𝑥 𝑂Δ𝑇𝑖𝑤𝑖𝑡ℎ 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 1−𝜖,

𝑟𝑎𝑛𝑑𝑜𝑚(0𝑡𝑜 ℎ +1)𝑤 𝑖𝑡 ℎ 𝑝𝑟 𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝜖 . (2)

Δ𝑇={Δ𝑇1,Δ𝑇2,· · · ,Δ𝑇𝑛}.(3)

The control action Ais obtained from the back-end network:

A=𝐹−1(Δ𝑇, 𝑆𝑐𝑢𝑟 ).(4)

3.4 Back-end Building-specic Network Design

and Training

The objective of the back-end network is to map the planned tem-

perature change vector

𝑂Δ𝑇

(or

Δ𝑇

), together with the system state

𝐼

, into the control action

. Consider that during operation, a build-

ing environment “maps” the control action and system state to the

actual temperature change value. So in a way, the back-end network

can be viewed as doing the inverse of what a building environment

does, i.e., it can be viewed as an inverse building network 𝐹−1.

The network

𝐹−1

receives the planned temperature change value

Δ𝑇

and the system state

𝐼

at its input layer. It is followed by

𝐿′

fully-connected layers (exact number for experimentation is

specied in Table 1 of Section 4). It outputs a likelihood control

action vector

𝑂A={𝑣1, 𝑣2,· · · , 𝑣𝑛}

, which can be divided into

𝑛

groups. For group

𝑖

, it has a one-hot vector

𝑣𝑖

corresponding

to the control action for zone

𝑖

. The length of

𝑣𝑖

𝑚

, as there

are

𝑚

possible control actions for each zone as dened earlier.

When

𝑂A

is provided, control action

can be easily calculated

by applying argmax operation for each group in

𝑂A

, i.e.,

{𝑎𝑟𝑔𝑚𝑎𝑥 {𝑣1}, 𝑎𝑟𝑔𝑚𝑎𝑥 {𝑣2},· · · , 𝑎𝑟𝑔𝑚𝑎𝑥 {𝑣𝑛}}.

The network 𝐹−1is integrated with the reward function 𝑅𝑡:

𝑅𝑡=𝑤𝑐𝑜𝑠𝑡 𝑅_𝑐𝑜𝑠𝑡𝑡+𝑤𝑣𝑖𝑜 𝑅_𝑣𝑖𝑜 𝑡,(5)

where

𝑅_𝑐𝑜𝑠𝑡 𝑡

is the reward of energy cost at time step

𝑡

and

𝑤𝑐𝑜𝑠𝑡

is the corresponding scaling factor.

𝑅_𝑣𝑖𝑜𝑡

is the reward of zone

temperature violation at time step

𝑡

and

𝑤𝑣𝑖𝑜

is its scaling factor.

The two rewards are further dened as:

𝑅_𝑐𝑜𝑠𝑡 𝑡=−𝑐𝑜 𝑠𝑡 (𝐹−1(Δ𝑇𝑡−1), 𝑡 −1).(6)

𝑅_𝑣𝑖𝑜𝑡=−

𝑛



𝑖=1

𝑚𝑎𝑥 (𝑇𝑖

𝑡−𝑇𝑢𝑝𝑝𝑒𝑟 ,0) + 𝑚𝑎𝑥 (𝑇𝑙𝑜𝑤𝑒𝑟 −𝑇𝑖

𝑡,0).(7)

Algorithm 1 Training of front-end network 𝑄

1: 𝑒𝑝 : the number of training epochs

2: Δ𝑐𝑡: the control period

3: 𝑡𝑀𝐴𝑋 : the maximum training time of an epoch

4: Δ𝑛𝑡: the time interval to update target network

5: Empty replay buer

Initialize

𝑄

; set the weights of target network

𝑄′

𝑄

; initialize

𝐹−1

based on Algorithm 2

7: Initialize the current planned temperature change vector Δ𝑇

8: Initialize previous state 𝑆𝑝𝑟𝑒

9: Initialize exploration rate 𝜖

10: for 𝐸𝑝𝑜𝑐ℎ = 1 to 𝑒𝑝 do

11: for 𝑡= 0 to 𝑡𝑀𝐴𝑋 ,𝑡+= Δ𝑐𝑡 do

12: 𝑆𝑐𝑢𝑟 ←(𝑡,𝑆𝑖𝑛 ,𝑆𝑜𝑢𝑡 )

13: Calculate reward 𝑅

14: Add experience (𝑆𝑝𝑟𝑒 ,𝑆𝑐𝑢𝑟 ,Δ𝑇,A,𝑅) to the replay buer

15: for 𝑡𝑟 = 0 to 𝐿𝑀 𝐴𝑋 do

16: Sample a batch 𝑀=(𝒮

𝑝𝑟 𝑖𝑚𝑒 ,𝒮

𝑛𝑒𝑥𝑡 ,𝒶,𝓇)

17: Calculate actual temperature change value Δ𝒯

𝑎

18: Predicted label 𝒶𝑝=𝐹−1(Δ𝒯

𝑎,𝒮

𝑝𝑟 𝑖𝑚𝑒 )

19: Set loss 𝐿=𝐶𝑟 𝑜𝑠𝑠 𝐸𝑛𝑡𝑟 𝑜𝑝𝑦 𝐿𝑜𝑠𝑠 (𝒶𝑝,𝒶)

20: Update 𝐹−1with loss 𝐿and learning rate 𝑙𝑟2

21: Target 𝓋←target network 𝑄′(𝒮

𝑝𝑟 𝑖𝑚𝑒 )

22: Train network 𝑄with 𝒮

𝑝𝑟 𝑖𝑚𝑒 and 𝓋

23: end for

24: if 𝑡mod Δ𝑛𝑡 == 0 then

25: Update target network 𝑄′

26: end if

27: 𝑂Δ𝑇=𝑄(𝑆𝑐𝑢𝑟 )

28: Update exploration rate 𝜖

29: Update each Δ𝑇𝑖follows 𝜖−greedy policy

30: Δ𝑇=<Δ𝑇1,Δ𝑇2,· · · ,Δ𝑇𝑛>

31: Control action A←𝐹−1(Δ𝑇, 𝑆𝑐𝑢 𝑟 )

32: 𝑆𝑝𝑟 𝑒 =𝑆𝑐𝑢𝑟

33: end for

34: end for

Here,

𝑐𝑜𝑠 𝑡 (,)

is a function that calculates the energy cost within a

control period according to the local electricity price that changes

over time.

Δ𝑇𝑡−1

is the planned temperature change value at time

𝑡−

𝑇𝑖

𝑡

is the zone

𝑖

temperature at time

𝑡

𝑇𝑢𝑝𝑝𝑒𝑟

and

𝑇𝑙𝑜𝑤𝑒𝑟

are

the comfortable temperature upper / lower bound, respectively.

As stated before,

𝐹−1

can be trained in a supervised manner. We

could also directly deploy our DRL controller, with transferred front-

end network

𝑄

and an initially-randomized back-end network

𝐹−1

;

but we have found that leveraging data collected from the existing

controller of the target building for oine supervise learning of

𝐹−1

before deployment can provide signicantly better results than

starting with a random

𝐹−1

. This is because that the data from

the existing controller provides insights into the building-specic

behavior, which after all is what

𝐹−1

is for. In our experiments, we

have found that a simple existing controller such as the ON-OFF

controller with two weeks of data can already be very eective

for helping training

𝐹−1

. Note that such supervised training of

𝐹−1

does not require the front-end network

𝑄

, which means

𝐹−1

could

be well-trained and ready for use before

𝑄

is trained and transferred.

In the case that the target building is brand new and there is no

existing controller, we can deploy a simple ON-OFF controller for

collecting such data in a warm-up process (Figure 1). While such

Shichao Xu, Yixuan Wang, Yanzhi Wang, Zheng O’Neill, and Qi Zhu

Algorithm 2 Training of back-end network 𝐹−1

1: 𝑒𝑝𝐹: the number of training epochs

2: Δ𝑐𝑡: the control period

3: 𝑡′

𝑀𝐴𝑋 : the maximum data collection time

4: Initialize previous state 𝑆𝑝𝑟𝑒

5: Initialize 𝐹−1

6: Empty database 𝑀and dataset 𝐷

7: for 𝑡= 0 to 𝑡𝑀𝐴𝑋 ,𝑡+= Δ𝑐𝑡 do

8: 𝑆𝑐𝑢𝑟 ←(𝑡,𝑆𝑖𝑛 ,𝑆𝑜𝑢𝑟 )

9: Control action A←run ON-OFF controller on 𝑆𝑐𝑢𝑟

10: 𝑆𝑝𝑟 𝑒 =𝑆𝑐𝑢𝑟

11: Add sample (𝑆𝑐𝑢 𝑟 , 𝑆𝑝𝑟 𝑒 ,A)to database 𝑀

12: end for

13: for each sample u=(𝑆𝑐𝑢𝑟 , 𝑆𝑝𝑟 𝑒 ,a) in 𝑀do

14: Δ𝒯

𝑎←calculate temperature dierence in (𝒮

𝑐𝑢𝑟 ,𝒮

𝑝𝑟 𝑒 )

15: Add sample v=(Δ𝒯

𝑎, 𝑆𝑝𝑟 𝑒 ,a)to dataset 𝐷

16: end for

17: for each sample u=(𝑆𝑐𝑢𝑟 , 𝑆𝑝𝑟 𝑒 ,a) in 𝑀do

18: Δ𝒯

𝑎←lowest level

19: a′←maximum air condition level

20: Add sample v=(Δ𝒯

𝑎, 𝑆𝑝𝑟 𝑒 ,a′)to dataset 𝐷

21: end for

22: for 𝐸𝑝𝑜𝑐ℎ = 1 to 𝑒𝑝𝐹do

23: for each training batch of (Δ𝒯

𝑎, 𝑆𝑝𝑟 𝑒 ,a)in dataset 𝐷do

24: network inputs = (Δ𝒯

𝑎, 𝑆𝑝𝑟 𝑒 )

25: corresponding labels = (a)

26: Train network 𝐹−1

27: end for

28: end for

29: Return 𝐹−1

Algorithm 3 Running of our proposed approach

1: Δ𝑐𝑡: the control period

2: 𝑡𝑀𝐴𝑋 : the maximum testing time

Initialize the weights of

𝑄

with the front-end network transferred from

the source building (see Figure 1)

4: Initialize the weights of 𝐹−1with weights learned using Algorithm 2

5: for 𝑡= 0 to 𝑡𝑀𝐴𝑋 ,𝑡+= Δ𝑐𝑡 do

6: 𝑆𝑐𝑢𝑟 ←(𝑡,𝑆𝑖𝑛 ,𝑆𝑜𝑢𝑡 )

7: Δ𝑇←𝑎𝑟𝑔𝑚𝑎𝑥 𝑄 (𝑆𝑐𝑢𝑟 )

8: Control action A←𝐹−1(Δ𝑇, 𝑆𝑐𝑢 𝑟 )

9: end for

ON-OFF controller typically consumes signicantly higher energy,

it can eectively maintain the room temperature within desired

bounds, which means that the building could already be in use

during this period. Once

𝐹−1

is trained, the DRL controller can

replace the ON-OFF controller in operation.

Algorithm 2 shows the detailed process for the training of

𝐹−1

Note that the initialization of

𝐹−1

in this algorithm also follows

the weights initialization method described in [

]. We also aug-

ment the collected training data to ensure the boundary condition.

The augmenting data is created by copying all samples from the

collected data and set temperature change value

Δ𝒯

to the lowest

level (

<−𝑏

) while setting all control actions to the maximum level.

Once the front-end network

𝑄

is trained as in Algorithm 1 and

the back-end network

𝐹−1

is trained as in Algorithm 2, our trans-

ferred DRL controller is ready to be deployed and can operate as

Parameter Value Parameter Value

Front-end

network layers

[10,128,256,

256,256,400,22]

Back-end

network layers

[22*n,128,256,

256,128,m*n]

𝑏2ℎ20

𝑙𝑟10.0003 𝑒𝑝 150

𝑙𝑟20.0001 𝑒𝑝𝐹15

𝐿𝑀𝐴𝑋 1𝑤𝑐𝑜𝑠𝑡 1

1000

𝑒𝑝 150 𝑤𝑣𝑖𝑜 1

1600

𝑇𝑙𝑜𝑤𝑒 𝑟 19 𝑇𝑢𝑝𝑝𝑒𝑟 24

Δ𝑛𝑡 240*15 min Δ𝑐𝑡 15 min

𝑡′

𝑀𝐴𝑋 2 weeks 𝑡𝑀𝐴𝑋 1 month

𝜖𝑙𝑜 𝑤 0.1

Table 1: Hyper-parameters used in our experiments.

described in Algorithm 3. Note that we could further ne-tune our

DRL controller during the operation. This can be done by enabling a

ne-tuning procedure that is similar to Algorithm 1. The dierence

is that instead of initializing the Q-network

𝑄

using [

], we copy

transferred Q-network weights from the source building to the

target building’s front-end network

𝑄

and its corresponding target

network

𝑄′

. And we set

𝜖=

𝜖𝑙𝑜 𝑤 =

0, and

𝐿𝑀𝐴𝑋

to 3instead of

1. Other operations remain the same as in Algorithm 1.

4 EXPERIMENTAL RESULTS

4.1 Experiment Settings

All experiments are conducted on a server equipped with a 2.10GHz

CPU (Intel Xeon(R) Gold 6130), 64GB RAM, and an NVIDIA TITAN

RTX GPU card. The learning algorithms are implemented in the

PyTorch learning framework. The Adam optimizer [

] is used to

optimize both front-end networks and back-end networks. The DRL

hyper-parameter settings are shown in Table 1. In addition, to accu-

rately evaluate our approach, we leverage the building simulation

tool EnergyPlus [

]. Note that EnergyPlus here is only used for

evaluation purpose, in place of real buildings. During the practical

application of our approach, EnergyPlus is not needed. This is dif-

ferent from some of the approaches in the literature [

], where

EnergyPlus is needed for oine training before deployment and

hence accurate and expensive physical models have to be developed.

In our experiments, simulation models in EnergyPlus interact

with the learning algorithms written in Python through the Building

Controls Virtual Test Bed (BCVTB) [

]. We simulate the building

models with the weather data obtained from the Typical Meteoro-

logical Year 3 database [

], and choose the summer weather data in

August (each training epoch contains one-month data). Apart from

the weather transferring experiments, all other experiments are

based on the weather data collected in Riverside, California, where

the ambient weather changes more drastically and thus presents

more challenges to the HVAC controller. Dierent building types

are used in our experiments, including one-zone building 1 (simpli-

ed as 1-zone 1), four-zone building 1 (4-zone 1), four-zone building

2 (4-zone 2), four-zone building 3 (4-zone 3), ve-zone building 1

(5-zone 1), seven-zone building 1 (7-zone 1). These models are visu-

alized in Figure 2. In addition, the conditioned air temperature sent

from the VAV HVAC system is set to 10 ℃.

The symbols used in the result tables are explained as follows.

𝜃𝑖

denotes the temperature violation rate in the thermal zone

𝑖

. A

𝜃

and M

𝜃

represent the average temperature violation rate across

One for Many: Transfer Learning for Building HVAC Control

all zones and the maximum temperature violation rate across all

zones, respectively.

𝜇𝑖

denotes the maximum temperature violation

value for zone

𝑖

, measured in

℃

. A

𝜇

and M

𝜇

are the average and

maximum temperature violation value across all zones, respectively.

EP represents the number of training epochs. The symbol

2

denotes

whether all the temperature violation rates across all zones are less

than 5%. If it is true, it is marked as

✓

; otherwise, it is

(which is

typically not acceptable for HVAC control).

Before reporting the main part of our results, we want to show

that simply transferring a well-trained DQN model for a single-zone

source building to every zone of a target multi-zone building may

not yield good results, as discussed in Section 3.2. Here as shown in

Table 2, a DQN model trained for one-zone building 1 works well

for itself, but when being transferred directly to every zone of four-

zone building 2, there are signicant temperature violations. This

shows that a more sophisticated approach such as ours is needed.

The following sections will show the results of our approach and

its comparison with other methods.

4.2 Transfer from n-zone to n-zone with

dierent materials and layouts

In this section, we conduct experiments on building HVAC con-

troller transfer with four-zone buildings that have dierent mate-

rials and layouts. As shown in Figure 2, four-zone building 1 and

four-zone building 2 have dierent structures, and also dierent

wall materials in each zone with dierent heat capacities. Table 3

rst shows the direct training results on four-zone building 1, and

the main transferring results are presented in Table 4.

The direct training outcome by baselines and our approach are

shown in Table 3. The results include ON-OFF control, Deep Q-

network (DQN) control as described in [

] (which assigns an indi-

vidual DQN model for each zone in the building and trains them for

100 epochs, with one-month data for each epoch),

𝐷𝑄 𝑁 ∗

(standard

deep Q learning method with

𝑚𝑛

selections in the last layer [

]),

and the direct training result of our method without transferring.

Moreover, the DQN method is trained with 50, 100, and 150 training

epochs (months), respectively, to show the impact of training time.

As shown in the table, all learning-based methods demonstrate sig-

nicant energy cost reduction over ON-OFF control.

𝐷𝑄 𝑁 ∗

shows

slightly higher cost and violation rate, when compared to DQN

after 150 epochs. Our approach with Algorithm 1 (i.e., not trans-

ferred) achieves the lowest violation rate among all learning-based

methods, while providing a low cost.

Table 4 shows the

main comparison results

of our transfer

learning approach and other baselines on four-zone building 2 and

four-zone building 3. ON-OFF,

𝐷𝑄 𝑁

and

𝐷𝑄 𝑁 ∗

are directly trained

on those two buildings.

𝐷𝑄 𝑁 ∗

𝑇

is a transfer learning approach that

transfers a well-trained

𝐷𝑄 𝑁 ∗

model on four-zone building 1 to the

target building (four-zone building 2 or 3). Our approach transfers

our trained four-zone building 1 model (last line in Table 3) to the

target building. From Table 4, we can see that for both four-zone

building 2 and 3, with 150 training epochs,

𝐷𝑄 𝑁

and

𝐷𝑄 𝑁 ∗

pro-

vide lower violation rate and cost than ON-OFF control, although

𝐷𝑄 𝑁 ∗

cannot meet the temperature violation requirement. And

the other transfer learning approach

𝐷𝑄 𝑁 ∗

𝑇

shows very high vio-

lation rate. In comparison, our approach achieves extremely low

temperature violation rate and a relatively low energy cost without

any ne-tuning after transferring (i.e., EP is 0). We may ne tune

the controller for 1 epoch (month) after transferring to further re-

duce the energy cost (i.e., EP is 1), at the expense of slightly higher

violation rate (but still meeting the requirement). More studies on

ne-tuning can be found in Section 4.5. Figure 3 (left) also shows

the temperature over time for the target four-zone building 2, and

we can see that it is kept well within the bounds.

4.3 Transfer from n-zone to m-zone

We also study the transfer from an n-zone building to an m-zone

building. This is a dicult task because the input and output di-

mensions are dierent, presenting signicant challenges for DRL

network design. Here, we conduct experiments for transferring

HVAC controller for four-zone building 1 to ve-zone building 1

and seven-zone building 1, and the results are presented in Table 5.

For these cases,

𝐷𝑄 𝑁 ∗

and

𝐷𝑄 𝑁 ∗

𝑇

cannot provide feasible results as

the

𝑚𝑛

action space is too large for them, and the violation rate does

not go down even after 150 training epochs.

𝐷𝑄 𝑁

[

] also leads

to high violation rate. In comparison, our approach achieves both

low violation rate and low energy cost. Figure 3 (middle and right)

shows the temperature over time (kept well within the bounds) for

the two target buildings after using our transfer approach.

4.4 Transfer from n-zone to n-zone with

dierent HVAC equipment

In some cases, the target building may have dierent HVAC equip-

ment (or a building may have its equipment upgraded). The new

HVAC equipment may be more powerful or have a dierent number

of control levels, making the original controller not as eective. In

such cases, our transfer learning approach provides an eective

solution. Here we conduct experiments on transferring our con-

troller for the original HVAC equipment (denote as AC 1, which has

two control levels and used in all other experiments) to the same

building with new HVAC equipment (denoted as AC2, which has

ve control levels; and AC3, which has double max airow rate and

double air conditioner power compared to AC1). The experimental

results are shown in Table 6. We can see that our approach provides

zero violation rate after transferring, and the energy cost can be

further reduced with the ne tuning process.

4.5 Fine-tuning study

After transferring, although our method has already gained a great

performance without ne-tuning, further training is still worth

considering because it may provide even lower energy cost. We

record the change of cost and violation rate when ne-tuning our

method transferred from four-zone building 1 to four-zone building

2. The results are shown in Figure 4.

4.6 Discussion

4.6.1 Transfer from n-zone to n-zone with dierent weather. As

presented in [

], the Q-learning controller with weather that has a

larger temperature range and variance is easy to be transferred into

the environment with the weather that has a smaller temperature

range and variance, but it is much harder in the opposite direction.

This conclusion is similar to what we observed for our approach.

Shichao Xu, Yixuan Wang, Yanzhi Wang, Zheng O’Neill, and Qi Zhu

Figure 2: Dierent building models used in our experiments. From left to right, the models are one-zone building 1, four-zone

building 1, four-zone building 2 , four-zone building 3, ve-zone building 1, seven-zone building 1. Compared to four-zone

building 1, four-zone building 2 has dierent layout and wall material; four-zone building 3 has dierent layout, wall material,

and room size; ve-zone building 1 has dierent number of zones, layout, and wall material; and seven-zone building 1 has

dierent number of zones, layout, wall material, and room size.

Source building Target building 𝜃1𝜃2𝜃3𝜃4𝜇1𝜇2𝜇3𝜇42 Cost

1-zone 1 1-zone 1 1.62% - - - 1.11 - - - ✓248.43

1-zone 1 4-zone 2 1.88% 9.43% 10.19% 14.07% 0.44 0.97 1.04 1.17 ×308.13

Table 2: This table shows the experiment that transfers a single-zone DQN model (trained on one-zone building 1) to every

zone of four-zone building 2. The high violation rate shows that such a straightforward scheme may not yield good results

and more sophisticated methods such as ours are needed.

Method Building EP 𝜃1𝜃2𝜃3𝜃4𝜇1𝜇2𝜇3𝜇42 Cost

ON-OFF 4-zone 1 00.08% 0.08% 0.23% 0.19% 0.01 0.03 0.08 0.08 ✓329.56

DQN[29] 4-zone 1 50 1.21% 22.72% 9.47% 20.66% 0.68 2.46 1.61 2.07 ×245.08

DQN[29] 4-zone 1 100 0.0% 0.53% 0.05% 0.93% 0.0 0.46 0.40 1.09 ✓292.91

DQN[29] 4-zone 1 150 0.0% 0.95% 0.03% 1.59% 0.0 0.52 0.17 1.17 ✓278.32

𝐷𝑄 𝑁 ∗4-zone 1 150 1.74% 2.81% 1.80% 2.76% 0.45 0.79 1.08 1.22 ✓289.09

Ours 4-zone 1 150 0.0% 0.04% 0.0% 0.03% 0.0 0.33 0.0 0.11 ✓297.42

Table 3: Results of dierent methods on four-zone building 1. Apart from the ON-OFF control, all others are the training results

without transferring. The training model in the last row is used as the transfer model to other buildings in our method.

Method Building EP 𝜃1𝜃2𝜃3𝜃4𝜇1𝜇2𝜇3𝜇42 Cost

ON-OFF 4-zone 2 0 0.0% 0.0% 0.0% 0.02% 0.0 0.0 0.0 0.46 ✓373.78

DQN[29] 4-zone 2 50 0.83% 49.22% 46.75% 60.48% 0.74 2.93 3.18 3.39 ×258.85

DQN[29] 4-zone 2 100 0.0% 1.67% 1.23% 3.58% 0.0 0.92 0.77 1.62 ✓352.13

DQN[29] 4-zone 2 150 0.0% 2.52% 1.67% 4.84% 0.0 1.64 1.56 1.61 ✓337.33

𝐷𝑄 𝑁 ∗4-zone 2 150 1.16% 2.71% 2.17% 6.44% 0.61 1.11 0.77 1.11 ×323.72

𝐷𝑄 𝑁 ∗𝑇4-zone 2 0 12.35% 19.10% 10.39% 23.59% 2.47 4.67 2.27 5.22 ×288.73

Ours 4-zone 2 0 0.0% 0.0% 0.0% 0.07% 0.0 0.0 0.0 0.88 ✓338.45

Ours 4-zone 2 10.09% 3.44% 1.91% 4.06% 0.33 1.04 0.96 1.35 ✓297.03

ON-OFF 4-zone 3 0 0.0% 0.19% 0.0% 0.0% 0.0 0.02 0.0 0.0 ✓360.74

DQN[29] 4-zone 3 50 0.68% 47.21% 44.61% 56.19% 0.74 3.15 2.92 3.60 ×267.29

DQN[29] 4-zone 3 100 0.34% 2.53% 2.21% 5.59% 0.01 1.18 0.85 1.18 ×342.08

DQN[29] 4-zone 3 150 0.0% 1.55% 1.68% 3.79% 0.0 1.09 1.18 1.51 ✓334.89

𝐷𝑄 𝑁 ∗4-zone 3 150 7.09% 13.85% 2.87% 2.16% 1.26 1.48 1.42 1.01 ×316.93

𝐷𝑄 𝑁 ∗𝑇4-zone 3 0 13.31% 8.11% 3.18% 0.66% 1.25 3.48 2.27 0.69 ×294.23

Ours 4-zone 3 0 0.0% 0.28% 0.0% 0.0% 0.0 0.37 0.0 0.0 ✓340.40

Ours 4-zone 3 10.23% 2.74% 0.04% 0.13% 0.34 1.73 0.12 0.31 ✓331.47

Table 4: Comparison between our approach and other baselines. The top half shows the performance of dierent controllers

on four-zone building 2, including ON-OFF controller, DQN from [29] trained with dierent number of epochs, the standard

Deep Q-learning method (𝐷𝑄 𝑁 ∗) and its transferred version from four-zone building 1 (𝐷𝑄𝑁 ∗

𝑇), and our approach transferred

from four-zone building 1 (without ne-tuning and with 1 epoch tuning, respectively). We can see that our method achieves

the lowest violation rate and very low energy cost after transferring without any further tuning/training. We may ne tune

our controller with 1 epoch (month) of training and achieve the lowest cost, at the expense of slightly higher violation rate

(but still meeting the requirement). The bottom half shows the similar comparison results for four-zone building 3.

We tested the weather from Riverside, Bualo, and Los Angeles,

which is shown in Figure 5. The results show that our approach can

easily be transferred from large range and high variance weather

(Riverside) to small range and low variance weather (Bualo and

Los Angeles(LA)), but not vice versa. Fortunately, the transferring

for a new building is still not aected, because our approach can

use the building models in the same region or obtain the weather

data in that region and create a simulated model for transferring.

4.6.2 Dierent seings for ON-OFF control. Our back-end network

(inverse building network) is learned from the dataset collected by

an ON-OFF control with low temperature violation rate. In practice,

it is exible to determine the actual temperature boundaries for

ON-OFF control. For instance, the operator may set the temperature

One for Many: Transfer Learning for Building HVAC Control

Figure 3: Temperature of four-zone building 2 (left), 5-zone building 1 (middle), and 7-zone building 1 (right) after transfer.

Method Building EP A𝜃M𝜃A𝜇M𝜇2 Cost

ON-OFF 5-zone 1 0 0.45% 2.2% 0.24 1.00 ✓373.90

DQN[29] 5-zone 1 50 38.65% 65.00% 2.60 3.81 ×263.79

DQN[29] 5-zone 1 100 4.13% 11.59% 4.66 1.47 ×326.50

DQN[29] 5-zone 1 150 2.86% 10.94% 0.89 1.63 ×323.78

Ours 5-zone 1 0 0.47% 2.34% 0.33 1.42 ✓339.73

Ours 5-zone 1 12.41% 4.48% 1.02 1.64 ✓323.26

ON-OFF 7-zone 1 0 0.37% 2.61% 0.04 0.30 ✓392.56

DQN[29] 7-zone 1 50 28.14% 54.28% 2.76 3.06 ×248.38

DQN[29] 7-zone 1 100 5.19% 18.91% 1.12 1.69 ×277.87

DQN[29] 7-zone 1 150 4.48% 18.34% 1.22 1.98 ×284.51

Ours 7-zone 1 0 0.42% 2.79% 0.10 0.43 ✓332.07

Ours 7-zone 1 1 0.77% 1.16% 0.77 1.21 ✓329.81

Table 5: Comparison of our approach and baselines on ve-

zone building 1 and seven-zone building 1.

Method AC EP 𝐴𝜃 M𝜃A𝜇M𝜇2 Cost

ON-OFF AC 2 0 0.15% 0.23% 0.05 0.08 ✓329.56

DQN[29] AC 2 50 20.28% 35.56% 1.73 2.66 ×229.41

DQN[29] AC 2 100 1.25% 2.69% 0.61 1.20 ✓270.93

DQN[29] AC 2 150 1.49% 2.87% 0.60 1.02 ✓263.92

Ours AC 2 0 0.0% 0.0% 0.0 0.0 ✓303.37

Ours AC 2 12.06% 4.20% 0.97 1.30 ✓262.23

ON-OFF AC 3 0 0.01% 0.05% 0.22 0.88 ✓317.53

DQN[29] AC 3 50 2.85% 3.76% 1.37 1.90 ✓321.03

DQN[29] AC 3 100 0.69% 1.20% 0.53 0.99 ✓265.46

DQN[29] AC 3 150 0.62% 1.07% 0.47 0.65 ✓266.86

Ours AC 3 0 0.0% 0.0% 0.0 0.0 ✓316.16

Ours AC 3 10.84% 1.42% 0.54 0.78 ✓269.24

Table 6: Comparison under dierent HVAC equipment.

0.02% 0.74% 0.02% 1.53% 2.76% 2.41% 2.45%

Week

Average

violation

260

270

280

290

300

310

320

330

340

350

0123456

Cost

Figure 4: Fine-tuning results of our approach for four-zone

building 2. Our approach can signicantly reduce energy

cost after ne-tuning for 3 weeks, while keeping the tem-

perature violation rate at a low level.

Figure 5: The visualization of dierent weathers. The yellow

line is the Bualo weather, the green line is the LA weather,

the blue line is the Riverside weather, and the red lines are

the comfortable temperature boundary.

Building Source Target EP A𝜃M𝜃2 Cost

4-zone 1 LA LA 150 0.68% 1.71% ✓82.01

4-zone 1 Bualo Bualo 150 0.64% 1.14% ✓101.79

4-zone 1 Riverside Riverside 150 0.02% 0.04% ✓297.42

4-zone 1 Riverside LA 0 0.0% 0.0% ✓105.17

4-zone 1 Riverside Bualo 0 0.0% 0.0% ✓134.28

4-zone 1 LA Riverside 0 71.77% 89.34% ×158.06

4-zone 1 Bualo Riverside 0 54.92% 81.89% ×180.20

Table 7: Transferring between dierent weathers.

Method Upper-Bound EP A𝜃M𝜃Cost

ON-OFF 23 0 0.01% 0.02% 373.78

ON-OFF 24 0 61.45% 73.69% 256.46

ON-OFF 25 0 98.56% 99.99% 208.79

Ours 23 0 0.02% 0.07% 338.45

Ours 24 0 0.02% 0.07% 338.08

Ours 25 0 0.02% 0.07% 338.08

Table 8: Results of testing using dierent boundary.

bound of ON-OFF control to be within the human comfortable tem-

perature boundary (what we use for our method) or just the same as

the human comfortable temperature boundary, or even a little out

of boundary to save energy cost. Thus, we tested the performance

of our method by collecting data under dierent ON-OFF boundary

settings. Results in Table 8 shows that with dierent boundary set-

tings, supervised learning can stably learn from building-specic

behaviors.

Shichao Xu, Yixuan Wang, Yanzhi Wang, Zheng O’Neill, and Qi Zhu

5 CONCLUSION

In this paper, we present a novel transfer learning approach that de-

composes the design of the neural network based HVAC controller

into two sub-networks: a building-agnostic front-end network that

can be directly transferred, and a building-specic back-end net-

work that can be eciently trained with oine supervise learning.

Our approach successfully transfers the DRL-based building HVAC

controller from source buildings to target buildings that can have

a dierent number of thermal zones, dierent materials and lay-

outs, dierent HVAC equipment, and even under dierent weather

conditions in certain cases.

ACKNOWLEDGMENTS

We gratefully acknowledge the support from Department of Energy

(DOE) award DE-EE0009150 and National Science Foundation (NSF)

award 1834701.

REFERENCES

[1]

Ilge Akkaya, Marcin Andrychowicz, Maciek Chociej, Mateusz Litwin, Bob Mc-

Grew, Arthur Petron, Alex Paino,Matthias P lappert, Glenn Powell, Raphael Ribas,

et al

2019. Solving rubik’s cube with a robot hand. arXiv preprint arXiv:1910.07113

(2019).

[2]

Enda Barrett and Stephen Linder. 2015. Autonomous HVAC Control, A Reinforce-

ment Learning Approach. Springer.

[3]

Yujiao Chen, Zheming Tong, Yang Zheng, Holly Samuelson, and Leslie Norford.

2020. Transfer learning with deep neural networks for model predictive control

of HVAC and natural ventilation in smart buildings. Journal of Cleaner Production

254 (2020), 119866.

[4]

Giuseppe Tommaso Costanzo, Sandro Iacovella, Frederik Ruelens, Tim Leurs,

and Bert J Claessens. 2016. Experimental analysis of data-driven control for a

building heating system. Sustainable Energy, Grids and Networks 6 (2016), 81–90.

[5]

Drury B. Crawley, Curtis O. Pedersen, Linda K. Lawrie, and Frederick C. Winkel-

mann. 2000. EnergyPlus: Energy Simulation Program. ASHRAE Journal 42

(2000).

[6]

Felipe Leno Da Silva and Anna Helena Reali Costa. 2019. A survey on transfer

learning for multiagent reinforcement learning systems. Journal of Articial

Intelligence Research 64 (2019), 645–703.

[7]

Pedro Fazenda, Kalyan Veeramachaneni, Pedro Lima, and Una-May O’Reilly.

2014. Using reinforcement learning to optimize occupant comfort and energy

usage in HVAC systems. Journal of Ambient Intelligence and Smart Environments

(2014), 675–690.

[8]

Guanyu Gao, Jie Li, and Yonggang Wen. 2019. Energy-ecient thermal com-

fort control in smart buildings via deep reinforcement learning. arXiv preprint

arXiv:1901.04693 (2019).

[9]

Guanyu Gao, Jie Li, and Yonggang Wen. 2020. DeepComfort: Energy-Ecient

Thermal Comfort Control in Buildings via Reinforcement Learning. IEEE Internet

of Things Journal (2020).

[10]

Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. 6.5 Back-Propagation

and Other Dierentiation Algorithms. Deep Learning (2016), 200–220.

[11]

Abhishek Gupta, Coline Devin, YuXuan Liu, Pieter Abbeel, and Sergey Levine.

2017. Learning invariant feature spaces to transfer skills with reinforcement

learning. ICLR (2017).

[12]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Delving deep

into rectiers: Surpassing human-level performance on imagenet classication.

In Proceedings of the IEEE international conference on computer vision. 1026–1034.

[13]

Todd Hester, Matej Vecerik, Olivier Pietquin, Marc Lanctot, Tom Schaul, Bilal

Piot, Dan Horgan, John Quan, Andrew Sendonaris, Ian Osband, et al

2018. Deep

q-learning from demonstrations. In AAAI.

[14]

Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic opti-

mization. arXiv preprint arXiv:1412.6980 (2014).

[15]

Neil E Klepeis, William C Nelson, Wayne R Ott, John P Robinson, Andy M

Tsang, Paul Switzer, Joseph V Behar, Stephen C Hern, and William H Engelmann.

2001. The National Human Activity Pattern Survey (NHAPS): a resource for

assessing exposure to environmental pollutants. Journal of Exposure Science &

Environmental Epidemiology 11, 3 (2001), 231–252.

[16]

B. Li and L. Xia. 2015. A multi-grid reinforcement learning method for energy

conservation and comfort of HVAC in buildings. IEEE International Conference

on Automation Science and Engineering (CASE), 444–449.

[17]

Yuanlong Li, Yonggang Wen, Dacheng Tao, and Kyle Guan. 2019. Transforming

cooling optimization for green data center via deep reinforcement learning. IEEE

transactions on cybernetics 50, 5 (2019), 2002–2013.

[18]

Paulo Lissa, Michael Schukat, and Enda Barrett. 2020. Transfer Learning Applied

to Reinforcement Learning-Based HVAC Control. SN Computer Science 1 (2020).

[19]

Y. Ma, F. Borrelli, B. Hencey, B. Coey, S. Bengea, and P. Haves. 2012. Model Pre-

dictive Control for the Operation of Building Cooling Systems. IEEE Transactions

on Control Systems Technology 20, 3 (2012), 796–803.

[20]

Mehdi Maasoumy, Alessandro Pinto, and Alberto Sangiovanni-Vincentelli. 2011.

Model-based hierarchical optimal control design for HVAC systems. In Dynamic

Systems and Control Conference, Vol. 54754. 271–278.

[21]

Mehdi Maasoumy, M Razmara, M Shahbakhti, and A Sangiovanni Vincentelli.

2014. Handling model uncertainty in model predictive control for energy ecient

buildings. Energy and Buildings 77 (2014), 377–392.

[22]

Mehdi Maasoumy, Meysam Razmara, Mahdi Shahbakhti, and Alberto Sangio-

vanni Vincentelli. 2014. Selecting building predictive control based on model

uncertainty. In 2014 American Control Conference. IEEE, 404–411.

[23]

Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness,

Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg

Ostrovski, et al

2015. Human-level control through deep reinforcement learning.

nature 518, 7540 (2015), 529–533.

[24]

Aviek Naug, Ibrahim Ahmed, and Gautam Biswas. 2019. Online energy manage-

ment in commercial buildings using deep reinforcement learning. In 2019 IEEE

International Conference on Smart Computing (SMARTCOMP). IEEE, 249–257.

[25]

D Nikovski, J Xu, and M Nonaka. 2013. A method for computing optimal set-point

schedules for HVAC systems. In REHVA World Congress CLIMA.

[26] U.S. Department of Energy. 2011. Buildings energy data book.

[27]

Saran Salakij, Na Yu, Samuel Paolucci, and Panos Antsaklis. 2016. Model-Based

Predictive Control for building energy management. I: Energy modeling and

optimal control. Energy and Buildings 133 (2016), 345–358.

[28]

T. Wei, S. Ren, and Q. Zhu. 2019. Deep Reinforcement Learning for Joint Dat-

acenter and HVAC Load Control in Distributed Mixed-Use Buildings. IEEE

Transactions on Sustainable Computing (2019), 1–1.

[29]

Tianshu Wei, Yanzhi Wang, and Qi Zhu. 2017. Deep reinforcement learning for

building HVAC control. In Proceedings of the 54th Annual Design Automation

Conference 2017. 1–6.

[30]

Tianshu Wei, Qi Zhu, and Nanpeng Yu. 2015. Proactive demand participation of

smart buildings in smart grid. IEEE Trans. Comput. 65, 5 (2015), 1392–1406.

[31]

Michael Wetter. 2011. Co-simulation of building energy and control systems

with the Building Controls Virtual Test Bed. Journal of Building Performance

Simulation 4, 3 (2011), 185–203.

[32]

Stephen Wilcox and William Marion. 2008. Users manual for TMY3 data sets.

(2008).

[33]

Yu Yang, Seshadhri Srinivasan, Guoqiang Hu, and Costas J Spanos. 2020. Dis-

tributed Control of Multi-zone HVAC Systems Considering Indoor Air Quality.

arXiv preprint arXiv:2003.08208 (2020).

[34]

Liang Yu, Yi Sun, Zhanbo Xu, Chao Shen, Dong Yue, Tao Jiang, and Xiaohong

Guan. 2020. Multi-Agent Deep Reinforcement Learning for HVAC Control in

Commercial Buildings. IEEE Transactions on Smart Grid (2020).

[35]

Yusen Zhan and Mattew E Taylor. 2015. Online transfer learning in reinforcement

learning domains. In 2015 AAAI Fall Symposium Series.

[36]

Zhiang Zhang, Adrian Chong, Yuqi Pan, Chenlu Zhang, Siliang Lu, and Khee Poh

Lam. 2018. A deep reinforcement learning approach to using whole building

energy model for hvac optimal control. In 2018 Building Performance Analysis

Conference and SimBuild, Vol. 3. 22–23.

[37]

Zhiang Zhang and Khee Poh Lam. 2018. Practical implementation and evaluation

of deep reinforcement learning control for a radiant heating system. In Proceedings

of the 5th Conference on Systems for Built Environments. 148–157.

An experimental evaluation of deep reinforcement learning algorithms for HVAC control

Article

Full-text available

Jun 2024
ARTIF INTELL REV

Heating, ventilation, and air conditioning (HVAC) systems are a major driver of energy consumption in commercial and residential buildings. Recent studies have shown that Deep Reinforcement Learning (DRL) algorithms can outperform traditional reactive controllers. However, DRL-based solutions are generally designed for ad hoc setups and lack standardization for comparison. To fill this gap, this paper provides a critical and reproducible evaluation, in terms of comfort and energy consumption, of several state-of-the-art DRL algorithms for HVAC control. The study examines the controllers’ robustness, adaptability, and trade-off between optimization goals by using the Sinergym framework. The results obtained confirm the potential of DRL algorithms, such as SAC and TD3, in complex scenarios and reveal several challenges related to generalization and incremental learning.

TECDR: Cross-Domain Recommender System Based on Domain Knowledge Transferor and Latent Preference Extractor

Article

May 2024

When new users join a recommender system, traditional approaches encounter challenges in accurately understanding their interests due to the absence of historical user behavior data, thus making it difficult to provide personalized recommendations. Currently, two main methods are employed to address this issue from different perspectives. One approach is centered on meta-learning, enabling models to adapt faster to new tasks by sharing knowledge and experiences across multiple tasks. However, these methods often overlook potential improvements based on cross-domain information. The other method involves cross-domain recommender systems, which transfer learned knowledge to different domains using shared models and transfer learning techniques. Nonetheless, this approach has certain limitations, as it necessitates a substantial amount of labeled data for training and may not accurately capture users' latent preferences when dealing with a limited number of samples. Therefore, a crucial need arises to devise a novel method that amalgamates cross-domain information and latent preference extraction to address this challenge. To accomplish this objective, we propose a Cross-domain Recommender System based on Domain Knowledge Transferor and Latent Preference Extractor (TECDR). In TECDR, we have designed a Latent Preference Extractor that transforms user behaviors into representations of their latent interests in items. Additionally, we have introduced a Domain Knowledge Transfer mechanism for transferring knowledge and patterns between domains. Moreover, we leverage meta-learning-based optimization methods to assist the model in adapting to new tasks. The experimental results from three cross-domain scenarios demonstrate that TECDR exhibits outstanding performance across various cross-domain recommender scenarios.

Application of transfer learning to overcome data imbalance and extrapolation for model predictive control: A real-life case

Article

May 2024
ENERG BUILDINGS

Challenges and opportunities of occupant-centric building controls in real-world implementation: A critical review

Article

Apr 2024
ENERG BUILDINGS

Balancing Sustainability and Comfort: A Holistic Study of Building Control Strategies That Meet the Global Standards for Efficiency and Thermal Comfort

Article

Full-text available

Mar 2024

The objective of energy transition is to convert the worldwide energy sector from using fossil fuels to using sources that do not emit carbon by the end of the current century. In order to achieve sustainability in the construction of energy-positive buildings, it is crucial to employ novel approaches to reduce reliance on fossil fuels. Hence, it is essential to develop buildings with very efficient structures to promote sustainable energy practices and minimize the environmental impact. Our aims were to shed some light on the standards, building modeling strategies, and recent advances regarding the methods of control utilized in the building sector and to pinpoint the areas for improvement in the methods of control in buildings in hopes of giving future scholars a clearer understanding of the issues that need to be addressed. Accordingly, we focused on recent works that handle methods of control in buildings, which we filtered based on their approaches and relevance to the subject at hand. Furthermore, we ran a critical analysis of the reviewed works. Our work proves that model predictive control (MPC) is the most commonly used among other methods in combination with AI. However, it still faces some challenges, especially regarding its complexity.

Implementing dynamic subset sensitivity analysis for early design datasets

Article

Feb 2024
AUTOMAT CONSTR

Enhancing HVAC control systems through transfer learning with deep reinforcement learning agents

Article

Full-text available

Feb 2024

Traditionally, building control systems for heating, ventilation, and air conditioning (HVAC) relied on rule-based scheduler systems. Deep reinforcement learning techniques have the ability to learn optimal control policies from data without the need for explicit programming or domain-specific knowledge. However, these data-driven methods require considerable time and data to learn effective policies without prior knowledge. Performing transfer learning using pre-trained models avoids the need to learn the underlying data from scratch, thus saving time and resources. In this work, we evaluate reinforcement learning as a method of pretraining and fine-tuning neural networks for HVAC control. First, we train an RL agent in a building simulation environment to obtain a foundation model. We then fine-tune this model on two separate simulation environments such that one environment simulates the same building under different weather conditions while the other environment simulates a different building under the same weather conditions. We perform these experiments with two different reward functions to evaluate their effect on transfer learning. The results indicate that transfer learning agents outperform the rule-based controller and show improvements in the range of 1% to 4% when compared to agents trained from scratch.

A transfer learning approach to minimize reinforcement learning risks in energy optimization for automated and smart buildings

Article

Nov 2023
ENERG BUILDINGS

POLAR-Express: Efficient and Precise Formal Reachability Analysis of Neural-Network Controlled Systems

Article

Full-text available

Jan 2023

Neural networks (NNs) playing the role of controllers have demonstrated impressive empirical performance on challenging control problems. However, the potential adoption of NN controllers in real-life applications has been significantly impeded by the growing concerns over the safety of these neural-network controlled systems (NNCSs). In this work, we present POLAR-Express, an efficient and precise formal reachability analysis tool for verifying the safety of NNCSs. POLAR-Express uses Taylor model arithmetic to propagate Taylor models (TMs) layer-by-layer across a neural network to compute an over-approximation of the neural network. It can be applied to analyze any feed-forward neural networks with continuous activation functions, such as ReLU, Sigmoid, and Tanh activation functions that cover the common benchmarks for NNCS reachability analysis. Compared with its earlier prototype POLAR, we develop a novel approach in POLAR-Express to propagate TMs more efficiently and precisely across ReLU activation functions, and provide parallel computation support for TM propagation, thus significantly improving the efficiency and scalability. Across the comparison with six other state-of-the-art tools on a diverse set of common benchmarks, POLAR-Express achieves the best verification efficiency and tightness in the reachable set analysis. POLAR-Express is publicly available at https://github.com/ChaoHuang2018/POLAR_Tool.

MF^2: Model-free reinforcement learning for modeling-free building HVAC control with data-driven environment construction in a residential building

Article

Sep 2023
BUILD ENVIRON

Multi-Agent Deep Reinforcement Learning for HVAC Control in Commercial Buildings

Article

Full-text available

Jul 2020

In commercial buildings, about 40%–50% of the total electricity consumption is attributed to Heating, Ventilation, and Air Conditioning (HVAC) systems, which places an economic burden on building operators. In this paper, we intend to minimize the energy cost of an HVAC system in a multi-zone commercial building with the consideration of random zone occupancy, thermal comfort, and indoor air quality comfort. Due to the existence of unknown thermal dynamics models, parameter uncertainties (e.g., outdoor temperature, electricity price, and number of occupants), spatially and temporally coupled constraints associated with indoor temperature and CO <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">2</sub> concentration, a large discrete solution space, and a non-convex and non-separable objective function, it is very challenging to achieve the above aim. To this end, the above energy cost minimization problem is reformulated as a Markov game. Then, an HVAC control algorithm is proposed to solve the Markov game based on multi-agent deep reinforcement learning with attention mechanism. The proposed algorithm does not require any prior knowledge of uncertain parameters and can operate without knowing building thermal dynamics models. Simulation results based on real-world traces show the effectiveness, robustness and scalability of the proposed algorithm.

Transfer Learning Applied to Reinforcement Learning-Based HVAC Control

Article

Full-text available

Apr 2020

Modern control solutions for HVAC have demonstrated excellent cost and energy savings through the utilisation of machine learning techniques. However, a challenging problem faced by most machine learning tasks is the amount of time and data required to train effective policies in the absence of prior knowledge. Considering that buildings from a specific geographical location share common environmental and structural features, this paper investigates the impact of spatial changes on performance accuracy through the use of transfer learning applied to reinforcement learning based HVAC control. We propose the development of an adapted RL (Q-learning) algorithm which can transfer HVAC control polices, adjusting themselves according to spatial changes. We examine the performance of our approach across multiple different locations. Moreover, an analysis of the user’s time out comfort has been made, comparing models with and without transfer learning. The results from different cases show that after applying transfer learning the learning time to train optimal or near-optimal control policies was reduced by more than a factor of 6 when comparing to experiments without it. Furthermore, the test case where the spatial variation was lower than 50% achieved a similar performance for both dynamic and static HVAC control, presenting an average time out comfort error of 2.55% and 3.83%, respectively. From the user’s perspective, it means they will not feel any additional discomfort, as the number of minutes out of the comfort zone for the static version is approximately the same for a 1-day interval. Finally, when examining the effect of transfer learning on geographical changes, the proposed method demonstrated higher performance in countries where the temperature variation is lower, reducing time out comfort by one-third. If an agent receives a policy from a place where the environmental conditions are very different the proposed method will still work and find the best policy, but not as fast as receiving it from a similar place.

A Survey on Transfer Learning for Multiagent Reinforcement Learning Systems

Article

Full-text available

Mar 2019
JAIR

Multiagent Reinforcement Learning (RL) solves complex tasks that require coordination with other agents through autonomous exploration of the environment. However, learning a complex task from scratch is impractical due to the huge sample complexity of RL algorithms. For this reason, reusing knowledge that can come from previous experience or other agents is indispensable to scale up multiagent RL algorithms. This survey provides a unifying view of the literature on knowledge reuse in multiagent RL. We define a taxonomy of solutions for the general knowledge reuse problem, providing a comprehensive discussion of recent progress on knowledge reuse in Multiagent Systems (MAS) and of techniques for knowledge reuse across agents (that may be actuating in a shared environment or not). We aim at encouraging the community to work towards reusing all the knowledge sources available in a MAS. For that, we provide an in-depth discussion of current lines of research and open questions.

Practical Implementation and Evaluation of Deep Reinforcement Learning Control for a Radiant Heating System

Conference Paper

Full-text available

Nov 2018

Deep reinforcement learning (DRL) has become a popular optimal control method in recent years. This is mainly because DRL has the potential to solve the optimal control problems with complex process dynamics, such as the optimal control for heating, ventilation, and air-conditioning (HVAC) systems. However, DRL control for HVAC systems has not been well studied. There is limited research on the real-life implementation and evaluation of this method. This study implements and deploys a DRL control method for a radiant heating system in a real-life office building for energy efficiency. A physical-based model for the heating system is first created and then calibrated using the measured building operation data. After that, the model is used as a simulator to train the DRL agent. The trained agent is then deployed in the actual heating system, and a smartphone App is used to let the occupants submit their thermal preferences to the DRL agent. It is found the DRL control method can save 16.6% to 18.2% heating demand compared to the old rule-based control logic over the three-month deployment period. However, several limitations of this study are found, such as the low participation rate of the App-based thermal preference feedback system, inefficient DRL training, and the requirement for a large amount of building data.

Deep Q-learning From Demonstrations

Article

Apr 2018

Deep reinforcement learning (RL) has achieved several high profile successes in difficult decision-making problems. However, these algorithms typically require a huge amount of data before they reach reasonable performance. In fact, their performance during learning can be extremely poor. This may be acceptable for a simulator, but it severely limits the applicability of deep RL to many real-world tasks, where the agent must learn in the real environment. In this paper we study a setting where the agent may access data from previous control of the system. We present an algorithm, Deep Q-learning from Demonstrations (DQfD), that leverages small sets of demonstration data to massively accelerate the learning process even from relatively small amounts of demonstration data and is able to automatically assess the necessary ratio of demonstration data while learning thanks to a prioritized replay mechanism. DQfD works by combining temporal difference updates with supervised classification of the demonstrator’s actions. We show that DQfD has better initial performance than Prioritized Dueling Double Deep Q-Networks (PDD DQN) as it starts with better scores on the first million steps on 41 of 42 games and on average it takes PDD DQN 83 million steps to catch up to DQfD’s performance. DQfD learns to out-perform the best demonstration given in 14 of 42 games. In addition, DQfD leverages human demonstrations to achieve state-of-the-art results for 11 games. Finally, we show that DQfD performs better than three related algorithms for incorporating demonstration data into DQN.

Distributed Control of Multizone HVAC Systems Considering Indoor Air Quality

Article

Jan 2021

This article studies a scalable control method for multizone heating, ventilation, and air-conditioning (HVAC) systems to optimize the energy cost for maintaining thermal comfort (TC) and indoor air quality (IAQ) (represented by CO₂) simultaneously. This problem is computationally challenging due to the complex system dynamics, various spatial and temporal couplings, as well as multiple control variables to be coordinated. To address the challenges, we propose a two-level distributed method (TLDM) with an upper level and lower level control integrated. The upper level computes zone mass flow rates for maintaining zone TC with minimal energy cost, and then, the lower level strategically regulates zone mass flow rates and the ventilation rate to achieve IAQ while preserving the near energy-saving performance of upper level. As both the upper and the lower level computation are deployed in a distributed manner, the proposed method is scalable and computationally efficient. The near-optimal performance of the method in energy cost saving is demonstrated through comparison with the centralized method. In addition, the comparisons with the existing distributed method show that our method can provide IAQ with only little increase of energy cost, while the latter fails. Moreover, we demonstrate that our method outperforms the demand-controlled ventilation (DCVs) strategies for IAQ management with about 8%-10% energy cost reduction. Note to Practitioners: The high portion of building energy consumption has motivated the energy saving for heating, ventilation, and air-conditioning (HVAC) systems. Concurrently, the living standards for indoor environment are rising among the occupants. Nevertheless, the status quo on improving building energy efficiency has mostly focused on maintaining thermal comfort (such as temperature), and the indoor air quality (IAQ) (usually represented by CO₂ level) has been seldom incorporated. In our previous work with the similar setting, we observed that the CO₂ levels will surge beyond tolerance during the high occupancy periods if only thermal comfort (TC) is considered for HVAC control. This deduces the IAQ and TC should be jointly considered while pursuing the energy cost saving target and thus studied in this article. This task is computationally cumbersome due to the complex system dynamics (thermal and CO₂) and tight correlations among the different control components (variable air volume and fresh air damper). To cope with these challenges, this work develops a two-level distributed computation paradigm for HVAC systems based on problem structures. Specifically, the upper level control (ULC) first calculates zone mass flow rates for maintaining comfortable zone temperature with minimal energy cost, and then, the lower level strategically regulates the computed zone mass flow rates as well as ventilation rate to satisfy IAQ while preserving the near energy-saving performance of the ULC. As both the upper and lower level calculations can be implemented in a distributed manner, the proposed method is scalable to large multizone deployment. The method's performance both in maintaining comfort (i.e., TC and IAQ) and energy cost saving is demonstrated via simulations in comparisons with the centralized method, the distributed token-based scheduling strategy, and the demand-controlled ventilation strategies.

DeepComfort: Energy-Efficient Thermal Comfort Control in Buildings Via Reinforcement Learning

Article

May 2020

Heating, Ventilation, and Air Conditioning (HVAC) are extremely energy-consuming, accounting for 40% of total building energy consumption. It is crucial to design some energy-efficient building thermal comfort control strategy which can reduce the energy consumption of the HVAC while maintaining the comfort of the occupants. However, implementing such a strategy is challenging, because the changes of the thermal states in a building environment are influenced by various factors. The relationships among these influencing factors are hard to model and are always different in different building environments. To address this challenge, we propose a deep reinforcement learning based framework, DeepComfort, for thermal comfort control in buildings. We formulate the thermal comfort control as a cost-minimization problem by jointly considering the energy consumption of the HVAC and the occupants’ thermal comfort. We first design a deep Feedforward Neural Network (FNN) based approach for predicting the occupants’ thermal comfort, and then propose a Deep Deterministic Policy Gradients (DDPG) based approach for learning the optimal thermal comfort control policy. We implement a building thermal comfort control simulation environment and evaluate the performance under various settings. The experimental results show that our approaches can improve the performance of thermal comfort prediction by 14.5% and reduce the energy consumption of HVAC by 4.31% while improving the occupants’ thermal comfort by 13.6%.

Transfer learning with deep neural networks for model predictive control of heating, cooling, and natural ventilation

Article

Dec 2019
J CLEAN PROD

Advanced control strategies are central components of smart buildings. For model-based control algorithms, the quality of the model that represents building systems and dynamics is essential to guarantee satisfactory performance of smart building control and automation. For the model predictive control of the heating, ventilation, and air conditioning systems in buildings coupled with natural ventilation, a high-fidelity model is necessary to reliably predict the thermal responses of the building under various environmental and operational conditions. This task can be accomplished by using a deep neural network, which can capture the dynamics of complicated physical processes, such as natural ventilation. Training a deep neural network requires the collection of a large amount of data; however, in practice, the target building may not have enough operational data available. This study demonstrates how transfer learning could help with this dilemma. By freezing most layers of a deep neural network model with 42,902 parameters that are pre-trained on multi-year data from a source room in Beijing, the model can be re-trained with only 200 trainable parameters on only 15 days of data from the target room in Shanghai that has entirely different floor area, building material, and window size. The proposed transfer learning model achieves high accuracy predicting both indoor air temperature and relative humidity for a time horizon from 10 minutes to 2 hours, showing the mean squared error almost one magnitude smaller than the comparison model that is only trained on source data or target data. This methodology can be applied to the design of the control system in a new building which reduces the required amount of data for the training of the model, thus saving costs in control system design and commissioning.

Online Energy Management in Commercial Buildings using Deep Reinforcement Learning

Conference Paper

Jun 2019

Deep Reinforcement Learning for Joint Datacenter and HVAC Load Control in Distributed Mixed-Use Buildings

Article

Apr 2019

The majority of today's power-hungry datacenters are physically co-located with office rooms in mixed-use buildings (MUBs). The heating, ventilation and air conditioning (HVAC) system within each MUB is often shared or partially-shared between datacenter rooms and office zones, for removing the heat generated by computing equipment and maintaining desired room temperature for building tenants. To effectively reduce the total energy cost of MUBs, it is important to leverage the scheduling flexibility in both the HVAC system and the datacenter workload. In this work, we formulate both HVAC control and datacenter workload scheduling as a Markov decision process (MDP), and propose a deep reinforcement learning (DRL) based algorithm for minimizing the total energy cost while maintaining desired room temperature and meeting datacenter workload deadline constraints. Moreover, we also develop a heuristic DRL-based algorithm to enable interactive workload allocation among geographically distributed MUBs for further energy reduction. The experiment results demonstrate that our regular DRL-based algorithm can achieve up to 26.9% cost reduction for a single MUB, when compared with a baseline strategy. Our heuristic DRL-based algorithm can reduce the total energy cost by an additional 5.5%, when intelligently allocating interactive workload for multiple geographically distributed MUBs.

One for Many: Transfer Learning for Building HVAC Control

Recommended publications

Potential Analysis of the Transfer Learning Model in Short and Medium-term Forecasting of Building H...

One for Many: Transfer Learning for Building HVAC Control

Accelerate online reinforcement learning for building HVAC control with heterogeneous expert guidanc...

Deep Reinforcement Learning for Building HVAC Control

Learning-based framework for sensor fault-tolerant building HVAC control with model-assisted learnin...