Conference PaperPDF Available

Variational Policy Chaining for Lifelong Reinforcement Learning

September 2019

September 2019

DOI:10.1109/ICTAI.2019.00222

Conference: ICTAI 2019: The IEEE International Conference on Tools with Artificial Intelligence
At: Portland, Oregon, USA

Authors:

Maxime Guériau

Institut National des Sciences Appliquées de Rouen

Ivana Dusparic

Trinity College Dublin

With increasing applications of reinforcement learning in real life problems, it is becoming essential that agents are able to update their knowledge continually. Lifelong learning approaches aim to enable agents to retain the knowledge they learn and to selectively transfer knowledge to new tasks. Recent techniques for lifelong reinforcement learning have shown great success in getting an agent to generalise over several tasks. However , scalability becomes an issue when agents learn numerous tasks, as each task's information must be remembered. To address this issue, this paper proposes the approach of Variational Policy Chaining (VPC) which enables a reinforcement learning agent to generalise effectively in a scalable manner when presented with continuous task updates, without storing multiple historic experiences. VPC uses Kullback-Leibler divergence method isolate the most common pieces of knowledge, and condenses the important knowledge in a single policy chain. We evaluate VPC in a GridWorld environment and compare it to vanilla policy gradient methods, showing that VPC's ability to reuse knowledge from previously encountered tasks reduces learning time in new tasks by up to 50%.

Ideal GridWorld policy for optimal relearning.

…

Training graph for a variational policy chain.

…

Terminus states for PG agent (a) and VPC agent (b)

…

Terminus states for high-variance 4-policy chain

…

Figures - uploaded by Ivana Dusparic

Content may be subject to copyright.

Content uploaded by Ivana Dusparic

Content may be subject to copyright.

Variational Policy Chaining for Lifelong

Reinforcement Learning

Christopher Doyle, Maxime Gu´

eriau and Ivana Dusparic

School of Computer Science and Statistics, Trinity College Dublin, Ireland

Email: doylec49@tcd.ie, maxime.gueriau@scss.tcd.ie, ivana.dusparic@scss.tcd.ie

Abstract—With increasing applications of reinforcement learn-

ing in real life problems, it is becoming essential that agents are

able to update their knowledge continually. Lifelong learning

approaches aim to enable agents to retain the knowledge they

learn and to selectively transfer knowledge to new tasks. Recent

techniques for lifelong reinforcement learning have shown great

success in getting an agent to generalise over several tasks. How-

ever, scalability becomes an issue when agents learn numerous

tasks, as each task’s information must be remembered. To address

this issue, this paper proposes the approach of Variational Policy

Chaining (VPC) which enables a reinforcement learning agent

to generalise effectively in a scalable manner when presented

with continuous task updates, without storing multiple historic

experiences. VPC uses Kullback-Leibler divergence method iso-

late the most common pieces of knowledge, and condenses the

important knowledge in a single policy chain. We evaluate VPC

in a GridWorld environment and compare it to vanilla policy

gradient methods, showing that VPC’s ability to reuse knowledge

from previously encountered tasks reduces learning time in new

tasks by up to 50%.

Index Terms—lifelong learning, continual learning, reinforce-

ment learning

I. INTRODUCTION

Recent advances in Reinforcement Learning (RL) and deep

neural networks have seen increasing exploration of classical

and deep RL use in real-word applications, such as for example

renewable energy management and intelligent transportation

systems [1]–[3]. However, to ensure a long-term deployment

of agents in the environment, performing task(s) they have

been pre-trained to achieve is not sufﬁcient. They also must be

able to continuously adapt their knowledge to learn new tasks,

as well as remember and reuse knowledge from previously

learnt tasks to bootstrap learning in newly encountered ones.

This need gave rise to the ﬁeld of Lifelong Reinforcement

Learning (LLRL), which aims to develop techniques that allow

RL agents to generalise over multiple tasks. Current LLRL

techniques typically fall into one of two categories, either 1) be

able to relearn new and old tasks without the model diverging,

or 2) be able to select the most appropriate policy from a

reservoir (or repository) of pre-trained policies. The trade-off

here is that relearning policies for each new task from scratch

is slow, and storing full policies is not scalable to a large

number of tasks if storage is limited.

This paper proposes an alternative in the novel method

of Variation Policy Chaining (VPC). The general approach

used by VPC is to store just a single policy, derived from

all previously learnt policies, called the ”policy chain”. This

policy is designed to be loosely suitable for any task in the

given environment, which provides a starting point to learning

a new task.

II. RE LATE D WORK

Continual learning (CL [4]) emerged as a promising way

to achieve lifelong learning [5]. This form of online learning

allows to tackle problems where an agent faces a series

of successive tasks and needs to adapt in an incremental

way to optimise the use of previous data. CL arose to ad-

dress the issue of catastrophic forgetting across various areas

within machine learning; including reinforcement learning.

Related work in general continual learning primarily relies

on Bayesian methods, and can be categorized into prior-

focused or likelihood-focused approaches [6]. Prior-focused

approaches directly reuse part of the intermediate model as

priors to learn the new model while likelihood approaches

intend to estimate the likelihood of the new model on past

tasks.

Continual Reinforcement Learning (CRL) was developed to

enhance both tabular Q-Learning [7] and Deep Q-Learning [8].

A synaptic model was used in [9], [10] to account for biolog-

ical complexity, allowing interacting components to evolve at

different timescales. While it allows an agent to learn new

tasks without forgetting on previous ones, the optimal policies

for each take much longer to learn due to bias imposed

from one task to another. Composition approaches [11] are an

alternate way to achieve CRL. Here, an agent stores a learned

policy for each completed task, and can re-use them when en-

countering new (very similar) tasks. Composition approaches

are therefore often susceptible to ‘winners curse’ [12] – the

idea that the more suitable a policy is for one task, the harder

it is to re-learn another.

III. BACKGRO UN D

In this section we show background knowledge that we

utilize in our approach. We present the basics of RL and

Kullback-Leibler divergence, a method VPC uses to measure

the distance between two policies.

A. Reinforcement Learning

Reinforcement learning (RL) is a technique in which

an intelligent agent learns to map environment situations

(i.e., states) to actions so as to maximize a numerical reward

signal it receives from the environment [13]. An RL process

is expressed through the following terms:

•t: the step in time between actions and observations.

•S: the ‘state space’, i.e., the set of all states contained

within an environment.

•A: the ‘action space’, i.e., the set of all actions that an

agent can take.

•T: the state transition function, describes P(st+1|st, at).

•r: the numerical reward signal associated with every state

transition.

A learning episode is then deﬁned as the set

st, at, rt, st+1, ..., sT, where stis the state at timestep

t,atis the action taken from this state, rtis the reward

obtained from the transition resulting from at,st+1 is

the resulting state entered after at, and sTis a terminal

state. Optimizing an agent’s performance can be seen as

maximizing the cumulative expected reward from any given

state st. This infers that each action taken will result in an

optimal state transition and accompanying reward. We say

therefore that an agent has obtained the optimal policy π∗,

where an agent’s policy π(st)is the conditional probability

distribution over all possible actions in state st,π(A | st).

In this work we use a Policy Gradient (PG) [13] method

to obtain the optimal policy. PG learns a parameterized policy

which selects an action in each state so as to maximize the

expected cumulative reward Jobtained given policy π.Jis

parameterized with a parameter vector θ, and is maximized

by obtaining a gradient of the policy’s return ∇θJ(θ)with

respect to the parameters, according to the formula:

∇θJ(θ) = X

π(t;θ)∇θlog(πθ(t;θ))rt(1)

where rtis the reward received at time step t.

VPC utilizes the fact that the optimal policy is expressed as

a probability distribution over all possible actions in state st

to compare the difference between policies (further discussed

in Section IV) using Kullback-Leibler divergence.

B. Kullback-Leibler divergence

The Kullback-Leibler (KL) divergence [14] is a method

used to measure the similarity between two probability dis-

tributions P(d)and Q(d), and it is calculated as:

DKL (Q||P) = X

Q(d)log Q(d)

P(d)(2)

KL divergence is not a symmetrical operation and

DKL (Q||P)6=DKL (P||Q).DKL (Q||P)deﬁned above is

the divergence from Q to P, also known as the exclusive diver-

gence, and the inclusive divergence is given by DKL (P||Q).

A Kullback-Leibler divergence DKL of 0 means that P(d)

and Q(d)are identical. In the context of machine learning,

DKL (P||Q)is often called the information gain achieved if

Q(d)is used instead of P(d).

IV. VA RI ATIONA L POLICY CHAINING

Variational Policy Chaining (VPC) algorithm aims to pro-

vide an RL agent with sense of intuition over which actions to

take without having to explicitly learn a policy, or draw one

from a storage repository.

VPC proposes to create a policy which contains the maxi-

mum amount of useful information from all learned policies.

This policy is named the policy chain, πc(s, a), and it provides

an optimal starting point for learning as wide a range of

policies as possible. The storage requirements to facilitate this

solution are equal to those of a single policy, resulting in an

agent being theoretically capable of performing N tasks from

just one policy.

To illustrate this, we give an example of VPC application in

the standard RL benchmark environment GridWorld as shown

in Figure 1.

Fig. 1: Ideal GridWorld policy for optimal relearning.

The cells of the grid correspond to the states of the RL

environment. Four actions are possible in each of the cells:

north, south, east, and west, which deterministically cause an

agent to move (i.e., cause a state transition) one cell in the

chosen direction. The agent’s goal, as in any RL scenario, is

to maximize the long term reward received, with the reward’s

location being initially unknown to the agent.

We present two cases in particular that demonstrate the inner

workings of VPC.

Case 1: Π1(A | si)∼

=Π2(A | si):This scenario is sim-

ple to analyse; we have two polices that have similar distri-

butions given some state si. The resulting chain of the two

polices should produce a similar distribution. For example, if

both policies contain information that infers that the boundary

of Gridworld is bad, then this information should be contained

within the resulting policy chain.

Case 2: Π1(A | si)Π2(A | si):This scenario dives

deeper into the capabilities of VPC. The aim of VPC is to

eventually generalise to some ideal policy chain. We can deﬁne

the ideal policy for relearning as the policy that remembers

everything that inhibits learning (i.e., boundary locations), and

forgets all information that can introduce a bias (i.e., the

reward was in the top left last time, therefore it should be

there again). We call this the ideal policy chain for relearning,

and the equivalent GridWorld policy is conveyed graphically

in Figure 1. However, if policies have each learned a different

reward located at either end of the GridWorld, then their

distributions will greatly differ around the centre of the board.

The aim therefore, is that if policies differ, then we want

our chain to tend to a uniform distribution. We achieve

this by prioritising the inclusive version of KL approach to

minimizing the VPC loss function, further discussed in the

rest of this section.

A. Policy Chaining and VPC Loss Function

The implementation of training a new policy chain and

applying it to a task is presented in Figure 2. An environment

state is sampled and used as an input into the both existing

policy chain, and the next policy to be added to the chain.

We then obtain two discreet condition probability distributions

which are inputted into the VPC loss function. Minimizing this

loss function returns a new discreet conditional probability

distribution designed to mimic the ideal chain. A neural

network is trained to encompass these output distributions

from the VPC loss function to generalise to the new policy

chain.

VPC AGENT

ENVIRONMENT

Variational Policy Chaining

loss function

Kullback-Leibler

divergence

Neural

network

policy

Action at

Observation st

Policy

prior Policy

posterior

NN policy will

generalise to the

new optimal chain

Posterior

policy is

the next to

be added

to chain

Prior policy

is typically

instance of

policy

chain

Fig. 2: Training graph for a variational policy chain.

VPC is introduced as the amalgamation of important self-

information across a running chain of policy distributions. A

new link is created in a ’chain’ when we extract the most

important self-information from a policy and include this in

chain model. The VPC loss function is used to extract the most

important self-information from within a policy distribution.

Through this function, the policy chain is strengthened by

absorbing new information from the next policy. The output

of the loss function is πtwhich is deﬁned as:

“The policy distribution that minimises a given ratio of

information lost between the posterior policy (next policy to be

added to the chain) for task t−1and the existing policy chain

(prior policy) which was used for the previous task t−1”.

πt= arg min

πθ

αDKL (πθ|| πt−1,posterior )

+ (1 −α)DK L (πθ|| πt−1,prior )(3)

where DKL is the Kullback-Leibler divergence.

To determine how the loss function returns the new chain

distribution πtwe break down the above equation as follows.

Firstly, DKL (Q||P)calculates the exclusive divergence be-

tween the distributions Q and P. The lower the result of this

divergence, the more similar Q is to P. The Q that returns the

lowest value for this expression will be Q=P.

Therefore to minimise Equation 3, we must ﬁnd the value

of πtthat sufﬁciently minimises both the divergence between

itself and the optimal policy for the current task, and the

divergence between itself and the current distribution. The sum

is regularized by the chaining ratio αsuch that the chain is

given priority in this minimisation. We use α= 1/t where t

is the task number beginning at 2 to ensure that the posterior

polices are fairly weighted against the chain.

This technique is labelled variational due to the parametric

nature of πθ. This distribution is varied until we achieve the

result that minimises the desired expression. The optimal πθ

is found by applying gradient descent approach on the loss

function until we reach an acceptable minimum.

B. Minimising The Loss Function

In its general form, the VPC loss function is given by:

L(q) = αDKL (Qθ|| P1) + (1 −α)DK L (Qθ|| P2)(4)

with the optimal Q distribution given by:

Q∗= arg min

Qθ

L(Q)(5)

To ﬁnd Q∗that minimises the loss function L(Q)we will

use a gradient descent step:

Q←− Q+γ∇QL(Q)(6)

where γis the learning rate.

As DKL (Q||P)6=DKL (Q||P), the gradient of the loss

function ∇QL(Q)will return different values depending on

whether we use the inclusive or exclusive divergence, as

follows:

For L(Q)exclusive:

∇QL(Q) = α"X

log Qθ

+ 1#+ (1 −α)"X

log Qθ

+ 1#

(7)

And for L(Q)inclusive:

∇QL(Q) = (1 −α)"X

Qθ#−α"X

Qθ#(8)

Now that we have deﬁned the chaining process and VPC

loss function, the complete chaining algorithm is presented in

Algorithm 1

Algorithm 1 Variational Policy Chaining

Initialise new or load existing policy chain πc

Initialise neural network policy πNN as random policy

for task in M do

Load next policy to be chained, πt

for epoch in N do

Uniformly choose next input state si

Obtain marginal distributions πt(ai|si),πc(ai|si)

Obtain πc+1(ai|si)that minimises loss L(πt, πc)

Perform optimisation step on πNN

towards πc+1(ai|si)

end for

Perform update: πc(a, s)←πNN

end for

If a new task Tiis to be learned, the agent begins with

a policy identical to the chain πci−1. This policy is then

optimized through the PG method to obtain π∗

i, and once opti-

mized, is absorbed back into the chain πci−1. This absorption

is processed via minimizing the VPC loss function to obtain

πci. The next new task Ti+1 then begins with an agent using

this obtained chain.

V. EVAL UATIO N

In this section we evaluate VPC’s ability to facilitate life-

long reinforcement learning by applying it in a standard RL

benchmark rectangular grid-world environment.

In order to evaluate VPC’s ability to remember knowledge

across the tasks, the location of the reward must change to

render previous policies inaccurate or useless.

We implemented two agents: an RL (PG) agent with a

uniform random policy and a VPC agent. The aim is to

monitor the ability of each agent to re-learn the optimal policy

after the reward has been changed. The hyper-parameters used

in the implementation and evaluation of VPC are listed in

Table I.

Name Description Value

αpg learning rate for policy gradient 0.01

γpg reward decay for policy gradient 0.95

episodespg # episodes of training for policy gradient agents 0.01

αvpc chaining ratio 1/t

γvpc learning rate for VPC πnn 0.001

epochsvpc # epochs of training for VPC πnn 1000

TABLE I: Hyper-parameters for evaluation of VPC.

We evaluate VPC in two experiments: the ﬁrst one aims to

capture the sensitivity of VPC to winner’s curse (introduced

in Section II) when learning a second task after optimizing its

policy for another (previous) task; and the second one aims to

evaluate VPC effectiveness as a CRL technique in learning an

n-th task. For both experiments, VPC is compared to a simple

policy gradient (PG) agent, as described in Section III.

A. Learning a 2nd task

Winner’s curse [12] can prevent an RL agent from learning

a new task due to the presence of bias towards existing

knowledge. An agent struggles to ‘unlearn’ a policy and hence

is unable to adapt to a new task. To demonstrate VPC’s ability

to avoid winner’s curse, we ﬁrst train an RL agent with a

reward in location (2,2). We then initialise a VPC agent with

this policy chained with another PG agent policy for reward

location (6,6), and then task the VPC agent with learning a new

optimal policy for a third reward location (4,4). Simulation

results are depicted in Figure 3.

(a) (b)

Fig. 3: Terminus states for PG agent (a) and VPC agent (b)

As expected, the baseline PG agent was susceptible to

winner’s curse and did not converge to the optimal policy

even after 1,000 training episodes. The VPC agent, however,

successfully re-learned the optimal policy from a continual

learning perspective, remaining unhindered by the new reward

location. Figure 3b depicts the terminus states for each episode

spent re-training the VPC agent equipped with a chained

policy; we can see that the agent reaches the goal after

approximately 500 episodes of training. These results conﬁrm

that VPC is a credible technique for continual reinforcement

learning due to its aversion to winner’s curse, and its success

in learning a new policy on top of a different optimal policy.

B. Learning n-th task

In this experiment, we further explore VPC’s ability to chain

learnt policies when it is learning several (i.e., more than

two) tasks continually. We create a chain out of four policies

in which the rewards are located at opposite corners of the

GridWorld.

This experiment is performed as follows:

1) Forge a high-variance chain from multiple policies with

reward locations (1,1), (1,6), (6,6), (6,1).

2) Task this chain with re-learning a set of reward locations

dissimilar to those of the policies within the chain.

The results obtained for a high-variance VPC agent are

shown in Figure 4.

We observe that VPC is not only able to re-learn every test

successfully, but does so faster than a PG agent. Figures 4a

to 4f show that the VPC agents reach the goal after 150–

300 episodes, i.e. twice as fast as a randomly initialised PG

agent. These results conﬁrm that the higher the variance of

the policy chain, the faster the chain converges to the ideal

policy for re-learning. They also validate the design proposed

in Section IV, where highly different conditional distributions

when chained together will ultimately converge to a uniform

(a) new reward location (1,3) (b) new reward location (3,1)

(e) new reward location (3,6):

1,000 training episodes

(f) new reward location (3,6):

2,000 training episodes

Fig. 4: Terminus states for high-variance 4-policy chain

distribution. These results also verify that VPC can correctly

prioritise information from multiple policies.

In summary, the ﬁrst experiment demonstrated VPC’s ability

to re-learn effectively for the second task, without being

biased towards the previously optimized policy (i.e., to avoid

winner’s curse. The second experiment consisted of four

chained policies, demonstrating VPC’s ability to effectively

inherit from multiple previously encountered tasks. In both

cases, only a single policy (called policy chain in VPC) was

stored throughout agent’s lifetime, ensuring scalability.

VI. CONCLUSION AND FUTURE WO RK

Long-term deployment of RL agents in complex and chang-

ing environment motivates the need for lifelong learning

algorithms that enable agents to re-use previously acquired

knowledge (stored policies), and adapt it to solve new tasks.

However, storing full policies can be intractable for very large

sets of successive tasks and re-using an optimised policy can

introduce a bias in the agent behaviour if the model diverges.

To address these issues, this paper introduces Variational

Policy Chaining (VPC) with Kullback-Leibler divergence,

a continual reinforcement learning technique for storing a

“policy chain” which is derived from all previously learnt

policies. The novelty of our approach is that an RL agent does

not need to store the full policy, but instead the policy chain

is build from the divergence the learnt policies show from one

task to an other. This reduces the required size for storing the

previous experience down to a single policy (i.e., policy chain)

and ensures that the agent avoids the ‘winner’s curse’, i.e., the

situation that arises when a policy is too speciﬁc to a single

task to be applicable in others.

Future work directions include further evaluation for more

complex tasks, e.g., a comparison with state of the art ap-

proaches in situations with a small number of tasks, and study-

ing a range of VPC applications. In addition, speciﬁcation

and impact of the chaining ratio should be investigated, as

empirical results suggest that VPC performance increases with

the variance of the chain. Capturing how close the chain is to

the ideal policy could potentially improve the effectiveness of

VPC, as it would provide an inclination into whether or not

additional policies need to be added to the chain, helping the

agent to converge quicker to an optimised policy chain.

VII. ACKN OWLEDGEMENTS

This publication is supported in part by a research grant

from Science Foundation Ireland (SFI) under Grant Numbers

13/RC/2077 and 16/SP/3804 and by Irish Research Council

through Surpass New Horizons award.

REFERENCES

[1] I. Dusparic, J. Monteil, and V. Cahill, “Towards autonomic urban

trafﬁc control with collaborative multi-policy reinforcement learning,” in

2016 IEEE 19th International Conference on Intelligent Transportation

Systems (ITSC). IEEE, 2016, pp. 2065–2070.

[2] M. Gu´

eriau and I. Dusparic, “Samod: Shared autonomous mobility-

on-demand using decentralized reinforcement learning,” in 2018 21st

International Conference on Intelligent Transportation Systems (ITSC).

IEEE, 2018, pp. 1558–1563.

[3] I. Dusparic, A. Taylor, A. Marinescu, F. Golpayegani, and S. Clarke,

“Residential demand response: Experimental evaluation and compari-

son of self-organizing techniques,” Renewable and Sustainable Energy

Reviews, vol. 80, pp. 1528–1536, 2017.

[4] M. B. Ring, “Child: A ﬁrst step towards continual learning,” Machine

Learning, vol. 28, no. 1, pp. 77–104, 1997.

[5] S. Thrun and L. Pratt, Learning to learn. Springer Science & Business

Media, 2012.

[6] S. Farquhar and Y. Gal, “Towards robust evaluations of continual

learning,” arXiv preprint arXiv:1805.09733, 2018.

[7] C. J. Watkins and P. Dayan, “Q-learning,” Machine learning, vol. 8, no.

3-4, pp. 279–292, 1992.

[8] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wier-

stra, and M. Riedmiller, “Playing atari with deep reinforcement learn-

ing,” arXiv preprint arXiv:1312.5602, 2013.

[9] O. Ostapenko, M. Puscas, T. Klein, P. Jahnichen, and M. Nabi, “Learning

to remember: A synaptic plasticity driven framework for continual

learning,” in Proceedings of the IEEE Conference on Computer Vision

and Pattern Recognition, 2019, pp. 11321–11 329.

[10] C. Kaplanis, M. Shanahan, and C. Clopath, “Continual reinforcement

learning with complex synapses,” in ICML, 2018.

[11] B. van Niekerk, S. F. James, A. C. Earle, and B. Rosman, “Will it

blend? composing value functions in reinforcement learning,” CoRR,

vol. abs/1807.04439, 2018.

[12] R. Fox, A. Pakman, and N. Tishby, “Taming the noise in reinforcement

learning via soft updates,” CoRR, vol. abs/1512.08562, 2015.

[13] Sutton and Barto, Reinforcement Learning. MIT Press, 2017.

[14] S. Kullback and R. A. Leibler, “On information and sufﬁciency,” Annals

of Mathematical Statistics, vol. 22, no. 1, pp. 79–86, 1951.

A novel approach for self-driving car in partially observable environment using life long reinforcement learning

Article

Jun 2024

Composing Value Functions in Reinforcement Learning

Conference Paper

Full-text available

Jun 2019

An important property for lifelong-learning agents is the ability to combine existing skills to solve new unseen tasks. In general, however, it is unclear how to compose existing skills in a principled manner. Under the assumption of deterministic dynamics, we prove that optimal value function composition can be achieved in entropy-regularised reinforcement learning (RL), and extend this result to the standard RL setting. Composition is demonstrated in a high-dimensional video game, where an agent with an existing library of skills is immediately able to solve new tasks without the need for further learning.

SAMoD: Shared Autonomous Mobility-on-Demand using Decentralized Reinforcement Learning

Conference Paper

Full-text available

Nov 2018

Residential demand response: Experimental evaluation and comparison of self-organizing techniques

Article

Full-text available

Dec 2017
RENEW SUST ENERG REV

Residential demand response (DR) has gained a significant increase in interest from industrial and academic communities as a means to contribute to more efficient operation of smart grids, with numerous techniques proposed to implement residential DR programmes. However, the proposed techniques have been evaluated in scenarios addressing different types of electrical devices with different energy requirements, on different scales, and have compared technique performance to different baselines. Furthermore, numerous review papers have been published comparing various characteristics of DR systems, but without comparing their performance. No existing work provides an experimental evaluation of residential DR techniques in a common scenario, side-by-side comparison of their properties and requirements derived from their behaviour in such a scenario and analysis of their suitability to various domain requirements. To address this gap, in this paper we present four self-organizing intelligent algorithms for residential DR, which we evaluate both quantitatively and qualitatively in a number of common residential DR scenarios, providing a performance comparison as well as a benchmark for further investigations of DR algorithms. The approaches implemented are: set-point, reinforcement learning, evolutionary computation, and Monte Carlo tree search. We compare the performance of approaches with regards to energy-use patterns (such as reduction in peak-time energy use), adaptivity to changes in the environment and device behaviour, communication requirements, computational complexity, scalability, and flexibility with respect to type of electric load to which it can be applied, and provide guidelines on their suitability based on specific DR requirements.

Towards Autonomic Urban Traffic Control with Collaborative Multi-Policy Reinforcement Learning

Conference Paper

Full-text available

Nov 2016

Various multi-agent decentralized approaches based on reinforcement learning (RL) have been proposed to increase scalability and real-time adaptiveness of urban traffic control (UTC) systems. In such approaches, traffic light control parameters are not pre-defined, but intelligent agents controlling the junctions learn the suitable traffic signal settings. In order to consider applications of RL in commercial UTC products, they need to enable fine-grained optimization of multiple traffic objectives and be validated in realistic UTC simulations. This paper presents REALT, a UTC system based on Distributed W-Learning (DWL), a multi-policy multi-agent RL-based optimization technique, which enables it to address multiple traffic optimization goals simultaneously. We introduce an extension of DWL with fine-grained traffic observation to enable adaptation of phase duration rather than just phase selection. We also evaluate the impact of action set selection in RL applications to UTC, by introducing a phase generation module which enables REALT to use different phase sets. We simulate REALT performance in VISSIM, on a detailed model of the road network representing Western Road, the main traffic artery in Cork, Ireland. We use precise historic traffic counts as input and present results of the comparison of REALT performance to that of SCOOT signals deployed in Cork.

Taming the Noise in Reinforcement Learning via Soft Updates

Conference Paper

Full-text available

Jun 2016

Model-free reinforcement learning algorithms, such as Q-learning, perform poorly in the early stages of learning in noisy environments, because much effort is spent unlearning biased estimates of the state-action value function. The bias results from selecting, among several noisy estimates, the apparent optimum, which may actually be suboptimal. We propose G-learning, a new off-policy learning algorithm that regularizes the value estimates by penalizing deterministic policies in the beginning of the learning process. We show that this method reduces the bias of the value-function estimation, leading to faster convergence to the optimal value and the optimal policy. Moreover, G-learning enables the natural incorporation of prior domain knowledge, when available. The stochastic nature of G-learning also makes it avoid some exploration costs, a property usually attributed only to on-policy algorithms. We illustrate these ideas in several examples, where G-learning results in significant improvements of the convergence rate and the cost of the learning process.

Learning to Remember: A Synaptic Plasticity Driven Framework for Continual Learning

Conference Paper

Jun 2019

Continual Reinforcement Learning with Complex Synapses

Article

Feb 2018

Unlike humans, who are capable of continual learning over their lifetimes, artificial neural networks have long been known to suffer from a phenomenon known as catastrophic forgetting, whereby new learning can lead to abrupt erasure of previously acquired knowledge. Whereas in a neural network the parameters are typically modelled as scalar values, an individual synapse in the brain comprises a complex network of interacting biochemical components that evolve at different timescales. In this paper, we show that by equipping tabular and deep reinforcement learning agents with a synaptic model that incorporates this biological complexity (Benna & Fusi, 2016), catastrophic forgetting can be mitigated at multiple timescales. In particular, we find that as well as enabling continual learning across sequential training of two simple tasks, it can also be used to overcome within-task forgetting by reducing the need for an experience replay database.

Q-learning

Article

Jan 1992

Playing Atari with Deep Reinforcement Learning

Article

Dec 2013

We present the first deep learning model to successfully learn control policies directly from high-dimensional sensory input using reinforcement learning. The model is a convolutional neural network, trained with a variant of Q-learning, whose input is raw pixels and whose output is a value function estimating future rewards. We apply our method to seven Atari 2600 games from the Arcade Learning Environment, with no adjustment of the architecture or learning algorithm. We find that it outperforms all previous approaches on six of the games and surpasses a human expert on three of them.

Learning to Learn

Book

Jan 1998

Preface. Part I: Overview Articles. 1. Learning to Learn: Introduction and Overview S. Thrun, L. Pratt. 2. A Survey of Connectionist Network Reuse Through Transfer L. Pratt, B. Jennings. 3. Transfer in Cognition A. Robins. Part II: Prediction. 4. Theoretical Models of Learning to Learn J. Baxter. 5. Multitask Learning R. Caruana. 6. Making a Low-Dimensional Representation Suitable for Diverse Tasks N. Intrator, S. Edelman. 7. The Canonical Distortion Measure for Vector Quantization and Function Approximation J. Baxter. 8. Lifelong Learning Algorithms S. Thrun. Part III: Relatedness. 9. The Parallel Transfer of Task Knowledge Using Dynamic Learning Rates Based on a Measure of Relatedness D.L. Silver, R.E. Mercer. 10. Clustering Learning Tasks and the Selective Cross-Task Transfer of Knowledge S. Thrun, J. O'Sullivan. Part IV: Control. 11. CHILD: A First Step Towards Continual Learning M.B. Ring. 12. Reinforcement Learning with Self-Modifying Policies J. Schmidhuber, et al. 13. Creating Advice-Taking Reinforcement Learners R. Maclin, J.W. Shavlik. Contributing Authors. Index.

Variational Policy Chaining for Lifelong Reinforcement Learning

Abstract and Figures

Recommended publications

A Deep Hierarchical Approach to Lifelong Learning in Minecraft

Bayesian Transfer Reinforcement Learning with Prior Knowledge Rules

Constructivist Approach to State Space Adaptation in Reinforcement Learning

Reinforcement Learning for Sustainability: Adapting in large-scale heterogeneous dynamic environment...

SAMoD: Shared Autonomous Mobility-on-Demand using Decentralized Reinforcement Learning

Shared Autonomous Mobility on Demand: A Learning-Based Approach and Its Performance in the Presence...