Conference PaperPDF Available

Variational Policy Chaining for Lifelong Reinforcement Learning

Authors:

Abstract and Figures

With increasing applications of reinforcement learning in real life problems, it is becoming essential that agents are able to update their knowledge continually. Lifelong learning approaches aim to enable agents to retain the knowledge they learn and to selectively transfer knowledge to new tasks. Recent techniques for lifelong reinforcement learning have shown great success in getting an agent to generalise over several tasks. However , scalability becomes an issue when agents learn numerous tasks, as each task's information must be remembered. To address this issue, this paper proposes the approach of Variational Policy Chaining (VPC) which enables a reinforcement learning agent to generalise effectively in a scalable manner when presented with continuous task updates, without storing multiple historic experiences. VPC uses Kullback-Leibler divergence method isolate the most common pieces of knowledge, and condenses the important knowledge in a single policy chain. We evaluate VPC in a GridWorld environment and compare it to vanilla policy gradient methods, showing that VPC's ability to reuse knowledge from previously encountered tasks reduces learning time in new tasks by up to 50%.
Content may be subject to copyright.
Variational Policy Chaining for Lifelong
Reinforcement Learning
Christopher Doyle, Maxime Gu´
eriau and Ivana Dusparic
School of Computer Science and Statistics, Trinity College Dublin, Ireland
Email: doylec49@tcd.ie, maxime.gueriau@scss.tcd.ie, ivana.dusparic@scss.tcd.ie
Abstract—With increasing applications of reinforcement learn-
ing in real life problems, it is becoming essential that agents are
able to update their knowledge continually. Lifelong learning
approaches aim to enable agents to retain the knowledge they
learn and to selectively transfer knowledge to new tasks. Recent
techniques for lifelong reinforcement learning have shown great
success in getting an agent to generalise over several tasks. How-
ever, scalability becomes an issue when agents learn numerous
tasks, as each task’s information must be remembered. To address
this issue, this paper proposes the approach of Variational Policy
Chaining (VPC) which enables a reinforcement learning agent
to generalise effectively in a scalable manner when presented
with continuous task updates, without storing multiple historic
experiences. VPC uses Kullback-Leibler divergence method iso-
late the most common pieces of knowledge, and condenses the
important knowledge in a single policy chain. We evaluate VPC
in a GridWorld environment and compare it to vanilla policy
gradient methods, showing that VPC’s ability to reuse knowledge
from previously encountered tasks reduces learning time in new
tasks by up to 50%.
Index Terms—lifelong learning, continual learning, reinforce-
ment learning
I. INTRODUCTION
Recent advances in Reinforcement Learning (RL) and deep
neural networks have seen increasing exploration of classical
and deep RL use in real-word applications, such as for example
renewable energy management and intelligent transportation
systems [1]–[3]. However, to ensure a long-term deployment
of agents in the environment, performing task(s) they have
been pre-trained to achieve is not sufficient. They also must be
able to continuously adapt their knowledge to learn new tasks,
as well as remember and reuse knowledge from previously
learnt tasks to bootstrap learning in newly encountered ones.
This need gave rise to the field of Lifelong Reinforcement
Learning (LLRL), which aims to develop techniques that allow
RL agents to generalise over multiple tasks. Current LLRL
techniques typically fall into one of two categories, either 1) be
able to relearn new and old tasks without the model diverging,
or 2) be able to select the most appropriate policy from a
reservoir (or repository) of pre-trained policies. The trade-off
here is that relearning policies for each new task from scratch
is slow, and storing full policies is not scalable to a large
number of tasks if storage is limited.
This paper proposes an alternative in the novel method
of Variation Policy Chaining (VPC). The general approach
used by VPC is to store just a single policy, derived from
all previously learnt policies, called the ”policy chain”. This
policy is designed to be loosely suitable for any task in the
given environment, which provides a starting point to learning
a new task.
II. RE LATE D WORK
Continual learning (CL [4]) emerged as a promising way
to achieve lifelong learning [5]. This form of online learning
allows to tackle problems where an agent faces a series
of successive tasks and needs to adapt in an incremental
way to optimise the use of previous data. CL arose to ad-
dress the issue of catastrophic forgetting across various areas
within machine learning; including reinforcement learning.
Related work in general continual learning primarily relies
on Bayesian methods, and can be categorized into prior-
focused or likelihood-focused approaches [6]. Prior-focused
approaches directly reuse part of the intermediate model as
priors to learn the new model while likelihood approaches
intend to estimate the likelihood of the new model on past
tasks.
Continual Reinforcement Learning (CRL) was developed to
enhance both tabular Q-Learning [7] and Deep Q-Learning [8].
A synaptic model was used in [9], [10] to account for biolog-
ical complexity, allowing interacting components to evolve at
different timescales. While it allows an agent to learn new
tasks without forgetting on previous ones, the optimal policies
for each take much longer to learn due to bias imposed
from one task to another. Composition approaches [11] are an
alternate way to achieve CRL. Here, an agent stores a learned
policy for each completed task, and can re-use them when en-
countering new (very similar) tasks. Composition approaches
are therefore often susceptible to ‘winners curse’ [12] – the
idea that the more suitable a policy is for one task, the harder
it is to re-learn another.
III. BACKGRO UN D
In this section we show background knowledge that we
utilize in our approach. We present the basics of RL and
Kullback-Leibler divergence, a method VPC uses to measure
the distance between two policies.
A. Reinforcement Learning
Reinforcement learning (RL) is a technique in which
an intelligent agent learns to map environment situations
(i.e., states) to actions so as to maximize a numerical reward
signal it receives from the environment [13]. An RL process
is expressed through the following terms:
t: the step in time between actions and observations.
S: the ‘state space’, i.e., the set of all states contained
within an environment.
A: the ‘action space’, i.e., the set of all actions that an
agent can take.
T: the state transition function, describes P(st+1|st, at).
r: the numerical reward signal associated with every state
transition.
A learning episode is then defined as the set
st, at, rt, st+1, ..., sT, where stis the state at timestep
t,atis the action taken from this state, rtis the reward
obtained from the transition resulting from at,st+1 is
the resulting state entered after at, and sTis a terminal
state. Optimizing an agent’s performance can be seen as
maximizing the cumulative expected reward from any given
state st. This infers that each action taken will result in an
optimal state transition and accompanying reward. We say
therefore that an agent has obtained the optimal policy π,
where an agent’s policy π(st)is the conditional probability
distribution over all possible actions in state st,π(A | st).
In this work we use a Policy Gradient (PG) [13] method
to obtain the optimal policy. PG learns a parameterized policy
which selects an action in each state so as to maximize the
expected cumulative reward Jobtained given policy π.Jis
parameterized with a parameter vector θ, and is maximized
by obtaining a gradient of the policy’s return θJ(θ)with
respect to the parameters, according to the formula:
θJ(θ) = X
t
π(t;θ)θlog(πθ(t;θ))rt(1)
where rtis the reward received at time step t.
VPC utilizes the fact that the optimal policy is expressed as
a probability distribution over all possible actions in state st
to compare the difference between policies (further discussed
in Section IV) using Kullback-Leibler divergence.
B. Kullback-Leibler divergence
The Kullback-Leibler (KL) divergence [14] is a method
used to measure the similarity between two probability dis-
tributions P(d)and Q(d), and it is calculated as:
DKL (Q||P) = X
d
Q(d)log Q(d)
P(d)(2)
KL divergence is not a symmetrical operation and
DKL (Q||P)6=DKL (P||Q).DKL (Q||P)defined above is
the divergence from Q to P, also known as the exclusive diver-
gence, and the inclusive divergence is given by DKL (P||Q).
A Kullback-Leibler divergence DKL of 0 means that P(d)
and Q(d)are identical. In the context of machine learning,
DKL (P||Q)is often called the information gain achieved if
Q(d)is used instead of P(d).
IV. VA RI ATIONA L POLICY CHAINING
Variational Policy Chaining (VPC) algorithm aims to pro-
vide an RL agent with sense of intuition over which actions to
take without having to explicitly learn a policy, or draw one
from a storage repository.
VPC proposes to create a policy which contains the maxi-
mum amount of useful information from all learned policies.
This policy is named the policy chain, πc(s, a), and it provides
an optimal starting point for learning as wide a range of
policies as possible. The storage requirements to facilitate this
solution are equal to those of a single policy, resulting in an
agent being theoretically capable of performing N tasks from
just one policy.
To illustrate this, we give an example of VPC application in
the standard RL benchmark environment GridWorld as shown
in Figure 1.
Fig. 1: Ideal GridWorld policy for optimal relearning.
The cells of the grid correspond to the states of the RL
environment. Four actions are possible in each of the cells:
north, south, east, and west, which deterministically cause an
agent to move (i.e., cause a state transition) one cell in the
chosen direction. The agent’s goal, as in any RL scenario, is
to maximize the long term reward received, with the reward’s
location being initially unknown to the agent.
We present two cases in particular that demonstrate the inner
workings of VPC.
Case 1: Π1(A | si)
=Π2(A | si):This scenario is sim-
ple to analyse; we have two polices that have similar distri-
butions given some state si. The resulting chain of the two
polices should produce a similar distribution. For example, if
both policies contain information that infers that the boundary
of Gridworld is bad, then this information should be contained
within the resulting policy chain.
Case 2: Π1(A | si)Π2(A | si):This scenario dives
deeper into the capabilities of VPC. The aim of VPC is to
eventually generalise to some ideal policy chain. We can define
the ideal policy for relearning as the policy that remembers
everything that inhibits learning (i.e., boundary locations), and
forgets all information that can introduce a bias (i.e., the
reward was in the top left last time, therefore it should be
there again). We call this the ideal policy chain for relearning,
and the equivalent GridWorld policy is conveyed graphically
in Figure 1. However, if policies have each learned a different
reward located at either end of the GridWorld, then their
distributions will greatly differ around the centre of the board.
The aim therefore, is that if policies differ, then we want
our chain to tend to a uniform distribution. We achieve
this by prioritising the inclusive version of KL approach to
minimizing the VPC loss function, further discussed in the
rest of this section.
A. Policy Chaining and VPC Loss Function
The implementation of training a new policy chain and
applying it to a task is presented in Figure 2. An environment
state is sampled and used as an input into the both existing
policy chain, and the next policy to be added to the chain.
We then obtain two discreet condition probability distributions
which are inputted into the VPC loss function. Minimizing this
loss function returns a new discreet conditional probability
distribution designed to mimic the ideal chain. A neural
network is trained to encompass these output distributions
from the VPC loss function to generalise to the new policy
chain.
VPC AGENT
ENVIRONMENT
Variational Policy Chaining
loss function
Kullback-Leibler
divergence
Neural
network
policy
Action at
Observation st
Policy
prior Policy
posterior
NN policy will
generalise to the
new optimal chain
Posterior
policy is
the next to
be added
to chain
Prior policy
is typically
previous
instance of
policy
chain
Fig. 2: Training graph for a variational policy chain.
VPC is introduced as the amalgamation of important self-
information across a running chain of policy distributions. A
new link is created in a ’chain’ when we extract the most
important self-information from a policy and include this in
chain model. The VPC loss function is used to extract the most
important self-information from within a policy distribution.
Through this function, the policy chain is strengthened by
absorbing new information from the next policy. The output
of the loss function is πtwhich is defined as:
“The policy distribution that minimises a given ratio of
information lost between the posterior policy (next policy to be
added to the chain) for task t1and the existing policy chain
(prior policy) which was used for the previous task t1”.
πt= arg min
πθ
αDKL (πθ|| πt1,posterior )
+ (1 α)DK L (πθ|| πt1,prior )(3)
where DKL is the Kullback-Leibler divergence.
To determine how the loss function returns the new chain
distribution πtwe break down the above equation as follows.
Firstly, DKL (Q||P)calculates the exclusive divergence be-
tween the distributions Q and P. The lower the result of this
divergence, the more similar Q is to P. The Q that returns the
lowest value for this expression will be Q=P.
Therefore to minimise Equation 3, we must find the value
of πtthat sufficiently minimises both the divergence between
itself and the optimal policy for the current task, and the
divergence between itself and the current distribution. The sum
is regularized by the chaining ratio αsuch that the chain is
given priority in this minimisation. We use α= 1/t where t
is the task number beginning at 2 to ensure that the posterior
polices are fairly weighted against the chain.
This technique is labelled variational due to the parametric
nature of πθ. This distribution is varied until we achieve the
result that minimises the desired expression. The optimal πθ
is found by applying gradient descent approach on the loss
function until we reach an acceptable minimum.
B. Minimising The Loss Function
In its general form, the VPC loss function is given by:
L(q) = αDKL (Qθ|| P1) + (1 α)DK L (Qθ|| P2)(4)
with the optimal Q distribution given by:
Q= arg min
Qθ
L(Q)(5)
To find Qthat minimises the loss function L(Q)we will
use a gradient descent step:
QQ+γQL(Q)(6)
where γis the learning rate.
As DKL (Q||P)6=DKL (Q||P), the gradient of the loss
function QL(Q)will return different values depending on
whether we use the inclusive or exclusive divergence, as
follows:
For L(Q)exclusive:
QL(Q) = α"X
d
log Qθ
P1
+ 1#+ (1 α)"X
d
log Qθ
P2
+ 1#
(7)
And for L(Q)inclusive:
QL(Q) = (1 α)"X
d
P2
Qθ#α"X
d
P1
Qθ#(8)
Now that we have defined the chaining process and VPC
loss function, the complete chaining algorithm is presented in
Algorithm 1
Algorithm 1 Variational Policy Chaining
Initialise new or load existing policy chain πc
Initialise neural network policy πNN as random policy
for task in M do
Load next policy to be chained, πt
for epoch in N do
Uniformly choose next input state si
Obtain marginal distributions πt(ai|si),πc(ai|si)
Obtain πc+1(ai|si)that minimises loss L(πt, πc)
Perform optimisation step on πNN
towards πc+1(ai|si)
end for
Perform update: πc(a, s)πNN
end for
If a new task Tiis to be learned, the agent begins with
a policy identical to the chain πci1. This policy is then
optimized through the PG method to obtain π
i, and once opti-
mized, is absorbed back into the chain πci1. This absorption
is processed via minimizing the VPC loss function to obtain
πci. The next new task Ti+1 then begins with an agent using
this obtained chain.
V. EVAL UATIO N
In this section we evaluate VPC’s ability to facilitate life-
long reinforcement learning by applying it in a standard RL
benchmark rectangular grid-world environment.
In order to evaluate VPC’s ability to remember knowledge
across the tasks, the location of the reward must change to
render previous policies inaccurate or useless.
We implemented two agents: an RL (PG) agent with a
uniform random policy and a VPC agent. The aim is to
monitor the ability of each agent to re-learn the optimal policy
after the reward has been changed. The hyper-parameters used
in the implementation and evaluation of VPC are listed in
Table I.
Name Description Value
αpg learning rate for policy gradient 0.01
γpg reward decay for policy gradient 0.95
episodespg # episodes of training for policy gradient agents 0.01
αvpc chaining ratio 1/t
γvpc learning rate for VPC πnn 0.001
epochsvpc # epochs of training for VPC πnn 1000
TABLE I: Hyper-parameters for evaluation of VPC.
We evaluate VPC in two experiments: the first one aims to
capture the sensitivity of VPC to winner’s curse (introduced
in Section II) when learning a second task after optimizing its
policy for another (previous) task; and the second one aims to
evaluate VPC effectiveness as a CRL technique in learning an
n-th task. For both experiments, VPC is compared to a simple
policy gradient (PG) agent, as described in Section III.
A. Learning a 2nd task
Winner’s curse [12] can prevent an RL agent from learning
a new task due to the presence of bias towards existing
knowledge. An agent struggles to ‘unlearn’ a policy and hence
is unable to adapt to a new task. To demonstrate VPC’s ability
to avoid winner’s curse, we first train an RL agent with a
reward in location (2,2). We then initialise a VPC agent with
this policy chained with another PG agent policy for reward
location (6,6), and then task the VPC agent with learning a new
optimal policy for a third reward location (4,4). Simulation
results are depicted in Figure 3.
(a) (b)
Fig. 3: Terminus states for PG agent (a) and VPC agent (b)
As expected, the baseline PG agent was susceptible to
winner’s curse and did not converge to the optimal policy
even after 1,000 training episodes. The VPC agent, however,
successfully re-learned the optimal policy from a continual
learning perspective, remaining unhindered by the new reward
location. Figure 3b depicts the terminus states for each episode
spent re-training the VPC agent equipped with a chained
policy; we can see that the agent reaches the goal after
approximately 500 episodes of training. These results confirm
that VPC is a credible technique for continual reinforcement
learning due to its aversion to winner’s curse, and its success
in learning a new policy on top of a different optimal policy.
B. Learning n-th task
In this experiment, we further explore VPC’s ability to chain
learnt policies when it is learning several (i.e., more than
two) tasks continually. We create a chain out of four policies
in which the rewards are located at opposite corners of the
GridWorld.
This experiment is performed as follows:
1) Forge a high-variance chain from multiple policies with
reward locations (1,1), (1,6), (6,6), (6,1).
2) Task this chain with re-learning a set of reward locations
dissimilar to those of the policies within the chain.
The results obtained for a high-variance VPC agent are
shown in Figure 4.
We observe that VPC is not only able to re-learn every test
successfully, but does so faster than a PG agent. Figures 4a
to 4f show that the VPC agents reach the goal after 150–
300 episodes, i.e. twice as fast as a randomly initialised PG
agent. These results confirm that the higher the variance of
the policy chain, the faster the chain converges to the ideal
policy for re-learning. They also validate the design proposed
in Section IV, where highly different conditional distributions
when chained together will ultimately converge to a uniform
(a) new reward location (1,3) (b) new reward location (3,1)
(c) new reward location (4,4) (d) new reward location (6,3)
(e) new reward location (3,6):
1,000 training episodes
(f) new reward location (3,6):
2,000 training episodes
Fig. 4: Terminus states for high-variance 4-policy chain
distribution. These results also verify that VPC can correctly
prioritise information from multiple policies.
In summary, the first experiment demonstrated VPC’s ability
to re-learn effectively for the second task, without being
biased towards the previously optimized policy (i.e., to avoid
winner’s curse. The second experiment consisted of four
chained policies, demonstrating VPC’s ability to effectively
inherit from multiple previously encountered tasks. In both
cases, only a single policy (called policy chain in VPC) was
stored throughout agent’s lifetime, ensuring scalability.
VI. CONCLUSION AND FUTURE WO RK
Long-term deployment of RL agents in complex and chang-
ing environment motivates the need for lifelong learning
algorithms that enable agents to re-use previously acquired
knowledge (stored policies), and adapt it to solve new tasks.
However, storing full policies can be intractable for very large
sets of successive tasks and re-using an optimised policy can
introduce a bias in the agent behaviour if the model diverges.
To address these issues, this paper introduces Variational
Policy Chaining (VPC) with Kullback-Leibler divergence,
a continual reinforcement learning technique for storing a
“policy chain” which is derived from all previously learnt
policies. The novelty of our approach is that an RL agent does
not need to store the full policy, but instead the policy chain
is build from the divergence the learnt policies show from one
task to an other. This reduces the required size for storing the
previous experience down to a single policy (i.e., policy chain)
and ensures that the agent avoids the ‘winner’s curse’, i.e., the
situation that arises when a policy is too specific to a single
task to be applicable in others.
Future work directions include further evaluation for more
complex tasks, e.g., a comparison with state of the art ap-
proaches in situations with a small number of tasks, and study-
ing a range of VPC applications. In addition, specification
and impact of the chaining ratio should be investigated, as
empirical results suggest that VPC performance increases with
the variance of the chain. Capturing how close the chain is to
the ideal policy could potentially improve the effectiveness of
VPC, as it would provide an inclination into whether or not
additional policies need to be added to the chain, helping the
agent to converge quicker to an optimised policy chain.
VII. ACKN OWLEDGEMENTS
This publication is supported in part by a research grant
from Science Foundation Ireland (SFI) under Grant Numbers
13/RC/2077 and 16/SP/3804 and by Irish Research Council
through Surpass New Horizons award.
REFERENCES
[1] I. Dusparic, J. Monteil, and V. Cahill, “Towards autonomic urban
traffic control with collaborative multi-policy reinforcement learning,” in
2016 IEEE 19th International Conference on Intelligent Transportation
Systems (ITSC). IEEE, 2016, pp. 2065–2070.
[2] M. Gu´
eriau and I. Dusparic, “Samod: Shared autonomous mobility-
on-demand using decentralized reinforcement learning,” in 2018 21st
International Conference on Intelligent Transportation Systems (ITSC).
IEEE, 2018, pp. 1558–1563.
[3] I. Dusparic, A. Taylor, A. Marinescu, F. Golpayegani, and S. Clarke,
“Residential demand response: Experimental evaluation and compari-
son of self-organizing techniques,” Renewable and Sustainable Energy
Reviews, vol. 80, pp. 1528–1536, 2017.
[4] M. B. Ring, “Child: A first step towards continual learning,Machine
Learning, vol. 28, no. 1, pp. 77–104, 1997.
[5] S. Thrun and L. Pratt, Learning to learn. Springer Science & Business
Media, 2012.
[6] S. Farquhar and Y. Gal, “Towards robust evaluations of continual
learning,” arXiv preprint arXiv:1805.09733, 2018.
[7] C. J. Watkins and P. Dayan, “Q-learning,” Machine learning, vol. 8, no.
3-4, pp. 279–292, 1992.
[8] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wier-
stra, and M. Riedmiller, “Playing atari with deep reinforcement learn-
ing,” arXiv preprint arXiv:1312.5602, 2013.
[9] O. Ostapenko, M. Puscas, T. Klein, P. Jahnichen, and M. Nabi, “Learning
to remember: A synaptic plasticity driven framework for continual
learning,” in Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, 2019, pp. 11321–11 329.
[10] C. Kaplanis, M. Shanahan, and C. Clopath, “Continual reinforcement
learning with complex synapses,” in ICML, 2018.
[11] B. van Niekerk, S. F. James, A. C. Earle, and B. Rosman, “Will it
blend? composing value functions in reinforcement learning,” CoRR,
vol. abs/1807.04439, 2018.
[12] R. Fox, A. Pakman, and N. Tishby, “Taming the noise in reinforcement
learning via soft updates,” CoRR, vol. abs/1512.08562, 2015.
[13] Sutton and Barto, Reinforcement Learning. MIT Press, 2017.
[14] S. Kullback and R. A. Leibler, “On information and sufficiency,” Annals
of Mathematical Statistics, vol. 22, no. 1, pp. 79–86, 1951.
Conference Paper
Full-text available
An important property for lifelong-learning agents is the ability to combine existing skills to solve new unseen tasks. In general, however, it is unclear how to compose existing skills in a principled manner. Under the assumption of deterministic dynamics, we prove that optimal value function composition can be achieved in entropy-regularised reinforcement learning (RL), and extend this result to the standard RL setting. Composition is demonstrated in a high-dimensional video game, where an agent with an existing library of skills is immediately able to solve new tasks without the need for further learning.
Article
Full-text available
Residential demand response (DR) has gained a significant increase in interest from industrial and academic communities as a means to contribute to more efficient operation of smart grids, with numerous techniques proposed to implement residential DR programmes. However, the proposed techniques have been evaluated in scenarios addressing different types of electrical devices with different energy requirements, on different scales, and have compared technique performance to different baselines. Furthermore, numerous review papers have been published comparing various characteristics of DR systems, but without comparing their performance. No existing work provides an experimental evaluation of residential DR techniques in a common scenario, side-by-side comparison of their properties and requirements derived from their behaviour in such a scenario and analysis of their suitability to various domain requirements. To address this gap, in this paper we present four self-organizing intelligent algorithms for residential DR, which we evaluate both quantitatively and qualitatively in a number of common residential DR scenarios, providing a performance comparison as well as a benchmark for further investigations of DR algorithms. The approaches implemented are: set-point, reinforcement learning, evolutionary computation, and Monte Carlo tree search. We compare the performance of approaches with regards to energy-use patterns (such as reduction in peak-time energy use), adaptivity to changes in the environment and device behaviour, communication requirements, computational complexity, scalability, and flexibility with respect to type of electric load to which it can be applied, and provide guidelines on their suitability based on specific DR requirements.
Conference Paper
Full-text available
Various multi-agent decentralized approaches based on reinforcement learning (RL) have been proposed to increase scalability and real-time adaptiveness of urban traffic control (UTC) systems. In such approaches, traffic light control parameters are not pre-defined, but intelligent agents controlling the junctions learn the suitable traffic signal settings. In order to consider applications of RL in commercial UTC products, they need to enable fine-grained optimization of multiple traffic objectives and be validated in realistic UTC simulations. This paper presents REALT, a UTC system based on Distributed W-Learning (DWL), a multi-policy multi-agent RL-based optimization technique, which enables it to address multiple traffic optimization goals simultaneously. We introduce an extension of DWL with fine-grained traffic observation to enable adaptation of phase duration rather than just phase selection. We also evaluate the impact of action set selection in RL applications to UTC, by introducing a phase generation module which enables REALT to use different phase sets. We simulate REALT performance in VISSIM, on a detailed model of the road network representing Western Road, the main traffic artery in Cork, Ireland. We use precise historic traffic counts as input and present results of the comparison of REALT performance to that of SCOOT signals deployed in Cork.
Conference Paper
Full-text available
Model-free reinforcement learning algorithms, such as Q-learning, perform poorly in the early stages of learning in noisy environments, because much effort is spent unlearning biased estimates of the state-action value function. The bias results from selecting, among several noisy estimates, the apparent optimum, which may actually be suboptimal. We propose G-learning, a new off-policy learning algorithm that regularizes the value estimates by penalizing deterministic policies in the beginning of the learning process. We show that this method reduces the bias of the value-function estimation, leading to faster convergence to the optimal value and the optimal policy. Moreover, G-learning enables the natural incorporation of prior domain knowledge, when available. The stochastic nature of G-learning also makes it avoid some exploration costs, a property usually attributed only to on-policy algorithms. We illustrate these ideas in several examples, where G-learning results in significant improvements of the convergence rate and the cost of the learning process.
Article
Unlike humans, who are capable of continual learning over their lifetimes, artificial neural networks have long been known to suffer from a phenomenon known as catastrophic forgetting, whereby new learning can lead to abrupt erasure of previously acquired knowledge. Whereas in a neural network the parameters are typically modelled as scalar values, an individual synapse in the brain comprises a complex network of interacting biochemical components that evolve at different timescales. In this paper, we show that by equipping tabular and deep reinforcement learning agents with a synaptic model that incorporates this biological complexity (Benna & Fusi, 2016), catastrophic forgetting can be mitigated at multiple timescales. In particular, we find that as well as enabling continual learning across sequential training of two simple tasks, it can also be used to overcome within-task forgetting by reducing the need for an experience replay database.
Article
We present the first deep learning model to successfully learn control policies directly from high-dimensional sensory input using reinforcement learning. The model is a convolutional neural network, trained with a variant of Q-learning, whose input is raw pixels and whose output is a value function estimating future rewards. We apply our method to seven Atari 2600 games from the Arcade Learning Environment, with no adjustment of the architecture or learning algorithm. We find that it outperforms all previous approaches on six of the games and surpasses a human expert on three of them.
Book
Preface. Part I: Overview Articles. 1. Learning to Learn: Introduction and Overview S. Thrun, L. Pratt. 2. A Survey of Connectionist Network Reuse Through Transfer L. Pratt, B. Jennings. 3. Transfer in Cognition A. Robins. Part II: Prediction. 4. Theoretical Models of Learning to Learn J. Baxter. 5. Multitask Learning R. Caruana. 6. Making a Low-Dimensional Representation Suitable for Diverse Tasks N. Intrator, S. Edelman. 7. The Canonical Distortion Measure for Vector Quantization and Function Approximation J. Baxter. 8. Lifelong Learning Algorithms S. Thrun. Part III: Relatedness. 9. The Parallel Transfer of Task Knowledge Using Dynamic Learning Rates Based on a Measure of Relatedness D.L. Silver, R.E. Mercer. 10. Clustering Learning Tasks and the Selective Cross-Task Transfer of Knowledge S. Thrun, J. O'Sullivan. Part IV: Control. 11. CHILD: A First Step Towards Continual Learning M.B. Ring. 12. Reinforcement Learning with Self-Modifying Policies J. Schmidhuber, et al. 13. Creating Advice-Taking Reinforcement Learners R. Maclin, J.W. Shavlik. Contributing Authors. Index.