ArticlePDF Available

Reinforcement Learning and Markov Decision Processes

Authors:

Abstract

Situated in between supervised learning and unsupervised learning, the paradigm of reinforcement learning deals with learning in sequential decision making problems in which there is limited feedback. This text introduces the intuitions and concepts behind Markov decision processes and two classes of algorithms for computing optimal behaviors: reinforcement learning and dynamic programming. First the formal framework of Markov decision process is defined, accompanied by the definition of value functions and policies. The main part of this text deals with introducing foundational classes of algorithms for learning optimal behaviors, based on various definitions of optimality with respect to the goal of learning sequential decisions. Additionally, it surveys efficient extensions of the foundational algorithms, differing mainly in the way feedback given by the environment is used to speed up learning, and in the way they concentrate on relevant parts of the problem. For both model-based and model-free settings these efficient extensions have shown useful in scaling up to larger problems.
Reinforcement Learning and Markov Decision
Processes
Martijn van Otterlo and Marco Wiering
Abstract Situated in between supervised learning and unsupervised learning, the
paradigm of reinforcement learning deals with learning in sequential decision mak-
ing problems in which there is limited feedback. This text introduces the intuitions
and concepts behind Markov decision processes and two classes of algorithms for
computing optimal behaviors: reinforcement learning and dynamic programming.
First the formal framework of Markov decision process is defined, accompanied
by the definition of value functions and policies. The main part of this text deals
with introducing foundational classes of algorithms for learning optimal behaviors,
based on various definitions of optimality with respect to the goal of learning se-
quential decisions. Additionally, it surveys efficient extensions of the foundational
algorithms, differing mainly in the way feedback given by the environment is used
to speed up learning, and in the way they concentrate on relevant parts of the prob-
lem. For both model-based and model-free settings these efficient extensions have
shown useful in scaling up to larger problems.
This text was assembled from the initial chapters of The Logic of Adaptive
Behavior by Martijn van Otterlo (2008), later published at IOS Press (2009).
The purpose of this chapter is to show i) which kinds of algorithms, concepts,
techniques etc. one can assume when writing chapters on specific topics (i.e.
concepts such as MD Ps and Q-learning can be assumed, and do not have to
be explained again), and ii) the type of notation that is expected (e.g. the AI-
style of RL, as opposed to the Bertsekas and Tsitsiklis (1996) OR-style. The
specifics and length of this chapter are – of course – subject to change.
Martijn van Otterlo
Katholieke Universiteit Leuven, Celestijnenlaan 200A, 3001 Heverlee, Belgium, e-mail:
martijn.vanotterlo@cs.kuleuven.be
Marco Wiering
Department of Artificial Intelligence, University of Groningen, The Netherlands e-mail: m.a.
wiering@rug.nl
1
2 Martijn van Otterlo and Marco Wiering
1 Introduction
Markov Decision Processes (MDP) [Puterman(1994)] are an intu-
itive and fundamental formalism for decision-theoretic planning (DTP)
[Boutilier et al(1999)Boutilier, Dean, and Hanks, Boutilier(1999)], reinforce-
ment learning (RL) [Bertsekas and Tsitsiklis(1996), Sutton and Barto(1998),
Kaelbling et al(1996)Kaelbling, Littman, and Moore] and other learning problems
in stochastic domains. In this model, an environment is modelled as a set of
states and actions can be performed to control the system’s state. The goal is to
control the system in such a way that some performance criterium is maximized.
Many problems such as (stochastic) planning problems, learning robot control and
game playing problems have successfully been modelled in terms of an MDP. In
fact MD Ps have become the de facto standard formalism for learning sequential
decision making.
DTP [Boutilier et al(1999)Boutilier, Dean, and Hanks], e.g. planning using decision-
theoretic notions to represent uncertainty and plan quality, is an important extension
of the AI planning paradigm, adding the ability to deal with uncertainty in action
effects and the ability to deal with less-defined goals. Furthermore it adds a sig-
nificant dimension in that it considers situations in which factors such as resource
consumption and uncertainty demand solutions of varying quality, for example in
real-time decision situations. There are many connections between AI planning, re-
search done in the field of operations research [Winston(1991)] and control theory
[Bertsekas(1995)], as most work in these fields on sequential decision making can
be viewed as instances of MDPs. The notion of a plan in AI planning, i.e. a series
of actions from a start state to a goal state, is extended to the notion of a policy,
which is mapping from all states to an (optimal) action, based on decision-theoretic
measures of optimality with respect to some goal to be optimized.
As an example, consider a typical planning domain, involving boxes to be moved
around and where the goal is to move some particular boxes to a designated area.
This type of problems can be solved using AI planning techniques. Consider now
a slightly more realistic extension in which some of the actions can fail, or have
uncertain side-effects that can depend on factors beyond the operator’s control, and
where the goal is specified by giving credit for how many boxes are put on the right
place. In this type of environment, the notion of a plan is less suitable, because a
sequence of actions can have many different outcomes, depending on the effects of
the operators used in the plan. Instead, the methods in this chapter are concerned
about policies that map states onto actions in such a way that the expected out-
come of the operators will have the intended effects. The expectation over actions
is based on a decision-theoretic expectation with respect to their probabilistic out-
comes and credits associated with the problem goals. The MD P framework allows
for online solutions that learn optimal policies gradually through simulated trials,
and additionally, it allows for approximated solutions with respect to resources such
as computation time. Finally, the model allows for numeric, decision-theoretic mea-
surement of the quality of policies and learning performance. For example, policies
Reinforcement Learning and Markov Decision Processes 3
environment You are in state 65. You have 4 possible actions.
agent I’ll take action 2.
environment You have received a reward of 7 units. You are now in state 15.
You have 2 possible actions.
agent I’ll take action 1.
environment You have received a reward of 4 units. You are now in state 65.
You have 4 possible actions.
agent I’ll take action 2.
environment You have received a reward of 5 units. You are now in state 44.
You have 5 possible actions.
... ...
Fig. 1 Example of interaction between an agent and its environment, from a RL perspective.
can be ordered by how much credit they receive, or by how much computation is
needed for a particular performance.
This chapter will cover the broad spectrum of methods that have been devel-
oped in the literature to compute good or optimal policies for problems modelled
as an MD P. The term RL is associated with the more difficult setting in which no
(prior) knowledge about the MDP is presented. The task then of the algorithm is
to interact, or experiment with the environment (i.e. the MDP), in order to gain
knowledge about how to optimize its behavior, being guided by the evaluative feed-
back (rewards). The model-based setting, in which the full transition dynamics and
reward distributions are known, is usually characterized by the use of dynamic pro-
gramming (DP) techniques. However, we will see that the underlying basis is very
similar, and that mixed forms occur.
2 Learning Sequential Decision Making
RL is a general class of algorithms in the field of machine learning that aims at
allowing an agent to learn how to behave in an environment, where the only feed-
back consists of a scalar reward signal. RL should not be seen as characterized by a
particular class of learning methods, but rather as a learning problem or a paradigm.
The goal of the agent is to perform actions that maximize the reward signal in the
long run.
The distinction between the agent and the environment might not always be the
most intuitive one. We will draw a boundary based on control [Sutton and Barto(1998)].
Everything the agent cannot control is considered part of the environment. For ex-
ample, although the motors of a robot agent might be considered part of the agent,
the exact functioning of them in the environment is beyond the agent’s control. It can
give commands to gear up or down, but their physical realization can be influenced
by many things.
An example of interaction with the environment is given in Figure 1. It shows
how the interaction between an agent and the environment can take place. The agent
4 Martijn van Otterlo and Marco Wiering
can choose an action in each state, and the perceptions the agent gets from the
environment are the environment’s state after each action plus the scalar reward
signal at each step. Here a discrete model is used in which there are distinct numbers
for each state and action. The way the interaction is depicted is highly general in the
sense that one just talks about states and actions as discrete symbols. In the rest of
this book we will be more concerned about interactions in which states and actions
have more structure, such that a state can be something like there are two blue
boxes and one white one and you are standing next to a blue box. However, this
figure clearly shows the mechanism of sequential decision making.
There are several important aspects in learning sequential decision making which
we will describe in this section, after which we will describe formalizations in the
next sections.
Approaching Sequential Decision Making.
There are several classes of algorithms that deal with the problem of sequential
decision making. In this book we deal specifically with the topic of learning, but
some other options exist.
The first solution is the programming solution. An intelligent system for sequen-
tial decision making can – in principle – be programmed to handle all situations. For
each possible state an appropriate or optimal action can be specified a priori. How-
ever, this puts a heavy burden on the designer or programmer of the system. All sit-
uations should be foreseen in the design phase and programmed into the agent. This
is a tedious and almost impossible task for most interesting problems, and it only
works for problems which can be modelled completely. In most realistic problems
this is not possible due to the sheer size of the problem, or the intrinsic uncertainty
in the system. A simple example is robot control in which factors such as lighting or
temperature can have a large, and unforeseen, influence on the behavior of camera
and motor systems. Furthermore, in situations where the problem changes, for ex-
ample due to new elements in the description of the problem or changing dynamic
of the system, a programmed solution will no longer work. Programmed solutions
are brittle in that they will only work for completely known, static problems with
fixed probability distributions.
A second solution uses search and planning for sequential decision making. The
successful chess program Deep Blue [Schaeffer and Plaat(1997)] was able to defeat
the human world champion Gary Kasparov by smart, brute force search algorithms
that used a model of the dynamics of chess, tuned to Kasparov’s style of playing.
When the dynamics of the system are known, one can search or plan from the cur-
rent state to a desirable goal state. However, when there is uncertainty about the
action outcomes standard search and planning algorithms do not apply. Admissible
heuristics can solve some problems concerning the reward-based nature of sequen-
tial decision making, but the probabilistic effects of actions pose a difficult prob-
lem. Probabilistic planning algorithms exist, e.g. [?], but their performance is not as
good as their deterministic counterparts. An additional problem is that planning and
Reinforcement Learning and Markov Decision Processes 5
search focus on specific start and goal states. In contrast, we are looking for policies
which are defined for all states, and are defined with respect to rewards.
The third solution is learning, and this will be the main topic of this book. Learn-
ing has several advantages in sequential decision making. First, it relieves the de-
signer of the system from the difficult task of deciding upon everything in the design
phase. Second, it can cope with uncertainty, goals specified in terms of reward mea-
sures, and with changing situations. Third, it is aimed at solving the problem for
every state, as opposed to a mere plan from one state to another. Additionally, al-
though a model of the environment can be used or learned, it is not necessary in
order to compute optimal policies, such as is exemplified by RL methods. Every-
thing can be learned from interaction with the environment.
Online versus Off-line Learning.
One important aspect in the learning task we consider in this book is the distinction
between online and off-line learning. The difference between these two types is
influenced by factors such as whether one wants to control a real-world entity – such
as a robot player robot soccer or a machine in a factory – or whether all necessary
information is available. Online learning performs learning directly on the problem
instance. Off-line learning uses a simulator of the environment as a cheap way to
get many training examples for safe and fast learning.
Learning the controller directly on the real task is often not possible. For ex-
ample, the learning algorithms in this chapter sometimes need millions of training
instances which can be too time-consuming to collect. Instead, a simulator is much
faster, and in addition it can be used to provide arbitrary training situations, includ-
ing situations that rarely happen in the real system. Furthermore, it provides a ”safe”
training situation in which the agent can explore and make mistakes. Obtaining neg-
ative feedback in the real task in order to learn to avoid these situations, might entail
destroying the machine that is controlled, which is unacceptable. Often one uses a
simulation to obtain a reasonable policy for a given problem, after which some parts
of the behavior are fine-tuned on the real task. For example, a simulation might pro-
vide the means for learning a reasonable robot controller, but some physical factors
concerning variance in motor and perception systems of the robot might make addi-
tional fine-tuning necessary. A simulation is just a model of the real problem, such
that small differences between the two are natural, and learning might make up for
that difference. Many problems in the literature however, are simulations of games
and optimization problems, such that the distinction disappears.
Credit Assignment.
An important aspect of sequential decision making is the fact that deciding whether
an action is ”good” or ”bad” cannot be decided upon right away. The appropriate-
ness of actions is completely determined by the goal the agent is trying to pursue.
6 Martijn van Otterlo and Marco Wiering
The real problem is that the effect of actions with respect to the goal can be much
delayed. For example, the opening moves in chess have a large influence on win-
ning the game. However, between the first opening moves and receiving a reward
for winning the game, a couple of tens of moves might have been played. Decid-
ing how to give credit to the first moves – which did not get the immediate reward
for winning – is a difficult problem called the temporal credit assignment problem.
Each move in a winning chess game contributes more or less to the success of the
last move, although some moves along this path can be less optimal or even bad. A
related problem is the structural credit assignment problem, in which the problem is
to distribute feedback over the structure representing the agent’s policy. For exam-
ple, the policy can be represented by a structure containing parameters (e.g. a neural
network). Deciding which parameters have to be updated forms the structural credit
assignment problem.
The Exploration-Exploitation Trade-off.
If we know a complete model of dynamics of the problem, there exist methods
(e.g. DP) that can compute optimal policies from this model. However, in the more
general case where we do not have access to this knowledge (e.g. RL), it becomes
necessary to interact with the environment to learn by trial-and-error a correct pol-
icy. The agent has to explore the environment by performing actions and perceiving
their consequences (i.e. the effects on the environments and the obtained rewards).
The only feedback the agent gets are rewards, but it does not get information about
what is the right action. At some point in time, it will have a policy with a partic-
ular performance. In order to see whether there are possible improvements to this
policy, it sometimes has to try out various actions to see their results. This might
result in worse performance because the actions might also be less good than the
current policy. However, without trying them, it might never find possible improve-
ments. In addition, if the world is not stationary, the agent has to explore to keep
its policy up-to-date. So, in order to learn it has to explore, but in order to perform
well it should exploit what it already knows. Balancing these two things is called the
exploration-exploitation problem.
Feedback, Goals and Performance.
Compared to supervised learning, the amount of feedback the learning system gets
in RL, is much less. In supervised learning, for every learning sample the correct
output is given in a training set. The performance of the learning system can be
measured relative to the number of correct answers, resulting in a predictive accu-
racy. The difficulty lies in learning this mapping, and whether this mapping gen-
eralizes to new, unclassified, examples. In unsupervised learning, the difficulty lies
in constructing a useful partitioning of the data such that classes naturally arise. In
reinforcement there is only some information available about performance, in the
Reinforcement Learning and Markov Decision Processes 7
form of one scalar signal. This feedback system is evaluative rather than being in-
structive. Using this limited signal for feedback renders a need to put more effort in
using it to evaluate and improve behavior during learning.
A second aspect about feedback and performance is related to the stochastic na-
ture of the problem formulation. In supervised and unsupervised learning, the data
is usually considered static, i.e. a data set is given and performance can be mea-
sured with respect to this data. The learning samples for the learner originate from
a fixed distribution, i.e. the data set. From a RL perspective, the data can be seen as
amoving target. The learning process is driven by the current policy, but this policy
will change over time. That means that the distribution over states and rewards will
change because of this. In machine learning the problem of a changing distribution
of learning samples is termed concept drift [Maloof(2003)] and it demands special
features to deal with it. In RL this problem is dealt with by exploration, a constant
interaction between evaluation and improvement of policies and additionally the use
of learning rate adaption schemes.
A third aspect of feedback is the question ”where do the numbers come from?”.
In many sequential decision tasks, suitable reward functions present themselves
quite naturally. For games in which there are winning, losing and draw situations,
the reward function is easy to specify. In some situations special care has to be taken
in giving rewards for states or actions, and also their relative size is important. When
the agent will encounter a large negative reward before it finally gets a small posi-
tive reward, this positive reward might get overshadowed. All problems posed will
have some optimal policy, but it depends on whether the reward function is in ac-
cordance with the right goals, whether the policy will tackle the right problem. In
some problems it can be useful to provide the agent with rewards for reaching inter-
mediate subgoals. This can be helpful in problems which require very long action
sequences.
Representations.
One of the most important aspects in learning sequential decision making is repre-
sentation. Two central issues are what should be represented, and how things should
be represented. The first issue is dealt with in this chapter. Key components that can
or should be represented are models of the dynamics of the environment, reward dis-
tributions, value functions and policies. For some algorithms all components are ex-
plicitly stored in tables, for example in classic DP algorithms. Actor-critic methods
keep separate, explicit representations of both value functions and policies. How-
ever, in most RL algorithms just a value function is represented whereas policy
decisions are derived from this value function online. Methods that search in policy
space do not represent value functions explicitly, but instead an explicitly repre-
sented policy is used to compute values when necessary. Overall, the choice for not
representing certain elements can influence the choice for a type of algorithm, and
its efficiency.
8 Martijn van Otterlo and Marco Wiering
The question of how various structures can be represented is dealt with exten-
sively in this book, starting from the next chapter. Structures such as policies, tran-
sition functions and value functions can be represented in more compact form by us-
ing various structured knowledge representation formalisms and this enables much
more efficient solution mechanisms and scaling up to larger domains.
3 A Formal Framework
The elements of the RL problem as described in the introduction to this chapter can
be formalized using the Markov decision process (MDP) framework. In this section
we will formally describe components such as states and actions and policies, as
well as the goals of learning using different kinds of optimality criteria. MDPs are
extensively described in [Puterman(1994)] and [Boutilier et al(1999)Boutilier, Dean, and Hanks].
They can be seen as stochastic extensions of finite automata and also as Markov pro-
cesses augmented with actions.
Although general MD Ps may have infinite (even uncountable) state and action
spaces, we limit the discussion to finite-state and finite-action problems. In the next
chapter we will encounter continuous spaces and in later chapters we will encounter
situations arising in the first-order logic setting in which infinite spaces can quite
naturally occur.
3.1 Markov Decision Processes.
MDPs consist of states, actions, transitions between states and a reward function
definition. We consider each of them in turn.
States.
The set of environmental states Sis defined as the finite set {s1,. . . , sN}where the
size of the state space is N, i.e. |S|=N. A state is a unique characterization of all
that is important in a state of the problem that is modeled. For example, in chess
a complete configuration of board pieces of both black and white, is a state. In the
next chapter we will encounter the use of features that describe the state. In those
contexts, it becomes necessary to distinguish between legal and illegal states, for
some combinations of features might not result in an actually existing state in the
problem. In this chapter, we will confine ourselves to the discrete state set Sin which
each state is represented by a distinct symbol, and all states sSare legal.
Reinforcement Learning and Markov Decision Processes 9
Actions.
The set of actions Ais defined as the finite set {a1,...,aK}where the size of the
action space is K, i.e. |A|=K. Actions can be used to control the system state.
The set of actions that can be applied in some particular state sS, is denoted A(s),
where A(s)A. In some systems, not all actions can be applied in every state, but in
general we will assume that A(s) = Afor all sS. In more structured representations
(e.g. by means of features), the fact that some actions are not applicable in some
states, is modeled by a precondition function pre : S×A→ {true,false}, stating
whether action aAis applicable in state sS.
The Transition Function.
By applying action aAin a state sS, the system makes a transition from sto a
new state s0S, based on a probability distribution over the set of possible transi-
tions. The transition function Tis defined as T:S×A×S[0,1], i.e. the probability
of ending up in state s0after doing action ain state sis denoted T(s,a,s0). It is re-
quired that for all actions a, and all states sand s0,T(s,a,s0)0 and T(s,a,s0)1.
Furthermore, for all states sand actions a,s0ST(s,a,s0) = 1, i.e. Tdefines a proper
probability distribution over possible next states. Instead of a precondition function,
it is also possible to set1T(s,a,s0) = 0 for all states s0Sif ais not applicable in s.
For talking about the order in which actions occur, we will define a discrete global
clock,t=1,2,.... Using this, the notation stdenotes the state at time tand st+1
denotes the state at time t+1. This enables to compare different states (and actions)
occurring ordered in time during interaction.
The system being controlled is Markovian if the result of an action does not
depend on the previous actions and visited states (history), but only depends on the
current state, i.e.
P(st+1|st,at,st1,at1, . . .) = P(st+1|st,at) = T(st,at,st+1)
The idea of Markovian dynamics is that the current state sgives enough information
to make an optimal decision; if is not important which states and actions preceded
s. Another way of saying this, is that if you select an action a, the probability dis-
tribution over next states is the same as the last time you tried this action in the
same state. More general models can be characterized by being k-Markov, i.e. the
last kare states sufficient, such that Markov is actually 1-Markov. Though, each
k-Markov problem can be transformed into an equivalent Markov problem. The
Markov property forms a boundary between the MDP and more general models
such as POMDPs.
1Although this is the same, the explicit distinction between an action not begin applicable in a
state and a zero probability for transitions with that action, is lost in this way.
10 Martijn van Otterlo and Marco Wiering
The Reward Function.
The reward function2specifies rewards for being in a state, or doing some action
in a state. The state reward function is defined as R:SR, and it specifies the
reward obtained in states. However, two other definitions exist. One can define either
R:S×ARor R:S×A×SR. The first one gives rewards for performing
an action in a state, and the second gives rewards for particular transitions between
states. All definitions are interchangeable though the last one is convenient in model-
free algorithms (see Section 7), because there we usually need both the starting state
and the resulting state in backing up values. Throughout this book we will mainly
use R(s,a,s0), but deviate from this when more convenient.
The reward function is an important part of the MDP that specifies implicitly the
goal of learning. For example, in episodic tasks such as in the games Tic-Tac-Toe
and chess, one can assign all states in which the agent has won a positive reward
value, all states in which the agent loses a negative reward value and a zero reward
value in all states where the final outcome of the game is a draw. The goal of the
agent is to reach positive valued states, which means winning the game. Thus, the
reward function is used to give direction in which way the system, i.e. the MDP,
should be controlled. Often, the reward function assigns non-zero reward to non-
goal states as well, which can be interpreted as defining sub-goals for learning.
The Markov Decision Process.
Putting all elements together results in the definition of a Markov decision process,
which will be the base model for the large majority of methods described in this
book.
Definition 3.1
AMarkov decision process is a tuple
hS,A,T,Ri
in which
S
is a finite set of states,
A
a finite set of actions,
T
a transition function defined as
T:S×A×S[0,1]
and
R
a reward function defined as
R:S×A×SR
.
The transition function Tand the reward function Rtogether define the model of the
MDP. Often MDPs are depicted as a state transition graph for an example) where
the nodes correspond to states and (directed) edges denote transitions. A typical
domain that is frequently used in the MD P literature is the maze [Matthews(1922)],
in which the reward function assigns a positive reward for reaching the exit state.
There are several distinct types of systems that can be modelled by this definition
of an MDP. In episodic tasks, there is the notion of episodes of some length, where
2Although we talk about rewards here, with the usual connotation of something positive, the
reward function merely gives a scalar feedback signal. This can be interpreted as negative (pun-
ishment) or positive (reward). The various origins of work in MDPs in the literature creates an
additional confusion with the reward function. In the operations resarch literature, one usually
speaks of a cost function instead and the goal of learning and optimization is to minimize this
function.
Reinforcement Learning and Markov Decision Processes 11
the goal is to take the agent from a starting state to a goal state. An initial state
distribution I :S[0,1]gives for each state the probability of the system being
started in that state. Starting from a state sthe system progresses through a sequence
of states, based on the actions performed. In episodic tasks, there is a specific subset
GS, denoted goal state area containing states (usually with some distinct reward)
where the process ends. We can furthermore distinguish between finite, fixed horizon
tasks in which each episode consists of a fixed number of steps, indefinite horizon
tasks in which each episode can end but episodes can have arbitrary length, and
infinite horizon tasks where the system does not end at all. The last type of model is
usually called a continuing task.
Episodic tasks, i.e. in which there so-called goal states, can be modelled using
the same model defined in Definition 3.1. This is usually modelled by means of
absorbing states or terminal states, e.g. states from which every action results in
a transition to that same state with probability 1 and reward 0. Formally, for an
absorbing state s, it holds that T(s,a,s) = 1 and R(s,a,s0) = 0 for all states s0Sand
actions aA. When entering an absorbing state, the process is reset and restarts in
a new starting state. Episodic tasks and absorbing states can in this way be elegantly
modelled in the same framework as continuing tasks.
3.2 Policies
Given an MDP hS,A,T,Ri, a policy is a computable function that outputs for each
state sSan action aA(or aA(s)). Formally, a deterministic policy
π
is
a function defined as
π
:SA. It is also possible to define a stochastic policy
as
π
:S×A[0,1]such that for each state sS, it holds that
π
(s,a)0 and
aA
π
(s,a) = 1. We will assume deterministic policies in this book unless stated
otherwise.
Application of a policy to an MD P is done in the following way. First, a start
state s0from the initial state distribution Iis generated. Then, the policy
π
sug-
gest the action a0=
π
(s0)and this action is performed. Based on the transition
function Tand reward function R, a transition is made to state s1, with probabil-
ity T(s0,a,s1)and a reward r0=R(s0,a0,s1)is received. This process continues,
producing s0,a0,r0,s1,a1,r1,s2,a2, . . .. If the task is episodic, the process ends in
state sgoal and is restarted in a new state drawn from I. If the task is continuing, the
sequence of states can be extended indefinitely.
The policy is part of the agent and its aim is to control the environment modeled
as an MD P. A fixed policy induces a stationary transition distribution over the MDP
which can be transformed into a Markov system3hS0,T0iwhere S0=Sand T0(s,s0) =
T(s,a,s0)whenever
π
(s) = a.
3In other words, if
π
is fixed, the system behaves as a stochastic transition system with a stationary
distribution over states.
12 Martijn van Otterlo and Marco Wiering
3.3 Optimality Criteria and Discounting
In the previous sections, we have defined the environment (the MD P) and the agent
(i.e. the controlling element, or policy). Before we can talk about algorithms for
computing optimal policies, we have to define what that means. That is, we have to
define what the model of optimality is. There are two ways of looking at optimality.
First, there is the aspect of what is actually being optimized, i.e. what is the goal of
the agent? Second, there is the aspect of how optimal the way in which the goal is
being optimized, is. The first aspect is related to gathering reward and is treated in
this section. The second aspect is related to the efficiency and optimality of algo-
rithms, and this is briefly touched upon and dealt with more extensively in Section 5
and further.
The goal of learning in an MD P is to gather rewards. If the agent was only con-
cerned about the immediate reward, a simple optimality criterion would be to opti-
mize E[rt]. However, there are several ways of taking into account the future in how
to behave now. There are basically three models of optimality in the MDP, which
are sufficient to cover most of the approaches in the literature. They are strongly
related to the types of tasks that were defined in Section 3.1.
E·h
t=0
rt¸E·
t=1
γ
trt¸lim
hE·1
h
h
t=0
rt¸
Fig. 2 Optimality:a) finite horizon, b) discounted, infinite horizon, c) average reward.
The finite horizon model simply takes a finite horizon of length hand states that
the agent should optimize its expected reward over this horizon, i.e. the next hsteps
(see Figure 2a)). One can think of this in two ways. The agent could in the first step
take the h-step optimal action, after this the (h1)-step optimal action, and so on.
Another way is that the agent will always take the h-step optimal action, which is
called receding-horizon control. The problem, however, with this model, is that the
(optimal) choice for the horizon length his not always known.
In the infinite-horizon model, the long-run reward is taken into account, but the
rewards that are received in the future are discounted according to how far away in
time they will be received. A discount factor
γ
, with 0
γ
<1 is used for this (see
Figure 2b)). Note that in this discounted case, rewards obtained later are discounted
more than rewards obtained earlier. Additionally, the discount factor ensures that
– even with infinite horizon – the sum of the rewards obtained is finite. In episodic
tasks, i.e. in tasks where the horizon is finite, the discount factor is not needed or can
equivalently be set to 1. If
γ
=0 the agent is said to be myopic, which means that it
is only concerned about immediate rewards. The discount factor can be interpreted
in several ways; as an interest rate, probability of living another step, or the mathe-
matical trick for bounding the infinite sum. The discounted, infinite-horizon model
Reinforcement Learning and Markov Decision Processes 13
is mathematically more convenient, but conceptually similar to the finite horizon
model. Most algorithms in this book use this model of optimality.
A third optimality model is the average-reward model, maximizing the long-run
average reward (see Figure 2c)). Sometimes this is called the gain optimal policy
and in the limit, as the discount factor approaches 1, it is equal to the infinite-horizon
discounted model. A difficult problem with this criterion that we cannot distinguish
between two policies in which one receives a lot of reward in the initial phases
and another one which does not. This initial difference in reward is hidden in the
long-run average. This problem can be solved in using a bias optimal model in
which the long-run average is still being optimized, but policies are preferred if
they additionally get initially extra reward. See [Mahadevan(1996)] for a survey on
average reward RL.
Choosing between these optimality criteria can be related to the learning prob-
lem. If the length of the episode is known, the finite-horizon model is best. However,
often this is not known, or the task is continuing, the infinite-horizon model is more
suitable. [Koenig and Liu(2002)] gives an extensive overview of different model-
ings of MD Ps and their relationship with optimality.
The second kind of optimality in this section is related to the more general aspect
of the optimality of the learning process itself. We will encounter various concepts
in the remainder of this book. We will briefly summarize three important notions
here.
Learning optimality can be explained in terms of what the end result of learning
might be. A first concern is whether the agent is able to obtain optimal performance
in principle. For some algorithms there are proofs stating this, but for some not. In
other words, is there a way to ensure that the learning process will reach a global
optimum, or merely a local optimum, or even an oscillation between performances?
A second kind of optimality is related to the speed of converging to a solution. We
can distinguish between two learning methods by looking at how many interactions
are needed, or how much computation is needed per interaction. And related to that,
what will the performance be after a certain period of time? In supervised learning
the optimality criterion is often defined in terms of predictive accuracy which is
different from optimality in the MDP setting. Also, it is important to look at how
much experimentation is necessary, or even allowed, for reaching optimal behavior.
For example, a learning robot or helicopter might not be allowed to make many
mistakes during learning. A last kind of optimality is related to how much reward is
not obtained by the learned policy, as compared to an optimal one. This is usually
called the regret of a policy.
4 Value Functions and Bellman Equations
In the preceding sections we have defined MDPs and optimality criteria that can be
useful for learning optimal policies. In this section we define value functions, which
are a way to link the optimality criteria to policies. Most learning algorithms for
14 Martijn van Otterlo and Marco Wiering
MDPs compute optimal policies by learning value functions. A value function rep-
resents an estimate how good it is for the agent to be in a certain state (or how good
it is to perform a certain action in that state). The notion of how good is expressed in
terms of an optimality criterion, i.e. in terms of the expected return. Value functions
are defined for particular policies.
The value of a state s under policy
π
, denoted V
π
(s)is the expected return when
starting in sand following
π
thereafter. We will use the infinite-horizon, discounted
model in this section, such that this can be expressed4as:
V
π
(s) = E
π
½
k=0
γ
krt+k|st=s¾(1)
A similar state-action value function Q :S×ARcan be defined as the expected
return starting from state s, taking action aand thereafter following policy
π
:
Q
π
(s,a) = E
π
½
k=0
γ
krt+k|st=s,at=a¾
One fundamental property of value functions is that they satisfy certain recursive
properties. For any policy
π
and any state sthe expression in Equation 1 can recur-
sively be defined in terms of a so-called Bellman Equation [Bellman(1957)]:
V
π
(s) = E
π
½rt+
γ
rt+1+
γ
2rt+2+. . . |st=t¾
=E
π
½rt+
γ
V
π
(st+1)|st=s¾
=
s0
T(s,
π
(s),s0)µR(s,a,s0) +
γ
V
π
(s0)(2)
It denotes that the expected value of state is defined in terms of the immediate reward
and values of possible next states weighted by their transition probabilities, and
additionally a discount factor. V
π
is the unique solution for this set of equations.
Note that multiple policies can have the same value function, but for a given policy
π
,V
π
is unique.
The goal for any given MDP is to find a best policy, i.e. the policy that receives
the most reward. This means maximizing the value function of Equation 1 for all
states sS. An optimal policy, denoted
π
, is such that V
π
(s)V
π
(s)for all sS
and all policies
π
. It can be proven that the optimal solution V=V
π
satisfies the
following Equation:
V(s) = max
aA
s0S
T(s,a,s0)µR(s,a,s0) +
γ
V(s0)(3)
4Note that we use E
π
for the expected value under policy
π
.
Reinforcement Learning and Markov Decision Processes 15
This expression is called the Bellman optimality equation. It states that the value
of a state under an optimal policy must be equal to the expected return for the best
action in that state. To select an optimal action given the optimal state value function
Vthe following rule can be applied:
π
(s) = argmax
a
s0S
T(s,a,s0)µR(s,a,s0) +
γ
V(s0)(4)
We call this policy the greedy policy, denoted
π
greedy(V)because it greedily selects
the best action using the value function V. An analogous optimal state-action value
is:
Q(s,a) =
s0
T(s,a,s0)µR(s,a,s0) +
γ
max
a0Q(s0,a0)
Q-functions are useful because they make the weighted summation over different
alternatives (such as in Equation 4) using the transition function unnecessary. No
forward-reasoning step is needed to compute an optimal action in a state. This is the
reason that in model-free approaches, i.e. in case Tand Rare unknown, Q-functions
are learned instead of V-functions. The relation between Qand Vis given by
Q(s,a) =
s0S
T(s,a,s0)µR(s,a,s0) +
γ
V(s0)(5)
V(s) = max
aQ(s,a)(6)
Now, analogously to Equation 4, optimal action selection can be simply put as:
π
(s) = argmax
aQ(s,a)(7)
That is, the best action is the action that has the highest expected utility based on
possible next states resulting from taking that action. One can, analogous to the
expression in Equation 4, define a greedy policy
π
greedy(Q)based on Q. In contrast
to
π
greedy(V)there is no need to consult the model of the MD P; the Q-function
suffices.
5 Solving Markov decision processes
Now that we have defined MDPs, policies, optimality criteria and value functions, it
is time to consider the question of how to compute optimal policies. Solving a given
MDP means computing an optimal policy
π
. Several dimensions exist along which
algorithms have been developed for this purpose. The most important distinction is
that between model-based and model-free algorithms.
Model-based algorithms exist under the general name of DP. The basic assump-
tion in these algorithms is that a model of the MDP is known beforehand, and can
16 Martijn van Otterlo and Marco Wiering
be used to compute value functions and policies using the Bellman equation (see
Equation 3). Most methods are aimed at computing state value functions which can,
in the presence of the model, be used for optimal action selection. In this chapter we
will focus on iterative procedures for computing value functions and policies.
Model-free algorithms, under the general name of RL, do not rely on the avail-
ability of a perfect model. Instead, they rely on interaction with the environment,
i.e. a simulation of the policy thereby generating samples of state transitions and
rewards. These samples are then used to estimate state-action value functions. Be-
cause a model of the MD P is not known, the agent has to explore the MDP to obtain
information. This naturally induces a exploration-exploitation trade-off which has
to be balanced to obtain an optimal policy.
A very important underlying mechanism, the so-called generalized policy itera-
tion (GP I) principle, present in all methods is depicted in Figure 3. This principle
consists of two interaction processes. The policy evaluation step estimates the utility
of the current policy
π
, that is, it computes V
π
. There are several ways for comput-
ing this. In model-based algorithms, one can use the model to compute it directly
or iteratively approximate it. In model-free algorithms, one can simulate the pol-
icy and estimate its utility from the sampled execution traces. The main purpose of
this step is to gather information about the policy for computing the second step,
the policy improvement step. In this step, the values of the actions are evaluated
for every state, in order to find possible improvements, i.e. possible other actions
in particular states that are better than the action the current policy proposes. This
step computes an improved policy
π
0from the current policy
π
using the informa-
tion in V
π
. Both the evaluation and the improvement steps can be implemented in
various ways, and interleaved in several distinct ways. The bottom line is that there
is a policy that drives value learning, i.e. it determines the value function, but in
turn there is a value function that can be used by the policy to select good actions.
Note that it is also possible to have an implicit representation of the policy, which
means that only the value function is stored, and a policy is computed on-the-fly
for each state based on the value function when needed. This is common practice
in model-free algorithms (see Section 7). And vice versa it is also possible to have
implicit representations of value functions in the context of an explicit policy rep-
resentation. Another interesting aspect is that in general a value function does not
have to be perfectly accurate. In many cases it suffices that sufficient distinction is
present between suboptimal and optimal actions, such that small errors in values do
not have to influence policy optimality. This is also important in approximation and
abstraction methods discussed in the next chapter.
Planning as a RL Problem.
The MD P formalism is a general formalism for decision-theoretic planning, which
entails that standard (deterministic) planning problems can be formalized as such
too. All the algorithms in this chapter can – in principle – be used for these plan-
ning problems too. In order to solve planning problems in the MD P framework we
Reinforcement Learning and Markov Decision Processes 17
πV
evaluation
improvement
V Vπ
π→greedy(V)
*
V
π*
starting
V π
V
=
Vπ
π
=
gree
dy(V)
V*
π*
Fig. 3 a) The algorithms in Section 5 can be seen as instantiations of Generalized Policy Itera-
tion (GPI) [Sutton and Barto(1998)]. The policy evaluation step estimates V
π
, the policy’s perfor-
mance. The policy improvement step improves the policy
π
based on the estimates in V
π
. b) The
gradual convergence of both the value function and the policy to optimal versions.
have to specify goals and rewards. We can assume that the transition function T
is given, accompanied by a precondition function. In planning we are given a goal
function G:S→ {true,false}that defines which states are goal states. The plan-
ning task is compute a sequence of actions at,at+1,...,at+nsuch that applying this
sequence from a start state will lead to a state sG. All transitions are assumed to
be deterministic, i.e. for all states sSand actions aAthere exists only one state
s0Ssuch that T(s,a,s0) = 1. All states in Gare assumed to be absorbing. The only
thing left is to specify the reward function. We can specify this in such a way that a
positive reinforcement is received once a goal state is reached, and zero otherwise:
R(st,at,st+1) = (1,if st6∈ Gand st+1G
0,otherwise
Now, depending on whether the transition function and reward function are known
to the agent, one can solve this planning task with either model-based or model-free
learning. The difference with classic planning is that the learned policy will apply
to all states.
6 Dynamic Programming: Model-based Solution Techniques
The term DP refers to a class of algorithms that is able to compute optimal policies
in the presence of a perfect model of the environment. The assumption that a model
is available will be hard to ensure for many applications. However, we will see that
from a theoretical viewpoint, as well as from an algorithmic viewpoint, DP are very
relevant because they define fundamental computational mechanisms which are also
used when no model is available. The methods in this section all assume a standard
MDP hS,A,T,Ri, where the state and action sets are finite and discrete such that
18 Martijn van Otterlo and Marco Wiering
they can be stored in tables. Furthermore, transition, reward and value functions are
assumed to store values for all states and actions separately.
6.1 Fundamental DP Algorithms
Two core DP methods are policy iteration [Howard(1960)] and value iteration
[Bellman(1957)]. In the first, the GP I mechanism is clearly separated into two steps,
whereas the second represents a tight integration of policy evaluation and improve-
ment. We will consider both these algorithms in turn.
6.1.1 Policy Iteration
Policy iteration (PI) [Howard(1960)] iterates between the two phases of GPI. The
policy evaluation phase computes the value function of the current policy and the
policy improvement phase computes an improved policy by a maximization over the
value function. This is repeated until converging to an optimal policy.
Policy Evaluation: The Prediction Problem.
A first step is to find the value function V
π
of a fixed policy
π
. This is called the pre-
diction problem. It is a part of the complete problem, that of computing an optimal
policy. Remember from the previous sections that for all sS,
V
π
(s) =
s0S
T(s,
π
(s),s0)µR(s,
π
(s),s0) +
γ
V
π
(s0)(8)
If the dynamics of the system is known, i.e. a model of the MDP is given, then
these equations form a system of |S|equations in |S|unknowns (the values of V
π
for
each sS). This can be solved by linear programming (LP). However, an iterative
procedure is possible, and in fact common in DP and RL. The Bellman equation is
transformed into an update rule which updates the current value function V
π
kinto
V
π
k+1by ’looking one step further in the future’, thereby extending the planning
horizon with one step:
V
π
k+1(s) = E
π
½rt+
γ
V
π
k(st+1)|st=s¾
=
s0
T(s,
π
(s),s0)µR(s,
π
(s),s0) +
γ
V
π
k(s0)(9)
The sequence of approximations of V
π
kas kgoes to infinity can be shown to con-
verge. In order to converge, the update rule is applied to each state sSin each
Reinforcement Learning and Markov Decision Processes 19
iteration. It replaces the old value for that state by a new one that is based on the
expected value of possible successor states, intermediate rewards and weighted by
the transition probabilities. This operation is called a full backup because it is based
on all possible transitions from that state.
A more general formulation can be given by defining a backup operator B
π
over
arbitrary real-valued functions
ϕ
over the state space (e.g. a value function):
(B
πϕ
)(s) =
s0S
T(s,
π
(s),s0)µR(s,
π
(s),s0) +
γϕ
(s0)(10)
The value function V
π
of a fixed policy
π
satisfies the fixed point of this backup
operator as V
π
=B
π
V
π
. A useful special case of this backup operator is defined
with respect to a fixed action a:
(Ba
ϕ
)(s) = R(s) +
γ
s0S
T(s,a,s0)
ϕ
(s0)
Now LP for solving the prediction problem can be stated as follows. Computing
V
π
can be accomplished by solving the Bellman equations (see Equation 3) for all
states. The optimal value function Vcan be found by using a LP problem solver
that computes V=arg maxVsV(s)subject to V(s)(BaV)(s)for all aand s.
Policy Improvement.
Now that we know the value function V
π
of a policy
π
as the outcome of the policy
evaluation step, we can try to improve the policy. First we identify the value of all
actions by using:
Q
π
(s,a) = E
π
½rt+
γ
V
π
(st+1)|st=s,at=a¾(11)
=
s0
T(s,a,s0)µR(s,a,s0) +
γ
V
π
(s0)(12)
If now Q
π
(s,a)is larger than V
π
(s)for some aAthen we could do better by
choosing action ainstead of the current
π
(s). In other words, we can improve the
current policy by selecting a different, better, action in a particular state. In fact, we
can evaluate all actions in all states and choose the best action in all states. That is,
we can compute the greedy policy
π
0by selecting the best action in each state, based
on the current value function V
π
:
20 Martijn van Otterlo and Marco Wiering
Require: V(s)Rand
π
(s)A(s)arbitrarily for all sS
{POLICY EVALUATI ON}
repeat
:=0
for each sSdo
v:=V
π
(s)
V(s):=s0T(s,
π
(s),s0)µR(s,
π
(s),s0) +
γ
V(s0)
:=max(
,|vV(s)|)
until
<
σ
{POLICY IMP ROVEM ENT }
policy-stable :=true
for each sSdo
b:=
π
(s)
π
(s):=argmaxas0T(s,a,s0)µR(s,a,s0)+
γ
·V(s0)
if b6=
π
(s)then policy-stable :=false
if policy-stable then stop; else go to POLICY EVAL UATION
Algorithm 1: Policy Iteration [Howard(1960)]
π
0(s) = argmax
aQ
π
(s,a)
=argmax
aE½rt+
γ
V
π
(st+1)|st=s,at=a¾
=argmax
a
s0
T(s,a,s0)µR(s,a,s0) +
γ
V
π
(s0)(13)
Computing an improved policy by greedily selecting the best action with respect to
the value function of the original policy is called policy improvement. If the policy
cannot be improved in this way, it means that the policy is already optimal and its
value function satisfies the Bellman equation for the optimal value function. In a
similar way one can also perform these steps for stochastic policies by blending the
action probabilities into the expectation operator.
Summarizing, policy iteration [Howard(1960)] starts with an arbitrary initialized
policy
π
0. Then a sequence of iterations follows in which the current policy is eval-
uated after which it is improved. The first step, the policy evaluation step computes
V
π
k, making use of Equation 9 in an iterative way. The second step, the policy im-
provement step, computes
π
k+1from
π
kusing V
π
k. For each state, using equation 4,
the optimal action is determined. If for all states s,
π
k+1(s) =
π
k(s), the policy is sta-
ble and the policy iteration algorithm can stop. Policy iteration generates a sequence
of alternating policies and value functions
π
0V
π
0
π
1V
π
1
π
2V
π
2
π
3V
π
3...
π
The complete algorithm can be found in Algorithm 1.
For finite MDPs, i.e. state and action spaces are finite, policy iteration converges
after a finite number of iterations. Each policy
π
k+1is a strictly better policy than
Reinforcement Learning and Markov Decision Processes 21
Require: initialize V arbitrarily (e.g. V(s):=0,sS)
repeat
:=0
for each sSdo
v:=V(s)
for each aA(s)do
Q(s,a):=s0T(s,a,s0)µR(s,a,s0) +
γ
V(s0)
V(s):=maxaQ(s,a)
:=max(
,|vV(s)|)
until
<
σ
Algorithm 2: Value Iteration [Bellman(1957)]
π
kunless in case
π
k=
π
, in which case the algorithm stops. And because for a
finite MDP, the number of different policies is finite, policy iteration converges
in finite time. In practice, it usually converges after a small number of iterations.
Although policy iteration computes the optimal policy for a given MDP in finite
time, it is relatively inefficient. In particular the first step, the policy evaluation
step, is computationally expensive. Value functions for all intermediate policies
π
0,...,
π
k,...,
π
are computed, which involves multiple sweeps through the com-
plete state space per iteration. A bound on the number of iterations is not known
[Littman et al(1995)Littman, Dean, and Kaelbling] and depends on the MDP tran-
sition structure, but it often converges after few iterations in practice.
6.1.2 Value Iteration
The policy iteration algorithm completely separates the evaluation and improve-
ment phases. In the evaluation step, the value function must be computed in the
limit. However, it is not necessary to wait for full convergence, but it is possible to
stop evaluating earlier and improve the policy based on the evaluation so far. The ex-
treme point of truncating the evaluation step is the value iteration [Bellman(1957)]
algorithm. It breaks off evaluation after just one iteration. In fact, it immediately
blends the policy improvement step into its iterations, thereby purely focusing on
estimating directly the value function. Necessary updates are computed on-the-fly.
In essence, it combines a truncated version of the policy evaluation step with the
policy improvement step, which is essentially Equation 3 turned into one update
rule:
Vt+1(s) = max
a
s0
T(s,a,s0)µR(s,a,s0) +
γ
Vt(s0)(14)
=max
aQt+1(s,a).(15)
Using Equations (14) and (15), the value iteration algorithm (see Figure 2) can be
stated as follows: starting with a value function V0over all states, one iteratively
22 Martijn van Otterlo and Marco Wiering
updates the value of each state according to (14) to get the next value functions Vt
(t=1,2,3, . . .). It produces the following sequence of value functions:
V0V1V2V3V4V5V6V7...V
Actually, in the way it is computed it also produces the intermediate Q-value func-
tions such that the sequence is
V0Q1V1Q2V2Q3V3Q4V4...V
Value iteration is guaranteed to converge in the limit towards V, i.e. the Bellman
optimality Equation (3) holds for each state. A deterministic policy
π
for all states
sScan be computed using Equation 4. If we use the same general backup oper-
ator mechanism used in the previous section, we can define value iteration in the
following way.
(B
ϕ
)(s) = max
a
s0S
T(s,a,s0)½R(s,a,s0) +
γϕ
(s)¾(16)
The backup operator Bfunctions as a contraction mapping on the value function.
If we let
π
denote the optimal policy and Vits value function, we have the rela-
tionship (fixed point) V=BVwhere (BV)(s) = maxa(BaV)(s). If we define
Q(s,a) = BaVthen we have
π
(s) =
π
greedy(V)(s) = argmaxaQ(s,a). That
is, the algorithm starts with an arbitrary value function V0after which it iterates
Vt+1=BVtuntil kVt+1VtkS<
ε
, i.e. until the distance between subsequent value
function approximations is small enough.
6.2 Efficient DP Algorithms
The policy iteration and value iteration algorithms can be seen as spanning a spec-
trum of DP approaches. This spectrum ranges from complete separation of eval-
uation and improvement steps to a complete integration of these steps. Clearly, in
between the extreme points is much room for variations on algorithms. Let us first
consider the computational complexity of the extreme points.
Complexity.
Value iteration works by producing successive approximations of the optimal value
function. Each iteration can be performed in O(|A||S|2)steps, or faster if Tis
sparse. However, the number of iterations can grow exponentially in the discount
factor cf. [Bertsekas and Tsitsiklis(1996)]. This follows from the fact that a larger
γ
implies that a longer sequence of future rewards has to be taken into account,
hence a larger number of value iteration steps because each step only extends the
Reinforcement Learning and Markov Decision Processes 23
horizon taking into account in Vby one step. The complexity of value iteration
is linear in number of actions, and quadratic in the number of states. But, usu-
ally the transition matrix is sparse. In practice policy iteration converges much
faster, but each evaluation step is expensive. Each iteration has a complexity of
O(|A||S|2+|S|3), which can grow large quickly. A worst case bound on the num-
ber of iterations is not known [Littman et al(1995)Littman, Dean, and Kaelbling].
Linear programming is is a common tool that can be used for the evaluation
too. In general, the number of iterations and value backups can quickly grow ex-
tremely large when the problem size grows. The state spaces of games such as
backgammon and chess consist of too many states to perform just one full sweep.
In this section we will describe some efficient variations on DP approaches. De-
tailed coverage of complexity results for the solution of MDPs can be found in
[Littman et al(1995)Littman, Dean, and Kaelbling, Bertsekas and Tsitsiklis(1996),
Boutilier et al(1999)Boutilier, Dean, and Hanks].
The efficiency of DP can be roughly improved along two lines. The first is a
tighter integration of the evaluation and improvement steps of the GPI process.
We will discuss this issue briefly in the next section. The second is that of using
(heuristic)search algorithms in combination with DP algorithms. For example, us-
ing search as an exploration mechanism can highlight important parts of the state
space such that value backups can be concentrated on these parts. This is the under-
lying mechanism used in the methods discussed briefly in Section 6.2.2
6.2.1 Styles of Updating
The full backups updates in DP algorithms can be done in several ways. We have
assumed in the description of the algorithms that in each step an old and a new
value function are kept in memory. Each update puts a new value in the new ta-
ble, based on the information of the old. This is called synchronous, or Jacobi-
style updating [Sutton and Barto(1998)]. This is useful for explanation of algo-
rithms and theoretical proofs of convergence. However, there are two more com-
mon ways for updates. One can keep a single table and do the updating directly
in there. This is called in-place updating [Sutton and Barto(1998)] or Gauss-Seidel
[Bertsekas and Tsitsiklis(1996)] and usually speeds up convergence, because dur-
ing one sweep of updates, some updates use already newly updated values of other
states. Another type of updating is called asynchronous updating which is an ex-
tension of the in-place updates, but here updates can be performed in any order.
An advantage is that the updates may be distributed unevenly throughout the state(-
action) space, with more updates being given to more important parts of this space.
For all these methods convergence can be proved under the general condition that
values are updated infinitely often but with a finite frequency.
24 Martijn van Otterlo and Marco Wiering
Modified Policy Iteration.
Modified policy iteration (MPI) [Puterman and Shin(1978)] strikes a middle ground
between value and policy iteration. MPI maintains the two separate steps of GPI,
but both steps are not necessarily computed in the limit. The key insight here is
that for policy improvement, one does not need an exactly evaluated policy in or-
der to improve it. For example, the policy estimation step can be approximative
after which a policy improvement step can follow. In general, both steps can be
performed quite independently by different means. For example, instead of itera-
tively applying the Bellman update rule from Equation 15, one can perform the pol-
icy estimation step by using a sampling procedure such as Monte Carlo estimation
[Sutton and Barto(1998)]. These general forms in which mixed forms of estimation
and improvements is captured by the generalized policy iteration mechanism de-
picted in Figure 3. Policy iteration and value iteration are both the extreme cases
of modified policy iteration, whereas MPI is a general method for asynchronous
updating.
6.2.2 Heuristics and Search
In many realistic problems, only a fraction of the state space is relevant to the prob-
lem of reaching the goal state from some state s. This has inspired a number of
algorithms that focus computation on states that seem most relevant for finding an
optimal policy from a start state s. These algorithms usually display good anytime
behavior, i.e. they produce good or reasonable policies fast, after which they are
gradually improved. In addition, they can be seen as implementing various ways of
asynchronous DP.
Envelopes and Fringe States.
One form of asynchronous methods is the PLE XU S system
[Dean et al(1995)Dean, Kaelbling, Kirman, and Nicholson]. It was designed
for goal-based reward functions, i.e. episodic tasks in which only goal states get
positive reward. It starts with an approximated version of the MDP in which not the
full state space is contained. This smaller version of the MDP is called an envelope
and it includes the agent’s current state and the goal state. A special OUT state
represents all the states outside the envelope. The initial envelope is constructed
by a forward search until a goal state is found. The envelope can be extended by
considering states outside the envelope that can be reached with high probability.
The intuitive idea is to include in the envelope all the states that are likely to be
reached on the way to the goal. Once the envelope has been constructed a policy is
computed through policy iteration. If at any point the agent leaves the envelope, it
has to replan by extending the envelope. This combination of learning and planning
Reinforcement Learning and Markov Decision Processes 25
still uses policy iteration, but on a much smaller (and presumably more relevant
with respect to the goal) state space.
A related method proposed in [Tash and Russell(1994)] considers goal-based
tasks too. However, instead of the single OUT state, they keep a fringe of states
on the edge of the envelope and use a heuristic to estimate values of the other states.
When computing a policy for the envelope, all fringe states become absorbing states
with the heuristic set as their value. Over time the heuristic values of the fringe states
converge to the optimal values of those states.
Similar to the previous methods the LAO* algorithm [Hansen and Zilberstein(2001)]
also alternates between an expansion phase and a policy generation phase. It too
keeps a fringe of states outside the envelope such that expansions can be larger than
the envelope method in [Dean et al(1995)Dean, Kaelbling, Kirman, and Nicholson].
The motivation behind LAO* was to extend the classic search algorithm AO* cf.
[?] to cyclic domains such as MDPS.
Search and Planning in DP.
Real-time dynamic DP (RTDP) [Barto et al(1995)Barto, Bradtke, and Singh] com-
bines forward search with DP too. It is used as an alternative for value iteration
in which only a subset of values in the state space are backed up in each iteration.
RTDP performs trials from a randomly selected state to a goal state, by simulating
the greedy policy using an admissable heuristic function as the initial value function.
It then backups values fully only along these trials, such that backups are concen-
trated on the relevant parts of the state. The approach was later extended into labeled
RTDP [Bonet and Geffner(2003b)] where some states are labeled as solved which
means that their value has already converged. Furthermore, it was recently extended
to bounded RTDP[Brendan McMahan et al(2005)Brendan McMahan, Likhachev, and Gordon]
which keeps lower and upper bounds on the optimal value function. Other recent
methods along these lines are focussed DP [Ferguson and Stentz(2004)] and heuris-
tic search-DP [Bonet and Geffner(2003a)]
7 Reinforcement Learning: Model-Free Solution Techniques
The previous section has reviewed several methods for computing an optimal policy
for an MDP assuming that a (perfect) model is available. RL is primarily concerned
with how to obtain an optimal policy when such a model is not available. RL adds
to MD Ps a focus on approximation and incomplete information, and the need for
sampling and exploration. In contrast with the algorithms discussed in the previous
section, model-free methods do not rely on the availability of priori known transi-
tion and reward models, i.e. a model of the MDP. The lack of a model generates a
need to sample the MDP to gather statistical knowledge about this unknown model.
Many model-free RL techniques exist that probe the environment by doing actions,
26 Martijn van Otterlo and Marco Wiering
for each episode do
sSis initialized as the starting state
t:=0
repeat
choose an action aA(s)
perform action a
observe the new state s0and received reward r
update ˜
T,˜
R,˜
Qand/or ˜
V
using the experience hs,a,r,s0i
s:=s0
until s0is a goal state
Algorithm 3: A general algorithm for online RL
thereby estimating the same kind of state value and state-action value functions as
model-based techniques. This section will review model-free methods along with
several efficient extensions.
In model-free contexts one has still a choice between two options. The first one is
first to learn the transition and reward model from interaction with the environment.
After that, when the model is (approximately or sufficiently) correct, all the DP
methods from the previous section apply. This type of learning is called indirect
RL. The second option, called direct RL, is to step right into estimating values for
actions, without even estimating the model of the MDP. Additionally, mixed forms
between these two exists too. For example, one can still do model-free estimation of
action values, but use an approximated model to speed up value learning by using
this model to perform more, and in addition, full backups of values (see Section 7.3).
Most model-free methods however, focus on direct estimation of (action) values.
A second choice one has to make is what to do with the temporal credit assign-
ment. It is difficult to assess the utility of some action, if the real effects of this
particular action can only be perceived much later. One possibility is to wait un-
til the ”end” (e.g. of an episode) and punish or reward specific actions along the
path taken. However, this will take a lot of memory and often, with ongoing tasks,
it is not known beforehand whether, or when, there will be an ”end”. Instead, one
can use similar mechanisms as in value iteration to adjust the estimated value of
a state based on the immediate reward and the estimated (discounted) value of the
next state. This is generally called temporal difference learning which is a general
mechanism underlying the model-free methods in this section. The main difference
with the update rules for DP approaches (such as Equation 14) is that the transition
function Tand reward function Rcannot appear in the update rules now. The gen-
eral class of algorithms that interact with the environment and update their estimates
after each experience is called online.
A general template for online RL is depicted in Figure 3. It shows an interaction
loop in which the agent selects an action (by whatever means) based on its current
state, gets feedback in the form of the resulting state and an associated reward, after
which it updates its estimated values stored in ˜
Vand ˜
Qand possibly statistics con-
cerning ˜
Tand ˜
R(in case of some form of indirect learning). The selection of the
Reinforcement Learning and Markov Decision Processes 27
action is based on the current state sand the value function (either Qor V). To solve
the exploration-exploitation problem, usually a separate exploration mechanism en-
sures that sometimes the best action (according to current estimates of action values)
is taken (exploitation) but sometimes a different action is chosen (exploration). Var-
ious choices for exploration, ranging from random to sophisticated, exist and we
will see some examples in Section 7.3.
Exploration.
One important aspect of model-free algorithms is that there is a need for explo-
ration. Because the model is unknown, the learner has to try out different ac-
tions to see their results. A learning algorithm has to strike a balance between
exploration and exploitation, i.e. in order to gain a lot of reward the learner has
to exploit its current knowledge about good actions, although it sometimes must
try out different actions to explore the environment for possible better actions.
The most basic exploration strategy is the
ε
-greedy policy, i.e. the learner takes
its current best action with probability (1
ε
)and a (randomly selected) other
action with probability
ε
. There are many more ways of doing exploration (see
[Wiering(1999), Reynolds(2002), Ratitch(2005)] for overviews) and in Section 7.3
we will see some examples. One additional method that is often used in combina-
tion with the algorithms in this section is the Boltzmann (or: softmax) exploration
strategy. It is only slightly more complicated than the
ε
-greedy strategy. The ac-
tion selection strategy is still random, but selection probabilities are weighted by
their relative Q-values. This makes it more likely for the agent to choose very good
actions, whereas two actions that have similar Q-values will have almost the same
probability to get selected. Its general form is
P(an) = eQ(s,a)
T
ieQ(s,ai)
T
(17)
in which P(an)is the probability of selecting action anand Tis the temperature pa-
rameter. Higher values of Twill move the selection strategy more towards a purely
random strategy and lower values will move to a fully greedy strategy. A com-
bination of both
ε
-greedy and Boltzmann exploration can be taken by taking the
best action with probability (1
ε
)and otherwise an action computed according to
Equation 17 [Wiering(1999)].
Another simple method to stimulate exploration is optimistic Q-values initializa-
tion; one can initialize all Q-values to high values – e.g. an a priori defined upper-
bound – at the start of learning. Because Q-values will decrease during learning,
actions that have not been tried a number of times will have a large enough value to
get selected when using Boltzmann exploration for example. Another solution with
a similar effect is to keep counters on the number of times a particular state-action
pair has been selected.
28 Martijn van Otterlo and Marco Wiering
7.1 Temporal Difference Learning
Temporal difference learning algorithms learn estimates of values based on other es-
timates. Each step in the world generates a learning example which can be used
to bring some value in accordance to the immediate reward and the estimated
value of the next state or state-action pair. An intuitive example, along the lines
of [Sutton and Barto(1998)](Chapter 6), is the following.
Imagine you have to predict at what time your guests can arrive for a small diner
in your house. Before cooking, you have to go to the supermarket, the butcher and
the wine seller, in that order. You have estimates of driving times between all lo-
cations, and you predict that you can manage to visit the two last stores both in 10
minutes, but given the crowdy time on the day, your estimate about the supermarket
is a half hour. Based on this prediction, you have notified your guests that they can
arrive no earlier than 18.00h. Once you have found out while in the supermarket that
it will take you only 10 minutes to get all the things you need, you can adjust your
estimate on arriving back home with 20 minutes less. However, once on your way
from the butcher to the wine seller, you see that there is quite some traffic along the
way and it takes you 30 minutes longer to get there. Finally you arrive 10 minutes
later than you predicted in the first place. The bottom line of this example is that
you can adjust your estimate about what time you will be back home every time you
have obtained new information about in-between steps. Each time you can adjust
your estimate on how long it will still take based on actually experienced times of
parts of your path. This is the main principle of TD learning: you do not have to
wait until the end of a trial to make updates along your path.
TD methods learn their value estimates based on estimates of other values, which
is called bootstrapping. They have an advantage over DP in that they do not require
a model of the MD P. Another advantage is that they are naturally implemented in
an online, incremental fashion such that they can be easily used in various circum-
stances. No full sweeps through the full state space are needed; only along experi-
enced paths values get updated, and updates are effected after each step.
TD(0).
TD(0) is a member of the family of TD learning algorithms [Sutton(1988)]. It solves
the prediction problem, i.e. it estimates V
π
for some policy
π
, in an online, incre-
mental fashion. T D(0)can be used to evaluate a policy and works through the use
of the following update rule5:
Vk+1(s):=Vk(s) +
α
µr+
γ
Vk(s0)Vk(s)
5The learning parameter
α
should comply with some criteria on its value, and the way it is
changed. In the algorithms in this section, one often chooses a small, fixed learning parameter,
or it is decreased every iteration.
Reinforcement Learning and Markov Decision Processes 29
Require: discount factor
γ
, learning parameter
α
initialize Qarbitrarily (e.g. Q(s,a) = 0,sS,aA)
for each episode do
sis initialized as the starting state
repeat
choose an action aA(s)based on an exploration strategy
perform action a
observe the new state s0and received reward r
Q(s,a):=Q(s,a) +
α
µr+
γ
·maxa0A(s0)Q(s0,a0)Q(s,a)
s:=s0
until s0is a goal state
Algorithm 4: Q-Learning [Watkins and Dayan(1992)]
where
α
[0,1]is the learning rate, that determines by how much values get up-
dated. This backup is performed after experiencing the transition from state sto s0
based on the action a, while receiving reward r. The difference with DP backups
such as used in Equation 14 is that the update is still done by using bootstrapping,
but it is based on an observed transition, i.e. it uses a sample backup instead of a
full backup. Only the value of one successor state is used, instead of a weighted av-
erage of all possible successor states. When using the value function V
π
for action
selection, a model is needed to compute an expected value over all action outcomes
(e.g. see Equation 4).
The learning rate
α
has to be decreased appropriately for learning to converge.
Sometimes the learning rate can be defined for states separately as in
α
(s), in which
case it can be dependent on how often the state is visited. The next two algorithms
learn Q-functions directly from samples, removing the need for a transition model
for action selection.
Q-learning.
One of the most basic and popular methods to estimate Q-value functions in a
model-free fashion is the Q-learning algorithm [Watkins(1989), Watkins and Dayan(1992)],
see Algorithm 4.
The basic idea in Q-learning is to incrementally estimate Q-values for actions,
based on feedback (i.e. rewards) and the agent’s Q-value function. The update rule is
a variation on the theme of TD learning, using Q-values and a built-in max-operator
over the Q-values of the next state in order to update Qtinto Qt+1:
Qk+1(st,at) = Qk(st,at)+
α
µrt+
γ
max
aQk(st+1,a)Qk(st,at)(18)
The agent makes a step in the environment from state stto st+1using action atwhile
receiving reward rt. The update takes place on the Q-value of action atin the state
stfrom which this action was executed.
30 Martijn van Otterlo and Marco Wiering
Q-learning is exploration-insensitive. It means that it will converge to the op-
timal policy regardless of the exploration policy being followed, under the as-
sumption that each state-action pair is visited an infinite number of times, and
the learning parameter
α
is decreased appropriately [Watkins and Dayan(1992),
Bertsekas and Tsitsiklis(1996)].
SARSA.
Q-learning is an off-policy learning algorithm, which means that while following
some exploration policy
π
, it aims at estimating the optimal policy
π
. A related
on-policy algorithm that learns the Q-value function for the policy the agent is ac-
tually executing is the SARSA [Rummery and Niranjan(1994), Rummery(1995),
Sutton(1996)] algorithm, which stands for State–Action–Reward–State–Aaction. It
uses the following update rule:
Qt+1(st,at) = Qt(st,at) +
α
µrt+
γ
Qt(st+1,at+1)Qt(st,at)(19)
where the action at+1is the action that is executed by the current policy for state
st+1. Note that the max-operator in Q-learning is replaced by the estimate of the
value of the next action according to the policy. This learning algorithm will still
converge in the limit to the optimal value function (and policy) under the condi-
tion that all states and actions are tried infinitely often and the policy converges in
the limit to the greedy policy, i.e. such that exploration does not occur anymore.
SARSA is especially useful in non-stationary environments. In these situations one
will never reach an optimal policy. It is also useful if function approximation is
used, because off-policy methods can diverge when this is used. However, off-policy
methods are needed in many situations such as in learning using hierarchically struc-
tured policies.
Actor-Critic Learning.
Another class of algorithms that precede Q-learning and SA RSA are actor–critic
methods [Witten(1977), Barto et al(1983)Barto, Sutton, and Anderson, Konda and Tsitsiklis(2003)],
which learn on-policy. This branch of TD methods keeps a separate policy indepen-
dent of the value function. The policy is called the actor and the value function the
critic. The critic – typically a state-value function – evaluates, or: criticizes, the ac-
tions executed by the actor. After action selection, the critic evaluates the action
using the TD-error:
δ
t=rt+
γ
V(st+1)V(st)
The purpose of this error is to strengthen or weaken the selection of this action in
this state. A preference for an action ain some state scan be represented as p(s,a)
such that this preference can be modified using:
Reinforcement Learning and Markov Decision Processes 31
p(st,at):=p(st,at)+
β δ
t
where a parameter
β
determines the size of the update. There are other versions of
actor–critic methods, differing mainly in how preferences are changed, or experi-
ence is used (for example using eligibility traces, see next section). An advantage
of having separate policy representation is that if there are many actions, or when
the action space is continuous, there is no need to consider all actions’ Q-values in
order to select one of them. A second advantage is that they can learn stochastic
policies naturally. Furthermore, a priori knowledge about policy constraints can be
used, e.g. see [Fr¨
amling(2005)].
Average Reward Temporal Difference Learning.
We have explained Q-learning and related algorithms in the context of discounted,
infinite-horizon MD Ps. Q-learning can also be adapted to the average-reward frame-
work, for example in the R-learning algorithm [Schwartz(1993)]. Other extensions
of algorithms to the average reward framework exist (see [Mahadevan(1996)] for an
overview).
7.2 Monte Carlo Methods
Other algorithms that use more unbiased estimates are Monte Carlo (MC) tech-
niques. They keep frequency counts of transitions and rewards and base their values
on these estimates. MC methods only require samples to estimate average sample
returns. For example, in MC policy evaluation, for each state sSall returns ob-
tained from sare kept and the value of a state sSis just their average. In other
words, MC algorithms treat the long-term reward as a random variable and take as
its estimate the sampled mean. In contrast with one-step TD methods, MC estimates
values based on averaging sample returns observed during interaction. Especially
for episodic tasks this can be very useful, because samples from complete returns
can be obtained. One way of using MC is by using it for the evaluation step in pol-
icy iteration. However, because the sampling is dependent on the current policy
π
,
only returns for actions suggested by
π
are evaluated. Thus, exploration is of key
importance here, just as in other model-free methods.
A distinction can be made between every-visit MC, which averages over all visits
of a state sSin all episodes, and first-visit MC, which averages over just the re-
turns obtained from the first visit to a state sSfor all episodes. Both variants will
converge to V
π
for the current policy
π
over time. MC methods can also be applied
to the problem of estimating action values. One way of ensuring enough explo-
ration is to use exploring starts, i.e. each state-action pair has a non-zero probability
of being selected as the initial pair. MC methods can be used for both on-policy
and off-policy control, and the general pattern complies with the generalized policy
32 Martijn van Otterlo and Marco Wiering
iteration procedure. The fact that MC methods do not bootstrap makes them less de-
pendent on the Markov assumption. TD methods too focus on sampled experience,
although they do use bootstrapping.
Learning a Model.
We have described MC methods in the context of learning value functions. Methods
similar to MC can also be used to estimate a model of the M DP. An average over
sample transition probabilities experienced during interaction can be used to gradu-
ally estimate transition probabilities. The same can be done for immediate rewards.
Indirect RL algorithms make use of this to strike a balance between model-based
and model-free learning. They are essentially model-free, but learn a transition and
reward model in parallel with model-free RL, and use this model to do more effi-
cient value function learning (see also the next section). An example of this is the
DYNA model [Sutton(1991a)]. Another method that often employs model learning
is prioritized sweeping [Moore and Atkeson(1993)]. Learning a model can also be
very useful to learn in continuous spaces where the transition model is defined over
a discretized version of the underlying (infinite) state space [Großmann(2000)].
Relations with Dynamic Programming.
The methods in this section solve essentially similar problems as DP techniques. RL
approaches can be seen as asynchronous DP. There are some important differences
in both approaches though.
RL approaches avoid the exhaustive sweeps of DP by restricting computation
on, or in the neighborhood of, sampled trajectories, either real or simulated. This
can exploit situations in which many states have low probabilities of occurring in
actual trajectories. The backups used in DP are simplified by using sampling. In-
stead of generating and evaluating all of a state’s possible immediate successors,
the estimate of a backup’s effect is done by sampling from the appropriate distribu-
tion. MC methods use this to base their estimates completely on the sample returns,
without bootstrapping using values of other, sampled, states. Furthermore, the fo-
cus on learning (action) value functions in RL is easily amenable to function ap-
proximation approaches. Representing value functions and or policies can be done
more compactly than lookup-table representations by using numeric regression al-
gorithms without breaking the standard RL interaction process; one can just feed
the update values into a regression engine.
An interesting point here is the similarity between Q-learning and value iteration
on the one hand and SA RSA and policy iteration on the other hand. In the first two
methods, the updates immediately combine policy evaluation and improvement into
one step by using the max-operator. In contrast, the second two methods separate
evaluation and improvement of the policy. In this respect, value iteration can be
considered as off-policy because it aims at directly estimating Vwhereas policy
Reinforcement Learning and Markov Decision Processes 33
iteration estimates values for the current policy and is on-policy. However, in the
model-based setting the distinction is only superficial, because instead of samples
that can be influenced by an on-policy distribution, a model is available such that
the distribution over states and rewards is known.
7.3 Efficient Exploration and Value Updating
The methods in the previous section have shown that both prediction and control
can be learned using samples from interaction with the environment, without hav-
ing access to a model of the MDP. One problem with these methods is that they
often need a large number of experiences to converge. In this section we describe a
number of extensions used to speed up learning. One direction for improvement lies
in the exploration. One can – in principle – use MC sampling until one knows every-
thing about the MD P but this simply takes too long. Using more information enables
more focused exploration procedures to generate experience more efficiently. An-
other direction is to put more efforts in using the experience for updating multiple
values of the value function on each step. Improving exploration generates better
samples, whereas improving updates will squeeze more information from samples.
Efficient Exploration.
We have already encountered
ε
-greedy and Boltzmann exploration. Although com-
monly used, these are relatively simple undirected exploration methods. They are
mainly driven by randomness. In addition, they are stateless, i.e. the exploration is
driven without knowing which areas of the state space have been explored so far.
A large class of directed methods for exploration have been proposed in the
literature that use additional information about the learning process. The focus of
these methods is to do more uniform exploration of the state space and to balance
the relative benefits of discovering new information relative to exploiting current
knowledge. Most methods use or learn a model of the MDP in parallel with RL. In
addition they learn an exploration value function.
Several options for directed exploration are available. One distinction between
methods is whether to work locally (e.g. exploration of individual state-action pairs)
or globally by considering information about parts or the complete state-space when
making a decision to explore. Furthermore, there are several other classes of explo-
ration algorithms.
Counter-based or recency-based methods keep records of how often, or how
long ago, a state-action pair has been visited. Error-based methods, of which pri-
oritized sweeping [Moore and Atkeson(1993)] is one example, use an exploration
bonus based on the error in the value of states. Other methods base exploration on
the uncertainty about the value of a state, or the confidence about the state’s cur-
rent value. They decide whether to explore by calculating the probability that an
34 Martijn van Otterlo and Marco Wiering
explorative action will encover a larger reward than already found. The interval es-
timation (IE) method [Kaelbling(1993)] is an example of this kind of methods. IE
uses a statistical model to measure the degree of uncertainty of each Q(s,a)-value.
An upper bound can be calculated on the likely value of each Q-value, and the ac-
tion with the highest upper bound is taken. If the action taken happens to be a poor
choice, the upper bound will be decreased when the statistical model is updated.
Good actions will continue to have a high upper bound and will be chosen often.
In contrast to counter- and recency-based exploration, IE is concerned with action
exploration and not with state space exploration. [Wiering(1999)] (see also [?]) in-
troduced an extension to model-based RL in the model-based interval estimation
algorithm in which the same idea is used for estimates of transition probabilities.
Another recent method that deals explicitly with the exploration-exploitation
trade-off is the E3method [Kearns and Singh(1998)]. E3stands for explicit ex-
ploration and exploitation. It learns by updating a model of the environment by
collecting statistics. The state space is divided into known and unknown parts. On
every step a decision is made whether the known part contains sufficient opportu-
nities for getting rewards or whether the unknown part should be explored to ob-
tain possibly more reward. An important aspect of this algorithm is that it was the
first general near-optimal (tabular) RL algorithm with provable bounds on compu-
tation time. The approach was extended in [Brafman and Tennenholtz(2002)] into
the more general algorithm R- MAX. It too provides a polynomial bound on com-
putation time for reaching near-optimal policies. As a last example, [Ratitch(2005)]
present an approach for efficient, directed exploration based on more sophisticated
characteristics of the MDP such as an entropy measure over state transitions. An
interesting feature of this approach is that these characteristics can be computed be-
fore learning and be used in combination with other exploration methods, thereby
improving their behavior.
For a more detailed coverage of exploration strategies we refer the reader to
[Ratitch(2005)] and [Wiering(1999)].
Guidance and Shaping.
Exploration methods can be used to speed up learning and focus attention to relevant
areas in the state space. The exploration methods mainly use statistics derived from
the problem before or during learning. However, sometimes more information is
available that can be used to guide the learner. For example, if a reasonable policy for
a domain is available, it can be used to generate more useful learning samples than
(random) exploration could do. In fact, humans are usually very bad in specifying
optimal policies, but considerably good at specifying reasonable ones6.
The work in behavioral cloning [Bain and Sammut(1995)] takes an extreme
point on the guidance spectrum in that the goal is to replicate example behavior from
expert traces, i.e. to clone this behavior. This type of guidance moves learning more
6Quote taken from the invited talk by Leslie Kaelbling at the European Workshop on Reinforce-
ment Learning EW RL, in Utrecht 2001.
Reinforcement Learning and Markov Decision Processes 35
in the direction of supervised learning. Another way to help the agent is by shaping
[Mataric(1994), Dorigo and Colombetti(1997), Ng et al(1999)Ng, Harada, and Russell].
Shaping pushes the reward closer to the subgoals of behavior, and thus encourages
the agent to incrementally improve its behavior by searching the policy space more
effectively. This is also related to the general issue of giving rewards to appropriate
subgoals, and the gradual increase in difficulty of tasks. The agent can be trained
on increasingly more difficult problems, which can also be considered as a form of
guidance.
Various other mechanisms can be used to provide guidance to RL algorithms,
such as decompositions [?], heuristic rules for better exploration [Fr¨
amling(2005)]
and various types of transfer in which knowledge learned in one problem is trans-
ferred to other, related problems, e.g. see [?].
Eligibility Traces.
In MC methods, the updates are based on the entire sequence of observed rewards
until the end of an episode. In TD methods, the estimates are based on the samples
of immediate rewards and the next states. An intermediate approach is to use the
n-step-truncated-return R(n)
t, obtained from a whole sequence of returns:
R(n)
t=rt+
γ
rt+1+
γ
2rt+2+...+
γ
nVt(st+n)
With this, one can go to the approach of computing the updates of values based
on several n-step returns. The family of TD(
λ
), with 0
λ
1, combines n-step
returns weighted proportionally to
λ
n1.
The problem with this is that we would have to wait indefinitely to compute R()
t.
This view is useful for theoretical analysis and understanding of n-step backups.
It is called the forward view of the TD(
λ
)algorithm. However, the usual way to
implement this kind of updates is called the backward view of the TD(
λ
)algorithm
and is done by using eligibility traces, which is an incremental implementation of
the same idea.
Eligibility traces are a way to perform n-step backups in an elegant way. For
each state sS, an eligibility et(s)is kept in memory. They are initialized at 0 and
incremented every time according to:
et(s) = (
γλ
et1(s)if s6=st
γλ
et1(s) + 1 if s=st
where
λ
is the trace decay parameter. The trace for each state is increased every
time that state is visited and decreases exponentially otherwise. Now
δ
tis the tem-
poral difference error at stage t:
δ
t=rt+
γ
V(st+1)V(st)
36 Martijn van Otterlo and Marco Wiering
On every step, all states are updated in proportion to their eligibility traces as in:
V(s):=V(s) +
αδ
tet(s)
The forward and backward view on eligibility traces can be proved equivalent
[Sutton and Barto(1998)]. For
λ
=1, TD(
λ
) is essentially the same as MC, because
it considers the complete return, and for
λ
=0, TD(
λ
) uses just the immediate re-
turn as in all one-step RL algorithms. Eligibility traces are a general mechanism to
learn from n-step returns. They can be combined with all of the model-free methods
we have described in the previous section. [Watkins(1989)] combined Q-learning
with eligibility traces in the Q(
λ
)-algorithm. [Peng and Williams(1996)] proposed
a similar algorithm, and [Wiering and Schmidhuber(1998)] and [Reynolds(2002)]
both proposed efficient versions of Q(
λ
). The problem with combining eligibility
traces with learning control is special care has to be taken in case of exploratory
actions, which can break the intended meaning of the n-step return for the current
policy that is followed. In [Watkins(1989)]’s version, eligibility traces are reset ev-
ery time an exploratory action is taken. [Peng and Williams(1996)]’s version is, in
that respect more efficient, in that traces do not have to be set to zero every time.
SARSA(
λ
) [Sutton and Barto(1998)] is more safe in this respect, because action
selection is on-policy. Another recent on-policy learning algorithm is the QV(
λ
)
algorithm by [Wiering(2005)]. In QV(
λ
)-learning two value functions are learned;
TD(
λ
) is used for learning a state value function Vand one-step Q-learning is used
for learning a state-action value function, based on V.
Learning and Using a Model: Learning and Planning.
Even though RL methods can function without a model of the MD P, such a model
can be useful to speed up learning, or bias exploration. A learned model can also be
useful to do more efficient value updating. A general guideline is when experience
is costly, it pays off to learn a model. In RL model-learning is usually targeted at the
specific learning task defined by the MD P , i.e. determined by the rewards and the
goal. In general, learning a model is most often useful because it gives knowledge
about the dynamics of the environment, such that it can be used for other tasks too
(see [?] for extensive elaboration on this point).
The DYNA architecture [Sutton(1990), Sutton(1991b), Sutton(1991a), Sutton and Barto(1998)]
is a simple way to use the model to amplify experiences. Algorithm 5 shows DYNA-
Qwhich combines Q-learning with planning. In a continuous loop, Q-learning is
interleaved with series of extra updates using a model that is constantly updated too.
DYNA needs less interactions with the environment, because it replays experience
to do more value updates.
A related method that makes more use of experience using a learned model is
prioritized sweeping (PS) [Moore and Atkeson(1993)]. Instead of selecting states to
be updated randomly (as in DYNA), PS prioritizes updates based on their change in
values. Once a state is updated, the PS algortithm considers all states that can reach
Reinforcement Learning and Markov Decision Processes 37
Require: initialize Q and Model arbitrarily
repeat
sSis the start state
a:=
ε
-greedy(s,Q)
update Q
update Model
for i:=1to ndo
s:=randomly selected observed state
a:=random, previously selected action from s
update Qusing the model
until Sufficient Performance
Algorithm 5: Dyna-Q [Sutton and Barto(1998)]
that state, by looking at the transition model, and sees whether these states will have
to be updated as well. The order of the updates is determined by the size of the value
updates. The general mechanism can be summarized as follows. In each step i) one
remembers the old value of the current state, ii) one updates the state value with a
full backup using the learned model, iii) one sets the priority of the current state to 0,
iv) one computes the change
δ
in value as the result of the backup, v) one uses this
difference to modify predecessors of the current state (determined by the model); all
states leading to the current state get a priority update of
δ
×T. The number of value
backups is a parameter to be set in the algorithm. Overall, PS focuses the backups
to where they are expected to most quickly reduce the error. Another example of
using planning in model-based RL is [Wiering(2002)].
References
[Bain and Sammut(1995)] Bain M, Sammut C (1995) A framework for behavioral cloning. Ma-
chine Learning 15
[Barto et al(1983)Barto, Sutton, and Anderson] Barto AG, Sutton RS, Anderson CW (1983) Neu-
ronlike elements that can solve difficult learning control problems. IEEE Transactions on Sys-
tems, Man, and Cybernetics 13:835–846
[Barto et al(1995)Barto, Bradtke, and Singh] Barto AG, Bradtke SJ, Singh SP (1995) Learning to
act using real-time dynamic programming. Artificial Intelligence 72(1):81–138
[Bellman(1957)] Bellman RE (1957) Dynamic Programming. Princeton University Press, Prince-
ton, New Jersey
[Bertsekas(1995)] Bertsekas D (1995) Dynamic Programming and Optimal Control, volumes 1
and 2. Athena Scientific, Belmont, MA
[Bertsekas and Tsitsiklis(1996)] Bertsekas DP, Tsitsiklis J (1996) Neuro-Dynamic Programming.
Athena Scientific, Belmont, MA
[Bonet and Geffner(2003a)] Bonet B, Geffner H (2003a) Faster heuristic search algorithms for
planning with uncertainty and full feedback. In: Proceedings of the International Joint Con-
ference on Artificial Intelligence (IJCAI), pp 1233–1238
[Bonet and Geffner(2003b)] Bonet B, Geffner H (2003b) Labeled rtdp: Improving the conver-
gence of real-time dynamic programming. In: Proceedings of the International Conference on
Artificial Planning Systems (ICAPS’03), pp 12–21
[Boutilier(1999)] Boutilier C (1999) Knowledge representation for stochastic decision processes.
Lecture Notes in Computer Science 1600:111–152
38 Martijn van Otterlo and Marco Wiering
[Boutilier et al(1999)Boutilier, Dean, and Hanks] Boutilier C, Dean T, Hanks S (1999) Decision
theoretic planning: Structural assumptions and computational leverage. Journal of Artificial
Intelligence Research 11:1–94
[Brafman and Tennenholtz(2002)] Brafman R, Tennenholtz M (2002) R-MAX - a general poly-
nomial time algorithm for near-optimal reinforcement learning. Journal of Machine Learning
Research 3:213–231
[Brendan McMahan et al(2005)Brendan McMahan, Likhachev, and Gordon] Brendan McMahan
H, Likhachev M, Gordon GJ (2005) Bounded real-time dynamic programming: Rtdp with
monotone upper bounds and performance guarantees. In: Proceedings of the International
Conference on Machine Learning (ICML), pp 569–576
[Dean et al(1995)Dean, Kaelbling, Kirman, and Nicholson] Dean T, Kaelbling LP, Kirman J,
Nicholson A (1995) Planning under time constraints in stochastic domains. Artificial Intel-
ligence 76:35–74
[Dorigo and Colombetti(1997)] Dorigo M, Colombetti M (1997) Robot shaping: an experiment
in behavior engineering. MIT Press, Cambridge, MA
[Ferguson and Stentz(2004)] Ferguson D, Stentz A (2004) Focussed dynamic programming: Ex-
tensive comparative results. Tech. Rep. CMU-RI-TR-04-13, Robotics Institute, Carnegie Mel-
lon University, Pittsburgh, Pennsylvania
[Fr¨
amling(2005)] Fr¨
amling K (2005) Bi-memory model for guiding exploration by pre-existing
knowledge. In: Driessens K, Fern A, van Otterlo M (eds) Proceedings of the ICML-2005
Workshop on Rich Representations for Reinforcement Learning, pp 21–26
[Großmann(2000)] Großmann A (2000) Adaptive state-space quantisation and multi-task rein-
forcement learning using constructive neural networks. In: Meyer JA, Berthoz A, Floreano D,
Roitblat HL, Wilson SW (eds) From Animals to Animats: Proceedings of The International
Conference on Simulation of Adaptive Behavior (SAB), pp 160–169
[Hansen and Zilberstein(2001)] Hansen EA, Zilberstein S (2001) LAO*: A heuristic search algo-
rithm that finds solutions with loops. Artificial Intelligence 129:35–62
[Howard(1960)] Howard RA (1960) Dynamic Programming and Markov Processes. The MIT
Press, Cambridge, Massachusetts
[Kaelbling(1993)] Kaelbling LP (1993) Learning in Embedded Systems. MIT Press, Cambridge,
MA
[Kaelbling et al(1996)Kaelbling, Littman, and Moore] Kaelbling LP, Littman ML, Moore AW
(1996) Reinforcement learning: A survey. Journal of Artificial Intelligence Research 4:237–
285
[Kearns and Singh(1998)] Kearns M, Singh S (1998) Near-optimal reinforcement learning in
polynomial time. In: Proceedings of the International Conference on Machine Learning
(ICML)
[Koenig and Liu(2002)] Koenig S, Liu Y (2002) The interaction of representations and planning
objectives for decision-theoretic planning. Journal of Experimental and Theoretical Artificial
Intelligence
[Konda and Tsitsiklis(2003)] Konda V, Tsitsiklis J (2003) Actor-critic algorithms. SIAM Journal
on Control and Optimization 42(4):1143–1166
[Littman et al(1995)Littman, Dean, and Kaelbling] Littman ML, Dean TL, Kaelbling LP (1995)
On the complexity of solving markov decision problems. In: Proceedings of the National
Conference on Artificial Intelligence (AAAI), pp 394–402
[Mahadevan(1996)] Mahadevan S (1996) Average reward reinforcement learning: Foundations,
algorithms, and empirical results. Machine Learning 22:159–195
[Maloof(2003)] Maloof MA (2003) Incremental rule learning with partial instance memory for
changing concepts. In: Proceedings of the International Joint Conference on Neural Networks,
pp 2764–2769
[Mataric(1994)] Mataric M (1994) Reward functions for accelerated learning. In: Proceedings of
the International Conference on Machine Learning (ICML)
[Matthews(1922)] Matthews WH (1922) Mazes and Labyrinths: A General Account of their His-
tory and Developments. Longmans, Green and Co., London, reprinted in 1970 by Dover Pub-
lications, New York, under the title ’Mazes & Labyrinths: Their History & Development
Reinforcement Learning and Markov Decision Processes 39
[Moore and Atkeson(1993)] Moore AW, Atkeson CG (1993) Prioritized sweeping: Reinforce-
ment learning with less data and less time. Machine Learning 13(1):103–130
[Ng et al(1999)Ng, Harada, and Russell] Ng AY, Harada D, Russell SJ (1999) Policy invariance
under reward transformations: Theory and application to reward shaping. In: Proceedings of
the International Conference on Machine Learning (ICML), pp xx–xx
[Peng and Williams(1996)] Peng J, Williams RJ (1996) Incremental multi-step q-learning. Ma-
chine Learning 22:283–290
[Puterman(1994)] Puterman ML (1994) Markov Decision Processes—Discrete Stochastic Dy-
namic Programming. John Wiley & Sons, Inc., New York, NY
[Puterman and Shin(1978)] Puterman ML, Shin MC (1978) Modified policy iteration algorithms
for discounted Markov decision processes. Management Science 24:1127–1137
[Ratitch(2005)] Ratitch B (2005) On characteristics of markov decision processes and reinforce-
ment learning in large domains. PhD thesis, The School of Computer Science, McGill Uni-
versity, Montreal
[Reynolds(2002)] Reynolds SI (2002) Reinforcement learning with exploration. PhD thesis, The
School of Computer Science, The University of Birmingham, UK
[Rummery(1995)] Rummery GA (1995) Problem solving with reinforcement learning. PhD the-
sis, Cambridge University, Engineering Department, Cambridge, England
[Rummery and Niranjan(1994)] Rummery GA, Niranjan M (1994) On-line Q-Learning using
connectionist systems. Tech. Rep. CUED/F-INFENG/TR 166, Cambridge University, Engi-
neering Department
[Schaeffer and Plaat(1997)] Schaeffer J, Plaat A (1997) Kasparov versus deep blue: The re-match.
International Computer Chess Association Journal 20(2):95–101
[Schwartz(1993)] Schwartz A (1993) A reinforcement learning method for maximizing undis-
counted rewards. In: Proceedings of the International Conference on Machine Learning
(ICML), pp 298–305
[Sutton(1990)] Sutton R (1990) Integrated architectures for learning, planning, and reacting based
on approximating dynamic programming. In: Proceedings of the International Conference on
Machine Learning (ICML), pp 216–224
[Sutton(1988)] Sutton RS (1988) Learning to predict by the methods of temporal differences.
Machine Learning 3:9–44
[Sutton(1991a)] Sutton RS (1991a) DYNA, an integrated architecture for learning, planning and
reacting. In: Working Notes of the AAAI Spring Symposium on Integrated Intelligent Archi-
tectures, pp 151–155
[Sutton(1991b)] Sutton RS (1991b) Reinforcement learning architectures for animats. In: From
Animals to Animats: Proceedings of The International Conference on Simulation of Adaptive
Behavior (SAB), pp 288–296
[Sutton(1996)] Sutton RS (1996) Generalization in reinforcement learning: Successful examples
using sparse coarse coding. In: Touretzky DS, Mozer MC, Hasselmo ME (eds) Proceedings
of the Neural Information Processing Conference (NIPS), vol 8, pp 1038–1044
[Sutton and Barto(1998)] Sutton RS, Barto AG (1998) Reinforcement Learning: an Introduction.
The MIT Press, Cambridge
[Tash and Russell(1994)] Tash J, Russell S (1994) Control strategies for a stochastic planner. In:
Proceedings of the Twelfth National Conference on Artificial Intelligence (AAAI’94), pp
1079–1085
[Watkins(1989)] Watkins CJCH (1989) Learning from delayed rewards. PhD thesis, King’s Col-
lege, Cambridge, England
[Watkins and Dayan(1992)] Watkins CJCH, Dayan P (1992) Q-learning. Machine Learning
8(3/4), special Issue on Reinforcement Learning
[Wiering(1999)] Wiering MA (1999) Explorations in efficient reinforcement learning. PhD thesis,
Faculteit der Wiskunde, Informatica, Natuurkunde en Sterrenkunde, Universiteit van Amster-
dam
[Wiering(2002)] Wiering MA (2002) Model-based reinforcement learning in dynamic environ-
ments. Tech. Rep. UU-CS-2002-029, Institute of Information and Computing Sciences, Uni-
versity of Utrecht, The Netherlands
40 Martijn van Otterlo and Marco Wiering
[Wiering(2005)] Wiering MA (2005) QV(
λ
)-Learning: A new on-policy reinforcement learning
algorithm. In: Proceedings of the 7th European Workshop on Reinforcement Learning
[Wiering and Schmidhuber(1998)] Wiering MA, Schmidhuber JH (1998) Fast online q(
λ
). Ma-
chine Learning xx:xxx–xxx
[Winston(1991)] Winston WL (1991) Operations research applications and algorithms, 2nd edn.
Thomson Information/Publishing Group, Boston
[Witten(1977)] Witten IH (1977) An adaptive optimal controller for discrete-time markov envi-
ronments. Information and Control 34:286–295
... Deep-RL. Reinforcement Learning (RL) [28] is a prominent branch of machine learning where agents learn optimal strategies through interactions with an environment, receiving feedback in the form of rewards or penalties. The fundamental goal of RL is to deduce a policy that maximizes the expected cumulative reward over time, emphasizing decision-making to achieve specific objectives. ...
Preprint
Recent years have witnessed a growing trend toward employing deep reinforcement learning (Deep-RL) to derive heuristics for combinatorial optimization (CO) problems on graphs. Maximum Coverage Problem (MCP) and its probabilistic variant on social networks, Influence Maximization (IM), have been particularly prominent in this line of research. In this paper, we present a comprehensive benchmark study that thoroughly investigates the effectiveness and efficiency of five recent Deep-RL methods for MCP and IM. These methods were published in top data science venues, namely S2V-DQN, Geometric-QN, GCOMB, RL4IM, and LeNSE. Our findings reveal that, across various scenarios, the Lazy Greedy algorithm consistently outperforms all Deep-RL methods for MCP. In the case of IM, theoretically sound algorithms like IMM and OPIM demonstrate superior performance compared to Deep-RL methods in most scenarios. Notably, we observe an abnormal phenomenon in IM problem where Deep-RL methods slightly outperform IMM and OPIM when the influence spread nearly does not increase as the budget increases. Furthermore, our experimental results highlight common issues when applying Deep-RL methods to MCP and IM in practical settings. Finally, we discuss potential avenues for improving Deep-RL methods. Our benchmark study sheds light on potential challenges in current deep reinforcement learning research for solving combinatorial optimization problems.
... After obtaining the data, the agent updates its strategy by learning to maximize the total reward value. The Markov Decision Process formalizes the environment of RL [34], which is a memoryless random process. MDP consists of 5 elements [ , , , , ] s A P R γ , where s is a finite set of states, A is a finite set of actions, P is a state transition probability matrix, R is a reward function used to score the decisions of the agent, and γ is a discount factor, and [0,1] γ ∈ . ...
Article
Full-text available
With the development and strengthening of interception measures, the traditional penetration methods of high-speed unmanned aerial vehicles (UAVs) are no longer able to meet the penetration requirements in diversified and complex combat scenarios. Due to the advancement of Artificial Intelligence technology in recent years, intelligent penetration methods have gradually become promising solutions. In this paper, a penetration strategy for high-speed UAVs based on improved Deep Reinforcement Learning (DRL) is proposed, in which Long Short-Term Memory (LSTM) networks are incorporated into a classical Soft Actor–Critic (SAC) algorithm. A three-dimensional (3D) planar engagement scenario of a high-speed UAV facing two interceptors with strong maneuverability is constructed. According to the proposed LSTM-SAC approach, the reward function is designed based on the criteria for successful penetration, taking into account energy and flight range constraints. Then, an intelligent penetration strategy is obtained by extensive training, which utilizes the motion states of both sides to make decisions and generate the penetration overload commands for the high-speed UAV. The simulation results show that compared with the classical SAC algorithm, the proposed algorithm has a training efficiency improvement of 75.56% training episode reduction. Meanwhile, the LSTM-SAC approach achieves a successful penetration rate of more than 90% in hypothetical complex scenarios, with a 40% average increase compared with the conventional programmed penetration methods.
Preprint
Full-text available
This paper introduces a flight envelope protection algorithm on a longitudinal axis that leverages reinforcement learning (RL). By considering limits on variables such as angle of attack, load factor, and pitch rate, the algorithm counteracts excessive pilot or control commands with restoring actions. Unlike traditional methods requiring manual tuning, RL facilitates the approximation of complex functions within the trained model, streamlining the design process. This study demonstrates the promising results of RL in enhancing flight envelope protection, offering a novel and easy-to-scale method for safety-ensured flight.
Preprint
Maximizing storage performance in geological carbon storage (GCS) is crucial for commercial deployment, but traditional optimization demands resource-intensive simulations, posing computational challenges. This study introduces the multimodal latent dynamic (MLD) model, a deep learning framework for fast flow prediction and well control optimization in GCS. The MLD model includes a representation module for compressed latent representations, a transition module for system state evolution, and a prediction module for flow responses. A novel training strategy combining regression loss and joint-embedding consistency loss enhances temporal consistency and multi-step prediction accuracy. Unlike existing models, the MLD supports diverse input modalities, allowing comprehensive data interactions. The MLD model, resembling a Markov decision process (MDP), can train deep reinforcement learning agents, specifically using the soft actor-critic (SAC) algorithm, to maximize net present value (NPV) through continuous interactions. The approach outperforms traditional methods, achieving the highest NPV while reducing computational resources by over 60%. It also demonstrates strong generalization performance, providing improved decisions for new scenarios based on knowledge from previous ones.
Chapter
Neural Networks for Control highlights key issues in learning control and identifies research directions that could lead to practical solutions for control problems in critical application domains. It addresses general issues of neural network based control and neural network learning with regard to specific problems of motion planning and control in robotics, and takes up application domains well suited to the capabilities of neural network controllers. The appendix describes seven benchmark control problems. Contributors Andrew G. Barto, Ronald J. Williams, Paul J. Werbos, Kumpati S. Narendra, L. Gordon Kraft, III, David P. Campagna, Mitsuo Kawato, Bartlett W. Met, Christopher G. Atkeson, David J. Reinkensmeyer, Derrick Nguyen, Bernard Widrow, James C. Houk, Satinder P. Singh, Charles Fisher, Judy A. Franklin, Oliver G. Selfridge, Arthur C. Sanderson, Lyle H. Ungar, Charles C. Jorgensen, C. Schley, Martin Herman, James S. Albus, Tsai-Hong Hong, Charles W. Anderson, W. Thomas Miller, III Bradford Books imprint
Chapter
These sixty contributions from researchers in ethology, ecology, cybernetics, artificial intelligence, robotics, and related fields delve into the behaviors and underlying mechanisms that allow animals and, potentially, robots to adapt and survive in uncertain environments. They focus in particular on simulation models in order to help characterize and compare various organizational principles or architectures capable of inducing adaptive behavior in real or artificial animals. Jean-Arcady Meyer is Director of Research at CNRS, Paris. Stewart W. Wilson is a Scientist at The Rowland Institute for Science, Cambridge, Massachusetts. Bradford Books imprint
Chapter
The Animals to Animats Conference brings together researchers fromethology, psychology, ecology, artificial intelligence, artificiallife, robotics, engineering, and related fields to furtherunderstanding of the behaviors and underlying mechanisms that allownatural and synthetic agents (animats) to adapt and survive inuncertain environments The Animals to Animats Conference brings together researchers from ethology, psychology, ecology, artificial intelligence, artificial life, robotics, engineering, and related fields to further understanding of the behaviors and underlying mechanisms that allow natural and synthetic agents (animats) to adapt and survive in uncertain environments. The work presented focuses on well-defined models—robotic, computer-simulation, and mathematical—that help to characterize and compare various organizational principles or architectures underlying adaptive behavior in both natural animals and animats. Bradford Books imprint
Chapter
This is the fifteenth volume in the Machine Intelligence series, founded in 1965 by Donald Michie, and includes papers by a number of eminent AI figures including John McCarthy, Alan Robinson, Robert Kowalski and Mike Genesereth. The book is centred on the theme of intelligent agents and covers a wide range of topics, including: - Representations of consciousness (John McCarthy, Stanford University and Donald Michie, Edinburgh University) - SoftBots (Bruce Blumberg, MIT Media Lab) - Parallel implementations of logic (Alan Robinson, Syracuse University) - Machine learning (Stephen Muggleton, Oxford University) - Machine vision (Andrew Blake, Oxford University) - Machine-based scientific discovery in molecular biology (Mike Sternberg, Imperial Cancer Research Fund).
Book
Made-Up Minds addresses fundamental questions of learning and concept invention by means of an innovative computer program that is based on the cognitive-developmental theory of psychologist Jean Piaget. Drescher uses Piaget's theory as a source of inspiration for the design of an artificial cognitive system called the schema mechanism, and then uses the system to elaborate and test Piaget's theory. The approach is original enough that readers need not have extensive knowledge of artificial intelligence, and a chapter summarizing Piaget assists readers who lack a background in developmental psychology. The schema mechanism learns from its experiences, expressing discoveries in its existing representational vocabulary, and extending that vocabulary with new concepts. A novel empirical learning technique, marginal attribution, can find results of an action that are obscure because each occurs rarely in general, although reliably under certain conditions. Drescher shows that several early milestones in the Piagetian infant's invention of the concept of persistent object can be replicated by the schema mechanism.
Chapter
This paper presents a detailed study of average reward reinforcement learning, an undiscounted optimality framework that is more appropriate for cyclical tasks than the much better studied discounted framework. A wide spectrum of average reward algorithms are described, ranging from synchronous dynamic programming methods to several (provably convergent) asynchronous algorithms from optimal control and learning automata. A general sensitive discount optimality metric called n-discount-optimality is introduced, and used to compare the various algorithms. The overview identifies a key similarity across several asynchronous algorithms that is crucial to their convergence, namely independent estimation of the average reward and the relative values. The overview also uncovers a surprising limitation shared by the different algorithms: while several algorithms can provably generate gain-optimal policies that maximize average reward, none of them can reliably filter these to produce bias-optimal (or T-optimal) policies that also maximize the finite reward to absorbing goal states. This paper also presents a detailed empirical study of R-learning, an average reward reinforcement learning method, using two empirical testbeds: a stochastic grid world domain and a simulated robot environment. A detailed sensitivity analysis of R-learning is carried out to test its dependence on learning rates and exploration levels. The results suggest that R-learning is quite sensitive to exploration strategies, and can fall into sub-optimal limit cycles. The performance of R-learning is also compared with that of Q-learning, the best studied discounted RL method. Here, the results suggest that R-learning can be fine-tuned to give better performance than Q-learning in both domains.