ArticlePDF Available

Reinforcement Learning and Markov Decision Processes

January 2012

January 2012

Authors:

University of Groningen

Situated in between supervised learning and unsupervised learning, the paradigm of reinforcement learning deals with learning in sequential decision making problems in which there is limited feedback. This text introduces the intuitions and concepts behind Markov decision processes and two classes of algorithms for computing optimal behaviors: reinforcement learning and dynamic programming. First the formal framework of Markov decision process is defined, accompanied by the definition of value functions and policies. The main part of this text deals with introducing foundational classes of algorithms for learning optimal behaviors, based on various definitions of optimality with respect to the goal of learning sequential decisions. Additionally, it surveys efficient extensions of the foundational algorithms, differing mainly in the way feedback given by the environment is used to speed up learning, and in the way they concentrate on relevant parts of the problem. For both model-based and model-free settings these efficient extensions have shown useful in scaling up to larger problems.

Content uploaded by Marco A. Wiering

Content may be subject to copyright.

Reinforcement Learning and Markov Decision

Processes

Martijn van Otterlo and Marco Wiering

Abstract Situated in between supervised learning and unsupervised learning, the

paradigm of reinforcement learning deals with learning in sequential decision mak-

ing problems in which there is limited feedback. This text introduces the intuitions

and concepts behind Markov decision processes and two classes of algorithms for

computing optimal behaviors: reinforcement learning and dynamic programming.

First the formal framework of Markov decision process is deﬁned, accompanied

by the deﬁnition of value functions and policies. The main part of this text deals

with introducing foundational classes of algorithms for learning optimal behaviors,

based on various deﬁnitions of optimality with respect to the goal of learning se-

quential decisions. Additionally, it surveys efﬁcient extensions of the foundational

algorithms, differing mainly in the way feedback given by the environment is used

to speed up learning, and in the way they concentrate on relevant parts of the prob-

lem. For both model-based and model-free settings these efﬁcient extensions have

shown useful in scaling up to larger problems.

This text was assembled from the initial chapters of The Logic of Adaptive

Behavior by Martijn van Otterlo (2008), later published at IOS Press (2009).

The purpose of this chapter is to show i) which kinds of algorithms, concepts,

techniques etc. one can assume when writing chapters on speciﬁc topics (i.e.

concepts such as MD Ps and Q-learning can be assumed, and do not have to

be explained again), and ii) the type of notation that is expected (e.g. the AI-

style of RL, as opposed to the Bertsekas and Tsitsiklis (1996) OR-style. The

speciﬁcs and length of this chapter are – of course – subject to change.

Martijn van Otterlo

Katholieke Universiteit Leuven, Celestijnenlaan 200A, 3001 Heverlee, Belgium, e-mail:

martijn.vanotterlo@cs.kuleuven.be

Marco Wiering

Department of Artiﬁcial Intelligence, University of Groningen, The Netherlands e-mail: m.a.

wiering@rug.nl

2 Martijn van Otterlo and Marco Wiering

1 Introduction

Markov Decision Processes (MDP) [Puterman(1994)] are an intu-

itive and fundamental formalism for decision-theoretic planning (DTP)

[Boutilier et al(1999)Boutilier, Dean, and Hanks, Boutilier(1999)], reinforce-

ment learning (RL) [Bertsekas and Tsitsiklis(1996), Sutton and Barto(1998),

Kaelbling et al(1996)Kaelbling, Littman, and Moore] and other learning problems

in stochastic domains. In this model, an environment is modelled as a set of

states and actions can be performed to control the system’s state. The goal is to

control the system in such a way that some performance criterium is maximized.

Many problems such as (stochastic) planning problems, learning robot control and

game playing problems have successfully been modelled in terms of an MDP. In

fact MD Ps have become the de facto standard formalism for learning sequential

decision making.

DTP [Boutilier et al(1999)Boutilier, Dean, and Hanks], e.g. planning using decision-

theoretic notions to represent uncertainty and plan quality, is an important extension

of the AI planning paradigm, adding the ability to deal with uncertainty in action

effects and the ability to deal with less-deﬁned goals. Furthermore it adds a sig-

niﬁcant dimension in that it considers situations in which factors such as resource

consumption and uncertainty demand solutions of varying quality, for example in

real-time decision situations. There are many connections between AI planning, re-

search done in the ﬁeld of operations research [Winston(1991)] and control theory

[Bertsekas(1995)], as most work in these ﬁelds on sequential decision making can

be viewed as instances of MDPs. The notion of a plan in AI planning, i.e. a series

of actions from a start state to a goal state, is extended to the notion of a policy,

which is mapping from all states to an (optimal) action, based on decision-theoretic

measures of optimality with respect to some goal to be optimized.

As an example, consider a typical planning domain, involving boxes to be moved

around and where the goal is to move some particular boxes to a designated area.

This type of problems can be solved using AI planning techniques. Consider now

a slightly more realistic extension in which some of the actions can fail, or have

uncertain side-effects that can depend on factors beyond the operator’s control, and

where the goal is speciﬁed by giving credit for how many boxes are put on the right

place. In this type of environment, the notion of a plan is less suitable, because a

sequence of actions can have many different outcomes, depending on the effects of

the operators used in the plan. Instead, the methods in this chapter are concerned

about policies that map states onto actions in such a way that the expected out-

come of the operators will have the intended effects. The expectation over actions

is based on a decision-theoretic expectation with respect to their probabilistic out-

comes and credits associated with the problem goals. The MD P framework allows

for online solutions that learn optimal policies gradually through simulated trials,

and additionally, it allows for approximated solutions with respect to resources such

as computation time. Finally, the model allows for numeric, decision-theoretic mea-

surement of the quality of policies and learning performance. For example, policies

Reinforcement Learning and Markov Decision Processes 3

environment You are in state 65. You have 4 possible actions.

agent I’ll take action 2.

environment You have received a reward of 7 units. You are now in state 15.

You have 2 possible actions.

agent I’ll take action 1.

environment You have received a reward of −4 units. You are now in state 65.

You have 4 possible actions.

agent I’ll take action 2.

environment You have received a reward of 5 units. You are now in state 44.

You have 5 possible actions.

... ...

Fig. 1 Example of interaction between an agent and its environment, from a RL perspective.

can be ordered by how much credit they receive, or by how much computation is

needed for a particular performance.

This chapter will cover the broad spectrum of methods that have been devel-

oped in the literature to compute good or optimal policies for problems modelled

as an MD P. The term RL is associated with the more difﬁcult setting in which no

(prior) knowledge about the MDP is presented. The task then of the algorithm is

to interact, or experiment with the environment (i.e. the MDP), in order to gain

knowledge about how to optimize its behavior, being guided by the evaluative feed-

back (rewards). The model-based setting, in which the full transition dynamics and

reward distributions are known, is usually characterized by the use of dynamic pro-

gramming (DP) techniques. However, we will see that the underlying basis is very

similar, and that mixed forms occur.

2 Learning Sequential Decision Making

RL is a general class of algorithms in the ﬁeld of machine learning that aims at

allowing an agent to learn how to behave in an environment, where the only feed-

back consists of a scalar reward signal. RL should not be seen as characterized by a

particular class of learning methods, but rather as a learning problem or a paradigm.

The goal of the agent is to perform actions that maximize the reward signal in the

long run.

The distinction between the agent and the environment might not always be the

most intuitive one. We will draw a boundary based on control [Sutton and Barto(1998)].

Everything the agent cannot control is considered part of the environment. For ex-

ample, although the motors of a robot agent might be considered part of the agent,

the exact functioning of them in the environment is beyond the agent’s control. It can

give commands to gear up or down, but their physical realization can be inﬂuenced

by many things.

An example of interaction with the environment is given in Figure 1. It shows

how the interaction between an agent and the environment can take place. The agent

4 Martijn van Otterlo and Marco Wiering

can choose an action in each state, and the perceptions the agent gets from the

environment are the environment’s state after each action plus the scalar reward

signal at each step. Here a discrete model is used in which there are distinct numbers

for each state and action. The way the interaction is depicted is highly general in the

sense that one just talks about states and actions as discrete symbols. In the rest of

this book we will be more concerned about interactions in which states and actions

have more structure, such that a state can be something like there are two blue

boxes and one white one and you are standing next to a blue box. However, this

ﬁgure clearly shows the mechanism of sequential decision making.

There are several important aspects in learning sequential decision making which

we will describe in this section, after which we will describe formalizations in the

next sections.

Approaching Sequential Decision Making.

There are several classes of algorithms that deal with the problem of sequential

decision making. In this book we deal speciﬁcally with the topic of learning, but

some other options exist.

The ﬁrst solution is the programming solution. An intelligent system for sequen-

tial decision making can – in principle – be programmed to handle all situations. For

each possible state an appropriate or optimal action can be speciﬁed a priori. How-

ever, this puts a heavy burden on the designer or programmer of the system. All sit-

uations should be foreseen in the design phase and programmed into the agent. This

is a tedious and almost impossible task for most interesting problems, and it only

works for problems which can be modelled completely. In most realistic problems

this is not possible due to the sheer size of the problem, or the intrinsic uncertainty

in the system. A simple example is robot control in which factors such as lighting or

temperature can have a large, and unforeseen, inﬂuence on the behavior of camera

and motor systems. Furthermore, in situations where the problem changes, for ex-

ample due to new elements in the description of the problem or changing dynamic

of the system, a programmed solution will no longer work. Programmed solutions

are brittle in that they will only work for completely known, static problems with

ﬁxed probability distributions.

A second solution uses search and planning for sequential decision making. The

successful chess program Deep Blue [Schaeffer and Plaat(1997)] was able to defeat

the human world champion Gary Kasparov by smart, brute force search algorithms

that used a model of the dynamics of chess, tuned to Kasparov’s style of playing.

When the dynamics of the system are known, one can search or plan from the cur-

rent state to a desirable goal state. However, when there is uncertainty about the

action outcomes standard search and planning algorithms do not apply. Admissible

heuristics can solve some problems concerning the reward-based nature of sequen-

tial decision making, but the probabilistic effects of actions pose a difﬁcult prob-

lem. Probabilistic planning algorithms exist, e.g. [?], but their performance is not as

good as their deterministic counterparts. An additional problem is that planning and

Reinforcement Learning and Markov Decision Processes 5

search focus on speciﬁc start and goal states. In contrast, we are looking for policies

which are deﬁned for all states, and are deﬁned with respect to rewards.

The third solution is learning, and this will be the main topic of this book. Learn-

ing has several advantages in sequential decision making. First, it relieves the de-

signer of the system from the difﬁcult task of deciding upon everything in the design

phase. Second, it can cope with uncertainty, goals speciﬁed in terms of reward mea-

sures, and with changing situations. Third, it is aimed at solving the problem for

every state, as opposed to a mere plan from one state to another. Additionally, al-

though a model of the environment can be used or learned, it is not necessary in

order to compute optimal policies, such as is exempliﬁed by RL methods. Every-

thing can be learned from interaction with the environment.

Online versus Off-line Learning.

One important aspect in the learning task we consider in this book is the distinction

between online and off-line learning. The difference between these two types is

inﬂuenced by factors such as whether one wants to control a real-world entity – such

as a robot player robot soccer or a machine in a factory – or whether all necessary

information is available. Online learning performs learning directly on the problem

instance. Off-line learning uses a simulator of the environment as a cheap way to

get many training examples for safe and fast learning.

Learning the controller directly on the real task is often not possible. For ex-

ample, the learning algorithms in this chapter sometimes need millions of training

instances which can be too time-consuming to collect. Instead, a simulator is much

faster, and in addition it can be used to provide arbitrary training situations, includ-

ing situations that rarely happen in the real system. Furthermore, it provides a ”safe”

training situation in which the agent can explore and make mistakes. Obtaining neg-

ative feedback in the real task in order to learn to avoid these situations, might entail

destroying the machine that is controlled, which is unacceptable. Often one uses a

simulation to obtain a reasonable policy for a given problem, after which some parts

of the behavior are ﬁne-tuned on the real task. For example, a simulation might pro-

vide the means for learning a reasonable robot controller, but some physical factors

concerning variance in motor and perception systems of the robot might make addi-

tional ﬁne-tuning necessary. A simulation is just a model of the real problem, such

that small differences between the two are natural, and learning might make up for

that difference. Many problems in the literature however, are simulations of games

and optimization problems, such that the distinction disappears.

Credit Assignment.

An important aspect of sequential decision making is the fact that deciding whether

an action is ”good” or ”bad” cannot be decided upon right away. The appropriate-

ness of actions is completely determined by the goal the agent is trying to pursue.

6 Martijn van Otterlo and Marco Wiering

The real problem is that the effect of actions with respect to the goal can be much

delayed. For example, the opening moves in chess have a large inﬂuence on win-

ning the game. However, between the ﬁrst opening moves and receiving a reward

for winning the game, a couple of tens of moves might have been played. Decid-

ing how to give credit to the ﬁrst moves – which did not get the immediate reward

for winning – is a difﬁcult problem called the temporal credit assignment problem.

Each move in a winning chess game contributes more or less to the success of the

last move, although some moves along this path can be less optimal or even bad. A

related problem is the structural credit assignment problem, in which the problem is

to distribute feedback over the structure representing the agent’s policy. For exam-

ple, the policy can be represented by a structure containing parameters (e.g. a neural

network). Deciding which parameters have to be updated forms the structural credit

assignment problem.

The Exploration-Exploitation Trade-off.

If we know a complete model of dynamics of the problem, there exist methods

(e.g. DP) that can compute optimal policies from this model. However, in the more

general case where we do not have access to this knowledge (e.g. RL), it becomes

necessary to interact with the environment to learn by trial-and-error a correct pol-

icy. The agent has to explore the environment by performing actions and perceiving

their consequences (i.e. the effects on the environments and the obtained rewards).

The only feedback the agent gets are rewards, but it does not get information about

what is the right action. At some point in time, it will have a policy with a partic-

ular performance. In order to see whether there are possible improvements to this

policy, it sometimes has to try out various actions to see their results. This might

result in worse performance because the actions might also be less good than the

current policy. However, without trying them, it might never ﬁnd possible improve-

ments. In addition, if the world is not stationary, the agent has to explore to keep

its policy up-to-date. So, in order to learn it has to explore, but in order to perform

well it should exploit what it already knows. Balancing these two things is called the

exploration-exploitation problem.

Feedback, Goals and Performance.

Compared to supervised learning, the amount of feedback the learning system gets

in RL, is much less. In supervised learning, for every learning sample the correct

output is given in a training set. The performance of the learning system can be

measured relative to the number of correct answers, resulting in a predictive accu-

racy. The difﬁculty lies in learning this mapping, and whether this mapping gen-

eralizes to new, unclassiﬁed, examples. In unsupervised learning, the difﬁculty lies

in constructing a useful partitioning of the data such that classes naturally arise. In

reinforcement there is only some information available about performance, in the

Reinforcement Learning and Markov Decision Processes 7

form of one scalar signal. This feedback system is evaluative rather than being in-

structive. Using this limited signal for feedback renders a need to put more effort in

using it to evaluate and improve behavior during learning.

A second aspect about feedback and performance is related to the stochastic na-

ture of the problem formulation. In supervised and unsupervised learning, the data

is usually considered static, i.e. a data set is given and performance can be mea-

sured with respect to this data. The learning samples for the learner originate from

a ﬁxed distribution, i.e. the data set. From a RL perspective, the data can be seen as

amoving target. The learning process is driven by the current policy, but this policy

will change over time. That means that the distribution over states and rewards will

change because of this. In machine learning the problem of a changing distribution

of learning samples is termed concept drift [Maloof(2003)] and it demands special

features to deal with it. In RL this problem is dealt with by exploration, a constant

interaction between evaluation and improvement of policies and additionally the use

of learning rate adaption schemes.

A third aspect of feedback is the question ”where do the numbers come from?”.

In many sequential decision tasks, suitable reward functions present themselves

quite naturally. For games in which there are winning, losing and draw situations,

the reward function is easy to specify. In some situations special care has to be taken

in giving rewards for states or actions, and also their relative size is important. When

the agent will encounter a large negative reward before it ﬁnally gets a small posi-

tive reward, this positive reward might get overshadowed. All problems posed will

have some optimal policy, but it depends on whether the reward function is in ac-

cordance with the right goals, whether the policy will tackle the right problem. In

some problems it can be useful to provide the agent with rewards for reaching inter-

mediate subgoals. This can be helpful in problems which require very long action

sequences.

Representations.

One of the most important aspects in learning sequential decision making is repre-

sentation. Two central issues are what should be represented, and how things should

be represented. The ﬁrst issue is dealt with in this chapter. Key components that can

or should be represented are models of the dynamics of the environment, reward dis-

tributions, value functions and policies. For some algorithms all components are ex-

plicitly stored in tables, for example in classic DP algorithms. Actor-critic methods

keep separate, explicit representations of both value functions and policies. How-

ever, in most RL algorithms just a value function is represented whereas policy

decisions are derived from this value function online. Methods that search in policy

space do not represent value functions explicitly, but instead an explicitly repre-

sented policy is used to compute values when necessary. Overall, the choice for not

representing certain elements can inﬂuence the choice for a type of algorithm, and

its efﬁciency.

8 Martijn van Otterlo and Marco Wiering

The question of how various structures can be represented is dealt with exten-

sively in this book, starting from the next chapter. Structures such as policies, tran-

sition functions and value functions can be represented in more compact form by us-

ing various structured knowledge representation formalisms and this enables much

more efﬁcient solution mechanisms and scaling up to larger domains.

3 A Formal Framework

The elements of the RL problem as described in the introduction to this chapter can

be formalized using the Markov decision process (MDP) framework. In this section

we will formally describe components such as states and actions and policies, as

well as the goals of learning using different kinds of optimality criteria. MDPs are

extensively described in [Puterman(1994)] and [Boutilier et al(1999)Boutilier, Dean, and Hanks].

They can be seen as stochastic extensions of ﬁnite automata and also as Markov pro-

cesses augmented with actions.

Although general MD Ps may have inﬁnite (even uncountable) state and action

spaces, we limit the discussion to ﬁnite-state and ﬁnite-action problems. In the next

chapter we will encounter continuous spaces and in later chapters we will encounter

situations arising in the ﬁrst-order logic setting in which inﬁnite spaces can quite

naturally occur.

3.1 Markov Decision Processes.

MDPs consist of states, actions, transitions between states and a reward function

deﬁnition. We consider each of them in turn.

States.

The set of environmental states Sis deﬁned as the ﬁnite set {s1,. . . , sN}where the

size of the state space is N, i.e. |S|=N. A state is a unique characterization of all

that is important in a state of the problem that is modeled. For example, in chess

a complete conﬁguration of board pieces of both black and white, is a state. In the

next chapter we will encounter the use of features that describe the state. In those

contexts, it becomes necessary to distinguish between legal and illegal states, for

some combinations of features might not result in an actually existing state in the

problem. In this chapter, we will conﬁne ourselves to the discrete state set Sin which

each state is represented by a distinct symbol, and all states s∈Sare legal.

Reinforcement Learning and Markov Decision Processes 9

Actions.

The set of actions Ais deﬁned as the ﬁnite set {a1,...,aK}where the size of the

action space is K, i.e. |A|=K. Actions can be used to control the system state.

The set of actions that can be applied in some particular state s∈S, is denoted A(s),

where A(s)⊆A. In some systems, not all actions can be applied in every state, but in

general we will assume that A(s) = Afor all s∈S. In more structured representations

(e.g. by means of features), the fact that some actions are not applicable in some

states, is modeled by a precondition function pre : S×A→ {true,false}, stating

whether action a∈Ais applicable in state s∈S.

The Transition Function.

By applying action a∈Ain a state s∈S, the system makes a transition from sto a

new state s0∈S, based on a probability distribution over the set of possible transi-

tions. The transition function Tis deﬁned as T:S×A×S→[0,1], i.e. the probability

of ending up in state s0after doing action ain state sis denoted T(s,a,s0). It is re-

quired that for all actions a, and all states sand s0,T(s,a,s0)≥0 and T(s,a,s0)≤1.

Furthermore, for all states sand actions a,∑s0∈ST(s,a,s0) = 1, i.e. Tdeﬁnes a proper

probability distribution over possible next states. Instead of a precondition function,

it is also possible to set1T(s,a,s0) = 0 for all states s0∈Sif ais not applicable in s.

For talking about the order in which actions occur, we will deﬁne a discrete global

clock,t=1,2,.... Using this, the notation stdenotes the state at time tand st+1

denotes the state at time t+1. This enables to compare different states (and actions)

occurring ordered in time during interaction.

The system being controlled is Markovian if the result of an action does not

depend on the previous actions and visited states (history), but only depends on the

current state, i.e.

P(st+1|st,at,st−1,at−1, . . .) = P(st+1|st,at) = T(st,at,st+1)

The idea of Markovian dynamics is that the current state sgives enough information

to make an optimal decision; if is not important which states and actions preceded

s. Another way of saying this, is that if you select an action a, the probability dis-

tribution over next states is the same as the last time you tried this action in the

same state. More general models can be characterized by being k-Markov, i.e. the

last kare states sufﬁcient, such that Markov is actually 1-Markov. Though, each

k-Markov problem can be transformed into an equivalent Markov problem. The

Markov property forms a boundary between the MDP and more general models

such as POMDPs.

1Although this is the same, the explicit distinction between an action not begin applicable in a

state and a zero probability for transitions with that action, is lost in this way.

10 Martijn van Otterlo and Marco Wiering

The Reward Function.

The reward function2speciﬁes rewards for being in a state, or doing some action

in a state. The state reward function is deﬁned as R:S→R, and it speciﬁes the

reward obtained in states. However, two other deﬁnitions exist. One can deﬁne either

R:S×A→Ror R:S×A×S→R. The ﬁrst one gives rewards for performing

an action in a state, and the second gives rewards for particular transitions between

states. All deﬁnitions are interchangeable though the last one is convenient in model-

free algorithms (see Section 7), because there we usually need both the starting state

and the resulting state in backing up values. Throughout this book we will mainly

use R(s,a,s0), but deviate from this when more convenient.

The reward function is an important part of the MDP that speciﬁes implicitly the

goal of learning. For example, in episodic tasks such as in the games Tic-Tac-Toe

and chess, one can assign all states in which the agent has won a positive reward

value, all states in which the agent loses a negative reward value and a zero reward

value in all states where the ﬁnal outcome of the game is a draw. The goal of the

agent is to reach positive valued states, which means winning the game. Thus, the

reward function is used to give direction in which way the system, i.e. the MDP,

should be controlled. Often, the reward function assigns non-zero reward to non-

goal states as well, which can be interpreted as deﬁning sub-goals for learning.

The Markov Decision Process.

Putting all elements together results in the deﬁnition of a Markov decision process,

which will be the base model for the large majority of methods described in this

book.

Deﬁnition 3.1

AMarkov decision process is a tuple

hS,A,T,Ri

in which

is a ﬁnite set of states,

a ﬁnite set of actions,

a transition function deﬁned as

T:S×A×S→[0,1]

and

a reward function deﬁned as

R:S×A×S→R

The transition function Tand the reward function Rtogether deﬁne the model of the

MDP. Often MDPs are depicted as a state transition graph for an example) where

the nodes correspond to states and (directed) edges denote transitions. A typical

domain that is frequently used in the MD P literature is the maze [Matthews(1922)],

in which the reward function assigns a positive reward for reaching the exit state.

There are several distinct types of systems that can be modelled by this deﬁnition

of an MDP. In episodic tasks, there is the notion of episodes of some length, where

2Although we talk about rewards here, with the usual connotation of something positive, the

reward function merely gives a scalar feedback signal. This can be interpreted as negative (pun-

ishment) or positive (reward). The various origins of work in MDPs in the literature creates an

additional confusion with the reward function. In the operations resarch literature, one usually

speaks of a cost function instead and the goal of learning and optimization is to minimize this

function.

Reinforcement Learning and Markov Decision Processes 11

the goal is to take the agent from a starting state to a goal state. An initial state

distribution I :S→[0,1]gives for each state the probability of the system being

started in that state. Starting from a state sthe system progresses through a sequence

of states, based on the actions performed. In episodic tasks, there is a speciﬁc subset

G⊆S, denoted goal state area containing states (usually with some distinct reward)

where the process ends. We can furthermore distinguish between ﬁnite, ﬁxed horizon

tasks in which each episode consists of a ﬁxed number of steps, indeﬁnite horizon

tasks in which each episode can end but episodes can have arbitrary length, and

inﬁnite horizon tasks where the system does not end at all. The last type of model is

usually called a continuing task.

Episodic tasks, i.e. in which there so-called goal states, can be modelled using

the same model deﬁned in Deﬁnition 3.1. This is usually modelled by means of

absorbing states or terminal states, e.g. states from which every action results in

a transition to that same state with probability 1 and reward 0. Formally, for an

absorbing state s, it holds that T(s,a,s) = 1 and R(s,a,s0) = 0 for all states s0∈Sand

actions a∈A. When entering an absorbing state, the process is reset and restarts in

a new starting state. Episodic tasks and absorbing states can in this way be elegantly

modelled in the same framework as continuing tasks.

3.2 Policies

Given an MDP hS,A,T,Ri, a policy is a computable function that outputs for each

state s∈San action a∈A(or a∈A(s)). Formally, a deterministic policy

a function deﬁned as

:S→A. It is also possible to deﬁne a stochastic policy

:S×A→[0,1]such that for each state s∈S, it holds that

(s,a)≥0 and

∑a∈A

(s,a) = 1. We will assume deterministic policies in this book unless stated

otherwise.

Application of a policy to an MD P is done in the following way. First, a start

state s0from the initial state distribution Iis generated. Then, the policy

sug-

gest the action a0=

(s0)and this action is performed. Based on the transition

function Tand reward function R, a transition is made to state s1, with probabil-

ity T(s0,a,s1)and a reward r0=R(s0,a0,s1)is received. This process continues,

producing s0,a0,r0,s1,a1,r1,s2,a2, . . .. If the task is episodic, the process ends in

state sgoal and is restarted in a new state drawn from I. If the task is continuing, the

sequence of states can be extended indeﬁnitely.

The policy is part of the agent and its aim is to control the environment modeled

as an MD P. A ﬁxed policy induces a stationary transition distribution over the MDP

which can be transformed into a Markov system3hS0,T0iwhere S0=Sand T0(s,s0) =

T(s,a,s0)whenever

(s) = a.

3In other words, if

is ﬁxed, the system behaves as a stochastic transition system with a stationary

distribution over states.

12 Martijn van Otterlo and Marco Wiering

3.3 Optimality Criteria and Discounting

In the previous sections, we have deﬁned the environment (the MD P) and the agent

(i.e. the controlling element, or policy). Before we can talk about algorithms for

computing optimal policies, we have to deﬁne what that means. That is, we have to

deﬁne what the model of optimality is. There are two ways of looking at optimality.

First, there is the aspect of what is actually being optimized, i.e. what is the goal of

the agent? Second, there is the aspect of how optimal the way in which the goal is

being optimized, is. The ﬁrst aspect is related to gathering reward and is treated in

this section. The second aspect is related to the efﬁciency and optimality of algo-

rithms, and this is brieﬂy touched upon and dealt with more extensively in Section 5

and further.

The goal of learning in an MD P is to gather rewards. If the agent was only con-

cerned about the immediate reward, a simple optimality criterion would be to opti-

mize E[rt]. However, there are several ways of taking into account the future in how

to behave now. There are basically three models of optimality in the MDP, which

are sufﬁcient to cover most of the approaches in the literature. They are strongly

related to the types of tasks that were deﬁned in Section 3.1.

E·h

∑

t=0

rt¸E·∞

∑

t=1

trt¸lim

h→∞E·1

∑

t=0

rt¸

Fig. 2 Optimality:a) ﬁnite horizon, b) discounted, inﬁnite horizon, c) average reward.

The ﬁnite horizon model simply takes a ﬁnite horizon of length hand states that

the agent should optimize its expected reward over this horizon, i.e. the next hsteps

(see Figure 2a)). One can think of this in two ways. The agent could in the ﬁrst step

take the h-step optimal action, after this the (h−1)-step optimal action, and so on.

Another way is that the agent will always take the h-step optimal action, which is

called receding-horizon control. The problem, however, with this model, is that the

(optimal) choice for the horizon length his not always known.

In the inﬁnite-horizon model, the long-run reward is taken into account, but the

rewards that are received in the future are discounted according to how far away in

time they will be received. A discount factor

, with 0 ≤

<1 is used for this (see

Figure 2b)). Note that in this discounted case, rewards obtained later are discounted

more than rewards obtained earlier. Additionally, the discount factor ensures that

– even with inﬁnite horizon – the sum of the rewards obtained is ﬁnite. In episodic

tasks, i.e. in tasks where the horizon is ﬁnite, the discount factor is not needed or can

equivalently be set to 1. If

=0 the agent is said to be myopic, which means that it

is only concerned about immediate rewards. The discount factor can be interpreted

in several ways; as an interest rate, probability of living another step, or the mathe-

matical trick for bounding the inﬁnite sum. The discounted, inﬁnite-horizon model

Reinforcement Learning and Markov Decision Processes 13

is mathematically more convenient, but conceptually similar to the ﬁnite horizon

model. Most algorithms in this book use this model of optimality.

A third optimality model is the average-reward model, maximizing the long-run

average reward (see Figure 2c)). Sometimes this is called the gain optimal policy

and in the limit, as the discount factor approaches 1, it is equal to the inﬁnite-horizon

discounted model. A difﬁcult problem with this criterion that we cannot distinguish

between two policies in which one receives a lot of reward in the initial phases

and another one which does not. This initial difference in reward is hidden in the

long-run average. This problem can be solved in using a bias optimal model in

which the long-run average is still being optimized, but policies are preferred if

they additionally get initially extra reward. See [Mahadevan(1996)] for a survey on

average reward RL.

Choosing between these optimality criteria can be related to the learning prob-

lem. If the length of the episode is known, the ﬁnite-horizon model is best. However,

often this is not known, or the task is continuing, the inﬁnite-horizon model is more

suitable. [Koenig and Liu(2002)] gives an extensive overview of different model-

ings of MD Ps and their relationship with optimality.

The second kind of optimality in this section is related to the more general aspect

of the optimality of the learning process itself. We will encounter various concepts

in the remainder of this book. We will brieﬂy summarize three important notions

here.

Learning optimality can be explained in terms of what the end result of learning

might be. A ﬁrst concern is whether the agent is able to obtain optimal performance

in principle. For some algorithms there are proofs stating this, but for some not. In

other words, is there a way to ensure that the learning process will reach a global

optimum, or merely a local optimum, or even an oscillation between performances?

A second kind of optimality is related to the speed of converging to a solution. We

can distinguish between two learning methods by looking at how many interactions

are needed, or how much computation is needed per interaction. And related to that,

what will the performance be after a certain period of time? In supervised learning

the optimality criterion is often deﬁned in terms of predictive accuracy which is

different from optimality in the MDP setting. Also, it is important to look at how

much experimentation is necessary, or even allowed, for reaching optimal behavior.

For example, a learning robot or helicopter might not be allowed to make many

mistakes during learning. A last kind of optimality is related to how much reward is

not obtained by the learned policy, as compared to an optimal one. This is usually

called the regret of a policy.

4 Value Functions and Bellman Equations

In the preceding sections we have deﬁned MDPs and optimality criteria that can be

useful for learning optimal policies. In this section we deﬁne value functions, which

are a way to link the optimality criteria to policies. Most learning algorithms for

14 Martijn van Otterlo and Marco Wiering

MDPs compute optimal policies by learning value functions. A value function rep-

resents an estimate how good it is for the agent to be in a certain state (or how good

it is to perform a certain action in that state). The notion of how good is expressed in

terms of an optimality criterion, i.e. in terms of the expected return. Value functions

are deﬁned for particular policies.

The value of a state s under policy

, denoted V

(s)is the expected return when

starting in sand following

thereafter. We will use the inﬁnite-horizon, discounted

model in this section, such that this can be expressed4as:

(s) = E

½∞

∑

k=0

krt+k|st=s¾(1)

A similar state-action value function Q :S×A→Rcan be deﬁned as the expected

return starting from state s, taking action aand thereafter following policy

(s,a) = E

½∞

∑

k=0

krt+k|st=s,at=a¾

One fundamental property of value functions is that they satisfy certain recursive

properties. For any policy

and any state sthe expression in Equation 1 can recur-

sively be deﬁned in terms of a so-called Bellman Equation [Bellman(1957)]:

(s) = E

½rt+

rt+1+

2rt+2+. . . |st=t¾

½rt+

(st+1)|st=s¾

=∑

T(s,

(s),s0)µR(s,a,s0) +

(s0)¶(2)

It denotes that the expected value of state is deﬁned in terms of the immediate reward

and values of possible next states weighted by their transition probabilities, and

additionally a discount factor. V

is the unique solution for this set of equations.

Note that multiple policies can have the same value function, but for a given policy

is unique.

The goal for any given MDP is to ﬁnd a best policy, i.e. the policy that receives

the most reward. This means maximizing the value function of Equation 1 for all

states s∈S. An optimal policy, denoted

∗, is such that V

∗(s)≥V

(s)for all s∈S

and all policies

. It can be proven that the optimal solution V∗=V

∗satisﬁes the

following Equation:

V∗(s) = max

a∈A∑

s0∈S

T(s,a,s0)µR(s,a,s0) +

V∗(s0)¶(3)

4Note that we use E

for the expected value under policy

Reinforcement Learning and Markov Decision Processes 15

This expression is called the Bellman optimality equation. It states that the value

of a state under an optimal policy must be equal to the expected return for the best

action in that state. To select an optimal action given the optimal state value function

V∗the following rule can be applied:

∗(s) = argmax

a∑

s0∈S

T(s,a,s0)µR(s,a,s0) +

V∗(s0)¶(4)

We call this policy the greedy policy, denoted

greedy(V)because it greedily selects

the best action using the value function V. An analogous optimal state-action value

is:

Q∗(s,a) = ∑

T(s,a,s0)µR(s,a,s0) +

max

a0Q∗(s0,a0)¶

Q-functions are useful because they make the weighted summation over different

alternatives (such as in Equation 4) using the transition function unnecessary. No

forward-reasoning step is needed to compute an optimal action in a state. This is the

reason that in model-free approaches, i.e. in case Tand Rare unknown, Q-functions

are learned instead of V-functions. The relation between Q∗and V∗is given by

Q∗(s,a) = ∑

s0∈S

T(s,a,s0)µR(s,a,s0) +

V∗(s0)¶(5)

V∗(s) = max

aQ∗(s,a)(6)

Now, analogously to Equation 4, optimal action selection can be simply put as:

∗(s) = argmax

aQ∗(s,a)(7)

That is, the best action is the action that has the highest expected utility based on

possible next states resulting from taking that action. One can, analogous to the

expression in Equation 4, deﬁne a greedy policy

greedy(Q)based on Q. In contrast

greedy(V)there is no need to consult the model of the MD P; the Q-function

sufﬁces.

5 Solving Markov decision processes

Now that we have deﬁned MDPs, policies, optimality criteria and value functions, it

is time to consider the question of how to compute optimal policies. Solving a given

MDP means computing an optimal policy

∗. Several dimensions exist along which

algorithms have been developed for this purpose. The most important distinction is

that between model-based and model-free algorithms.

Model-based algorithms exist under the general name of DP. The basic assump-

tion in these algorithms is that a model of the MDP is known beforehand, and can

16 Martijn van Otterlo and Marco Wiering

be used to compute value functions and policies using the Bellman equation (see

Equation 3). Most methods are aimed at computing state value functions which can,

in the presence of the model, be used for optimal action selection. In this chapter we

will focus on iterative procedures for computing value functions and policies.

Model-free algorithms, under the general name of RL, do not rely on the avail-

ability of a perfect model. Instead, they rely on interaction with the environment,

i.e. a simulation of the policy thereby generating samples of state transitions and

rewards. These samples are then used to estimate state-action value functions. Be-

cause a model of the MD P is not known, the agent has to explore the MDP to obtain

information. This naturally induces a exploration-exploitation trade-off which has

to be balanced to obtain an optimal policy.

A very important underlying mechanism, the so-called generalized policy itera-

tion (GP I) principle, present in all methods is depicted in Figure 3. This principle

consists of two interaction processes. The policy evaluation step estimates the utility

of the current policy

, that is, it computes V

. There are several ways for comput-

ing this. In model-based algorithms, one can use the model to compute it directly

or iteratively approximate it. In model-free algorithms, one can simulate the pol-

icy and estimate its utility from the sampled execution traces. The main purpose of

this step is to gather information about the policy for computing the second step,

the policy improvement step. In this step, the values of the actions are evaluated

for every state, in order to ﬁnd possible improvements, i.e. possible other actions

in particular states that are better than the action the current policy proposes. This

step computes an improved policy

0from the current policy

using the informa-

tion in V

. Both the evaluation and the improvement steps can be implemented in

various ways, and interleaved in several distinct ways. The bottom line is that there

is a policy that drives value learning, i.e. it determines the value function, but in

turn there is a value function that can be used by the policy to select good actions.

Note that it is also possible to have an implicit representation of the policy, which

means that only the value function is stored, and a policy is computed on-the-ﬂy

for each state based on the value function when needed. This is common practice

in model-free algorithms (see Section 7). And vice versa it is also possible to have

implicit representations of value functions in the context of an explicit policy rep-

resentation. Another interesting aspect is that in general a value function does not

have to be perfectly accurate. In many cases it sufﬁces that sufﬁcient distinction is

present between suboptimal and optimal actions, such that small errors in values do

not have to inﬂuence policy optimality. This is also important in approximation and

abstraction methods discussed in the next chapter.

Planning as a RL Problem.

The MD P formalism is a general formalism for decision-theoretic planning, which

entails that standard (deterministic) planning problems can be formalized as such

too. All the algorithms in this chapter can – in principle – be used for these plan-

ning problems too. In order to solve planning problems in the MD P framework we

Reinforcement Learning and Markov Decision Processes 17

πV

evaluation

improvement

V →Vπ

π→greedy(V)

π*

starting

V π

Vπ

gree

dy(V)

π*

Fig. 3 a) The algorithms in Section 5 can be seen as instantiations of Generalized Policy Itera-

tion (GPI) [Sutton and Barto(1998)]. The policy evaluation step estimates V

, the policy’s perfor-

mance. The policy improvement step improves the policy

based on the estimates in V

. b) The

gradual convergence of both the value function and the policy to optimal versions.

have to specify goals and rewards. We can assume that the transition function T

is given, accompanied by a precondition function. In planning we are given a goal

function G:S→ {true,false}that deﬁnes which states are goal states. The plan-

ning task is compute a sequence of actions at,at+1,...,at+nsuch that applying this

sequence from a start state will lead to a state s∈G. All transitions are assumed to

be deterministic, i.e. for all states s∈Sand actions a∈Athere exists only one state

s0∈Ssuch that T(s,a,s0) = 1. All states in Gare assumed to be absorbing. The only

thing left is to specify the reward function. We can specify this in such a way that a

positive reinforcement is received once a goal state is reached, and zero otherwise:

R(st,at,st+1) = (1,if st6∈ Gand st+1∈G

0,otherwise

Now, depending on whether the transition function and reward function are known

to the agent, one can solve this planning task with either model-based or model-free

learning. The difference with classic planning is that the learned policy will apply

to all states.

6 Dynamic Programming: Model-based Solution Techniques

The term DP refers to a class of algorithms that is able to compute optimal policies

in the presence of a perfect model of the environment. The assumption that a model

is available will be hard to ensure for many applications. However, we will see that

from a theoretical viewpoint, as well as from an algorithmic viewpoint, DP are very

relevant because they deﬁne fundamental computational mechanisms which are also

used when no model is available. The methods in this section all assume a standard

MDP hS,A,T,Ri, where the state and action sets are ﬁnite and discrete such that

18 Martijn van Otterlo and Marco Wiering

they can be stored in tables. Furthermore, transition, reward and value functions are

assumed to store values for all states and actions separately.

6.1 Fundamental DP Algorithms

Two core DP methods are policy iteration [Howard(1960)] and value iteration

[Bellman(1957)]. In the ﬁrst, the GP I mechanism is clearly separated into two steps,

whereas the second represents a tight integration of policy evaluation and improve-

ment. We will consider both these algorithms in turn.

6.1.1 Policy Iteration

Policy iteration (PI) [Howard(1960)] iterates between the two phases of GPI. The

policy evaluation phase computes the value function of the current policy and the

policy improvement phase computes an improved policy by a maximization over the

value function. This is repeated until converging to an optimal policy.

Policy Evaluation: The Prediction Problem.

A ﬁrst step is to ﬁnd the value function V

of a ﬁxed policy

. This is called the pre-

diction problem. It is a part of the complete problem, that of computing an optimal

policy. Remember from the previous sections that for all s∈S,

(s) = ∑

s0∈S

T(s,

(s),s0)µR(s,

(s),s0) +

(s0)¶(8)

If the dynamics of the system is known, i.e. a model of the MDP is given, then

these equations form a system of |S|equations in |S|unknowns (the values of V

for

each s∈S). This can be solved by linear programming (LP). However, an iterative

procedure is possible, and in fact common in DP and RL. The Bellman equation is

transformed into an update rule which updates the current value function V

kinto

k+1by ’looking one step further in the future’, thereby extending the planning

horizon with one step:

k+1(s) = E

½rt+

k(st+1)|st=s¾

=∑

T(s,

(s),s0)µR(s,

(s),s0) +

k(s0)¶(9)

The sequence of approximations of V

kas kgoes to inﬁnity can be shown to con-

verge. In order to converge, the update rule is applied to each state s∈Sin each

Reinforcement Learning and Markov Decision Processes 19

iteration. It replaces the old value for that state by a new one that is based on the

expected value of possible successor states, intermediate rewards and weighted by

the transition probabilities. This operation is called a full backup because it is based

on all possible transitions from that state.

A more general formulation can be given by deﬁning a backup operator B

over

arbitrary real-valued functions

over the state space (e.g. a value function):

πϕ

)(s) = ∑

s0∈S

T(s,

(s),s0)µR(s,

(s),s0) +

γϕ

(s0)¶(10)

The value function V

of a ﬁxed policy

satisﬁes the ﬁxed point of this backup

operator as V

. A useful special case of this backup operator is deﬁned

with respect to a ﬁxed action a:

(Ba

)(s) = R(s) +

∑

s0∈S

T(s,a,s0)

(s0)

Now LP for solving the prediction problem can be stated as follows. Computing

can be accomplished by solving the Bellman equations (see Equation 3) for all

states. The optimal value function V∗can be found by using a LP problem solver

that computes V∗=arg maxV∑sV(s)subject to V(s)≥(BaV)(s)for all aand s.

Policy Improvement.

Now that we know the value function V

of a policy

as the outcome of the policy

evaluation step, we can try to improve the policy. First we identify the value of all

actions by using:

(s,a) = E

½rt+

(st+1)|st=s,at=a¾(11)

=∑

T(s,a,s0)µR(s,a,s0) +

(s0)¶(12)

If now Q

(s,a)is larger than V

(s)for some a∈Athen we could do better by

choosing action ainstead of the current

(s). In other words, we can improve the

current policy by selecting a different, better, action in a particular state. In fact, we

can evaluate all actions in all states and choose the best action in all states. That is,

we can compute the greedy policy

0by selecting the best action in each state, based

on the current value function V

20 Martijn van Otterlo and Marco Wiering

Require: V(s)∈Rand

(s)∈A(s)arbitrarily for all s∈S

{POLICY EVALUATI ON}

repeat

∆

:=0

for each s∈Sdo

v:=V

(s)

V(s):=∑s0T(s,

(s),s0)µR(s,

(s),s0) +

V(s0)¶

∆

:=max(

∆

,|v−V(s)|)

until

∆

{POLICY IMP ROVEM ENT }

policy-stable :=true

for each s∈Sdo

b:=

(s)

(s):=argmaxa∑s0T(s,a,s0)µR(s,a,s0)+

·V(s0)¶

if b6=

(s)then policy-stable :=false

if policy-stable then stop; else go to POLICY EVAL UATION

Algorithm 1: Policy Iteration [Howard(1960)]

0(s) = argmax

(s,a)

=argmax

aE½rt+

(st+1)|st=s,at=a¾

=argmax

a∑

T(s,a,s0)µR(s,a,s0) +

(s0)¶(13)

Computing an improved policy by greedily selecting the best action with respect to

the value function of the original policy is called policy improvement. If the policy

cannot be improved in this way, it means that the policy is already optimal and its

value function satisﬁes the Bellman equation for the optimal value function. In a

similar way one can also perform these steps for stochastic policies by blending the

action probabilities into the expectation operator.

Summarizing, policy iteration [Howard(1960)] starts with an arbitrary initialized

policy

0. Then a sequence of iterations follows in which the current policy is eval-

uated after which it is improved. The ﬁrst step, the policy evaluation step computes

k, making use of Equation 9 in an iterative way. The second step, the policy im-

provement step, computes

k+1from

kusing V

k. For each state, using equation 4,

the optimal action is determined. If for all states s,

k+1(s) =

k(s), the policy is sta-

ble and the policy iteration algorithm can stop. Policy iteration generates a sequence

of alternating policies and value functions

0→V

0→

1→V

1→

2→V

2→

3→V

3→... →

∗

The complete algorithm can be found in Algorithm 1.

For ﬁnite MDPs, i.e. state and action spaces are ﬁnite, policy iteration converges

after a ﬁnite number of iterations. Each policy

k+1is a strictly better policy than

Reinforcement Learning and Markov Decision Processes 21

Require: initialize V arbitrarily (e.g. V(s):=0,∀s∈S)

repeat

∆

:=0

for each s∈Sdo

v:=V(s)

for each a∈A(s)do

Q(s,a):=∑s0T(s,a,s0)µR(s,a,s0) +

V(s0)¶

V(s):=maxaQ(s,a)

∆

:=max(

∆

,|v−V(s)|)

until

∆

Algorithm 2: Value Iteration [Bellman(1957)]

kunless in case

∗, in which case the algorithm stops. And because for a

ﬁnite MDP, the number of different policies is ﬁnite, policy iteration converges

in ﬁnite time. In practice, it usually converges after a small number of iterations.

Although policy iteration computes the optimal policy for a given MDP in ﬁnite

time, it is relatively inefﬁcient. In particular the ﬁrst step, the policy evaluation

step, is computationally expensive. Value functions for all intermediate policies

0,...,

k,...,

∗are computed, which involves multiple sweeps through the com-

plete state space per iteration. A bound on the number of iterations is not known

[Littman et al(1995)Littman, Dean, and Kaelbling] and depends on the MDP tran-

sition structure, but it often converges after few iterations in practice.

6.1.2 Value Iteration

The policy iteration algorithm completely separates the evaluation and improve-

ment phases. In the evaluation step, the value function must be computed in the

limit. However, it is not necessary to wait for full convergence, but it is possible to

stop evaluating earlier and improve the policy based on the evaluation so far. The ex-

treme point of truncating the evaluation step is the value iteration [Bellman(1957)]

algorithm. It breaks off evaluation after just one iteration. In fact, it immediately

blends the policy improvement step into its iterations, thereby purely focusing on

estimating directly the value function. Necessary updates are computed on-the-ﬂy.

In essence, it combines a truncated version of the policy evaluation step with the

policy improvement step, which is essentially Equation 3 turned into one update

rule:

Vt+1(s) = max

a∑

T(s,a,s0)µR(s,a,s0) +

Vt(s0)¶(14)

=max

aQt+1(s,a).(15)

Using Equations (14) and (15), the value iteration algorithm (see Figure 2) can be

stated as follows: starting with a value function V0over all states, one iteratively

22 Martijn van Otterlo and Marco Wiering

updates the value of each state according to (14) to get the next value functions Vt

(t=1,2,3, . . .). It produces the following sequence of value functions:

V0→V1→V2→V3→V4→V5→V6→V7→...V∗

Actually, in the way it is computed it also produces the intermediate Q-value func-

tions such that the sequence is

V0→Q1→V1→Q2→V2→Q3→V3→Q4→V4→...V∗

Value iteration is guaranteed to converge in the limit towards V∗, i.e. the Bellman

optimality Equation (3) holds for each state. A deterministic policy

for all states

s∈Scan be computed using Equation 4. If we use the same general backup oper-

ator mechanism used in the previous section, we can deﬁne value iteration in the

following way.

(B∗

)(s) = max

a∑

s0∈S

T(s,a,s0)½R(s,a,s0) +

γϕ

(s)¾(16)

The backup operator B∗functions as a contraction mapping on the value function.

If we let

∗denote the optimal policy and V∗its value function, we have the rela-

tionship (ﬁxed point) V∗=B∗V∗where (B∗V)(s) = maxa(BaV)(s). If we deﬁne

Q∗(s,a) = BaV∗then we have

∗(s) =

greedy(V∗)(s) = argmaxaQ∗(s,a). That

is, the algorithm starts with an arbitrary value function V0after which it iterates

Vt+1=B∗Vtuntil kVt+1−VtkS<

, i.e. until the distance between subsequent value

function approximations is small enough.

6.2 Efﬁcient DP Algorithms

The policy iteration and value iteration algorithms can be seen as spanning a spec-

trum of DP approaches. This spectrum ranges from complete separation of eval-

uation and improvement steps to a complete integration of these steps. Clearly, in

between the extreme points is much room for variations on algorithms. Let us ﬁrst

consider the computational complexity of the extreme points.

Complexity.

Value iteration works by producing successive approximations of the optimal value

function. Each iteration can be performed in O(|A||S|2)steps, or faster if Tis

sparse. However, the number of iterations can grow exponentially in the discount

factor cf. [Bertsekas and Tsitsiklis(1996)]. This follows from the fact that a larger

implies that a longer sequence of future rewards has to be taken into account,

hence a larger number of value iteration steps because each step only extends the

Reinforcement Learning and Markov Decision Processes 23

horizon taking into account in Vby one step. The complexity of value iteration

is linear in number of actions, and quadratic in the number of states. But, usu-

ally the transition matrix is sparse. In practice policy iteration converges much

faster, but each evaluation step is expensive. Each iteration has a complexity of

O(|A||S|2+|S|3), which can grow large quickly. A worst case bound on the num-

ber of iterations is not known [Littman et al(1995)Littman, Dean, and Kaelbling].

Linear programming is is a common tool that can be used for the evaluation

too. In general, the number of iterations and value backups can quickly grow ex-

tremely large when the problem size grows. The state spaces of games such as

backgammon and chess consist of too many states to perform just one full sweep.

In this section we will describe some efﬁcient variations on DP approaches. De-

tailed coverage of complexity results for the solution of MDPs can be found in

[Littman et al(1995)Littman, Dean, and Kaelbling, Bertsekas and Tsitsiklis(1996),

Boutilier et al(1999)Boutilier, Dean, and Hanks].

The efﬁciency of DP can be roughly improved along two lines. The ﬁrst is a

tighter integration of the evaluation and improvement steps of the GPI process.

We will discuss this issue brieﬂy in the next section. The second is that of using

(heuristic)search algorithms in combination with DP algorithms. For example, us-

ing search as an exploration mechanism can highlight important parts of the state

space such that value backups can be concentrated on these parts. This is the under-

lying mechanism used in the methods discussed brieﬂy in Section 6.2.2

6.2.1 Styles of Updating

The full backups updates in DP algorithms can be done in several ways. We have

assumed in the description of the algorithms that in each step an old and a new

value function are kept in memory. Each update puts a new value in the new ta-

ble, based on the information of the old. This is called synchronous, or Jacobi-

style updating [Sutton and Barto(1998)]. This is useful for explanation of algo-

rithms and theoretical proofs of convergence. However, there are two more com-

mon ways for updates. One can keep a single table and do the updating directly

in there. This is called in-place updating [Sutton and Barto(1998)] or Gauss-Seidel

[Bertsekas and Tsitsiklis(1996)] and usually speeds up convergence, because dur-

ing one sweep of updates, some updates use already newly updated values of other

states. Another type of updating is called asynchronous updating which is an ex-

tension of the in-place updates, but here updates can be performed in any order.

An advantage is that the updates may be distributed unevenly throughout the state(-

action) space, with more updates being given to more important parts of this space.

For all these methods convergence can be proved under the general condition that

values are updated inﬁnitely often but with a ﬁnite frequency.

24 Martijn van Otterlo and Marco Wiering

Modiﬁed Policy Iteration.

Modiﬁed policy iteration (MPI) [Puterman and Shin(1978)] strikes a middle ground

between value and policy iteration. MPI maintains the two separate steps of GPI,

but both steps are not necessarily computed in the limit. The key insight here is

that for policy improvement, one does not need an exactly evaluated policy in or-

der to improve it. For example, the policy estimation step can be approximative

after which a policy improvement step can follow. In general, both steps can be

performed quite independently by different means. For example, instead of itera-

tively applying the Bellman update rule from Equation 15, one can perform the pol-

icy estimation step by using a sampling procedure such as Monte Carlo estimation

[Sutton and Barto(1998)]. These general forms in which mixed forms of estimation

and improvements is captured by the generalized policy iteration mechanism de-

picted in Figure 3. Policy iteration and value iteration are both the extreme cases

of modiﬁed policy iteration, whereas MPI is a general method for asynchronous

updating.

6.2.2 Heuristics and Search

In many realistic problems, only a fraction of the state space is relevant to the prob-

lem of reaching the goal state from some state s. This has inspired a number of

algorithms that focus computation on states that seem most relevant for ﬁnding an

optimal policy from a start state s. These algorithms usually display good anytime

behavior, i.e. they produce good or reasonable policies fast, after which they are

gradually improved. In addition, they can be seen as implementing various ways of

asynchronous DP.

Envelopes and Fringe States.

One form of asynchronous methods is the PLE XU S system

[Dean et al(1995)Dean, Kaelbling, Kirman, and Nicholson]. It was designed

for goal-based reward functions, i.e. episodic tasks in which only goal states get

positive reward. It starts with an approximated version of the MDP in which not the

full state space is contained. This smaller version of the MDP is called an envelope

and it includes the agent’s current state and the goal state. A special OUT state

represents all the states outside the envelope. The initial envelope is constructed

by a forward search until a goal state is found. The envelope can be extended by

considering states outside the envelope that can be reached with high probability.

The intuitive idea is to include in the envelope all the states that are likely to be

reached on the way to the goal. Once the envelope has been constructed a policy is

computed through policy iteration. If at any point the agent leaves the envelope, it

has to replan by extending the envelope. This combination of learning and planning

Reinforcement Learning and Markov Decision Processes 25

still uses policy iteration, but on a much smaller (and presumably more relevant

with respect to the goal) state space.

A related method proposed in [Tash and Russell(1994)] considers goal-based

tasks too. However, instead of the single OUT state, they keep a fringe of states

on the edge of the envelope and use a heuristic to estimate values of the other states.

When computing a policy for the envelope, all fringe states become absorbing states

with the heuristic set as their value. Over time the heuristic values of the fringe states

converge to the optimal values of those states.

Similar to the previous methods the LAO* algorithm [Hansen and Zilberstein(2001)]

also alternates between an expansion phase and a policy generation phase. It too

keeps a fringe of states outside the envelope such that expansions can be larger than

the envelope method in [Dean et al(1995)Dean, Kaelbling, Kirman, and Nicholson].

The motivation behind LAO* was to extend the classic search algorithm AO* cf.

[?] to cyclic domains such as MDPS.

Search and Planning in DP.

Real-time dynamic DP (RTDP) [Barto et al(1995)Barto, Bradtke, and Singh] com-

bines forward search with DP too. It is used as an alternative for value iteration

in which only a subset of values in the state space are backed up in each iteration.

RTDP performs trials from a randomly selected state to a goal state, by simulating

the greedy policy using an admissable heuristic function as the initial value function.

It then backups values fully only along these trials, such that backups are concen-

trated on the relevant parts of the state. The approach was later extended into labeled

RTDP [Bonet and Geffner(2003b)] where some states are labeled as solved which

means that their value has already converged. Furthermore, it was recently extended

to bounded RTDP[Brendan McMahan et al(2005)Brendan McMahan, Likhachev, and Gordon]

which keeps lower and upper bounds on the optimal value function. Other recent

methods along these lines are focussed DP [Ferguson and Stentz(2004)] and heuris-

tic search-DP [Bonet and Geffner(2003a)]

7 Reinforcement Learning: Model-Free Solution Techniques

The previous section has reviewed several methods for computing an optimal policy

for an MDP assuming that a (perfect) model is available. RL is primarily concerned

with how to obtain an optimal policy when such a model is not available. RL adds

to MD Ps a focus on approximation and incomplete information, and the need for

sampling and exploration. In contrast with the algorithms discussed in the previous

section, model-free methods do not rely on the availability of priori known transi-

tion and reward models, i.e. a model of the MDP. The lack of a model generates a

need to sample the MDP to gather statistical knowledge about this unknown model.

Many model-free RL techniques exist that probe the environment by doing actions,

26 Martijn van Otterlo and Marco Wiering

for each episode do

s∈Sis initialized as the starting state

t:=0

repeat

choose an action a∈A(s)

perform action a

observe the new state s0and received reward r

update ˜

T,˜

R,˜

Qand/or ˜

using the experience hs,a,r,s0i

s:=s0

until s0is a goal state

Algorithm 3: A general algorithm for online RL

thereby estimating the same kind of state value and state-action value functions as

model-based techniques. This section will review model-free methods along with

several efﬁcient extensions.

In model-free contexts one has still a choice between two options. The ﬁrst one is

ﬁrst to learn the transition and reward model from interaction with the environment.

After that, when the model is (approximately or sufﬁciently) correct, all the DP

methods from the previous section apply. This type of learning is called indirect

RL. The second option, called direct RL, is to step right into estimating values for

actions, without even estimating the model of the MDP. Additionally, mixed forms

between these two exists too. For example, one can still do model-free estimation of

action values, but use an approximated model to speed up value learning by using

this model to perform more, and in addition, full backups of values (see Section 7.3).

Most model-free methods however, focus on direct estimation of (action) values.

A second choice one has to make is what to do with the temporal credit assign-

ment. It is difﬁcult to assess the utility of some action, if the real effects of this

particular action can only be perceived much later. One possibility is to wait un-

til the ”end” (e.g. of an episode) and punish or reward speciﬁc actions along the

path taken. However, this will take a lot of memory and often, with ongoing tasks,

it is not known beforehand whether, or when, there will be an ”end”. Instead, one

can use similar mechanisms as in value iteration to adjust the estimated value of

a state based on the immediate reward and the estimated (discounted) value of the

next state. This is generally called temporal difference learning which is a general

mechanism underlying the model-free methods in this section. The main difference

with the update rules for DP approaches (such as Equation 14) is that the transition

function Tand reward function Rcannot appear in the update rules now. The gen-

eral class of algorithms that interact with the environment and update their estimates

after each experience is called online.

A general template for online RL is depicted in Figure 3. It shows an interaction

loop in which the agent selects an action (by whatever means) based on its current

state, gets feedback in the form of the resulting state and an associated reward, after

which it updates its estimated values stored in ˜

Vand ˜

Qand possibly statistics con-

cerning ˜

Tand ˜

R(in case of some form of indirect learning). The selection of the

Reinforcement Learning and Markov Decision Processes 27

action is based on the current state sand the value function (either Qor V). To solve

the exploration-exploitation problem, usually a separate exploration mechanism en-

sures that sometimes the best action (according to current estimates of action values)

is taken (exploitation) but sometimes a different action is chosen (exploration). Var-

ious choices for exploration, ranging from random to sophisticated, exist and we

will see some examples in Section 7.3.

Exploration.

One important aspect of model-free algorithms is that there is a need for explo-

ration. Because the model is unknown, the learner has to try out different ac-

tions to see their results. A learning algorithm has to strike a balance between

exploration and exploitation, i.e. in order to gain a lot of reward the learner has

to exploit its current knowledge about good actions, although it sometimes must

try out different actions to explore the environment for possible better actions.

The most basic exploration strategy is the

-greedy policy, i.e. the learner takes

its current best action with probability (1−

)and a (randomly selected) other

action with probability

. There are many more ways of doing exploration (see

[Wiering(1999), Reynolds(2002), Ratitch(2005)] for overviews) and in Section 7.3

we will see some examples. One additional method that is often used in combina-

tion with the algorithms in this section is the Boltzmann (or: softmax) exploration

strategy. It is only slightly more complicated than the

-greedy strategy. The ac-

tion selection strategy is still random, but selection probabilities are weighted by

their relative Q-values. This makes it more likely for the agent to choose very good

actions, whereas two actions that have similar Q-values will have almost the same

probability to get selected. Its general form is

P(an) = eQ(s,a)

∑ieQ(s,ai)

(17)

in which P(an)is the probability of selecting action anand Tis the temperature pa-

rameter. Higher values of Twill move the selection strategy more towards a purely

random strategy and lower values will move to a fully greedy strategy. A com-

bination of both

-greedy and Boltzmann exploration can be taken by taking the

best action with probability (1−

)and otherwise an action computed according to

Equation 17 [Wiering(1999)].

Another simple method to stimulate exploration is optimistic Q-values initializa-

tion; one can initialize all Q-values to high values – e.g. an a priori deﬁned upper-

bound – at the start of learning. Because Q-values will decrease during learning,

actions that have not been tried a number of times will have a large enough value to

get selected when using Boltzmann exploration for example. Another solution with

a similar effect is to keep counters on the number of times a particular state-action

pair has been selected.

28 Martijn van Otterlo and Marco Wiering

7.1 Temporal Difference Learning

Temporal difference learning algorithms learn estimates of values based on other es-

timates. Each step in the world generates a learning example which can be used

to bring some value in accordance to the immediate reward and the estimated

value of the next state or state-action pair. An intuitive example, along the lines

of [Sutton and Barto(1998)](Chapter 6), is the following.

Imagine you have to predict at what time your guests can arrive for a small diner

in your house. Before cooking, you have to go to the supermarket, the butcher and

the wine seller, in that order. You have estimates of driving times between all lo-

cations, and you predict that you can manage to visit the two last stores both in 10

minutes, but given the crowdy time on the day, your estimate about the supermarket

is a half hour. Based on this prediction, you have notiﬁed your guests that they can

arrive no earlier than 18.00h. Once you have found out while in the supermarket that

it will take you only 10 minutes to get all the things you need, you can adjust your

estimate on arriving back home with 20 minutes less. However, once on your way

from the butcher to the wine seller, you see that there is quite some trafﬁc along the

way and it takes you 30 minutes longer to get there. Finally you arrive 10 minutes

later than you predicted in the ﬁrst place. The bottom line of this example is that

you can adjust your estimate about what time you will be back home every time you

have obtained new information about in-between steps. Each time you can adjust

your estimate on how long it will still take based on actually experienced times of

parts of your path. This is the main principle of TD learning: you do not have to

wait until the end of a trial to make updates along your path.

TD methods learn their value estimates based on estimates of other values, which

is called bootstrapping. They have an advantage over DP in that they do not require

a model of the MD P. Another advantage is that they are naturally implemented in

an online, incremental fashion such that they can be easily used in various circum-

stances. No full sweeps through the full state space are needed; only along experi-

enced paths values get updated, and updates are effected after each step.

TD(0).

TD(0) is a member of the family of TD learning algorithms [Sutton(1988)]. It solves

the prediction problem, i.e. it estimates V

for some policy

, in an online, incre-

mental fashion. T D(0)can be used to evaluate a policy and works through the use

of the following update rule5:

Vk+1(s):=Vk(s) +

µr+

Vk(s0)−Vk(s)¶

5The learning parameter

should comply with some criteria on its value, and the way it is

changed. In the algorithms in this section, one often chooses a small, ﬁxed learning parameter,

or it is decreased every iteration.

Reinforcement Learning and Markov Decision Processes 29

Require: discount factor

, learning parameter

initialize Qarbitrarily (e.g. Q(s,a) = 0,∀s∈S,∀a∈A)

for each episode do

sis initialized as the starting state

repeat

choose an action a∈A(s)based on an exploration strategy

perform action a

observe the new state s0and received reward r

Q(s,a):=Q(s,a) +

µr+

·maxa0∈A(s0)Q(s0,a0)−Q(s,a)¶

s:=s0

until s0is a goal state

Algorithm 4: Q-Learning [Watkins and Dayan(1992)]

where

∈[0,1]is the learning rate, that determines by how much values get up-

dated. This backup is performed after experiencing the transition from state sto s0

based on the action a, while receiving reward r. The difference with DP backups

such as used in Equation 14 is that the update is still done by using bootstrapping,

but it is based on an observed transition, i.e. it uses a sample backup instead of a

full backup. Only the value of one successor state is used, instead of a weighted av-

erage of all possible successor states. When using the value function V

for action

selection, a model is needed to compute an expected value over all action outcomes

(e.g. see Equation 4).

The learning rate

has to be decreased appropriately for learning to converge.

Sometimes the learning rate can be deﬁned for states separately as in

(s), in which

case it can be dependent on how often the state is visited. The next two algorithms

learn Q-functions directly from samples, removing the need for a transition model

for action selection.

Q-learning.

One of the most basic and popular methods to estimate Q-value functions in a

model-free fashion is the Q-learning algorithm [Watkins(1989), Watkins and Dayan(1992)],

see Algorithm 4.

The basic idea in Q-learning is to incrementally estimate Q-values for actions,

based on feedback (i.e. rewards) and the agent’s Q-value function. The update rule is

a variation on the theme of TD learning, using Q-values and a built-in max-operator

over the Q-values of the next state in order to update Qtinto Qt+1:

Qk+1(st,at) = Qk(st,at)+

µrt+

max

aQk(st+1,a)−Qk(st,at)¶(18)

The agent makes a step in the environment from state stto st+1using action atwhile

receiving reward rt. The update takes place on the Q-value of action atin the state

stfrom which this action was executed.

30 Martijn van Otterlo and Marco Wiering

Q-learning is exploration-insensitive. It means that it will converge to the op-

timal policy regardless of the exploration policy being followed, under the as-

sumption that each state-action pair is visited an inﬁnite number of times, and

the learning parameter

is decreased appropriately [Watkins and Dayan(1992),

Bertsekas and Tsitsiklis(1996)].

SARSA.

Q-learning is an off-policy learning algorithm, which means that while following

some exploration policy

, it aims at estimating the optimal policy

∗. A related

on-policy algorithm that learns the Q-value function for the policy the agent is ac-

tually executing is the SARSA [Rummery and Niranjan(1994), Rummery(1995),

Sutton(1996)] algorithm, which stands for State–Action–Reward–State–Aaction. It

uses the following update rule:

Qt+1(st,at) = Qt(st,at) +

µrt+

Qt(st+1,at+1)−Qt(st,at)¶(19)

where the action at+1is the action that is executed by the current policy for state

st+1. Note that the max-operator in Q-learning is replaced by the estimate of the

value of the next action according to the policy. This learning algorithm will still

converge in the limit to the optimal value function (and policy) under the condi-

tion that all states and actions are tried inﬁnitely often and the policy converges in

the limit to the greedy policy, i.e. such that exploration does not occur anymore.

SARSA is especially useful in non-stationary environments. In these situations one

will never reach an optimal policy. It is also useful if function approximation is

used, because off-policy methods can diverge when this is used. However, off-policy

methods are needed in many situations such as in learning using hierarchically struc-

tured policies.

Actor-Critic Learning.

Another class of algorithms that precede Q-learning and SA RSA are actor–critic

methods [Witten(1977), Barto et al(1983)Barto, Sutton, and Anderson, Konda and Tsitsiklis(2003)],

which learn on-policy. This branch of TD methods keeps a separate policy indepen-

dent of the value function. The policy is called the actor and the value function the

critic. The critic – typically a state-value function – evaluates, or: criticizes, the ac-

tions executed by the actor. After action selection, the critic evaluates the action

using the TD-error:

t=rt+

V(st+1)−V(st)

The purpose of this error is to strengthen or weaken the selection of this action in

this state. A preference for an action ain some state scan be represented as p(s,a)

such that this preference can be modiﬁed using:

Reinforcement Learning and Markov Decision Processes 31

p(st,at):=p(st,at)+

β δ

where a parameter

determines the size of the update. There are other versions of

actor–critic methods, differing mainly in how preferences are changed, or experi-

ence is used (for example using eligibility traces, see next section). An advantage

of having separate policy representation is that if there are many actions, or when

the action space is continuous, there is no need to consider all actions’ Q-values in

order to select one of them. A second advantage is that they can learn stochastic

policies naturally. Furthermore, a priori knowledge about policy constraints can be

used, e.g. see [Fr¨

amling(2005)].

Average Reward Temporal Difference Learning.

We have explained Q-learning and related algorithms in the context of discounted,

inﬁnite-horizon MD Ps. Q-learning can also be adapted to the average-reward frame-

work, for example in the R-learning algorithm [Schwartz(1993)]. Other extensions

of algorithms to the average reward framework exist (see [Mahadevan(1996)] for an

overview).

7.2 Monte Carlo Methods

Other algorithms that use more unbiased estimates are Monte Carlo (MC) tech-

niques. They keep frequency counts of transitions and rewards and base their values

on these estimates. MC methods only require samples to estimate average sample

returns. For example, in MC policy evaluation, for each state s∈Sall returns ob-

tained from sare kept and the value of a state s∈Sis just their average. In other

words, MC algorithms treat the long-term reward as a random variable and take as

its estimate the sampled mean. In contrast with one-step TD methods, MC estimates

values based on averaging sample returns observed during interaction. Especially

for episodic tasks this can be very useful, because samples from complete returns

can be obtained. One way of using MC is by using it for the evaluation step in pol-

icy iteration. However, because the sampling is dependent on the current policy

only returns for actions suggested by

are evaluated. Thus, exploration is of key

importance here, just as in other model-free methods.

A distinction can be made between every-visit MC, which averages over all visits

of a state s∈Sin all episodes, and ﬁrst-visit MC, which averages over just the re-

turns obtained from the ﬁrst visit to a state s∈Sfor all episodes. Both variants will

converge to V

for the current policy

over time. MC methods can also be applied

to the problem of estimating action values. One way of ensuring enough explo-

ration is to use exploring starts, i.e. each state-action pair has a non-zero probability

of being selected as the initial pair. MC methods can be used for both on-policy

and off-policy control, and the general pattern complies with the generalized policy

32 Martijn van Otterlo and Marco Wiering

iteration procedure. The fact that MC methods do not bootstrap makes them less de-

pendent on the Markov assumption. TD methods too focus on sampled experience,

although they do use bootstrapping.

Learning a Model.

We have described MC methods in the context of learning value functions. Methods

similar to MC can also be used to estimate a model of the M DP. An average over

sample transition probabilities experienced during interaction can be used to gradu-

ally estimate transition probabilities. The same can be done for immediate rewards.

Indirect RL algorithms make use of this to strike a balance between model-based

and model-free learning. They are essentially model-free, but learn a transition and

reward model in parallel with model-free RL, and use this model to do more efﬁ-

cient value function learning (see also the next section). An example of this is the

DYNA model [Sutton(1991a)]. Another method that often employs model learning

is prioritized sweeping [Moore and Atkeson(1993)]. Learning a model can also be

very useful to learn in continuous spaces where the transition model is deﬁned over

a discretized version of the underlying (inﬁnite) state space [Großmann(2000)].

Relations with Dynamic Programming.

The methods in this section solve essentially similar problems as DP techniques. RL

approaches can be seen as asynchronous DP. There are some important differences

in both approaches though.

RL approaches avoid the exhaustive sweeps of DP by restricting computation

on, or in the neighborhood of, sampled trajectories, either real or simulated. This

can exploit situations in which many states have low probabilities of occurring in

actual trajectories. The backups used in DP are simpliﬁed by using sampling. In-

stead of generating and evaluating all of a state’s possible immediate successors,

the estimate of a backup’s effect is done by sampling from the appropriate distribu-

tion. MC methods use this to base their estimates completely on the sample returns,

without bootstrapping using values of other, sampled, states. Furthermore, the fo-

cus on learning (action) value functions in RL is easily amenable to function ap-

proximation approaches. Representing value functions and or policies can be done

more compactly than lookup-table representations by using numeric regression al-

gorithms without breaking the standard RL interaction process; one can just feed

the update values into a regression engine.

An interesting point here is the similarity between Q-learning and value iteration

on the one hand and SA RSA and policy iteration on the other hand. In the ﬁrst two

methods, the updates immediately combine policy evaluation and improvement into

one step by using the max-operator. In contrast, the second two methods separate

evaluation and improvement of the policy. In this respect, value iteration can be

considered as off-policy because it aims at directly estimating V∗whereas policy

Reinforcement Learning and Markov Decision Processes 33

iteration estimates values for the current policy and is on-policy. However, in the

model-based setting the distinction is only superﬁcial, because instead of samples

that can be inﬂuenced by an on-policy distribution, a model is available such that

the distribution over states and rewards is known.

7.3 Efﬁcient Exploration and Value Updating

The methods in the previous section have shown that both prediction and control

can be learned using samples from interaction with the environment, without hav-

ing access to a model of the MDP. One problem with these methods is that they

often need a large number of experiences to converge. In this section we describe a

number of extensions used to speed up learning. One direction for improvement lies

in the exploration. One can – in principle – use MC sampling until one knows every-

thing about the MD P but this simply takes too long. Using more information enables

more focused exploration procedures to generate experience more efﬁciently. An-

other direction is to put more efforts in using the experience for updating multiple

values of the value function on each step. Improving exploration generates better

samples, whereas improving updates will squeeze more information from samples.

Efﬁcient Exploration.

We have already encountered

-greedy and Boltzmann exploration. Although com-

monly used, these are relatively simple undirected exploration methods. They are

mainly driven by randomness. In addition, they are stateless, i.e. the exploration is

driven without knowing which areas of the state space have been explored so far.

A large class of directed methods for exploration have been proposed in the

literature that use additional information about the learning process. The focus of

these methods is to do more uniform exploration of the state space and to balance

the relative beneﬁts of discovering new information relative to exploiting current

knowledge. Most methods use or learn a model of the MDP in parallel with RL. In

addition they learn an exploration value function.

Several options for directed exploration are available. One distinction between

methods is whether to work locally (e.g. exploration of individual state-action pairs)

or globally by considering information about parts or the complete state-space when

making a decision to explore. Furthermore, there are several other classes of explo-

ration algorithms.

Counter-based or recency-based methods keep records of how often, or how

long ago, a state-action pair has been visited. Error-based methods, of which pri-

oritized sweeping [Moore and Atkeson(1993)] is one example, use an exploration

bonus based on the error in the value of states. Other methods base exploration on

the uncertainty about the value of a state, or the conﬁdence about the state’s cur-

rent value. They decide whether to explore by calculating the probability that an

34 Martijn van Otterlo and Marco Wiering

explorative action will encover a larger reward than already found. The interval es-

timation (IE) method [Kaelbling(1993)] is an example of this kind of methods. IE

uses a statistical model to measure the degree of uncertainty of each Q(s,a)-value.

An upper bound can be calculated on the likely value of each Q-value, and the ac-

tion with the highest upper bound is taken. If the action taken happens to be a poor

choice, the upper bound will be decreased when the statistical model is updated.

Good actions will continue to have a high upper bound and will be chosen often.

In contrast to counter- and recency-based exploration, IE is concerned with action

exploration and not with state space exploration. [Wiering(1999)] (see also [?]) in-

troduced an extension to model-based RL in the model-based interval estimation

algorithm in which the same idea is used for estimates of transition probabilities.

Another recent method that deals explicitly with the exploration-exploitation

trade-off is the E3method [Kearns and Singh(1998)]. E3stands for explicit ex-

ploration and exploitation. It learns by updating a model of the environment by

collecting statistics. The state space is divided into known and unknown parts. On

every step a decision is made whether the known part contains sufﬁcient opportu-

nities for getting rewards or whether the unknown part should be explored to ob-

tain possibly more reward. An important aspect of this algorithm is that it was the

ﬁrst general near-optimal (tabular) RL algorithm with provable bounds on compu-

tation time. The approach was extended in [Brafman and Tennenholtz(2002)] into

the more general algorithm R- MAX. It too provides a polynomial bound on com-

putation time for reaching near-optimal policies. As a last example, [Ratitch(2005)]

present an approach for efﬁcient, directed exploration based on more sophisticated

characteristics of the MDP such as an entropy measure over state transitions. An

interesting feature of this approach is that these characteristics can be computed be-

fore learning and be used in combination with other exploration methods, thereby

improving their behavior.

For a more detailed coverage of exploration strategies we refer the reader to

[Ratitch(2005)] and [Wiering(1999)].

Guidance and Shaping.

Exploration methods can be used to speed up learning and focus attention to relevant

areas in the state space. The exploration methods mainly use statistics derived from

the problem before or during learning. However, sometimes more information is

available that can be used to guide the learner. For example, if a reasonable policy for

a domain is available, it can be used to generate more useful learning samples than

(random) exploration could do. In fact, humans are usually very bad in specifying

optimal policies, but considerably good at specifying reasonable ones6.

The work in behavioral cloning [Bain and Sammut(1995)] takes an extreme

point on the guidance spectrum in that the goal is to replicate example behavior from

expert traces, i.e. to clone this behavior. This type of guidance moves learning more

6Quote taken from the invited talk by Leslie Kaelbling at the European Workshop on Reinforce-

ment Learning EW RL, in Utrecht 2001.

Reinforcement Learning and Markov Decision Processes 35

in the direction of supervised learning. Another way to help the agent is by shaping

[Mataric(1994), Dorigo and Colombetti(1997), Ng et al(1999)Ng, Harada, and Russell].

Shaping pushes the reward closer to the subgoals of behavior, and thus encourages

the agent to incrementally improve its behavior by searching the policy space more

effectively. This is also related to the general issue of giving rewards to appropriate

subgoals, and the gradual increase in difﬁculty of tasks. The agent can be trained

on increasingly more difﬁcult problems, which can also be considered as a form of

guidance.

Various other mechanisms can be used to provide guidance to RL algorithms,

such as decompositions [?], heuristic rules for better exploration [Fr¨

amling(2005)]

and various types of transfer in which knowledge learned in one problem is trans-

ferred to other, related problems, e.g. see [?].

Eligibility Traces.

In MC methods, the updates are based on the entire sequence of observed rewards

until the end of an episode. In TD methods, the estimates are based on the samples

of immediate rewards and the next states. An intermediate approach is to use the

n-step-truncated-return R(n)

t, obtained from a whole sequence of returns:

R(n)

t=rt+

rt+1+

2rt+2+...+

nVt(st+n)

With this, one can go to the approach of computing the updates of values based

on several n-step returns. The family of TD(

), with 0 ≤

≤1, combines n-step

returns weighted proportionally to

n−1.

The problem with this is that we would have to wait indeﬁnitely to compute R(∞)

This view is useful for theoretical analysis and understanding of n-step backups.

It is called the forward view of the TD(

)algorithm. However, the usual way to

implement this kind of updates is called the backward view of the TD(

)algorithm

and is done by using eligibility traces, which is an incremental implementation of

the same idea.

Eligibility traces are a way to perform n-step backups in an elegant way. For

each state s∈S, an eligibility et(s)is kept in memory. They are initialized at 0 and

incremented every time according to:

et(s) = (

γλ

et−1(s)if s6=st

γλ

et−1(s) + 1 if s=st

where

is the trace decay parameter. The trace for each state is increased every

time that state is visited and decreases exponentially otherwise. Now

tis the tem-

poral difference error at stage t:

t=rt+

V(st+1)−V(st)

36 Martijn van Otterlo and Marco Wiering

On every step, all states are updated in proportion to their eligibility traces as in:

V(s):=V(s) +

αδ

tet(s)

The forward and backward view on eligibility traces can be proved equivalent

[Sutton and Barto(1998)]. For

=1, TD(

) is essentially the same as MC, because

it considers the complete return, and for

=0, TD(

) uses just the immediate re-

turn as in all one-step RL algorithms. Eligibility traces are a general mechanism to

learn from n-step returns. They can be combined with all of the model-free methods

we have described in the previous section. [Watkins(1989)] combined Q-learning

with eligibility traces in the Q(

)-algorithm. [Peng and Williams(1996)] proposed

a similar algorithm, and [Wiering and Schmidhuber(1998)] and [Reynolds(2002)]

both proposed efﬁcient versions of Q(

). The problem with combining eligibility

traces with learning control is special care has to be taken in case of exploratory

actions, which can break the intended meaning of the n-step return for the current

policy that is followed. In [Watkins(1989)]’s version, eligibility traces are reset ev-

ery time an exploratory action is taken. [Peng and Williams(1996)]’s version is, in

that respect more efﬁcient, in that traces do not have to be set to zero every time.

SARSA(

) [Sutton and Barto(1998)] is more safe in this respect, because action

selection is on-policy. Another recent on-policy learning algorithm is the QV(

)

algorithm by [Wiering(2005)]. In QV(

)-learning two value functions are learned;

TD(

) is used for learning a state value function Vand one-step Q-learning is used

for learning a state-action value function, based on V.

Learning and Using a Model: Learning and Planning.

Even though RL methods can function without a model of the MD P, such a model

can be useful to speed up learning, or bias exploration. A learned model can also be

useful to do more efﬁcient value updating. A general guideline is when experience

is costly, it pays off to learn a model. In RL model-learning is usually targeted at the

speciﬁc learning task deﬁned by the MD P , i.e. determined by the rewards and the

goal. In general, learning a model is most often useful because it gives knowledge

about the dynamics of the environment, such that it can be used for other tasks too

(see [?] for extensive elaboration on this point).

The DYNA architecture [Sutton(1990), Sutton(1991b), Sutton(1991a), Sutton and Barto(1998)]

is a simple way to use the model to amplify experiences. Algorithm 5 shows DYNA-

Qwhich combines Q-learning with planning. In a continuous loop, Q-learning is

interleaved with series of extra updates using a model that is constantly updated too.

DYNA needs less interactions with the environment, because it replays experience

to do more value updates.

A related method that makes more use of experience using a learned model is

prioritized sweeping (PS) [Moore and Atkeson(1993)]. Instead of selecting states to

be updated randomly (as in DYNA), PS prioritizes updates based on their change in

values. Once a state is updated, the PS algortithm considers all states that can reach

Reinforcement Learning and Markov Decision Processes 37

Require: initialize Q and Model arbitrarily

repeat

s∈Sis the start state

a:=

-greedy(s,Q)

update Q

update Model

for i:=1to ndo

s:=randomly selected observed state

a:=random, previously selected action from s

update Qusing the model

until Sufﬁcient Performance

Algorithm 5: Dyna-Q [Sutton and Barto(1998)]

that state, by looking at the transition model, and sees whether these states will have

to be updated as well. The order of the updates is determined by the size of the value

updates. The general mechanism can be summarized as follows. In each step i) one

remembers the old value of the current state, ii) one updates the state value with a

full backup using the learned model, iii) one sets the priority of the current state to 0,

iv) one computes the change

in value as the result of the backup, v) one uses this

difference to modify predecessors of the current state (determined by the model); all

states leading to the current state get a priority update of

×T. The number of value

backups is a parameter to be set in the algorithm. Overall, PS focuses the backups

to where they are expected to most quickly reduce the error. Another example of

using planning in model-based RL is [Wiering(2002)].

References

[Bain and Sammut(1995)] Bain M, Sammut C (1995) A framework for behavioral cloning. Ma-

chine Learning 15

[Barto et al(1983)Barto, Sutton, and Anderson] Barto AG, Sutton RS, Anderson CW (1983) Neu-

ronlike elements that can solve difﬁcult learning control problems. IEEE Transactions on Sys-

tems, Man, and Cybernetics 13:835–846

[Barto et al(1995)Barto, Bradtke, and Singh] Barto AG, Bradtke SJ, Singh SP (1995) Learning to

act using real-time dynamic programming. Artiﬁcial Intelligence 72(1):81–138

[Bellman(1957)] Bellman RE (1957) Dynamic Programming. Princeton University Press, Prince-

ton, New Jersey

[Bertsekas(1995)] Bertsekas D (1995) Dynamic Programming and Optimal Control, volumes 1

and 2. Athena Scientiﬁc, Belmont, MA

[Bertsekas and Tsitsiklis(1996)] Bertsekas DP, Tsitsiklis J (1996) Neuro-Dynamic Programming.

Athena Scientiﬁc, Belmont, MA

[Bonet and Geffner(2003a)] Bonet B, Geffner H (2003a) Faster heuristic search algorithms for

planning with uncertainty and full feedback. In: Proceedings of the International Joint Con-

ference on Artiﬁcial Intelligence (IJCAI), pp 1233–1238

[Bonet and Geffner(2003b)] Bonet B, Geffner H (2003b) Labeled rtdp: Improving the conver-

gence of real-time dynamic programming. In: Proceedings of the International Conference on

Artiﬁcial Planning Systems (ICAPS’03), pp 12–21

[Boutilier(1999)] Boutilier C (1999) Knowledge representation for stochastic decision processes.

Lecture Notes in Computer Science 1600:111–152

38 Martijn van Otterlo and Marco Wiering

[Boutilier et al(1999)Boutilier, Dean, and Hanks] Boutilier C, Dean T, Hanks S (1999) Decision

theoretic planning: Structural assumptions and computational leverage. Journal of Artiﬁcial

Intelligence Research 11:1–94

[Brafman and Tennenholtz(2002)] Brafman R, Tennenholtz M (2002) R-MAX - a general poly-

nomial time algorithm for near-optimal reinforcement learning. Journal of Machine Learning

Research 3:213–231

[Brendan McMahan et al(2005)Brendan McMahan, Likhachev, and Gordon] Brendan McMahan

H, Likhachev M, Gordon GJ (2005) Bounded real-time dynamic programming: Rtdp with

monotone upper bounds and performance guarantees. In: Proceedings of the International

Conference on Machine Learning (ICML), pp 569–576

[Dean et al(1995)Dean, Kaelbling, Kirman, and Nicholson] Dean T, Kaelbling LP, Kirman J,

Nicholson A (1995) Planning under time constraints in stochastic domains. Artiﬁcial Intel-

ligence 76:35–74

[Dorigo and Colombetti(1997)] Dorigo M, Colombetti M (1997) Robot shaping: an experiment

in behavior engineering. MIT Press, Cambridge, MA

[Ferguson and Stentz(2004)] Ferguson D, Stentz A (2004) Focussed dynamic programming: Ex-

tensive comparative results. Tech. Rep. CMU-RI-TR-04-13, Robotics Institute, Carnegie Mel-

lon University, Pittsburgh, Pennsylvania

[Fr¨

amling(2005)] Fr¨

amling K (2005) Bi-memory model for guiding exploration by pre-existing

knowledge. In: Driessens K, Fern A, van Otterlo M (eds) Proceedings of the ICML-2005

Workshop on Rich Representations for Reinforcement Learning, pp 21–26

[Großmann(2000)] Großmann A (2000) Adaptive state-space quantisation and multi-task rein-

forcement learning using constructive neural networks. In: Meyer JA, Berthoz A, Floreano D,

Roitblat HL, Wilson SW (eds) From Animals to Animats: Proceedings of The International

Conference on Simulation of Adaptive Behavior (SAB), pp 160–169

[Hansen and Zilberstein(2001)] Hansen EA, Zilberstein S (2001) LAO*: A heuristic search algo-

rithm that ﬁnds solutions with loops. Artiﬁcial Intelligence 129:35–62

[Howard(1960)] Howard RA (1960) Dynamic Programming and Markov Processes. The MIT

Press, Cambridge, Massachusetts

[Kaelbling(1993)] Kaelbling LP (1993) Learning in Embedded Systems. MIT Press, Cambridge,

[Kaelbling et al(1996)Kaelbling, Littman, and Moore] Kaelbling LP, Littman ML, Moore AW

(1996) Reinforcement learning: A survey. Journal of Artiﬁcial Intelligence Research 4:237–

285

[Kearns and Singh(1998)] Kearns M, Singh S (1998) Near-optimal reinforcement learning in

polynomial time. In: Proceedings of the International Conference on Machine Learning

(ICML)

[Koenig and Liu(2002)] Koenig S, Liu Y (2002) The interaction of representations and planning

objectives for decision-theoretic planning. Journal of Experimental and Theoretical Artiﬁcial

Intelligence

[Konda and Tsitsiklis(2003)] Konda V, Tsitsiklis J (2003) Actor-critic algorithms. SIAM Journal

on Control and Optimization 42(4):1143–1166

[Littman et al(1995)Littman, Dean, and Kaelbling] Littman ML, Dean TL, Kaelbling LP (1995)

On the complexity of solving markov decision problems. In: Proceedings of the National

Conference on Artiﬁcial Intelligence (AAAI), pp 394–402

[Mahadevan(1996)] Mahadevan S (1996) Average reward reinforcement learning: Foundations,

algorithms, and empirical results. Machine Learning 22:159–195

[Maloof(2003)] Maloof MA (2003) Incremental rule learning with partial instance memory for

changing concepts. In: Proceedings of the International Joint Conference on Neural Networks,

pp 2764–2769

[Mataric(1994)] Mataric M (1994) Reward functions for accelerated learning. In: Proceedings of

the International Conference on Machine Learning (ICML)

[Matthews(1922)] Matthews WH (1922) Mazes and Labyrinths: A General Account of their His-

tory and Developments. Longmans, Green and Co., London, reprinted in 1970 by Dover Pub-

lications, New York, under the title ’Mazes & Labyrinths: Their History & Development

Reinforcement Learning and Markov Decision Processes 39

[Moore and Atkeson(1993)] Moore AW, Atkeson CG (1993) Prioritized sweeping: Reinforce-

ment learning with less data and less time. Machine Learning 13(1):103–130

[Ng et al(1999)Ng, Harada, and Russell] Ng AY, Harada D, Russell SJ (1999) Policy invariance

under reward transformations: Theory and application to reward shaping. In: Proceedings of

the International Conference on Machine Learning (ICML), pp xx–xx

[Peng and Williams(1996)] Peng J, Williams RJ (1996) Incremental multi-step q-learning. Ma-

chine Learning 22:283–290

[Puterman(1994)] Puterman ML (1994) Markov Decision Processes—Discrete Stochastic Dy-

namic Programming. John Wiley & Sons, Inc., New York, NY

[Puterman and Shin(1978)] Puterman ML, Shin MC (1978) Modiﬁed policy iteration algorithms

for discounted Markov decision processes. Management Science 24:1127–1137

[Ratitch(2005)] Ratitch B (2005) On characteristics of markov decision processes and reinforce-

ment learning in large domains. PhD thesis, The School of Computer Science, McGill Uni-

versity, Montreal

[Reynolds(2002)] Reynolds SI (2002) Reinforcement learning with exploration. PhD thesis, The

School of Computer Science, The University of Birmingham, UK

[Rummery(1995)] Rummery GA (1995) Problem solving with reinforcement learning. PhD the-

sis, Cambridge University, Engineering Department, Cambridge, England

[Rummery and Niranjan(1994)] Rummery GA, Niranjan M (1994) On-line Q-Learning using

connectionist systems. Tech. Rep. CUED/F-INFENG/TR 166, Cambridge University, Engi-

neering Department

[Schaeffer and Plaat(1997)] Schaeffer J, Plaat A (1997) Kasparov versus deep blue: The re-match.

International Computer Chess Association Journal 20(2):95–101

[Schwartz(1993)] Schwartz A (1993) A reinforcement learning method for maximizing undis-

counted rewards. In: Proceedings of the International Conference on Machine Learning

(ICML), pp 298–305

[Sutton(1990)] Sutton R (1990) Integrated architectures for learning, planning, and reacting based

on approximating dynamic programming. In: Proceedings of the International Conference on

Machine Learning (ICML), pp 216–224

[Sutton(1988)] Sutton RS (1988) Learning to predict by the methods of temporal differences.

Machine Learning 3:9–44

[Sutton(1991a)] Sutton RS (1991a) DYNA, an integrated architecture for learning, planning and

reacting. In: Working Notes of the AAAI Spring Symposium on Integrated Intelligent Archi-

tectures, pp 151–155

[Sutton(1991b)] Sutton RS (1991b) Reinforcement learning architectures for animats. In: From

Animals to Animats: Proceedings of The International Conference on Simulation of Adaptive

Behavior (SAB), pp 288–296

[Sutton(1996)] Sutton RS (1996) Generalization in reinforcement learning: Successful examples

using sparse coarse coding. In: Touretzky DS, Mozer MC, Hasselmo ME (eds) Proceedings

of the Neural Information Processing Conference (NIPS), vol 8, pp 1038–1044

[Sutton and Barto(1998)] Sutton RS, Barto AG (1998) Reinforcement Learning: an Introduction.

The MIT Press, Cambridge

[Tash and Russell(1994)] Tash J, Russell S (1994) Control strategies for a stochastic planner. In:

Proceedings of the Twelfth National Conference on Artiﬁcial Intelligence (AAAI’94), pp

1079–1085

[Watkins(1989)] Watkins CJCH (1989) Learning from delayed rewards. PhD thesis, King’s Col-

lege, Cambridge, England

[Watkins and Dayan(1992)] Watkins CJCH, Dayan P (1992) Q-learning. Machine Learning

8(3/4), special Issue on Reinforcement Learning

[Wiering(1999)] Wiering MA (1999) Explorations in efﬁcient reinforcement learning. PhD thesis,

Faculteit der Wiskunde, Informatica, Natuurkunde en Sterrenkunde, Universiteit van Amster-

dam

[Wiering(2002)] Wiering MA (2002) Model-based reinforcement learning in dynamic environ-

ments. Tech. Rep. UU-CS-2002-029, Institute of Information and Computing Sciences, Uni-

versity of Utrecht, The Netherlands

40 Martijn van Otterlo and Marco Wiering

[Wiering(2005)] Wiering MA (2005) QV(

)-Learning: A new on-policy reinforcement learning

algorithm. In: Proceedings of the 7th European Workshop on Reinforcement Learning

[Wiering and Schmidhuber(1998)] Wiering MA, Schmidhuber JH (1998) Fast online q(

). Ma-

chine Learning xx:xxx–xxx

[Winston(1991)] Winston WL (1991) Operations research applications and algorithms, 2nd edn.

Thomson Information/Publishing Group, Boston

[Witten(1977)] Witten IH (1977) An adaptive optimal controller for discrete-time markov envi-

ronments. Information and Control 34:286–295

A Benchmark Study of Deep-RL Methods for Maximum Coverage Problems over Graphs

Preprint

Jun 2024

Recent years have witnessed a growing trend toward employing deep reinforcement learning (Deep-RL) to derive heuristics for combinatorial optimization (CO) problems on graphs. Maximum Coverage Problem (MCP) and its probabilistic variant on social networks, Influence Maximization (IM), have been particularly prominent in this line of research. In this paper, we present a comprehensive benchmark study that thoroughly investigates the effectiveness and efficiency of five recent Deep-RL methods for MCP and IM. These methods were published in top data science venues, namely S2V-DQN, Geometric-QN, GCOMB, RL4IM, and LeNSE. Our findings reveal that, across various scenarios, the Lazy Greedy algorithm consistently outperforms all Deep-RL methods for MCP. In the case of IM, theoretically sound algorithms like IMM and OPIM demonstrate superior performance compared to Deep-RL methods in most scenarios. Notably, we observe an abnormal phenomenon in IM problem where Deep-RL methods slightly outperform IMM and OPIM when the influence spread nearly does not increase as the budget increases. Furthermore, our experimental results highlight common issues when applying Deep-RL methods to MCP and IM in practical settings. Finally, we discuss potential avenues for improving Deep-RL methods. Our benchmark study sheds light on potential challenges in current deep reinforcement learning research for solving combinatorial optimization problems.

Penetration Strategy for High-Speed Unmanned Aerial Vehicles: A Memory-Based Deep Reinforcement Learning Approach

Article

Full-text available

Jun 2024

With the development and strengthening of interception measures, the traditional penetration methods of high-speed unmanned aerial vehicles (UAVs) are no longer able to meet the penetration requirements in diversified and complex combat scenarios. Due to the advancement of Artificial Intelligence technology in recent years, intelligent penetration methods have gradually become promising solutions. In this paper, a penetration strategy for high-speed UAVs based on improved Deep Reinforcement Learning (DRL) is proposed, in which Long Short-Term Memory (LSTM) networks are incorporated into a classical Soft Actor–Critic (SAC) algorithm. A three-dimensional (3D) planar engagement scenario of a high-speed UAV facing two interceptors with strong maneuverability is constructed. According to the proposed LSTM-SAC approach, the reward function is designed based on the criteria for successful penetration, taking into account energy and flight range constraints. Then, an intelligent penetration strategy is obtained by extensive training, which utilizes the motion states of both sides to make decisions and generate the penetration overload commands for the high-speed UAV. The simulation results show that compared with the classical SAC algorithm, the proposed algorithm has a training efficiency improvement of 75.56% training episode reduction. Meanwhile, the LSTM-SAC approach achieves a successful penetration rate of more than 90% in hypothetical complex scenarios, with a 40% average increase compared with the conventional programmed penetration methods.

Machine learning techniques for sustainable industrial process control

Chapter

Jan 2024

Path Planning Algorithm for Unmanned Sailboat Under Different Wind Fields Based on Dueling Double Deep Q-Learning Network

Conference Paper

Mar 2024

SpaceRL-KG: Searching paths automatically combining embedding-based rewards with Reinforcement Learning in Knowledge Graphs

Article

Jun 2024
EXPERT SYST APPL

Deep Reinforcement Learning-Based Jamming Against Multiple Frequency Agile Radars

Conference Paper

May 2024

Enhanced Flight Envelope Protection: A Novel Reinforcement Learning Approach

Preprint

Full-text available

Jun 2024

This paper introduces a flight envelope protection algorithm on a longitudinal axis that leverages reinforcement learning (RL). By considering limits on variables such as angle of attack, load factor, and pitch rate, the algorithm counteracts excessive pilot or control commands with restoring actions. Unlike traditional methods requiring manual tuning, RL facilitates the approximation of complex functions within the trained model, streamlining the design process. This study demonstrates the promising results of RL in enhancing flight envelope protection, offering a novel and easy-to-scale method for safety-ensured flight.

Optimization of geological carbon storage operations with multimodal latent dynamic model and deep reinforcement learning

Preprint

Jun 2024

Maximizing storage performance in geological carbon storage (GCS) is crucial for commercial deployment, but traditional optimization demands resource-intensive simulations, posing computational challenges. This study introduces the multimodal latent dynamic (MLD) model, a deep learning framework for fast flow prediction and well control optimization in GCS. The MLD model includes a representation module for compressed latent representations, a transition module for system state evolution, and a prediction module for flow responses. A novel training strategy combining regression loss and joint-embedding consistency loss enhances temporal consistency and multi-step prediction accuracy. Unlike existing models, the MLD supports diverse input modalities, allowing comprehensive data interactions. The MLD model, resembling a Markov decision process (MDP), can train deep reinforcement learning agents, specifically using the soft actor-critic (SAC) algorithm, to maximize net present value (NPV) through continuous interactions. The approach outperforms traditional methods, achieving the highest NPV while reducing computational resources by over 60%. It also demonstrates strong generalization performance, providing improved decisions for new scenarios based on knowledge from previous ones.

Machine Learning Algorithms for Autonomous Vehicles

Chapter

Jun 2024

Role of Artificial Intelligence in Improving Syncope Management

Article

Jun 2024

First Results with Dyna, an Integrated Architecture for Learning, Planning and Reacting

Chapter

Jan 1991

Richard S. Sutton

Neural Networks for Control highlights key issues in learning control and identifies research directions that could lead to practical solutions for control problems in critical application domains. It addresses general issues of neural network based control and neural network learning with regard to specific problems of motion planning and control in robotics, and takes up application domains well suited to the capabilities of neural network controllers. The appendix describes seven benchmark control problems. Contributors Andrew G. Barto, Ronald J. Williams, Paul J. Werbos, Kumpati S. Narendra, L. Gordon Kraft, III, David P. Campagna, Mitsuo Kawato, Bartlett W. Met, Christopher G. Atkeson, David J. Reinkensmeyer, Derrick Nguyen, Bernard Widrow, James C. Houk, Satinder P. Singh, Charles Fisher, Judy A. Franklin, Oliver G. Selfridge, Arthur C. Sanderson, Lyle H. Ungar, Charles C. Jorgensen, C. Schley, Martin Herman, James S. Albus, Tsai-Hong Hong, Charles W. Anderson, W. Thomas Miller, III Bradford Books imprint

Reinforcement Learning Architectures for Animats

Chapter

Feb 1991

Richard S. Sutton

These sixty contributions from researchers in ethology, ecology, cybernetics, artificial intelligence, robotics, and related fields delve into the behaviors and underlying mechanisms that allow animals and, potentially, robots to adapt and survive in uncertain environments. They focus in particular on simulation models in order to help characterize and compare various organizational principles or architectures capable of inducing adaptive behavior in real or artificial animals. Jean-Arcady Meyer is Director of Research at CNRS, Paris. Stewart W. Wilson is a Scientist at The Rowland Institute for Science, Cambridge, Massachusetts. Bradford Books imprint

Efficient Model-Based Exploration

Chapter

Jul 1998

The Animals to Animats Conference brings together researchers fromethology, psychology, ecology, artificial intelligence, artificiallife, robotics, engineering, and related fields to furtherunderstanding of the behaviors and underlying mechanisms that allownatural and synthetic agents (animats) to adapt and survive inuncertain environments The Animals to Animats Conference brings together researchers from ethology, psychology, ecology, artificial intelligence, artificial life, robotics, engineering, and related fields to further understanding of the behaviors and underlying mechanisms that allow natural and synthetic agents (animats) to adapt and survive in uncertain environments. The work presented focuses on well-defined models—robotic, computer-simulation, and mathematical—that help to characterize and compare various organizational principles or architectures underlying adaptive behavior in both natural animals and animats. Bradford Books imprint

A framework for behavioural cloning

Chapter

Jan 2000

This is the fifteenth volume in the Machine Intelligence series, founded in 1965 by Donald Michie, and includes papers by a number of eminent AI figures including John McCarthy, Alan Robinson, Robert Kowalski and Mike Genesereth. The book is centred on the theme of intelligent agents and covers a wide range of topics, including: - Representations of consciousness (John McCarthy, Stanford University and Donald Michie, Edinburgh University) - SoftBots (Bruce Blumberg, MIT Media Lab) - Parallel implementations of logic (Alan Robinson, Syracuse University) - Machine learning (Stephen Muggleton, Oxford University) - Machine vision (Andrew Blake, Oxford University) - Machine-based scientific discovery in molecular biology (Mike Sternberg, Imperial Cancer Research Fund).

Made-Up Minds: A Constructivist Approach to Artificial Intelligence

Book

Jan 2003

Gary L. Drescher

Made-Up Minds addresses fundamental questions of learning and concept invention by means of an innovative computer program that is based on the cognitive-developmental theory of psychologist Jean Piaget. Drescher uses Piaget's theory as a source of inspiration for the design of an artificial cognitive system called the schema mechanism, and then uses the system to elaborate and test Piaget's theory. The approach is original enough that readers need not have extensive knowledge of artificial intelligence, and a chapter summarizing Piaget assists readers who lack a background in developmental psychology. The schema mechanism learns from its experiences, expressing discoveries in its existing representational vocabulary, and extending that vocabulary with new concepts. A novel empirical learning technique, marginal attribution, can find results of an action that are obscure because each occurs rarely in general, although reliably under certain conditions. Drescher shows that several early milestones in the Piagetian infant's invention of the concept of persistent object can be replicated by the schema mechanism.

Kasparov versus Deep Blue: The Rematch

Article

Jun 1997

Artificial Intelligence: A Modern Approach

Book

Jan 2009

Artificial Intelligence --- A Modern Approach

Book

Jan 1995

Labeled RTDP: Improving the convergence of real-time dynamic programming

Article

Jan 2003

Average Reward Reinforcement Learning: Foundations, Algorithms, and Empirical Results

Chapter

Jan 1996

Sridhar Mahadevan

This paper presents a detailed study of average reward reinforcement learning, an undiscounted optimality framework that is more appropriate for cyclical tasks than the much better studied discounted framework. A wide spectrum of average reward algorithms are described, ranging from synchronous dynamic programming methods to several (provably convergent) asynchronous algorithms from optimal control and learning automata. A general sensitive discount optimality metric called n-discount-optimality is introduced, and used to compare the various algorithms. The overview identifies a key similarity across several asynchronous algorithms that is crucial to their convergence, namely independent estimation of the average reward and the relative values. The overview also uncovers a surprising limitation shared by the different algorithms: while several algorithms can provably generate gain-optimal policies that maximize average reward, none of them can reliably filter these to produce bias-optimal (or T-optimal) policies that also maximize the finite reward to absorbing goal states. This paper also presents a detailed empirical study of R-learning, an average reward reinforcement learning method, using two empirical testbeds: a stochastic grid world domain and a simulated robot environment. A detailed sensitivity analysis of R-learning is carried out to test its dependence on learning rates and exploration levels. The results suggest that R-learning is quite sensitive to exploration strategies, and can fall into sub-optimal limit cycles. The performance of R-learning is also compared with that of Q-learning, the best studied discounted RL method. Here, the results suggest that R-learning can be fine-tuned to give better performance than Q-learning in both domains.

Reinforcement Learning and Markov Decision Processes

Abstract

Recommended publications

Energy keep us going

University of Groningen in 83rd place on THE ranking list

Origins Center to open at Fundamentals of Life in the Universe symposium

A grand future with small particles. Nanotechnology affects many aspects of our lives

On the Significance of Markov Decision Processes

Hebbian Motor Learning

Unsupervised Hybrid Learning Model (UHLM) as Combination of Supervised and Unsupervised Models

A Reservoir Computing Model of Reward-Modulated Motor Learning and Automaticity