Conference PaperPDF Available

A Reinforcement Learning Method for Maximizing Undiscounted Rewards

December 1993

December 1993

DOI:10.1016/B978-1-55860-307-3.50045-9

Source
DBLP

Conference: Machine Learning, Proceedings of the Tenth International Conference, University of Massachusetts, Amherst, MA, USA, June 27-29, 1993

Authors:

Anton Schwartz

Stanford University

While most Reinforcement Learning work utilizes temporal discounting to evaluate performance, the reasons for this are unclear. Is it out of desire or necessity? We argue that it is not out of desire, and seek to dispel the notion that temporal discounting is necessary by proposing a framework for undiscounted optimization. We present a metric of undiscounted performance and an algorithm for finding action policies that maximize that measure. The technique, which we call R-learning, is modelled after the popular Q-learning algorithm [17]. Initial experimental results are presented which attest to a great improvement over Q-learning in some simple cases.

Content uploaded by Anton Schwartz

Content may be subject to copyright.

A Reinforcement Learning Method for

Maximizing Undiscounted Rewards

Anton Schwartz

Computer Science Dept.

Stanford University

Stanford, CA 94305

schwartz@cs.stanford.edu

To appear in Machine Learning: Proceedings of the Tenth International Conference,

Morgan Kaufmann, San Mateo, CA, 1993.

Abstract

While most Reinforcement Learning work uti-

lizes temporal discounting to evaluate perfor-

mance, the reasons for this are unclear. Is it out of

desire or necessity? We argue that it is not out of

desire, and seek to dispel the notion that temporal

discounting is necessary by proposing a frame-

work for undiscounted optimization. We present

a metric of undiscounted performance and an al-

gorithm for ﬁnding action policies that maximize

that measure. The technique, which we call R-

learning, is modelled after the popular Q-learning

algorithm [17]. Initial experimental results are

presented which attest to a great improvement

over Q-learning in some simple cases.

1 Introduction

In the paradigm of Reinforcement Learning (RL), an agent

ﬁnds itself in an environment and must learn by trial and

error to take actions so as to maximize, over the long run,

rewards which it receives in return. The techniques for ac-

complishing this estimate the rewards that performing an

action may reap over the course of time, so as to choose ac-

tions maximizing that measure. This requires summarizing

a possibly inﬁnite sequence of rewards in a ﬁnite measure.

To do this, researchers have relied on temporal discounting,

a practice of giving exponentially diminishing importance

to rewards far in the future, so as to summarize the rewards

in a single ﬁnite number.

While temporal discounting is convenient, it comes at a

price. For one, it may make behaviors with quick but

mediocre results look more attractive than efﬁcient behav-

iors reaping long-term beneﬁts. Moreover, even when it fa-

vors behaviors with adequate farsightedness, it can greatly

impede the process by which Q-learning arrives at its solu-

tions.

In this paper we outline some of the problems caused by

temporal discounting, and set out to provide a workable

alternative. We put forth an undiscountedmeasure of policy

performance, and present an algorithm designed to arrive

at policies which maximize that measure. The method,

which we call R-learning, resembles Q-learning inasmuch

as it measures the value—for a different notion of value—

of state-action pairs, relative to its current policy, and in

any state recommends that action which maximizes value.

The key element introduced is a measure of average reward,

which serves as a standard of comparison. When examined

according to a measure of utility that is depends on this

average, improvements to policies become more salient.

This boosts performance in a number of ways.

While all of the principle ideas and techniques presented

in this paper (including the formal notions of average and

average-adjusted value, Theorem 3, and the R-learning al-

gorithm) are the result of our own work, we became aware

while writing that the ideas were ﬁrst created and explored

in the Dynamic Programming literature, often with much

greater generality and elegance. As a result, this paper may

serve as an introduction to the concepts of undiscounted

Dynamic Programming, and as the ﬁrst attempt to bring

those concepts to bear on the problems and practices of Re-

inforcement Learning. The R-learning algorithm remains,

to our knowledge, a novel technical contribution.

A more qualitative account of the ideas presented in this

paper is given in [13].

2 Background

In RL, the learner’s environment is modelled as a Markov

Decision Process (MDP). An MDP speciﬁes a set of states

and a set of actions. At each step in time the process is

in some state, and an action must be chosen; this action

has the effect of changing the current state and producing

a scalar reinforcement value. The reinforcement value, or

reward, represents the extent to which we can consider the

action to have had immediately desirable or undesirable

consequences. Formally, an MDP is described by a 4-

tuple where : 0 1 gives the

probability (written ) of moving into state when

action is performed in state , and : gives

the corresponding expected reward. In this paper we will

deal exclusively withﬁnite MDP’s, i.e., ones with ﬁnite sets

and .Apolicy is a mapping from to , suggesting

which action to perform in each state.1The goal of RL

methods is to arrive, by performing actions and observing

their outcomes, at a policy which maximizes the rewards

accumulated over time.

Q-learning [17] is the most widely used and studiedmethod

for RL and, like most, it uses discounted value as its criterion

of optimality. The discounted value, or discounted return,

of a policy in a state is deﬁned as the expected value

where , a random variable, represents the reward that

will be received time steps after the learner begins exe-

cuting policy in state , and is a temporal discounting

constant, 0 1.

Q-learning seeks to ﬁnd a policy which maximizes

in all states . (We call such policies -optimal.)

To do so, it makes particular use of the action-value form

of (1), ;2

in which ; denotes the nonstationary policy which per-

forms action once and thereafter follows policy . The

algorithm for Q-learning is based on the relation

and on the fact that for -optimal policies (and no others),

max for all 4

Q-learning operates by maintaining a function ˆ:

, which initially maps all values to zero, and is updated

every time an action is executed as follows. Let , for

01, denote the operation of assigning to variable

the value . If action is performed in state

, resulting in an immediate reward and a transition

into state , then the value of ˆon input is modiﬁed

by performing

ˆmax ˆ5

for an appropriate learning rate . At any time, the Q-

values induce a policy ˆwhich maps each state to the

action maximizing ˆ. Given certain assumptions,

this procedure is guaranteed to arrive at a policy which

maximizes discounted value from every state [18].

In cases where the value 1exists and is ﬁnite (cf.

Equation 1), 1can be used as an undiscounted measure

of performance, and Q-learning with 1 can be shown

to converge to 1-optimal policies [18]. But the MDP’s of

which the above stipulation holds are goal tasks in which

1Except where otherwise noted, we use policy to refer to a

stationary policy, i.e., one which does not vary over time [12].

BA,B

A,B

(+0)

(+1)

(-1)

(+1000)

"Earth"

"Heaven"

"Hell"

Figure 1: A trivial MDP on which discounted and undis-

counted measures may disagree. States are indicated by

circles, actions are given in italics, their associated state

transitions are given by arrows, and their immediate re-

wards are given in parentheses.

one seeks to reach an absorbing set of “goal” states via a

minimal path; and in those domains researchers tend to use

discounted Q-learning for other reasons2(e.g., [8, 16]).

3 Measures of Performance

If we look at the graphs which papers in RL use to report

the performance of their systems, we ﬁnd that they are

almost universally graphs of undiscounted measures: either

total cumulative reward or average reward per time step

(e.g., [6, 8, 9, 16]). But Q-Learning maximizes a future-

discounted measure of reward instead. The problem is that

that these criteria, in general, need not coincide.

3.1 Discounted Performance

For instance, consider the MDP of Figure 1. Here, attention

to (undiscounted) future rewards will clearly mandate that

a policy choose action in state 1. But for any

500

501 0 998, 1 1 regardless of —

which makes methods such as Q-learning prefer action .

In fact, given any , there is some value we can set for

1 which makes the -discounted criterion favor action

over action .

It is true that for any ﬁnite MDP there is some sufﬁciently

large for which the discounted and undiscounted measures

agree. However, proper choice of such a requires detailed

knowledge of the domain—knowledge that we do not want

to presuppose. Even with such knowledge, a parameter such

as that needs to be tailored to suit individual domains is

clearly undesirable.

In light of these observations, one may wonder why re-

searchers use discounting at all. We address several possi-

ble answers to this question:

Discounting to compensate for rate of interest. Discounted

value is the primary criterion for performance in the ﬁeld

2These reasons, too lengthy to be presented in this paper, will

be explained in detail elsewhere. They do not apply to the R-

learning method.

of Operations Research, where dynamic programming has

been used and studied extensively. There the motivation is

economic: one assumes that interest is available on earned

rewards at the rate of 100 % per unit of time. In order

to compensate for the interest one can earn on rewards

achieved in the present, a reward of units time steps into

the future must be evaluated by its present value, 1 .

This gives motivationfor using 1 1in the above

framework. However, this explanation can be dismissed out

of hand by observing that RL researchers do not let interest

accrue on rewards in their reportings of cumulative reward,

nor do they evaluate the rewards in terms of their present

value at any moment in time.

Discounting to express the ﬁniteness of an agent’s lifetime.

Another possible reason to value present rewards more than

future rewards is the assumption that the agent might die

sometime before it is able to reap future rewards. In this

case, the represents an assumption that the agent has a

probability 1 of dying at any step intime, in which case

all future rewards will be zero. A different intuition with

the same mathematical interpretation is this: Discounting

to express uncertainty about the future. Some have argued

that future rewards may be uncertain because of a changing

environment [1], and use discounting to reﬂect that any

reward expected at future time may turn out to be zero

instead with a probability of 1 . But both of these

interpretations of discounting beg the following question:

If researchers assume that agents may die or that rewards

may turn to naught, why do they measure performance in

domains where neither of these eventualities come to pass?

This paper will proceed on the premise that the reason why

people use discounting is none of these but a more practical

one: By discounting future rewards, one makes their inﬁnite

sum ﬁnite. Which is to say, researchers use a discounted

value measure because there exists no workable alternative.

We proceed to provide such an alternative.

3.2 Undiscounted Performance

We wish to compare total undiscounted rewards reaped by

a policy from a state; but since policies are likely to accrue

rewards steadily over time, it is generally impossible to

merely compare the inﬁnite sums of rewards. Naturally,

we turn to some comparison of the ﬁnite cumulative sums.

Deﬁne the n-period value of a policy as

One way to obtain a ﬁnite measure is to look at the average

reward per unit of time incurred by a policy over the long

run. We deﬁne the average reward of a policy started in

state as

lim

While a policy needs to maximize average reward in order

for us to consider it optimal in an undiscounted sense, the

converse may not be true. For example, in some state , two

policies 1and 2may yield the same average reward even

though 1will constantly outperform 2in that 1

20 for large . In such cases we would

like to have an additional measure which is sensitive to such

constants . One possibility is

lim

the limiting difference between cumulative performance

and the line through the origin with slope . (We

may think of this line as the cumulative performance of

the hypothetical reference case in which were to reap its

average reward at every step.) This measure, while

intuitive, may not be well deﬁned when the policy reaches

periodic limit cycles. But in such cases we can generalize

it to a measure which is guaranteed to exist. We deﬁne

the average-adjusted value [5] of policy in state as the

Ces`aro or “limit in the average” [4] version of the simpler

expression above:

lim

If we summarize the performance of a policy in state by

a linear function in approximating , then is

that line’s slope and the -intercept.

These two values, then, provide us with an undiscounted

standard of performance. We wish foremost to maximize

and secondarily ( ’s being equal) to maximize .

So let us deﬁne the total value of a policy from state as

the ordered pair

and use lexicographical order as the ordering on . That is,

we say that if and only if or

and .

We deﬁne undiscounted action-values analogously to the

discounted case:

;

To make parallels to the discounted case clear, we will

use and to refer to the components of

and , respectively. Let us write 1 2

to mean that 1 2 for all states . We say

that is T-optimal whenever for all policies .

Like -optimal policies, stationary T-optimal policies are

guaranteed to exist for ﬁnite MDP’s [5]. We willhenceforth

use to denote T-optimal policies.

The notion of T-optimality exists under a variety of names

in the literature [11]. It is a stronger condition than average-

reward optimality, but weaker than other forms including

Blackwell optimality [3]. Blackwell optimality, the most

selective one in the literature according to which all ﬁnite

MDP’s have optimal stationary policies, may be viewed as

lexicographically maximizing an inﬁnite series of which

and are the ﬁrst two terms. So far we have not seen the

need to utilize the higher order terms.

We brieﬂy explore an important property of average reward:

Fact 1 If two states and are such that executing policy

from either state leads to the same ergodic set3of states

in S, then .

This has the following consequence:

Corollary 2 For any ﬁnite MDP, either

1. is independent of , or

2. There are states between which no policy can ever guar-

antee passage.

The corollary tells us that in all cases of interest to RL, the

optimal policy has a single average reward, since MDP’s of

which the second clause holds would normally violate the

frequency of visit assumptions required for the convergence

of stochastic approximation methods such as Q-learning

[18].

While for many MDP’s there exist policies that induce mul-

tiple ergodic sets, a simpliﬁcation commonly made in the

Dynamic Programming literature is that all policies are

unichain,i.e., give rise to a single ergodic set [11]. In

light of Fact 1, this lets one assume that all policies, not

only , achieve a single state-independent average reward.

At times we will make this assumption so as to facilitate the

presentation of R-learning. A discussion of the behavior of

R-learning in the general multichain case is forthcoming.

Attention to may seem frivolous, as represents merely

a constant offset to cumulative value, likely to be quickly

dominated by the repeated contributions of . But for ter-

minal goal tasks, average reward is merely a function of

the goal state that is reached, invariant of the behavior that

leads to the goal.4Thus, is of great importance; it is the

entire measure of efﬁciency in reaching the goal.

Moreover, even if we are only interested in maximizing

average reward, turns out to be a crucial instrument in

that maximization. For any policy , states which lead to

the same ergodic set—and all states do, given the unichain

assumption—share a common value of , so their val-

ues differ only with regard to their second com-

ponent, . Likewise, for all actions that lead to

the same ergodic set, the action-values share a

constant component. Clearly, then, is the key to pol-

icy improvement. When an action is found for which

, then changing to choose in results

in an improvement of , increasing if is a recurrent

state under .

3Any policy, applied to an MDP, gives rise to a Markov chain.

An ergodic set of a Markov chainis a minimal set of states which,

once entered, will never be left [7].

4Any policy which reachesthe goal, strictly speaking, remains

there forever. The average reward of a policy reﬂects this fact,

even though a researcher generally stops the simulation at that

point and begins a new trial from a different state. In practice,

learners are rarely if ever allowed to execute a ﬁxed stationary

policy, so need not express the average reward observed

during experimental trials.

4 The Connection Between Discounted and

Undiscounted Value

Before proposing a method for learning T-optimal policies,

we note a connection between discounted and undiscounted

value. The main result of interest is the following [11]:

Theorem 3 For any policy and state ,

1where lim

We may read this statement as saying that for values of

approaching 1, the discounted value is composed

of two nonvanishing terms: one that is a large constant

multiple of the average reward expected from starting policy

in state , and one that is the average-adjusted value of .

We willrefer to these as the -term and -term, respectively.

So for close to 1, may be seen as approximating

the lexicographical preference of by giving much more

weight to than to . But the approximation is imperfect,

as is manifest in the fact that dictates opting for a quick

constant bonus in reward over a long-term improvement

in average reward when !, the gain in ﬁnite short-term

reward, is greater than !

1, the scaled difference in long-

term average reward. This explains the faulty choice in the

example of Figure 1.

Conversely, a simple corollary of Theorem 3 lets us under-

stand average-adjusted value in terms of discounted value:

Corollary 4 For any policy and state ,

lim

If we think of the quantities as average-adjusted

rewards [13], then this corollary lets us view as the

expected discounted sum of average-adjusted rewards, for

vanishingly little discounting (cf. Equation 1).

5 Learning T-Optimal Policies

When we express average-adjusted values in the form of

Equation 7, it is easy to verify that they obey the following

recurrence relation:

( ( )) 8

The recurrence, like the expression for value in Corollary 4,

is analogous to the discounted case, but with the following

two modiﬁcations:

1. Rewards are average-adjusted by subtracting out .

2. The effect of is eliminated by bringing arbitrarily

close to 1.

We may see that ﬁrst modiﬁcation enables the second as

follows: In the representation of given by Theorem 3,

the ﬁrst term is the one which blows up for close to 1

when 0; this is what prevents us from simply using

the undiscounted sum 1in the ﬁrst place. But average-

adjusting the incoming rewards reduces the problem to one

where 0, eliminating the ﬁrst term altogether. This

done, we are free to let approach 1 in order to eliminate

the contribution of the high order term . What is

left is a measure of the -term alone which, as we have

mentioned, is precisely that upon which we wish to base

action selection.

By applying these two modiﬁcations to the standard Q-

learning algorithm, what results is a technique for approx-

imating instead of , and hence maxi-

mizing total rather than -discounted value. The only addi-

tional machinery needed is a mechanism for approximating

the average reward of the successive policies suggested by

the algorithm. The method, R-learning, is now presented:

The R-Learning Algorithm

1. Begin with a table of real numbers , all ini-

tialized to zero, and a real-valued variable , also initialized

to zero.

2. Repeat:

2a. From the current state , perform an action , chosen

by some exploration/action-selection mechanism (this

is an orthogonal component of the system, just as it is

for Q-learning). Observe the immediate reward imm

received and the subsequent state .

2b. Update according to:

imm 9

where max and is a learning

rate parameter.

2c. If (i.e., if agrees with the policy

), then update according to:

imm 10

where is a learning rate parameter.

One might wonder why we do not simply approximate

via exponential averaging of immediate rewards (i.e.,

), performed on every tick. This is so as to

restrict our attention to the policy , uninﬂuenced by the

conﬂicting behavior of the exploration mechanism that is

ultimately responsible for action choice. Exploratory ac-

tions, which generally incur subaverage rewards, would

skew the approximation of if included in the calculation.

The appearance of the term in Equation 9

plays the role of compensating for known variations in the

rewards received in different states; it serves to adjust for

periods of low reward when the agent is using its optimal

policy to recover from suboptimal exploratory actions, and

more generally to minimize the variance of the values used

to estimate .

One may easily show that whenever R-learning converges it

must arrive at a T-optimal policy. But unlike the discounted

methods, whose mathematics have been extensively ex-

plored (e.g., [18]), a proof of the convergence of R-learning

has not been established. Nonetheless, related techniques

such as undiscounted Policy Improvement [5] are well un-

derstood, and work toward convergence results is currently

under way.

6 Advantages of R-Learning

R-learning is designed to arrive at T-optimal policies, and

we have already argued for the advantages of undiscounted

performance criteria. But in addition to maximizing undis-

counted performance, R-learning displays several compu-

tational advantages over existing techniques. As a result,

even when one can be sure of choosing so as to allow Q-

learning to arrive at T-optimal policies, it is often preferable

to use R-learning (bearing in mind that the algorithm has

not yet been proven to converge).

For discounted methods such as Q-learning to arrive at

policies that do not overlook temporally distant rewards,

they must use large values of . But it is well known that

for successive approximation techniques the geometric rate

of decay in the approximation error is proportional to , so

that high values of can slow the convergence dramatically

[2]. As a result, one might expect an undiscounted method

such as R-learning to converge extremely slowly. In fact

the opposite is generally true, for the following reasons:

6.1 Better Initial Estimates

In cases where the optimal policy has nonzero average

reward and is near 1, Q-learning spends much of its time

converging to the large -term of , which is invariant of

, whereas the -term is the one needed for action choice.

During this time, the true values of necessary for action

selection may be entirely obscured by approximation error,

causing poor performance. R-values, in contrast, have no

contribution from . As a result, the values, which begin at

zero, already reﬂect their -term, and need only converge

on .

6.2 Faster Propagation of Rewards

Many researchers have pointed out that information about

rewards can be propagated very slowly across states by Q-

learning [10, 16]. In particular, we note that the -terms of

the Q-values, though constant over a state space, may be

updated only locally, and one state at a time. By contrast,

R-learning uses to effectively store a common -term for

all its action-values. This term is updated on every iteration,

and lets information be propagated instantly throughout the

state space. Section 7.2 presents experimental data attesting

to the resulting speedup.

6.3 Value Disambiguation

Even for MDP’s where -optimal policies are T-optimal

for small —in which case one would hope for Q-learning

to converge rapidly despite the above concerns—there turn

out to be compelling reasons to prefer R-learning. Let

us examine any case where temporal credit assignment is

necessary—where some T-suboptimal action gives imme-

diate reward as large as that of the T-optimal action. That

is, for some state of an MDP with T-optimal policy ,

, but . Then one

may show that there is a value 0such that 0

0, and for which holds of

all 01.5To be sure to prefer the T-optimal action,

one must choose a within this interval. This granted, one

would like to choose a at the very bottom of the range so

as to speed convergence. But choosing arbitrarily close

to 0will result in an arbitrarily small difference between

and .6This is problematic for at least

the following two reasons:

First, for stochastic MDP’s, approximation errors of Q-

values are due not only to the process of value iteration,

but also to the Monte Carlo sampling it employs. When

differences in true action-values for competing actions are

small they are more likely to be obscured by such errors,

which may cause suboptimal actions to be chosen. An

example of this phenomenon is presented in Section 7.1.

Secondly, many popular exploration methods select actions

stochastically according to their relative action-values [6,

16, 17]. In such cases, an action whose Q-value differs

from the optimal one by a small amount will be chosen

almost as frequently as the optimal action. This may result

in a substantial loss of cumulative reward.

One may see that R-learning minimizes these problems,

because using average-adjusted value maximizes the differ-

ences in value in the following sense:

sup

In summary, choosing to be large for Q-learning results in

poor initial estimates and slow convergence, while choosing

small, when it does not result in T-suboptimal solutions,

may reduce performance dramatically for other reasons. R-

learning eliminates the hazards that accompany small values

of , while using other means to speed convergence.

5This follows from the proof of existence of stationary Black-

well optimal policies ([3], Theorem 5).

6In goal tasks where nonzero reward is givenonly upon achiev-

ing the goal, this is the familiar problem wherein temporal differ-

encing causes all actions in states distant from the goal to have

approximately zero value.

6.4 Linearity of Undiscounted Values

In the case of a deterministic MDP, we may read the recur-

rence for (Equation 8) as telling us that

where is the state reached by performing in . Since

unichain policies have constant, it follows that for

regions of where following incurs a constant reward,

average-adjusted value of successive states encountered will

change linearly. By contrast, the present in Q-values

makes increases and decreases in those values exponential,

growing in magnitude with temporal proximity to rewards.

Linearity is very desirable in the case where is a met-

ric space, and the policy action causes uniform movement

along some dimension of . In this case the R-values for the

policy action vary linearly over that dimension, potentially

permitting the use of simple and accurate linear interpola-

tion to speed up learning and even allowing learning over

continuous state spaces.

Finally, the linearity of values can be useful in situations

where we want to combine action-values for multiple indi-

vidual tasks to determine values for a composite task. Singh

[14] has explored this possibility for sets of goal tasks.

6.5 Q-Learning Seen as a Special Case of R-learning

Occasions might arise where temporal discounting is a use-

ful feature for any of the reasons given in Section 3.1. As

we saw, the discounted value measures total value in the

case where at each point in time there is a probability 1

of an exogenous incident after which all rewards are zero.

Given any MDP, we may introduce this assumption explic-

itly by adding one absorbing state with autotransitions

yielding reward zero, to which all actions from all other

states lead with probability 1 . (Such actions receive

their normal reward, since it is only the state they result

in that changes.) Accordingly, we multiply all preexisting

transition probabilities by . Now let quantities marked by

a dot denote values pertaining to the modiﬁed MDP. We

ﬁrst observe that for all policies and all states of the new

decision problem, ˙0, and build this fact into the

algorithm by eliminating step 2c. Now by Equation 8 we

have, for all ,

˙˙ ˙

˙ ˙

˙ ˙ 1˙

In other words, ˙obeys the same recurrence relation as

. By modifying the R-learning update of step 2b to

reﬂect the recurrence of ˙instead of , we are left with

precisely the Q-learning algorithm.

move (+0)

(+0)

stay

move (+0)

(+0)

stay

move (+0)

(+0)

stay

49 move

(+50)

(+0)

stay

Figure 2: A simple MDP with temporally distant rewards.

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0 50000 100000 150000 200000 250000

R-learning

Q-learning

Figure 3: Performance of R-Learning versus Q-learning.

7 Experimental Results

7.1 Value Disambiguation

An initial experiment compared Q-learning and R-learning

in the simple domain pictured in Figure 2, modiﬁed to

yield stochastic rewards that always deviate from the values

given by either +1 or -1 (with equal probability). Figure 3

shows the results. The -axis measures number of actions

performed, while the -axis measures average reward per

5000-action interval. (The results are averaged over 50

trials.) We used random exploration with a ﬁxed probability

0 05 of a random action at any time step; an exploration

method that favors actions with near-optimal values would

make the advantage of R-learning more pronounced.

100

100.1

100.2

100.3

100.4

100.5

100.6

100.7

100.8

100.9

101

0 50000 100000 150000 200000 250000

R-learning

Q-learning

Figure 4: Performance of R-Learning versus Q-learning

(all rewards increased by 100).

(+0)

(+30)

(+0)

1’

(+0)

9’

(+10)

(+0)(+0) AB A,B A,BA,B

A,B A,B

Figure 5: An MDP with two cycles. Action choice is

irrelevant except in state 0.

1.05

1.1

1.15

1.2

1.25

1.3

1.35

1.4

1.45

1.5

0 10000 20000 30000 40000 50000 60000

R-learning

Q-learning

Figure 6: Comparison of learning mechanisms on a MDP

with two cycles. Performance of Q-learning is poor com-

pared to R-learning because of ramping.

Figure 4 shows a comparison of Q-learning and R-learning

in the same domain with all rewards increased uniformly

by 100. Notice the period in which R-learning stalls while

its estimate of climbs from an initial value of zero up to

the ﬁnal value of 101.

Both runs use 0 9, 0 2. Because of the exploration

strategy, an optimal policy will have an average reward of

0 95. Note that the ﬁxed is responsible for the fact that

the Q-values never converge to the optimal value.

7.2 Convergence Rate

A second experiment used the double-loop domain shown

in Figure 5. Average rewards of the two methods are plotted

in Figure 6, compiled over 100 runs. The reason for the poor

performance by Q-learning is that the shorter cycle, though

it gives less per-step payoff than the longer one, allows

rewards to be propagated more quickly. As a result, it’s Q-

values converge faster than the longer one’s, making it look

more favorable during the long process of convergence.

These experiments use the same parameters as the previous

ones, except that here is increased to 0 99 to allow Q-

learning to learn the T-optimal policy at all. For larger

, the effect is even more pronounced: When 0 999

instead of 0 99, the speedup of R-learning over Q-learning

is more than forty-fold.

8 Related Work

In Section 2 we remarked that Q-learning may be used to

ﬁnd 1-optimal policies in cases where that undiscounted

measure is ﬁnite. Now we may further observe that in

such cases, 0 1and the recurrence for

reduces to that of 1. In this sense, the theory of total value

subsumes that of 1, and R-learning is a generalization

of undiscounted Q-learning.

Whereas Q-learning is an asynchronous, stochastic approx-

imation version of the well-understood dynamic program-

ming method of value iteration, R-learning has no such pre-

cise analog in the literature. Though several successive ap-

proximation algorithms exist for computing , none of them

approximate explicitly; this added complexity present in

R-learning, while making it more robust in the multichain

case, necessitates new methods of analysis, which we are

working to develop. We know of no successive approxi-

mation methods in the Dynamic Programming literature for

handling the multichain case.

We have recently learned of the related work of Westerdale

[19], who presents a bucket brigade technique based on the

notions of average and average-adjusted value. Since he

does not draw any connections to the Dynamic Program-

ming or Reinforcement Learning literature, the precise con-

tribution of this work is difﬁcult to assess.

The general notion that performance should be viewed rel-

ative to a standard of reference, which underlies average-

adjusted value and the workings of R-learning, has appeared

in many places in the psychological literature, as well as in

Sutton’sReinforcement Comparison algorithm [15]. A pre-

sentation of R-learning from a more psychological stand-

point is offered in [13].

9 Conclusion

Until now, most of the work in RL has maintained one stan-

dard of performance while using algorithms that maximize

another. The discounted algorithms have been well under-

stood, but the performance standards havenot: In goal tasks,

maximal undiscounted value 1has been sought, while, in

recurrent domains, high average reward has been the aim.

The concept of total reward presented here subsumes these

two criteria, and is the basis for R-learning, a method we

have proposed to maximize them both. We have used total

reward as a tool to gain a better understandingof some of the

computational shortcomings of the discounted techniques,

and have shown that R-learning may ameliorate many of

those problems.

While informal tests have shown R-learning tobe applicable

to a variety of domains, more empirical and theoretical work

is necessary to establish its viability as a robust algorithm

for reinforcement learning. Some of this work is currently

under way.

Acknowledgements

I wish to thank Nils Nilsson, Rich Sutton, George John, and

Moshe Tennenholtz for their helpful comments. This work

was supported by an IBM Graduate Fellowship.

References

[1] A. G. Barto, R. S. Sutton, and C. J. C. H. Watkins. Learning

and sequential decision making. Technical Report COINS

89-95, Dept. of Computer and Information Science, Univer-

sity of Massachusetts, Amherst, 1989.

[2] D. P. Bertsekas and J. N. Tsitsiklis. Parallel and Distributed

Computation: Numerical Methods. Prentice Hall, Engle-

wood Cliffs, NJ, 1989.

[3] D. Blackwell. Discrete dynamic programming. Ann. Math.

Statist., 33:719–726, 1962.

[4] G. H. Hardy. Divergent Series. Clarendon Press, Oxford,

1949.

[5] R. A. Howard. Dynamic Programming and Markov Pro-

cesses. MIT Press, Cambridge, MA, 1960.

[6] L. P. Kaelbling. Learning in Embedded Systems. PhD thesis,

Stanford University, 1990.

[7] J. G. Kemeny and J. L. Snell. Finite Markov Chains. Van

Nostrand, Princeton, NJ, 1960.

[8] L.-J. Lin. Programming robots using reinforcement learning

and teaching. In ProceedingsAAAI-91, pages781–786. MIT

Press, Cambridge, MA, 1991.

[9] S. Mahadevan and J. Connell. Automatic programming of

behavior-based robots using reinforcement learning. In Pro-

ceedings AAAI-91, pages 768–773. MIT Press, Cambridge,

MA, 1991.

[10] R. A. McCallum. Using transitional proximity for faster

reinforcement learning. In Proceedings of the Ninth Inter-

national Workshop on Machine Learning, pages 316–321.

Morgan Kaufmann, San Mateo, CA, 1992.

[11] M. L. Puterman. Markov decision processes. In D. P. Hey-

man and M. J. Sobel, editors, Handbooks in OR & MS, Vol. 2,

pages 331–434. Elsevier, North-Holland, 1990.

[12] S. M. Ross. Introduction to Stochastic Dynamic Program-

ming. Academic Press, New York, 1983.

[13] A. Schwartz. Thinking locally to act globally: A novel

approach to reinforcement learning. In Proceedings of the

Fifteenth Annual Conference of the Cognitive Science Soci-

ety. Lawrence Erlbaum, Hillsdale, NJ, 1993.

[14] S. P. Singh. Transfer of learning by composing solutions for

elemental sequential tasks. Machine Learning, 8(3/4):323–

339, May 1992.

[15] R. S. Sutton. Temporal Credit Assignment in Reinforcement

Learning. PhD thesis, Department of Computer and Infor-

mation Sciences, University of Massachusetts, 1984.

[16] R. S. Sutton. Integrated architectures for learning, planning,

and reacting based on approximating dynamicprogramming.

In Proceedings of the Seventh International Workshop on

Machine Learning, pages 216–224. Morgan Kaufmann, San

Mateo, CA, 1990.

[17] C. J. C. H. Watkins. Learning from Delayed Rewards. PhD

thesis, King’s College, Cambridge, 1989.

[18] C. J. C. H. Watkins and P. Dayan. Q-learning. Machine

Learning, 8:279–292, 1992.

[19] T. H. Westerdale. Quasimorphisms or queasymorphisms?

Modeling ﬁnite automaton environments. In G. J. E. Rawl-

ins, editor, Foundations of Genetic Algorithms, pages 128–

147. Morgan Kaufmann, San Mateo, CA, 1991.

A Review on Reinforcement Learning in Condition-based Maintenance

Conference Paper

Full-text available

Oct 2023

This paper analyzes the application of reinforcement learning in condition-based maintenance and investigates its growing popularity in this field. Recent research treats condition-based maintenance as a sequential decision-making problem and formulates it as a Markov decision process or its variations. Reinforcement learning, a class of machine learning techniques, is introduced as a modern and flexible tool to solve Markov decision process problems. Reinforcement learning involves guiding a decision-maker to act rationally in a stochastic environment using rewards. In contrast to the parametric approach, which seeks optimal parameter values for a predetermined policy, the sequential decision-making approach does not rely on predetermined policies or parameters. This approach provides greater flexibility in the maintenance policy by relaxing the predetermined structure. The paper reviews various reinforcement learning algorithms classified according to the system's characteristics. It also discusses why reinforcement learning has become an attractive tool in condition-based maintenance through an extensive review of its applications in this field. Lastly, the paper highlights the prospective direction of research in this area. By leveraging reinforcement learning, researchers aim to further improve condition-based maintenance policies, maximize system availability, and minimize maintenance costs.

Joint Content Update and Transmission Resource Allocation for Energy-Efficient Edge Caching of High Definition Map

Article

Full-text available

Nov 2023

Caching the high definition map (HDM) on the edge network can significantly alleviate energy consumption of the roadside sensors frequently conducting the operators of the traffic content updating and transmission, and such operators have also an important impact on the freshness of the received content at each vehicle. This paper aims to minimize the energy consumption of the roadside sensors and satisfy the requirement of vehicles for the HDM content freshness by jointly scheduling the edge content updating and the downlink transmission resource allocation of the Road Side Unit (RSU). To this end, we propose a deep reinforcement learning based algorithm, namely the prioritized double deep R-Learning Networking (PRD-DRN). Under this PRD-DRN algorithm, the content update and transmission resource allocation are modeled as a Markov Decision Process (MDP). We take full advantage of deep R-learning and prioritized experience sampling to obtain the optimal decision, which achieves the minimization of the long-term average cost related to the content freshness and energy consumption. Extensive simulation results are conducted to verify the effectiveness of our proposed PRD-DRN algorithm, and also to illustrate the advantage of our algorithm on improving the content freshness and energy consumption compared with the baseline policies.

Tools at the Frontiers of Quantitative Verification

Preprint

Full-text available

May 2024

The analysis of formal models that include quantitative aspects such as timing or probabilistic choices is performed by quantitative verification tools. Broad and mature tool support is available for computing basic properties such as expected rewards on basic models such as Markov chains. Previous editions of QComp, the comparison of tools for the analysis of quantitative formal models, focused on this setting. Many application scenarios, however, require more advanced property types such as LTL and parameter synthesis queries as well as advanced models like stochastic games and partially observable MDPs. For these, tool support is in its infancy today. This paper presents the outcomes of QComp 2023: a survey of the state of the art in quantitative verification tool support for advanced property types and models. With tools ranging from first research prototypes to well-supported integrations into established toolsets, this report highlights today's active areas and tomorrow's challenges in tool-focused research for quantitative verification.

Constrained Markov Decision Process for the Industry

Chapter

Mar 2024

1. Introduction to constrained Markov decision processes : general framework and classical algorithms to find optimal policies. 2. Application to systems subject to a constraint on the asymptotic availability, as it is the case for many industrial problems, for which the maintenance has to be optimised, insuring a minimum level of availability. 3. Application to systems subject to a constraint on the asymptotic failure rate, as it is the case for airplane’s full authority digital engine control (FADEC) systems.

Learning the sound inventory of a complex vocal skill via an intrinsic reward

Article

Full-text available

Mar 2024

Reinforcement learning (RL) is thought to underlie the acquisition of vocal skills like birdsong and speech, where sounding like one’s “tutor” is rewarding. However, what RL strategy generates the rich sound inventories for song or speech? We find that the standard actor-critic model of birdsong learning fails to explain juvenile zebra finches’ efficient learning of multiple syllables. However, when we replace a single actor with multiple independent actors that jointly maximize a common intrinsic reward, then birds’ empirical learning trajectories are accurately reproduced. The influence of each actor (syllable) on the magnitude of global reward is competitively determined by its acoustic similarity to target syllables. This leads to each actor matching the target it is closest to and, occasionally, to the competitive exclusion of an actor from the learning process (i.e., the learned song). We propose that a competitive-cooperative multi-actor RL (MARL) algorithm is key for the efficient learning of the action inventory of a complex skill.

Learning Safety Critics via a Non-Contractive Binary Bellman Operator

Conference Paper

Oct 2023

Measuring the damage evolution of granite under different quasi-static load rates through acoustic emission time-frequency characteristics and moment tensor analysis

Article

Mar 2024

Time-dependent deployment of medial prefrontal cortical representations in mice

Preprint

Full-text available

Dec 2023

The prefrontal cortical areas exhibit a wide range of diverse and mixed representations involving sensory, motor, and autonomous variables. Despite this diversity, the prefrontal areas are known to regulate very specific behavioral functions. This disparity between the representations and their precise roles in behavior prompts a fundamental question: which representations do animals deploy, and in what specific behavioral contexts? To address this question, we recorded neurons in the medial prefrontal cortex (mPFC) of mice engaged in probabilistic reward-based foraging task. Using reinforcement learning (RL) model that takes into account both reward and choice history we derived decision variable (DV) that adequately described behavior. We found that neurons integrate choice and reward history in line with their behavioral effects. To probe under what behavioral contingencies DV was used we subjected animals to different behavioral manipulations. These manipulations included altering the conditional dependence of reward probability on past choices and varying the temporal intervals between choices. While neural representations tracked the task contingencies, inactivation of mPFC specifically degraded performance of the animals when there were long temporal gaps between choices. We discovered that the neural representations had to be viewed within the context of animals performance. Namely, we found correlation of animal’s performance and decoding accuracy at the population level only in the task with long temporal gaps between choices. We conclude that if neural representation are examined alone almost identical representations can have different behavioral impacts. Our findings argue that redundant computations exist in medial prefrontal cortex and its behavioral deployment is conditioned on temporal gaps between task relevant events.

On the Convergence of Natural Policy Gradient and Mirror Descent-Like Policy Methods for Average-Reward MDPs

Conference Paper

Dec 2023

Lesions to different regions of frontal cortex have dissociable effects on voluntary persistence

Preprint

Full-text available

Nov 2023

Deciding how long to keep waiting for uncertain future rewards is a complex problem. Previous research has shown that choosing to stop waiting results from an evaluative process that weighs the subjective value of the awaited reward against the opportunity cost of waiting. In functional neuroimaging data, activity in ventromedial prefrontal cortex (vmPFC) tracks the dynamics of this evaluation, while activation in the dorsomedial prefrontal cortex (dmPFC) and anterior insula (AI) ramps up before a decision to quit is made. Here, we provide causal evidence of the necessity of these brain regions for successful performance in a willingness-to-wait task. 28 participants with frontal lobe lesions were tested on their ability to adaptively calibrate how long they waited for monetary rewards. We grouped the participants based on the location of their lesions, which were primarily in ventromedial, dorsomedial, or lateral parts of their prefrontal cortex (vmPFC, dmPFC, and lPFC, respectively), or in the anterior insula. We compared the performance of each subset of lesion participants to behavior in a control group without lesions (n=18). Finally, we fit a newly developed computational model to the data to glean a more mechanistic understanding of how lesions affect the cognitive processes underlying choice. We found that participants with lesions to the vmPFC waited less overall, while participants with lesions to the dmPFC and anterior insula were specifically impaired at calibrating their level of persistence to the environment. These behavioral effects were accounted for by systematic differences in parameter estimates from a computational model of task performance: while the vmPFC group showed reduced initial willingness to wait, lesions to the dmPFC/anterior insula were associated with slower learning from negative feedback. These findings corroborate the notion that failures of persistence can be driven by sophisticated cost-benefit analyses rather than lapses in self-control. They also support the functional specialization of different parts of the prefrontal cortex in service of voluntary persistence.

Learning and Sequential Decision Making

Technical Report

Full-text available

Sep 1989

In this report we show how the class of adaptive prediction methods that Sutton called \temporal dierence," or TD, methods are related to the theory of squential decision making. TD methods have been used as \adaptive critics" in connectionist learning systems, and have been proposed as models of animal learning in classical conditioning experiments. Here we relate TD methods to decision tasks formulated in terms of a stochastic dynamical system whose behavior unfolds over time under the inuence of a decision maker's actions. Strategies are sought for selecting actions so as to maximize a measure of long-term payo gain. Mathematically, tasks such as this can be formulated as Markovian decision problems, and numerous methods have been proposed for learning how to solve such problems. We show how a TD method can be understood as a novel synthesis of concepts from the theory of stochastic dynamic programming, which comprises the standard method for solving such tasks when a model of the dynamical system is available, and the theory of parameter estimation, which provides the appropriate context for studying learning rules in the form of equations for updating associative strengths in behavioral models, or connection weights in connectionist networks. Because this report is oriented primarily toward the non-engineer interested in animal learning, it presents tutorials on stochastic sequential decision tasks, stochastic dynamic programming, and parameter estimation. y The authors acknowledge their indebtedness to C. W. Anderson, who has contributed greatly to the development of the ideas presented here. We also thank

Transfer of Learning by Composing Solutions of Elemental Sequential Tasks

Article

Full-text available

May 1992

Satinder Singh

Although building sophisticated learning agents that operate in complex environments will require learning to perform multiple tasks, most applications of reinforcement learning have focused on single tasks. In this paper I consider a class of sequential decision tasks (SDTs), called composite sequential decision tasks, formed by temporally concatenating a number of elemental sequential decision tasks. Elemental SDTs cannot be decomposed into simpler SDTs. I consider a learning agent that has to learn to solve a set of elemental and composite SDTs. I assume that the structure of the composite tasks is unknown to the learning agent. The straightforward application of reinforcement learning to multiple tasks requires learning the tasks separately, which can waste computational resources, both memory and time. I present a new learning algorithm and a modular architecture that learns the decomposition of composite SDTs, and achieves transfer of learning by sharing the solutions of elemental SDTs across multiple composite SDTs. The solution of a composite SDT is constructed by computationally inexpensive modifications of the solutions of its constituent elemental SDTs. I provide a proof of one aspect of the learning algorithm.

Dynamic Programming and Markov Processes.

Article

Jun 1961

Finite markov decision processes

Article

Martin L. Puterman

Temporal credit assignment in reinforcement learning

Article

Jan 1984

R. S. Sutton

Introduction to Stochastic Dynamic Programming

Book

Jan 1983

Sheldon M. Ross

Programming Robots Using Reinforcement Learning and Teaching.

Conference Paper

Jan 1991

Long Ji Lin

Programming robots is a tedious task. So, there is growing interest in building robots which can learn by themselves. Self-improving, which involves trial and error, however, is often a slow process and could be hazardous in a hostile environment. By teaching robots how tasks can be achieved, learning time can be shortened and hazard can be minimized. This paper presents a general approach to making robots which can improve their performance from experiences as well as from being taught. Based on this proposed approach and other learning speedup techniques, a simulated learning robot was developed and could learn three moderately complex behaviors, which were then integrated in a subsumption style so that the robot could navigate and recharge itself. Interestingly, a real robot could actually use what was learned in the simulator to operate in the real world quite successfully.

Using Transitional Proximity for Faster Reinforcement Learning

Conference Paper

Dec 1992

Andrew McCallum

Why does reinforcement learning take so long? One major reason is that reward spreads too slowly through the agent's policy. When an agent receives reward, existent methods only pass the reward back to internal states along the current path. Normally there are many possible paths to goal states however, and the agent must follow each of them successfully one or more times in order to complete learning. Our algorithm learns the transitions between internal states so that rewards may be passed not only along the one path taken this trial, but also passed back through all transitions learned during previous trials. States closer to the current state receive correspondingly more of the current reward. We call this distance between states in transition space transitional proximity.

Quasimorphisms or Queasymorphisms? Modeling Finite Automaton Environments

Conference Paper

Jan 1990

Thomas H. Westerdale

The paper examines models that are homomorphic images of the first component of a particular two component cascade decomposition of the environment. The bucket brigade is used to estimate model state values. The discussion is limited to finite automaton environments whose successive input symbols are selected by the system probabilistically, with independent probabilities, according to a probability distribution over the input symbols.

Learning in Embedded Systems

Book

Jan 1993

Leslie Pack Kaelbling

A Reinforcement Learning Method for Maximizing Undiscounted Rewards

Abstract

Recommended publications

Multiagent Reinforcement Learning: Spiking and Nonspiking Agents in the Iterated Prisoner's Dilemma

Continuous time Markov decision programming with average reward criterion and unbounded reward rate

Decision-making method for vehicle longitudinal automatic driving based on reinforcement Q-learning

Reinforcement Learning-Based Trajectory Design for the Aerial Base Stations