Conference PaperPDF Available

A Reinforcement Learning Method for Maximizing Undiscounted Rewards

Authors:

Abstract

While most Reinforcement Learning work utilizes temporal discounting to evaluate performance, the reasons for this are unclear. Is it out of desire or necessity? We argue that it is not out of desire, and seek to dispel the notion that temporal discounting is necessary by proposing a framework for undiscounted optimization. We present a metric of undiscounted performance and an algorithm for finding action policies that maximize that measure. The technique, which we call R-learning, is modelled after the popular Q-learning algorithm [17]. Initial experimental results are presented which attest to a great improvement over Q-learning in some simple cases.
A Reinforcement Learning Method for
Maximizing Undiscounted Rewards
Anton Schwartz
Computer Science Dept.
Stanford University
Stanford, CA 94305
schwartz@cs.stanford.edu
To appear in Machine Learning: Proceedings of the Tenth International Conference,
Morgan Kaufmann, San Mateo, CA, 1993.
Abstract
While most Reinforcement Learning work uti-
lizes temporal discounting to evaluate perfor-
mance, the reasons for this are unclear. Is it out of
desire or necessity? We argue that it is not out of
desire, and seek to dispel the notion that temporal
discounting is necessary by proposing a frame-
work for undiscounted optimization. We present
a metric of undiscounted performance and an al-
gorithm for nding action policies that maximize
that measure. The technique, which we call R-
learning, is modelled after the popular Q-learning
algorithm [17]. Initial experimental results are
presented which attest to a great improvement
over Q-learning in some simple cases.
1 Introduction
In the paradigm of Reinforcement Learning (RL), an agent
nds itself in an environment and must learn by trial and
error to take actions so as to maximize, over the long run,
rewards which it receives in return. The techniques for ac-
complishing this estimate the rewards that performing an
action may reap over the course of time, so as to choose ac-
tions maximizing that measure. This requires summarizing
a possibly innite sequence of rewards in a nite measure.
To do this, researchers have relied on temporal discounting,
a practice of giving exponentially diminishing importance
to rewards far in the future, so as to summarize the rewards
in a single nite number.
While temporal discounting is convenient, it comes at a
price. For one, it may make behaviors with quick but
mediocre results look more attractive than efcient behav-
iors reaping long-term benets. Moreover, even when it fa-
vors behaviors with adequate farsightedness, it can greatly
impede the process by which Q-learning arrives at its solu-
tions.
In this paper we outline some of the problems caused by
temporal discounting, and set out to provide a workable
alternative. We put forth an undiscountedmeasure of policy
performance, and present an algorithm designed to arrive
at policies which maximize that measure. The method,
which we call R-learning, resembles Q-learning inasmuch
as it measures the value—for a different notion of value—
of state-action pairs, relative to its current policy, and in
any state recommends that action which maximizes value.
The key element introduced is a measure of average reward,
which serves as a standard of comparison. When examined
according to a measure of utility that is depends on this
average, improvements to policies become more salient.
This boosts performance in a number of ways.
While all of the principle ideas and techniques presented
in this paper (including the formal notions of average and
average-adjusted value, Theorem 3, and the R-learning al-
gorithm) are the result of our own work, we became aware
while writing that the ideas were rst created and explored
in the Dynamic Programming literature, often with much
greater generality and elegance. As a result, this paper may
serve as an introduction to the concepts of undiscounted
Dynamic Programming, and as the rst attempt to bring
those concepts to bear on the problems and practices of Re-
inforcement Learning. The R-learning algorithm remains,
to our knowledge, a novel technical contribution.
A more qualitative account of the ideas presented in this
paper is given in [13].
2 Background
In RL, the learner’s environment is modelled as a Markov
Decision Process (MDP). An MDP species a set of states
and a set of actions. At each step in time the process is
in some state, and an action must be chosen; this action
has the effect of changing the current state and producing
a scalar reinforcement value. The reinforcement value, or
reward, represents the extent to which we can consider the
action to have had immediately desirable or undesirable
consequences. Formally, an MDP is described by a 4-
tuple where : 0 1 gives the
probability (written ) of moving into state when
action is performed in state , and : gives
the corresponding expected reward. In this paper we will
deal exclusively withnite MDP’s, i.e., ones with nite sets
and .Apolicy is a mapping from to , suggesting
which action to perform in each state.1The goal of RL
methods is to arrive, by performing actions and observing
their outcomes, at a policy which maximizes the rewards
accumulated over time.
Q-learning [17] is the most widely used and studiedmethod
for RL and, like most, it uses discounted value as its criterion
of optimality. The discounted value, or discounted return,
of a policy in a state is dened as the expected value
0
1
where , a random variable, represents the reward that
will be received time steps after the learner begins exe-
cuting policy in state , and is a temporal discounting
constant, 0 1.
Q-learning seeks to nd a policy which maximizes
in all states . (We call such policies -optimal.)
To do so, it makes particular use of the action-value form
of (1), ;2
in which ; denotes the nonstationary policy which per-
forms action once and thereafter follows policy . The
algorithm for Q-learning is based on the relation
3
and on the fact that for -optimal policies (and no others),
max for all 4
Q-learning operates by maintaining a function ˆ:
, which initially maps all values to zero, and is updated
every time an action is executed as follows. Let , for
01, denote the operation of assigning to variable
the value . If action is performed in state
, resulting in an immediate reward and a transition
into state , then the value of ˆon input is modied
by performing
ˆmax ˆ5
for an appropriate learning rate . At any time, the Q-
values induce a policy ˆwhich maps each state to the
action maximizing ˆ. Given certain assumptions,
this procedure is guaranteed to arrive at a policy which
maximizes discounted value from every state [18].
In cases where the value 1exists and is nite (cf.
Equation 1), 1can be used as an undiscounted measure
of performance, and Q-learning with 1 can be shown
to converge to 1-optimal policies [18]. But the MDP’s of
which the above stipulation holds are goal tasks in which
1Except where otherwise noted, we use policy to refer to a
stationary policy, i.e., one which does not vary over time [12].
1
2
3
A
BA,B
A,B
(+0)
(+1)
(-1)
(+1000)
"Earth"
"Heaven"
"Hell"
Figure 1: A trivial MDP on which discounted and undis-
counted measures may disagree. States are indicated by
circles, actions are given in italics, their associated state
transitions are given by arrows, and their immediate re-
wards are given in parentheses.
one seeks to reach an absorbing set of “goal” states via a
minimal path; and in those domains researchers tend to use
discounted Q-learning for other reasons2(e.g., [8, 16]).
3 Measures of Performance
If we look at the graphs which papers in RL use to report
the performance of their systems, we nd that they are
almost universally graphs of undiscounted measures: either
total cumulative reward or average reward per time step
(e.g., [6, 8, 9, 16]). But Q-Learning maximizes a future-
discounted measure of reward instead. The problem is that
that these criteria, in general, need not coincide.
3.1 Discounted Performance
For instance, consider the MDP of Figure 1. Here, attention
to (undiscounted) future rewards will clearly mandate that
a policy choose action in state 1. But for any
500
501 0 998, 1 1 regardless of
which makes methods such as Q-learning prefer action .
In fact, given any , there is some value we can set for
1 which makes the -discounted criterion favor action
over action .
It is true that for any nite MDP there is some sufciently
large for which the discounted and undiscounted measures
agree. However, proper choice of such a requires detailed
knowledge of the domain—knowledge that we do not want
to presuppose. Even with such knowledge, a parameter such
as that needs to be tailored to suit individual domains is
clearly undesirable.
In light of these observations, one may wonder why re-
searchers use discounting at all. We address several possi-
ble answers to this question:
Discounting to compensate for rate of interest. Discounted
value is the primary criterion for performance in the eld
2These reasons, too lengthy to be presented in this paper, will
be explained in detail elsewhere. They do not apply to the R-
learning method.
2
of Operations Research, where dynamic programming has
been used and studied extensively. There the motivation is
economic: one assumes that interest is available on earned
rewards at the rate of 100 % per unit of time. In order
to compensate for the interest one can earn on rewards
achieved in the present, a reward of units time steps into
the future must be evaluated by its present value, 1 .
This gives motivationfor using 1 1in the above
framework. However, this explanation can be dismissed out
of hand by observing that RL researchers do not let interest
accrue on rewards in their reportings of cumulative reward,
nor do they evaluate the rewards in terms of their present
value at any moment in time.
Discounting to express the niteness of an agent’s lifetime.
Another possible reason to value present rewards more than
future rewards is the assumption that the agent might die
sometime before it is able to reap future rewards. In this
case, the represents an assumption that the agent has a
probability 1 of dying at any step intime, in which case
all future rewards will be zero. A different intuition with
the same mathematical interpretation is this: Discounting
to express uncertainty about the future. Some have argued
that future rewards may be uncertain because of a changing
environment [1], and use discounting to reect that any
reward expected at future time may turn out to be zero
instead with a probability of 1 . But both of these
interpretations of discounting beg the following question:
If researchers assume that agents may die or that rewards
may turn to naught, why do they measure performance in
domains where neither of these eventualities come to pass?
This paper will proceed on the premise that the reason why
people use discounting is none of these but a more practical
one: By discounting future rewards, one makes their innite
sum nite. Which is to say, researchers use a discounted
value measure because there exists no workable alternative.
We proceed to provide such an alternative.
3.2 Undiscounted Performance
We wish to compare total undiscounted rewards reaped by
a policy from a state; but since policies are likely to accrue
rewards steadily over time, it is generally impossible to
merely compare the innite sums of rewards. Naturally,
we turn to some comparison of the nite cumulative sums.
Dene the n-period value of a policy as
1
0
One way to obtain a nite measure is to look at the average
reward per unit of time incurred by a policy over the long
run. We dene the average reward of a policy started in
state as
lim
While a policy needs to maximize average reward in order
for us to consider it optimal in an undiscounted sense, the
converse may not be true. For example, in some state , two
policies 1and 2may yield the same average reward even
though 1will constantly outperform 2in that 1
20 for large . In such cases we would
like to have an additional measure which is sensitive to such
constants . One possibility is
lim
the limiting difference between cumulative performance
and the line through the origin with slope . (We
may think of this line as the cumulative performance of
the hypothetical reference case in which were to reap its
average reward at every step.) This measure, while
intuitive, may not be well dened when the policy reaches
periodic limit cycles. But in such cases we can generalize
it to a measure which is guaranteed to exist. We dene
the average-adjusted value [5] of policy in state as the
Ces`aro or “limit in the average” [4] version of the simpler
expression above:
lim
1
06
If we summarize the performance of a policy in state by
a linear function in approximating , then is
that line’s slope and the -intercept.
These two values, then, provide us with an undiscounted
standard of performance. We wish foremost to maximize
and secondarily ( ’s being equal) to maximize .
So let us dene the total value of a policy from state as
the ordered pair
and use lexicographical order as the ordering on . That is,
we say that if and only if or
and .
We dene undiscounted action-values analogously to the
discounted case:
;
To make parallels to the discounted case clear, we will
use and to refer to the components of
and , respectively. Let us write 1 2
to mean that 1 2 for all states . We say
that is T-optimal whenever for all policies .
Like -optimal policies, stationary T-optimal policies are
guaranteed to exist for nite MDP’s [5]. We willhenceforth
use to denote T-optimal policies.
The notion of T-optimality exists under a variety of names
in the literature [11]. It is a stronger condition than average-
reward optimality, but weaker than other forms including
Blackwell optimality [3]. Blackwell optimality, the most
selective one in the literature according to which all nite
MDP’s have optimal stationary policies, may be viewed as
lexicographically maximizing an innite series of which
and are the rst two terms. So far we have not seen the
need to utilize the higher order terms.
We briey explore an important property of average reward:
3
Fact 1 If two states and are such that executing policy
from either state leads to the same ergodic set3of states
in S, then .
This has the following consequence:
Corollary 2 For any nite MDP, either
1. is independent of , or
2. There are states between which no policy can ever guar-
antee passage.
The corollary tells us that in all cases of interest to RL, the
optimal policy has a single average reward, since MDP’s of
which the second clause holds would normally violate the
frequency of visit assumptions required for the convergence
of stochastic approximation methods such as Q-learning
[18].
While for many MDP’s there exist policies that induce mul-
tiple ergodic sets, a simplication commonly made in the
Dynamic Programming literature is that all policies are
unichain,i.e., give rise to a single ergodic set [11]. In
light of Fact 1, this lets one assume that all policies, not
only , achieve a single state-independent average reward.
At times we will make this assumption so as to facilitate the
presentation of R-learning. A discussion of the behavior of
R-learning in the general multichain case is forthcoming.
Attention to may seem frivolous, as represents merely
a constant offset to cumulative value, likely to be quickly
dominated by the repeated contributions of . But for ter-
minal goal tasks, average reward is merely a function of
the goal state that is reached, invariant of the behavior that
leads to the goal.4Thus, is of great importance; it is the
entire measure of efciency in reaching the goal.
Moreover, even if we are only interested in maximizing
average reward, turns out to be a crucial instrument in
that maximization. For any policy , states which lead to
the same ergodic set—and all states do, given the unichain
assumption—share a common value of , so their val-
ues differ only with regard to their second com-
ponent, . Likewise, for all actions that lead to
the same ergodic set, the action-values share a
constant component. Clearly, then, is the key to pol-
icy improvement. When an action is found for which
, then changing to choose in results
in an improvement of , increasing if is a recurrent
state under .
3Any policy, applied to an MDP, gives rise to a Markov chain.
An ergodic set of a Markov chainis a minimal set of states which,
once entered, will never be left [7].
4Any policy which reachesthe goal, strictly speaking, remains
there forever. The average reward of a policy reects this fact,
even though a researcher generally stops the simulation at that
point and begins a new trial from a different state. In practice,
learners are rarely if ever allowed to execute a xed stationary
policy, so need not express the average reward observed
during experimental trials.
4 The Connection Between Discounted and
Undiscounted Value
Before proposing a method for learning T-optimal policies,
we note a connection between discounted and undiscounted
value. The main result of interest is the following [11]:
Theorem 3 For any policy and state ,
1where lim
10
We may read this statement as saying that for values of
approaching 1, the discounted value is composed
of two nonvanishing terms: one that is a large constant
multiple of the average reward expected from starting policy
in state , and one that is the average-adjusted value of .
We willrefer to these as the -term and -term, respectively.
So for close to 1, may be seen as approximating
the lexicographical preference of by giving much more
weight to than to . But the approximation is imperfect,
as is manifest in the fact that dictates opting for a quick
constant bonus in reward over a long-term improvement
in average reward when !, the gain in nite short-term
reward, is greater than !
1, the scaled difference in long-
term average reward. This explains the faulty choice in the
example of Figure 1.
Conversely, a simple corollary of Theorem 3 lets us under-
stand average-adjusted value in terms of discounted value:
Corollary 4 For any policy and state ,
lim
10
7
If we think of the quantities as average-adjusted
rewards [13], then this corollary lets us view as the
expected discounted sum of average-adjusted rewards, for
vanishingly little discounting (cf. Equation 1).
5 Learning T-Optimal Policies
When we express average-adjusted values in the form of
Equation 7, it is easy to verify that they obey the following
recurrence relation:
( ( )) 8
The recurrence, like the expression for value in Corollary 4,
is analogous to the discounted case, but with the following
two modications:
1. Rewards are average-adjusted by subtracting out .
2. The effect of is eliminated by bringing arbitrarily
close to 1.
4
We may see that rst modication enables the second as
follows: In the representation of given by Theorem 3,
the rst term is the one which blows up for close to 1
when 0; this is what prevents us from simply using
the undiscounted sum 1in the rst place. But average-
adjusting the incoming rewards reduces the problem to one
where 0, eliminating the rst term altogether. This
done, we are free to let approach 1 in order to eliminate
the contribution of the high order term . What is
left is a measure of the -term alone which, as we have
mentioned, is precisely that upon which we wish to base
action selection.
By applying these two modications to the standard Q-
learning algorithm, what results is a technique for approx-
imating instead of , and hence maxi-
mizing total rather than -discounted value. The only addi-
tional machinery needed is a mechanism for approximating
the average reward of the successive policies suggested by
the algorithm. The method, R-learning, is now presented:
The R-Learning Algorithm
1. Begin with a table of real numbers , all ini-
tialized to zero, and a real-valued variable , also initialized
to zero.
2. Repeat:
2a. From the current state , perform an action , chosen
by some exploration/action-selection mechanism (this
is an orthogonal component of the system, just as it is
for Q-learning). Observe the immediate reward imm
received and the subsequent state .
2b. Update according to:
imm 9
where max and is a learning
rate parameter.
2c. If (i.e., if agrees with the policy
), then update according to:
imm 10
where is a learning rate parameter.
One might wonder why we do not simply approximate
via exponential averaging of immediate rewards (i.e.,
), performed on every tick. This is so as to
restrict our attention to the policy , uninuenced by the
conicting behavior of the exploration mechanism that is
ultimately responsible for action choice. Exploratory ac-
tions, which generally incur subaverage rewards, would
skew the approximation of if included in the calculation.
The appearance of the term in Equation 9
plays the role of compensating for known variations in the
rewards received in different states; it serves to adjust for
periods of low reward when the agent is using its optimal
policy to recover from suboptimal exploratory actions, and
more generally to minimize the variance of the values used
to estimate .
One may easily show that whenever R-learning converges it
must arrive at a T-optimal policy. But unlike the discounted
methods, whose mathematics have been extensively ex-
plored (e.g., [18]), a proof of the convergence of R-learning
has not been established. Nonetheless, related techniques
such as undiscounted Policy Improvement [5] are well un-
derstood, and work toward convergence results is currently
under way.
6 Advantages of R-Learning
R-learning is designed to arrive at T-optimal policies, and
we have already argued for the advantages of undiscounted
performance criteria. But in addition to maximizing undis-
counted performance, R-learning displays several compu-
tational advantages over existing techniques. As a result,
even when one can be sure of choosing so as to allow Q-
learning to arrive at T-optimal policies, it is often preferable
to use R-learning (bearing in mind that the algorithm has
not yet been proven to converge).
For discounted methods such as Q-learning to arrive at
policies that do not overlook temporally distant rewards,
they must use large values of . But it is well known that
for successive approximation techniques the geometric rate
of decay in the approximation error is proportional to , so
that high values of can slow the convergence dramatically
[2]. As a result, one might expect an undiscounted method
such as R-learning to converge extremely slowly. In fact
the opposite is generally true, for the following reasons:
6.1 Better Initial Estimates
In cases where the optimal policy has nonzero average
reward and is near 1, Q-learning spends much of its time
converging to the large -term of , which is invariant of
, whereas the -term is the one needed for action choice.
During this time, the true values of necessary for action
selection may be entirely obscured by approximation error,
causing poor performance. R-values, in contrast, have no
contribution from . As a result, the values, which begin at
zero, already reect their -term, and need only converge
on .
6.2 Faster Propagation of Rewards
Many researchers have pointed out that information about
rewards can be propagated very slowly across states by Q-
learning [10, 16]. In particular, we note that the -terms of
the Q-values, though constant over a state space, may be
updated only locally, and one state at a time. By contrast,
R-learning uses to effectively store a common -term for
all its action-values. This term is updated on every iteration,
and lets information be propagated instantly throughout the
state space. Section 7.2 presents experimental data attesting
to the resulting speedup.
5
6.3 Value Disambiguation
Even for MDP’s where -optimal policies are T-optimal
for small —in which case one would hope for Q-learning
to converge rapidly despite the above concerns—there turn
out to be compelling reasons to prefer R-learning. Let
us examine any case where temporal credit assignment is
necessary—where some T-suboptimal action gives imme-
diate reward as large as that of the T-optimal action. That
is, for some state of an MDP with T-optimal policy ,
, but . Then one
may show that there is a value 0such that 0
0, and for which holds of
all 01.5To be sure to prefer the T-optimal action,
one must choose a within this interval. This granted, one
would like to choose a at the very bottom of the range so
as to speed convergence. But choosing arbitrarily close
to 0will result in an arbitrarily small difference between
and .6This is problematic for at least
the following two reasons:
First, for stochastic MDP’s, approximation errors of Q-
values are due not only to the process of value iteration,
but also to the Monte Carlo sampling it employs. When
differences in true action-values for competing actions are
small they are more likely to be obscured by such errors,
which may cause suboptimal actions to be chosen. An
example of this phenomenon is presented in Section 7.1.
Secondly, many popular exploration methods select actions
stochastically according to their relative action-values [6,
16, 17]. In such cases, an action whose Q-value differs
from the optimal one by a small amount will be chosen
almost as frequently as the optimal action. This may result
in a substantial loss of cumulative reward.
One may see that R-learning minimizes these problems,
because using average-adjusted value maximizes the differ-
ences in value in the following sense:
sup
01
In summary, choosing to be large for Q-learning results in
poor initial estimates and slow convergence, while choosing
small, when it does not result in T-suboptimal solutions,
may reduce performance dramatically for other reasons. R-
learning eliminates the hazards that accompany small values
of , while using other means to speed convergence.
5This follows from the proof of existence of stationary Black-
well optimal policies ([3], Theorem 5).
6In goal tasks where nonzero reward is givenonly upon achiev-
ing the goal, this is the familiar problem wherein temporal differ-
encing causes all actions in states distant from the goal to have
approximately zero value.
6.4 Linearity of Undiscounted Values
In the case of a deterministic MDP, we may read the recur-
rence for (Equation 8) as telling us that
where is the state reached by performing in . Since
unichain policies have constant, it follows that for
regions of where following incurs a constant reward,
average-adjusted value of successive states encountered will
change linearly. By contrast, the present in Q-values
makes increases and decreases in those values exponential,
growing in magnitude with temporal proximity to rewards.
Linearity is very desirable in the case where is a met-
ric space, and the policy action causes uniform movement
along some dimension of . In this case the R-values for the
policy action vary linearly over that dimension, potentially
permitting the use of simple and accurate linear interpola-
tion to speed up learning and even allowing learning over
continuous state spaces.
Finally, the linearity of values can be useful in situations
where we want to combine action-values for multiple indi-
vidual tasks to determine values for a composite task. Singh
[14] has explored this possibility for sets of goal tasks.
6.5 Q-Learning Seen as a Special Case of R-learning
Occasions might arise where temporal discounting is a use-
ful feature for any of the reasons given in Section 3.1. As
we saw, the discounted value measures total value in the
case where at each point in time there is a probability 1
of an exogenous incident after which all rewards are zero.
Given any MDP, we may introduce this assumption explic-
itly by adding one absorbing state with autotransitions
yielding reward zero, to which all actions from all other
states lead with probability 1 . (Such actions receive
their normal reward, since it is only the state they result
in that changes.) Accordingly, we multiply all preexisting
transition probabilities by . Now let quantities marked by
a dot denote values pertaining to the modied MDP. We
rst observe that for all policies and all states of the new
decision problem, ˙0, and build this fact into the
algorithm by eliminating step 2c. Now by Equation 8 we
have, for all ,
˙˙ ˙
˙
˙ ˙
˙ ˙ 1˙
˙
˙
In other words, ˙obeys the same recurrence relation as
. By modifying the R-learning update of step 2b to
reect the recurrence of ˙instead of , we are left with
precisely the Q-learning algorithm.
6
0
move (+0)
(+0)
stay
1
move (+0)
(+0)
stay
48
move (+0)
(+0)
stay
49 move
(+50)
(+0)
stay
Figure 2: A simple MDP with temporally distant rewards.
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 50000 100000 150000 200000 250000
R-learning
Q-learning
Figure 3: Performance of R-Learning versus Q-learning.
7 Experimental Results
7.1 Value Disambiguation
An initial experiment compared Q-learning and R-learning
in the simple domain pictured in Figure 2, modied to
yield stochastic rewards that always deviate from the values
given by either +1 or -1 (with equal probability). Figure 3
shows the results. The -axis measures number of actions
performed, while the -axis measures average reward per
5000-action interval. (The results are averaged over 50
trials.) We used random exploration with a xed probability
0 05 of a random action at any time step; an exploration
method that favors actions with near-optimal values would
make the advantage of R-learning more pronounced.
100
100.1
100.2
100.3
100.4
100.5
100.6
100.7
100.8
100.9
101
0 50000 100000 150000 200000 250000
R-learning
Q-learning
Figure 4: Performance of R-Learning versus Q-learning
(all rewards increased by 100).
0
(+0)
1
(+0)
18
(+0)
19
(+30)
(+0)
1’
(+0)
9’
(+10)
(+0)(+0) AB A,B A,BA,B
A,B A,B
Figure 5: An MDP with two cycles. Action choice is
irrelevant except in state 0.
1
1.05
1.1
1.15
1.2
1.25
1.3
1.35
1.4
1.45
1.5
0 10000 20000 30000 40000 50000 60000
R-learning
Q-learning
Figure 6: Comparison of learning mechanisms on a MDP
with two cycles. Performance of Q-learning is poor com-
pared to R-learning because of ramping.
Figure 4 shows a comparison of Q-learning and R-learning
in the same domain with all rewards increased uniformly
by 100. Notice the period in which R-learning stalls while
its estimate of climbs from an initial value of zero up to
the nal value of 101.
Both runs use 0 9, 0 2. Because of the exploration
strategy, an optimal policy will have an average reward of
0 95. Note that the xed is responsible for the fact that
the Q-values never converge to the optimal value.
7.2 Convergence Rate
A second experiment used the double-loop domain shown
in Figure 5. Average rewards of the two methods are plotted
in Figure 6, compiled over 100 runs. The reason for the poor
performance by Q-learning is that the shorter cycle, though
it gives less per-step payoff than the longer one, allows
rewards to be propagated more quickly. As a result, it’s Q-
values converge faster than the longer one’s, making it look
more favorable during the long process of convergence.
These experiments use the same parameters as the previous
ones, except that here is increased to 0 99 to allow Q-
learning to learn the T-optimal policy at all. For larger
, the effect is even more pronounced: When 0 999
instead of 0 99, the speedup of R-learning over Q-learning
is more than forty-fold.
8 Related Work
In Section 2 we remarked that Q-learning may be used to
nd 1-optimal policies in cases where that undiscounted
measure is nite. Now we may further observe that in
7
such cases, 0 1and the recurrence for
reduces to that of 1. In this sense, the theory of total value
subsumes that of 1, and R-learning is a generalization
of undiscounted Q-learning.
Whereas Q-learning is an asynchronous, stochastic approx-
imation version of the well-understood dynamic program-
ming method of value iteration, R-learning has no such pre-
cise analog in the literature. Though several successive ap-
proximation algorithms exist for computing , none of them
approximate explicitly; this added complexity present in
R-learning, while making it more robust in the multichain
case, necessitates new methods of analysis, which we are
working to develop. We know of no successive approxi-
mation methods in the Dynamic Programming literature for
handling the multichain case.
We have recently learned of the related work of Westerdale
[19], who presents a bucket brigade technique based on the
notions of average and average-adjusted value. Since he
does not draw any connections to the Dynamic Program-
ming or Reinforcement Learning literature, the precise con-
tribution of this work is difcult to assess.
The general notion that performance should be viewed rel-
ative to a standard of reference, which underlies average-
adjusted value and the workings of R-learning, has appeared
in many places in the psychological literature, as well as in
Sutton’sReinforcement Comparison algorithm [15]. A pre-
sentation of R-learning from a more psychological stand-
point is offered in [13].
9 Conclusion
Until now, most of the work in RL has maintained one stan-
dard of performance while using algorithms that maximize
another. The discounted algorithms have been well under-
stood, but the performance standards havenot: In goal tasks,
maximal undiscounted value 1has been sought, while, in
recurrent domains, high average reward has been the aim.
The concept of total reward presented here subsumes these
two criteria, and is the basis for R-learning, a method we
have proposed to maximize them both. We have used total
reward as a tool to gain a better understandingof some of the
computational shortcomings of the discounted techniques,
and have shown that R-learning may ameliorate many of
those problems.
While informal tests have shown R-learning tobe applicable
to a variety of domains, more empirical and theoretical work
is necessary to establish its viability as a robust algorithm
for reinforcement learning. Some of this work is currently
under way.
Acknowledgements
I wish to thank Nils Nilsson, Rich Sutton, George John, and
Moshe Tennenholtz for their helpful comments. This work
was supported by an IBM Graduate Fellowship.
References
[1] A. G. Barto, R. S. Sutton, and C. J. C. H. Watkins. Learning
and sequential decision making. Technical Report COINS
89-95, Dept. of Computer and Information Science, Univer-
sity of Massachusetts, Amherst, 1989.
[2] D. P. Bertsekas and J. N. Tsitsiklis. Parallel and Distributed
Computation: Numerical Methods. Prentice Hall, Engle-
wood Cliffs, NJ, 1989.
[3] D. Blackwell. Discrete dynamic programming. Ann. Math.
Statist., 33:719–726, 1962.
[4] G. H. Hardy. Divergent Series. Clarendon Press, Oxford,
1949.
[5] R. A. Howard. Dynamic Programming and Markov Pro-
cesses. MIT Press, Cambridge, MA, 1960.
[6] L. P. Kaelbling. Learning in Embedded Systems. PhD thesis,
Stanford University, 1990.
[7] J. G. Kemeny and J. L. Snell. Finite Markov Chains. Van
Nostrand, Princeton, NJ, 1960.
[8] L.-J. Lin. Programming robots using reinforcement learning
and teaching. In ProceedingsAAAI-91, pages781–786. MIT
Press, Cambridge, MA, 1991.
[9] S. Mahadevan and J. Connell. Automatic programming of
behavior-based robots using reinforcement learning. In Pro-
ceedings AAAI-91, pages 768–773. MIT Press, Cambridge,
MA, 1991.
[10] R. A. McCallum. Using transitional proximity for faster
reinforcement learning. In Proceedings of the Ninth Inter-
national Workshop on Machine Learning, pages 316–321.
Morgan Kaufmann, San Mateo, CA, 1992.
[11] M. L. Puterman. Markov decision processes. In D. P. Hey-
man and M. J. Sobel, editors, Handbooks in OR & MS, Vol. 2,
pages 331–434. Elsevier, North-Holland, 1990.
[12] S. M. Ross. Introduction to Stochastic Dynamic Program-
ming. Academic Press, New York, 1983.
[13] A. Schwartz. Thinking locally to act globally: A novel
approach to reinforcement learning. In Proceedings of the
Fifteenth Annual Conference of the Cognitive Science Soci-
ety. Lawrence Erlbaum, Hillsdale, NJ, 1993.
[14] S. P. Singh. Transfer of learning by composing solutions for
elemental sequential tasks. Machine Learning, 8(3/4):323–
339, May 1992.
[15] R. S. Sutton. Temporal Credit Assignment in Reinforcement
Learning. PhD thesis, Department of Computer and Infor-
mation Sciences, University of Massachusetts, 1984.
[16] R. S. Sutton. Integrated architectures for learning, planning,
and reacting based on approximating dynamicprogramming.
In Proceedings of the Seventh International Workshop on
Machine Learning, pages 216–224. Morgan Kaufmann, San
Mateo, CA, 1990.
[17] C. J. C. H. Watkins. Learning from Delayed Rewards. PhD
thesis, King’s College, Cambridge, 1989.
[18] C. J. C. H. Watkins and P. Dayan. Q-learning. Machine
Learning, 8:279–292, 1992.
[19] T. H. Westerdale. Quasimorphisms or queasymorphisms?
Modeling nite automaton environments. In G. J. E. Rawl-
ins, editor, Foundations of Genetic Algorithms, pages 128–
147. Morgan Kaufmann, San Mateo, CA, 1991.
8
... The first category involves the utilization of the SMART algorithm (Das et al., 1999), a modelfree, off-policy algorithm specifically designed to address semi-MDP in an average-reward setting. The second category focuses on the application of the R-learning algorithm introduced by (Schwartz, 1993). While there is no formal proof of convergence for the R-learning algorithm, empirical evidence has demonstrated its considerable potential compared to the discounted -learning approach (Mahadevan, 1996;Schwartz, 1993). ...
... The second category focuses on the application of the R-learning algorithm introduced by (Schwartz, 1993). While there is no formal proof of convergence for the R-learning algorithm, empirical evidence has demonstrated its considerable potential compared to the discounted -learning approach (Mahadevan, 1996;Schwartz, 1993). ...
Conference Paper
Full-text available
This paper analyzes the application of reinforcement learning in condition-based maintenance and investigates its growing popularity in this field. Recent research treats condition-based maintenance as a sequential decision-making problem and formulates it as a Markov decision process or its variations. Reinforcement learning, a class of machine learning techniques, is introduced as a modern and flexible tool to solve Markov decision process problems. Reinforcement learning involves guiding a decision-maker to act rationally in a stochastic environment using rewards. In contrast to the parametric approach, which seeks optimal parameter values for a predetermined policy, the sequential decision-making approach does not rely on predetermined policies or parameters. This approach provides greater flexibility in the maintenance policy by relaxing the predetermined structure. The paper reviews various reinforcement learning algorithms classified according to the system's characteristics. It also discusses why reinforcement learning has become an attractive tool in condition-based maintenance through an extensive review of its applications in this field. Lastly, the paper highlights the prospective direction of research in this area. By leveraging reinforcement learning, researchers aim to further improve condition-based maintenance policies, maximize system availability, and minimize maintenance costs.
... In this paper, we investigate how to satisfy the vehicular AoI requirements while maintaining relatively low energy consumption of the battery-powered roadside sensors in the edge HDM caching scenario by jointly schedule the content update and the downlink transmission resource allocation. We propose a PRD-DRN algorithm, which combines the superiority of prioritized double deep Q-learning [47] [61] and R-learning [46]. In the proposed algorithm, the agent can interact with environment and execute the optimal scheduling action for maximizing the long-term average system reward without adjusting the discount factor. ...
... Thus, this paper is to maximize the long-term average reward rather than the long-term discount reward. We modify the state value function V π (s) and the state-action value function Q π (s, a) by combining the idea used in the R-learning [46] as follows: ...
Article
Full-text available
Caching the high definition map (HDM) on the edge network can significantly alleviate energy consumption of the roadside sensors frequently conducting the operators of the traffic content updating and transmission, and such operators have also an important impact on the freshness of the received content at each vehicle. This paper aims to minimize the energy consumption of the roadside sensors and satisfy the requirement of vehicles for the HDM content freshness by jointly scheduling the edge content updating and the downlink transmission resource allocation of the Road Side Unit (RSU). To this end, we propose a deep reinforcement learning based algorithm, namely the prioritized double deep R-Learning Networking (PRD-DRN). Under this PRD-DRN algorithm, the content update and transmission resource allocation are modeled as a Markov Decision Process (MDP). We take full advantage of deep R-learning and prioritized experience sampling to obtain the optimal decision, which achieves the minimization of the long-term average cost related to the content freshness and energy consumption. Extensive simulation results are conducted to verify the effectiveness of our proposed PRD-DRN algorithm, and also to illustrate the advantage of our algorithm on improving the content freshness and energy consumption compared with the baseline policies.
... The average reward (also known as longrun average reward, limit-average reward, steady-state reward, or mean payoff) captures much more adequately the performance over an unknown or variable horizon (see e.g. [203]). Consequently, it is used to model e.g. ...
Preprint
Full-text available
The analysis of formal models that include quantitative aspects such as timing or probabilistic choices is performed by quantitative verification tools. Broad and mature tool support is available for computing basic properties such as expected rewards on basic models such as Markov chains. Previous editions of QComp, the comparison of tools for the analysis of quantitative formal models, focused on this setting. Many application scenarios, however, require more advanced property types such as LTL and parameter synthesis queries as well as advanced models like stochastic games and partially observable MDPs. For these, tool support is in its infancy today. This paper presents the outcomes of QComp 2023: a survey of the state of the art in quantitative verification tool support for advanced property types and models. With tools ranging from first research prototypes to well-supported integrations into established toolsets, this report highlights today's active areas and tomorrow's challenges in tool-focused research for quantitative verification.
... To solve the problem, adaptive algorithms can also be implemented (see Mahadevan (1996b) for a review of methods). Schwartz (1994) developed an algorithm, the R-learning, based on the same principles as the Q-learning. There are other algorithms in the literature (see, for example, Singh (1994) (variant of Shwartz algorithm) or ). ...
Chapter
1. Introduction to constrained Markov decision processes : general framework and classical algorithms to find optimal policies. 2. Application to systems subject to a constraint on the asymptotic availability, as it is the case for many industrial problems, for which the maintenance has to be optimised, insuring a minimum level of availability. 3. Application to systems subject to a constraint on the asymptotic failure rate, as it is the case for airplane’s full authority digital engine control (FADEC) systems.
... It is important to note that, although, by construction, the goal of our model is to minimize the square RPE (δ model), an RPE is not needed in principle. Instead, we could have chosen the goal of directly maximizing the intrinsic reward [as in policy-gradient learning (45,46)]. Such a model would be simpler, as it would not require a critic for computing an expected reward, which would free up resources. ...
Article
Full-text available
Reinforcement learning (RL) is thought to underlie the acquisition of vocal skills like birdsong and speech, where sounding like one’s “tutor” is rewarding. However, what RL strategy generates the rich sound inventories for song or speech? We find that the standard actor-critic model of birdsong learning fails to explain juvenile zebra finches’ efficient learning of multiple syllables. However, when we replace a single actor with multiple independent actors that jointly maximize a common intrinsic reward, then birds’ empirical learning trajectories are accurately reproduced. The influence of each actor (syllable) on the magnitude of global reward is competitively determined by its acoustic similarity to target syllables. This leads to each actor matching the target it is closest to and, occasionally, to the competitive exclusion of an actor from the learning process (i.e., the learned song). We propose that a competitive-cooperative multi-actor RL (MARL) algorithm is key for the efficient learning of the action inventory of a complex skill.
Preprint
Full-text available
The prefrontal cortical areas exhibit a wide range of diverse and mixed representations involving sensory, motor, and autonomous variables. Despite this diversity, the prefrontal areas are known to regulate very specific behavioral functions. This disparity between the representations and their precise roles in behavior prompts a fundamental question: which representations do animals deploy, and in what specific behavioral contexts? To address this question, we recorded neurons in the medial prefrontal cortex (mPFC) of mice engaged in probabilistic reward-based foraging task. Using reinforcement learning (RL) model that takes into account both reward and choice history we derived decision variable (DV) that adequately described behavior. We found that neurons integrate choice and reward history in line with their behavioral effects. To probe under what behavioral contingencies DV was used we subjected animals to different behavioral manipulations. These manipulations included altering the conditional dependence of reward probability on past choices and varying the temporal intervals between choices. While neural representations tracked the task contingencies, inactivation of mPFC specifically degraded performance of the animals when there were long temporal gaps between choices. We discovered that the neural representations had to be viewed within the context of animals performance. Namely, we found correlation of animal’s performance and decoding accuracy at the population level only in the task with long temporal gaps between choices. We conclude that if neural representation are examined alone almost identical representations can have different behavioral impacts. Our findings argue that redundant computations exist in medial prefrontal cortex and its behavioral deployment is conditioned on temporal gaps between task relevant events.
Preprint
Full-text available
Deciding how long to keep waiting for uncertain future rewards is a complex problem. Previous research has shown that choosing to stop waiting results from an evaluative process that weighs the subjective value of the awaited reward against the opportunity cost of waiting. In functional neuroimaging data, activity in ventromedial prefrontal cortex (vmPFC) tracks the dynamics of this evaluation, while activation in the dorsomedial prefrontal cortex (dmPFC) and anterior insula (AI) ramps up before a decision to quit is made. Here, we provide causal evidence of the necessity of these brain regions for successful performance in a willingness-to-wait task. 28 participants with frontal lobe lesions were tested on their ability to adaptively calibrate how long they waited for monetary rewards. We grouped the participants based on the location of their lesions, which were primarily in ventromedial, dorsomedial, or lateral parts of their prefrontal cortex (vmPFC, dmPFC, and lPFC, respectively), or in the anterior insula. We compared the performance of each subset of lesion participants to behavior in a control group without lesions (n=18). Finally, we fit a newly developed computational model to the data to glean a more mechanistic understanding of how lesions affect the cognitive processes underlying choice. We found that participants with lesions to the vmPFC waited less overall, while participants with lesions to the dmPFC and anterior insula were specifically impaired at calibrating their level of persistence to the environment. These behavioral effects were accounted for by systematic differences in parameter estimates from a computational model of task performance: while the vmPFC group showed reduced initial willingness to wait, lesions to the dmPFC/anterior insula were associated with slower learning from negative feedback. These findings corroborate the notion that failures of persistence can be driven by sophisticated cost-benefit analyses rather than lapses in self-control. They also support the functional specialization of different parts of the prefrontal cortex in service of voluntary persistence.
Technical Report
Full-text available
In this report we show how the class of adaptive prediction methods that Sutton called \temporal dierence," or TD, methods are related to the theory of squential decision making. TD methods have been used as \adaptive critics" in connectionist learning systems, and have been proposed as models of animal learning in classical conditioning experiments. Here we relate TD methods to decision tasks formulated in terms of a stochastic dynamical system whose behavior unfolds over time under the inuence of a decision maker's actions. Strategies are sought for selecting actions so as to maximize a measure of long-term payo gain. Mathematically, tasks such as this can be formulated as Markovian decision problems, and numerous methods have been proposed for learning how to solve such problems. We show how a TD method can be understood as a novel synthesis of concepts from the theory of stochastic dynamic programming, which comprises the standard method for solving such tasks when a model of the dynamical system is available, and the theory of parameter estimation, which provides the appropriate context for studying learning rules in the form of equations for updating associative strengths in behavioral models, or connection weights in connectionist networks. Because this report is oriented primarily toward the non-engineer interested in animal learning, it presents tutorials on stochastic sequential decision tasks, stochastic dynamic programming, and parameter estimation. y The authors acknowledge their indebtedness to C. W. Anderson, who has contributed greatly to the development of the ideas presented here. We also thank
Article
Full-text available
Although building sophisticated learning agents that operate in complex environments will require learning to perform multiple tasks, most applications of reinforcement learning have focused on single tasks. In this paper I consider a class of sequential decision tasks (SDTs), called composite sequential decision tasks, formed by temporally concatenating a number of elemental sequential decision tasks. Elemental SDTs cannot be decomposed into simpler SDTs. I consider a learning agent that has to learn to solve a set of elemental and composite SDTs. I assume that the structure of the composite tasks is unknown to the learning agent. The straightforward application of reinforcement learning to multiple tasks requires learning the tasks separately, which can waste computational resources, both memory and time. I present a new learning algorithm and a modular architecture that learns the decomposition of composite SDTs, and achieves transfer of learning by sharing the solutions of elemental SDTs across multiple composite SDTs. The solution of a composite SDT is constructed by computationally inexpensive modifications of the solutions of its constituent elemental SDTs. I provide a proof of one aspect of the learning algorithm.
Conference Paper
Programming robots is a tedious task. So, there is growing interest in building robots which can learn by themselves. Self-improving, which involves trial and error, however, is often a slow process and could be hazardous in a hostile environment. By teaching robots how tasks can be achieved, learning time can be shortened and hazard can be minimized. This paper presents a general approach to making robots which can improve their performance from experiences as well as from being taught. Based on this proposed approach and other learning speedup techniques, a simulated learning robot was developed and could learn three moderately complex behaviors, which were then integrated in a subsumption style so that the robot could navigate and recharge itself. Interestingly, a real robot could actually use what was learned in the simulator to operate in the real world quite successfully.
Conference Paper
Why does reinforcement learning take so long? One major reason is that reward spreads too slowly through the agent's policy. When an agent receives reward, existent methods only pass the reward back to internal states along the current path. Normally there are many possible paths to goal states however, and the agent must follow each of them successfully one or more times in order to complete learning. Our algorithm learns the transitions between internal states so that rewards may be passed not only along the one path taken this trial, but also passed back through all transitions learned during previous trials. States closer to the current state receive correspondingly more of the current reward. We call this distance between states in transition space transitional proximity.
Conference Paper
The paper examines models that are homomorphic images of the first component of a particular two component cascade decomposition of the environment. The bucket brigade is used to estimate model state values. The discussion is limited to finite automaton environments whose successive input symbols are selected by the system probabilistically, with independent probabilities, according to a probability distribution over the input symbols.