ArticlePDF Available

Mesolimbic Dopamine Signals the Value of Work

Authors:

Abstract and Figures

Dopamine cell firing can encode errors in reward prediction, providing a learning signal to guide future behavior. Yet dopamine is also a key modulator of motivation, invigorating current behavior. Existing theories propose that fast (phasic) dopamine fluctuations support learning, whereas much slower (tonic) dopamine changes are involved in motivation. We examined dopamine release in the nucleus accumbens across multiple time scales, using complementary microdialysis and voltammetric methods during adaptive decision-making. We found that minute-by-minute dopamine levels covaried with reward rate and motivational vigor. Second-by-second dopamine release encoded an estimate of temporally discounted future reward (a value function). Changing dopamine immediately altered willingness to work and reinforced preceding action choices by encoding temporal-difference reward prediction errors. Our results indicate that dopamine conveys a single, rapidly evolving decision variable, the available reward for investment of effort, which is employed for both learning and motivational functions.
Adaptive choice and motivation in the trial-and-error task. (a) Sequence of behavioral events (in rewarded trials). (b) Choice behavior in a representative session. Numbers at top denote nominal block-by-block reward probabilities for left (purple) and right (green) choices. Tick marks indicate actual choices and outcomes on each trial (tall ticks indicate rewarded trials, short ticks unrewarded). The same choice data is shown below in smoothed form (thick lines, seven-trial smoothing). (c) Relationship between reward rate and latency for the same session. Tick marks indicate only whether trials were rewarded or not, regardless of choice. Solid black line shows reward rate and cyan line shows latency (on inverted log scale), both smoothed in the same way as in b. (d) Choices progressively adapted toward the block reward probabilities (data set for d–i: n = 14 rats, 125 sessions, 2,738 ± 284 trials per rat). (e) Reward rate breakdown by block reward probabilities. (f) Latencies by block reward probabilities. Latencies became rapidly shorter when reward rate was higher. (g) Latencies by proportion of recent trials rewarded. Error bars represent s.e.m. (h) Latency distributions presented as survivor curves (the average fraction of trials for which the Center-In event has not yet happened, by time elapsed from Light-On) broken down by proportion of recent trials rewarded. (i) Same latency distributions as h, but presented as hazard rates (the instantaneous probability that the center-in event will happen, if it has not happened yet). The initial bump in the first second after Light-On reflects engaged trials (Supplementary ), after that hazard rates are relatively stable and continue to scale with reward history.
… 
A succession of within-trial dopamine increases. (a) Examples of FSCV data from a single session. Color plots display consecutive voltammograms (every 0.1 s) as a vertical colored strip; examples of individual voltammograms are shown at top (taken from marked time points). Dashed vertical lines indicate side-in events for rewarded (red) and unrewarded (blue) trials. Black traces below indicate raw current values, at the applied voltage corresponding to the dopamine peak. (b) [DA] fluctuations for each of the 312 completed trials of the same session, aligned to key behavioral events. For Light-On and Center-In alignments, trials are sorted by latency (pink dots mark light on times; white dots mark Center-In times). For the other alignments, rewarded (top) and unrewarded (bottom) trials are shown separately, but otherwise in the order in which they occurred. [DA] changes aligned to Light-On were assessed relative to a 2-s baseline period, ending 1 s before Light-On. For the other alignments, [DA] is shown relative to a 2-s baseline ending 1 s before Center-In. (c) Average [DA] changes during a single session (same data as b; shaded area represents s.e.m.). (d) Average event-aligned [DA] change across all six animals, for rewarded and unrewarded trials (see Supplementary for each individual session). Data are normalized by the peak average rewarded [DA] in each session and are shown relative to the same baseline epochs as in b. Black arrows indicate increasing levels of event-related [DA] during the progression through rewarded trials. Colored bars at top indicate time periods with statistically significant differences (red, rewarded trials greater than baseline, one-tailed t tests for each 100-ms time point individually; blue, same for unrewarded trials; black, rewarded trials different to unrewarded trials, two-tailed t tests; all statistical thresholds set to P = 0.05, uncorrected).
… 
Content may be subject to copyright.
© 2015 Nature America, Inc. All rights reserved.
nature neurOSCIenCe ADVANCE ONLINE PUBLICATION 1
a r t I C l e S
Altered dopamine signaling is involved in many human disorders,
from Parkinsons disease to drug addiction. Yet the normal functions
of dopamine have long been the subject of debate. There is extensive
evidence that dopamine affects learning, especially the reinforcement
of actions that produce desirable results
1
. Specifically, electrophysi-
ological studies suggest that bursts and pauses of dopamine cell firing
encode the reward prediction errors (RPEs) of reinforcement learn-
ing (RL) theory
2
. In this framework, RPE signals are used to update
estimated values of states and actions, and these updated values affect
subsequent decisions when similar situations are re-encountered.
Further support for a link between phasic dopamine and RPE comes
from measurements of dopamine release using fast-scan cyclic
voltammetry (FSCV)
3,4
and optogenetic manipulations
5,6
.
There is also extensive evidence that dopamine modulates arousal
and motivation
7,8
. Drugs that produce prolonged increases in
dopamine release (for example, amphetamines) can markedly enhance
psychomotor activation, whereas drugs or toxins that interfere with
dopamine transmission have the opposite effect. Over slow times-
cales (tens of minutes) microdialysis studies have demonstrated that
dopamine release ([DA]) is strongly correlated with behavioral activ-
ity, especially in the nucleus accumbens
9
(that is, mesolimbic [DA]).
It is widely thought that slow (tonic) [DA] changes are involved in
motivation
10–12
. However, faster [DA] changes also appear to have
a motivational function
13
. Subsecond increases in mesolimbic [DA]
accompany motivated approach behaviors
14,15
, and dopamine ramps
lasting several seconds have been reported as rats approach anticipated
rewards
16
, without any obvious connection to RPE. Overall, the role of
dopamine in motivation is still considered to be mysterious
12
.
We sought to better understand just how dopamine contributes to
motivation and to learning simultaneously. We found that mesolimbic
[DA] conveys a motivational signal in the form of state values, which
are moment-by-moment estimates of available future reward. These
values were used for making decisions about whether to work, that is, to
invest time and effort in activities that are not immediately rewarded, to
obtain future rewards. When there was an unexpected change in value,
the corresponding change in [DA] not only influenced motivation to
work, but also served as an RPE learning signal, reinforcing specific
choices. Rather than separate functions of phasic and tonic [DA], our
data support a unified view in which the same dynamically fluctuating
[DA] signal influences both current and future motivated behavior.
RESULTS
Motivation to work adapts to recent reward history
We used an adaptive decision-making task (Fig. 1a and Online
Methods) that is closely related to the reinforcement learning frame-
work (a ‘two-armed bandit’). On each trial, a randomly chosen nose
poke port lit up (Light-On), indicating that the rat might profitably
approach and place its nose in that port (Center-In). The rat had to
wait in this position for a variable delay (0.75–1.25s) until an audi-
tory white noise burst (Go cue) prompted the rat to make a brief
leftward or rightward movement to an adjacent side port. Unlike pre-
vious behavioral tasks using the same apparatus, the Go cue did not
specify which way to move; instead, the rat had to learn through trial-
and-error which option was currently more likely to be rewarded.
Left and right choices had separate reward probabilities (each was
either 10, 50 or 90%), and these probabilities changed periodically
without any explicit signal. On rewarded trials only, entry into
the side port (Side-In) immediately triggered an audible click (the
reward cue) as a food hopper delivered a sugar pellet to a separate
food port at the opposite side of the chamber.
1
Department of Psychology, University of Michigan, Ann Arbor, Michigan, USA.
2
Neuroscience Graduate Program, University of Michigan, Ann Arbor, Michigan,
USA.
3
Department of Chemistry, University of Michigan, Ann Arbor, Michigan, USA.
4
Department of Pharmacology, University of Michigan, Ann Arbor, Michigan,
USA.
5
Department of Biomedical Engineering, University of Michigan, Ann Arbor, Michigan, USA.
6
Present address: BrainLinks-BrainTools Cluster of Excellence
and Bernstein Center, University of Freiburg, Germany (R.S.), Brain and Cognitive Sciences, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA
(C.M.V.W.).
7
These authors contributed equally to this work. Correspondence should be addressed to J.D.B. (jdberke@umich.edu).
Received 20 May; accepted 8 October; published online 23 November 2015; doi:10.1038/nn.4173
Mesolimbic dopamine signals the value of work
Arif A Hamid
1,2,7
, Jeffrey R Pettibone
1,7
, Omar S Mabrouk
3,4
, Vaughn L Hetrick
1
, Robert Schmidt
1,6
,
Caitlin M Vander Weele
1,6
, Robert T Kennedy
3,4
, Brandon J Aragona
1,2
& Joshua D Berke
1,2,5
Dopamine cell firing can encode errors in reward prediction, providing a learning signal to guide future behavior. Yet dopamine
is also a key modulator of motivation, invigorating current behavior. Existing theories propose that fast (phasic) dopamine
fluctuations support learning, whereas much slower (tonic) dopamine changes are involved in motivation. We examined dopamine
release in the nucleus accumbens across multiple time scales, using complementary microdialysis and voltammetric methods
during adaptive decision-making. We found that minute-by-minute dopamine levels covaried with reward rate and motivational
vigor. Second-by-second dopamine release encoded an estimate of temporally discounted future reward (a value function).
Changing dopamine immediately altered willingness to work and reinforced preceding action choices by encoding temporal-
difference reward prediction errors. Our results indicate that dopamine conveys a single, rapidly evolving decision variable, the
available reward for investment of effort, which is employed for both learning and motivational functions.
© 2015 Nature America, Inc. All rights reserved.
2 ADVANCE ONLINE PUBLICATION nature neurOSCIenCe
a r t I C l e S
Trained rats readily adapted their behavior in at least two respects
(Fig. 1b,c). First, actions followed by rewards were more likely to
be subsequently selected (that is, they were reinforced), producing
left and right choice probabilities that scaled with actual reward
probabilities
17
(Fig. 1d).
Second, rats were more motivated to perform the task while it pro-
duced a higher rate of reward
18,19
. This was apparent from latency
(the time taken from Light-On until the Center-In nose poke), which
scaled inversely with reward rate (Fig. 1eg). When reward rate was
higher, rats were more likely to be already waiting near the center
ports at Light-On (engaged trials; Supplementary Fig. 1), produc-
ing very short latencies. Higher reward rates also produced shorter
latencies even when rats were not already
engaged at Light-On (Supplementary Fig. 1),
as a result of an elevated moment-by-moment
probability (hazard rate) of choosing to begin
work (Fig. 1h,i).
These latency observations are consistent
with optimal foraging theories
20
, which argue
that reward rate is a key decision variable
(currency). As animals perform actions and
experience rewards, they construct estimates
of reward rate and can use these estimates to
help decide whether engaging in an activity
is worthwhile. In a stable environment, the
best estimate of reward rate is simply the
total magnitude of past rewards received over a long time period,
divided by the duration of that period. It has been proposed that such
a long-term average reward rate is encoded by slow (tonic) changes
in [DA]
10
. However, under shifting conditions such as our trial-
and-error task, the reward rate at a given time is better estimated
by more local measures. Reinforcement learning algorithms use past
reward experiences to update estimates of future reward from each
state: a set of these estimates is called a value function
21
.
Minute-by-minute dopamine correlates with reward rate
To test whether changes in [DA] accompany reward rate during adap-
tive decision-making, we first employed microdialysis in the nucleus
a
Go
Go
Side
in
Light
on
Center
in
Food port
in
Latency
Side
out
Center
out
b
Center port
Side port
Reward cue
Food port
0
1
5 min
0 50 100 150 200 250 300 350
0
1
0.5
1
2
4
8
50 : 50 90 : 10 10 : 9050 : 1050 : 9010 : 50 90 : 1090 : 50
Right
Left
p (choice)
p (reward)
Latency (s)
Reward
c
Go cue
p (choose L)
90:10
50:10
90:50
50:50
50:90
10:50
10:90
0 0.5 1
0.2
0.5
0.8
d
Latency (s)
Rewards in last ten trials
0 5 10
2
4
6
8
10
12
Fraction of block completed
g
Trial
2
3
4
Rewards per min
0 0.5 1
Fraction of block completed
90:50
50:90
90:10
10:90
50:50
50:10
10:50
e
0
20
40
60
80
100
Fraction remaining (%)
0 1 2 3 4 5
h
0
10
Rewards in
last ten trials
2
3
4
5
Latency (s)
10:50
50:10
50:50
90:10
90:50
50:90
10:90
0 0.5 1
Fraction of block completed
f
0 1 2 3 4 5
10
30
50
70
90
Hazard rate (% change per s)
i
0
10
Rewards in
last ten trials
Center-In Center-Out Side-In Food-Port-InLight-On
Time from Light-On (s) Time from Light-On (s)
Figure 1 Adaptive choice and motivation
in the trial-and-error task. (a) Sequence of
behavioral events (in rewarded trials). (b) Choice
behavior in a representative session. Numbers
at top denote nominal block-by-block reward
probabilities for left (purple) and right (green)
choices. Tick marks indicate actual choices
and outcomes on each trial (tall ticks indicate
rewarded trials, short ticks unrewarded).
The same choice data is shown below in
smoothed form (thick lines, seven-trial
smoothing). (c) Relationship between reward
rate and latency for the same session. Tick marks
indicate only whether trials were rewarded or
not, regardless of choice. Solid black line shows
reward rate and cyan line shows latency (on
inverted log scale), both smoothed in the same
way as in b. (d) Choices progressively adapted
toward the block reward probabilities (data set
for di: n = 14 rats, 125 sessions, 2,738 ±
284 trials per rat). (e) Reward rate breakdown
by block reward probabilities. (f) Latencies by
block reward probabilities. Latencies became
rapidly shorter when reward rate was higher.
(g) Latencies by proportion of recent trials
rewarded. Error bars represent s.e.m.
(h) Latency distributions presented as survivor
curves (the average fraction of trials for which
the Center-In event has not yet happened, by
time elapsed from Light-On) broken down
by proportion of recent trials rewarded.
(i) Same latency distributions as h, but
presented as hazard rates (the instantaneous
probability that the center-in event will
happen, if it has not happened yet). The initial
bump in the first second after Light-On
reflects engaged trials (Supplementary Fig. 1),
after that hazard rates are relatively stable and
continue to scale with reward history.
© 2015 Nature America, Inc. All rights reserved.
nature neurOSCIenCe ADVANCE ONLINE PUBLICATION 3
a r t I C l e S
accumbens combined with liquid chromatography–mass spectrom-
etry. This method allows us to simultaneously assay a wide range
of neurochemicals, including all of the well-known low–molecular
weight striatal neurotransmitters, neuromodulators and their metabo-
lites (Fig. 2a), each with 1-min time resolution. We performed regres-
sion analyses to assess relationships between these neurochemicals
and a range of behavioral factors: reward rate, the number of trials
attempted (as an index of a more general form of activation/arousal),
the degree of exploitation versus exploration (an important decision
parameter that has been suggested to involve [DA]; Online Methods)
and the cumulative reward obtained (as an index of progressively
increasing factors such as satiety).
We found a clear overall relationship between [DA] and ongoing
reward rate (R
2
= 0.15, P < 10
−16
). Among the 19 tested analytes, [DA]
had by far the strongest relationship to reward rate (Fig. 2b), and
this relationship was significant in six of seven individual sessions,
from six different rats (P = 0.0052 or lower in each case; Fig. 2c and
Supplementary Fig. 2). Modest relationships were also found for the
dopamine metabolites DOPAC and 3-MT. We found a weak relation-
ship between [DA] and the number of trials attempted, but this was
entirely accounted for by reward rate; that is, if the regression model
already included reward rate, adding number of attempts did not
improve model fit. We did not find support for alternative proposals
that tonic [DA] is related to exploration or exploitation, as higher
[DA] was not associated with an altered probability of choosing the
better left or right option (Fig. 2b and Supplementary Fig. 2). [DA]
also showed no relationship to the cumulative total rewards earned
(though there was a strong relationship between cumulative reward
and the dopamine metabolite HVA, among other neurochemicals;
Fig. 2b and Supplementary Fig. 3).
We conclude that higher reward rate is associated specifically with
higher average [DA], rather than other striatal neuromodulators,
and with increased motivation to work. This finding supports the
proposal that [DA] helps to mediate the effects of reward rate on
motivation
10
. However, rather than signaling an especially long-term
rate of reward, [DA] tracked minute-by-minute fluctuations in reward
rate. We therefore needed to assess whether this result truly reflects
an aspect of [DA] signaling that is inherently slow (tonic) or could
instead be explained by rapidly changing [DA] levels, that signal a
rapidly changing decision variable.
Dopamine signals time-discounted available future reward
To help distinguish these possibilities, we used FSCV to assess task-
related [DA] changes on fast timescales (from tenths of seconds
to tens of seconds; Fig. 3). In each trial, [DA] rapidly increased as
rats poked their nose in the start hole (Fig. 3c,d), and for all rats
this increase was more closely related to this approach behavior
than to the onset of the light cue (for data from each of the single
sessions from all six rats, see Supplementary Fig. 4). A second abrupt
increase in [DA] occurred following presentation of the Go cue
(Fig. 3c,d). If received, the reward cue prompted a third abrupt increase
(Fig. 3c,d). [DA] rose still further as the rat approached the food
port (Fig. 3c,d), then declined once the reward was obtained.
The same overall pattern of task-related [DA] change was observed
HPLC retention time (min)
1 20 3
Relative ion intensity
a c
Exploitation index
0
0.2
0.4
0.6
0.8
1
Reward rate
Attempts
0
0.6
1.2
1.8
2.4
3.0
0
1
2
3
4
5
6
Cumulative rewards
20
40
60
80
100
120
140
Low Medium High
Relative [DA]
Low Medium High
Relative [DA]
b
DA
3-MT
NE
NM
5-HT
HVA
5HIAA
Adenosine
GABA
Glutamate
Glucose
Glycine
Aspartate
Glutamine
Serine
Histamine
Taurine
ACh
DOPAC
0
0.05
0.1
0.15
0
0.05
0.1
0.15
0
0.05
0.1
0.15
R
2
(reward rate) R
2
(attempts) R
2
(exploitation)
P < 0.05
P < 0.0005 P < 0.000005
0
0.05
0.1
0.15
R
2
(cumulative
rewards)
P < 0.05
P < 0.0005 P < 0.000005
+
Figure 2 Minute-by-minute dopamine levels track reward rate. (a) Total ion chromatogram of a single representative microdialysis sample, illustrating
the set of detected analytes in this experiment. x axis indicates chromatography retention times, y axis indicates intensity of ion detection for each
analyte (normalized to peak values). Inset, locations of each microdialysis probe in the nucleus accumbens (all data shown in the same Paxinos atlas
section; six were on the left side and one on the right). DA, dopamine; 3-MT, 3-methoxytyramine; NE, norepinephrine; NM, normetanephrine; 5-HT,
serotonin; DOPAC, 3,4-dihydroxyphenylacetate acid; HVA, homovanillic acid; 5HIAA, 5-hydroxyindole-3-acetic acid; ACh, acetylcholine. (b) Regression
analysis results indicating strength of linear relationships between each analyte and each of four behavioral measures (reward rate, number of attempts,
exploitation index and cumulative rewards). Data are from six rats (seven sessions, total of 444 1-min samples). Color scale shows P values, Bonferroni-
corrected for multiple comparisons (4 behavioral measures × 19 analytes), with red bars indicating a positive relationship and blue bars indicating
a negative relationship. Given that both reward rate and attempts showed significant correlations with [DA], we constructed a regression model that
included these predictors and an interaction term. In this model, R
2
remained at 0.15 and only reward rate showed a significant partial effect (P < 2.38
× 10
−12
). (c) An alternative assessment of the relationship between minute-long [DA] samples and behavioral variables. In each of the seven sessions,
[DA] levels were divided into three equal-sized bins (low, medium and high); different colors indicate different sessions. For each behavioral variable,
means were compared across [DA] levels using one-way ANOVA. There was a significant main effect of reward rate (F(2,18) = 10.02, P = 0.0012), but
no effect of attempts (F(2,18) = 1.21, P = 0.32), exploitation index (F(2,18) = 0.081, P = 0.92) or cumulative rewards (F(2,18) = 0.181, P = 0.84).
Post hoc comparisons using the Tukey test revealed that the mean reward rates of low and high [DA] differed significantly (P = 0.00082). See also
Supplementary Figures 2 and 3.
© 2015 Nature America, Inc. All rights reserved.
4 ADVANCE ONLINE PUBLICATION nature neurOSCIenCe
a r t I C l e S
in all rats, albeit with some variation (Supplementary Fig. 4). [DA]
increases did not simply accompany movements, given that, on the
infrequent trials in which the rat approached the food port without
hearing the reward cue, we observed no corresponding increase in
[DA] (Fig. 3c,d).
The overall ramping up of [DA] as rats drew progressively closer
to reward suggested some form of reward expectation
16
. Specifically,
we hypothesized that [DA] continuously signals a value function: the
temporally discounted reward predicted from the current moment.
To make this more clear, consider a hypothetical agent moving
through a sequence of distinct, unrewarded states leading up to an
expected reward (perhaps a rat running at constant speed along a
familiar maze arm; Fig. 4a). As the reward is more discounted when
more distant, the value function will progressively rise until the
reward is obtained.
This value function describes the time-varying level of motiva-
tion. If a reward is distant (so strongly discounted), animals are less
likely to choose to work for it. Once engaged, animals are increas-
ingly motivated, and so less likely to quit, as they detect progress
toward the reward (the value function produces agoal gradient’)
22
.
If the reward is smaller or less reliable, the value function will be
lower, indicating less incentive to begin work. Moving closer to our
real situation, suppose that reward is equally likely to be obtained,
or not, on any given trial, but a cue indicates this outcome halfway
through the trial (Fig. 4a). The increasing value function should
initially reflect the overall 0.5 reward probability, but if the reward
cue occurs, estimated value should promptly jump to that of the
(discounted) full reward.
Such unpredicted sudden transitions to states with a different
value produce ‘temporal-difference’ RPEs (Fig. 4b). In particular,
if the value function is low (for example, the trajectory indicating
0.25 expectation of reward), the reward cue produces a large RPE, as
value jumps up to the discounted value of the now-certain reward. If
instead reward expectation was higher (for example, 0.75 trajectory),
the RPE produced by the reward cue is smaller. Given that temporal
difference RPEs reflect sudden shifts in value, under some condi-
tions they can be challenging to dissociate from value itself. However,
RPE and value signals are not identical. In particular, as reward gets
closer, the state value progressively increases but RPE remains zero
unless events occur with unpredicted value or timing.
Our task includes additional features, such as variable timing
between events and many trials. We therefore considered what the
a
4
0
–2.7
nA
5 s
–0.4
nA
1.3
–0.4
Applied voltage (V)
Latency
<1 s
1 to 2 s
2 to 3 s
>4 s
c
d
100
60
–20
20
nM
0
20
40
Rewarded
Unrewarded
–20
Normalized [DA]
b
300
200
100
1
Trials
−2 0 2 4
0
0.5
1
Light-On (s)
−2 0 2 4
Center-In (s)
−2 0 2 4
Go cue (s)
−2 0 2 4
Side-In (s)
−2 0 2 4
Food-Port-In (s)
Light-On Center-In Go cue Side-In
[DA] (nM)
3 to 4 s
Food-Port-In
Figure 3 A succession of within-trial
dopamine increases. (a) Examples of FSCV
data from a single session. Color plots display
consecutive voltammograms (every 0.1 s) as a
vertical colored strip; examples of individual
voltammograms are shown at top (taken from
marked time points). Dashed vertical lines
indicate side-in events for rewarded (red) and
unrewarded (blue) trials. Black traces below
indicate raw current values, at the applied
voltage corresponding to the dopamine peak.
(b) [DA] fluctuations for each of the 312
completed trials of the same session, aligned
to key behavioral events. For Light-On and
Center-In alignments, trials are sorted by
latency (pink dots mark light on times; white
dots mark Center-In times). For the other
alignments, rewarded (top) and unrewarded
(bottom) trials are shown separately, but
otherwise in the order in which they occurred.
[DA] changes aligned to Light-On were assessed
relative to a 2-s baseline period, ending 1 s
before Light-On. For the other alignments, [DA]
is shown relative to a 2-s baseline ending
1 s before Center-In. (c) Average [DA] changes
during a single session (same data as b;
shaded area represents s.e.m.). (d) Average
event-aligned [DA] change across all six
animals, for rewarded and unrewarded trials
(see Supplementary Fig. 4 for each individual
session). Data are normalized by the peak
average rewarded [DA] in each session and are
shown relative to the same baseline epochs as
in b. Black arrows indicate increasing levels
of event-related [DA] during the progression
through rewarded trials. Colored bars at
top indicate time periods with statistically
significant differences (red, rewarded trials
greater than baseline, one-tailed t tests for each
100-ms time point individually; blue, same
for unrewarded trials; black, rewarded trials
different to unrewarded trials, two-tailed
t tests; all statistical thresholds set to
P = 0.05, uncorrected).
© 2015 Nature America, Inc. All rights reserved.
nature neurOSCIenCe ADVANCE ONLINE PUBLICATION 5
a r t I C l e S
‘true value function should look like, on average, based on actual
times to future rewards (Fig. 4c). At the beginning of a trial, reward is
at least several seconds away and may not occur at all until a later trial.
During correct trial performance each subsequent, variably timed
event indicates to the rat that rewards are getting closer and more
likely, and thus causes a jump in state value. For example, hearing the
Go cue indicates both that reward is closer and that the rat will not
lose out by moving too soon (an impulsive procedural error). Hearing
the reward cue indicates that reward is now certain and only a couple
of seconds away.
To assess how the intertwined decision variables, state value
and RPE, are encoded by phasic [DA], we compared our FSCV
measurements to the dynamically varying state value and RPE of a
reinforcement learning model (Online Methods). This simplified
model consisted of a set of discrete states (Supplementary Fig. 5)
whose values were updated using temporal-difference RPEs. When
the actual sequence of behavioral events experienced by the rat was
given as input, the model’s value function consisted of a series of
increases in each trial (Fig. 4d,e), resembling the observed time
course of [DA] (Fig. 3c).
Consistent with the idea that state value represents motivation to
work, model state value early in each trial correlated with behavioral
latencies for all rats (across a wide range of model parameter settings;
Supplementary Fig. 5). We identified model parameters (learning rate
= 0.4, discount factor = 0.95) that maximized this behavioral correla-
tion across all rats combined and examined the corresponding within-
trial correlation between [DA] and model variables. For all of the six
FSCV rats, we found a clear and highly significant positive correlation
between phasic [DA] and state value V (Fig. 4f). [DA] and RPE were
also positively correlated, as expected given that V and RPE partially
covary. However, in every case, [DA] had a significantly stronger
relationship to V than to RPE (Fig. 4f and Supplementary Fig. 5).
a
0
0.25
0.50
0.75
1.00
Time to reward
Value of work
Begin
work
Get
reward
0
0.25
0.50
0.75
1.00
Time to reward
Value of work
Begin
work
Get
reward
Reward
cue
d
Reward
cue
0
0.25
0.50
0.75
1.00
Time to reward
Value of work
Begin
work
Get
reward
1 2 3 4 5 6 7 8 9
Hyperbolic
Exponential
Offset (s)
Average [DA] correlation
0 0.5 1 1.5 2 2.5
0
0.1
0.2
0.3
0.4
0
0.2
0
0.2
1 60
0
0.2
0
0.2
0
0.2
2 5 10 20 30
Center-In
Go cue
Side-In
Reward
cue
Omission
Time to reward (s)
b
0
0.1
0.2
0.3
0.4
0.5
0.6
V
δ
V
δ
δ
e
Decision
to start
Decision
to start
c
f
[DA]
g
0
0
0 1 2 3
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
RL model state value
Time from side-in (s)
Reward
cue
–1–2–3
δ
Figure 4 Within-trial dopamine fluctuations reflect state value dynamics.
(a) Top, temporal discounting: the motivational value of rewards is lower
when they are distant in time. With the exponential discounting commonly
used in RL models, value is lower by a constant factor γ for each time step
of separation from reward. People and other animals may actually use
hyperbolic discounting that can optimize reward rate (as rewards/time is
inherently hyperbolic). Time parameters were chosen here simply to illustrate
the distinct curve shapes. Bottom, effect of reward cue or omission on
state value. At trial start, the discounted value of a future reward will be
less if that reward is less likely. Lower value provides less motivational
drive to start work, producing, for example, longer latencies. If a cue
signals that upcoming reward is certain, the value function jumps up to
the (discounted) value of that reward. For simplicity, the value of subsequent
rewards is not included. (b) The reward prediction error δ reflects abrupt
changes in state value. If the discounted value of work reflects an unlikely
reward (for example, probability = 0.25) a reward cue prompts a larger δ
than if the reward was likely (for example, probability = 0.75).
Note that in this idealized example, δ would be zero at all other times.
(c) Task events signal times to reward. Data is from the example session
shown in Figure 3c. Bright red indicates actual times to the very next
reward, dark red indicates subsequent rewards. Green arrowheads indicate
average times to next reward (harmonic mean, only including rewards in
the next 60s). As the trial progresses, average times-to-reward get shorter.
If the reward cue is received, rewards are reliably obtained ~2 s later. Task
events are considered to prompt transitions between different internal
states (Supplementary Fig. 5) whose learned values reflect these different
experienced times to reward. (d) Average state value of the RL model for
rewarded (red) and unrewarded (blue) trials, aligned on the Side-In event.
The exponentially discounting model received the same sequence of
events as in Figure 3c, and model parameters (
α
= 0.68,
γ = 0.98) were chosen for the strongest correlation to behavior
(comparing state values at Center-In to latencies in this session,
Spearman r = −0.34). Model values were binned at 100 ms, and only
bins with at least three events (state transitions) were plotted.
(e) Example of the [DA] signal during a subset of trials from the same
session compared with model variables. Black arrows indicate Center-In
events, red arrows indicate Side-In with reward cue, and blue arrows
indicate Side-In alone (omission). Scale bars represent 20 nM ([DA]),
0.2 (V) and 0.2 (δ). Dashed gray lines mark the passage of time in 10-s
intervals. (f) Within-trial [DA] fluctuations were more strongly correlated
with model state value (V) than with RPE (δ). For every rat, the [DA]:V
correlation was significant (number of trials for each rat: 312, 229, 345,
252, 200, 204; P < 10
−14
in each case; Wilcoxon signed-rank test of null
hypothesis that median correlation within trials is zero) and significantly
greater than the [DA]:δ correlation (P < 10
−24
in each case, Wilcoxon
signed-rank test). Group-wise, both [DA]:V and [DA]:δ correlations
were significantly nonzero, and the difference between them was also
significant (n = 6 sessions, all comparisons P = 0.031, Wilcoxon signed-rank test). Model parameters (
α
= 0.4, γ = 0.95) were chosen to maximize
the average behavioral correlation across all six rats (Spearman r = −0.28), but the stronger [DA] correlation to V than to δ was seen for all parameter
combinations (Supplementary Fig. 5). (g) Model variables were maximally correlated with [DA] signals ~0.5 s later, consistent with a slight delay caused
by the time taken by the brain to process cues, and by the FSCV technique.
© 2015 Nature America, Inc. All rights reserved.
6 ADVANCE ONLINE PUBLICATION nature neurOSCIenCe
a r t I C l e S
We emphasize that this result was not dependent on specific model
parameters; in fact, even if parameters were chosen to maximize
the [DA]:RPE correlation, the [DA]:V correlation was stronger
(Supplementary Fig. 5).
Correlations were maximal when V was compared with the [DA]
signal measured ~0.40.5 s later (Fig. 4g). This small delay is con-
sistent with the known brief lag associated with the FSCV method
using acute electrodes
23
and prior observations that peak [DA]
response occurs ~0.5 s after cue onset with acute FSCV recordings
3
.
As an alternative method of incorporating temporal distortion that
might be produced by FSCV and/or the finite speeds of DA release
and update, we convolved model variables with a kernel consisting
of an exponential rise and fall, and explored the effect of varying
kernel time constants. Once again, [DA] always correlated much
better with V than with RPE across a wide range of parameter values
(Supplementary Fig. 6). We conclude that state value provides a
more accurate description of the time course of [DA] fluctuations
than RPE alone, even though RPEs can be simultaneously signaled
as changes in state value.
Abrupt dopamine changes encode RPEs
FSCV electrode signals tend to drift over a timescale of minutes, so
standard practice is to assess [DA] fluctuations relative to a pre-trial
‘baselineof unknown concentration (as in Fig. 3). Presented this way,
reward cues appeared to evoke a higher absolute [DA] level when
rewards were less common (Fig. 5a,b), consistent with a conventional
RPE-based account of phasic [DA]. However, our model implies a
different interpretation of this data (Figs. 4b and 5c). Rather than
a jump from a fixed to a variable [DA] level (that encodes RPE), we
predicted that the reward cue actually causes a [DA] jump from a
variable [DA] level (reflecting variable estimates of upcoming reward)
to a fixed [DA] level (that encodes the time-discounted value of the
now certain reward).
To test these competing accounts, we compared [DA] levels
between consecutive pairs of rewarded trials with Side-In events
< 30 s apart (that is, well within the accepted stable range of FSCV
measurements
24
; for included pairs of trials, the average time between
side-in events was 11.5 s). If the [DA] level evoked by the reward cue
reflects RPE, then this level should tend to decline as rats experience
Reward rate
0
0
0
0
Baseline change
after reward
0
0.1
0.2
0.3
−0.3
–0.2
–0.1
0
0.1
−0.2
−0.1
0
0.1
0.2
0–2
3–5
6–7
8–10
Number of rewards Number of rewards Number of rewards
0
Baseline change
after omission
0
Lower reward
expectation
Higher reward
expectation
Peak change
after reward
0 4 −4 0 4
0
0.2
0.4
0.6
0.8
0
4 0 4
0
0.2
0.4
0.6
0.8
Time from Side-In (s) Time from Side-In (s)
∆[DA]∆[DA]
Norm. ∆[DA]
Reward rate
Reward rate
Reward rate
Reward rate
Reward rate
d e
f
Norm. ∆[DA]
g
h
i
Norm. ∆[DA]
Predictions
Observations
Peak change in δ model
Baseline change in V model
−4 −2 0 2 4
−0.4
0
0.4
0.8
1.2
Normalized ∆[DA]
−0.8
−0.4
0
−4 −2 0 2 4
0
0.4
0.4
0.6
0.8
1
−0.2
0
0–2
3–
5
6–
7
8–10
Number of rewards
0–2
8–10
6–7
3–5
Number of rewards
in last 10 trials
Time from Side-In (s)
0.2
0
−0.4
Time from Side-In (s)
Normalized ∆[DA]Normalized ∆[DA]
a b
c
Normalized ∆[DA]
0–2
3–5
6–7
8–10
0–2
3–5
6–7
8–10
Figure 5 Between-trial dopamine shifts reflect
updated state values. (a) Less-expected
outcomes provoke larger changes in [DA].
[DA] data from all FSCV sessions together (as
in Fig. 3d), broken down by recent reward
history and shown relative to pre-trial baseline
(−3 to −1 s relative to Center-In). Note that
the [DA] changes after reward omission last
at least several seconds (shift in level), rather
than showing a highly transient dip followed
by return to baseline, as might be expected
for encoding RPEs alone. (b) Quantification of
[DA] changes, between baseline and reward
feedback (0.5–1.0 s after Side-In for rewarded
trials, 1–3 s after Side-In for unrewarded trials).
Error bars show s.e.m. (c) Data are presented
as in a, but plotted relative to [DA] levels after
reward feedback. These [DA] observations are
consistent with a variable baseline whose level
depends on recent reward history (as in
Fig. 4b model). (d) Alternative accounts of [DA]
make different predictions for between-trial
[DA] changes. When reward expectation is low,
rewarded trials provoke large RPEs, but RPEs
should decline across repeated consecutive
rewards. Thus, if absolute [DA] levels encode
RPE, the peak [DA] evoked by the reward cue
should decline between consecutive rewarded
trials (and baseline levels should not change).
For simplicity, this cartoon omits detailed
within-trial dynamics. (e) Predicted pattern of
[DA] change under this account, which also
does not predict any baseline shift after reward
omissions (right). (f) If instead [DA] encodes
state values, then peak [DA] should not decline
from one reward to the next, but the baseline
level should increase (and decrease following
unrewarded trials). (g) Predicted pattern of [DA] change for this alternative account. (h) Unexpected rewards cause a shift in baseline, not in peak [DA].
Average FSCV data from consecutive pairs of rewarded trials (all FSCV sessions combined, as in a), shown relative to the pre-trial baseline of the first
trial in each pair. Data were grouped into lower reward expectation (left pair of plots, 165 total trials; average time between side-in events = 11.35 ±
0.22 s, s.e.m.) and higher reward expectation (right pair of plots, 152 total trials; time between side-in events = 11.65 ± 0.23 s) by a median split of
each individual session (using number of rewards in last ten trials). Dashed lines indicate that reward cues evoked a similar absolute level of [DA] in
the second rewarded trial compared with the first. Black arrow indicates the elevated pre-trial [DA] level for the second trial in the pair (mean change
in baseline [DA] = 0.108, P = 0.013, one-tailed Wilcoxon signed rank test). No comparable change was observed if the first reward was more expected
(right pair of plots; mean change in baseline [DA] = 0.0013, P = 0.108, one-tailed Wilcoxon signed rank test). (i) [DA] changes between consecutive
trials follow the pattern expected for value coding, rather than RPE coding alone. Error bars represent ±s.e.m.
© 2015 Nature America, Inc. All rights reserved.
nature neurOSCIenCe ADVANCE ONLINE PUBLICATION 7
a r t I C l e S
consecutive rewards (Fig. 5d,e). However, if [DA] represents state
value, then baseline [DA] should asymptotically increase with
repeated rewards while reward cue-evoked [DA] remains more stable
(Fig. 5f,g). The latter proved correct (Fig. 5h,i). These results provide
clear further evidence that [DA] reflects reward expectation (the value
function), not just RPE.
Considering the microdialysis and FSCV results together, a
parsimonious interpretation is that, across multiple measurement
timescales, [DA] simply signals estimated availability of reward.
The higher minute-by-minute [DA] levels observed with greater
reward rate reflect both the higher values of states distal to rewards
(including baseline periods between active trial performance) and
the greater proportion of time spent in high-value states proximal
to rewards.
By conveying an estimate of available reward, mesolimbic [DA]
could be used as a motivational signal, helping to decide whether it
is worthwhile to engage in effortful activity. At the same time, abrupt
relative changes in [DA] could be detected and used as an RPE signal
for learning. But is the brain actually using [DA] to signal motivation
or learning, or both, during this task?
Dopamine both enhances motivation and reinforces choices
To address this question, we turned to precisely timed, bidirectional,
optogenetic manipulations of dopamine. Following an approach vali-
dated in previous studies
6
, we expressed channelrhodopsin-2 (ChR2)
selectively in dopamine neurons by combining Th-Cre
+
rats with
DIO-ChR2 virus injections and bilateral optic fibers in the ventral
tegmental area (Supplementary Fig. 7). We chose optical stimulation
parameters (10-ms pulses of blue light at 30 Hz, 0.5-s total duration;
Fig. 6a,b) that produced phasic [DA] increases of similar duration
and magnitude to those naturally observed with unexpected reward
delivery. We provided this stimulation at one of two distinct moments
during task performance. We hypothesized that enhancing [DA]
coincident with Light-On would increase the estimated motivational
value of task performance; this would make the rat more likely to
initiate an approach, leading to shorter latencies on the same trial. We
further hypothesized that enhancing [DA] at the time of the major
RPE (Side-In) would affect learning, as reflected in altered behavior
on subsequent trials. In each session, laser stimulation was given at
only one of these two times, and on only 30% of trials (randomly
selected) to allow within-session comparisons between stimulated and
unstimulated trials.
Providing phasic [DA] at Side-In reinforced choice behavior: it
increased the chance that the same left or right action was repeated
on the next trial, whether or not the food reward was actually
received (n = 6 rats, two-way ANOVA yielded significant main effects
for laser, F(1,5) = 224.0, P = 2.4 × 10
−5
; for reward, F(1,5) = 41.0,
P = 0.0014; without a significant laser × reward interaction, P = 0.174;
Fig. 6c and Supplementary Fig. 8c). No reinforcing effect was seen
if the same optogenetic stimulation was given in littermate controls
(n = 6 Th-Cre
rats, laser main effect F(1,5) = 2.51, P = 0.174; Fig. 6c).
For a further group of Th-Cre
+
animals (n = 5), we instead used the
inhibitory opsin Halorhodopsin (eNpHR3.0). Inhibition of dopamine
cells at Side-In reduced the probability that the same left or right choice
was repeated on the next trial (laser main effect F(1,4) = 18.7, P = 0.012;
without a significant laser × reward interaction, P = 0.962). A direct
comparison between these three rat groups also demonstrated a group-
specific effect of Side-In laser stimulation on choice reinforcement
(two-way ANOVA, laser × group interaction F(2,14) = 69.4, P = 5.4 ×
10
−8
). These observations support the hypothesis that abrupt [DA] fluc-
tuations serve as an RPE learning signal, consistent with prior optoge-
netic manipulations
7
. However, extra [DA] at Side-In did not affect
b
−3
0 10
0
40
80
120
Time from laser onset (s)
5
[DA] (nM)
5
15
30
45
60
90
Pulses
a
30 Hz, 15 pulses
–0.4
–0.4
Applied voltage (V)
20 nM
Time from laser onset (s)
4
–2.7
nA
−3 0 105
1.3
d
Proportion of trials (%)
0.2 0.5 1 2 5 10 30
2
4
6
Latency (s)
0
10
30
50
Hazard rate (% change per s)
0 1 2 3 4 5
No laser
Laser
0
20
40
e
ChR2
OmissionReward
0
0.2
0.4
0.6
0.8
1
OmissionReward
Halo
0
0.2
0.4
0.6
0.8
1
ChR2
0
0.2
0.4
0.6
0.8
1
P (repeat choice)
OmissionReward
c
Time from light-on (s)
TH-Cre
TH-Cre
+
TH-Cre
+
Figure 6 Phasic dopamine manipulations
affect both learning and motivation. (a) FSCV
measurement of optogenetically evoked [DA]
increases. Optic fibers were placed above
VTA and [DA] change examined in nucleus
accumbens core. Example shows dopamine
release evoked by a 0.5-s stimulation train
(average of six stimulation events, shaded area
indicates ±s.e.m.). (b) Effect of varying the
number of laser pulses on evoked dopamine
release, for the same 30-Hz stimulation
frequency. (c) Dopaminergic stimulation at
Side-In reinforces the chosen left or right
action. Left, in Th-Cre
+
rats stimulation of ChR2
increased the probability that the same action
would be repeated on the next trial. Circles
indicate average data for each of six rats (three
sessions each, 384 trials per session ± 9.5,
s.e.m.). Middle, this effect did not occur in
Th-Cre
littermate controls (six rats, three
sessions each, 342 ± 7 trials per session).
Right, in Th-Cre
+
rats expressing Halorhodopsin,
orange laser stimulation at Side-In reduced the
chance that the chosen action was repeated on
the next trial (five rats, three sessions each,
336 ± 10 trials per session). See Supplementary
Figure 8 for additional analyses. (d) Laser
stimulation at Light-On causes a shift toward sooner engagement, if the rats were not already engaged. Latency distribution (on log scale, 10 bins
per log unit) for non-engaged, completed trials in Th-Cre
+
rats with ChR2 (n = 4 rats with video analysis; see Supplementary Figure 9 for additional
analyses). (e) Same latency data as d, but presented as hazard rates. Laser stimulation (blue ticks at top left) increased the chance that rats would
decide to initiate an approach, resulting in more Center-In events 1–2 s later (for these n = 4 rats, one-way ANOVA on hazard rate F(1,3) = 18.1,
P = 0.024). See Supplementary Figure 10 for hazard rate time courses from the individual rats.
© 2015 Nature America, Inc. All rights reserved.
8 ADVANCE ONLINE PUBLICATION nature neurOSCIenCe
a r t I C l e S
subsequent trial latency (Supplementary Fig. 8a,b), indicating that
our artificial [DA] manipulations reproduced some, but not all, types
of behavioral change normally evoked by rewarded trials.
Optogenetic effects on reinforcement were temporally specific:
providing extra [DA] at Light-On (instead of Side-In) on trial n did
not affect the probability that rats made the same choice on trial
n + 1 (laser main effect F(1,5) = 0.031, P = 0.867; Supplementary
Fig. 8c) nor did it affect the probability that choice on trial n was the
same as trial n − 1 (laser main effect F(1,5) = 0.233, P = 0.649).
By contrast, extra [DA] at Light-On markedly affected latency for
that very same trial (Fig. 6d and Supplementary Fig. 8). The effect on
latencies depended on what the rat was doing at the time of Light-On
(two-way ANOVA yielded a significant laser × engaged interaction,
F(1,3) = 28.1, P = 0.013). If the rat was already engaged in task per-
formance, the very short latencies became slightly longer on average
(median control latency = 0.45 s, median stimulated latency = 0.61
s; simple main effect of laser, F(1,3) = 10.4, P = 0.048). This effect
apparently resulted from additional laser-evoked orienting move-
ments on a subset of trials (Supplementary Fig. 9). By contrast, for
non-engaged trials extra [DA] significantly reduced latencies (median
control latency = 2.64 s, median stimulated latency = 2.16 s; simple
main effect of laser, F(1,3) = 32.5, P = 0.011; Fig. 6d). These optoge-
netic results are consistent with the idea that mesolimbic [DA] is less
important for the initiation of simple, cue-evoked responses when
a task is already underway
25
, but is critical for motivating ‘flexible
approach behaviors
26
.
The shorter latencies produced by extra [DA] were not the result
of rats approaching the start port at faster speeds, as the average
approach trajectory was unaffected (Supplementary Fig. 9). Instead,
extra [DA] transiently increased the probability that rats initi-
ated the approach behavior. As the approach itself lasted ~1–2 s
(Supplementary Fig. 9), the result was an increased rate of Center-In
events ~1–2 s after the laser pulse train (Fig. 6e and Supplementary
Fig. 10). This effect of Light-On laser stimulation on hazard rates was
dependent on rat group (two-way ANOVA, laser × group interac-
tion F(2,14) = 26.28, P = 0.000018). Post hoc pairwise comparison of
simple laser effects showed a significant increase in hazard rate for
Th-Cre
+
ChR2 rats (F(1,14) = 62.06, P = 1.63 × 10
−6
) and a significant
reduction in hazard rate for Th-Cre
+
eNpHR3.0 rats (F(1,14) = 6.31,
P = 0.025), with no significant change in Th-Cre
ChR2 rats (F(1,14)
= 2.81, P = 0.116). Overall, we conclude that, beyond just correlating
with estimates of reward availability, mesolimbic [DA] helps translate
those estimates into decisions to work for reward.
DISCUSSION
A dopamine value signal used for both motivation and learning
Our results help confirm a range of disparate prior ideas, while plac-
ing them within a newly integrated theoretical context. First, phasic
[DA] has been previously related to motivated approach
14,15
, reward
expectation
16
and effort-based decision-making
27
, but our demon-
stration that [DA] specifically conveys the temporally discounted
value of future rewards grounds this motivational aspect of
dopamine fluctuations in the quantitative frameworks of machine
learning and optimal foraging theory. This idea is also consistent
with findings using other techniques; for example, fMRI signals
in ventral striatum (often argued to reflect dopamine signaling)
encode reward expectation in the form of temporally-discounted
subjective value
28
.
Second, using the complementary method of microdialysis to
assess slower changes, we partly confirmed proposals that reward
rate is reflected specifically in increased [DA], which in turn enhances
motivational vigor
10
. However, our critical argument is that this
motivational message of reward availability can dynamically change
from moment to moment, rather than being an inherently slow (tonic)
signal. Using optogenetics, we confirmed that phasic changes in [DA]
levels immediately affect willingness to engage in work, supporting
the idea that subsecond [DA] fluctuations promptly influence moti-
vational decision-making
13,29
. This dynamic [DA] motivation sig-
nal can help to account for detailed patterns of time allocation
30
.
For example, animals take time to reengage in task performance after
getting a reward (the post-reinforcement pause), and this pause is
longer when the next reward is smaller or more distant. This behav-
ioral phenomenon has been a long-standing puzzle
31
, but fits well
with our argument that the time-discounted value of future rewards,
conveyed by [DA], influences the moment-by-moment probability
(hazard rate) of engaging in work.
Third, we confirmed the vital role of fast [DA] fluctuations, includ-
ing transient dips, in signaling RPEs to affect learning
4–6
. However,
a notable result from our analyses is that RPEs were conveyed by fast
relative changes in the [DA] value signal, rather than by deviations
from a steady (tonic) baseline. This interpretation explains for the
first time, to the best of our knowledge, how [DA] can simultane-
ously provide both learning and motivational signals, an important
gap in prior theorizing. Our results also highlight the importance
of not assuming a consistent baseline [DA] level across trials in
voltammetry studies.
One interesting implication is that, among the many postsynap-
tic mechanisms that are affected by dopamine, some are concerned
more with absolute levels and others with fast relative changes.
This possibility needs to be investigated further, together with the
natural working hypothesis that [DA] effects on neuronal excitability
are closely involved in motivational functions
32
, whereas [DA] effects
on spike-timing-dependent-plasticity are responsible for reinforce-
ment-driven learning
1
. It is also intriguing that a pulse of increased
[DA] sufficient to immediately affect latency, or to alter left or right
choice on subsequent trials, does not appear to be sufficient to alter
latency on subsequent trials. This suggests that state values and left
and right action values
17
may be updated via distinct mechanisms or
at different times in the trial.
Although dopamine is often labeled as a reward transmitter,
[DA] levels dropped during reward consumption, consistent with
findings that dopamine is relatively less important for consum-
ing, and apparently enjoying, rewards
7,33
. Mesolimbic [DA] has
also been shown to not be required for performance of simple
actions that are immediately followed by reward, such as pressing a
lever once to obtain food
34
. Rather, loss of mesolimbic [DA] reduces
motivation to work, in the sense of investing time and effort in
activities that are not inherently rewarding or interesting, but may even-
tually lead to rewards
12
. Conversely, increasing [DA] with drugs such
as amphetamines increases motivation to engage in prolonged work,
in both normal subjects and those with attention-deficit hyperactivity
disorder
35,36
.
Dopamine and decision dynamics
Our interpretation of mesolimbic [DA] as signaling the value of work
is based on rat decisions to perform our task rather than alternative
default’ behaviors, such as grooming or local exploration. In this
view, mesolimbic [DA] helps to determine whether to work, but not
which activity is most worthwhile (that is, it is activational more than
directional
12
). It may be best considered as signaling the overall
motivational excitement associated with reward expectation or,
equivalently, the perceived opportunity cost of sloth
10,30
.
© 2015 Nature America, Inc. All rights reserved.
nature neurOSCIenCe ADVANCE ONLINE PUBLICATION 9
a r t I C l e S
Based on prior results
27
, we expect that [DA] signals reward avail-
ability without factoring in the costs of effortful work, but we did not
parametrically vary such costs here. Other notable limitations of our
study are that we only examined [DA] in the nucleus accumbens and
we did not selectively manipulate [DA] in various striatal subregions
(and other dopamine targets). Our functional account of [DA] effects
on behavioral performance is undoubtedly incomplete and it will be
important to explore alternative descriptions, especially more general-
izable accounts that apply throughout the striatum. In particular, our
observation that mesolimbic [DA] affects the hazard rate of decisions
to work seems compatible with a broader influence of striatal [DA]
over decision-making, such as setting ‘thresholdsfor decision process
completion
27,37,38
. In sensorimotor striatum, dopamine influences
the vigor (and learning) of more elemental actions
38,39
, and it has
been shown that even saccade speed in humans is best predicted by
a discounting model that optimizes the rate of reward
40
. In this way,
the activational, invigorating role of [DA] on both simple movements
and motivation may reflect the same fundamental, computational-
level mechanism applied to decision-making processes throughout
striatum, affecting behaviors across a range of timescales.
Activational signals are useful, but not sufficient, for adaptive
decision-making in general. Choosing between alternative, simulta-
neously available courses of action requires net value representations
for the specific competing options
27,41
. Although different subpopula-
tions of dopamine neurons may carry somewhat distinct signals
42
, the
aggregate [DA] message received by target regions is unlikely to have
sufficient spatial resolution to represent multiple competing values
simultaneously
43
or sufficient temporal resolution to present them
for rapid serial consideration
44
. By contrast, distinct ensembles of
GABAergic neurons in the basal ganglia can dynamically encode the
value of specific options, including through ramps-to-reward
45,46
that
may reflect escalating bids for behavioral control. Such neurons are
modulated by dopamine, and in turn provide key feedback inputs to
dopamine cells that may contribute to the escalating [DA] patterns
observed here.
Relationship between dopamine cell firing and release
Firing rates of presumed dopamine cells have been previously reported
to escalate in trials under some conditions
47
, but this has not been
typically reported with reward anticipation. Several factors may con-
tribute to this apparent discrepancy with our [DA] measures. The first
is the nature of the behavioral task. Many important prior studies of
dopamine
2,3
(although not all
41
) used Pavlovian situations, in which
outcomes are not determined by the animal’s actions. When effortful
work is not required to obtain rewards, the learned value of work may
be low and corresponding decision variables may be less apparent.
Second, a moving rat receives constantly changing sensory input,
and may therefore more easily define and discriminate a set of discrete
states leading up to reward compared with situations in which elapsed
time is the sole cue of progress. When such a sequence of states can be
more readily recognized, it may be easier to assign a corresponding
set of escalating values as reward gets nearer in time. Determining
subjects’ internal state representations, and their development during
training, is an important challenge for future work. It has been argued
that ramps in [DA] might actually reflect RPE if space is nonlinearly
represented
48
or if learned values rapidly decay in time
49
. However,
these suggestions do not address the critical relationship between
[DA] and motivation that we aim to account for here.
Finally, release from dopamine terminals is strongly influenced
by local microcircuit mechanisms in striatum
50
producing a dis-
sociation between dopamine cell firing and [DA] in target regions.
This dissociation is not complete: the ability of unexpected sensory
events to drive a rapid, synchronized burst of dopamine cell firing
is still likely to be of particular importance for abrupt RPE signal-
ing at state transitions. More detailed models of dopamine release,
incorporating dopamine cell firing, local terminal control and uptake
dynamics, will certainly be needed to understand how [DA] comes
to convey a value signal.
METHODS
Methods and any associated references are available in the online
version of the paper.
Note: Any Supplementary Information and Source Data files are available in the
online version of the paper.
ACKNOWLEDGMENTS
We thank K. Berridge, T. Robinson, R. Wise, P. Redgrave, P. Dayan, D. Weissman, A.
Kreitzer, N. Sanderson, D. Leventhal, S. Singh, J. Beeler, M. Walton, S. Nicola and
members of the Berke laboratory for critical reading of various manuscript drafts,
N. Mallet for initial assistance with viral injections, and K. Porter-Stransky for
initial assistance with microdialysis procedures. Th-Cre
+
rats were developed
by K. Deisseroth and I. Witten and made available for distribution through RRRC
(http://www.rrrc.us). This work was supported by the National Institute on Drug
Abuse (DA032259, training grant DA007281), the National Institute of Mental
Health (MH093888, MH101697), the National Institute on Neurological Disorders
and Stroke (NS078435, training grant NS076401), and the National Institute of
Biomedical Imaging and Bioengineering (EB003320). R.S. was supported by the
BrainLinks-BrainTools Cluster of Excellence funded by the German Research
Foundation (DFG grant number EXC1086).
AUTHOR CONTRIBUTIONS
A.A.H. performed and analyzed both FSCV and optogenetic experiments,
and J.R.P. performed and analyzed the microdialysis experiments. O.S.M.
assisted with microdialysis, C.M.V.W. assisted with FSCV, V.L.H. assisted with
optogenetics and R.S. assisted with reinforcement learning models. B.J.A. helped
supervise the FSCV experiments and data analysis, and R.T.K. helped supervise
microdialysis experiments. J.D.B. designed and supervised the study, performed
the computational modeling, developed the theoretical interpretation, and wrote
the manuscript.
COMPETING FINANCIAL INTERESTS
The authors declare no competing financial interests.
Reprints and permissions information is available online at http://www.nature.com/
reprints/index.html.
1. Reynolds, J.N., Hyland, B.I. & Wickens, J.R. A cellular mechanism of reward-related
learning. Nature 413, 67–70 (2001).
2. Schultz, W., Dayan, P. & Montague, P.R. A neural substrate of prediction and reward.
Science 275, 1593–1599 (1997).
3. Day, J.J., Roitman, M.F., Wightman, R.M. & Carelli, R.M. Associative learning
mediates dynamic shifts in dopamine signaling in the nucleus accumbens.
Nat Neurosci 10, 1020–8 (2007).
4. Hart, A.S., Rutledge, R.B., Glimcher, P.W. & Phillips, P.E. Phasic dopamine release
in the rat nucleus accumbens symmetrically encodes a reward prediction error term.
J. Neurosci. 34, 698–704 (2014).
5. Kim, K.M. et al. Optogenetic mimicry of the transient activation of dopamine
neurons by natural reward is sufficient for operant reinforcement. PLoS ONE 7,
e33612 (2012).
6. Steinberg, E.E. et al. A causal link between prediction errors, dopamine neurons
and learning. Nat. Neurosci. 16, 966–973 (2013).
7. Berridge, K.C. The debate over dopamine’s role in reward: the case for incentive
salience. Psychopharmacology (Berl.) 191, 391–431 (2007).
8. Beierholm, U. et al. Dopamine modulates reward-related vigor. Neuropsycho-
pharmacology 38, 1495–1503 (2013).
9. Freed, C.R. & Yamamoto, B.K. Regional brain dopamine metabolism: a marker for
the speed, direction, and posture of moving animals. Science 229, 62–65
(1985).
10. Niv, Y., Daw, N. & Dayan, P. How fast to work: response vigor, motivation and tonic
dopamine. Adv. Neural Inf. Process. Syst. 18, 1019 (2006).
11. Cagniard, B., Balsam, P.D., Brunner, D. & Zhuang, X. Mice with chronically elevated
dopamine exhibit enhanced motivation, but not learning, for a food reward.
Neuropsychopharmacology 31, 1362–1370 (2006).
12. Salamone, J.D. & Correa, M. The mysterious motivational functions of mesolimbic
dopamine. Neuron 76, 470–485 (2012).
© 2015 Nature America, Inc. All rights reserved.
1 0 ADVANCE ONLINE PUBLICATION nature neurOSCIenCe
a r t I C l e S
13. Satoh, T., Nakai, S., Sato, T. & Kimura, M. Correlated coding of motivation and
outcome of decision by dopamine neurons. J. Neurosci. 23, 9913–9923
(2003).
14. Phillips, P.E., Stuber, G.D., Heien, M.L., Wightman, R.M. & Carelli, R.M. Subsecond
dopamine release promotes cocaine seeking. Nature 422, 614–618 (2003).
15. Roitman, M.F., Stuber, G.D., Phillips, P.E., Wightman, R.M. & Carelli, R.M.
Dopamine operates as a subsecond modulator of food seeking. J. Neurosci. 24,
1265–1271 (2004).
16. Howe, M.W., Tierney, P.L., Sandberg, S.G., Phillips, P.E. & Graybiel, A.M. Prolonged
dopamine signaling in striatum signals proximity and value of distant rewards.
Nature 500, 575–579 (2013).
17. Samejima, K., Ueda, Y., Doya, K. & Kimura, M. Representation of action-specific
reward values in the striatum. Science 310, 1337–1340 (2005).
18. Guitart-Masip, M., Beierholm, U.R., Dolan, R., Duzel, E. & Dayan, P. Vigor in the
face of fluctuating rates of reward: an experimental examination. J. Cogn. Neurosci.
23, 3933–3938 (2011).
19. Wang, A.Y., Miura, K. & Uchida, N. The dorsomedial striatum encodes net expected
return, critical for energizing performance vigor. Nat. Neurosci. 16, 639–647 (2013).
20. Stephens, D.W. Foraging theory (Princeton University Press, 1986).
21. Sutton, R.S. & Barto, A.G. Reinforcement Learning: an Introduction (MIT Press,
1998).
22. Hull, C.L. The goal-gradient hypothesis and maze learning. Psychol. Rev. 39, 25
(1932).
23. Venton, B.J., Troyer, K.P. & Wightman, R.M. Response Times of carbon fiber
microelectrodes to dynamic changes in catecholamine concentration. Anal. Chem.
74, 539–546 (2002).
24. Heien, M.L. et al. Real-time measurement of dopamine fluctuations after cocaine
in the brain of behaving rats. Proc. Natl. Acad. Sci. USA 102, 10023–10028
(2005).
25. Nicola, S.M. The flexible approach hypothesis: unification of effort and cue-
responding hypotheses for the role of nucleus accumbens dopamine in the activation
of reward-seeking behavior. J. Neurosci. 30, 16585–16600 (2010).
26. Ikemoto, S. & Panksepp, J. The role of nucleus accumbens dopamine in motivated
behavior: a unifying interpretation with special reference to reward-seeking.
Brain Res. Brain Res. Rev. 31, 6–41 (1999).
27. Gan, J.O., Walton, M.E. & Phillips, P.E. Dissociable cost and benefit encoding of
future rewards by mesolimbic dopamine. Nat. Neurosci. 13, 25–27 (2010).
28. Kable, J.W. & Glimcher, P.W. The neural correlates of subjective value during
intertemporal choice. Nat. Neurosci. 10, 1625–1633 (2007).
29. Adamantidis, A.R. et al. Optogenetic interrogation of dopaminergic modulation of
the multiple phases of reward-seeking behavior. J. Neurosci. 31, 10829–10835
(2011).
30. Niyogi, R.K. et al. Optimal indolence: a normative microscopic approach to work
and leisure. J. R. Soc. Interface 11, 20130969 (2014).
31. Schlinger, H.D., Derenne, A. & Baron, A. What 50 years of research tell us about
pausing under ratio schedules of reinforcement. Behav. Anal. 31, 39 (2008).
32. du Hoffmann, J. & Nicola, S.M. Dopamine invigorates reward seeking by promoting
cue-evoked excitation in the nucleus accumbens. J. Neurosci. 34, 14349–14364
(2014).
33. Cannon, C.M. & Palmiter, R.D. Reward without dopamine. J. Neurosci. 23,
10827–10831 (2003).
34. Ishiwari, K., Weber, S.M., Mingote, S., Correa, M. & Salamone, J.D. Accumbens
dopamine and the regulation of effort in food-seeking behavior: modulation of work
output by different ratio or force requirements. Behav. Brain Res. 151, 83–91
(2004).
35. Rapoport, J.L. et al. Dextroamphetamine. Its cognitive and behavioral effects in
normal and hyperactive boys and normal men. Arch. Gen. Psychiatry 37, 933–943
(1980).
36. Wardle, M.C., Treadway, M.T., Mayo, L.M., Zald, D.H. & de Wit, H. Amping up
effort: effects of d-amphetamine on human effort-based decision-making.
J. Neurosci. 31, 16597–16602 (2011).
37. Nagano-Saito, A. et al. From anticipation to action, the role of dopamine in
perceptual decision making: an fMRI-tyrosine depletion study. J. Neurophysiol. 108,
501–512 (2012).
38. Leventhal, D.K. et al. Dissociable effects of dopamine on learning and performance
within sensorimotor striatum. Basal Ganglia 4, 43–54 (2014).
39. Turner, R.S. & Desmurget, M. Basal ganglia contributions to motor control: a vigorous
tutor. Curr. Opin. Neurobiol. 20, 704–716 (2010).
40. Haith, A.M., Reppert, T.R. & Shadmehr, R. Evidence for hyperbolic temporal
discounting of reward in control of movements. J. Neurosci. 32, 11727–11736
(2012).
41. Morris, G., Nevet, A., Arkadir, D., Vaadia, E. & Bergman, H. Midbrain dopamine
neurons encode decisions for future action. Nat. Neurosci. 9, 1057–1063 (2006).
42. Matsumoto, M. & Hikosaka, O. Two types of dopamine neuron distinctly convey
positive and negative motivational signals. Nature 459, 837–41 (2009).
43. Dreyer, J.K., Herrik, K.F., Berg, R.W. & Hounsgaard, J.D. Influence of phasic and tonic
dopamine release on receptor activation. J. Neurosci. 30, 14273–14283 (2010).
44. McClure, S.M., Daw, N.D. & Montague, P.R. A computational substrate for incentive
salience. Trends Neurosci. 26, 423–428 (2003).
45. Tachibana, Y. & Hikosaka, O. The primate ventral pallidum encodes expected reward
value and regulates motor action. Neuron 76, 826–837 (2012).
46. van der Meer, M.A. & Redish, A.D. Ventral striatum: a critical look at models of
learning and evaluation. Curr. Opin. Neurobiol. 21, 387–392 (2011).
47. Fiorillo, C.D., Tobler, P.N. & Schultz, W. Discrete coding of reward probability and
uncertainty by dopamine neurons. Science 299, 1898–902 (2003).
48. Gershman, S.J. Dopamine ramps are a consequence of reward prediction errors.
Neural Comput. 26, 467–471 (2014).
49. Morita, K. & Kato, A. Striatal dopamine ramping may indicate flexible reinforcement
learning with forgetting in the cortico-basal ganglia circuits. Front Neural Circuits
8, 36 (2014).
50. Threlfell, S. et al. Striatal dopamine release is triggered by synchronized activity
in cholinergic interneurons. Neuron 75, 58–64 (2012).
© 2015 Nature America, Inc. All rights reserved.
nature neurOSCIenCe
doi:10.1038/nn.4173
ONLINE METHODS
Animals and behavioral task. All animal procedures were approved by the
University of Michigan Committee on Use and Care of Animals. Male rats
(300–500 g, either wild-type Long-Evans or Th-Cre
+
with a Long-Evans back-
ground
51
were maintained on a reverse 12:12 light:dark cycle and tested during
the dark phase. Rats were mildly food deprived, receiving 15 g of standard labora-
tory rat chow daily in addition to food rewards earned during task performance.
Training and testing was performed in computer-controlled Med Associates oper-
ant chambers (25 cm × 30 cm at widest point) each with a five-hole nose-poke
wall, as previously described
52–54
. Training to perform the trial-and-error task
typically took ~2 months, and included several pretraining stages (2 d to 2 weeks
each, advancing when ~85% of trials were performed without procedural errors).
First, any one of the five nosepoke holes was illuminated (at random), and pok-
ing this hole caused delivery of a 45-mg fruit punch–flavored sucrose pellet into
the Food Port (FR1 schedule). Activation of the food hopper to deliver the pellet
caused an audible click (the reward cue). In the next stage, the hole illuminated
at trial start was always one of the three more-central holes (randomly-selected),
and rats learned to poke and maintain hold for a variable interval (750–1,250 ms)
until Go cue onset (250-ms duration white noise, together with dimming of the
start port). Next, Go cue onset was also paired with illumination of both adjacent
side ports. A leftward or rightward poke to one of these ports was required to
receive a reward (each at 50% probability), and initiated the inter-trial inter-
val (5–10 s randomly selected from a uniform distribution). If the rat poked an
unlit center port (wrong start) or pulled out before the end of the hold period
(false start), the house light turned on for the duration of an inter-trial interval.
During this stage (only), to discourage development of a side bias, a maximum of
three consecutive pokes to the same side were rewarded. Finally, in the complete
trial-and-error task left and right choices had independent reward probabilities,
each maintained for blocks of 40–60 trials (randomly selected block length and
sequence for each session). All combinations of 10, 50 and 90% reward probability
were used except 10:10 and 90:90. There was no event that indicated to the rat
that a trial would be unrewarded other than the omission of the Reward cue and
the absence of the pellet.
For a subset of ChR2 optogenetic sessions, overhead video was captured at
15 frames per s. The frames immediately preceding the Light-On events were
extracted, and the positions of the nose tip and neck were marked (by scorers
blind to whether that trial included laser stimulation). These positions were used
to determine rat distance and orientation to the center port (the one that will be
illuminated on that trial). Each trial was classified as engagedor ‘unengaged’,
using cutoff values of distance (10.6 cm) and orientation (84°) that minimized
the overlap between aggregate distributions. To assess how path length was
affected by optogenetic stimulation, rat head positions were scored for each
video frame between Light-On and Center-Nose-In. Engaged trials were further
classified by whether the rat was immediately adjacent to one of the three
possible center ports, and if that port was the one that became illuminated at
Light-On or not (that is lucky, unlucky guesses).
Smoothing of latency (and other) time series for graphical display (Fig. 1b,c)
was performed using the MATLAB filtfilt function with a seven-trial window.
To quantify the impact of prior trial rewards on current trial latency, we used a
multiple regression model
log latency
10 1 1 2 2 10 10
( ) = + +
b b br r r
t t
where r = 1 if the corresponding trial was rewarded. All latency analyses excluded
trials of zero latency (that is those for which the rat’s nose was already inside the
randomly-chosen center port at Light-On). For analysis of prior trial outcomes
on left/right choice behavior we used another multiple regression model, just as
previously described
55
.
Latency survivor curves were calculated simply as the proportion of trials for
which the Center-In event had not yet occurred, at each 250-ms interval after
Light-On (an inverted cumulative latency distribution), smoothed with a three-
point moving average (x
t
= 0.25x
t1
+ 0.5x
t
+ 0.25x
t + 1
). These survivor curves
were then used to calculate hazard rates, as the fraction of the remaining latencies
that occurred in each 250-ms bin (the number of Center-In events that happened,
divided by the number that could have happened).
We defined reward rate as the exponentially weighted moving average of
individual rewards (a leaky integrator
56–58
). For each session the integrator time
constant was chosen to maximize the (negative) correlation between reward rate
and behavioral latency. If instead we defined reward rate as simply the number
of rewards during each minute (ignoring the contributions of trials in previous
minutes to current reward rate), the relationship between microdialysis-
measured [DA] in that minute and reward rate was lower, although still significant
(R
2
= 0.084, P = 5.5 × 10
−10
).
An important parameter in reinforcement learning is the degree to which
agents choose the option that is currently estimated to be the best (exploitation)
versus trying alternatives to assess whether they are actually better (exploration),
and dopamine has been proposed to mediate this trade-off
59,60
. To assess this
we examined left/right choices in the second half of each block, by which time
choices have typically stabilized (Fig. 1d; this behavioral pattern was also seen
for the microdialysis sessions). We defined an exploitation index as the propor-
tion of trials for which rats choose the better option in these second block halves
(so values close to 1 would be fully exploitative, and values close to 0.5 would
be random/exploratory). As an alternative metric of exploration/exploitation,
we examined the number of times that the rat switched between left and right
choices in each minute; this metric also showed no significant relationship to any
neurochemical assayed in our microdialysis experiments.
Microdialysis. After 3–6 months of behavioral training rats were implanted with
guide cannulae bilaterally above the nucleus accumbens core (NAcc; +1.3–1.9
mm AP, 1.5 mm ML from bregma) and allowed to recover for at least 1 week
before retraining. On test days (3–5 weeks after cannula implantation) a single
custom made microdialysis probe (300-µm diameter) with polyacrylonitrile
membrane (Hospal; 20-kDa molecular weight cutoff) was inserted into NAcc,
extending 1 mm below the guide cannula. Artificial CSF (composition in mM:
CaCl
2
1.2; KCl 2.7, NaCl 148, MgCl
2
0.85; ascorbate, 0.25) was perfused continu-
ously at 2 µl min
−1
. Rats were placed in the operant chamber with the house light
on for an initial 90min period of probe equilibration, after which samples were
collected once every minute. Following five baseline samples the house light was
extinguished to indicate task availability.
For chemical analyses, we employed a modified version of our benzoyl chloride
derivatization and HPLC-MS analysis method
61
. Immediately after each 2-µl
sample collection, we added 1.5 µl of buffer (sodium carbonate monohydrate
100 mM), 1.5 µl of 2% benzoyl chloride in acetonitrile, and 1.5 µl of a 13C-
labeled internal standard mixture (total mixture volume 6.5 µl). The mixture was
vortexed for 2 s between each reagent addition. Since ACh is a quaternary amine
and thus not derivatized by benzoyl chloride, it was directly detected in its native
form (transition 14687). Deuterated ACh (d4-ACh) was also added to the
internal standard mixture for improved ACh quantification
62
. 5 µl of the sample
mixture was automatically injected by a Thermo Accela HPLC system (Thermo
Fisher Scientific) onto a reverse-phase Kinetex biphenyl HPLC column (2.1 mm
× 100 mm; 1.7 particle size; Phenomenex). The HPLC system was interfaced
to a HESI II ESI probe and Thermo TSQ Quantum Ultra (Thermo Scientific)
triple quadrupole mass spectrometer operating in positive mode. Sample run
times for all analytes were 3 min. To quantify neurochemicals in dialysate sam-
ples, we constructed six-point external calibration curves encompassing known
physiological concentrations. Thermo Xcalibur 2.1 software (Thermo Fisher
Scientific) automatically detected chromatographic peaks and quantified con-
centrations. To reduce noise each resulting minute-by-minute time series was
smoothed with a three-point moving average (as above), then converted to
Z-scores to facilitate comparison between subjects.
Regression analysis of microdialysis data was performed stepwise. We first
constructed models with only one behavioral variable as predictor and one out-
come (analyte). If two behavioral variables showed a significant relationship to
a given analyte, we constructed a model with both behavioral variables and an
interaction term, and examined the capacity of each variable to explain analyte
variance without substantial multicollinearity.
To determine cross-correlogram statistical thresholds we first shuffled the
time series for all sessions 200,000 times, and calculated the average Pearson
correlation coefficients (that is the zero-lag cross-correlation) for each shuffled
pair of time series. Thresholds were based on the tails of the resulting distribu-
tion: that is for uncorrected two-tailed alpha = 0.05 we would find the levels for
which 2.5% of the shuffled values lay outside these thresholds. As we wished to
correct for multiple comparisons we divided alpha by the number of tests (276;
number of cross-correlograms = 23 timeseries × 22 timeseries divided by two, as
© 2015 Nature America, Inc. All rights reserved.
nature neurOSCIenCe
doi:10.1038/nn.4173
the crosscorrelograms are just mirror-reversed when the order is changed, plus
23 autocorrelograms).
Voltammetry. FSCV electrode construction, data acquisition and analysis were
performed as described
63
. Rats were implanted with a guide cannula above the
right NAcc (+1.3–2.0 mm AP, 1.5 mm ML from bregma), a Ag/AgCl reference
electrode (in the contralateral hemisphere) and a bipolar stimulation electrode
aimed at the VTA (−5.2 mm AP, 0.8 mm ML, 7.5 mm DV). Carbon fiber elec-
trodes were lowered acutely into the NAcc. Dopaminergic current was quan-
tified offline by principal component regression (PCR)
24
using training data
for dopamine and pH from electrical stimulations. Recording time points that
exceeded the PCR residual analysis threshold (Qα) were omitted from further
processing or analysis. Current to [DA] conversion was based on in vitro calibra-
tions of electrodes constructed in the same manner with the same exposed fiber
length. On many days data was not recorded due to electrode breakage or obvious
movement-related electrical noise. FSCV recordings were made from 41 sessions
(14 rats total). We excluded those sessions for which the rat failed to complete
at least three blocks of trials, and those in which electrical artifacts caused >10%
of trials to violate the assumptions of PCR residual analysis. The remaining ten
sessions came from six different rats. To avoid aggregate results being overly
skewed by a single animal, we only included one session from each of the six rats
(the session with the largest reward-evoked [DA] increase). Upon completion
of FSCV testing, animals were deeply anesthetized and electrolytic lesions were
created (40 µA for 15 s at the same depth as recording site) using stainless steel
electrodes with 500 µm of exposed tip (AM Systems). Lesion locations were later
reconstructed in Nissl-stained sections.
For between-session comparisons we normalized [DA] to the average [DA]
difference between the pre-trial baseline and Food-Port-In aligned peak levels.
To visualize the reward-history-dependence of [DA] change between consecutive
trials (Fig. 5h), we first extracted time series of normalized [DA] from consecu-
tive pairs of rewarded trials (Side-In event to subsequent Side-In event separated
by less than 30 s). For each session we divided these traces into ‘low-reward-rate
and ‘high-reward-rate’ groups, using the (number of rewarded trials in the last
10) that best approximated a median-split (so low- and high- reward-rate groups
had similar trial numbers). We then averaged all low-reward-rate traces, and
separately all high-reward-rate traces.
Reinforcement learning model. To estimate the time-varying state value and
RPE in each trial, we used a Semi-Markov Decision Process
64
with temporal
difference learning, implemented in MATLAB. The model consisted of a set of
states, with rat behavioral events determining the times of transitions between
states (Supplementary Fig. 5). Each state was associated with a stored (‘cached’)
value of entering that state, V(s). At each state transition a reward prediction
error δ was calculated using
d g
t t t t
n
t t n
r V s V s= +
( ) ( )
where n is the number of time steps since the last state transition (a time step of
50ms was used throughout), r is defined as one at reward receipt and zero oth-
erwise, and γ specifies the rate at which future rewards are discounted at each
timestep (γ < 1). The V terms in the equation compare the cached value of the new
state to the value predicted, given the prior state value and the elapsed time since
the last transition (as illustrated in Fig. 4c). Each state also had e(s), an eligibility
trace that decayed with the same time parameter γ (following the terminology
of ref. 21, this is a TD(1) model with replacing traces). RPEs updated the values
of the states encountered up to that point, using
= +V s V s e s
t
( ) ( )
.
.
( )a d
where
α
is the learning rate. V and γ were defined only at state transitions, and V
was constrained to be non-negative. The model was ‘episodic’, as all eligibilities
were reset to zero at trial outcome (reward receipt, or omission). V is there-
fore an estimate of the time-discounted value of the next reward, rather than
total aggregate future reward; with exponential discounting and best-fit param-
eters subsequent time-discounted rewards are negligible (but this would not
necessarily be the case if hyperbolic discounting was used).
We also examined the effect of calculating prediction errors slightly differently
d g
t t
n
t t t t n
r V s V s= +
( ) ( )
This version compares a discounted version of the new state value to the pre-
vious state value. As expected, the results were the same. Specifically, overall
[DA] correlation to V remained ~0.4, overall
δ
correlation was ~0.2, and each
individual session [DA] was significantly better correlated to V than to
δ
, across
the full parameter space.
We present results using γ in the 0.9 to 1 range, because 0.9 is already a very fast
exponential discount rate when using 50-ms time steps. However we also tested
smaller γ (0.05–0.9) and confirmed that the [DA]:
δ
correlation only diminished
in this lower range (data not shown).
To compare within-trial [DA] changes to model variables, we identified all
epochs of time (3 s before to 3 s after Center-In) with at least six state transitions
(this encompasses both rewarded and unrewarded trials). Since the model can
change state value instantaneously, but our FSCV signal cannot
65
, we included
an offset lag (so we actually compared V and
δ
to [DA] a few measurements
later). The size of the lag affected the magnitude of the observed correlations
(Fig. 4f), but not the basic result. Results were also unchanged if (instead of a lag)
we convolved model variables with a kernel consisting of an exponential rise and
fall (Supplementary Fig. 6), demonstrating that our results are not a simple arti-
fact of time delays associated with the FSCV method or sluggish reuptake. Finally,
we also tried using the SMDP model with hyperbolic (instead of exponential)
discounting
66–69
, and again found a consistently stronger correlation between
[DA] and V than between [DA] and
δ
(data not shown).
Code availability: custom MATLAB code for the SMDP model is available
on request.
Optogenetics. We used three groups of rats to assess the behavioral effects of
VTA DA cell manipulations (first Th-Cre
+
with AAV-EF1α-DIO-ChR2-EYFP
virus, then littermate Th-Cre
with the same virus, then Th-Cre
+
with AAV-
EF1α-DIO-eNpHR3.0-EYFP). All virus was produced at the University of
North Carolina vector core. In each case rats received bilateral viral injections
(0.5 or 1 µl per hemisphere at 50 nl min
−1
) into the VTA (same coordinates as
above). After 3 weeks, we placed bilateral optic fibers (200-µm diameter) under
ketamine/xylazine anesthesia with FSCV guidance, at an angle of from the sag-
ittal plane, stopping at a location that yielded the most laser-evoked [DA] release
in NAc. Once cemented in place, we used FSCV to test multiple sets of stimulation
parameters from a 445-nm blue laser diode (Casio) with Arroyo Instruments
driver under LabView control. The parameters chosen for behavioral experiments
(0.5-s train of 10-ms pulses at 30 Hz, 20 mW power at tip) typically produced
[DA] increases in Th-Cre
+
/ ChR2 rats comparable to those seen with unexpected
reward delivery. All rats were allowed to recover from surgery and retrained to
pre-surgery performance. Combined behavioral / optogenetic experiments began
5 weeks after virus injection. On alternate days, sessions either included bilateral
laser stimulation (on a randomly selected 30% of trials, regardless of block or
outcome), or not. In this manner, each rat received three sessions of Light-On
stimulations and three sessions of Side-In stimulation, interleaved with control
(no laser) sessions, over a 2-week period. Halorhodopsin rats were tested with 1
s of constant 20-mW illumination from a 589-nm (yellow/orange) laser (OEM
Systems), starting either at Light-On or Side-In as above. One Th-Cre
+
/ChR2 rat
was excluded from analyses due to misplaced virus (no viral expression directly
below the optic fiber tips).
For statistical analysis of optogenetic effects on behavior we used repeated
measure ANOVA models, in SPSS. For each rat we first averaged data across the
three sessions with the same optogenetic conditions. Then, to assess reinforcing
effects we examined the two factors of LASER (off versus on) and REWARD
(rewarded versus omission), with the dependent measure the probability that the
same action was repeated on the next trial. For assessing effects on median latency
we examined the two factors of LASER (off versus on) and ENGAGED (yes ver-
sus no). For assessing group-dependent effects on hazard rate we examined the
factors of LASER (off versus on) and GROUP (Th-Cre
+
/ChR2; Th-Cre
/ChR2;
Th-Cre
+
/eNpHR3.0), with the dependent measure the average hazard rate during
the epoch 1–2.5 s after Light-On. This epoch was chosen since it is 1–2 s after
the laser stimulation period (0–0.5 s) and approach behaviors have a consistent
© 2015 Nature America, Inc. All rights reserved.
nature neurOSCIenCe
doi:10.1038/nn.4173
duration of ~1–2 s (Supplementary Fig. 9). Post hoc tests were Bonferroni-
corrected for multiple comparisons.
A Supplementary Methods Checklist is available.
51. Witten, I.B. et al. Recombinase-driver rat lines: tools, techniques, and optogenetic
application to dopamine-mediated reinforcement. Neuron 72, 721–733 (2011).
52. Gage, G.J., Stoetzner, C.R., Wiltschko, A.B. & Berke, J.D. Selective activation of
striatal fast-spiking interneurons during choice execution. Neuron 67, 466–79
(2010).
53. Leventhal, D.K. et al. Basal ganglia beta oscillations accompany cue utilization.
Neuron 73, 523–536 (2012).
54. Schmidt, R., Leventhal, D.K., Mallet, N., Chen, F. & Berke, J.D. Canceling actions
involves a race between basal ganglia pathways. Nat. Neurosci. 16, 1118–1124
(2013).
55. Lau, B. & Glimcher, P.W. Dynamic response-by-response models of matching
behavior in rhesus monkeys. J. Exp. Anal. Behav. 84, 555–579 (2005).
56. Simen, P., Cohen, J.D. & Holmes, P. Rapid decision threshold modulation by reward
rate in a neural network. Neural Netw. 19, 1013–1026 (2006).
57. Daw, N.D., Kakade, S. & Dayan, P. Opponent interactions between serotonin and
dopamine. Neural Netw. 15, 603–616 (2002).
58. Sugrue, L.P., Corrado, G.S. & Newsome, W.T. Matching behavior and the
representation of value in the parietal cortex. Science 304, 1782–1787 (2004).
59. Humphries, M.D., Khamassi, M. & Gurney, K. Dopaminergic control of the exploration-
exploitation trade-off via the basal ganglia. Front Neurosci 6, 9 (2012).
60. Beeler, J.A., Frazier, C.R. & Zhuang, X. Putting desire on a budget: dopamine and
energy expenditure, reconciling reward and resources. Front Integr Neurosci 6, 49
(2012).
61. Song, P., Mabrouk, O.S., Hershey, N.D. & Kennedy, R.T. In vivo neurochemical
monitoring using benzoyl chloride derivatization and liquid chromatography–mass
spectrometry. Anal. Chem. 84, 412–419 (2012).
62. Song, P., Hershey, N.D., Mabrouk, O.S., Slaney, T.R. & Kennedy, R.T. Mass
spectrometry “sensor” for in vivo acetylcholine monitoring. Anal. Chem. 84,
4659–4664 (2012).
63. Aragona, B.J. et al. Regional specificity in the real-time development of phasic
dopamine transmission patterns during acquisition of a cue–cocaine association in
rats. Eur. J. Neurosci. 30, 1889–1899 (2009).
64. Daw, N.D., Courville, A.C. & Touretzky, D.S. Representation and timing in theories
of the dopamine system. Neural Comput. 18, 1637–1677 (2006).
65. Kile, B.M. et al. Optimizing the temporal resolution of fast-scan cyclic voltammetry.
ACS Chem. Neurosci. 3, 285–292 (2012).
66. Mazur, J.E. Tests of an equivalence rule for fixed and variable reinforcer delays.
J. Exp. Psychol. Anim. Behav. Process. 10, 426 (1984).
67. Ainslie, G. Précis of breakdown of will. Behav. Brain Sci. 28, 635–650 (2005).
68. Kobayashi, S. & Schultz, W. Influence of reward delays on responses of dopamine
neurons. J. Neurosci. 28, 7837–7846 (2008).
69. Kacelnik, A. Normative and descriptive models of decision making: time discounting
and risk sensitivity. Ciba Found. Symp. 208, 51–67 (1997).
... Dilations are higher for Go than NoGo responses and particularly high for Go responses to Avoid cues, for which aversive inhibition must be overcome effort are likely driven by the same mechanisms (Berke, 2018;Hamid, 2021). Furthermore, animal research has used Go/NoGo tasks akin to our design to study physical effort costs (Hamid et al., 2016;Syed et al., 2016). Finally, also past research in humans has used relatively minor actions, e.g., the speed of single saccades (Manohar et al., 2015) or choice options that require several keyboard presses (Treadway, Buckholtz, Schwartzman, Lambert, & Zald, 2009), as indices of physical effort. ...
... The hypothesis that pupil dilation (at least partially) reflects dopaminergic processes in the striatum is in line with recent accounts of the role of dopamine in motivating action. Specifically, it has been proposed that the striatum evaluates whether recruiting additional effort to invigorate a candidate response option will lead to increases in expected reward, i.e., it computes the "value of work" (Hamid et al., 2016;Mohebi et al., 2019;Syed et al., 2016;Westbrook, Frank, & Cools, 2021). Further work by the same authors has suggested that dopamine reflects the control or "agency" an individual experiences over its environment, reflecting whether it is worth investing effort to try to increase reward rates (Hamid, 2021;Hamid, Frank, & Moore, 2021). ...
Article
Full-text available
Pavlovian” or “motivational” biases describe the phenomenon that the valence of prospective outcomes modulates action invigoration: Reward prospect invigorates action, whereas punishment prospect suppresses it. The adaptive role of these biases in decision-making is still unclear. One idea is that they constitute a fast-and-frugal decision strategy in situations characterized by high arousal, e.g., in presence of a predator, which demand a quick response. In this pre-registered study ( N = 35), we tested whether such a situation—induced via subliminally presented angry versus neutral faces—leads to increased reliance on Pavlovian biases. We measured trial-by-trial arousal by tracking pupil diameter while participants performed an orthogonalized Motivational Go/NoGo Task. Pavlovian biases were present in responses, reaction times, and even gaze, with lower gaze dispersion under aversive cues reflecting “freezing of gaze.” The subliminally presented faces did not affect responses, reaction times, or pupil diameter, suggesting that the arousal manipulation was ineffective. However, pupil dilations reflected facets of bias suppression, specifically the physical (but not cognitive) effort needed to overcome aversive inhibition: Particularly strong and sustained dilations occurred when participants managed to perform Go responses to aversive cues. Conversely, no such dilations occurred when they managed to inhibit responses to Win cues. These results suggest that pupil diameter does not reflect response conflict per se nor the inhibition of prepotent responses, but specifically effortful action invigoration as needed to overcome aversive inhibition. We discuss our results in the context of the “value of work” theory of striatal dopamine.
... This hypothesis was investigated experimentally in Cinotti et al. (2019), where we showed that systemic pharmacological inhibition of dopamine enhanced exploration without affecting the learning rate. Together with the assumption that tonic dopamine represents the average reward rate (Hamid et al., 2016;Niv et al., 2007), this leads to the idea that the reward rate might control exploration through tonic dopamine levels. Another possibility is that it is the learning rate or the forgetting rate or a combination of these parameters that is being regulated over time. ...
Article
Full-text available
In uncertain environments in which resources fluctuate continuously, animals must permanently decide whether to stabilise learning and exploit what they currently believe to be their best option, or instead explore potential alternatives and learn fast from new observations. While such a trade-off has been extensively studied in pretrained animals facing non-stationary decision-making tasks, it is yet unknown how they progressively tune it while learning the task structure during pretraining. Here, we compared the ability of different computational models to account for long-term changes in the behaviour of 24 rats while they learned to choose a rewarded lever in a three-armed bandit task across 24 days of pretraining. We found that the day-by-day evolution of rat performance and win-shift tendency revealed a progressive stabilisation of the way they regulated reinforcement learning parameters. We successfully captured these behavioural adaptations using a meta-learning model in which either the learning rate or the inverse temperature was controlled by the average reward rate. K E Y W O R D S decision-making, dopamine, exploration-exploitation trade-off, meta-learning
... Building a bridge to neurobiological mechanisms, Mikhael et al. (2021) developed a rational inattention account of tonic dopamine. Under this account, tonic dopamine subsumes the average reward theory of tonic dopamine, where it encodes the context (state)-specific average reward rate (Beierholm et al., 2013;Hamid et al., 2016;Niv et al., 2007). Specifically, rational inattention does not propose a different role for how dopamine encodes reward than what has been posited previously. ...
Article
Full-text available
Psychopathology is vast and diverse. Across distinct disease states, individuals exhibit symptoms that appear counter to the standard view of rationality (expected utility maximization). We argue that some aspects of psychopathology can be described as resource-rational, reflecting a rational trade-off between reward and cognitive resources. We review work on two theories of this kind: rational inattention, where a capacity limit applies to perceptual channels, and policy compression, where the capacity limit applies to action channels. We show how these theories can parsimoniously explain many forms of psychopathology, including affective, primary psychotic, and neurodevelopmental disorders, as well as many effects of psychoactive medications on these disorders. While there are important disorder-specific differences and the theories are by no means universal, we argue that resource rationality offers a useful new perspective on psychopathology. By emphasizing the role of cognitive resource constraints, this approach offers a more inclusive picture of rationality. Some aspects of psychopathology may reflect rational trade-offs rather than the breakdown of rationality.
... By distilling reward learning to its essence, Pavlovian tasks have facilitated seminal discoveries, such as the role of VTADA neurons in calculating reward prediction error (RPE)-the difference between predicted and actual rewards 4-7 . An increase (burst) in VTADA activity signals a positive RPE, indicating an unexpected gain, while a decrease (pause) denotes a negative RPE, reflecting unmet expectations [8][9][10][11] . ...
Preprint
Full-text available
Extinction learning is an essential form of cognitive flexibility, which enables obsolete reward associations to be discarded. Its downregulation can lead to perseveration, a symptom seen in several neuropsychiatric disorders. This balance is regulated by dopamine from VTA DA (ventral tegmental area dopamine) neurons, which in turn are largely controlled by GABA (gamma amino-butyric acid) synapses. However, the causal relationship of these circuit elements to extinction and perseveration remain incompletely understood. Here, we employ an innovative drug-targeting technology, DART (drug acutely restricted by tethering), to selectively block GABA A receptors on VTA DA neurons as mice engage in Pavlovian learning. DART eliminated GABA A -mediated pauses—brief decrements in VTA DA activity canonically thought to drive extinction learning. However, contrary to the hypothesis that blocking VTA DA pauses should eliminate extinction learning, we observed the opposite—accelerated extinction learning. Specifically, DART eliminated the naturally occurring perseveration seen in half of control mice. We saw no impact on Pavlovian conditioning, nor on other aspects of VTA DA neural firing. These findings challenge canonical theories, recasting GABA A -mediated VTA DA pauses from presumed facilitators of extinction to drivers of perseveration. More broadly, this study showcases the merits of targeted synaptic pharmacology, while hinting at circuit interventions for pathological perseveration.
Article
Full-text available
Agency is the sense that one has control over one’s own actions and the consequences of those actions. Despite the critical role that agency plays in the human condition, little is known about its neural basis. A novel theory proposes that increases in agency disinhibit the dopamine system and thereby increase the number of tonically active dopamine neurons in the ventral tegmental area. The theory, called ADDS (Agency Disinhibits the Dopamine System), proposes a specific neural network that mediates these effects. ADDS accurately predicts a variety of relevant neuroscience results, and makes many novel predictions, including that increases in an agency will (a) increase motivation, (b) improve executive function, (c) facilitate procedural learning, but only in the presence of immediate trial-by-trial feedback, (d) have little or no effect on learning-related effects of stimulus repetition or on standard eyeblink conditioning, (e) facilitate the development of automatic behaviors, but have little or no effect on the production of behaviors that are already automatized, (f) amplify the cognitive benefits of positive mood, and (g) reduce pain. The implications of this new theory are considered for several purely psychological theories that assign prominent roles to agency, including self-efficacy theory, hope theory, and goal-focused positive psychotherapy.
Preprint
Full-text available
During reward-oriented behaviors, animals -including humans- spontaneously adjust the speeds of their decisions and movements based on dynamically changing costs and benefits. The mechanisms constraining these adaptive modulations remain unclear, especially in freely moving animals. Here, we developed a naturalistic foraging task in which rats decided when and how fast to run across a motorized treadmill to collect rewards. Model-based analyses explained why decision and movement speeds were coupled or decoupled as rats adapted to changes in reward value or motor cost, respectively. Moreover, lesions of the dorsal striatum increased the animals' sensitivity to motor cost, limiting their running speed in the most effortful conditions while sparing reward-related behavioral modulations. Altogether, our study describes how neuroeconomic constraints influence decision and movement speeds in foraging rats, and paves the way for a refined understanding of the role of the basal ganglia in motor control and decision-making.
Article
Full-text available
A key feature of animal and human decision-making is to balance the exploration of unknown options for information gain (directed exploration) versus selecting known options for immediate reward (exploitation), which is often examined using restless bandit tasks. Recurrent neural network models (RNNs) have recently gained traction in both human and systems neuroscience work on reinforcement learning, due to their ability to show meta-learning of task domains. Here we comprehensively compared the performance of a range of RNN architectures as well as human learners on restless four-armed bandit problems. The best-performing architecture (LSTM network with computation noise) exhibited human-level performance. Computational modeling of behavior first revealed that both human and RNN behavioral data contain signatures of higher-order perseveration, i.e., perseveration beyond the last trial, but this effect was more pronounced in RNNs. In contrast, human learners, but not RNNs, exhibited a positive effect of uncertainty on choice probability (directed exploration). RNN hidden unit dynamics revealed that exploratory choices were associated with a disruption of choice predictive signals during states of low state value, resembling a win-stay-loose-shift strategy, and resonating with previous single unit recording findings in monkey prefrontal cortex. Our results highlight both similarities and differences between exploration behavior as it emerges in meta-learning RNNs, and computational mechanisms identified in cognitive and systems neuroscience work.
Article
Full-text available
Approach to reward is a fundamental adaptive behavior, disruption of which is a core symptom of addiction and depression. Nucleus accumbens (NAc) dopamine is required for reward-predictive cues to activate vigorous reward seeking, but the underlying neural mechanism is unknown. Reward-predictive cues elicit both dopamine release in the NAc and excitations and inhibitions in NAc neurons. However, a direct link has not been established between dopamine receptor activation, NAc cue-evoked neuronal activity, and reward-seeking behavior. Here, we use a novel microelectrode array that enables simultaneous recording of neuronal firing and local dopamine receptor antagonist injection. We demonstrate that, in the NAc of rats performing a discriminative stimulus task for sucrose reward, blockade of either D1 or D2 receptors selectively attenuates excitation, but not inhibition, evoked by reward-predictive cues. Furthermore, we establish that this dopamine-dependent signal is necessary for reward-seeking behavior. These results demonstrate a neural mechanism by which NAc dopamine invigorates environmentally cued reward-seeking behavior.
Article
Full-text available
It has been suggested that the midbrain dopamine (DA) neurons, receiving inputs from the cortico-basal ganglia (CBG) circuits and the brainstem, compute reward prediction error (RPE), the difference between reward obtained or expected to be obtained and reward that had been expected to be obtained. These reward expectations are suggested to be stored in the CBG synapses and updated according to RPE through synaptic plasticity, which is induced by released DA. These together constitute the "DA=RPE" hypothesis, which describes the mutual interaction between DA and the CBG circuits and serves as the primary working hypothesis in studying reward learning and value-based decision-making. However, recent work has revealed a new type of DA signal that appears not to represent RPE. Specifically, it has been found in a reward-associated maze task that striatal DA concentration primarily shows a gradual increase toward the goal. We explored whether such ramping DA could be explained by extending the "DA=RPE" hypothesis by taking into account biological properties of the CBG circuits. In particular, we examined effects of possible time-dependent decay of DA-dependent plastic changes of synaptic strengths by incorporating decay of learned values into the RPE-based reinforcement learning model and simulating reward learning tasks. We then found that incorporation of such a decay dramatically changes the model's behavior, causing gradual ramping of RPE. Moreover, we further incorporated magnitude-dependence of the rate of decay, which could potentially be in accord with some past observations, and found that near-sigmoidal ramping of RPE, resembling the observed DA ramping, could then occur. Given that synaptic decay can be useful for flexibly reversing and updating the learned reward associations, especially in case the baseline DA is low and encoding of negative RPE by DA is limited, the observed DA ramping would be indicative of the operation of such flexible reward learning.
Article
Full-text available
Making predictions about the rewards associated with environmental stimuli and updating those predictions through feedback is an essential aspect of adaptive behavior. Theorists have argued that dopamine encodes a reward prediction error (RPE) signal that is used in such a reinforcement learning process. Recent work with fMRI has demonstrated that the BOLD signal in dopaminergic target areas meets both necessary and sufficient conditions of an axiomatic model of the RPE hypothesis. However, there has been no direct evidence that dopamine release itself also meets necessary and sufficient criteria for encoding an RPE signal. Further, the fact that dopamine neurons have low tonic firing rates that yield a limited dynamic range for encoding negative RPEs has led to significant debate about whether positive and negative prediction errors are encoded on a similar scale. To address both of these issues, we used fast-scan cyclic voltammetry to measure reward-evoked dopamine release at carbon fiber electrodes chronically implanted in the nucleus accumbens core of rats trained on a probabilistic decision-making task. We demonstrate that dopamine concentrations transmit a bidirectional RPE signal with symmetrical encoding of positive and negative RPEs. Our findings strengthen the case that changes in dopamine concentration alone are sufficient to encode the full range of RPEs necessary for reinforcement learning.
Article
Full-text available
Dividing limited time between work and leisure when both have their attractions is a common everyday decision. We provide a normative control-theoretic treatment of this decision that bridges economic and psychological accounts. We show how our framework applies to free-operant behavioural experiments in which subjects are required to work (depressing a lever) for sufficient total time (called the price) to receive a reward. When the microscopic benefit-of-leisure increases nonlinearly with duration, the model generates behaviour that qualitatively matches various microfeatures of subjects' choices, including the distribution of leisure bout durations as a function of the pay-off. We relate our model to traditional accounts by deriving macroscopic, molar, quantities from microscopic choices.
Article
We recorded the activity of midbrain dopamine neurons in an instrumental conditioning task in which monkeys made a series of behavioral decisions on the basis of distinct reward expectations. Dopamine neurons responded to the first visual cue that appeared in each trial [conditioned stimulus (CS)] through which monkeys initiated trial for decision while expecting trial-specific reward probability and volume. The magnitude of neuronal responses to the CS was approximately proportional to reward expectations but with considerable discrepancy. In contrast, CS responses appear to represent motivational properties, because their magnitude at trials with identical reward expectation had significant negative correlation with reaction times of the animal after the CS. Dopamine neurons also responded to reinforcers that occurred after behavioral decisions, and the responses precisely encoded positive and negative reward expectation errors (REEs). The gain of coding REEs by spike frequency increased during learning act-outcome contingencies through a few months of task training, whereas coding of motivational properties remained consistent during the learning. We found that the magnitude of CS responses was positively correlated with that of reinforcers, suggesting a modulation of the effectiveness of REEs as a teaching signal by motivation. For instance, rate of learning could be faster when animals are motivated, whereas it could be slower when less motivated, even at identical REEs. Therefore, the dual correlated coding of motivation and REEs suggested the involvement of the dopamine system, both in reinforcement in more elaborate ways than currently proposed and in motivational function in reward-based decision-making and learning.
Article
Neuroimaging studies of decision-making have generally related neural activity to objective measures (such as reward magnitude, probability or delay), despite choice preferences being subjective. However, economic theories posit that decision-makers behave as though different options have different subjective values. Here we use functional magnetic resonance imaging to show that neural activity in several brain regions—particularly the ventral striatum, medial prefrontal cortex and posterior cingulate cortex—tracks the revealed subjective value of delayed monetary rewards. This similarity provides unambiguous evidence that the subjective value of potential rewards is explicitly represented in the human brain.
Article
Introduction: Debate continues over the precise causal contribution made by mesolimbic dopamine systems to reward. There are three competing explanatory categories: 'liking', learning, and 'wanting'. Does dopamine mostly mediate the hedonic impact of reward ('liking')? Does it instead mediate learned predictions of future reward, prediction error teaching signals and stamp in associative links (learning)? Or does dopamine motivate the pursuit of rewards by attributing incentive salience to reward-related stimuli ('wanting')? Each hypothesis is evaluated here, and it is suggested that the incentive salience or 'wanting' hypothesis of dopamine function may be consistent with more evidence than either learning or 'liking'. In brief, recent evidence indicates that dopamine is neither necessary nor sufficient to mediate changes in hedonic 'liking' for sensory pleasures. Other recent evidence indicates that dopamine is not needed for new learning, and not sufficient to directly mediate learning by causing teaching or prediction signals. By contrast, growing evidence indicates that dopamine does contribute causally to incentive salience. Dopamine appears necessary for normal 'wanting', and dopamine activation can be sufficient to enhance cue-triggered incentive salience. Drugs of abuse that promote dopamine signals short circuit and sensitize dynamic mesolimbic mechanisms that evolved to attribute incentive salience to rewards. Such drugs interact with incentive salience integrations of Pavlovian associative information with physiological state signals. That interaction sets the stage to cause compulsive 'wanting' in addiction, but also provides opportunities for experiments to disentangle 'wanting', 'liking', and learning hypotheses. Results from studies that exploited those opportunities are described here. Conclusion: In short, dopamine's contribution appears to be chiefly to cause 'wanting' for hedonic rewards, more than 'liking' or learning for those rewards.
Article
• The effects of a single oral dose of dextroamphetamine sulfate on motor activity, vigilance, learning, and mood were compared for normal and hyperactive prepubertal boys and normal college-aged men using a double-blind crossover design. Both groups of boys and men showed decreased motor activity, increased vigilance, and improvement on a learning task after taking the stimulant drug. The men reported euphoria, while the boys reported only feeling "tired" or "different" after taking the stimulant. It is not clear whether this difference in effect on mood between adults and children is due to differing experience with drugs, ability to report affect, or a true pharmacologic age-related effect. While there were some quantitative differences in drug effects on motor activity and vigilance between these different groups, stimulants appear to act similarly on normal and hyperactive children and adults.
Article
Temporal difference learning models of dopamine assert that phasic levels of dopamine encode a reward prediction error. However, this hypothesis has been challenged by recent observations of gradually ramping stratal dopamine levels as a goal is approached. This note describes conditions under which temporal difference learning models predict dopamine ramping. The key idea is representational: a quadratic transformation of proximity to the goal implies approximately linear ramping, as observed experimentally.