ArticlePDF Available

Nicotinic receptors in the ventral tegmental area promote uncertainty-seeking

Authors:

Abstract

Cholinergic neurotransmission affects decision-making, notably through the modulation of perceptual processing in the cortex. In addition, acetylcholine acts on value-based decisions through as yet unknown mechanisms. We found that nicotinic acetylcholine receptors (nAChRs) expressed in the ventral tegmental area (VTA) are involved in the translation of expected uncertainty into motivational value. We developed a multi-armed bandit task for mice with three locations, each associated with a different reward probability. We found that mice lacking the nAChR β2 subunit showed less uncertainty-seeking than their wild-type counterparts. Using model-based analysis, we found that reward uncertainty motivated wild-type mice, but not mice lacking the nAChR β2 subunit. Selective re-expression of the β2 subunit in the VTA was sufficient to restore spontaneous bursting activity in dopamine neurons and uncertainty-seeking. Our results reveal an unanticipated role for subcortical nAChRs in motivation induced by expected uncertainty and provide a parsimonious account for a wealth of behaviors related to nAChRs in the VTA expressing the β2 subunit.
© 2016Nature America, Inc. All rights reserved.
nature neurOSCIenCeADVANCE ONLINE PUBLICATION 1
a r t I C l e S
Acetylcholine (ACh) has a well-studied role in arousal, learning
and attention1,2 and modulates perceptual decision-making, notably
through its influence over prefrontal cortices3. Decisions are not
only driven by sensory information, but also by the animal’s expec-
tation of the values associated with alternative choices4,5. ACh also
affects cost-benefit decision-making6,7, albeit through unknown
mechanisms. Notably, effects on value-based decisions induced by
pharmacological manipulations of ACh or dopamine (DA) often
mirror each other5. Systemic pharmacological manipulation of
either DA or ACh receptors affects the choices between alterna-
tives associated with different delays, costs or risk5–7. Disentangling
the respective implications of ACh and DA in decision-making is of
utmost interest, as psychological diseases such as tobacco addiction
or schizophrenia involve alterations of both decision-making and
ACh-DA interactions2,8.
By opposition to ACh, DA exerts a well-defined role in moti-
vation and reinforcement9. DA neurons encode reward prediction
errors as bursts of action potentials9,10. These bursts may be used
as a teaching signal to learn the value of actions11 or as an incen-
tive signal biasing the ongoing decisions12. The bursting activity
of DA neurons from the VTA is influenced by ACh, notably
through nicotinic acetylcholine receptors containing the β2 subunit
(β2*-nAChRs)2,13–15. Thus, the similarity between the effects of DA
and ACh on decision-making may arise from a nicotinic regula-
tion of the VTA. We hypothesized that endogenous ACh, released
from mesopontine nuclei to the VTA2,5,15, may be involved in
value-based decisions.
In the context of decision-making, the concept of exploration
is opposed to that of exploitation with regard to a known reward
source16,17. Exploration occurs when an animal actively gathers
information about alternative choices with the aim of reducing
the uncertainty level on the consequences of possible actions18–21.
This typically happens in a learning setting when the statistics
of an outcome given a specific action, or its ‘uncertainty, are
in the process of being estimated. Once the consequences of
possible actions have been estimated, the animal can use this
knowledge of the environment to exploit reward sources efficiently.
However, when the outcome of an action is probabilistic, uncer-
tainty remains as to what will be the outcome of an action every
time it is performed. This known variability of the outcome of an
action, as in a repeated lottery, is referred to as expected uncertainty
or reward risk22,23.
The motivation to perform an action can be modulated by expected
uncertainty and lead to uncertainty-seeking or risk-taking. In ani-
mals, it is challenging to distinguish between a motivation to explore
or exploit a probabilistic reward source, as it cannot be easily inferred
whether animals might still try to reduce expected uncertainty by
exploring19,24,25 or whether they are attracted by this ‘known-unknown
and thus exploit. Nevertheless, the influence of expected uncertainty on
motivational value is experimentally tractable. It has been proposed that
expected uncertainty may be signaled by ACh22 in the context of per-
ceptual decision-making, but this theory has never been connected to
an involvement of DA in value-based decision-making. Moreover, the
neural basis underlying the motivation given to choices associated with
expected uncertainty is not known. We computationally characterized
the influence of VTA β2*-nAChRs on seeking probabilistic rewards in
a multi-armed bandit task for mice and found that these receptors are
involved in translating expected uncertainty into motivational value.
1Sorbonne Universités, UPMC University Paris 06, Institut de Biologie Paris Seine, UM 119, Paris, France. 2CNRS, UMR 8246, Neuroscience Paris Seine, Paris,
France. 3INSERM, U1130, Neuroscience Paris Seine, Paris, France. 4Institut Pasteur, CNRS UMR 3571, Unité NISC, Paris, France. Correspondence should be
addressed to P.F. (phfaure@gmail.com).
Received 9 October 2015; accepted 9 December 2015; published online 18 January 2016; doi:10.1038/nn.4223
Nicotinic receptors in the ventral tegmental area
promote uncertainty-seeking
Jérémie Naudé1–3, Stefania Tolu1–3, Malou Dongelmans1–3, Nicolas Torquet1–3, Sébastien Valverde1–3,
Guillaume Rodriguez1–3, Stéphanie Pons4, Uwe Maskos4, Alexandre Mourot1–3, Fabio Marti1–3 & Philippe Faure1–3
Cholinergic neurotransmission affects decision-making, notably through the modulation of perceptual processing in the cortex.
In addition, acetylcholine acts on value-based decisions through as yet unknown mechanisms. We found that nicotinic acetylcholine
receptors (nAChRs) expressed in the ventral tegmental area (VTA) are involved in the translation of expected uncertainty into
motivational value. We developed a multi-armed bandit task for mice with three locations, each associated with a different reward
probability. We found that mice lacking the nAChR b2 subunit showed less uncertainty-seeking than their wild-type counterparts.
Using model-based analysis, we found that reward uncertainty motivated wild-type mice, but not mice lacking the nAChR b2 subunit.
Selective re-expression of the b2 subunit in the VTA was sufficient to restore spontaneous bursting activity in dopamine neurons and
uncertainty-seeking. Our results reveal an unanticipated role for subcortical nAChRs in motivation induced by expected uncertainty
and provide a parsimonious account for a wealth of behaviors related to nAChRs in the VTA expressing the b2 subunit.
© 2016Nature America, Inc. All rights reserved.
2  ADVANCE ONLINE PUBLICATION nature neurOSCIenCe
a r t I C l e S
RESULTS
Mice-adapted multi-armed bandit task based on ICSS
In uncertain environments, living beings have to decide when
to exploit known resources and when to explore alternatives.
This exploitation-exploration dilemma is often studied in the
multi-armed bandit task16,18, in which humans choose between
different slot machines to discover the richest option. To assess the
implication of nAChRs in decision-making under uncertainty, we
designed a spatial version of the bandit task adapted to mice. Studies
of animal choices often rely on food restriction, even though the
satiation level is known to affect decisions under uncertainty26.
To circumvent this issue, we trained mice to perform a sequence
of choices in an open-field in which three locations were explicitly
associated with intra-cranial self-stimulation (ICSS) rewards27,28
(Fig. 1a and Online Methods). Mice could not receive two con-
secutive ICSS at the same location. Consequently, they alternated
between rewarding locations by performing a sequence of choices.
Mice mostly went directly to the next rewarding location, but some-
times wandered around in the open field before reaching the goal
(Fig. 1b). At each location, mice had to choose which next reward-
ing location to go to (amongst the two alternatives) and how directly
they should get there.
We compared the behavior of wild-type (WT) mice under two set-
tings of ICSS delivery: a certain setting (CS) in which all locations
were associated with a given ICSS, and an uncertain setting (US), in
which each location was associated with a different probability of ICSS
delivery (Fig. 1a). Although trajectories in the CS were stereotyped,
reward uncertainty induced a markedly different behavioral pattern
in the US (Fig. 1b). The time to goal was identical for the three loca-
tions in the CS (Fig. 1c), but was greater for locations associated with
lower reward probabilities in the US (F(2,18) = 6.8, P = 0.002, one-way
ANOVA; Fig. 1c). More precisely, the reward probability of the goal
affected the traveled distance (F(2,18) = 7.3, P = 0.002; Fig. 1d), but not
the traveling speed (F(2,18) = 0.48, P = 0.62; Fig. 1e) or the dwell times
(Supplementary Fig. 1a). This contrasts with the effect of reward
intensity, which affected the speed profiles between two rewarding
locations (Supplementary Fig. 1b). Thus, in this setup, reward inten-
sity affected the invigoration of goal-directed movements, whereas
reward probability affected the extent of locomotion, reflecting the
tendency to explore the open field between two visits.
In addition, mice distributed their choices of ICSS according to
the reward probability associated to each location. As expected,
in the CS, mice treated each rewarding location the same way
(Fig. 1f). In the US, however, mice visited the locations associated
with higher ICSS probability more often (F(2,18) = 113, P < 0.001,
one-way ANOVA; Fig. 1f). Because mice could not receive two con-
secutive ICSSs, the repartition on the rewarding locations (Fig. 1f)
arose from a sequence of binary choices in three gambles (G1, G2,
G3) between two respective payoffs (here, G1 = {100 versus 50%},
G2 = {50 versus 25%}, G3 = {100 versus 25%}; Fig. 2a,b). For each
gamble, mice chose the optimal location (associated with the highest
probability of reward; Fig. 2b) more than 50% of the time, but less
than 100% of the time. When they had to choose between a certain
(100%) and an uncertain (50%) ICSS, mice displayed a low prefer-
ence (56%) for the optimal location, suggesting a positive inclination
toward reward uncertainty (Fig. 2a)29,30.
A positive motivational value to expected uncertainty
In standard rodent decision tasks in which there is only a single
choice, the relative influence of expected value and uncertainty on
choices is difficult to dissect, as both parameters vary with reward
probability. For binary outcomes (the choice is rewarded or not),
the expected mean reward corresponds to the reward probability p,
ICSS Naive
10 cm
CS US
a
c
e
d
b
pA
pB
pC
US
n.s.
100%
P(ICSS)
50%
25%
40
35
30
25
20
15
10
5
0
Time from ICSS (s)
ICSS
0 1 2 3 4
Instantaneous speed (cm s–1)
5 6 7 8 9 10
US
US
CS
n.s.
Time to goal (s)
**
**
11
1/2
1/21/4
Traveled distance (cm)
40
60
80
100
120
80
75
70
65
60
55
50
1
ICSS probability ICSS probability
1/4 11/21/4
1
2
3
4
5
6
2
3
1 1
n = 19
n = 19
f
US
US
***
1
111
n.s.
%
Repartition (%)
20
15
20
25
30
35
40
45
30
40
ICSS probability
1/21/4
n = 19
Figure 1 Decisions under uncertainty in a mouse bandit task using
intracranial self-stimulations. (a) Illustration of the spatial multi-armed
bandit task design. Three explicit square locations were placed in the
open field (0.8-m diameter), forming an equilateral triangle (50-cm
side). Mice received an intracranial self-stimulation each time they were
detected in the area of one of the rewarding locations. Animals, which
could not receive two consecutive stimulations at the same location,
alternated between rewarding locations. (b) Trajectories of one mouse
(5 min) before (left) and after (middle) learning in the CS and US (right).
(c) Time to goal (average duration from the last location to the goal) in the
US as a function of the reward probability of the goal. Inset, times to goal
were identical for the three locations in the CS (F(2,18) = 0.53, P = 0.59,
one-way ANOVA). Insert, individual curves. N = 19 mice. (d) Traveled
distance between two consecutive locations. In the US, WT mice traveled
more distance when going toward less probable ICSS reward. Light gray,
individual curves. (e) Instantaneous speed: in the US, the maximal speed
of WT mice did not depend on the expected probability of the reward,
contrary to what was observed in the DS with increasing intensity. Data
are presented as mean ± s.e.m. Time 0 corresponds to the last time of
ICSS delivery or omission. (f) Proportion of choices of the three rewarding
locations as a function of reward probability in the US. Light gray,
individual curves. Inset, proportion of choices were identical for the
three locations in the CS (F(2,18) = 0.16, P = 0.86, one-way ANOVA).
Error bars represent mean ± s.e.m. *P < 0.05, **P < 0.01, ***P < 0.001.
n.s., not significant at P > 0.05.
© 2016Nature America, Inc. All rights reserved.
nature neurOSCIenCeADVANCE ONLINE PUBLICATION 3
a r t I C l e S
whereas expected uncertainty is related to reward variance, p(1 – p)
(Fig. 2a). Expected uncertainty is zero for predictable outcomes
(100% or 0% probability) and maximal at 50% probability (the most
unpredictable outcome). In our setup, the difference in expected
uncertainty and value between the outcomes was distinct for each
of the three gambles (Fig. 2a), which provides enough constraints
to differentiate between the influence of two co-varying parameters
(reward mean and variance). We compared computational models of
decision-making16,31 (Online Methods), each representing a different
influence of expected reward and uncertainty on choices, to assess
which model best explained the experimental data (Supplementary
Fig. 2). In the epsilon-greedy model, animals always choose the best
option, minus a fixed probability. In this model, the choices for the
optimal reward are identical whatever the gamble is (Fig. 2b), which
did not correspond to the experimental data. In the softmax model,
choices depend on the difference between the expected rewards of
the two alternatives. The softmax model formalizes that the larger the
difference in rewards is, the higher the probability to select the best
option will be. This model predicts that the proportions of optimal
choices would be sorted in the following order {G2 < G1 < G3}, dif-
fering from what was found experimentally {G1 < G2 < G3}. Finally,
in the uncertainty model, decision is biased toward actions with the
most uncertain consequences by assigning a bonus value32 to their
expected uncertainties19,21,24,29. This last model accurately reproduced
the pattern of mice preferences (Fig. 2b) and best accounted for our
experimental data (Supplementary Fig. 2), as shown by model com-
parison (likelihood penalized for the number of parameters, Bayesian
Information Criterion (BIC); Online Methods). Furthermore, the two
parameters of the ‘uncertainty bonus’ model disentangle two deter-
minants of decision-making: the inverse temperature parameter β
depicts the randomness in choices, whereas the uncertainty-seeking
parameter ϕ represents the value given to expected uncertainty. The
positive uncertainty bonus (ϕ = 1.01 ± 0.24, mean ± s.e.m.) explains
the great attractiveness of the 50% choice in G1 by a powerful motiva-
tion induced by its expected uncertainty. We assessed the robustness
of the data and of the model by fitting four sets of probabilities, with
multiple different differences of expected reward and uncertainties
(Supplementary Fig. 3), and compared alternative models (match-
ing law33 and uncertainty-normalized temperature34; Supplementary
Fig. 2). Overall, we found that expected uncertainty positively biased
the choices in WT mice.
As stated above, two types of decisions are nested in the task: the
sequence of choices (“which goal?”) and the locomotion (“how to reach
the goal?”). To investigate the influence of uncertainty on the latter,
we performed multiple linear regressions of time to goal. Comparison
of linear models (BIC; Online Methods and Supplementary Fig. 2)
revealed that the time to goal depended on the reward probability of
the goal, but not on the alternative (the location not chosen in the
gamble). These observations suggest a dual-stage process in which
animals first choose which location to go to and then how to reach it.
Furthermore, the dependence on reward history (TR = 0.49 ± 0.21,
mean ± s.e.m.) suggests that when mice had just gotten rewarded, they
traveled further in the open field (Fig. 2c). We also found that the
time to goal was decreased by the expected reward (TE = −1.63 ± 0.16;
Fig. 2c) and by the expected uncertainty (Tϕ = −1.56 ± 0.33). This
suggests that expected uncertainty increased motivation to go straight
toward the rewarding goal. Thus, model-based analyses suggest that,
in the two decision problems (“which location” and “how to get
there”), mice assign a positive motivational value (
ϕ
and T
ϕ
) to the
expected uncertainty of the goal.
c
b
Gamble
a
A
P(At|Ct-1) = f(pA,pB)
CB
Transition model
A
TTG(A) = f(Rt-1,pA)
C
B
t-1
Rt-1
Time-to-goal model
Time-to-goal model
Transition model
2
1/4 1/2 3/4 1
Non-rewarded
Regression coefficients
Rewarded (on
previous trial)
ICSS probability
2.5
3
3.5
4
4.5
–2
–1
0
T0TRTETϕ
1
2
3
4
Time to goal (s)
Data Models
00 0.2
Reward probability
0.4 0.6 0.8 1 0
0.05
0.1
0.15
0.2
0.25
0.2
0.4
0.6
1
0.8
Expected reward
Expected uncertainty
40
Gamble1
(100%
vs. 50%) G2
(50%
vs. 25%) G3
(100%
vs. 25%)
50
Data
P(At|Ct-1) transition (%)
60
70
80
45 G1 G2
Chance level
G3
Data Models
Softmax
Uncertainty
ε-greedy
50
55
Predicted transition (%)
60
65
70
75
Figure 2 Model-based analysis of decisions
shows motivation for expected uncertainty.
(a) Illustration of the modeling of the task.
Top, transition model of animal choices.
Each rewarding location is modeled as a
state, labeled {A,B,C}. The probability
of transition from one state to another
depends on the reward probabilities of
the two available options. Middle, expected
reward and uncertainty as a function of
reward probabilities (curves). In the three
gambles, the differences in expected values
(0.5 in G1, 0.25 in G2, 0.75 in G3) and
expected uncertainties (−0.25 in G1, 0.0625
in G2, −0.1875 in G3) are distinct. Bottom,
model of locomotion. The time to goal
depends on both reward history (whether
the mouse received a reward in the previous
location or not) and reward expectation
at the goal. (b) Left, proportions of exploitative
choices (choice of the most valuable
alternative, that is, with the highest
probability of reward in a given gamble)
of the mice, for the three gambles. Dots,
individual data points. Right, predicted
transition of the three decision models (lines)
corresponding to the experimental data
(dots, same value as in the right panel).
Error bars represent mean ± s.e.m. (c) Left,
time to goal (experimental data and model fit,
mean ± s.e.m.) as a function of reward
probability of the goal and reward history.
Data merged from experiments with different sets of reward probability. Right, regression coefficients from the best-fitting model of locomotion,
corresponding to a constant (T0) and the dependencies on reward history (TR), expected reward (TE) and expected uncertainty (Tϕ).
© 2016Nature America, Inc. All rights reserved.
4  ADVANCE ONLINE PUBLICATION nature neurOSCIenCe
a r t I C l e S
VTA b2*-nAChRs are involved in motivation by uncertainty
In the ICSS bandit task, WT mice displayed a robust preference for
uncertain outcomes. Thus, mice estimate expected uncertainty to
direct their decisions and locomotion29,30. The suggested role of
ACh in signaling expected uncertainty22 prompted us to investigate
whether nAChRs are involved in uncertainty-driven motivation.
We used mice in which the β2 subunit, the most abundant nicotinic
subunit in the brain1,2, was deleted (β2KO mice), in our ICSS bandit
task. In the CS, β2KO mice (β2KO and β2GFP (β2KO mice injected
with a lentivirus expressing just eGFP); Online Methods) learned the
task and responded to different ICSS current intensities similarly to
WT mice (Supplementary Fig. 4), confirming the modest implication
of nAChRs in decision-making with certain rewards35,36. In contrast,
in the US (Fig. 3a), β2KO mice systematically chose the location
associated with the highest uncertainty level (that is, 50% probability)
to a lower extent than WT mice (T(28) = −5.4, P < 0.001, unpaired
t test; Fig. 3b). Furthermore, the relationship between time to goal
and reward probability of the goal (F(2,10) = 0.33 P = 0.72, one-way
ANOVA; Fig. 3c and Supplementary Fig. 4) was abolished in β2KO
animals. These results suggest a role for β2*-nAChRs in decision-
making under uncertainty.
We next tested whether β2*-nAChRs could affect motivation
by expected uncertainty by acting on VTA DA neurons, which are
important for value-based decision-making9,12. Extracellular in vivo
single-unit recordings in anesthetized animals (Fig. 3d) confirmed
that, when compared with those of WT mice, DA neurons from
β2KO mice displayed a decreased firing frequency (2.1 versus 3.2 Hz,
T(74) = 2.4, P < 0.001, Welch t test), lacked bursting activity (U = 1,637,
P = 0.002, Mann-Whitney test; Fig. 3e) and did not respond to a sys-
temic injection of nicotine (104.6 ± 1.34% from baseline frequency,
V = 103, P = 0.95, Wilcoxon test; Fig. 3f,g and Supplementary
Fig. 5e,f)13,14. If β2*-nAChRs underlie uncertainty-driven motivation
in the VTA, then restoring expression of these receptors in the VTA of
β2KO mice should restore both the sensitivity to expected uncertainty
and DA activity. We achieved selective re-expression of the β2 subunit
in the VTA of β2KO mice (β2VEC) using a lentiviral vector13 strategy
(Online Methods). Coronal sections revealed that viral re-expression
was restricted to the VTA (Fig. 3h and Supplementary Fig. 5ad).
DA cells from β2VEC mice displayed a spontaneous firing frequency
(T(156) = 1.6, P = 0.1, unpaired t test) and bursting activity (U = 3288,
P = 0.9, Mann-Whitney test) similar to those observed in WT mice,
and responded to nicotine (120.2 ± 4.78% from baseline frequency,
a
10 cm
β2KO β2VEC
1 mv
β2KO +Nic
10 s
β2VEC
f
1 mv
WT
β2KO
1 s
β2VEC
d
45
n.s.
***
30
15 1/4
WT (n = 19)
KO(11)
VEC(12)
1/2 1
Repartition (%)
Repartition (%)
b
50
40
30
20
50
40
30
20
1/4 1/2 1
β2KO
β2VEC
WT (n = 95)
00
0 100
%SWB
315
0
n.s.
n.s.
*** **
**
*
1.0
Cumulative density
Frequency Hz
%SWB
β2KO (26) β2VEC (70)
e
TH
GFP
VTA 50 µm
h
WT (n = 46) β2KO (26) β2VEC (70)
300
100
S.N. S.N.
n.s.
n.s.
80
60
40
20
0
***
***
***
***
S. = Saline, N. = Nicotine
Max. ∆ freq
Max. ∆ %SWB
g
ICSS probability
c
**
***
4.5
4
3.5
3
n.s.
n.s.
2.5 1/4 1/2 1
Time to goal (s)
Time to goal (s)
7
5
3
1
7
5
3
1
1/4 1/2 1
β2KO
β2VEC
Figure 3 β2*-nAChRs in the VTA affect
choices and locomotion. (a) Behavioral
trajectories after learning in the US for β2KO
(red) and β2VEC (blue) mice. (b) Proportion of
choices of the three rewarding locations plotted
as a function of reward probability in the US for
the WT (black), β2KO (red, n = 11) and β2VEC
(blue, n = 12) mice. Insets, individual curves
for the β2KO (top, red) and β2VEC (bottom,
blue) mice. (c) Time to goal (in seconds) as a
function of reward probability of the goal for the
WT (black), β2KO (red) and β2VEC (blue) mice.
Insets, individual curves for the β2KO (top, red)
and β2VEC (bottom, blue) mice. (d) Examples
of in vivo juxtacellular recordings of the firing
pattern of DA neurons from anesthetized WT
(black), β2KO (red) and β2VEC (blue) mice.
(e) Cumulative distribution of percent of spikes
in a burst (%SWB). Insets, mean frequency
(left) and %SWB (right) of VTA DA neurons from
the three genotypes (obtained from 22 WT mice,
13 β2KO mice and 13 β2VEC mice). (f) Typical
electrophysiological recording illustrating the
effect of intravenous injection of nicotine on the
firing pattern of DA neurons in β2KO (red) and
β2VEC (blue) mice. Dots, individual data points.
(g) Relative variation in firing frequency (left)
and absolute variation in %SWB of DA neurons
from the three genotypes (obtained from 14
WT mice, 13 β2KO mice and 13 β2VEC mice) in response to nicotine. Error bars represent mean ± s.e.m. (h) Coronal sections of the VTA showing
the site of lentivirus injection revealed that β2-eGFP colocalized with TH, a dopaminergic marker. Transduction of β2-eGFP virus was efficient in both
dopaminergic and non-dopaminergic cells. Dots, individual data points. *P < 0.05, **P < 0.01, ***P < 0.001. n.s., not significant at P > 0.05.
Table 1 Behavioral measures and model parameters in the uncertain setting
WT
(mean ± s.e.m.)
β2KO
(mean ± s.e.m.)
β2VEC
(mean ± s.e.m.) ANOVA β2VEC versus β2KO β2VEC versus WT
Repartition at P = 1/2 (Fig. 3b) 35.6 ± 0.5% 31.3 ± 0.5% 34.8 ± 0.7% F(2,39) = 13.45, P < 0.001 T(21) = –3.86, P < 0.001 T(29) = 0.96, P = 0.35
Gamble 1 (Fig. 4a) 55.6 ± 2.2% 69.1 ± 3.5% 54.1 ± 3.7% F(2,39) = 6.95, P = 0.002 T(21) = 3.04, P = 0.006 T(29) = 0.30, P = 0.77
Uncertainty-seeking parameter (Fig. 4b) 1.01 ± 0.24 –0.38 ± 0.47 1.21 ± 0.23 F(2,39) = 6.89, P = 0.003 T(21) = 3.1, P = 0.005 T(29) = –0.6, P = 0.56
Inverse temperature parameter (Fig. 4b) 1.57 ± 0.16 1.14 ± 0.28 1.21 ± 0.16 F(2,39) = 1.6, P = 0.22
Reward history coefficient (Fig. 4e) 0.49 ± 0.12 0.52 ± 0.11 0.26 ± 0.16 F(2,39) = 1.6, P = 0.22
Reward expectation coefficient (Fig. 4e) –1.63 ± 0.16 –0.02 ± 0.17 –1.21 ± 0.23 F(2,39) = 18.4, P < 0.001 T(21) = 4.0, P < 0.001 T(29) = 1.55, P = 0.13
Uncertainty expectation
coefficient (Fig. 4e)
–1.56 ± 0.33 0.58 ± 0.37 –0.88 ± 0.28 F(2,39) = 9.8, P = 0.001 T(21) = 3.2, P = 0.005 T(29) = 1.42, P = 0.17
© 2016Nature America, Inc. All rights reserved.
nature neurOSCIenCeADVANCE ONLINE PUBLICATION 5
a r t I C l e S
V = 960, P < 0.001, Wilcoxon test), suggesting that, as previously estab-
lished13,14, physiological functions were also restored. Notably, β2VEC
mice differed from β2KO animals, but not from WT mice (Table 1),
when analyzing the uncertainty-related choices (Fig. 3b) and the times
to goal (Fig. 3c and Supplementary Fig. 4), indicating a restoration of
the WT phenotype following re-expression of β2 in the VTA.
We next used the model-based analysis to characterize the role of
VTA β2*-nAChRs in decision-making. Transition functions of β2KO
and WT mice differed in particular in G1 (100 versus 50%, T(28) = 3.54,
P = 0.001, unpaired t test; Fig. 4a), suggesting an alteration of deci-
sions under uncertainty. Indeed, the behavior of β2KO mice was best
explained (Supplementary Fig. 6) either by the softmax model or
the uncertainty model in which the sensitivity to uncertainty was
null on average (T(11) = −0.8, P = 0.44, t test; Fig. 4b). Both models
point toward the same interpretation: β2*-nAChRs are necessary for
translating uncertainty signals into motivational value. Accordingly,
uncertainty-seeking was significantly different in β2KO and WT mice
(T(29) = 2.9, P = 0.007, unpaired t test). Notably, the model-based
analysis supports the conclusion that β2*-nAChRs selectively
re-expressed in the VTA restored the positive value of expected uncer-
tainty (Table 1 and Supplementary Fig. 7). Moreover, the analysis
of the trajectories in-between goals indicates that neither expected
reward nor expected uncertainty of the next goal influenced the time
to goal in β2KO mice, whereas both parameters affected time to goal
in β2VEC mice (Table 1 and Fig. 4ce). Mice from each genotype
all traveled more distance when the previous trial was rewarded,
compared to when it was not (F(2,39) = 0.02, P = 0.98). Together
with the transition model, where the temperature parameter β was
not significantly different between genotypes (F(2,39) = 1.6, P = 0.2,
a
90
n.s. n.s. n.s.
** **
80
70
60
Transition (%)
50
40
30
G1
(100% vs. 50%)
G2
(50% vs. 25%)
G3
100% vs. 25%)
WT
KO
VEC
b2.5
2
1.5
1
Uncertainty-seeking (ϕ)
0.5
0
–0.5
–1
–1.5 0 0.5 1 1.5 2 2.5
Exploration parameter (β)
n.s.
n.s.
**
**
n.s.
Transition in G1
n.s.
0.75
0.7
0.65
0.6
0.55
0.5
0.45
c d e
***
***
Time to goal (s)
Time to goal (s)
5
Parameter fits
4
3
2
1
0
–1
–2
–3 T0TRTETϕ
4.5
4
3.5
3
2.5
2
1.5
4.5 Rewarded (on
previous trial)
Non-rewarded
4
3.5
3
2.5
2
1.5 1/4 1/2 3/4
ICSS probability
11/4 1/2 3/4
ICSS probability
1
n.s.
n.s.
n.s.
WT
β2 KO
β2 VEC
β2KO β2VEC
Data
Figure 4 Model-based analysis reveals
a role for VTA β2-nAChR in uncertainty-driven
motivation. (a) Transition (proportions of
exploitative choices) in the three gambles,
for the WT (black), β2KO (red) and β2VEC
(blue) mice. Dots, individual data points.
(b) Value of the parameters (β and ϕ)
derived from the model-based analysis
(uncertainty model) of the transition
functions for the WT (black), β2KO (red)
and β2VEC (blue) mice. The color code
indicates the predicted transition in
gamble 1 (100 versus 50% reward
probability) as a function of the parameters
of the model. (c,d) Time to goal as a
function of reward probability of the goal
and reward history for β2KO (c) and β2VEC (d)
mice. Experimental data (black dots with
error bars) and model fit (stripes)
are displayed as mean ± s.e.m. Data are
merged from experiments with four sets
of reward probabilities. (e) Regression
coefficients from the best-fitting model
of locomotion, corresponding to a
constant (T0) and the dependencies on reward history (TR), expected reward (TE) and expected uncertainty (Tϕ), for the WT (black), β2KO (red) and
β2VEC (blue) mice. Data are presented as mean ± s.e.m. *P < 0.05, **P < 0.01, ***P < 0.001. n.s., not significant at P > 0.05.
WT
a
β2KO
2nd session
1st session 3rd session
b c d
Choice (%)
Efficacy (reward/choice) (%)
WT
60 60
60
65
70
75
80
85 *
50 50
40 40
30 30
00
20 20
10 10
WT
Data
n = 13
Data
n = 11
β2KO
β2KO
2nd
1st
session
3rd 2nd
1st
session
3rd
e f g
Uncertainty-seeking (ϕ)
Choice (%)
60 60 6
50 50
40 40
4
30 30
2
00
0
–2
–4
–6
20 20
10 10
WT
WT
Model Model
β2KO
β2KO
2nd
1st
session
3rd 2nd
1st
session
3rd
*
Figure 5 β2*-nAChRs affect decision-making under uncertainty in
a dynamical foraging task. (a) Top, illustration of the task design.
During each session, animals receive stimulations in two (of three)
potential locations, with the two rewarding locations (indicated by an
‘R’ in the colored circle) changing between sessions. Bottom, behavioral
trajectories in the three 2-min sessions for the WT (black) and β2KO
(red) mice. (b,c) Repartition (in %) on the three locations (color-coded
as in a) for the three sessions. Calculation is divided by half-session
durations for the WT (b) and β2KO (c) mice. (d) Proportion of rewarded
choices averaged on three sessions for WT (black) and β2KO (red) mice.
Dots, individual data points. (e,f) Model fits of the experimental data
shown in b and c. (g) Uncertainty-seeking parameter (that is, value
given to uncertainty) of the models for the WT (black) and β2KO (red).
Dots, individual data points. Data are presented as mean ± s.e.m.
*P < 0.05, **P < 0.01, ***P < 0.001.
© 2016Nature America, Inc. All rights reserved.
6  ADVANCE ONLINE PUBLICATION nature neurOSCIenCe
a r t I C l e S
one-way ANOVA; Fig. 4b), the time-to-goal model strongly suggests
that β2*-nAChRs do not affect the global motivation to explore, but
rather specifically affect expected uncertainty on choices (“which
goal”) and locomotion (“how to reach it”).
b2*-nAChRs and uncertainty-seeking in a dynamic environment
Having characterized the role of β2*-nAChRs in motivation by expected
uncertainty at steady state, we next asked whether our results could be
extended to unstable environments. We analyzed the behavior of WT
and β2KO mice during the learning sessions of the CS (Supplementary
Fig. 8a,b), when reward probabilities were not known yet, and mod-
eled it with reinforcement-learning (RL) models16,29,31,37,38 (Online
Methods and Supplementary Fig. 8cf). In the standard RL model,
animals learn the expected value of the three rewarding locations using
reward prediction errors (the difference between actual reward and
predicted value)31. In the model, animals use these values to select the
next action using a softmax decision rule. We extended the standard
RL model to uncertainty learning. Animals can use reward prediction
errors to estimate reward uncertainty21,23,24,39: the larger the errors
(positive or negative) are, the more uncertain the outcomes will be.
This uncertainty RL model best explained the behavior of WT mice
(Supplementary Fig. 8c,e). By contrast, the behavior of β2KO mice
was best accounted for by a standard RL model, that is, without uncer-
tainty bonus (Supplementary Fig. 8d,f).
To further test the importance of β2*-nAChRs for translat-
ing expected uncertainty into motivational value, we compared
the behavior of WT and β2KO mice in a dynamic setting (DS) in
which the locations delivering the ICSS reward changed over time
(Online Methods). In the DS, mice underwent three consecu-
tive sessions in which only two of the three locations delivered the
ICSS. Overall, WT and β2KO mice adapted their strategies to these
changes in reward contingencies (Fig. 5a). Starting from a random
strategy, both WT and β2KO mice learned the position of the two
rewarding locations in the first session (Fig. 5b,c). However, β2KO
mice persevered in their earlier choices throughout the changes in
outcomes (Fig. 5b,c), resulting in slightly fewer rewarded choices
than for WT mice (T(22) = 2.7, P = 0.01, unpaired t test; Fig. 5d).
Model comparison (RL models; Supplementary Fig. 9 and Online
Methods) suggested that an uncertainty bonus model best explained
the behavior of WT mice (Supplementary Fig. 9). This uncertainty
model reproduced the choices of WT animals during the changes in
rewarding outcomes (Fig. 5e), with a positive bonus given to uncer-
tainty (
ϕ
= 2.18 ± 0.77). This is consistent with the results in the
CS task (Fig. 4 and Supplementary Fig. 8). Moreover, model com-
parison (Supplementary Fig. 9) suggested that experimental data
was not better explained by indirect effects arising from learning,
that is, asymmetric (different for reward and omission38) or adaptive
(uncertainty-dependent) learning rates37. In summary, our results and
models support the idea that, in WT mice, expected uncertainty exerts
a direct motivational effect. By contrast, the behavior of β2KO mice
could be explained by either the standard RL model or the expected
uncertainty model (Fig. 5f and Supplementary Fig. 8b,d,f). In this
latter model, the uncertainty-seeking parameter in the β2KO mice
was significantly lower than that of WT mice (T(22) = 2.4, P = 0.027,
unpaired t test; Fig. 5g) and not significantly different from zero
(T(22) = −0.6, P = 0.54). These results provide further evidence that
β2*-nAChRs are involved in uncertainty-seeking.
Uncertainty-seeking in other nAChRs-related behaviors
Finally, using computational approaches, we assessed whether the
role of VTA β2*-nAChRs in uncertainty-seeking might pervade other
decisions about natural rewards, punishments and salient aspects of
the environment. Paradoxically, it has been found that mice lacking
the β2 subunit perform seemingly better than WT mice, displaying
‘improved’ spatial learning40 and passive avoidance41. The spatial
learning test consists of a maze with a reachable food reward at one of
the four arms and an unreachable food at the opposite arm (Fig. 6a).
We simulated this task with a RL model embedding uncertainty-
seeking (Fig. 6b and Online Methods). The model fitted the behavior
of both strains, as the time to reach the food was greater for WT
(with an uncertainty bonus) than for β2KO mice (without bonus)
in the early trials40 (Fig. 6c). This slowly decreasing time to reward
progressively emerges in RL models embedding uncertainty-seeking
(Fig. 6d), but cannot be easily explained in terms of differences in
initial value (novelty-seeking31,32), learning rates or a combination
of both (Supplementary Fig. 10). Hence, interest for the unreach-
able reward may arise in WT mice from uncertainty, integrated at
the level of the VTA. We also assessed whether the same explanation
Reachable
food
Unreachable
food
N.
S.
Final (absorbing) state : R = 1
Initial
state
Initial
state
Uncertain state
S1
S4
S2
S3
W. E.
Start
a b
Start
300
250
200
100
50
00 0.5 1.5 2.5
21
150
Footshock intensity (mA)
Retention latency (s)
WT
Model
Data
β2KO
e f
Dark Light
Test
=
Return
latency (s)
=
f(V,ϕσ)
δ = –I
δ
δ2
σ
V
Training
=
Foot shock: / (mA)
50
60
70
80
40
30
20
10
21 3 4 5 6
0
WT
ϕ = 2.5
ϕ
ϕ = 0.5
Model
Data
β
2KO
Session number
Time to reward (s)
c d
50
40
30
20
10
021 3 4 5 6
Session number
Time to reward (s)
Figure 6 New interpretation of behaviors related to VTA nAChRs using
the uncertainty model. (a) Spatial learning task40. In a cross maze,
the north arm contained a reachable food reward and the south arm
contained an unreachable food. The initial position of the animal was
variable (east or west). (b) Discretized representation of the task. S1–S4
are the four possible states in the task; R = {0,1} indicates whether the
food reward was attained or not. Arrows represent the possible transitions
between the states. Data adapted with permission from ref. 40.
(c) Simulation (stripes) of the time to reach the food (data: lines with error
bars) along the learning sessions for the WT (black) and β2KO (red) mice.
Parameters were
α
= 0.54,
β
= 7.75, ϕ = 1.51 for WT mice,
α
= 0.59,
β
= 7.93, ϕ = 0.04 for β2KO mice. (d) Effect of the uncertainty-seeking
parameter in the simulation of the time to reach the food. (e) Passive
avoidance task41 consisting in a single training trial in which the mouse
was delivered a foot shock upon entrance in the dark chamber. Data
adapted with permission from ref. 41. (f) Simulation (stripes) of retention
latencies (data: lines with error bars) in response to various intensities
of foot shock. Parameters were
β
= 1.81,
ϕ
= 0.53,
θ
= 1.62 for WT,
β
= 1.62,
ϕ
= 0.15,
θ
= 1.65 for β2KO). Error bars represent mean ± s.e.m.
© 2016Nature America, Inc. All rights reserved.
nature neurOSCIenCeADVANCE ONLINE PUBLICATION 7
a r t I C l e S
holds in the case of punishment. We simulated the passive avoid-
ance task, where animals were in a box divided in two compartments
(light and dark). β2KO mice avoided the dark compartment, which
was associated with a foot shock, for a longer time than WT mice41
(Fig. 6e). Uncertainty-seeking can also explain this difference, as the
foot shock induces a single negative prediction error, which results in
uncertainty (Fig. 6f). Expected uncertainty may in that case motivate
WT mice, but not β2KO mice, to explore the dark part of the box in
spite of potential negative consequences. Finally, these models can be
extended to neutral, but potentially uncertain, outcomes. The deficits
of β2KO mice in locomotion in an open-field without rewards13,42
can be understood as a lack of uncertainty-seeking (Supplementary
Fig. 11ae). Exploration in the open-field is composed of action pat-
terns related to information-seeking (scanning, rearing and sniff-
ing42,43). The apparent lack of object recognition observed in β2KO
mice40 can also be interpreted as a lack of curiosity for the objects, that
is, an absence of uncertainty-seeking (Supplementary Fig. 11f,g),
rather than a memory deficit. The uncertainty-seeking model not only
generalize our results to positive, aversive and neutral natural out-
comes, but also provides a parsimonious interpretation for a wealth
of behaviors associated with β2*-nAChRs13,40–42.
DISCUSSION
Our findings reveal a role for VTA β2*-nAChRs in translating
expected uncertainty into motivational value and suggest that this
receptor is involved in exploratory decisions. Three broad theoretical
types of exploration have been proposed. At one extreme, exploration
is seen as randomness or noise in the choices (as in the softmax or
epsilon greedy models), which is problematic, as rodents, similar to
primates, display curiosity and refined forms of exploration29,42,43. At
the other extreme, a dichotomy has been postulated between subcorti-
cal systems (such as the VTA) and frontal cortices. Frontal cortices
would mediate flexible exploration by overriding16,17 the influence
of exploitive value, underpinned by DA neurons9,11. Our results lie in
between these two extremes and are consistent with theoretical work
on optimal exploration18,20,21 and intrinsic motivation19,24. In this
view, exploration and exploitation are entangled: uncertainty is given
a value that can be compared to and added to the value of primary
rewards18,20,21,32. Our findings further suggest that motivation driven
by expected uncertainty may be sufficient to explain exploration in
unstable environments. This contrasts with neuroeconomics stud-
ies, where expected uncertainty is defined as risk and corresponds
to the exploitation of the irreducible variability of the outcomes,
whereas exploration is driven only by reducible uncertainty23,44.
However, assigning a given choice to exploration or exploitation is
tricky in non-human animals. It relies on phenomenological models
of behavior in the absence of direct reports of decision strategies.
Thus, our data clearly show that VTA β2*-nAChRs affect motivation
from expected uncertainty in both stable and unstable environments,
but whether this corresponds to motivation to explore, exploit or
both remains unclear. Nevertheless, our results are consistent with a
causal role for the VTA in decisions under uncertainty via a common
currency (a motivational metric) that integrates at least the values of
both expected reward and expected uncertainty20,21,25,32.
DA neurons not only encode reward prediction errors9,10, but
also surprise45, risk25 (that is, expected uncertainty) and resolution
of uncertainty46, which are all linked to information. DA neuron
bursting related to reward is thought to constitute a teaching signal
for actions11 or an incentive signal12 biasing the ongoing behavior.
DA activity not related to expected rewards per se could also act as
an ‘intrinsic’ reinforcement signal24 (or an intrinsic incentive) for
which gathering information would be self-satisfactory, helping the
animal to better predict its environment19,45. ACh is closely related
to information processing1,2. We found that the cholinergic control of
DA could underpin the motivational properties of information. This
finding could explain the observed similarities when ACh or DA are
pharmacologically manipulated during value-based decisions5. Several
mechanisms may underlie functional ACh-DA interactions in the
brain. Mesopontine ACh might directly signal expected uncertainty
(
σ
2 in our model), as proposed for forebrain ACh22. Alternatively,
our data suggest a contribution of β2*-nAChRs to the spontaneous
excitability of DA neurons, with anesthetized β2KO animals lacking
bursting of DA neurons14. In this case, cholinergic signaling onto the
VTA via β2*-nAChRs could serve as a permissive gate15, rendering
DA neurons more responsive (that is, affecting
ϕ
) to uncertainty sig-
nals generated elsewhere in the mesocorticolimbic loop23. A strong
prediction of these interpretations would be that ACh is implicated in
the encoding of expected uncertainty by DA neurons25. Nevertheless,
we cannot totally exclude, with our lentiviral strategy, downstream
adaptations in β2KO mice or an effect at the level of axon terminals,
where β2*-nAChRs also influence the transfer function between DA
firing and release in the striatum47.
Nicotine hijacks endogenous cholinergic signaling by exerting its
reinforcing effects through β2*-nAChRs in the VTA13. But nicotine
also affects decisions that are not related to nicotine intake itself.
Smokers are actually known to display alterations of the exploration-
exploitation tradeoff48 and of risk-sensitivity49. Notably, tobacco
addiction and pathological gambling, which can be seen as exces-
sive uncertainty-seeking25,30, display a high comorbidity8. Thus,
we suggest that β2*-nAChRs in the VTA, in addition to mediating
reinforcement to nicotine, constitute a key neural component in the
alterations of decision-making under uncertainty observed in nico-
tine addicts48,49.
METHODS
Methods and any associated references are available in the online
version of the paper.
Note: Any Supplementary Information and Source Data files are available in the
online version of the paper.
ACKNOWLEDGMENTS
We thank E. Guigon for discussions, C. Prévost-Solié for technical support, and
J.-P. Changeux, E. Ey, G. Dugué and A. Boo for comments on the manuscript.
This work was supported by the Centre National de la Recherche Scientifique
CNRS UMR 8246, the University Pierre et Marie Curie (Programme Emergence
2012 for J.N. and P.F.), the Agence Nationale pour la Recherche (ANR Programme
Blanc 2012 for P.F., ANR JCJC to A.M.), the Neuropole de Recherche Francilien
(NeRF) of Ile de France, the Foundation for Medical Research (FRM, Equipe FRM
DEQ2013326488 to P.F.), the Bettencourt Schueller Foundation (Coup d’Elan
2012 to P.F.), the Ecole des Neurosciences de Paris (ENP) to P.F., the Fondation
pour la Recherche sur le Cerveau (FRC et les rotariens de France, “espoir en
tête” 2012) to P.F. and the Brain & Behavior Research Foundation for a NARSAD
Young Investigator Grant to A.M. The laboratories of P.F. and U.M. are part of the
École des Neurosciences de Paris Ile-de-France RTRA network. P.F. and U.M. are
members of the Laboratory of Excellence, LabEx Bio-Psy, and P.F. is member of the
DHU Pepsy.
AUTHOR CONTRIBUTIONS
J.N. and P.F. designed the study. S.T. and J.N. performed the virus injections. M.D.,
N.T., G.R. and J.N. performed the behavioral experiments. S.V. and F.M. performed
the electrophysiological recordings. S.P. and U.M. provided the genetic tools.
J.N., F.M. and P.F. analyzed the data. J.N., A.M. and P.F. wrote the manuscript.
COMPETING FINANCIAL INTERESTS
The authors declare no competing financial interests.
© 2016Nature America, Inc. All rights reserved.
8  ADVANCE ONLINE PUBLICATION nature neurOSCIenCe
a r t I C l e S
Reprints and permissions information is available online at http://www.nature.com/
reprints/index.html.
1. Everitt, B.J. & Robbins, T.W. Central cholinergic systems and cognition. Annu. Rev.
Psychol. 48, 649–684 (1997).
2. Dani, J.A. & Bertrand, D. Nicotinic acetylcholine receptors and nicotinic cholinergic
mechanisms of the central nervous system. Annu. Rev. Pharmacol. Toxicol. 47,
699–729 (2007).
3. Guillem, K. et al. Nicotinic acetylcholine receptor β2 subunits in the medial
prefrontal cortex control attention. Science 333, 888–891 (2011).
4. Rangel, A., Camerer, C. & Montague, P.R. A framework for studying the neurobiology
of value-based decision making. Nat. Rev. Neurosci. 9, 545–556 (2008).
5. Fobbs, W.C. & Mizumori, S.J. Cost-benefit decision circuitry: proposed modulatory
role for acetylcholine. Prog. Mol. Biol. Transl. Sci. 122, 233–261 (2014).
6. Kolokotroni, K.Z., Rodgers, R.J. & Harrison, A.A. Acute nicotine increases both
impulsive choice and behavioral disinhibition in rats. Psychopharmacology (Berl.)
217, 455–473 (2011).
7. Mendez, I.A., Gilbert, R.J., Bizon, J.L. & Setlow, B. Effects of acute administration
of nicotinic and muscarinic cholinergic agonists and antagonists on performance
in different cost-benefit decision making tasks in rats. Psychopharmacology (Berl.)
224, 489–499 (2012).
8. McGrath, D.S. & Barrett, S.P. The comorbidity of tobacco smoking and gambling:
a review of the literature. Drug Alcohol Rev. 28, 676–681 (2009).
9. Schultz, W. Multiple dopamine functions at different time courses. Annu. Rev.
Neurosci. 30, 259–288 (2007).
10. Waelti, P., Dickinson, A. & Schultz, W. Dopamine responses comply with basic
assumptions of formal learning theory. Nature 412, 43–48 (2001).
11. Montague, P.R., Dayan, P. & Sejnowski, T.J. A framework for mesencephalic
dopamine systems based on predictive Hebbian learning. J. Neurosci. 16,
1936–1947 (1996).
12. Berridge, K.C. From prediction error to incentive salience: mesolimbic computation
of reward motivation. Eur. J. Neurosci. 35, 1124–1143 (2012).
13. Maskos, U. et al. Nicotine reinforcement and cognition restored by targeted
expression of nicotinic receptors. Nature 436, 103–107 (2005).
14. Mameli-Engvall, M. et al. Hierarchical control of dopamine neuron-firing patterns
by nicotinic receptors. Neuron 50, 911–921 (2006).
15. Grace, A.A., Floresco, S.B., Goto, Y. & Lodge, D.J. Regulation of firing of
dopaminergic neurons and control of goal-directed behaviors. Trends Neurosci. 30,
220–227 (2007).
16. Daw, N.D., O’Doherty, J.P., Dayan, P., Seymour, B. & Dolan, R.J. Cortical substrates
for exploratory decisions in humans. Nature 441, 876–879 (2006).
17. Frank, M.J., Doll, B.B., Oas-Terpstra, J. & Moreno, F. Prefrontal and striatal
dopaminergic genes predict individual differences in exploration and exploitation.
Nat. Neurosci. 12, 1062–1068 (2009).
18. Gittins, J.C. & Jones, D.M. A dynamic allocation index for the discounted multiarmed
bandit problem. Biometrika 66, 561–565 (1979).
19. Scott, P.D. & Markovitch, S. Learning novel domains through curiosity and
conjecture. IJCAI (US) 1, 669–674 (1989).
20. Kaelbling, L.P. Learning in Embedded Systems (MIT Press, 1993).
21. Meuleau, N. & Bourgine, P. Exploration of multi-state environments: Local measures
and back-propagation of uncertainty. Mach. Learn. 35, 117–154 (1999).
22. Yu, A.J. & Dayan, P. Uncertainty, neuromodulation, and attention. Neuron 46,
681–692 (2005).
23. Bach, D.R. & Dolan, R.J. Knowing how much you don’t know: a neural organization
of uncertainty estimates. Nat. Rev. Neurosci. 13, 572–586 (2012).
24. Oudeyer, P.-Y. & Kaplan, F. What is intrinsic motivation? A typology of computational
approaches. Front. Neurorobot. 1, 6 (2007).
25. Fiorillo, C.D., Tobler, P.N. & Schultz, W. Discrete coding of reward probability and
uncertainty by dopamine neurons. Science 299, 1898–1902 (2003).
26. Schuck-Paim, C., Pompilio, L. & Kacelnik, A. State-dependent decisions cause
apparent violations of rationality in animal choice. PLoS Biol. 2, e402 (2004).
27. Carlezon, W.A. Jr. & Chartoff, E.H. Intracranial self-stimulation (ICSS) in rodents
to study the neurobiology of motivation. Nat. Protoc. 2, 2987–2995 (2007).
28. Kobayashi, T., Nishijo, H., Fukuda, M., Bureš, J. & Ono, T. Task-dependent
representations in rat hippocampal place neurons. J. Neurophysiol. 78, 597–613
(1997).
29. Funamizu, A., Ito, M., Doya, K., Kanzaki, R. & Takahashi, H. Uncertainty in action-
value estimation affects both action choice and learning rate of the choice behaviors
of rats. Eur. J. Neurosci. 35, 1180–1189 (2012).
30. Anselme, P., Robinson, M.J.F. & Berridge, K.C. Reward uncertainty enhances incentive
salience attribution as sign-tracking. Behav. Brain Res. 238, 53–61 (2013).
31. Sutton, R.S. & Barto, A.G. Reinforcement Learning: an introduction (MIT Press,
1998).
32. Kakade, S. & Dayan, P. Dopamine: generalization and bonuses. Neural Netw. 15,
549–559 (2002).
33. Herrnstein, R.J. Relative and absolute strength of response as a function of
frequency of reinforcement. J. Exp. Anal. Behav. 4, 267–272 (1961).
34. Ishii, S., Yoshida, W. & Yoshimoto, J. Control of exploitation-exploration meta-
parameter in reinforcement learning. Neural Netw. 15, 665–687 (2002).
35. Yeomans, J. & Baptista, M. Both nicotinic and muscarinic receptors in ventral
tegmental area contribute to brain-stimulation reward. Pharmacol. Biochem. Behav.
57, 915–921 (1997).
36. Serreau, P., Chabout, J., Suarez, S.V., Naudé, J. & Granon, S. Beta2-containing
neuronal nicotinic receptors as major actors in the flexible choice between conflicting
motivations. Behav. Brain Res. 225, 151–159 (2011).
37. Krugel, L.K., Biele, G., Mohr, P.N., Li, S.-C. & Heekeren, H.R. Genetic variation in
dopaminergic neuromodulation influences the ability to rapidly and flexibly adapt
decisions. Proc. Natl. Acad. Sci. USA 106, 17951–17956 (2009).
38. Niv, Y., Edlund, J.A., Dayan, P. & O’Doherty, J.P. Neural prediction errors reveal a
risk-sensitive reinforcement-learning process in the human brain. J. Neurosci. 32,
551–562 (2012).
39. Balasubramani, P.P., Chakravarthy, V.S., Ravindran, B. & Moustafa, A.A. An extended
reinforcement learning model of basal ganglia to understand the contributions of
serotonin and dopamine in risk-based decision making, reward prediction, and
punishment learning. Front. Comput. Neurosci. 8, 47 (2014).
40. Granon, S., Faure, P. & Changeux, J.-P. Executive and social behaviors under nicotinic
receptor regulation. Proc. Natl. Acad. Sci. USA 100, 9596–9601 (2003).
41. Picciotto, M.R. et al. Abnormal avoidance learning in mice lacking functional high-
affinity nicotine receptor in the brain. Nature 374, 65–67 (1995).
42. Maubourguet, N., Lesne, A., Changeux, J.-P., Maskos, U. & Faure, P. Behavioral
sequence analysis reveals a novel role for β2* nicotinic receptors in exploration.
PLoS Comput. Biol. 4, e1000229 (2008).
43. Gordon, G., Fonio, E. & Ahissar, E. Emergent exploration via novelty management.
J. Neurosci. 34, 12646–12661 (2014).
44. Payzan-LeNestour, E. & Bossaerts, P. Risk, unexpected uncertainty and estimation
uncertainty: Bayesian learning in unstable settings. PLoS Comput. Biol. 7,
e1001048 (2011).
45. Redgrave, P. & Gurney, K. The short-latency dopamine signal: a role in discovering
novel actions? Nat. Rev. Neurosci. 7, 967–975 (2006).
46. Bromberg-Martin, E.S. & Hikosaka, O. Midbrain dopamine neurons signal preference
for advance information about upcoming rewards. Neuron 63, 119–126 (2009).
47. Rice, M.E. & Cragg, S.J. Nicotine amplifies reward-related dopamine signals in
striatum. Nat. Neurosci. 7, 583–584 (2004).
48. Addicott, M.A., Pearson, J.M., Wilson, J., Platt, M.L. & McClernon, F.J. Smoking
and the bandit: a preliminary study of smoker and nonsmoker differences in
exploratory behavior measured with a multiarmed bandit task. Exp. Clin.
Psychopharmacol. 21, 66–73 (2013).
49. Galván, A. et al. Greater risk sensitivity of dorsolateral prefrontal cortex in young
smokers than in nonsmokers. Psychopharmacology (Berl.) 229, 345–355 (2013).
© 2016Nature America, Inc. All rights reserved.
nature neurOSCIenCe
doi:10.1038/nn.4223
ONLINE METHODS
Animals. 40 male C57BL/6J (WT) mice and 47 male knockout SOPF HO ACNB2
(β2KO) mice obtained from Charles Rivers Laboratories France were used. β2KO
mice were generated as described previously41. WT and β2KO mice are not
littermates and this could be a potential caveat of the study. However, mutant
mice were generated almost 20 years ago, the line has been backcrossed more
than 20 generations with the wild-type C57BL6/J line, and the β2KO line was
confirmed to be at more than 99.99% C57BL/6J. Mice arrived to the animal
facility at 8 weeks of age, and were housed individually for at least 2 weeks before
the electrode implantation. Behavioral tasks started 1 week after implantation to
insure full recovery. Intracranial self-stimulation (ICSS) does not require food
deprivation; as a consequence all mice had ad libitum access to food and water
except during behavioral sessions. The temperature (20–22 °C) and humidity
was automatically controlled and a circadian light cycle of 12/12-h light-dark
cycle (lights on at 8:30 a.m.) was maintained in the animal facility. All experi-
ments were performed during the light cycle, between 09:00 a.m. and 5:00 p.m.
Experiments were conducted at Université Pierre et Marie Curie. All procedures
were performed in accordance with the recommendations for animal experi-
ments issued by the European Commission directives 219/1990 and 220/1990,
and approved by Université Pierre et Marie Curie.
Stereotaxic injection of lentivirus. The lentiviral expression vectors β2 subunit–
IRES-eGFP cDNAs and the eGFP cDNA (control) are under the control of the
ubiquitous mouse phosphoglycerate kinase (PGK) promoter. Further details
can be found in ref. 13. β2KO mice aged 8 weeks were anesthetized using
isoflurane. The mouse was introduced into a stereotaxic frame adapted for
mice. Lentivirus (2 µl at 75 ng of p24 protein per µl) was injected bilaterally
at: anteroposterior = −3.4 mm, mediolateral = ±0.5 mm from bregma and
dorsoventral = 4.4 mm from the surface for VTA injection. Mice were implanted
with electrodes 4–5 weeks after viral injection. At the end of the behavioral
experiments, lentiviral re-expression in the VTA was verified using fluorescence
immunohistochemistry. As a control for β2VEC mice, another group of β2KO
mice were injected with lentivirus expressing eGFP only. We did not observe any
difference between β2KO (without lentiviral injections, n = 6) and β2-eGFP mice
(n = 6) in either choices (P = 0.76, unpaired t test) and time-to-goal (P = 0.34,
unpaired t test). We thus pooled the data from both groups to serve as control
for β2VEC data.
In vivo electrophysiological recordings. Extracellular recording electrodes were
constructed from borosilicate glass tubing (1.5 mm O.D. / 1.17 mm I.D.) using
a vertical electrode puller (Narishige). Tip was broken and electrodes were filled
with a 0.5% sodium acetate solution (wt/vol) and 1.5% neurobiotin (wt/vol),
yielding impedances of 6 9 M.
Animals were anesthetized with chloral hydrate (400 mg per kg of body
weight, intraperitoneal, supplemented as required to maintain optimal anesthe-
sia throughout the experiment), and placed in a stereotaxic apparatus (Kopf
Instruments). The left saphenous vein was catheterized for intravenous admin-
istration of nicotine and the right saphenous vein was catheterized for intravenous
administration of saline solution (NaCl 0.9%, wt/vol). The electrophysiological
activity was sampled in the central region of the VTA (coordinates: 3.1–3.5 mm
posterior to Bregma, ±0.3–0.6 mm lateral to midline and 4–4.7 mm below the
brain surface)50. Spontaneously active DAergic neurons were identified on the
basis of previously established electrophysiological criteria: (1) a typical triphasic
action potential with a marked negative deflection; (2) a characteristic long dura-
tion (>2.0 ms); (3) an action potential width from start to negative trough >1.1 ms;
(4) a slow firing rate (between 1 and 10 Hz) with an irregular single spiking pat-
tern and occasional short, slow bursting activity51. At least 5 min of spontaneous
baseline electrophysiological activity was recorded before intravenous injection
of nicotine (30 µg per kg). At the end of the recording period, the neurons were
stimulated by application of positive currents steps to electroporate neurobiotin
into the neurons to allow DA neurons identification.
Analysis of electrophysiological data. DA cell firing was analyzed with respect to
the average firing rate and the percentage of spikes within bursts (%SWB, number
of spikes within bursts, divided by total number of spikes). Bursts were identified
as discrete events consisting of a sequence of spikes such that: their onset is defined
by two consecutive spikes within an interval <80 ms and they terminated with an
interval >160 ms (ref. 51). Firing rate and %SWB were evaluated on successive
windows of 60 s, with a 45-s overlapping period14. For each cell, firing frequency
was rescaled as a percentage of its baseline value averaged during the 2 min before
nicotine injection. The effect of nicotine was assessed as a comparison between
the maximum of variation of firing rate and %SWB observed during the first
3 min after saline and nicotine injection. The results are presented as mean ± s.e.m.
of the difference of maximal variation before and after nicotine.
Fluorescence immunohistochemistry. Following the death of all the lentivirus-
injected mice (GFP and VEC animals), brains were rapidly removed and
fixed in 4% paraformaldehyde. Following a period of at least 3 d of fixation
at 4 °C, serial 60-µm sections were cut from the midbrain with vibratome.
Immunohistochemistry was performed as follows: free-floating VTA brain
sections were incubated 1 h at 4 °C in a blocking solution of phosphate-
buffered saline (PBS) containing 3% Bovine Serum Albumin (BSA, Sigma; A4503)
(vol/vol) and 0.2% Triton X-100 (vol/vol) and then incubated overnight at 4 °C
with a mouse anti-tyrosine hydroxylase antibody (TH, Sigma, T1299) at 1:200
dilution and a rabbit anti-GFP antibody (Molecular Probes, A-6455) at 1:5,000
dilution in PBS containing 1.5% BSA and 0.2% Triton X-100. The following day,
sections were rinsed with PBS and then incubated 3 h at 22–25 °C with Cy3-
conjugated anti-mouse and Cy2-conjugated anti-rabbit secondary antibodies
(Jackson ImmunoResearch, 715-165-150 and 711-225-152) at 1:200 dilution in
a solution of 1.5% BSA in PBS. After three rinses in PBS, slices were wet-mounted
using Prolong Gold Antifade Reagent (Invitrogen, P36930). Microscopy was car-
ried out with a fluorescent microscope, and images captured using a camera and
ImageJ imaging software.
In the case of electrophysiological recordings, an immmunohistochemical
identification of the recorded neurons was performed as described above, with the
addition of 1:200 AMCA-conjugated Streptavidin (Jackson ImmunoResearch) in
the solution. Neurons labeled for both TH and neurobiotin in the VTA50 allowed
to confirm their neurochemical phenotype.
Electrode implantation and ICSS training. Mice were introduced into a
stereotaxic frame and implanted unilaterally with bipolar stimulating electrodes
for ICSS in the medial forebrain bundle27,28 (MFB, anteroposterior = 1.4 mm,
mediolateral = ±1.2 mm, from the bregma, and dorsoventral = 4.8 mm from the
dura). After recovery from surgery (1 week), the efficacy of electrical stimulation
was verified in an open field with an explicit square location (side = 1 cm) at its
center. Each time a mouse was detected in the area (D = 3 cm) of the location,
a 200-ms train of 20 0.5-ms biphasic square waves pulsed at 100 Hz was gener-
ated by a stimulator28. Mice self-stimulating at least 50 times in a 5-min session
were kept for the behavioral sessions (3 mice were excluded at this stage, due
to improper electrode implantation). In the certain setting (see below), ICSS
intensity was adjusted so that mice self-stimulated between 50 and 150 times per
session at the end of the training (ninth and tenth session). Current intensity was
subsequently maintained the same throughout the uncertainty setting.
Behavioral data acquisition. Decision-making and locomotor activity were
recorded in a 1-m diameter circular open-field. Experiments were performed
using a video camera, connected to a video-track system, out of sight of the
experimenter. A home-made software (Labview National instrument) tracked the
animal, recorded its trajectory (20 frames per s) for 5 min and sent TTL pulses
to the ICSS stimulator when appropriate (see below).
Markovian decision problem by ICSS conditioning. We considered two com-
plementary aspects of motivation: direction and locomotion of the mice. We thus
developed a protocol allowing to record simultaneously the sequential choices
between differently rewarding locations (that is, associated with intracranial self-
stimulation) and the locomotor activity of the mice in between these locations.
After validation of ICSS behavior27, conditioning tasks took place in the 0.8-m
diameter circular open-field. Three explicit square locations were placed in the
open field, forming an equilateral triangle (side = 50 cm). Each time a mouse
was detected in the area of one of the rewarding locations, a stimulation train
was delivered. Animals received stimulations only when they alternate between
rewarding locations. In separate experiments, the intensity or the probability of
stimulation delivery differed for the three rewarding locations. Precise param-
eters (for example, reward probabilities) were pseudo-randomly assigned to each
© 2016Nature America, Inc. All rights reserved.
nature neurOSCIenCe doi:10.1038/nn.4223
rewarding location for each mouse. For each set of (consecutive) experiments con-
ditioning consisted in one daily session of 5 min, during 10 d. Decision-making
was analyzed by expressing data as a series of choices between rewarding
locations (labeled A, B, C). We only considered choices made in an interval of
10s after visiting the previous rewarding location. This restriction is based on
the observation that choices made after 10 s were random (that is, uniformly
distributed) for every condition, and thus probably reflect a disengagement from
the task. This led to the exclusion of fewer than 3% of the total choices made by
the animals (all groups), which suggests that incorporating these late choices
would not significantly change the results. This game implements a Markovian
Decision Process (MDP31) consisting of three states (A, B, C), corresponding to
each rewarding locations, and a transition function, corresponding to the propor-
tions of choices in the three gambles. The repartition is defined as the proportion
of states visited by the animal during a session. The transition matrix describes
the proportion of transitions from one state to another. Because animals receive
stimulations only when they alternate between rewarding locations, there is no
repetition of states in the sequence and the 3 × 3 transition matrix has null diago-
nal elements. The training consisted of a block (10 daily sessions of 5 min) in a CS
where all locations were associated with an ICSS delivery. The test phase consisted
of a block (10 daily sessions of 5 min) assessing choice organization under an
US, by associating each location with a different probability of ICSS, a validated
protocol for studying risky choices52. The foraging phase was performed after
the uncertain setting, and five supplementary sessions of deterministic setting.
The foraging phase assessed the exploratory strategy in a dynamic setting (DS),
which consisted in three consecutive 5-min sessions. In each session, two out of
three locations delivered the ICSS reward, and the identity of the two rewarding
locations changed every session.
Analysis of locomotion. Locomotor activity toward the rewarding locations
was measured in terms of time-to-goal, speed profile, dwell time and traveled
distance. Time-to-goal measures the duration between one choice and the next
one. The speed profile corresponds to the instantaneous speed as a function of
time (expressing it as a function of the distance between two locations did not give
any additional information). We averaged the speed profiles on a 10-s interval
(the same used for restricting the choices considered in the analysis), which was
zero-padded if the reward location was attained before 10 s. The dwell time is
defined as the duration between the moment of the detection in the last rewarding
location and the moment when the animal’s speed is greater than 10 cm s−1. The
traveled distance corresponds to the summation of the local distances between
two points of the mouse’s trajectory (20 frames per s) between the last and the
next choice. A multiple linear regression was performed on the time-to-goal, in
the different sets of probabilities of the US setting. We compared models with
increasing number of explanatory variables. As potential explanatory variables,
we included reward history (whether the animal just got rewarded or not, as a
binary variable), the expected reward of the goal, the expected uncertainty of the
goal, the expected reward of the alternative (that is, the location not chosen in
the gamble), and the expected uncertainty of the alternative. We compared these
linear models based on their summed squared errors, penalized for complexity
(Bayesian information criterion):
BIC TTG n SSE nk n( ) ln( ) ln( )= +
, where n is
the number of observations (time-to-goal, n is the same for all regressions), k the
number of explanatory variables, and SSE the summed squared errors from the
multiple linear regressions. Constant terms were omitted from the formula for
simplicity, as the BICs of the linear regressions were only used for comparisons.
Computational models of decision-making. In the US, we investigated how well
the transition function (that is, choices) from both genotypes can fit to variants
of decision-making models. At the end of the US, since mice are trained and
choice behavior is at steady-state, we only modeled decision-making, and used
the settings of the task (that is, reward probabilities) as fixed parameters for the
values of the options (see below). In the DS and in the learning phase of the US, we
modeled both learning (see below) and decision-making, and we evaluated how
well the models fits the animals’ choices, which were not at steady-state. These
models are thus based on the estimation of the expected payoffs (“value”) and
uncertainties of the options, rather than on objective parameters of the task.
Decision-making models determined the probability Pi of choosing the
next state i, as a function (the choice rule) of a decision variable. Because mice
could not return to the same rewarding location, they had to choose between
the two remaining locations. Accordingly, we modeled decisions between two
alternatives. We considered five choice rules31: local matching law33 Herrnstein,
softmax, epsilon-greedy, uncertainty bonus19,21,39 and uncertainty-controlled
randomness34.
• In the local matching law, the probability to choose an action i (amongst two
rewarding location) is given by
PV
V
ii
j
j
=
where Vi is defined as the value of an option, that is, the expected reward
(see below).
• The epsilon-greedy choice rule is
Pi V
ii
=
=
1e
e
argmax( )
otherwise
where ε is the probability of choosing less valuable options, reflecting undirected
exploration.
• The softmax choice rule is
where
β
is an inverse temperature parameter reflecting the sensitivity of choice
to the difference between decision variables.
In standard reinforcement learning31, the value of an option is the expected
(average) reward. In the US, where the choices are at steady-state, the expected
reward is taken as the reward probability
V E ICSS p ICSS
i i i
= =( ) ( )
In models embedding an exploration bonus, the value depends on both
expected reward and uncertainty16,17,29. Uncertainty may refer to estimation
uncertainty (due to incomplete knowledge or sampling of the outcome), to
the expected uncertainty (or reward risk), related to the estimated variability
of the outcome, or to unexpected uncertainty, that is, uncertainty greater than
expected22,23,44. The expected uncertainty scheme is similar to the mean-
variance approach used in neuroeconomic studies53 and it has also been proposed
to drive exploration19,21,24,25,30. In the US, as mice are trained and choice behavior
is at steady-state, we used this version of the model, where the decision variable
is a compound of the true (that is, not estimated by a learning algorithm) mean
and variance of the payoff
V E ICSS ICSS p ICSS p ICSS p ICSS
i i i i i i
= + = + ( ) ( ) ( ) ( )( ( ))js j
21
This compound value is then nested in the softmax choice rule. Note
that expected uncertainty (
σ
i2) can also be estimated through learning
(see equation (10)).
Finally, in the uncertainty-based temperature model (or local control of ran-
domness34), uncertainty associated with all the possible actions at a state controls
the randomness of choices (that is, the temperature parameter). In this strategy,
the randomness of action selection does not depend on the variability of the
possible outcomes. In the softmax model (equation (3)), in case where different
choices may yield comparable outcomes, the decision process is random even
with large β; while a large difference in values results in greedy action selection
even for small β . To circumvent this issue, it is possible to normalize the tem-
perature parameter βi for each state i.
bb
i
j j
E V E V
=
0
2 2
( ) ( )
where
β
0 is a constant (free) parameter, whereas
E V E V
j j
( ) ( )
2 2
represents the
uncertainty (or variability) of the state i (over all the possible actions j) rather
than reward uncertainty associated with a particular action.
Reinforcement learning models determined the evolution of the decision vari-
ables, which are in this case estimations of the task parameters. The values of
(1)(1)
(2)(2)
(3)(3)
(4)(4)
(5)(5)
(6)(6)
© 2016Nature America, Inc. All rights reserved.
nature neurOSCIenCe
doi:10.1038/nn.4223
the rewarding locations were estimated using standard reinforcement models31,
which are based on trial-and-error learning. First, the model computes the dis-
crepancy between the predicted value of the chosen location (Vi) and the actual
reward R at the trial t
di i t i t
t R V( ) , ,
= 1
where Ri(t) = 1 or 0 depending on whether the animal was rewarded or not. This
reward prediction error is then used to adapt the estimation of the value Vi of the
chosen location only, that is, the values of the other locations are not changed
where α is the learning rate. To test whether nicotinic receptors differentially
affected the sensitivity to reward and reward omission, we used an asymmetric
version of reinforcement learning38
V V
V V
i t i t i t
i t i t i t
i t
i t
, , ,
, , ,
,
,
= +
= +
>
<
+
1
1
0
0
a d
a d
d
d
where
α
+ and
α
are the learning rates for better- or worse-than-expected
outcomes.
We also used an extended version of reinforcement-learning model23,39 to eval-
uate the expected uncertainty of the rewarding locations. The rationale behind
this model is that uncertain and unpredictable outcomes produce large prediction
errors (positive and negative), by definition. Hence squared prediction errors
(equation (7)) can be used to estimate unpredictability or uncertainty
s2i t,
s2 2 1
i t i t i t
,, ,
= +
s a x
j
where
aj
is the learning rate for uncertainty, and
ξ
i,t is the uncertainty (or risk)
prediction error of the option i at trial t, that is,
x d s
i t i t i t, , ,
=
2 2 1
The uncertainty prediction error corresponds to unexpected uncertainty
(uncertainty larger than expected) and we tested whether exploration might
be directed by unexpected form of uncertainty, by assigning a bonus to this
error term
V V
i t i t i t
*
, , ,
= +x
Finally, uncertainty may exert an indirect effect through learning. It has been
shown in humans that learning rate itself can increase with sudden changes
in uncertainty37,54. We tested the following adaptive learning rate model37,
where learning rate increases when there is a recent increase m in absolute
prediction errors
a a a
a a a
t t t t t
t t t t t
f m m
f m m
= + >
= + <
1 1
1 1
1 0
0
( )( )
( )( )
where f(m) is a double sigmoid function
f m sign m e
t t m
( ) ( )( )
( / )
=
12
l
where
m is the slope of the (recent) smoothed absolute reward prediction errors,
mtt
abs t
abs
t
abs t
abs
=
+
21
1
d d
d d
. Smoothing of absolute prediction errors is achieved by
d d a d a
t
abs t
abs t
= +
1 1 1
1( )
. The free parameter λ determines the degree to
which uncertainty (absolute prediction errors) affects the learning rate, and
the other free parameter, α1, determines the initial learning rate and the speed
of
dt
abs
updating.
In the US, at steady-state, we fitted the free parameters of the four decision-
making models (none for the matching law, ε for ε -greedy, β for softmax, β
and ϕ for uncertainty model). In the learning phase of the US, we fitted the free
parameters of these 4 models: standard RL (α, β), RL with uncertainty learning
and expected uncertainty bonus (α, β, αϕ, ϕ),RL with adaptive (uncertainty-
dependent) learning rate (α, β, λ), and RL with uncertainty learning and unex-
pected uncertainty bonus (α, β, αϕ, ϕ). We fixed the initial conditions (V(0) = 1,
and σ(0) = 0), because the mice underwent the certain setting just beforehand.
(7)(7)
(8)(8)
(9)(9)
(10)(10)
(11)(11)
(12)(12)
(13)(13)
In the DS, we fitted the free parameters and initial conditions of these 7 models:
standard RL (α, β, V(0)), asymmetric learning rates RL (α, α+, β, V(0)), RL with
uncertainty bonus (α, β, ϕ, V(0),
σ
(0)), RL with separate learning for value and
uncertainty (α, αϕ, β, ϕ, V(0),
σ
(0)), RL with asymmetric learning rates learning
for value and separate uncertainty learning (α, α+, αϕ, β, ϕ, V(0),
σ
(0)), RL with
uncertainty learning and unexpected uncertainty bonus (α, αϕ, β, ϕ, V(0),
σ
(0)),
RL with adaptive (uncertainty-dependent) learning rate (α, β, λ, V(0)).
In each case, we searched for the free parameters maximizing the respective
likelihood of the observed choices c at all trials t
Pc t
c
,
. We performed the
fits of all the parameters individually for each animal a, using the population
fit (that is, fit of the average probabilities of choices) as initial conditions. We
checked that the mean of individual fits stayed close to the population fit, and
that the optima was non-local (by examining the Hessian matrix55). We used the
fmincon function in Matlab to perform the fits, with the constraints that learning
rates and temperature could not be negative and that learning rates could not
exceed 1. To assess goodness-of-fit, we report negative log likelihoods penalized
for model complexity (Bayesian information criterion; BIC). Smaller BIC values
indicate a better fit. Each of these models has been found to fit experimental
data in at least one given experimental condition (for example, behavioral task
or species16,17,29,38,39). Here, we aimed at accounting for the difference observed
between genotypes, to propose a computational role for the nicotinic modulation
of the VTA. Hence, once the best model is determined, possible differences in
the free parameters (for example, ε, β, ϕ) between genotypes or conditions point
at the computational role of the β2 subunit-containing nAChRs expressed in the
VTA in decision-making processes.
Extension of the uncertainty model to previous experiments on 2KO mice.
We also aimed at extending our framework by modeling the results from previ-
ous studies focusing on the behavioral differences between WT and β2KO mice
with reinforcement learning models embedding an uncertainty-based explora-
tory bonus (equations (5, 7, 8, 10 and 11)). In these experiments, uncertainty was
not explicitly controlled but was yet present, as in most decision tasks. We thus
used the main difference found in the model-based analysis of our decision task,
that is, a positive value given to uncertainty in WT, but not to β2KO, mice, and
explored the values for uncertainty estimation to qualitatively match the data.
All experiments were modeled as MDPs with a discretization of the relevant
states for the animals.
In the open-field experiment13,42, we used the symbolic decomposition of the
behavior proposed in ref. 42, by splitting the locomotion of the mice into “active”
versus “inactive” states, and their positions into “center” and “periphery” states.
The active state corresponds to high-speed navigation, while the inactive state
corresponds to low-speed exploration, mainly composed of rearing, scanning and
sniffing behaviors42,43. This double dichotomy gives rise to four states, that we
modeled as an MDP with all transitions possible, except for the stay” transitions
(that is, of one state on itself) and the transitions between periphery-inactive (PI)
and center-inactive (CI) states, which were not found in the data13,42. The dura-
tion of one state was 1 s. We modeled the difference between WT and β2KO mice
by adding an exploratory bonus to the inactive states in WT mice only, which we
deduced from the experimental (average) transition probability and the softmax
decision rule with bonus as follows. In the center-active state, the probability of
going the center-inactive state is given by
P CI CA e e e
V V V
CI CI PA
( ) ( )= +
b b b
,
where VCI and VPA represent the values associated with the center-inactive and
the periphery-active states, so we computed the relations between VPA and VCI
VCI
, and between VPI and VCA, and fitted β, VPI and VCI to reproduce the data.
In the object recognition task40, two objects are placed in an open-field, and the
time spent in the objects area is measured as a function of the behavioral sessions.
We modeled this task as an MDP using a discretization of space, consisting in 25
states corresponding to the open-field without objects, and two states correspond-
ing to the objects. The duration of one state was 1 s. We used the uncertainty
model (no reward being present in the task, we modeled the uncertainties but
not the values) and we fitted the values of α, β, ϕ, and the initial uncertainties of
the objects and of the open-field to reproduce the data.
In the spatial maze40, we modeled an idealized version of this conditioning
task, consisting of four states, corresponding to the arms of the maze. One of
them delivered a reachable food reward (R = 1 if reached), and was absorbing:
© 2016Nature America, Inc. All rights reserved.
nature neurOSCIenCe doi:10.1038/nn.4223
the simulation stopped if the agent (the modeled mouse) reached it. The duration
of one state (the mean duration of visiting one arm) was 10 s. We used the
uncertainty model with a single learning rate (α, β, ϕ, ς(0), σ(0)) for simplicity.
We simulated the model until the food was reached, and measured the time to
reach the food, as done in the experiment.
In the passive avoidance task41, animals are in a box divided in two (light and
dark) compartments. The learning phase (which was not modeled) consists in a
single foot shock given in the dark compartment, which arguably induces a nega-
tive prediction error for this state. We simulated this experiment by considering
a sequential evaluation model representing incentive motivation56, in which the
agent sequentially evaluates the probability to go to the dark compartment until
it decides to accept it. The probability to go to the dark part of the box at any
time is given by
where β is the inverse temperature (sensitivity to value) and θ a threshold
representing the basal locomotor activity of the animal. In this model, the agent
evaluates the probability of going to the dark part, based on its single experi-
ence of a foot shock, which induced a single, negative, reward prediction error
(equation (7)), resulting both in a decrease in value (equation (8)) and an
increased uncertainty (equation (10)). The time-step for each evaluation was 1 s.
We measured the time before the agents go to the dark part, as done in the experi-
ment41. For each model experiment, standard errors were obtained following
a bootstrap procedure, using the sample size of the original data.
Statistical analysis. No statistical methods were used to predetermine sample
sizes. Our sample sizes are comparable to many studies using similar techniques
and animal models. We used a pseudo-randomization procedure, in the sense
that in the behavioral experiments, precise parameters (for example, reward
probabilities) were pseudo-randomly assigned to each rewarding location for
(14)(14)
each mouse. The experiments were blind, in the sense that the experimenters
(both in behavioral and electrophysiological experiments) were not aware
of which genotype each mouse belonged to.
Behavioral and model data were analyzed and fitted using Matlab
(The MathWorks) Electrophysiological data was analyzed using R (The R Project).
Code is available on request. Data are plotted as mean ± s.e.m. Total number (n)
of observations in each group and statistics used are indicated in figure captions.
Classically comparisons between means were performed using parametric tests
(Student for two groups, or ANOVA for comparing more than two groups) when
parameters followed a normal distribution (Shapiro test P > 0.05), and non-
parametric tests (here, Wilcoxon or Mann-Whitney) when this was not the case.
Homogeneity of variances was tested preliminarily and the t tests were Welch-
corrected if needed. Multiple comparisons were Bonferroni corrected. All statistical
tests were two-sided. P > 0.05 was considered to be not statistically significant.
A Supplementary Methods Checklist is available.
50. Paxinos, G. & Franklin, K.B. The Mouse Brain in Stereotaxic Coordinates (Gulf
Professional Publishing, 2004).
51. Grace, A.A. & Bunney, B.S. Intracellular and extracellular electrophysiology of nigral
dopaminergic neurons--1. Identification and characterization. Neuroscience 10,
301–315 (1983).
52. Rokosik, S.L. & Napier, T.C. Intracranial self-stimulation as a positive reinforcer to
study impulsivity in a probability discounting paradigm. J. Neurosci. Methods 198,
260–269 (2011).
53. D’Acremont, M. & Bossaerts, P. Neurobiological studies of risk assessment: a
comparison of expected utility and mean-variance approaches. Cogn. Affect. Behav.
Neurosci. 8, 363–374 (2008).
54. Behrens, T.E.J., Woolrich, M.W., Walton, M.E. & Rushworth, M.F.S. Learning the value
of information in an uncertain world. Nat. Neurosci. 10, 1214–1221 (2007).
55. Daw, N.D. Trial-by-trial data analysis using computational models. in Decision
Making, Affect, and Learning: Attention and Performance XXIII (eds. Delgado, M.R.,
Phelps, E.A. & Robbins, T.W.) 3–38 (2011).
56. McClure, S.M., Daw, N.D. & Montague, P.R. A computational substrate for incentive
salience. Trends Neurosci. 26, 423–428 (2003).
... De plus, l'activation optogénétique des neurones cholinergiques du pont va principalement activer les neurones DA (Dautan et al., 2016;Xiao et al., 2016). Malgré l'importance de l'activation des mAChR, il existe un faisceau d'évidence qui pointe le rôle clé du contrôle nicotinique sur les neurones de la VTA (Durandde Cuttoli et al., 2018;Faure et al., 2014;Marti et al., 2011;Naudé et al., 2016;Picciotto et al., 1998;Tolu et al., 2013). Lors de ma thèse, j'ai utilisé un nAChR photo-contrôlable pour sonder, en temps réel et in vivo, l'impact de la modulation nicotinique endogène sur l'activité spontanée des neurones DA de la VTA (Durand-de Cuttoli et al., 2018). ...
... La présence de bursts chez la souris β2 -/est en accord avec le fait que les bursts ne sont pas juste sous contrôle nicotinique (Kitai et al., 1999). L'utilisation du modèle de la souris β2 -/couplé à de la réexpression locale dans la VTA a permis de mettre en évidence le rôle des β2* nAChR dans la locomotion spontanée ou l'exploration (Avale et al., 2008; et dans l'exploration de récompenses incertaines (Naudé et al., 2016). De plus, l'activation optogénétique des afférences cholinergiques dans la VTA conduit à un renforcement positif et promeut la locomotion (Dautan et al., 2016, Xiao et al., 2016. ...
... They are also present on DAergic terminals in the Nucleus Accumbens (NAc) and the prefrontal cortex (Grady et al., 2007;Changeux, 2010). Genetic and pharmacological manipulations have implicated VTA nAChRs in tuning the activity of DA neurons and in mediating the addictive properties of nicotine (Mameli-Engvall et al., 2006;Morel et al., 2014;Naudé et al., 2016;Picciotto et al., 1998;Tapper et al., 2004;Tolu et al., 2013). However, understanding the mechanism by which ACh and nicotine participate in these activities requires to comprehend the spatio-temporal dynamics of nAChRs activation. ...
Thesis
L’accroissement mondial des maladies liées au tabac reste un problème majeur de santé publique. La nicotine, principale substance active du tabac, agit exclusivement sur les récepteurs nicotiniques de l’acétylcholine (nAChR). Dans le système nerveux central, il existe douze sous-unités nicotiniques (α2-10 et β2-4) qui s'assemblent en pentamères selon diverses combinaisons, ce qui génère une grande diversité de récepteurs ayant des propriétés pharmacologiques, des localisations et des fonctions différentes. Outre son rôle indéniable et très bien décrit dans le renforcement, la nicotine peut également déclencher une aversion chez les individus. Ces effets opposés pourraient reposer sur l'activation de nAChR au sein de circuits neuronaux distincts, et sont prédicteurs, lors de la première expérience d’inhalation de tabac, des risques de développer une addiction à la nicotine. Comprendre le rôle des différents isoformes de nAChR et des circuits dans lesquels ils sont exprimés a été un défi majeur dans ce domaine, et un des questionnements de ce travail. Lors de la première partie de ma thèse, nous avons mis au point des nAChR contrôlables par la lumière (LinAChR) qui fonctionnent normalement dans l'obscurité ou sous une lumière verte (500 nm) et sont rapidement inhibés sous une lumière violette (380 nm). Nous avons implémenté cette technologie in vivo, dans l’aire tegmentale ventrale, un noyaudopaminergique qui joue un rôle clé dans le renforcement et la dépendance. Le blocage des LinAChR β2 par la lumière a révélé l’impact du tonus cholinergique endogène sur l'activité spontanée des neurones dopaminergiques, et a permis d’abolir le renforcement à la nicotine chez les souris. Lors de la seconde partie de ma thèse, je me suis intéressée aux mécanismes moléculaires et cellulaires qui sont impliqués dans la régulation de la consommation de nicotine. Pour cela, j'ai mis en place une tâche qui repose sur la consommation volontaire de nicotine, et développé une méthode d'analyse qui rend compte de la variabilité comportementale des souris. J'ai ainsi observé qu’une partie des souris développe une aversion spontanée et dose-dépendante à la nicotine. J'ai pu montrer qu'un traitement chronique à la nicotine diminue l'expression fonctionnelle des nAChR β4 dans le noyau interpédonculaire, et que cette régulation à la baisse impacte à la fois l'aversion à la nicotine et sa consommation. Les travaux que j’ai menés suggèrent que la récompense et l'aversion à la nicotine impliquent à la fois des récepteurs et des circuits neuronaux distincts. Mes travaux mettent aussi l'accent sur le développement de nouvelles technologies optiques pour comprendre comment la dynamique d'activation des récepteurs conduit à des modifications d’activité dans des circuits spécifiques, et à des comportements liés à la dépendance.
... 3 This fiber bundle courses via the lateral hypothalamus between multiple regions of the brain's reward system, most notably the ventral tegmental area and the nucleus accumbens (NA). This self-stimulating effect has mainly been used in order to study the neural underpinnings of motivation itself, but some studies have used this rewarding effect to perform spatial conditioning [4][5][6] or to study reward valuation. 7 This suggested to us that this method MOTIVATION Training animals on decision-making tasks to measure their perceptual abilities and match this to ongoing neural activity is an essential part of integrative neuroscience. ...
... It is likely that MFB will further generalize given that it has already been shown to work well in freely moving animals both with nose poke and with spatial conditioning. [4][5][6] Finally, further optimization of MFB stimulation parameters (delay, duration, etc.) could lead to improved results. ...
Article
Full-text available
Perceptual decision-making tasks are essential to many fields of neuroscience. Current protocols generally reward deprived animals with water. However, balancing animals’ deprivation level with their well-being is challenging, and trial number is limited by satiation. Here, we present electrical stimulation of the medial forebrain bundle (MFB) as an alternative that avoids deprivation while yielding stable motivation for thousands of trials. Using licking or lever press as a report, MFB animals learnt auditory discrimination tasks at similar speed to water-deprived mice. Moreover, they more reliably reached higher accuracy in harder tasks, performing up to 4,500 trials per session without loss of motivation. MFB stimulation did not impact the underlying sensory behavior since psychometric parameters and response times are preserved. MFB mice lacked signs of metabolic or behavioral stress compared with water-deprived mice. Overall, MFB stimulation is a highly promising tool for task learning because it enhances task performance while avoiding deprivation.
... Comme je le détaillerai dans les paragraphes suivants, de nombreuses études ont utilisé ce type d'approche, permettant d'immenses progrès sur la compréhension des mécanismes d'apprentissage et d'adaptation. Néanmoins, les neurosciences font aujourd'hui face à des enjeux encore plus complexes puisqu'il s'agit désormais de disséquer de manière plus fine ces processus : déterminer les récepteurs impliqués (Naudé et al., 2016), évaluer les concentrations minimales Stratégies d'adaptation de la souris face à un environnement volatil ETAT DE L'ART. Chapitre 3. Bases neurobiologiques des comportements adaptatifs 97 de neuromodulateurs nécessaires pour observer des conséquences comportementales ou encore comprendre l'influence des motifs de décharge des neurones neuromodulateurs (Ellwood et al., 2017). ...
... Or des souris génétiquement modifiées, n'exprimant pas la sous-unité β2 des récepteurs à l'acétylcholine au niveau de l'aire tegmentale ventrale, semblent montrer un défaut d'exploration des options les plus incertaines dans une tâche de bandit. La modélisation par un modèle d'apprentissage par renforcement ajoutant un bonus à l'exploration des options incertaines semble rendre compte au mieux des résultats(Naudé et al., 2016). Il semblerait donc que l'acétylcholine via son action sur les récepteurs nicotiniquesβ2 de l'aire tegmentale ventrale promeuve l'exploration dirigée. ...
Thesis
Les animaux évoluent dans un environnement en perpétuel changement. L’enjeu de ce travail de thèse a été de comprendre comment les animaux sont capables de s’adapter à des changements abrupts. L’ensemble de ces travaux a été mené au laboratoire chez la souris grâce à un nouveau paradigme expérimental, la tâche de switch. Cette tâche nécessitant des temps expérimentaux longs et l’acquisition d’une grande quantité de données, le développement d’un système expérimental comportemental innovant était essentiel. Une nouvelle chambre opérante automatisée permettant d’implémenter des paradigmes complexes, longs, et de collecter des milliers de données par sujet tout en respectant la physiologie et le bien-être animal a été conçue et validée. Grâce à un tel dispositif, dix-sept souris ont pu réaliser la tâche de switch. Les résultats montrent que les souris sont capables d’apprendre la tâche, de détecter les changements successifs de règles et de s’y adapter grâce à une stratégie de méta-apprentissage par renforcement associée à une heuristique d'exploration basée sur l'action. Par ailleurs, grâce à une technique de neuromodulation expérimentale pharmacologique, il a été possible d’inhiber le cingulaire antérieur de quatre souris dans la tâche de switch à la suite des changements de règles. Les résultats préliminaires sembleraient indiquer une potentielle implication de cette région dans la stratégie comportementale mise en jeu et le processus de méta-apprentissage par renforcement à la suite du changement des règles. Ce travail offre de nouvelles perspectives quant à la compréhension des mécanismes d’adaptation des êtres vivants face à un environnement volatil.
... 2,3 MFB has previously been used as a reward in nose-poke or place preference tasks in freely moving rodents. 4,5 Importantly, this reward does not lead to satiation which allows for flexible protocols and high trial yield per session. The protocol below describes how to use MFB stimulation in head-fixed mice as a reward in tasks in which the animal reports perception of a stimulus by a motor action. ...
Article
Full-text available
Training mice to perform perceptual tasks is a vital part of integrative neuroscience. Replacing classical rewards like water with medial forebrain bundle (MFB) stimulation allows experimenters to avoid deprivation and obtain higher trial numbers per session. Here, we provide a protocol for implementing MFB-based reward in mice. We describe steps for MFB electrode implantation, efficacy testing, and stimulation calibration. After these steps, MFB reward can be used to facilitate sensory discrimination task training and enable nuanced characterization of psychophysical abilities. For complete details on the use and execution of this protocol, please refer to Verdier et al. (2022).¹
... In this study, we demonstrate that expression of a gain-of-function 2 nAChR subunit in the ventral midbrain (VTA) of male rats is sufficient to significantly alter intravenous nicotine SA. Nicotine Reinforcement: VTA 2 nAChRs are Sufficient -Loss-of-function studies using 2 nAChR antagonists (Corrigall, Coen et al. 1994) or 2 knockout mice (Maskos, Molles et al. 2005, Pons, Fattore et al. 2008, Tolu, Eddine et al. 2013) have linked 2-containing nAChRs with nicotine reinforcement, while others (le Novere, Zoli et al. 1999, King, Caldarone et al. 2004, Walters, Brown et al. 2006, Drenan, Grady et al. 2008, Drenan, Grady et al. 2010, Naude, Tolu et al. 2016) have linked nicotine-related phenotypes (place preference, locomotor activation, motivation, etc.) with VTA 2-containing nAChRs. Notably, two prior studies served as a key starting point upon which our study builds. ...
Article
Full-text available
Mesolimbic nicotinic acetylcholine receptor (nAChRs) activation is necessary for nicotine reinforcement behavior, but it is unknown whether selective activation of nAChRs in the dopamine (DA) reward pathway is sufficient to support nicotine reinforcement. In this study, we tested the hypothesis that activation of β2-containing (β2*) nAChRs on VTA neurons is sufficient for intravenous nicotine self-administration (SA). We expressed β2 nAChR subunits with enhanced sensitivity to nicotine (referred to as β2Leu9′Ser) in the VTA of male Sprague Dawley (SD) rats, enabling very low concentrations of nicotine to selectively activate β2* nAChRs on transduced neurons. Rats expressing β2Leu9′Ser subunits acquired nicotine SA at 1.5 μg/kg/infusion, a dose too low to support acquisition in control rats. Saline substitution extinguished responding for 1.5 μg/kg/inf, verifying that this dose was reinforcing. β2Leu9′Ser nAChRs also supported acquisition at the typical training dose in rats (30 μg/kg/inf) and reducing the dose to 1.5 μg/kg/inf caused a significant increase in the rate of nicotine SA. Viral expression of β2Leu9′Ser subunits only in VTA DA neurons (via TH-Cre rats) also enabled acquisition of nicotine SA at 1.5 μg/kg/inf, and saline substitution significantly attenuated responding. Next, we examined electrically-evoked DA release in slices from β2Leu9′Ser rats with a history of nicotine SA. Single-pulse evoked DA release and DA uptake rate were reduced in β2Leu9′Ser NAc slices, but relative increases in DA following a train of stimuli were preserved. These results are the first to report that β2* nAChR activation on VTA neurons is sufficient for nicotine reinforcement in rats.
... When there are strong violations of expectations, locus coeruleus activity may induce a "network reset" that causes a reconfiguration of neuronal activity that clears the way to adapt to these changes (Bouret and Sara, 2005). In modeling and in experimental work, it has been shown that the cholinergic system mediates uncertainty seeking (Naude et al., 2016;Belkaid and Krichmar, 2020). Uncertainty seeking is especially advantageous in situations when reward sources are uncertain. ...
Article
Full-text available
In their book “How the Body Shapes the Way We Think: A New View of Intelligence,” Pfeifer and Bongard put forth an embodied approach to cognition. Because of this position, many of their robot examples demonstrated “intelligent” behavior despite limited neural processing. It is our belief that neurorobots should attempt to follow many of these principles. In this article, we discuss a number of principles to consider when designing neurorobots and experiments using robots to test brain theories. These principles are strongly inspired by Pfeifer and Bongard, but build on their design principles by grounding them in neuroscience and by adding principles based on neuroscience research. Our design principles fall into three categories. First, organisms must react quickly and appropriately to events. Second, organisms must have the ability to learn and remember over their lifetimes. Third, organisms must weigh options that are crucial for survival. We believe that by following these design principles a robot's behavior will be more naturalistic and more successful.
Article
The neural mechanisms by which animals initiate goal-directed actions, choose between options, or explore opportunities remain unknown. Here, we develop a spatial gambling task in which mice, to obtain intracranial self-stimulation rewards, self-determine the initiation, direction, vigor, and pace of their actions based on their knowledge of the outcomes. Using electrophysiological recordings, pharmacology, and optogenetics, we identify a sequence of oscillations and firings in the ventral tegmental area (VTA), orbitofrontal cortex (OFC), and prefrontal cortex (PFC) that co-encodes and co-determines self-initiation and choices. This sequence appeared with learning as an uncued realignment of spontaneous dynamics. Interactions between the structures varied with the reward context, particularly the uncertainty associated with the different options. We suggest that self-generated choices arise from a distributed circuit based on an OFC-VTA core determining whether to wait for or initiate actions, while the PFC is specifically engaged by reward uncertainty in action selection and pace.
Article
There is a great deal of uncertainty in the world. One common source of uncertainty results from incomplete or missing information about probabilistic outcomes (i.e., outcomes that may occur), which influences how people make decisions. The impact of this type of uncertainty may particularly pronounced for older adults, who, as the primary leaders around the world, make highly impactful decisions with lasting outcomes. This review examines the ways in which uncertainty about probabilistic outcomes is perceived, handled, and represented in the aging brain, with an emphasis on how uncertainty may specifically affect decision making in later life. We describe the role of uncertainty in decision making and aging from four perspectives, including 1) theoretical, 2) self-report, 3) behavioral, and 4) neuroscientific. We report evidence of any age-related differences in uncertainty among these contexts and describe how these changes may affect decision making. We then integrate the findings across the distinct perspectives, followed by a discussion of important future directions for research on aging and uncertainty, including prospection, domain-specificity in risk-taking behaviors, and choice overload.
Article
Full-text available
When encountering novel environments, animals perform complex yet structured exploratory behaviors. Despite their typical structuring, the principles underlying exploratory patterns are still not sufficiently understood. Here we analyzed exploratory behavioral data from two modalities: whisking and locomotion in rats and mice. We found that these rodents maximized novelty signal-to-noise ratio during each exploration episode, where novelty is defined as the accumulated information gain. We further found that these rodents maximized novelty during outbound exploration, used novelty-triggered withdrawal-like retreat behavior, and explored the environment in a novelty-descending sequence. We applied a hierarchical curiosity model, which incorporates these principles, to both modalities. We show that the model captures the major components of exploratory behavior in multiple timescales: single excursions, exploratory episodes, and developmental timeline. The model predicted that novelty is managed across exploratory modalities. Using a novel experimental setup in which mice encountered a novel object for the first time in their life, we tested and validated this prediction. Further predictions, related to the development of brain circuitry, are described. This study demonstrates that rodents select exploratory actions according to a novelty management framework and suggests a plausible mechanism by which mammalian exploration primitives can be learned during development and integrated in adult exploration of complex environments.
Article
Full-text available
Although empirical and neural studies show that serotonin (5HT) plays many functional roles in the brain, prior computational models mostly focus on its role in behavioral inhibition. In this study, we present a model of risk based decision making in a modified Reinforcement Learning (RL)-framework. The model depicts the roles of dopamine (DA) and serotonin (5HT) in Basal Ganglia (BG). In this model, the DA signal is represented by the temporal difference error (δ), while the 5HT signal is represented by a parameter (α) that controls risk prediction error. This formulation that accommodates both 5HT and DA reconciles some of the diverse roles of 5HT particularly in connection with the BG system. We apply the model to different experimental paradigms used to study the role of 5HT: (1) Risk-sensitive decision making, where 5HT controls risk assessment, (2) Temporal reward prediction, where 5HT controls time-scale of reward prediction, and (3) Reward/Punishment sensitivity, in which the punishment prediction error depends on 5HT levels. Thus the proposed integrated RL model reconciles several existing theories of 5HT and DA in the BG.
Article
Neuroeconomics is the study of the neurobiological and computational basis of value-based decision making. Its goal is to provide a biologically based account of human behaviour that can be applied in both the natural and the social sciences. This Review proposes a framework to investigate different aspects of the neurobiology of decision making. The framework allows us to bring together recent findings in the field, highlight some of the most important outstanding problems, define a common lexicon that bridges the different disciplines that inform neuroeconomics, and point the way to future applications.
Article
Researchers have recently begun to integrate computational models into the analysis of neural and behavioural data, particularly in experiments on reward learning and decision making. This chapter aims to review and rationalize these methods. It exposes these tools as instances of broadly applicable statistical techniques, considers the questions they are suited to answer, provides a practical tutorial and tips for their effective use, and, finally, suggests some directions for extension or improvement. The techniques are illustrated with fits of simple models to simulated datasets. Throughout, the chapter flags interpretational and technical pitfalls of which authors, reviewers, and readers should be aware. © The International Association for the study of Attention and Performance, 2011. All rights reserved.
Article
In order to select which action should be taken, an animal must weigh the costs and benefits of possible outcomes associate with each action. Such decisions, called cost-benefit decisions, likely involve several cognitive processes (including memory) and a vast neural circuitry. Rodent models have allowed research to begin to probe the neural basis of three forms of cost-benefit decision making: effort-, delay-, and risk-based decision making. In this review, we detail the current understanding of the functional circuits that subserve each form of decision making. We highlight the extensive literature by detailing the ability of dopamine to influence decisions by modulating structures within these circuits. Since acetylcholine projects to all of the same important structures, we propose several ways in which the cholinergic system may play a local modulatory role that will allow it to shape these behaviors. A greater understanding of the contribution of the cholinergic system to cost-benefit decisions will permit us to better link the decision and memory processes, and this will help us to better understand and/or treat individuals with deficits in a number of higher cognitive functions including decision making, learning, memory, and language.
Article
This paper presents an action selection technique for reinforcement learning in stationary Markovian environments. This technique may be used in direct algorithms such as Q-learning, or in indirect algorithms such as adaptive dynamic programming. It is based on two principles. The first is to define a local measure of the uncertainty using the theory of bandit problems. We show that such a measure suffers from several drawbacks. In particular, a direct application of it leads to algorithms of low quality that can be easily misled by particular configurations of the environment. The second basic principle was introduced to eliminate this drawback. It consists of assimilating the local measures of uncertainty to rewards, and back-propagating them with the dynamic programming or temporal difference mechanisms. This allows reproducing global-scale reasoning about the uncertainty, using only local measures of it. Numerical simulations clearly show the efficiency of these propositions.
Article
Evidence suggests that tobacco smoking and gambling frequently co-occur. Although high rates of comorbid smoking and gambling have been documented in studies with clinical populations of pathological gamblers in treatment, in studies using samples drawn from the community, and in large-epidemiological surveys, little empirical attention has been directed towards investigating the exact nature of this relationship. In this review, we stress the literature that has examined the epidemiology, aetiology and environmental factors implicated in comorbid smoking and gambling. Publications included in the review were identified through PsycInfo, PubMed and Medline searches. Although conclusive evidence is lacking, a growing body of literature suggests that smoking and gambling might share similar neurobiological, genetic and/or common environmental influences. Comorbid tobacco smoking and gambling are highly prevalent at the event and syndrome levels. However, research investigating how smoking might affect gambling or vice versa is currently lacking. More studies that examine the impact of this comorbidity on rates of tobacco dependence and problem gambling, as well as implications for treatment outcomes, are needed.
Article
Intracellular recordings were obtained from directly identified rat nigral dopamine cells in vivo. This identification was based on an increase in glyoxylic acid-induced catecholamine fluorescence in the impaled dopamine neurons. One of three compounds was injected intracellularly into each cell to produce the heightened fluorescence: (1) L-DOPA, to increase the intracellular dopamine content by precursor loading; (2) tetrahydrobiopterin, a cofactor for tyrosine hydroxylase, to increase intracellular dopamine concentration through activation of the rate-limiting enzyme for dopamine synthesis and (3) colchicine, to arrest intraneuronal transport and thus allow the build-up of dopamine synthesizing enzymes and dopamine in the soma. In addition, dopamine cells were antidromically activated from the caudate nucleus and collision with a directly elicited action potential was demonstrated. Identified dopamine neurons were shown to possess an input resistance of 31.2 +/- 7.4 M omega (means +/- SD) and a time constant of 12.1 +/- 3.2 ms. The action potentials were of long duration (2.75 +/- 0.5 ms) with a marked break between the initial segment and the somatodendritic spike components. The initial segment was the only component commonly elicited during antidromic activation. Spontaneously occurring action potentials were usually preceded by a slow, pacemaker-like depolarization. Burst firing by summation of depolarizing afterpotentials was observed to occur spontaneously, but could not be triggered by short depolarizing current pulses. Intravenously administered apomorphine demonstrated the same inhibitory effect on cell firing that was previously reported to occur when recording extracellularly from identified dopaminergic neurons. The determination of the electrophysiological characteristics of a population of cells directly identified as containing a specific neurotransmitter (in this case, dopamine) may allow one to construct better models of a system's functioning. Thus, the high input resistance and long time constant of dopamine-containing cells, combined with their burst/pause firing mode, may be important functionally with respect to a possible modulatory effect of dopamine in postsynaptic target areas.