Table 2 - uploaded by Mehdi Khamassi
Content may be subject to copyright.
| Model comparison criteria.

| Model comparison criteria.

Source publication
Article
Full-text available
Compared with reward seeking, punishment avoidance learning is less clearly understood at both the computational and neurobiological levels. Here we demonstrate, using computational modelling and fMRI in humans, that learning option values in a relative— context-dependent—scale offers a simple computational solution for avoidance learning. The cont...

Contexts in source publication

Context 1
... model selection. For each model, we estimated the free parameters by likelihood maximization (to calculate the Akaike Information Criterion, AIC, and the Bayesian Information Cri- terion, BIC) and by Laplace approximation of the model evi- dence (to calculate the exceedance probability; Tables 2 and 3). ...
Context 2
... model selection. For each model, we estimated the free parameters by likelihood maximization (to calculate the Akaike Information Criterion, AIC, and the Bayesian Information Cri- terion, BIC) and by Laplace approximation of the model evi- dence (to calculate the exceedance probability; Tables 2 and 3). After post hoc analyses we found that the RELATIVE model better accounted for the data, both at fixed and random effect analysis (compared with the ABSOLUTE LL: T ¼ 4.1, Po0.001). ...

Similar publications

Article
Full-text available
The powerful allure of social media platforms has been attributed to the human need for social rewards. Here, we demonstrate that the spread of misinformation on such platforms is facilitated by existing social 'carrots' (e.g., 'likes') and 'sticks' (e.g., 'dislikes') that are dissociated from the veracity of the information shared. Testing 951 par...

Citations

... Among several features characterizing human RL, the notion of outcome (or reward) context dependence has recently risen to prominence 16 . More specifically, a series of studies conducted mostly with Western, educated, industrialized, rich and democratic (WEIRD) populations 20 have shown that in many RL tasks participants encode outcomes (that is, rewards and punishments) in a context-dependent manner [21][22][23][24] . While there may not be a consensus yet concerning the exact functional form of such context dependency, the available findings seem to favour the idea that subjective outcomes are calculated relatively, following some form of range normalization [25][26][27] . ...
... In the present work, we sought to assess the cross-cultural stability of another recently discovered but well-documented feature of human behaviour: context-dependent RL. It is important to underscore that however robust, the vast majority of the results concerning context effects in human RL to date come from WEIRD samples 16,[21][22][23][24][25][26]56 . This severely limited the interpretation of context-dependent outcome encoding as a fundamental building block of human RL. ...
... It is important to underscore that, while for parsimony and commensurability purposes we modelled preferences in RL and lottery tasks with the same outcome-scaling model, this does not imply the assumption that both tasks share similar computational processes. Indeed, based on the present and other behavioural findings 13,21,26 it is likely that these different value-scaling schemes arise from different underlying computations altogether: respectively, outcome range adaptation in RL and diminishing marginal utility in lottery (see the Supplementary Information for further consideration). It is nonetheless important to note that here we are not claiming that context-dependent valuation is exclusive to choices based on experience (or reinforcement). ...
Article
Full-text available
Recent evidence indicates that reward value encoding in humans is highly context dependent, leading to suboptimal decisions in some cases, but whether this computational constraint on valuation is a shared feature of human cognition remains unknown. Here we studied the behaviour of n = 561 individuals from 11 countries of markedly different socioeconomic and cultural makeup. Our findings show that context sensitivity was present in all 11 countries. Suboptimal decisions generated by context manipulation were not explained by risk aversion, as estimated through a separate description-based choice task (that is, lotteries) consisting of matched decision offers. Conversely, risk aversion significantly differed across countries. Overall, our findings suggest that context-dependent reward value encoding is a feature of human cognition that remains consistently present across different countries, as opposed to description-based decision-making, which is more permeable to cultural factors.
... It is well known that the medial prefrontal cortex (mPFC) is involved in decision-making during approach-avoidance conflicts (Bechara et al., 2002;Xue et al., 2009;Chen et al., 2013;Friedman et al., 2015;Kim et al., 2017;Monosov, 2017;Siciliano et al., 2019;Fernandez-Leon et al., 2021;Bloem et al., 2022;Jacobs et al., 2022). When the dorsomedial prefrontal cortex (dmPFC) strongly responds to risk, humans prefer less risky choices (Xue et al., 2009), and patients with a ventromedial prefrontal cortex (vmPFC) lesion can show hypersensitivity to reward (Bechara et al., 2002). ...
... When the dorsomedial prefrontal cortex (dmPFC) strongly responds to risk, humans prefer less risky choices (Xue et al., 2009), and patients with a ventromedial prefrontal cortex (vmPFC) lesion can show hypersensitivity to reward (Bechara et al., 2002). Many neurons in the macaque anterior cingulate cortex (ACC) represent the value and uncertainty of rewards and punishments (Monosov, 2017). In rats, the silencing of mPFC neurons promotes cocaine-seeking behavior under the risk of foot shock (Chen et al., 2013), and binge drinking of alcohol under the risk of a bitter tastant (quinine) (Siciliano et al., 2019), whereas activating the mPFC can attenuate these behaviors. ...
... The water reward was omitted at 90% of tone A trials with lever-pull and 10% of tone B trials with lever-pull. To better understand how positive punishment (air-puff) and negative punishment (reward omission) affected the choice behavior (pull or non-pull) of the mouse, we constructed Q-learning models with a maximum of five parameters (Palminteri et al., 2015;Tanimoto et al., 2020) to predict the choice behavior during the training sessions in the air-puff and omission tasks (Supplementary Table S1; see STAR Methods for details). On the basis of our previous study (Tanimoto et al., 2020), we assumed that Reinforcement learning models to explain the pull-choice behavior in the air-puff and omission tasks. ...
Article
Full-text available
Reward-seeking behavior is frequently associated with risk of punishment. There are two types of punishment: positive punishment, which is defined as addition of an aversive stimulus, and negative punishment, involves the omission of a rewarding outcome. Although the medial prefrontal cortex (mPFC) is important in avoiding punishment, whether it is important for avoiding both positive and negative punishment and how it contributes to such avoidance are not clear. In this study, we trained male mice to perform decision-making tasks under the risks of positive (air-puff stimulus) and negative (reward omission) punishment, and modeled their behavior with reinforcement learning. Following the training, we pharmacologically inhibited the mPFC. We found that pharmacological inactivation of mPFC enhanced the reward-seeking choice under the risk of positive, but not negative, punishment. In reinforcement learning models, this behavioral change was well-explained as an increase in sensitivity to reward, rather than a decrease in the strength of aversion to punishment. Our results suggest that mPFC suppresses reward-seeking behavior by reducing sensitivity to reward under the risk of positive punishment.
... Specifically, it suggests that studies using different types of stimuli may be incomparable, affecting the generalizability of results. For instance, brain processes associated with learning abstract stimuli (e.g., Daw et al., 2011;Frank et al., 2005;Palminteri et al., 2015;Pessiglione et al., 2006) may not be generalizable to such processes associated with concrete stimuli (e.g., Eppinger et al., 2008;van den Bos et al., 2009). Also, this finding suggests that stimulus choice affects the validity of results. ...
Article
Full-text available
e.g., characters or fractals) and concrete stimuli (e.g., pictures of everyday objects) are used interchangeably in the reinforcement-learning literature. Yet, it is unclear whether the same learning processes underlie learning from these different stimulus types. In two preregistered experiments (N = 50 each), we assessed whether abstract and concrete stimuli yield different reinforcement-learning performance and whether this difference can be explained by verbalization. We argued that concrete stimuli are easier to verbalize than abstract ones, and that people therefore can appeal to the phonological loop, a subcomponent of the working-memory system responsible for storing and rehearsing verbal information, while learning. To test whether this verbalization aids reinforcement-learning performance, we administered a reinforcement-learning task in which participants learned either abstract or concrete stimuli while verbalization was hindered or not. In the first experiment, results showed a more pronounced detrimental effect of hindered verbalization for concrete than abstract stimuli on response times, but not on accuracy. In the second experiment, in which we reduced the response window, results showed the differential effect of hindered verbalization between stimulus types on accuracy, not on response times. These results imply that verbalization aids learning for concrete, but not abstract, stimuli and therefore that different processes underlie learning from these types of stimuli. This emphasizes the importance of carefully considering stimulus types. We discuss these findings in light of generalizability and validity of reinforcement-learning research.
... The present study uses the machine psychology framework to investigate the ability of LLMs to learn from past outcomes to make reward-maximizing choices. The focus is on relatively simple bandit tasks because of their simplicity and tractability, and because they have shed light on fundamental properties of reward encoding and belief updating in past studies [25,32,33]. In the bandit tasks used here, options are grouped in fixed pairings (or triplets) during the initial training phase. ...
Preprint
Full-text available
In-context learning enables large language models (LLMs) to perform a variety of tasks, including learning to make reward-maximizing choices in simple bandit tasks. Given their potential use as (autonomous) decision-making agents, it is important to understand how these models perform such reinforcement learning (RL) tasks and the extent to which they are susceptible to biases. Motivated by the fact that, in humans, it has been widely documented that the value of an outcome depends on how it compares to other local outcomes, the present study focuses on whether similar value encoding biases apply to how LLMs encode rewarding outcomes. Results from experiments with multiple bandit tasks and models show that LLMs exhibit behavioral signatures of a relative value bias. Adding explicit outcome comparisons to the prompt produces opposing effects on performance, enhancing maximization in trained choice sets but impairing generalization to new choice sets. Computational cognitive modeling reveals that LLM behavior is well-described by a simple RL algorithm that incorporates relative values at the outcome encoding stage. Lastly, we present preliminary evidence that the observed biases are not limited to fine-tuned LLMs, and that relative value processing is detectable in the final hidden layer activations of a raw, pretrained model. These findings have important implications for the use of LLMs in decision-making applications.
... Numerous studies have endeavored to elucidate the mechanistic underpinnings of changes in risk preference by employing diverse computational models with intricate utility functions tailored to specific contexts [23][24][25][26][27][28]. Notably, models with divisive normalization [24,27,29,30], reference-point centering [26,31,32], and range adaptation [26,33,32] have prominently featured in this exploration. However, while these studies proficiently elucidate the mechanisms through which dynamic modulation of risk preference arises as a consequence of learning-based utility modulation, they lack a comprehensive articulation of the adaptive significance that underlies the specific structure of the utility function and its evolution over time. ...
... As such, we did not explicitly test models incorporating a dynamic utility function. Noteworthy concepts in reinforcement learning and decision-making literature, such as the negative contrast effect [60], reference-point centering [26,31,32], range adaptation [26,33,32], and divisive normalization [24,27,29,30], are likely to provide mechanistic accounts to our findings. For example, the two environment-related regressors may be interpretable as range adaptions at two different time scales. ...
Article
Full-text available
Changes in risk preference have been reported when making a series of independent risky choices or non-foraging economic decisions. Behavioral economics has put forward various explanations for specific changes in risk preference in non-foraging tasks, but a consensus regarding the general principle underlying these effects has not been reached. In contrast, recent studies have investigated human economic risky choices using tasks adapted from foraging theory, which require consideration of past choices and future opportunities to make optimal decisions. In these foraging tasks, human economic risky choices are explained by the ethological principle of fitness maximization, which naturally leads to dynamic risk preference. Here, we conducted two online experiments to investigate whether the principle of fitness maximization can explain risk preference dynamics in a non-foraging task. Participants were asked to make a series of independent risky economic decisions while the environmental richness changed. We found that participants’ risk preferences were influenced by the current and past environments, making them more risk-averse during and after the rich environment compared to the poor environment. These changes in risk preference align with fitness maximization. Our findings suggest that the ethological principle of fitness maximization might serve as a generalizable principle for explaining dynamic preferences, including risk preference, in human economic decision-making.
... Increasing the plausibility that RNT functions to deflate expectations in this way, prediction errors-which are defined as outcomes minus expectations-have a strong influence on momentary mood [74]. However, the mechanics of ascribing reinforcement to an action that is aversive, yet which is believed to preclude a more-aversive outcome, are unclear (but see, e.g., [75,76], for general evidence that counterfactual outcomes and not choosing an action both influence reinforcement learning). ...
Article
Full-text available
Repetitive negative thinking (RNT) is a transdiagnostic construct that encompasses rumination and worry, yet what precisely is shared between rumination and worry is unclear. To clarify this, we develop a meta-control account of RNT. Meta-control refers to the reinforcement and control of mental behavior via similar computations as reinforce and control motor behavior. W epropose rumination and worry are coarse terms for failure in meta-control, just as tripping and falling are coarse terms for failure in motor control. We delineate four meta-control stages and risk factors increasing the chance of failure at each, including open-ended thoughts (stage 1),individual differences influencing subgoal execution (stage 2) and switching (stage 3), and challenges inherent to learning adaptive mental behavior (stage 4). Distinguishing these stages therefore elucidates diverse processes that lead to the same behavior of excessive RNT. Our account also subsumes prominent clinical accounts of RNT into a computational cognitive neuroscience framework.
... Recent trends give MAS research in economics a whole new potential range of realism, coming from the association of two present-day major scientific breakthroughs: i-the steady advances of cognitive neuroscience and neuroeconomics [14,15], and ii-the progress of machine learning due to the increasing computational power and use of big data methods [16] over the past decade. Even more promising is the synergy of these two fields, with the emergence of machine learning algorithms incorporating decision-theoretic features from neuroeconomics [17,18], or neuroscience models approached from the angle of machine learning [19,20]. Studies have also examined cognitive traits and biases in individual financial decision making [14], yet few have revealed the global impact of individual cognitive traits and biases in large populations of economic agents on the quantifiable financial market dynamics [21]. ...
... Such approaches have also used methods from reinforcement learning [22][23][24] or adaptive learning [25,26]. The framework of reinforcement learning has multiple parallels with decision processes in the brain [17][18][19][20]. Reinforcement learning hence computationally offers the possibility to quantitatively study the agent learning side of price formation, which is so crucial to basic market activity. ...
... Its boundaries are drawn from the literature in neuroscience on the values of the learning rate [17,18,74]. ...
Article
Full-text available
Recent advances in the field of machine learning have yielded novel research perspectives in behavioural economics and financial markets microstructure studies. In this paper we study the impact of individual trader leaning characteristics on markets using a stock market simulator designed with a multi-agent architecture. Each agent, representing an autonomous investor, trades stocks through reinforcement learning, using a centralized double-auction limit order book. This approach allows us to study the impact of individual trader traits on the whole stock market at the mesoscale in a bottom-up approach. We chose to test three trader trait aspects: agent learning rate increases, herding behaviour and random trading. As hypothesized, we find that larger learning rates significantly increase the number of crashes. We also find that herding behaviour undermines market stability, while random trading tends to preserve it.
... Notably, we departed from the original model 36 by modelling neutral outcomes as rewards (+1) in the punishment conditions and as punishments (−1) in the reward conditions. This reflects the assumption that participants perceive outcomes in a context-dependent manner 37,38 . ...
... Specifically, the model failed to account for participants' correct responses in the avoid-punishment conditions. Previous work using other reinforcement learning tasks suggests that participants perceive outcomes in a context-dependent manner 37,38 . On the basis of this idea, we were able to resolve the disagreement between the predicted and observed data by introducing a minor modification to the model, allowing the agent to learn from neutral outcomes (see Methods). ...
Article
Full-text available
Computational phenotyping has emerged as a powerful tool for characterizing individual variability across a variety of cognitive domains. An individual’s computational phenotype is defined as a set of mechanistically interpretable parameters obtained from fitting computational models to behavioural data. However, the interpretation of these parameters hinges critically on their psychometric properties, which are rarely studied. To identify the sources governing the temporal variability of the computational phenotype, we carried out a 12-week longitudinal study using a battery of seven tasks that measure aspects of human learning, memory, perception and decision making. To examine the influence of state effects, each week, participants provided reports tracking their mood, habits and daily activities. We developed a dynamic computational phenotyping framework, which allowed us to tease apart the time-varying effects of practice and internal states such as affective valence and arousal. Our results show that many phenotype dimensions covary with practice and affective factors, indicating that what appears to be unreliability may reflect previously unmeasured structure. These results support a fundamentally dynamic understanding of cognitive variability within an individual.
... Context-dependent value distortion has been associated with neurobiological constraints and the premise that neuronal firing adapts to the characteristics of the stimuli in the context in order to maximize the efficiency of the neural code 17,[20][21][22] . In particular, in the range normalization model, each alternative value is divided by the range of all values (i.e., maximum minus minimum) in the current context [22][23][24][25] . ...
Preprint
Contrary to the predictions of normative theories, choices between two high-value alternatives can be biased by the introduction of a third low-value alternative (dubbed the distractor effect). Normalization-based theories, like divisive and range normalization, explain different forms of the distractor effect by suggesting that the value of each alternative is normalized by a summary statistic of the values encountered in a particular decision context. The decision context can include alternatives encountered over an extended timeframe (temporal context); and alternatives that are available for choice on a given instance (immediate context). To date, the extent to which the immediate and temporal context (co-) shape context-dependent value representations remains unclear. To investigate this, we designed a task in which participants learned the values associated with three different alternatives and provided explicit value estimates before making a series of choices among ternary and binary combinations of those alternatives. We show that context-dependence already emerges in the pre-choice value estimates and is equally present in binary and ternary choice trials. Based on these findings, we conclude that the temporal (and not the immediate) context modulates subjective value representations. Interestingly, the functional form of context-dependence we report runs against both divisive and range normalization theories. Instead, our data are best explained by a stochastic rank-based model, according to which the value of an alternative is distorted by a series of memory-based binary comparisons with previously encountered alternatives.
... To better understand how positive punishment (air-puff) and negative punishment (reward omission) affected the choice behavior (pull or non-pull) of the mouse, we constructed Q-learning models with a maximum of five parameters 5,24 to predict the choice behavior during the training sessions in the air-puff and omission tasks (Table S1; see STAR Methods for details). On the basis of our previous study 5 , we assumed that these tasks included two choices, pull and non-pull, and there were values of pulling the lever (Qpull) and non-pulling of the lever (Qnon-pull) for both tone A and B trials in each task. ...
... To examine whether the best-fitting model was also the best for generating pull-choice behavior similar to that shown by the actual mice, we conducted a model simulation 2,5,24,29 . For each mouse, we used the fitted parameters in the S-F and P-S-F models to simulate the leverpull choice (1, pull; 0, non-pull) in each trial in the order of the actual tone A and B trials ( Figures 3A, 3D, and S2A-S2D). ...
Preprint
Full-text available
Reward-seeking behavior is frequently associated with risk of punishment. There are two types of punishment: positive, resulting in an unpleasant outcome, and negative, resulting in omission of a reinforcing outcome. Although the medial prefrontal cortex (mPFC) is important in avoiding punishment, whether it is important for avoiding both positive and negative punishment and how it contributes to such avoidance are not clear. In this study, we trained male mice to perform decision-making tasks under the risks of positive (air-puff stimulus) and negative (reward omission) punishment. We found that pharmacological inactivation of mPFC enhanced the reward-seeking choice under the risk of positive, but not negative, punishment. In reinforcement learning models, this behavioral change was well-explained by hypersensitivity to the reward, rather than a decrease in the strength of aversion to punishment. Our results suggest that mPFC suppresses reward-seeking behavior by reducing sensitivity to reward under the risk of positive punishment.