ArticlePDF Available

Abstract

An increasing number of studies have investigated the relationships between inter-individual variability in brain regions’ connectivity and behavioral phenotypes, making use of large population neuroimaging datasets. However, the replicability of brain-behavior associations identified by these approaches remains an open question. In this study, we examined the cross-dataset replicability of brain-behavior association patterns for fluid cognition and openness predictions using a previously developed region-wise approach, as well as using a standard whole-brain approach. Overall, we found moderate similarity in patterns for fluid cognition predictions across cohorts, especially in the Human Connectome Project Young Adult, Human Connectome Project Aging, and Enhanced Nathan Kline Institute Rockland Sample cohorts, but low similarity in patterns for openness predictions. In addition, we assessed the generalizability of prediction models in cross-dataset predictions, by training the model in one dataset and testing in another. Making use of the region-wise prediction approach, we showed that first, a moderate extent of generalizability could be achieved with fluid cognition prediction, and that, second, a set of common brain regions related to fluid cognition across cohorts could be identified. Nevertheless, the moderate replicability and generalizability could only be achieved in specific contexts. Thus, we argue that replicability and generalizability in connectivity-based prediction remain limited and deserve greater attention in future studies.
Nature Human Behaviour
nature human behaviour
https://doi.org/10.1038/s41562-023-01670-1Review article
The challenges and prospects of brain-based
prediction of behaviour
Jianxiao Wu  1,2 , Jingwei Li  1,2, Simon B. Eickhoff1,2, Dustin Scheinost  3,4,5,6,7
& Sarah Genon  1,2
Relating individual brain patterns to behaviour is fundamental in system
neuroscience. Recently, the predictive modelling approach has become
increasingly popular, largely due to the recent availability of large open
datasets and access to computational resources. This means that we can use
machine learning models and interindividual dierences at the brain level
represented by neuroimaging features to predict interindividual dierences
in behavioural measures. By doing so, we could identify biomarkers and
neural correlates in a data-driven fashion. Nevertheless, this budding eld
of neuroimaging-based predictive modelling is facing issues that may limit
its potential applications. Here we review these existing challenges, as well
as those that we anticipate as the eld develops. We focus on the impacts
of these challenges on brain-based predictions. We suggest potential
solutions to address the resolvable challenges, while keeping in mind that
some general and conceptual limitations may also underlie the predictive
modelling approach.
The study of the relationships between individual differences in brain
phenotypes and individual behaviours is fundamental in neuroscience,
from both a basic scientific perspective and an applied perspective.
The term ‘predictive modelling’ refers to the use of machine learning
techniques to build a statistical model for the estimation of behavioural
variables from brain-based neuroimaging data, either structural or
functional
1,2
. More precisely, a prediction model is trained to predict
particular behavioural variables from brain-based data from a number
of individuals (the training set), and its performance is then evaluated
on unseen data (test set).
The potential practical applications promised by such prediction
approaches in precision medicine, health care, human resources and
education
1,35
are certainly exciting. Potential future applications may
include the prediction of individual treatment outcomes to guide
treatment choices and dosage, the classification of clinical subgroups
with different brain pathology and thus different treatment require-
ments, and the prediction of future cognitive abilities and mental
health at the developmental stage. As concrete examples that could
be envisioned, brain-based predictions may provide objective bio-
markers when evaluating the effect of cognitive training or cognitive
behavioural therapies (for example, for mild functional cognitive
alterations and anxio-depressive phenotypes, respectively). Although
the effect of these interventions could be more readily investigated
with standard cognitive tests and interviews/questionnaires, respec-
tively, such approaches are prone to many biases (for example, prac-
tice effects, subjectivity biases and expectations biases). As a recent
working example, a prediction model of sustained attention provided
a neuromarker of sustained attention
6
. This neuromarker can be used
both for predicting attention deficit symptoms and for localizing
targets of potential brain-based treatments. Ultimately, brain-based
prediction could be expected to provide objective biomarkers that
can inform us about the brain mechanisms behind the effects under
scientific investigation. Aided by large, publicly available neuroimag-
ing datasets, accessible computational resources and code sharing
Received: 20 December 2022
Accepted: 27 June 2023
Published online: xx xx xxxx
Check for updates
1Institute of Neuroscience and Medicine, Brain & Behaviour (INM-7), Research Center Jülich, Jülich, Germany. 2Institute for Systems Neuroscience, Medical
Faculty, Heinrich-Heine University Düsseldorf, Düsseldorf, Germany. 3Department of Radiology and Biomedical Imaging, Yale School of Medicine, New
Haven, CT, USA. 4Department of Statistics and Data Science, Yale University, New Haven, CT, USA. 5Child Study Center, Yale School of Medicine, New
Haven, CT, USA. 6Interdepartmental Neuroscience Program, Yale School of Medicine, New Haven, CT, USA. 7Department of Biomedical Engineering,
Yale School of Engineering and Applied Sciences, New Haven, CT, USA. e-mail: j.wu@fz-juelich.de; s.genon@fz-juelich.de
Nature Human Behaviour
Review article https://doi.org/10.1038/s41562-023-01670-1
While big data and deep learning have enabled substantial suc-
cesses in many fields, neither has been particularly helpful in improv-
ing the performance of brain-based prediction models. To begin with,
even the easy-to-collect rs-fMRI data are considerably more difficult
to collect than pictures or texts, which are typically used in the fields
of computer vision and natural language processing, respectively.
The lack of truly big data in cognitive neuroscience may explain why
deep learning has often been reported to not outperform simpler
models
1,24,27
. The potential of deep learning as a more powerful type
of model would thus depend on the possibility of collecting truly big
neuroimaging datasets.
Alternatively, techniques such as few-shot learning could inspire
new solutions to utilize deep learning without acquiring big data.
From the data perspective, the few-shot learning strategy called data
augmentation can be employed to artificially increase the sample size.
Furthermore, simulated rs-fMRI and RSFC data have recently been used
to generate additional datasets3033. Their applications for predictive
modelling of behaviour remain to be further investigated. From the
parameter perspective, the meta-learning paradigm of few-shot learn-
ing can be useful by training a generalized model on a large dataset,
which can be used for the prediction of different targets in smaller
datasets34. Nevertheless, both strategies impose some requirements
and may not appear beneficial for all types of brain-based predictions.
Augmented or simulated data are limited by the characteristics of
the existing data used for augmentation or simulation. Accordingly,
a non-representative dataset (for example, including only a certain
age group or ethnicity) cannot become population representative
through augmentation. The performance of the meta-learning strategy
depends on the similarity of the prediction target in the large dataset
and the prediction target in the smaller dataset34. This means that the
meta-learning model would be beneficial only for smaller datasets that
use the same or very similar instruments for behavioural measurement
as those existing in the larger datasets in which the original model
is developed.
It may instead be more feasible to capitalize on existing data,
including neuroimaging features from multiple modalities, to boost
prediction accuracies. Structural, functional and diffusion MRI probe
different neurobiological aspects, offering complementary informa-
tion for psychometric prediction. In prediction studies based on fMRI,
resting-state and task fMRI features are often combined13,3537. However,
the benefit of combining these features in terms of prediction perfor-
mance has not been comprehensively investigated. Prediction studies
using multimodal data have found that different types of features
contribute to the prediction, including local connectome
18
, cortical
area18, cortical thickness17, grey matter volume21, RSFC17,3840 and task
functional connectivity39,41. Some studies have reported that inte-
grating multimodal MRI data did not actually improve the prediction
performance compared with using a single modality
21,39
. Furthermore,
combining multimodal features inevitably increases the feature dimen-
sion and in turn the risk of overfitting, requiring feature selection or
reduction techniques, such as stacking
18,38,41
. Generally, a systematic
evaluation of multimodal psychometric prediction across multiple
distinct cohorts, with an extensive set of neuroimaging features, psy-
chometric measures and model design, would be an important next
step for validating this research direction.
Moreover, psychometric prediction accuracies depend on the
target psychometric variable to be predicted. For behavioural traits in
cognition and socio-affective domains, the definition of the abstract
constructs measured by many behavioural variables and the con-
struct validity of these variables are still debated
4244
. The reliability
and validity of these behavioural traits require improvement through
both theoretical and experimental validations. Interestingly, many
studies have reported higher prediction accuracies for cognitive meas-
ures than for mental health traits5,34,38,45,46. It may be assumed that the
prediction of mental health would be particularly difficult in healthy
practices, predictive modelling has become a powerful tool for working
towards these future outlooks.
Among the various types of neuroimaging data, functional data
may be an intuitive choice for relating brain organization to behav-
ioural functions. In particular, task-free resting-state functional mag-
netic resonance imaging (rs-fMRI) scans can be readily collected for
large groups of study participants7, making them popular choices for
neuroimaging-based predictions. In the past ten years, resting-state
functional connectivity (RSFC) has been the most popular input fea-
ture to brain–behaviour prediction models2, in predictions of various
phenotypes including fluid intelligence810, attention6,11,12 and working
memory
1315
. Brain-based psychometric prediction using other features
such as task-based functional connectivity, grey matter volume, corti-
cal thickness and structural connectivity has also been investigated in
predictions of general cognitive abilities
1618
, attentional control
19
and
working memory
20,21
. However, the majority of studies forming the
scientific literature thus far have used RSFC alone or in combination
with other features for psychometric prediction (although this may be
expected to change in the future)2.
As a budding and growing field, brain-based psychometric predic-
tions remain to be improved and validated. Many reviews have analysed
methodological options based on the current state of the field and
given guidance for future studies
1,2,4,5,2224
. Practical tutorials have also
been published for guidance on specific implementation details
22,25
.
Nevertheless, the field faces general and conceptual issues that are
likely to limit the future usefulness of predictive modelling.
In this Review, we discuss the current and anticipated future chal-
lenges in psychometric prediction based on neuroimaging features. For
each challenge, we identify both inherent limitations in brain-based
psychometric predictions that may not be readily solved on the basis
of current resources and aspects that could be addressed with potential
solutions. In the following sections, we discuss the general challenges
of low prediction accuracies, followed by two core issues, generaliz-
ability and interpretability. Finally, we briefly discuss the potential
vulnerability of brain-based prediction models to enhancement and
adversarial attacks.
Low prediction accuracies
Low prediction accuracies limit any potential application of the model.
The general procedure of prediction model development and valida-
tion is described in Fig. 1. A prediction model is assessed by applying
it in a validation sample separate from the training sample and by
measuring the similarity or dissimilarity between the values predicted
for the participants in this sample and the truly observed values of the
psychometric variable for these participants (Box 1). Figure 2 shows
three examples of the most commonly used measure of model accu-
racy (Pearson’s correlation coefficient) and the predicted–observed
relationships underlying the accuracies. This measure indicates the
global linear trend between predicted and observed values, but it can-
not identify systematic biases or the size of errors. Presently, predic-
tion accuracies of various psychometric variables have been reported
from as low as 0.06 to as high as 0.908 (refs. 1,2). This wide range of
accuracies with both low and high values close to the value bounds
reflects the complexity of brain-based psychometric prediction study
design, as model accuracy can be affected by methodological deci-
sions and data characteristics (for example, the amount of relevant
variance in behavioural and/or brain data). While many studies that
showed high prediction accuracies appear to have used very small
samples, in studies using large samples, the prediction accuracies
are usually reported in the range of 0.2 to 0.4 (refs. 2629), implying
a generally lower accuracy when evaluating brain-based predictions
in population-representative samples. A recent literature survey
2
has
evidenced a correlation of r = −0.265 between the sizes of the training
samples and the reported prediction accuracies, demonstrating the
generality of this trend.
Nature Human Behaviour
Review article https://doi.org/10.1038/s41562-023-01670-1
populations because the participants would show very limited vari-
ations in mental health measures. As the largest and highest-quality
neuroimaging datasets open to the research community include mainly
healthy populations, studies attempting to develop predictive models
of mental health may be limited either by data availability and quality
for clinical populations or by lower prediction accuracies when using
easily accessible data. Relatedly, low test–retest reliability of fMRI
measures may be another source of poor prediction accuracies47,48. As
the reliability of connectivity features computed may depend on data
collection protocols
4951
, the selection of reliable data would further
restrict the available sample size.
One pessimistic view is that current modelling approaches may not
be able to handle the heterogeneity of the population-representative
samples, or that the brain–behaviour relationships captured in neuro-
imaging datasets may simply be too weak
5255
. Neuroimaging patterns
may be a reduced summary of endogenous factors and the exposome
that has a limited power to explain interindividual variability in behav-
iour. Crucially, brain-based prediction models need to be justified
Subject 1Dataset 1
Dataset 1
Dataset 2
Predicted scores
130.07
10 7.88 118.47
Machine
learning
model
Interpretation
(such as virtual
lesion, Haufe
transform,
sparsity
or SHAP)
Machine
learning
model
Accuracy:
correlation = 0.34
Accuracy:
correlation = 0.15
Model development (training)
Model validation (test)
Model generalization
Neuroimaging features
(such as RSFC)
Neuroimaging features
(such as RSFC)
Psychometric scores
(such as intelligence test scores)
Psychometric scores
(such as intelligence test scores)
Predicted scores
Neuroimaging features
(such as RSFC)
Psychometric scores
(such as intelligence test scores)
Subject 2 Subject n
Subject n + 1 Subject n + 2 Subject n + M
Subject 1 Subject 2 Subject n
85.40
109.26 77.6 4
98.34 83.26 114.85
101.18114.35 128.97
90.12 104.78 91.35
Interpretation
(such as
region-wise)
Machine
learning
model
Interpretation
(such as
region-wise)
?
Fig. 1 | Model development and validation for neuroimaging-based
psychometric predictions. A machine learning model is first trained using
neuroimaging features and psychometric scores from participants 1 to n (from
the training set). The model learns a relationship between the neuroimaging
features and the psychometric scores. For validation, the model takes in
neuroimaging features from participants n + 1 to n + M (from the test set) and
outputs predicted values for the psychometric scores. The predicted scores can
then be compared to the actual scores using various accuracy measures (Box 1)
to evaluate the performance of the model. To assess the generalizability of the
model, the model needs to be applied to a new dataset in a similar way to its
application in the test set. SHAP, Shapley additive explanation.
Nature Human Behaviour
Review article https://doi.org/10.1038/s41562-023-01670-1
on the basis of the additional predictive power not already provided
by non-neuroimaging features that can be easily collected through
questionnaires and interviews, especially considering the high cost
of MRI. From a practical standpoint, it may be useful to investigate the
prediction performance of hybrid models making use of all types of
data available in a realistic situation. For instance, RSFC patterns may
vary with age due to developmental effects in younger populations
and due to ageing in older populations. Similarly, cognitive measures
may be affected by age differently in different age subgroups. Allow-
ing the prediction model to learn these interactions across a large age
range would thus help the model to predict the target variable more
accurately in general. Finally, the variability of brain–behaviour associa-
tion patterns across different subgroups brings forth another crucial
challenge: model generalizability.
Generalizability of prediction models
The utility of a prediction model depends on its generalizability—that
is, its ability to make accurate predictions on unseen data, first the test
set data and ultimately data from the broader population. In the context
of brain-based psychometric predictions, we discuss generalizability
in terms of both generalizing to completely new cohorts and general-
izing to different subgroups of the population.
Cross-cohort generalizability
Cross-cohort generalizability can be defined as the prediction perfor-
mance of a model in a different dataset from the training dataset (Fig. 1).
Generalizable models are important for discovering neurobiologi-
cal insights general to the population and for deploying prediction
models to broader settings. In most present brain-based prediction
studies, the training and test sets are drawn from the same cohort under
a cross-validation scheme2. Cross-validation helps to evaluate model
performance without requiring additional datasets, but to rigorously
test the cross-cohort generalizability of a model, it is necessary to evalu-
ate the model on completely unrelated datasets. Many studies employ-
ing both internal and external validation have found similar prediction
accuracies in internal and external test sets
9,13,5659
. However, most of
these studies had small external test samples (n < 200), calling into
question the representativeness of these test cohorts. In two studies
with large test cohorts (n ~ 1,000), drops in prediction accuracies were
observed when generalizing to new cohorts
26,60
. It has been suggested
that reproducible brain–behaviour associations may be found only
by using samples with thousands of participants
55,61
. However, it has
also been shown that generalizable associations and predictions can
be achieved with much smaller samples in some specific cases62,63.
Additionally, it should be noted that the generalizability of a statistical
model is not a direct indication of the generalizability of a brain–behav-
iour association derived from the model, the latter showing a low to
moderate extent of generalizability across cohorts60.
At present, the main challenge from the perspective of
cross-cohort generalizability is the lack of awareness from scientific
investigators and hence the lack of assessment. The need for large
external test cohorts for evaluating prediction models is often over-
looked during the planning phase of a study and later dismissed on the
grounds that such large cohorts are not available for the specific psy-
chometric measure investigated. More generally, the cross-cohort gen-
eralizability of prediction models may be affected and limited by the
similarity of data collection and processing protocols in the different
cohorts60. The need for large datasets has led to researchers’ reliance
on whatever data is provided by the several publicly (or semi-publicly)
shared datasets. Many studies have trained and evaluated prediction
models using the Human Connectome Project Young Adult data, which
were processed with a specific pipeline not always adopted or viable
in other datasets64,65. Ideally, standardizing data collection protocols
and processing pipelines would improve model generalizability in
both research and practical situations. However, imaging conditions
in samples involving children or older adults often result in lower scan
duration, making it difficult to achieve the same standards that can be
set in samples of healthy young adults
66
. The need for large cohorts and
varied data specification may not be fully reconcilable. Partial solutions
would be more robust preprocessing strategies and prediction models
Box 1
Measures of model accuracy
To evaluate a model in a validation sample, its predictions need
to be compared against the actual values of the psychometric
variable. The closer the predicted values are to the actual values,
the more accurate the model is. This degree of closeness can be
represented either by correlation metrics examining the linear trend
between all predicted and observed values or by error metrics
examining the absolute dierences between each pair of predicted
and observed values.
The most common metric in the literature is Pearson’s
correlation coeicient (r) between the predicted and observed
values2, measuring the normalized covariance between the two
variables (
r
=
cov(pred,obs)
σpredσobs
). This correlation coeicient indicates
the extent to which a given increase or decrease in one variable
is associated with a similar increase or decrease in the other
variable. Similarly, Spearman’s correlation can be used to
measure the ranked correlation between predicted and observed
values, providing an indication of how well the two groups of
values are monotonically related.
Common error metrics include mean absolute error, mean
squared error (MSE) and root MSE, measuring the average
dierence between predicted and observed values in the validation
sample in slightly dierent manners. In general, the error values
should be normalized by the standard deviation (or the range of
predicted values for absolute errors) of the validation sample,
so that they are comparable to standardized measures from
other samples25.
A high correlation suggests that the predicted values are
generally higher when the observed values are higher, but it
does not mean that the predicted values are numerically close
to the observed values. As a result, the correlation metrics
cannot detect systematic biases where the predicted values are
consistently higher (or lower) than the observed values. It may be
recommended that high correlation accuracies should be validated
with error-based accuracies to check for systematic bias. However,
correlation metrics might be more useful when generalizing a
model to new data where the psychometric variables are similar
to but not the same as those in the training sample and numerical
closeness between the predicted and observed values may not
be required.
Finally, a useful metric for model evaluation is the coeicient of
determination (R2), providing a measure of the goodness of it of the
model. A simple form of R2 may also be computed as the square of
the correlation coeicient from the correlation metric, or r2. It
should be noted that this r2 measures the goodness of it between
the predicted–observed relation and its itted line and hence is not a
direct measure of model it itself. Using error metrics such as MSE,
the more general R2 can be computed as R2=1
MSE
σ2
, measuring
the goodness of it of the regression equation estimated by the
prediction model to the validation data. The R2 values can also be
interpreted as the ratio of variance explained by the model to the
total variance in the sample, oering an intuitive way to explain the
accuracies measured.
Nature Human Behaviour
Review article https://doi.org/10.1038/s41562-023-01670-1
to harmonize data differences or to extract generalizable information
despite the data differences.
Generalizability across subgroups within a single dataset
Typically, the test set for evaluating a prediction model is randomly
selected from the cohort or taken from an external validation cohort.
The composition of the test set may be completely random or stratified
for balanced distributions of age, gender and other variables of interest.
While the model performances reflect the average performance in the
test cohort population, they are not informative of potential prediction
biases between test participants. In both medical and non-medical
applications, model bias has been reported for potential mistreat-
ments of subgroups based on gender, ethnicity and socio-economic
status6769. In connectivity-based prediction, ethnicity-based bias has
been reported where prediction accuracies were lower in African Ameri-
can participants than in white American participants, even if the models
were trained on only African American participants70. Moreover, models
tend to predict lower cognitive scores and higher negative social behav-
iour scores for African American participants
70
, demonstrating the
potential biases in applications of the prediction models. Such robust
biases call for more balanced samples in scientific approaches, includ-
ing not only more data collection in underrepresented populations but
also the development of brain templates, atlases and preprocessing
tools based on balanced samples.
Common concepts used to define population subgroups such
as gender and ethnicity are complex notions themselves and are
often entangled with socio-economic factors. Relatedly, brain-based
prediction models do not see the population divided into distinct
gender-based or ethnic groups but have been shown to learn complex
profiles relating brain measures, covariates and psychometric vari-
ables
70
. It was recently demonstrated that individuals that do not follow
the majority trend of brain–phenotype relationships in the training
a b
c
Observed visual working memory
Predicted openness Observed meaning and purpose
Predicted visual working memory
Observed openness
Predicted meaning and purpose
20
r = 0.24
R
2
= 0.024
P < 0.001
P < 0.001
6
5
4
3
2
1
0
–1
–2
–3
–4
10
0
–10
–20
–20 –10 0 10
3
2
1
0 1 2 3 4
REST12
20 –30 –20 –10 0 10 20 30
AA
WA
r = 0.402, P < 0.001
Fig. 2 | Prediction accuracies measured by Pearson’s correlation. a, Scatter
plot of observed and predicted openness, with a Pearson’s correlation accuracy
of 0.24 (ref. 91), using averaged connectivity matrices across two resting-state
scans (REST12). The blue line shows the fitted line between the observed and
predicted values; the black dashed line marks the line with unit slope and
zero intercept. While the (standardized) observed values have a wide range of
variation (roughly between −20 and 15), the predicted values are tightly scattered
around zero. b, Scatter plot of observed and predicted scores of meaning and
purpose, with a Pearson’s correlation accuracy of 0.17 in African American
participants (AA) and 0.049 in white American participants (WA)70. The blue and
green lines show the fitted line between observed and predicted values in African
American and white American participants, respectively. The correlation appears
slightly higher in African Americans than in white Americans, but the prediction
errors may actually be greater in the former group. c, Scatter plot of observed
and predicted visual working memory performance, with a Pearson’s correlation
accuracy of 0.402 (ref. 21). The blue line shows the fitted line between the
observed and predicted values. Overall, from all three plots, it can be observed
that Pearson’s correlation coefficient is higher when the fitted line has a slope
closer to one. It is also noteworthy that the predicted values in all cases tend to
have smaller variances than the observed values. This reflects the tendency of
machine learning algorithms to generate predictions closer to the sample mean.
Finally, outliers or prediction failures can be observed in all plots even when
correlation accuracies are moderate. As the correlation accuracies measure the
relative goodness of fit, they are less affected by (or reflective of) outliers than
are accuracy measures based on absolute errors. Images adapted from ref. 91
© Dubois, J. et al., CC-BY 4.0 (a), ref. 70 © Li, J. et al., CC-BY 4.0 (b), ref. 21 © Xiao,
Y. et al., CC-BY 4.0 (c).
Nature Human Behaviour
Review article https://doi.org/10.1038/s41562-023-01670-1
sample can cause consistent prediction failure71. For instance, if most
older participants in the training sample scored lower for a cognitive
test, a few older participants in the validation sample who scored high
for the cognitive test would become outliers and lead to prediction
failures. In other words, model bias may be caused by any form of
stereotypical brain–behaviour relationships in the training sample,
not specific to an ethnic or gender group. This could lead to further
difficulty in collecting balanced samples as these stereotypical rela-
tionships can hardly be anticipated during the data collection phase.
In cases in which differences in brain–behaviour relationships can
be assumed across different subgroups in the sample, group-specific
models have been used to improve prediction accuracies within certain
subgroups or provide insights into the differences in brain–behaviour
associations across subgroups
17,37,72,73
. Nevertheless, the validity of and
potential bias in subgroup definitions (for instance, ambiguity in eth-
nicity reporting) could limit the validity of any insights generated. Fur-
thermore, brain–behaviour relationships inferred from group-specific
models should not be simplified in terms of causal relations with the
subgroups, lest we fall into the trap of model bias and mistreatment
again
70
. Alternatively, an ensemble learning technique called boosting
may be useful for capturing different brain–behaviour relationships
without defining subgroups. In boosting, a sequence of models is
trained where each model assigns more importance to participants
that were wrongly predicted by previous models, thereby automati-
cally identifying the outlying participants.
From a basic neuroscience perspective, the insights gained from
a biased prediction model may lead to false conclusions regarding
behaviour and social identities, while from a practical perspective,
a biased model deployed for social applications would easily lead to
inequitable treatment of the target populations. To develop a fair pre-
diction model, both dedicated study design and model transparency
are vital. This calls for more population-representative samples, clearly
documented study and model parameters, and interpretable models.
Model interpretability
While accuracy and generalizability are requirements of any predictive
model, interpretability is another crucial goal, if less easy to quantify.
From a basic neuroscience perspective, prediction models need to be
interpretable to contribute to our knowledge about brain–behaviour
relationships, while from a practical perspective, interpretability is
required to evaluate the neurobiological validity of the model and,
relatedly, its trustworthiness. A model with lower accuracy but higher
interpretability may be preferred to a black-box model with higher
accuracy, as the transparency of the former model allows assessments
of the model trustworthiness. For instance, model bias against an ethnic
minority could be identified earlier if the model can be interpreted eas-
ily. Achieving good model interpretability is not trivial and sometimes
requires compromise in prediction performance22.
Many early studies provided an illusion of interpretability by
treating regression weights from machine learning models as fea-
ture importance values for neuroscientific interpretation. Later stud-
ies have demonstrated that these weights are neither stable across
cross-validation folds46,74 nor conceptually valid as measures of brain
feature importance
75
. It may still be possible to interpret the regression
weights after transforming them into corresponding forward model
weights using the Haufe transform
75
. While stable predictive networks
may be identified for cognition45, the stability in cross-validation and
generalizability to new cohorts of the transformed weights are still
reported to be low
60,74
. The reliability of transformed weights may
improve with larger sample size
76
, making this technique potentially
suitable in large cohorts. Nevertheless, when using functional connec-
tivity as a feature, it may be difficult to align the connectivity edges to
the brain mapping literature or to summarize the feature importance
values into practically useful information. The feature importance of
connectivity edges may be more easily visualized and interpreted by
grouping the connectivity edges in networks or finding the top connec-
tions. For instance, Fig. 3a shows groups of important connections for
predicting cognition within the visual network, within the default mode
network, and between the default mode network and other networks,
while Fig. 3b shows that the most important connections for predicting
fluid intelligence tend to be cross-hemispheric between medial regions
or between temporal regions.
Many other solutions have been proposed for interpreting pre-
diction models. Using a feature-dropping concept used in random
forests77, the feature importance for each feature can be quantified as
the decrease in prediction performance when that feature is removed
from the feature set
62,7880
. This has sometimes been referred to as a
‘virtual lesion’ approach in the computational neuroimaging field. Such
simple implementations may not scale well to large feature sets, how-
ever, as each feature is dealt with independently. Alternatively, using
sparse regression models, only a small subset of features is selected
by the regression algorithm for prediction. This leads to an inbuilt
binary interpretation where only the small set of selected features is
considered important. For instance, Fig. 3c shows the feature impor-
tance assignment for predicting novelty seeking by a sparse algorithm,
helping the model interpretations to focus on frontal–subcortical, pari-
etal–frontal and within-frontal connections. This approach identifies
predictive features in a data-driven manner, albeit limited to research
questions for which sparsity can be safely assumed. When using highly
correlated features such as functional connectivity, some algorithms
may fail to include all important features that are correlated to each
other
81
. Considering a large set of features without feature selection, it
may still be possible to assess feature importance using Shapley addi-
tive explanation
82,83
. This method determines each feature’s contribu-
tion similarly to the ‘virtual lesion’ approach but in all possible subsets
of features, providing a distribution of feature importance for each
feature. Finally, using a recently proposed region-wise framework, each
brain region’s feature set can be assessed instead of individual features.
In this method, a region-wise model is trained and tested to provide a
model accuracy specific to the brain region
60
. Interpretations based on
region-wise models are easy to illustrate (Fig. 3d) and to some extent
align with the brain mapping literature. Nevertheless, the distributed
aspect of brain organization is not modelled by region-wise models,
limiting strong interpretations to mostly region-specific properties.
Ultimately, useful model interpretations rely on the prediction
accuracy and generalizability of the model. With very low accuracies,
the interpretations generated from the models may be arbitrary at best,
while with low generalizability, the interpretations may be valid only for
the training sample. The challenge hence lies in designing models for
which interpretability can be achieved with minimal or no compromise
in accuracy. Potential directions may include more powerful genera-
tive models, more informative priors and interpretable deep neural
networks. Generative models and deep neural network models may
be combined into deep generative models to achieve the benefits of
both interpretability and accuracy, with specific approaches includ-
ing variational autoencoders
84
, generative adversarial networks
85
and
autoregressive models
86,87
. With traditional machine learning models,
feature importance based on existing models can help to reduce fea-
ture dimensionality in new models in new cohorts, which offers new
interpretations to validate against the existing model’s interpretation.
In this way, a positive reinforcement loop may exist between boosting
prediction accuracy and interpretability, reducing the need to sacrifice
one for the other.
Enhancement and adversarial attacks
Enhancement and adversarial attacks can threaten the trustworthiness
of neuroimaging-based predictive models. Enhancement attacks are
those where purposeful data alterations can lead to falsely enhanced
model performance, while adversarial attacks are those where spe-
cifically designed noise is added to the data to cause a model to fail88.
Nature Human Behaviour
Review article https://doi.org/10.1038/s41562-023-01670-1
An artificially enhanced model may be the result of scientific malprac-
tice or fraud, which, if not discovered, could lead to large amounts of
time and resources wasted in the wrong research direction. Successful
adversarial attacks on deployed models cause prediction outcomes
to become unreliable. In biomedical development, for example, the
effect of a treatment or drug could be faked or exaggerated with data
manipulation of the machine learning model to mislead financial inves-
tors. Similar to the issue of generalizability, the main challenge of these
attacks in the field of neuroimaging-based psychometric prediction is
the lack of awareness. As practical applications of neuroimaging-based
prediction models are still far-fetched at present, there is a lack of moti-
vation for researchers to anticipate models that are robust to these
attacks. Furthermore, replication studies that might detect enhance-
ment attacks are still rather lacking in the field. While there is no evi-
dence of existing enhancement or adversarial attacks in the field and
no practical solution currently proposed against them, these are crucial
issues to address from the perspective of deploying brain-based predic-
tion models in societal applications.
Simple data enhancement can be done by biased participant selec-
tion. Participants may be retroactively selected on the basis of their
individual prediction outcome or chosen only if they follow certain
brain–phenotype stereotypes. Such manipulations can be detected
if data characteristics and exclusion criteria are reported faithfully,
especially when outliers are excluded on the basis of a threshold. A
more advanced approach involves adding patterns correlated to the
behavioural variable of interest to the imaging features, boosting the
prediction accuracies to almost perfect without inducing statistically
significant differences between the modified features and the original
features
88
. Furthermore, it is possible to design data enhancements to
cause machine learning models to learn brain–behaviour relationships
a b
c d
Default
R
L
Control
Language
SalVenAttn
DorsAttn
Auditory
SomMot
Visual
–2 –1 0 1 2
Feature importance
SFG
MFG
IFG
OFC
PrG
PCL
STG
MTG
ITG
FuG
PhG
STS
SPL
IPL
Pcun
PoG
INS
CG
MVOcC
LOcC
Amyg
Hipp
BG
Tha
SFG
MFG
IFG
OFC
PrG
PCL
STG
MTG
ITG
FuG
PhG
STS
SPL
IPL
Pcun
PoG
INS
CG
MVOcC
LOcC
Amyg
Hipp
BG
Tha
Novelty seeking
Mean weights
1.0
–1.0
0.05 0.2 0
R
5.0
2.5
0
–2.5
–5.0
L
Fig. 3 | Visualizations of model interpretations. a, Feature importance of all
RSFC edges for predicting general cognition in a young adult cohort with parcels
grouped under networks38. The colours correspond to the Haufe-transformed
weight values. Parcels are grouped under seven networks including the default
mode, control, language, salience/ventral attention (SalVenAttn), dorsal
attention (DorsAttn), auditory, somatomotor (SomMot) and visual networks.
Important connections can be found within the visual network, within the default
mode network, between the default mode network and the control network,
and between the default mode network and the attention networks. b, Feature
importance of top RSFC edges for predicting fluid intelligence in a young adult
cohort shown in their corresponding positions in the brain60. The colours
correspond to the Haufe-transformed weight values. Most top connections can
be found between medial regions or temporal regions across the hemispheres.
c, Feature importance of all RSFC edges for predicting novelty seeking in a young
adult cohort when a sparse algorithm was used. The colours correspond to the
mean weight values across cross-validation splits92. The sparse set of selected
features mostly includes frontal–subcortical, parietal–frontal and within-
frontal connections. d, Brain region importance for predicting fluid cognition
in an ageing cohort on the basis of the RSFC features using the region-wise
approach60. The colours correspond to the prediction accuracies achieved using
brain regional connectivity profiles. The relatively more predictive regions
can be identified in the cingulate cortex, the peripheral visual area, the right
supramarginal gyrus, the right anterior insula, the central sulcus and the right
lateral frontal cortex. Panels b,d reproduced with permission from ref. 60,
Elsevier. Panel c reproduced with permission from ref. 92, Elsevier.
Nature Human Behaviour
Review article https://doi.org/10.1038/s41562-023-01670-1
that do not exist in the original data. This means that enhancement
attacks may also be detrimental from a basic neuroscience perspective,
as the conclusions drawn would not be valid. This type of attack may
be detected when a replication study fails to generalize the model to
new cohorts but can really be confirmed only if the raw data and data
processing code can be openly examined.
The effects of adversarial attacks in machine learning models for
clinical applications have been investigated89,90. For brain-based pre-
diction models in healthy populations, very minor data manipulations
have been shown to cause the classification accuracy to drop to 0%88.
To design this type of attack, the model parameters must be known,
hence bringing forth additional challenges in achieving both open
science and practical utility. Data validation to identify manipulated
data, if possible, may become paramount in the future of adversarial
attacks. Potentially, machine learning models employed in practical
applications can use online learning where a trained model continues
to receive new batches of data for additional training, while only shar-
ing the model at baseline for scientific purposes.
In the face of potential enhancement and adversarial attacks,
model and study reproducibility enabled by open science is necessary
to detect and address these data manipulations. The field can benefit
from multiple aspects of transparent study design and provenance
tracking, including easier replication, enhancement attack monitoring,
comparison across studies and pooling of results22,88.
Conclusions
Many challenges lie in the way of brain-based predictive modelling
of behaviour before it can be substantially useful for understanding
complex brain–behaviour relationships or for practical applications.
While some limitations are inherent, such as smaller sample sizes
in studies interested in phenotypic measures that are uncommon
in large open datasets, others are solvable, such as the assessment
and improvement of generalizability. By acknowledging this and
addressing the solvable issues, brain-based psychometric predic-
tions can steadily progress towards scientific and practical utility. We
encourage more comprehensive study design, comprising multiple
cohorts to cover more population-representative samples, and ensur-
ing model validity with careful confound handling. Furthermore,
we advocate for model evaluation based on both accuracy and gen-
eralizability. Predictive modelling in neuroscience is a necessarily
interdisciplinary field, which requires combinations of neuroscientific
knowledge, statistical concepts and machine learning techniques
to achieve its potential. Beyond this interdisciplinarity, transpar-
ent models, diverse data and rigorous study designs are the keys to
moving forward.
References
1. Sui, J., Jiang, R., Bustillo, J. & Calhoun, V. Neuroimaging-based
individualized prediction of cognition and behavior for mental
disorders and health: methods and promises. Biol. Psychiatry 88,
818–828 (2020).
2. Yeung, A. W. K., More, S., Wu, J. & Eickho, S. B. Reporting
details of neuroimaging studies on individual traits prediction:
a literature survey. NeuroImage 256, 119275 (2022).
3. Cirillo, D. & Valencia, A. Big data analytics for personalized
medicine. Curr. Opin. Biotechnol. 58, 161–167 (2019).
4. Dadi, K. et al. Benchmarking functional connectome-based
predictive models for resting-state fMRI. NeuroImage 192,
115–134 (2019).
5. Dhamala, E., Yeo, B. T. T. & Holmes, A. J. One size does not it
all: methodological considerations for brain-based predictive
modelling in psychiatry. Biol. Psychiatry 93, 717–728 (2023).
6. Rosenberg, M. D. et al. A neuromarker of sustained attention
from whole-brain functional connectivity. Nat. Neurosci. 19,
165–171 (2016).
7. Lee, M. H., Smyser, C. D. & Shimony, J. S. Resting-state fMRI: a
review of methods and clinical applications. Am. J. Neuroradiol.
34, 1866–1872 (2013).
8. Finn, E. S. et al. Functional connectome ingerprinting: identifying
individuals using patterns of brain connectivity. Nat. Neurosci. 18,
1664–1671 (2015).
9. Ferguson, M. A., Anderson, J. S. & Spreng, R. N. Fluid and lexible
minds: intelligence relects synchrony in the brain’s intrinsic
network architecture. Netw. Neurosci. 1, 192–207 (2017).
10. Li, J. et al. A neuromarker of individual general luid intelligence
from the white-matter functional connectome. Transl. Psychiatry
10, 147 (2020).
11. Kumar, S. et al. An information network low approach for
measuring functional connectivity and predicting behavior. Brain
Behav. 9, e01346 (2019).
12. Rosenberg, M. D. et al. Functional connectivity predicts changes
in attention observed across minutes, days, and months. Proc.
Natl Acad. Sci. USA 117, 3797–3807 (2020).
13. Avery, E. W. et al. Distributed patterns of functional connectivity
predict working memory performance in novel healthy and
memory-impaired individuals. J. Cogn. Neurosci. 32, 241–255
(2020).
14. Pläschke, R. N. et al. Age dierences in predicting working
memory performance from network-based functional
connectivity. Cortex 132, 441–459 (2020).
15. Zhang, H. et al. Do intrinsic brain functional networks predict
working memory from childhood to adulthood? Hum. Brain Mapp.
41, 4574–4586 (2020).
16. Girault, J. B. et al. White matter connectomes at birth accurately
predict cognitive abilities at age 2. NeuroImage 192, 145–155
(2019).
17. Jiang, R. et al. Multimodal data revealed dierent neurobiological
correlates of intelligence between males and females. Brain
Imaging Behav. 14, 1979–1993 (2020).
18. Rasero, J., Sentis, A. I., Yeh, F. C. & Verstynen, T. Integrating across
neuroimaging modalities boosts prediction accuracy of cognitive
ability. PLoS Comput. Biol. 17, e1008347 (2021).
19. Wei, L. et al. Grey matter volume in the executive attention system
predict individual dierences in eortful control in young adults.
Brain Topogr. 32, 111–117 (2019).
20. Kaufmann, T. et al. Task modulations and clinical manifestations
in the brain functional connectome in 1615 fMRI datasets.
NeuroImage 147, 243–252 (2017).
21. Xiao, Y. et al. Predicting visual working memory with multimodal
magnetic resonance imaging. Hum. Brain Mapp. 42, 1446–1462
(2021).
22. Scheinost, D. et al. Ten simple rules for predictive modeling of
individual dierences in neuroimaging. NeuroImage 193, 35–45
(2019).
23. Gabrieli, J. D. E., Ghosh, S. S. & Whitield-Gabrieli, S. Prediction as
a humanitarian and pragmatic contribution from human cognitive
neuroscience. Neuron 85, 11–26 (2015).
24. Pervaiz, U., Vidaurre, D., Woolrich, M. W. & Smith, S. M. Optimising
network modelling methods for fMRI. NeuroImage 221, 116604
(2020).
25. Poldrak, R. A., Huckins, G. & Varoquax, G. Establishment of best
practices for evidence for prediction: a review. J. Am. Med. Assoc.
Psychiatry 77, 534–540 (2020).
26. Sripada, C. et al. Prediction of neurocognition in youth from
resting state fMRI. Mol. Psychiatry 25, 3413–3421 (2019).
27. He, T. et al. Deep neural networks and kernel regression achieve
comparable accuracies for functional connectivity prediction of
behavior and demographics. NeuroImage 206, 116276 (2020).
28. He, L. et al. Functional connectome prediction of anxiety related
to the COVID-19 pandemic. Am. J. Psychiatry 178, 530–540 (2021).
Nature Human Behaviour
Review article https://doi.org/10.1038/s41562-023-01670-1
29. Gao, S., Greene, A. S., Constable, R. T. & Scheinost, D.
Combining multiple connectomes improves predictive
modeling of phenotypic measures. NeuroImage 201,
116038 (2019).
30. Zalesky, A., Fornito, A., Cocchi, L., Gollo, L. L. & Breakspear, M.
Time-resolved resting-state brain networks. Proc. Natl Acad. Sci.
USA 111, 10341–10346 (2014).
31. Bahg, G., Evans, D. G., Galdo, M. & Turner, B. M. Gaussian process
linking functions for mind, brain, and behavior. Proc. Natl Acad.
Sci. USA 117, 29398–29406 (2020).
32. Mihalik, A. et al. Canonical correlation analysis and partial
least squares for identifying brain–behaviour associations:
a tutorial and a comparative study. Biol. Psychiatry 7, 1055–1067
(2022).
33. Gal, S., Tik, N., Bernstein-Eliav, M. & Tavor, I. Predicting individual
traits from unperformed tasks. NeuroImage 249, 118920 (2022).
34. He, T. et al. Meta-matching as a simple framework to translate
phenotypic predictive models from big to small data. Nat.
Neurosci. 25, 795–804 (2022).
35. Takagi, Y., Hirayama, J. I. & Tanaka, S. C. State-unspeciic patterns
of whole-brain functional connectivity from resting and multiple
task states predict stable individual traits. NeuroImage 201,
116036 (2019).
36. Burr, D. A. et al. Functional connectivity predicts the dispositional
use of expressive suppression but not cognitive reappraisal. Brain
Behav. 10, e01493 (2020).
37. Jiang, R. et al. Task-induced brain connectivity promotes
the detection of individual dierences in brain–behavior
relationships. NeuroImage 207, 116370 (2020).
38. Ooi, L. Q. R. et al. Comparison of individualized behavioral
predictions across anatomical, diusion and functional
connectivity MRI. NeuroImage 263, 119636 (2022).
39. Dhamala, E., Jamison, K. W., Jaywant, A., Dennis, S. & Kuceyeski, A.
Distinct functional and structural connections predict crystallised
and luid cognition in healthy adults. Hum. Brain Mapp. 42,
3102–3118 (2021).
40. Mansour, L. S., Tian, Y., Yeo, B. T. T., Cropley, V. & Zalesky, A.
High-resolution connectomic ingerprints: mapping neural
identity and behavior. NeuroImage 229, 117695 (2021).
41. Pat, N. et al. Longitudinally stable, brain-based predictive models
mediate the relationships between childhood cognition and
socio-demographic, psychological and genetic factors. Hum.
Brain Mapp. 43, 5520–5542 (2022).
42. Hurtz, G. M. & Donovan, J. J. Personality and job performance: the
Big Five revisited. J. Appl. Psychol. 85, 869–879 (2000).
43. Kane, M. J., Conway, A. R. A., Miura, T. K. & Collesh, G. J. H.
Working memory, attention control, and the n-back task:
a question of construct validity. J. Exp. Psychol. 33, 615–622
(2007).
44. Sanchez-Cubillo, I. et al. Construct validity of the Trail Making Test:
role of task-switching, working memory, inhibition/interference
control, and visuomotor abilities. J. Int. Neuropsychol. Soc. 15,
438–450 (2009).
45. Chen, J. et al. Shared and unique brain network features predict
cognitive, personality, and mental health scores in the ABCD
study. Nat. Commun. 13, 2217 (2022).
46. Wu, J. et al. A connectivity-based psychometric prediction
framework for brain–behavior relationship studies. Cereb. Cortex
31, 3732–3751 (2021).
47. Noble, S., Scheinost, D. & Constable, R. T. A decade of test–retest
reliability of functional connectivity: a systematic review and
meta-analysis. NeuroImage 203, 116157 (2019).
48. Elliott, M. L. et al. What is the test–retest reliability of common
task-functional MRI measures? New empirical evidence and a
meta-analysis. Psychol. Sci. 31, 792–806 (2020).
49. Patriat, R. et al. The eect of resting condition on resting-state
fMRI reliability and consistency: a comparison between resting
with eyes open, closed, and ixated. NeuroImage 78, 463–473
(2013).
50. Birn, R. M. et al. The eect of scan length on the reliability of
resting-state fMRI connectivity estimates. NeuroImage 83,
550–558 (2013).
51. Bennett, C. M. & Miller, M. B. fMRI reliability: inluences of task
and experimental design. Cogn. Aect. Behav. Neurosci. 13,
690–702 (2013).
52. Cremers, H. R., Wager, T. D. & Yarkoni, T. The relation between
statistical power and inference in fMRI. PLoS ONE 12, e0184923
(2017).
53. Kharabian Masouleh, S., Eickho, S. B., Hostaedter, F. & Genon,
S., Alzheimer’s Disease Neuroimaging Initiative. Empirical
examination of the replicability of associations between brain
structure and psychological variables. eLife 8, e43464 (2019).
54. Genon, S., Eickho, S. B. & Kahrabian, S. Linking interindividual
variability in brain structure to behaviour. Nat. Rev. Neurosci. 23,
307–318 (2022).
55. Marek, S. et al. Reproducible brain-wide association studies
require thousands of individuals. Nature 603, 654–660 (2022).
56. Beaty, R. E. et al. Robust prediction of individual creative ability
from brain functional connectivity. Proc. Natl Acad. Sci. USA 115,
1087–1092 (2018).
57. Liu, P. et al. The functional connectome predicts feeling of stress
on regular days and during the COVID-19 pandemic. Neurobiol.
Stress 14, 100285 (2021).
58. Ren, Z. et al. Connectome-based predictive modeling of creativity
anxiety. NeuroImage 225, 117469 (2021).
59. Fong, A. H. C. et al. Dynamic functional connectivity during task
performance and rest predicts individual dierences in attention
across studies. NeuroImage 188, 14–25 (2019).
60. Wu, J. et al. Cross-cohort replicability and generalizability
of connectivity-based psychometric prediction patterns.
NeuroImage 262, 119569 (2022).
61. Tervo-Clemmens, B. et al. Reply to: Multivariate BWAS can be
replicable with moderate sample sizes. Nature 615, E8–E12 (2023).
62. Rosenberg, M. D. & Finn, E. S. How to establish robust brain–
behavior relationships without thousands of individuals. Nat.
Neurosci. 25, 835–837 (2022).
63. Spisak, T., Bingel, U. & Wager, T. Replicable multivariate BWAS
with moderate sample sizes. Preprint at bioRxiv https://doi.org/
10.1101/2022.06.22.497072 (2022).
64. Van Essen, D. C. et al. The WU-Minh Human Connectome Project:
an overview. NeuroImage 80, 62–79 (2013).
65. Glasser, M. F. et al. The minimal preprocessing pipelines
for the Human Connectome Project. NeuroImage 80,
105–124 (2013).
66. Harms, M. P. et al. Extending the Human Connectome Project
across ages: imaging protocols for the lifespan development and
aging projects. NeuroImage 183, 972–984 (2018).
67. Chouldechova, A., Benavides-Prado, D., Fialko, O. & Vaithianathan,
R. A case study of algorithm-assisted decision making in child
maltreatment hotline screening decisions. Proc. Mach. Learn. Res.
81, 134–148 (2018).
68. Martin, A. R. et al. Clinical use of current polygenic risk scores
may exacerbate healthy disparities. Nat. Genet. 51, 584–591
(2019).
69. Obermeyer, Z., Powers, B., Vogeli, C. & Mullainathan, S. Dissecting
racial bias in an algorithm used to manage the health of
populations. Science 366, 447–453 (2019).
70. Li, J. et al. Cross-ethnicity/race generalization failure of behavioral
prediction from resting-state functional connectivity. Sci. Adv. 8,
eabj1812 (2022).
Nature Human Behaviour
Review article https://doi.org/10.1038/s41562-023-01670-1
71. Greene, A. S. et al. Brain-phenotype models fail for individuals
who defy sample stereotypes. Nature 609, 109–118 (2022).
72. Greene, A. S., Gao, S., Scheinost, D. & Constable, R. T.
Task-induced brain state manipulation improves prediction of
individual traits. Nat. Commun. 9, 2807 (2018).
73. Nostro, A. D. et al. Predicting personality from network-based
resting-state functional connectivity. Brain Struct. Funct. 223,
2699–2719 (2018).
74. Tian, Y. & Zalesky, A. Machine learning prediction of cognition
from functional connectivity: are feature weights reliable?
NeuroImage 245, 118648 (2021).
75. Haufe, S. et al. On the interpretation of weight vectors of
linear models in multivariate neuroimaging. NeuroImage 87,
96–110 (2014).
76. Chen, J. et al. Relationship between prediction accuracy and
feature importance reliability: an empirical and theoretical study.
NeuroImage 274, 120115 (2023).
77. Breiman, L. Random forests. Mach. Learn. 45, 5–32 (2001).
78. Yip, S. W., Kiluk, B. & Scheinost, D. Towards addiction prediction:
an overview of cross-validated predictive modeling indings and
considerations for future neuroimaging research. Biol. Psychiatry
Cogn. Neurosci. Neuroimaging 5, 748–758 (2020).
79. Jiang, R., Woo, C. W., Qi, S., Wu, J. & Sui, J. Interpreting brain
biomarkers: challenges and solutions in interpreting machine
learning-based predictive neuroimaging. IEEE Signal Process.
Mag. 39, 107–118 (2022).
80. Chormai, P. et al. Machine learning of large-scale multimodal
brain imaging data reveals neural correlates of hand preference.
NeuroImage 262, 119534 (2022).
81. Zou, H. & Hastie, T. Regularization and variable selection via the
elastic net. J. R. Stat. Soc. B 67, 301–320 (2005).
82. Lundberg, S. M. & Lee, S.-I. A uniied approach to interpreting
model prediction. In Advances in Neural Information Processing
Systems 30 (NIPS 2017) (eds Guyon, I. et al.) (Curran Associates,
2017).
83. Pat, N., Wang, Y., Bartonicek, A., Candia, J. & Stringaris, A.
Explainable machine learning approach to predict and explain
the relationship between task-based fMRI and individual
dierences in cognition. Cereb. Cortex 33, 2682–2703 (2023).
84. Kingma, D. P. & Welling, M. Auto-encoding variational Bayes.
Preprint at arXiv https://doi.org/10.48550/arXiv.1312.6114 (2013).
85. Goodfellow, I. J. et al. Generative adversarial networks. Commun.
ACM 63, 139–144 (2020).
86. van den Oord, A., Kalchbrenner, N. & Kavukcuoglu, K. Pixel
recurrent neural networks. In Proceedings of the 33rd International
Conference on Machine Learning (eds Balcan, M. F. & Weinberger,
K. Q.) 48 1747–1756 (Proceedings of Machine Learning Research,
2016).
87. Fried, D. et al. Speaker-follower models for vision-and-language
navigation. In Advances in Neural Information Processing Systems
31 (NeurIPS 2018) (eds Bengio, S. et al.) (Curran Associates, 2018).
88. Rosenblatt, M. et al. Connectome-based machine learning
models are vulnerable to subtle data manipulations. Patterns (in
the press).
89. Finlayson, S. G. et al. Adversarial attacks on medical machine
learning. Science 363, 1287–1289 (2019).
90. Finlayson, S. G., Chung, H. W., Kohane, I. S. & Beam, A. L.
Adversarial attacks against medical deep learning systems.
Preprint at arXiv https://doi.org/10.48550/arXiv.1804.05296 (2019).
91. Dubois, J., Galdi, P., Han, Y., Paul, L. K. & Adolphs, R. Resting-state
functional brain connectivity best predicts the personality
dimension of openness to experience. Pers. Neurosci. 1, E6 (2018).
92. Jiang, R. et al. Connectome-based individualized prediction of
temperament trait scores. NeuroImage 183, 366–374 (2018).
Acknowledgements
This work was supported by the Deutsche Forschungsgemeinschaft (GE
2835/2–1, EI 816/ 4–1), the Helmholtz Portfolio Theme ‘Supercomputing
and Modelling for the Human Brain’ and the European Union’s Horizon
2020 Research and Innovation Programme under grant agreement no.
720270 (HBP SGA1) and grant agreement no. 785907 (HBP SGA2).
Competing interests
The authors declare no competing interests.
Additional information
Correspondence and requests for materials should be addressed to
Jianxiao Wu or Sarah Genon.
Peer review information Nature Human Behaviour thanks the
anonymous reviewers for their contribution to the peer review
of this work.
Reprints and permissions information is available at
www.nature.com/reprints.
Publisher’s note Springer Nature remains neutral with regard to
jurisdictional claims in published maps and institutional ailiations.
Springer Nature or its licensor (e.g. a society or other partner) holds
exclusive rights to this article under a publishing agreement with
the author(s) or other rightsholder(s); author self-archiving of the
accepted manuscript version of this article is solely governed by the
terms of such publishing agreement and applicable law.
© Springer Nature Limited 2023
... Details of the fMRI preprocessing can be found in our previous study (Wu et al., 2022), but briefly, eNKI-RS data were pre-processed with fMRIprep (Esteban et al., 2019) with default configuration and additional ICA-AROMA denoising (Pruim et al., 2015a;2015b). Additional nuisance regression was then performed with regressors corresponding to 24 motion parameters, white matter signal, CSF signal and their temporal derivatives (Wu et al., 2022). ...
... Details of the fMRI preprocessing can be found in our previous study (Wu et al., 2022), but briefly, eNKI-RS data were pre-processed with fMRIprep (Esteban et al., 2019) with default configuration and additional ICA-AROMA denoising (Pruim et al., 2015a;2015b). Additional nuisance regression was then performed with regressors corresponding to 24 motion parameters, white matter signal, CSF signal and their temporal derivatives (Wu et al., 2022). The pre-processed fMRI data in MNI152 space were used to compute 419 × 419 RSFC matrices ...
... We manually selected commonly used behavioral measures, resulting in 45 phenotypes and 656 participants with at least one phenotype. The resting-fMRI data after ICA-FIX denoising in MNI152 space were used, following our previous study (Wu et al., 2022). Nuisance regression was then implemented, controlling for 24 motion parameters, white matter signal, CSF signal, and their temporal derivatives (Wu et al., 2022). ...
Preprint
Full-text available
Resting-state functional connectivity (RSFC) is widely used to predict phenotypic traits in individuals. Large sample sizes can significantly improve prediction accuracies. However, for studies of certain clinical populations or focused neuroscience inquiries, small-scale datasets often remain a necessity. We have previously proposed a "meta-matching" approach to translate prediction models from large datasets to predict new phenotypes in small datasets. We demonstrated large improvement of meta-matching over classical kernel ridge regression (KRR) when translating models from a single source dataset (UK Biobank) to the Human Connectome Project Young Adults (HCP-YA) dataset. In the current study, we propose two meta-matching variants ("meta-matching with dataset stacking" and "multilayer meta-matching") to translate models from multiple source datasets across disparate sample sizes to predict new phenotypes in small target datasets. We evaluate both approaches by translating models trained from five source datasets (with sample sizes ranging from 862 participants to 36,834 participants) to predict phenotypes in the HCP-YA and HCP-Aging datasets. We find that multilayer meta-matching modestly outperforms meta-matching with dataset stacking. Both meta-matching variants perform better than the original "meta-matching with stacking" approach trained only on the UK Biobank. All meta-matching variants outperform classical KRR and transfer learning by a large margin. In fact, KRR is better than classical transfer learning when less than 50 participants are available for finetuning, suggesting the difficulty of classical transfer learning in the very small sample regime. The multilayer meta-matching model is publicly available at GITHUB_LINK.
... However, most of these studies had small external test samples (n < 200), calling into question the representativeness of these test cohorts. In two studies with large test cohorts (n ~ 1,000), drops in prediction accuracies were observed when generalizing to new cohorts 26,60 . It has been suggested that reproducible brain-behaviour associations may be found only by using samples with thousands of participants 55,61 . ...
... However, it has also been shown that generalizable associations and predictions can be achieved with much smaller samples in some specific cases 62,63 . Additionally, it should be noted that the generalizability of a statistical model is not a direct indication of the generalizability of a brain-behaviour association derived from the model, the latter showing a low to moderate extent of generalizability across cohorts 60 . ...
... The need for large external test cohorts for evaluating prediction models is often overlooked during the planning phase of a study and later dismissed on the grounds that such large cohorts are not available for the specific psychometric measure investigated. More generally, the cross-cohort generalizability of prediction models may be affected and limited by the similarity of data collection and processing protocols in the different cohorts 60 . The need for large datasets has led to researchers' reliance on whatever data is provided by the several publicly (or semi-publicly) shared datasets. ...
Article
Relating individual brain patterns to behaviour is fundamental in system neuroscience. Recently, the predictive modelling approach has become increasingly popular, largely due to the recent availability of large open datasets and access to computational resources. This means that we can use machine learning models and interindividual differences at the brain level represented by neuroimaging features to predict interindividual differences in behavioural measures. By doing so, we could identify biomarkers and neural correlates in a data-driven fashion. Nevertheless, this budding field of neuroimaging-based predictive modelling is facing issues that may limit its potential applications. Here we review these existing challenges, as well as those that we anticipate as the field develops. We focus on the impacts of these challenges on brain-based predictions. We suggest potential solutions to address the resolvable challenges, while keeping in mind that some general and conceptual limitations may also underlie the predictive modelling approach.
... This means that generalisability situates closer to how deployable the prediction models are in indicating cognitive abilities in the real world 10 . Yet, only a few studies have investigated the generalisability of MRI prediction models for cognitive abilities, and most have focused on functional connectivity during rest and/or tasks [39][40][41] . The generalisability of stacked models is currently unknown. ...
... As for the non-stacked models, cross-cohort generalisability was mostly significant, except for cortical thickness. This is in line with previous studies focusing on cross-dataset generalisability of Rest FC [39][40][41] . The generalisability of structural MRI sets of features was more varied. ...
Preprint
Full-text available
Brain-wide association studies (BWASs) have attempted to relate cognitive abilities with brain phenotypes, but have been challenged by issues such as predictability, test-retest reliability, and cross-cohort generalisability. To tackle these challenges, we proposed stacking that combines brain magnetic resonance imaging of different modalities, from task-fMRI contrasts and functional connectivity during tasks and rest to structural measures, into one prediction model. We benchmarked the benefits of stacking, using the Human Connectome Projects: Young Adults and Aging and the Dunedin Study. For predictability, stacked models led to out-of-sample r~.5-.6 when predicting cognitive abilities at the time of scanning and 36 years earlier. For test-retest reliability, stacked models reached an excellent level of reliability (ICC>.75), even when we stacked only task-fMRI contrasts together. For generalisability, a stacked model with non-task MRI built from one dataset significantly predicted cognitive abilities in other datasets. Altogether, stacking is a viable approach to undertake the three challenges of BWAS for cognitive abilities.
... Regression analysis, which can predict values of T. Xue a dependent variable (label) given a set of input independent variables (features), enables the prediction of neurocognitive measures given input features from neuroimaging. While many studies perform prediction using high-dimensional neuroimaging features from functional MRI (fMRI) [4]- [7] or multimodal data [8]- [11], a unimodal focus on dMRI tractography [12] could improve our understanding of the role of the WM connections in cognition. While a number of studies have pursued prediction of neurocognitive measures based on information from dMRI tractography (Table I), current approaches are limited in terms of study cohorts and regression methodology. ...
Preprint
Neuroimaging-based prediction of neurocognitive measures is valuable for studying how the brain's structure relates to cognitive function. However, the accuracy of prediction using popular linear regression models is relatively low. We propose Supervised Contrastive Regression (SCR), a simple yet effective method that allows full supervision for contrastive learning in regression tasks. SCR performs supervised contrastive representation learning by using the absolute difference between continuous regression labels (i.e. neurocognitive scores) to determine positive and negative pairs. We apply SCR to analyze a large-scale dataset including multi-site harmonized diffusion MRI and neurocognitive data from 8735 participants in the Adolescent Brain Cognitive Development (ABCD) Study. We extract white matter microstructural measures using a fine parcellation of white matter tractography into fiber clusters. We predict three scores related to domains of higher-order cognition (general cognitive ability, executive function, and learning/memory). To identify important fiber clusters for prediction of these neurocognitive scores, we propose a permutation feature importance method for high-dimensional data. We find that SCR improves the accuracy of neurocognitive score prediction compared to other state-of-the-art methods. We find that the most predictive fiber clusters are predominantly located within the superficial white matter and projection tracts, particularly the superficial frontal white matter and striato-frontal connections. Overall, our results demonstrate the utility of contrastive representation learning methods for regression, and in particular for improving neuroimaging-based prediction of higher-order cognitive abilities.
Article
One of the central objectives of contemporary neuroimaging research is to create predictive models that can disentangle the connection between patterns of functional connectivity across the entire brain and various behavioral traits. Previous studies have shown that models trained to predict behavioral features from the individual's functional connectivity have modest to poor performance. In this study, we trained models that predict observable individual traits (phenotypes) and their corresponding singular value decomposition (SVD) representations - herein referred to as latent phenotypes from resting state functional connectivity. For this task, we predicted phenotypes in two large neuroimaging datasets: the Human Connectome Project (HCP) and the Philadelphia Neurodevelopmental Cohort (PNC). We illustrate the importance of regressing out confounds, which could significantly influence phenotype prediction. Our findings reveal that both phenotypes and their corresponding latent phenotypes yield similar predictive performance. Interestingly, only the first five latent phenotypes were reliably identified, and using just these reliable phenotypes for predicting phenotypes yielded a similar performance to using all latent phenotypes. This suggests that the predictable information is present in the first latent phenotypes, allowing the remainder to be filtered out without any harm in performance. This study sheds light on the intricate relationship between functional connectivity and the predictability and reliability of phenotypic information, with potential implications for enhancing predictive modeling in the realm of neuroimaging research.
Preprint
Full-text available
Identifying reproducible and generalizable brain-phenotype associations is a central goal of neuroimaging. Consistent with this goal, prediction frameworks evaluate brain-phenotype models in unseen data. Most prediction studies train and evaluate a model in the same dataset. However, external validation, or the evaluation of a model in an external dataset, provides a better assessment of robustness and generalizability. Despite the promise of external validation and calls for its usage, the statistical power of such studies has yet to be investigated. In this work, we ran over 60 million simulations across several datasets, phenotypes, and sample sizes to better understand how the sizes of the training and external datasets affect statistical power. We found that prior external validation studies used sample sizes prone to low power, which may lead to false negatives and effect size inflation. Furthermore, increases in the external sample size led to increased simulated power directly following theoretical power curves, whereas changes in the training dataset size offset the simulated power curves. Finally, we compared the performance of a model within a dataset to the external performance. The within-dataset performance was typically within r=0.2 of the cross-dataset performance, which could help decide how to power future external validation studies. Overall, our results illustrate the importance of considering the sample sizes of both the training and external datasets when performing external validation.
Article
Full-text available
Neuroimaging-based predictive models continue to improve in performance, yet a widely overlooked aspect of these models is "trustworthiness," or robustness to data manipulations. High trustworthiness is imperative for researchers to have confidence in their findings and interpretations. In this work, we used functional connectomes to explore how minor data manipulations influence machine learning predictions. These manipulations included a method to falsely enhance prediction performance and adversarial noise attacks designed to degrade performance. Although these data manipulations drastically changed model performance, the original and manipulated data were extremely similar (r = 0.99) and did not affect other downstream analysis. Essentially, connectome data could be inconspicuously modified to achieve any desired prediction performance. Overall, our enhancement attacks and evaluation of existing adversarial noise attacks in connectome-based models highlight the need for counter-measures that improve the trustworthiness to preserve the integrity of academic research and any potential translational applications.
Article
Full-text available
There is significant interest in using neuroimaging data to predict behavior. The predictive models are often interpreted by the computation of feature importance, which quantifies the predictive relevance of an imaging feature. Tian and Zalesky (2021) suggest that feature importance estimates exhibit low split-half reliability, as well as a trade-off between prediction accuracy and feature importance reliability across parcellation resolutions. However, it is unclear whether the trade-off between prediction accuracy and feature importance reliability is universal. Here, we demonstrate that, with a sufficient sample size, feature importance (operationalized as Haufe-transformed weights) can achieve fair to excellent split-half reliability. With a sample size of 2600 participants, Haufe-transformed weights achieve average intra-class correlation coefficients of 0.75, 0.57 and 0.53 for cognitive, personality and mental health measures respectively. Haufe-transformed weights are much more reliable than original regression weights and univariate FC-behavior correlations. Original regression weights are not reliable even with 2600 participants. Intriguingly, feature importance reliability is strongly positively correlated with prediction accuracy across phenotypes. Within a particular behavioral domain, there is no clear relationship between prediction performance and feature importance reliability across regression models. Furthermore, we show mathematically that feature importance reliability is necessary, but not sufficient, for low feature importance error. In the case of linear models, lower feature importance error is mathematically related to lower prediction error. Therefore, higher feature importance reliability might yield lower feature importance error and higher prediction accuracy. Finally, we discuss how our theoretical results relate with the reliability of imaging features and behavioral measures. Overall, the current study provides empirical and theoretical insights into the relationship between prediction accuracy and feature importance reliability.
Article
Full-text available
Psychiatric illnesses are heterogeneous in nature. No illness manifests in the same way across individuals, and no two patients with a shared diagnosis exhibit identical symptom profiles. Over the last several decades, group-level analyses of in vivo neuroimaging data have led to fundamental advances in our understanding of the neurobiology of psychiatric illnesses. More recently, access to computational resources and large, publicly available datasets alongside the rise of predictive modeling and precision medicine approaches have facilitated the study of psychiatric illnesses at an individual level. Data-driven machine learning analyses can be applied to identify disease-relevant biological subtypes, predict individual symptom profiles, and recommend personalized therapeutic interventions. However, when developing these predictive models, methodological choices must be carefully considered to ensure accurate, robust, and interpretable results. Choices pertaining to algorithms, neuroimaging modalities and states, data transformation, phenotypes, parcellations, sample sizes, and populations we are specifically studying can influence model performance. Here, we review applications of neuroimaging-based machine learning models to study psychiatric illnesses and discuss the effects of different methodological choices on model performance. An understanding of these effects is crucial for the proper implementation of predictive models in psychiatry and will facilitate more accurate diagnoses, prognoses, and therapeutics.
Article
Full-text available
Individual differences in brain functional organization track a range of traits, symptoms and behaviours1–12. So far, work modelling linear brain–phenotype relationships has assumed that a single such relationship generalizes across all individuals, but models do not work equally well in all participants13,14. A better understanding of in whom models fail and why is crucial to revealing robust, useful and unbiased brain–phenotype relationships. To this end, here we related brain activity to phenotype using predictive models—trained and tested on independent data to ensure generalizability¹⁵—and examined model failure. We applied this data-driven approach to a range of neurocognitive measures in a new, clinically and demographically heterogeneous dataset, with the results replicated in two independent, publicly available datasets16,17. Across all three datasets, we find that models reflect not unitary cognitive constructs, but rather neurocognitive scores intertwined with sociodemographic and clinical covariates; that is, models reflect stereotypical profiles, and fail when applied to individuals who defy them. Model failure is reliable, phenotype specific and generalizable across datasets. Together, these results highlight the pitfalls of a one-size-fits-all modelling approach and the effect of biased phenotypic measures18–20 on the interpretation and utility of resulting brain–phenotype models. We present a framework to address these issues so that such models may reveal the neural circuits that underlie specific phenotypes and ultimately identify individualized neural targets for clinical intervention.
Article
Full-text available
Canonical Correlation Analysis (CCA) and Partial Least Squares (PLS) are powerful multivariate methods for capturing associations across two modalities of data (e.g., brain and behaviour). However, when the sample size is similar or smaller than the number of variables in the data, CCA and PLS models may overfit, i.e., find spurious associations that generalise poorly to new data. Dimensionality reduction and regularized extensions of CCA and PLS have been proposed to address this problem, yet most studies using these approaches have some limitations. This work gives a theoretical and practical introduction into the most common CCA/PLS models and their regularized variants. We examine the limitations of standard CCA and PLS when the sample size is similar or smaller than the number of variables. We discuss how dimensionality reduction and regularization techniques address this problem and explain their main advantages and disadvantages. We highlight crucial aspects of the CCA/PLS analysis framework, including optimising the hyperparameters of the model and testing the identified associations for statistical significance. We apply the described CCA/PLS models to simulated data and real data from the Human Connectome Project and the Alzheimer’s Disease Neuroimaging Initiative (both of n>500). We use both low and high dimensionality versions of each data (i.e., ratios between sample size and variables in the range of ∼1-10 and ∼0.1-0.01) to demonstrate the impact of data dimensionality on the models. Finally, we summarize the key lessons of the tutorial.
Article
Full-text available
Cognitive abilities are one of the major transdiagnostic domains in the National Institute of Mental Health's Research Domain Criteria (RDoC). Following RDoC's integrative approach, we aimed to develop brain-based predictive models for cognitive abilities that (a) are developmentally stable over years during adolescence and (b) account for the relationships between cognitive abilities and socio-demographic, psychological and genetic factors. For this, we leveraged the unique power of the large-scale, longitudinal data from the Adolescent Brain Cognitive Development (ABCD) study (n ~ 11 k) and combined MRI data across modalities (task-fMRI from three tasks: resting-state fMRI, structural MRI and DTI) using machine-learning. Our brain-based, predictive models for cognitive abilities were stable across 2 years during young adolescence and generalisable to different sites, partially predicting childhood cognition at around 20% of the variance. Moreover, our use of 'opportunistic stacking' allowed the model to handle missing values, reducing the exclusion from around 80% to around 5% of the data. We found fronto-parietal networks during a working-memory task to drive childhood-cognition prediction. The brain-based, predictive models significantly, albeit partially, accounted for variance in childhood cognition due to (1) key socio-demographic and psychological factors (proportion mediated = 18.65% [17.29%-20.12%]) and (2) genetic variation, as reflected by the polygenic score of cognition (proportion mediated = 15.6% [11%-20.7%]). Thus, our brain-based predictive models for cognitive abilities facilitate the development of a robust, transdiagnostic research tool for cognition at the neural level in keeping with the RDoC's integrative framework.
Article
Full-text available
Predictive modeling of neuroimaging data (predictive neuroimaging) for evaluating individual differences in various behavioral phenotypes and clinical outcomes is of growing interest. However, the field is experiencing challenges regarding the interpretability of results. Approaches to defining the specific contribution of functional connections, regions, and networks in prediction models are urgently needed, potentially helping to explore underlying mechanisms. In this article, we systematically review methods and applications for interpreting brain signatures derived from predictive neuroimaging, based on a survey of 326 research articles. Strengths, limitations, and suitable conditions for major interpretation strategies are also deliberated. An in-depth discussion of common issues in the existing literature and corresponding recommendations to address these pitfalls are provided. We highly recommend exhaustive validation of the reliability and interpretability of biomarkers across multiple data sets and contexts, which could translate technical advances in neuroimaging into concrete improvements in precision medicine.
Article
A fundamental goal across the neurosciences is the characterization of relationships linking brain anatomy, functioning, and behavior. Although various MRI modalities have been developed to probe these relationships, direct comparisons of their ability to predict behavior have been lacking. Here, we compared the ability of anatomical T1, diffusion and functional MRI (fMRI) to predict behavior at an individual level. Cortical thickness, area and volume were extracted from anatomical T1 images. Diffusion Tensor Imaging (DTI) and approximate Neurite Orientation Dispersion and Density Imaging (NODDI) models were fitted to the diffusion images. The resulting metrics were projected to the Tract-Based Spatial Statistics (TBSS) skeleton. We also ran probabilistic tractography for the diffusion images, from which we extracted the stream count, average stream length, and the average of each DTI and NODDI metric across tracts connecting each pair of brain regions. Functional connectivity (FC) was extracted from both task and resting-state fMRI. Individualized prediction of a wide range of behavioral measures were performed using kernel ridge regression, linear ridge regression and elastic net regression. Consistency of the results were investigated with the Human Connectome Project (HCP) and Adolescent Brain Cognitive Development (ABCD) datasets. In both datasets, FC-based models gave the best prediction performance, regardless of regression model or behavioral measure. This was especially true for the cognitive component. Furthermore, all modalities were able to predict cognition better than other behavioral components. Combining all modalities improved prediction of cognition, but not other behavioral components. Finally, across all behaviors, combining resting and task FC yielded prediction performance similar to combining all modalities. Overall, our study suggests that in the case of healthy children and young adults, behaviorally-relevant information in T1 and diffusion features might reflect a subset of the variance captured by FC.
Article
Lateralization is a fundamental characteristic of many behaviors and the organization of the brain, and atypical lateralization has been suggested to be linked to various brain-related disorders such as autism and schizophrenia. Right-handedness is one of the most prominent markers of human behavioural lateralization, yet its neurobiological basis remains to be determined. Here, we present a large-scale analysis of handedness, as measured by self-reported direction of hand preference, and its variability related to brain structural and functional organization in the UK Biobank (N = 36,024). A multivariate machine learning approach with multi-modalities of brain imaging data was adopted, to reveal how well brain imaging features could predict individual's handedness (i.e., right-handedness vs. non-right-handedness) and further identify the top brain signatures that contributed to the prediction. Overall, the results showed a good prediction performance, with an area under the receiver operating characteristic curve (AUROC) score of up to 0.72, driven largely by resting-state functional measures. Virtual lesion analysis and large-scale decoding analysis suggested that the brain networks with the highest importance in the prediction showed functional relevance to hand movement and several higher-level cognitive functions including language, arithmetic, and social interaction. Genetic analyses of contributions of common DNA polymorphisms to the imaging-derived handedness prediction score showed a significant heritability (h2=7.55%, p