ArticlePDF Available

Data sharing practices and data availability upon request differ across scientific disciplines

Authors:

Abstract and Figures

Data sharing is one of the cornerstones of modern science that enables large-scale analyses and reproducibility. We evaluated data availability in research articles across nine disciplines in Nature and Science magazines and recorded corresponding authors’ concerns, requests and reasons for declining data sharing. Although data sharing has improved in the last decade and particularly in recent years, data availability and willingness to share data still differ greatly among disciplines. We observed that statements of data availability upon (reasonable) request are inefficient and should not be allowed by journals. To improve data sharing at the time of manuscript acceptance, researchers should be better motivated to release their data with real benefits such as recognition, or bonus points in grant and job applications. We recommend that data management costs should be covered by funding agencies; publicly available research data ought to be included in the evaluation of applications; and surveillance of data sharing should be enforced by both academic publishers and funders. These cross-discipline survey data are available from the plutoF repository.
Content may be subject to copyright.
1
SCIENTIFIC DATA | (2021) 8:192 | https://doi.org/10.1038/s41597-021-00981-0
www.nature.com/scientificdata
Data sharing practices and data
availability upon request dier
across scientic disciplines
Leho Tedersoo1,2 ✉ , Rainer Küngas1, Ester Oras1,3,4, Kajar Köster
1,5, Helen Eenmaa1,6,
Äli Leijen1,7, Margus Pedaste7, Marju Raju1,8, Anastasiya Astapova1,9, Heli Lukner1,10,
Karin Kogermann1,11 & Tuul Sepp
1,12
Data sharing is one of the cornerstones of modern science that enables large-scale analyses and
reproducibility. We evaluated data availability in research articles across nine disciplines in Nature and
Science magazines and recorded corresponding authors’ concerns, requests and reasons for declining
data sharing. Although data sharing has improved in the last decade and particularly in recent years,
data availability and willingness to share data still dier greatly among disciplines. We observed that
statements of data availability upon (reasonable) request are inecient and should not be allowed by
journals. To improve data sharing at the time of manuscript acceptance, researchers should be better
motivated to release their data with real benets such as recognition, or bonus points in grant and job
applications. We recommend that data management costs should be covered by funding agencies;
publicly available research data ought to be included in the evaluation of applications; and surveillance
of data sharing should be enforced by both academic publishers and funders. These cross-discipline
survey data are available from the plutoF repository.
Introduction
Technological advances and accumulation of case studies have led many research fields into the era of ‘big
data’ - the possibility to integrate data from various sources for secondary analysis, e.g. meta-studies and
meta-analyses1,2. Nearly half of the researchers commonly use data generated by other scientists3. Data sharing is
a scientic norm and an important part of research ethics in all disciplines, also increasingly endorsed by publish-
ers, funders and the scientic community46. Despite decades of argumentation7, much of the published data is
still essentially unavailable for integration into secondary data analysis and evaluation of reproducibility, a proxy
for reliability810. Furthermore, the deposited data may also be incomplete, sometimes intentionally1114, e.g. in
cases these exhibit mismatching sample codes or lack information about important metadata such as sex and age
of studied organisms in biological and social sciences.
Although the vast majority of researchers prefer data sharing12,15, scientists tend to be concerned about losing
their priority in future publishing and potential commercial use of their work without their consent or partici-
pation12,16,17. Researchers working on human subjects may be bound by legal agreements not to reveal sensitive
data16,18. Across research elds, papers indicating available data are cited on average 25% more19. In research
using microarrays, papers with access to raw data accumulate on average 69% more citations compared with
other articles20. Unfortunately, higher citation rate has not motivated many researchers enough to release their
data, although referees and funding agencies account for bibliometrics when evaluating researchers and their
1Estonian Young Academy of Sciences, Kohtu 6, 10130, Tallinn, Estonia. 2Mycology and Microbiology Center,
University of Tartu, Ravila 14a, 50411, Tartu, Estonia. 3Institute of Chemistry, University of Tartu, Ravila 14a,
50411, Tartu, Estonia. 4Institute of History and Archaeology, University of Tartu, Jakobi 2, 51005, Tartu, Estonia.
5Department of Forest Sciences, University of Helsinki, PO Box 27 (Latokartanonkaari 7), Helsinki, FI-00014, Finland.
6School of Law, University of Tartu, Näituse 20, 50409, Tartu, Estonia. 7Institute of Education, University of Tartu,
Salme 1a, 50103, Tartu, Estonia. 8Department of Musicology, Music Pedagogy and Cultural Management, Estonian
Academy of Music and Theatre, Tatari 13, 10116, Tallinn, Estonia. 9Institute for Cultural Research and Fine Arts,
University of Tartu, Ülikooli 16, 51003, Tartu, Estonia. 10Institute of Physics, University of Tartu, W. Ostwaldi 1, 50411,
Tartu, Estonia. 11Institute of Pharmacy, University of Tartu, Nooruse 1, 50411, Tartu, Estonia. 12Institute of Ecology
and Earth Sciences, University of Tartu, Vanemuise 46, 51003, Tartu, Estonia. e-mail: leho.tedersoo@ut.ee
ANALYSIS
OPEN
Content courtesy of Springer Nature, terms of use apply. Rights reserved
2
SCIENTIFIC DATA | (2021) 8:192 | https://doi.org/10.1038/s41597-021-00981-0
www.nature.com/scientificdata
www.nature.com/scientificdata/
proposals21. Multiple case studies have revealed high variation in data availability in dierent journals and disci-
plines, ranging from 9 to 76%8,11,13,19,2224. Data requests to authors are successful in 27–59% of cases, whereas the
request is ignored in 14–41% cases based on previous research10,2528. To promote access to data, many journals
have implemented mandatory data availability statements and require data storage in supplementary materials
or specic databases29,30. Because of poor enforcement, this has not always guaranteed access to published data
because of broken links, the lack of metadata or the authors’ lack of willingness to share upon request8,26.
is study aims to map and evaluate cross-disciplinary dierences in data sharing, authors’ concerns and
reasons for denying access to data, and whether these decisions are reected in article citations (Fig.1). We
selected the scholarly articles published in journals Nature and Science because of their multidisciplinary con-
tents, stringent data availability policies outlined in authors’ instructions, and high-impact conclusions derived
from the data of exceptional size, accuracy and/or novelty. We hypothesised that in spite of overall improvement
in data sharing culture, the actual data availability and reasons for declining the requests to share data depend
on scientic disciplines because of eld-specic ‘traditions’, ‘sensitivity’ of data, or their economic potential. Our
broader goal is to improve data sharing principles and policies among authors, academic publishers and research
foundations.
Results
Initial and nal data availability. We evaluated the availability of most critical data in 875 articles across
nine scientic disciplines (TableS1) published in Nature and Science over two 10-year intervals (2000–2009 and
2010–2019) and, in case these data were not available for access, we contacted the authors. e initial (pre-con-
tacting) full and at least partial data availability averaged at 54.2% (range across disciplines, 33.0–82.8%) and
71.8% (40.4–100.0%), respectively. Stepwise logistic regression models revealed that initial data availability
diered by research eld, type of data, journal and publishing period (no vs. full availability: n = 721; Somers’
D = 0.676; R2model = 0.476; P < 0.001). According to the best model (TableS2), the data were less readily avail-
able in materials for energy and catalysis (W = 68.0; β = 1.52 ± 0.19; P < 0.001), psychology (W = 55.6;
β = 1.11 ± 0.15; P < 0.001), optics and photonics (W = 18.8; β = 0.59 ± 0.14; P < 0.001) and forestry (W = 9.8;
β = 0.52 ± 0.19; P = 0.002) compared with other disciplines, especially humanities (Fig.2). Data availability was
relatively lower in the period of 2000–2009 (W = 82.5; β = 0.57 ± 0.10; P < 0.001) and when the most important
data were in the form of a dataset (relative to image/video and model; W = 41.5; β = 1.23 ± 0.19; P < 0.001;
Fig.3). Relatively less data were available for Nature (W = 32.7; β = 0.57 ± 0.19; P < 0.001), with striking sever-
al-fold dierences in optics and photonics (Fig.2).
Data
request
cle
Discipline
Journal
Age/ eperiod
Country/conent
No. corr. authors
Open access
Data
complexity
No response
Data declined
Data obtained
Requests &
concerns
Time to
obtain data
Recommenda fordata
sharingpolicies
Data sharing al
Data sharing final
Reminder1
Reminder2 Storage
op ons
Cit ons
Reason for
decline
Fig. 1 Schematic rationale of the study.
Content courtesy of Springer Nature, terms of use apply. Rights reserved
3
SCIENTIFIC DATA | (2021) 8:192 | https://doi.org/10.1038/s41597-021-00981-0
www.nature.com/scientificdata
www.nature.com/scientificdata/
0
20
Data availability (%)
40
60
80 b
**
*
*
**
*
*
*
*
*
**
**
b
b
bcd
cd
bc
cd
a
{
{
{
{
{
{
{
{
{
{
{
{
{
{
{
{
{
d
Ecology
Forestry
Humanities
Materials
for energy
& catalysis
Biomaterials &
biotechnolog
y
Micro-
biology
Optics &
photonics
Psychology
Social
sciences
P1 P2 P1 P2
Science Nature P1 P2 P1 P2
Science Nature P1 P2 P1 P2
Science Nature P1 P2 P1 P2
Science Nature P1 P2 P1 P2
Science Nature P1 P2 P1 P2
Science Nature P1 P2 P1 P2
Science Nature P1 P2 P1 P2
Science Nature P1 P2 P1 P2
Science Nature
Fig. 2 Dierences in partial (grey) and full (black) data availability among disciplines depending on journal and
publishing period (P1, 2000–2009; P2, 2010–2019) before contacting the authors (n = 875). Letters above bars
indicate statistically signicant dierence groups among disciplines in full data availability compared to no data
availability. Asterisks show signicant dierences in full data availability between journals and publishing periods.
Data availability (%) Frequency of critical data types (%)
0
0
20
20
40
40
60
60
80
80
Ecology
Forestry
Humanities
Materials
for energy
& catalysis
Biomaterials &
biotechnology
Micro-
biology
Optics &
photonics
Psychology
Social
sciences
DS Img Mod DS Img Mod DS Img Mod DS Img Mod DS Img Mod DS Img Mod DS Img Mod DS Img Mod DS Img Mod
a
b
Fig. 3 Types of critical data (n = 875). (a) Distribution of data types among disciplines (blue, dataset; purple,
image; black, model); (b) Partial (light shades) and full (dark shades) data availability among disciplines
depending on the type of critical data (DS, dataset; Img, image; Mod, model) before contacting the author(s).
Content courtesy of Springer Nature, terms of use apply. Rights reserved
4
SCIENTIFIC DATA | (2021) 8:192 | https://doi.org/10.1038/s41597-021-00981-0
www.nature.com/scientificdata
www.nature.com/scientificdata/
Upon contacting the authors of 310 papers, the overall data availability was improved by 35.0%. Full and at
least partial availability averaged 69.5% (range across disciplines, 57.0–87.9%) and 83.2% (64.9–100.0%), respec-
tively (Fig.4), aer 60 days since contacting, a reasonable time frame4. e nal data availability (aer contacting
the authors) was best predicted by scientic discipline, data type and time lapse since publishing (no vs. full
availability: n = 580; D = 0.659; R2model = 0.336; Padj < 0.001; TableS2) but with no major changes in the ranking of
disciplines or data types compared with the initial data availability (Fig.4). It took a median of 15 days to receive
data from the authors (Fig.5), with a minimum time of 13 minutes. Four authors sent their data aer the 60-days
period since the initial request (max. 107 days). e rate of receiving data was unrelated to any studied parameter.
Authors’ responses to data requests. e data were obtained from the authors in 39.4% of data requests
on average, with a range of 27.9–56.1% among research elds. e likelihood of receiving data, the request being
declined or ignored depended mostly on the time period and eld of research. According to the best model
(n = 310; D = 0.300; R2model = 0.106; Padj < 0.001; TableS2), the data were obtained slightly less frequently for
the earlier time period (29.4% vs. 56.0%; W = 20.4; β = 0.56 ± 0.12; Padj < 0.001). Receiving data upon request
tended to be lowest in the eld of forestry (W = 3.6; β = 0.31 ± 0.16; Padj = 0.177), especially when compared
with microbiology (Fig.2).
Declining the data request averaged 19.4% and it differed most strongly among the research fields. The
best model (n = 310; D = 0.508; R2model = 0.221; Padj < 0.001) revealed that the data were not made available
Data availability (%)
Ecology Forestry Humanities Materials
for energy
& catalysis
Microbiology Psychology Social sciences
0
20
P1 P2 P1 P2 P1 P2
DS Image Model P1 P2 P1 P2 P1 P2
DS Image Model P1 P2 P1 P2 P1 P2
DS Image Model P1 P2 P1 P2 P1 P2
DS Image Model P1 P2 P1 P2 P1 P2
DS Image Model P1 P2 P1 P2 P1 P2
DS Image Model P1 P2 P1 P2 P1 P2
DS Image Model
*
40
60
80
b
b
b
b
a
b
ab
Fig. 4 Dierences in partial (grey) and full (black) data availability among disciplines aer data requests
(n = 672) depending on the type of critical data (DS, dataset; image; model) and publishing period (P1, 2000–
2009; P2, 2010–2019). Numbers above bars indicate statistically signicant dierence groups among disciplines
in full data availability.
01248163260 128
Data obtained upon request (days)
0
5
10
15
20
25
30
35
40
Number of datasets
Fig. 5 Histogram of time for receiving data from authors upon request within the 60-day reasonable time
period (blue bars) and beyond (purple bar; data excluded from analyses; n = 199 requests). Note the 2-base
logarithmic scale until 60 days.
Content courtesy of Springer Nature, terms of use apply. Rights reserved
5
SCIENTIFIC DATA | (2021) 8:192 | https://doi.org/10.1038/s41597-021-00981-0
www.nature.com/scientificdata
www.nature.com/scientificdata/
upon request most likely in the elds of social sciences (W = 24.3; β = 1.09 ± 0.22; Padj < 0.001), psychology
(W = 20.0; β = 0.73 ± 0.20; Padj < 0.001) and humanities (W = 5.0; β = 0.67 ± 0.30; Padj = 0.078) compared
with natural sciences (Fig.2). Furthermore, the data request was more likely to be declined when the data com-
plexity was high (W = 9.8; β = 0.59 ± 0.19; Padj = 0.005), the paper was not open access in ISI Web of Science
(W = 4.0; β = 0.37 ± 0.18; Padj = 0.132) and published in Science rather than Nature (W = 4.6; β = 0.35 ± 0.16
Padj = 0.096), although these two latter gures are non-signicant when accounting for multiple testing.
We received no response to 41.3% of our data requests, including two biweekly reminders. Responding to the
data request diered most strongly among scientic disciplines and time periods (Fig.6). Altogether 28.9% and
49.0% of requests were ignored by the authors of earlier (2000–2009) and later (2010 to 2019) papers, respec-
tively. According to the best model (n = 310; D = 0.429; R2model = 0.200; Padj < 0.001; TableS2), articles from the
earlier time period (W = 9.3; β = 0.41 ± 0.13; Padj = 0.007) and the elds of forestry (W = 13.4; β = 0.57 ± 0.16;
Padj < 0.001) and ecology (W = 7.0; β = 0.53 ± 0.20; Padj = 0.024) had the greatest likelihood of no response,
whereas social scientists (W = 7.7; β = 0.87 ± 0.31; Padj = 0.016) answered most frequently.
In general, there was no residual eect of time since publication when the publication period was included
in the best model. Within the 2010–2019 period, we specically tested whether the authors publishing in 2019
and 2018 were less likely to share their data because of potential conicting publishing interests. is hypothesis
was not supported and a non-signicant reverse trend was observed as the proportion of data obtained from the
authors increased from 44% in 2010–2017 to 63% in 2018–2019. Accounting for time since publishing across
the entire survey period, the data availability upon request decayed at a rate 5.9% year1 based on an exponential
model. is estimate was marginally higher than the 3.5% annual loss of publicly available data (Fig.7). e num-
ber of articles was insucient to test dierences in data decay rates among disciplines.
Authors’ concerns and reasons for declining data sharing. Upon contacting the authors, we recorded
and categorised their concerns and requests related to data sharing (n = 188 authors) and their reasons for decline
(n = 65). Altogether 22.9% of authors were concerned about certain aspects of our request (Fig.8). Authors
of non-open access publications (W = 4.6; β = 0.49 ± 0.23; Padj = 0.064) and the eld of humanities (W = 9.7;
β = 1.11 ± 0.36; Padj = 0.004) expressed any types of concerns or requests relatively more oen (TableS2). In par-
ticular, researchers in the elds of humanities (W = 15.2; β = 1.36 ± 0.35; Padj < 0.001), materials for energy and
catalysis (W = 6.4; β = 0.65 ± 0.26; Padj = 0.022) and ecology (W = 5.6; β = 0.81 ± 0.34; Padj = 0.036) were more
concerned about the study’s specic purpose than researchers on average.
Data sharing was declined by 33.0% of the 188 established contacts. When we specically inquired about the
reasons, the lack of time to search for data (29.2%), loss of data (27.7%) and privacy or legal concerns (23.1%)
were most commonly indicated by the authors (Fig.8), whereas no specic answer was provided by 10.8% of
authors. According to the best binomial models (TableS2), social scientists indicated data loss more commonly
than other researchers (W = 10.9; β = 1.04 ± 0.32; Padj = 0.003) and psychologists pointed most commonly to
legal or privacy issues (W = 4.9; β = 0.85 ± 0.38; Padj = 0.078). Data decline due to legal issues became increasingly
important in more recent publications (days since 01.01.2000: W = 7.2; β = 0.07 ± 0.03; Padj = 0.035). e lack of
time to search tended to be more common for older studies (W = 4.0; β = 0.73 ± 0.37; Padj = 0.135).
Data storage options and citations. e ways how the data were released diered greatly among disci-
plines (Fig.9), with most common storage options being the supplementary materials on the publisher’s web-
site (62.2% of articles), various data archives (22.3%) and upon request from corresponding authors (19.7%).
Although 29.8% articles declared depositing data in multiple sources, no source was indicated for 35.0% of arti-
cles. Declaring data availability upon request (n = 172) ranged from 1.0% in psychology to 52.0% in forestry, with
greater frequency in earlier (days back since 31.09.2019: W = 15.0; β = 0.016 ± 0.004; P < 0.001) studies and arti-
cles by non-North American corresponding authors (by primary aliation; W = 5.6; β = 0.23 ± 0.10; P = 0.018).
With a few exceptions (three datasets only commercially available, one removed during nal acceptance and one
0
10
20
30
40
50
60
70
80
Frequency (%)
a
ab
b
m
m
mn
mn
x
xy xy
xy
xy
xy
y
nn
n
b
b
b
ab
Ecology
Forestry
Humanities
Materials
for energy
& catalysi
s
Micro-
biology
Psychology
Social
sciences
Fig. 6 Authors’ response to data request (n = 199) depending on discipline (blue, declined; orange, ignored;
purple, obtained). Bars indicate 95% CI of Sison and Glaz51. Letters above bars indicate statistically signicant
dierence groups in frequency of data availability by each category based on Tukey post-hoc test and Bonferroni
correction.
Content courtesy of Springer Nature, terms of use apply. Rights reserved
6
SCIENTIFIC DATA | (2021) 8:192 | https://doi.org/10.1038/s41597-021-00981-0
www.nature.com/scientificdata
www.nature.com/scientificdata/
homepage corrupt), all data were successfully located for other indicated data sources, but only 42.3% of data
could be obtained from the authors upon request in practice. is rate is comparable to articles with no such
statement (38.3%; Chi-square test: P = 0.501).
e number of citations to articles ranged from 0.0 to 692.9 per year (median, 23.1). In contrast to the hypoth-
esis that articles with available data accumulate more citations20, general linear modelling revealed no signicant
eect of initial or nal data availability on annual citations. e model demonstrated that the average number of
yearly citations was explained by research discipline (F8,855 = 11.2; R2 = 0.105; P < 0.001), data type (F2,855 = 7.0;
Initial availability: y=31.1+e ; R=0.804; P<0.001
Final availability: y=47.1+e ; R=0.820; P<0.001
Upon request availability: y=20.3+e ; R=0.670; P=0.004
0
2000 2002
0.0354
0.0590
0.0480
2004 2006 2008 2010
Year
2012 2014 2016 2018
2020
10
20
30
40
50
60
70
80
90
Data availability (%)
Fig. 7 Decay in critical data availability initially (blue circles; n = 672), at the end of a 60-day contacting period
(purple circles; n = 672) and upon request from the authors (black circles; n = 310).
040 80
Number of requests and concerns
Number of reasons for declining data sharing
120 16
0
a
b
none
purpose
seeing results
citing
privacy
authorship
sharing
04812 16 20
no time to search
data lost
data protected by agreements
not specified
privacy
person moved
purpose unclear
more work in progress
person retired
interpretation problematic
need a good reason
bad experience with sharing data
not shared with strangers
person dead
putting on web in progress
Fig. 8 Frequency distribution of authors’ (a) Concerns and requests (n = 199) and (b) reasons for declining
data sharing (n = 67). White bars indicate answers where no concerns or reasons were specied.
Content courtesy of Springer Nature, terms of use apply. Rights reserved
7
SCIENTIFIC DATA | (2021) 8:192 | https://doi.org/10.1038/s41597-021-00981-0
www.nature.com/scientificdata
www.nature.com/scientificdata/
R2 = 0.016; P < 0.001), open access status (F1,855 = 4.5; R2 = 0.005; P = 0.034) and the interaction term between
open access and discipline (F8,855 = 2.94; R2 = 0.027; P = 0.003). Post-hoc tests indicated that articles with a data-
set as a critical data source were cited on average 6% more than those with an image or model, and open access
articles attracted 9% more citations than regular articles. Because of high variability in citation counts, it was not
possible to test the interaction terms with scientic discipline in the current dataset. We speculate that the articles
in Nature and Science are heavily cited on the basis of their key ndings and interpretations that may mask the few
extra citations raising from re-use of the data.
Discussion
Our study uniquely points to dierences among scientic disciplines in data availability as published along with
the article and upon request from the authors. We demonstrate that in several disciplines such as forestry, mate-
rials for energy and catalysis and psychology, critical data are still unavailable for re-analysis or meta-analysis
for more than half of the papers published in Nature and Science in the last decade. ese overall gures roughly
match those reported for other journals in various research elds8,11,13,22, but exceed the lowest reported val-
ues of around 10% available data13,23,24. Fortunately, data availability tends to improve, albeit slowly, in nearly
all disciplines (Figs.3, 7), which conrms recent implications from psychological and ecological journals13,31.
Furthermore, the reverse trend we observed in microbiology corroborates the declining metagenomics sequence
data availability22. Typically, such large DNA sequence data sets are used to publish tens of articles over many
years by the teams producing these data; hence releasing both raw data and datasets may jeopardise their expec-
tations of priority publishing. e weak discipline-specic dierences among Nature and Science (Fig.2) may be
related to how certain subject editors implemented and enforced stringent data sharing policies.
Aer rigorous attempts to contact the authors, data availability increased by one third on average across dis-
ciplines, with full and at least partial availability reaching 70% and 83%, respectively. ese gures are in the
top end of studies conducted thus far8,22 and indicate the relatively superior overall data availability in Science
and Nature compared with other journals. However, the relative rates of data retrieval upon request, decline
sharing data and ignoring the requests were on par with studies covering other journals and specic research
elds10,12,25,26,28. Across 20 years, we identied the overall loss of data at an estimated rate of 3.5% and 5.9% for
initially available data and data eectively available upon request, respectively. is rate of data decay is much less
than 17% year1 previously reported in plant and animal sciences based on a comparable approach24.
While the majority of data are eventually available, it is alarming that less than a half of the data clearly stated
to be available upon request could be eectively obtained from the authors. Although there may be objective
reasons such as force majeure, these results suggest that many authors declaring data availability upon contacting
may have abused the publishers’ or funders’ policy that allows statements of data availability upon request as
the only means of data sharing. We nd that this infringes research ethics and disables fair competition among
research groups. Researchers hiding their own data may be in a power position compared with fair players in
situations of big data analysis, when they can access all data (including their own), while others have more limited
opportunities. Data sharing is also important for securing a possibility to re-analyse and re-interpret unexpected
results9,32 and detect scientic misconduct25,33. More rigorous control of data release would prevent manuscripts
with serious issues in sampling design or analytical procedures from being prepared, reviewed and eventually
accepted for publication.
Our study uniquely recorded the authors’ concerns and specic requests when negotiating data sharing.
Concerns and hesitations about data sharing are understandable because of potential drawbacks and misun-
derstandings related to data interpretation and priority of publishing17,34 that may outweigh the benets of rec-
ognition and passive participation in broader meta-studies. Nearly one quarter of researchers expressed various
concerns or had specic requests depending on the discipline, especially about the specic objectives of our
study. Previous studies with questionnaires about hypothetical data sharing unrelated to actual data sharing reveal
0
10
20
30
40
50
60
Ecology
Forestry
Humanities
Materials
for energy
& catalysis
Biomaterials &
y
Micro-
biology
Optics &
photonics
Psychology
Social
sciences
Fig. 9 Preferred ways of data storage in articles (n = 875) representing dierent disciplines (blue, text and
supplement; purple, data archive; yellow, authors’ homepage; vermillion, previous publications; grey, museum;
black, upon (reasonable) request; white, none declared.
Content courtesy of Springer Nature, terms of use apply. Rights reserved
8
SCIENTIFIC DATA | (2021) 8:192 | https://doi.org/10.1038/s41597-021-00981-0
www.nature.com/scientificdata
www.nature.com/scientificdata/
that nancial interests, priority of additional publishing and fear of challenging the interpretations aer data
re-analysis constitute the authors’ major concerns12,35,36. Another study indicated that two thirds of researchers
sharing biomedical data expected to be invited as co-authors upon use of their data37 although this does not
full the authorship criteria6,38. At least partly related to these issues, the reasons for declining data sharing dif-
fered among disciplines: while social scientists usually referred to the loss of data, psychologists most commonly
pointed out ethical/legal issues. Recently published data were, however, more commonly declined due to ethical/
legal issues, which indicates rising concerns about data protection and potential misuse. Although we oered a
possibility to share anonymised data sets, such trimmed data sets were never obtained from the authors, sug-
gesting that ethical issues were not the only reason for data decline. Because research elds strongly diered in
the frequency of no response to data requests, most unanswered requests can be considered declines that avoid
ocial replies, which may harm the authors’ reputation.
Because we did not sample randomly across journals, our interpretations are limited to the journals Nature
and Science. Our study across disciplines did not account for the particular academic editor, which may have
partly contributed to the dierences among research elds and journals. Not all combinations of disciplines,
journals and time periods received the intended 25 replicate articles because of the poor representation of cer-
tain research elds in the 2000–2009 period. is may have reduced our ability to detect statistically signicant
dierences among the disciplines. We also obtained estimates for the nal data availability for seven out of nine
disciplines. Although we excluded the remaining two disciplines from comparisons of initial and nal data avail-
ability, it may have slightly altered the overall estimates. e process of screening the potentially relevant articles
chronologically backwards resulted in overrepresentation of more recent articles in certain relatively popular
disciplines, which may have biased comparisons across disciplines. However, the paucity of residual year eect
and year x discipline interaction in overall models and residual time eect in separate analyses within research
elds indicate a minimal bias (FigureS1).
We recorded the concerns and requests of authors that had issues with initial data sharing. erefore, these
responses may be relatively more sceptic than the opinions of the majority of the scientic community publishing
in these journals. It is likely that the authors who did not respond may have concerns and reasons for declining
similar to those who refused data sharing.
Our experience shows that receiving data typically required long email exchanges with the authors, contacting
other referred authors or sending a reminder. Obtaining data took on average 15 days, representing a substantial
eort to both parties39. is could have been easily avoided by releasing data upon article acceptance. On the
other hand, we received tips for analysis, caution against potential pitfalls and the authors’ informed consent upon
contacting. According to our experience, more than two thirds of the authors need to be contacted for retrieving
important metadata, variance estimates or specifying methods for meta-analyses40. us, contacting the authors
may be commonly required to ll gaps in the data, but such extra specications are easier to provide compared
with searching and converting old datasets into a universally understandable format.
Due to various concerns and tedious data re-formatting and uploading, the authors should be better moti-
vated for data sharing41. Data formatting and releasing certainly benets from clear instructions and support
from funders, institutions and publishers. In certain cases, public recognition such as badges of open data for
articles following the best data sharing practices and increasing numbers of citations may promote data release by
an order of magnitude42. Citable data papers are certainly another way forward43,44, because these provide access
to a well-organised dataset and add to the authors’ publication record. Encouraging enlisting published data
sets with download and citation metrics in grant and job applications alongside with other bibliometric indica-
tors should promote data sharing. Relating released data in publicly available research accounts such as ORCID,
ResearcherID and Google Scholar would benet both authors, other researchers and evaluators. To account for
many authors’ fear of data the17 and to prioritise the publishing options of data owners, setting a reasonable
embargo period for third-party publishing may be needed in specic cases such as immediate data release follow-
ing data generation45 and dissertations.
All funders, research institutions, researchers, editors and publishers should collectively contribute to turn
data sharing into a win-win situation for all parties and the scientic endeavour in general. Funding agencies may
have a key role here due to the lack of conicting interests and a possibility of exclusive allocation to depositing
and publishing huge data les46. Funders have ecient enforcing mechanisms during reports periods, with an
option to refuse extensions or approving forthcoming grant applications. We advocate that funders should include
published data sets, if relevant, as an evaluation criterion besides other bibliometric information. Research insti-
tutions may follow the same principles when issuing institutional grants and employing research sta. Institutions
should also insist their employees on following open data policies45.
Academic publishers also have a major role in shaping data sharing policies. Although deposition and main-
tenance of data incur extra costs to commercial publishers, they should promote data deposition in their servers
or public repositories. An option is to hire specic data editors for evaluating data availability in supplementary
materials or online repositories and refusing nal publishing before the data are fully available in a relevant for-
mat47. For ecient handling, clear instructions and a machine-readable data availability statement option (with
a QR code or link to the data) should be provided. In non-open access journals, the data should be accessible
free of charge or at reduced price to unsubscribed users. Creating specic data journals or ‘data paper’ formats
may promote publishing and sharing data that would otherwise pile up in the drawer because of disappointing
results or the lack of time for preparing a regular article. e leading scientometrics platforms Clarivate Analytics,
Google Scholar and Scopus should index data journals equally with regular journals to motivate researchers
publishing their data. ere should be a possibility of article withdrawal by the publisher, if the data availability
statements are incorrect or the data have been removed post-acceptance30. Much of the workload should stay on
the editors who are paid by the supporting association, institution or publisher in most cases. e editors should
grant the referees access to these data during the reviewing process48, requesting them a second opinion about
Content courtesy of Springer Nature, terms of use apply. Rights reserved
9
SCIENTIFIC DATA | (2021) 8:192 | https://doi.org/10.1038/s41597-021-00981-0
www.nature.com/scientificdata
www.nature.com/scientificdata/
data availability and reasons for declining to do so. Similar stringent data sharing policies are increasingly imple-
mented by various journals26,30,47.
In conclusion, data availability in top scientic journals diers strongly by discipline, but it is improving in
most research elds. As our study exemplies, the ‘data availability upon request’ model is insucient to ensure
access to datasets and other critical materials. Considering the overall data availability patterns, authors’ concerns
and reasons for declining data sharing, we advocate that (a) data releasing costs ought to be covered by funders;
(b) shared data and the associated bibliometric records should be included in the evaluation of job and grant
applications; and (c) data sharing enforcement should be led by both funding agencies and academic publishers.
Materials and Methods
Data collection. To assess dierences in data availability in dierent research disciplines, we focused our
study on Nature and Science, two high-impact, general-interest journals that practise relatively stringent data
availability policies49. Because of major changes in the public attitude and journals’ policies about data shar-
ing, our survey was focused on two study periods, 2000–2009 and 2010–2019. We selected nine scientic dis-
ciplines as dened by the Springer Nature publishing group - biomaterials and biotechnology, ecology, forestry,
humanities, materials for energy and catalysis, microbiology, optics and photonics, psychology and social
sciences (see TableS1 for details) - for analysis based on their coverage in Nature and Science journals and data-
driven research. ese nine disciplines were selected based on the competence of our team and the objective
to cover as dierent research elds as possible including natural sciences, social sciences and humanities. e
articles were searched by discipline, keywords and/or manual browsing as follows. For Nature, our search was
rened as https://www.nature.com/search?order=date_desc&journal=nature&article_type=research&sub-
ject=microbiology&date_range=2010-2019 (italicised parts varied). For Science, the corresponding search
string was the following: https://search.sciencemag.org/?searchTerm=microbiology&order=newest&limit=−
textFields&pageSize=10&startDate=2010-01-01&endDate=2019-08-31&articleTypes=Research%20and%20
reviews&source=sciencemag%7CScience. In both journals, the articles were retrieved by browsing search results
chronologically backwards since September 2019 or September 2009 until reaching 25 articles matching the cri-
teria. When the number of suitable articles was insucient, we searched by using additional discipline-specic
keywords in the title and browsed all issues manually when necessary. In some research elds, 25 articles could
not be found for all journal and time period combinations and therefore, data availability was evaluated for 875
articles in total (TableS1). In each article, we identied a specic analysis or result that was critical for the main
conclusion of that study based on both the authors’ emphasis and our subjective assessment. We determined
whether the underlying data of these critical results - datasets, images (including videos), models (including
scripts, programs and analytical procedures) or physical items - are available in the main text, supplementary
materials or other indicated sources such as specic data repositories, authors’ homepages, museums, or upon
request to the corresponding author (FigureS2). When available, we downloaded these data, checked for relevant
metadata, identiers and other components, and evaluated whether it is theoretically possible to repeat these
specic analyses and include these materials in a eld-specic metastudy. For example, in the case of a dataset, we
evaluated the data table for the presence of relevant metadata and sample codes necessary to perform the analysis;
for any statistical procedure, the authors must have used such a data table in their original work. We considered
the data to be too raw if these either required a large amount of work (other than common data transformations)
to generate the data table or model, or we had doubts whether the same data table can be reproduced with the
methods described. Raw high-throughput sequencing data are typical examples of incomplete datasets, because
these usually lack necessary metadata and require a thorough bioinformatics analysis, with the output depending
on soware and selected options. For further examples, certain optical raw images or videos make no sense with-
out expert ltering, and computer scripts are of limited use without thorough instructions.
If these critical data were unavailable or only partly available (i.e., missing some integral metadata, instructions
or explanations), we contacted the rst corresponding author or a relevant author referred in relation to access
to the specic item, requesting the data for a meta-study by using a pre-dened format and an institutional email
address (Item S1). In the email, we carefully specied the materials required to produce a particular gure or table
to avoid confusion and upsetting the authors with a messy request. We indicated that the data are intended for a
metastudy in a related topic to test the authors’ willingness to share the data for actual use, not just their intention
to share for no reasonable purpose. We similarly evaluated the received data for integrity and requested further
information, if necessary, to meet the standards. We also recorded the responses of corresponding authors to data
requests, including any specic requests or concerns and reasons for declining (Item S1).
e authors were mostly contacted early in the week and two reminders were sent ca. 14 and 28 days later if
necessary (Item S1). e reminders were also addressed to other corresponding authors if relevant. If emails were
returned with an error message, we contacted other corresponding authors or used an updated email address
found from the internet or newer publications. We considered 60 days from sending the rst email a reasonable
time period for the authors to locate and send the requested data4.
For each article, we recorded the details of publishing (date printed, journal, discipline), corresponding
authors (number, country of rst aliation, acquaintance to the contact author) and data (availability, type, ways
of access)50. Data complexity was evaluated based on the authors’ relative amount of extra work to polish the
raw data (e.g. low-complexity data include raw DNA sequence data, raw images, artefacts; high-complexity data
include bioinformatics-treated molecular data sets, noise-removed images, models and scripts). As of 23.03.2020,
we recorded the open access status and number of citations for each article using searches in the ISI Web of
Science (https://apps.webonowledge.com/). e citation count was expressed as citations per year, discounting
the rst 90 days with initially less citations.
Content courtesy of Springer Nature, terms of use apply. Rights reserved
10
SCIENTIFIC DATA | (2021) 8:192 | https://doi.org/10.1038/s41597-021-00981-0
www.nature.com/scientificdata
www.nature.com/scientificdata/
Data analysis. e principal aim of this study was to determine the relative importance of scientic discipline
and time period on data availability and authors’ concerns in response to data sharing requests, by accounting
for multiple potentially important covariates (Fig.1). e response variables, i.e. initial and nal data availability
(none, partly or fully available), author’s responses (ignored, data shared or declined), concerns and reasons
for decline, exhibit multinomial distribution50 and were hence transformed to dummy variables. Similarly, the
multi-level explanatory variables (discipline, topic overlap, countries and continents of corresponding authors,
data type and complexity) were transformed to dummies, whereas continuous variables (linear time, number of
citations, time to obtain data, number of corresponding authors) were square root- or logarithm-transformed
where appropriate. All analyses were performed in STATISTICA 12 (StatSo Inc., Tulsa, OK, USA).
Data analysis of the dummy-transformed multinomial and binomial variables was performed using stepwise
logistic regression model selection with a binomial link function using corrected Akaike information criterion
(AICc) as a selection criterion, and Somers’ D statistic and model determination coecients (R2) as measures of
overall goodness of t. Determination coecients and Wald’s W statistic were used to estimate the relative impor-
tance of explanatory variables. We calculated 95% condence intervals for multiple proportions51 using the R
package multinomialCI (https://rdrr.io/cran/MultinomialCI/). Increasing false discovery rates related to multiple
comparisons were accounted for by using Bonferroni correction of P-values (expressed as Padj) where appropriate.
Models with continuous response variables (proportion of available data, annual citations, time to receive
data) were tested using general linear models in two steps. First, the model selection included only dummy and
continuous explanatory variables. Multilevel categorical predictors corresponding to signicant dummies as well
as signicant continuous variables were included in the nal model selection as based on forward selection. To
check for potential biases related to the article selection procedure in both periods, we tested the eect of disci-
pline, period and year and all their interaction terms on initial data availability by retaining all variables in the
model (FigureS1). Dierences in these factor levels were tested using Tukey post-hoc tests for unequal sample
size, which accounts for multiple testing issues.
Data availability
e entire dataset is available as in a spreadsheet format in plutoF data repository50.
Code availability
No specic code was generated for analysis of these data.
Received: 11 December 2020; Accepted: 29 June 2021;
Published: xx xx xxxx
References
1. Fan, J. et al. Challenges of big data analysis. Nat. Sci. Rev. 1, 293–314 (2014).
2. itchin, . e data revolution: Big data, open data, data infrastructures and their consequences. (Sage Publications, London, 2014).
3. Science Sta. Challenges and opportunities. Science 331, 692–693 (2011).
4. Cech, T. . et al. Sharing publication-related data and materials: responsibilities of authorship in the life sciences. National
Academies Press, Washington, D.C. (2003).
5. Fischer, B. A. & Zigmond, M. J. e essential nature of sharing in science. Sci. Engineer. Ethics 16, 783–799 (2010).
6. Due, C. S. & Porter, H. H. e ethics of data sharing and reuse in biology. BioScience 63, 483–489 (2013).
7. Fienberg, S. E. et al. Sharing esearch Data. National Academy Press, Washington, D.C. (1985).
8. Begley, C. G. & Ioannidis, J. P. eproducibility in science: improving the standard for basic and preclinical research. Circul. Res. 116,
116–126 (2015).
9. Open S cience Collaboration. Estimating the reproducibility of psychological science. Science 349, aac4716 (2015).
10. Hardwice, T. E. & Ioannidis, J. P. Populating the Data Ar: An attempt to retrieve, preserve, and liberate data from the most highly-
cited psychology and psychiatry articles. PLoS One 13, e0201856 (2018).
11. oche, D. G. et al. Public data archiving in ecology and evolution: how well are we doing? PLoS Biol. 13, e1002295 (2015).
12. Tenopir, C. et al. Changes in data sharing and data reuse practices and perceptions among scientists worldwide. PLoS One 10,
e0134826 (2015).
13. Hardwice, T. E. et al. Data availability, reusability, and analytic reproducibility: Evaluating the impact of a mandatory open data
policy at the journal Cognition. R. Soc. Open Sci. 5, 180448 (2018).
14. Witwer, . W. Data submission and quality in microarray-based microNA proling. Clin. Chem. 59, 392–400 (2013).
15. Stuart, D. et al. Whitepaper: Practical challenges for researchers in data sharing. gshare https://doi.org/10.6084/m9.gshare.5975011
(2018).
16. Borgman, C.L. Scholarship in the digital age: Information, infrastructure, and the Internet. MIT press, Cambridge (2010).
17. Longo, D. L. & Drazen, J. M. Data sharing. New England J. Med. 375, 276–277 (2016).
18. Lewandowsy, S. & Bishop, D. esearch integrity: Don’t let transparency damage science. Nature 529, 459–461 (2016).
19. Colavizza, G. et al. e citation advantage of lining publications to research data. PLoS One 15, e0230416 (2020).
20. Piwowar, H. A. et al. Sharing detailed research data is associated with increased citation rate. PLoS One 2, e308 (2007).
21. Hics, D. et al. Bibliometrics: the Leiden Manifesto for research metrics. Nature 520, 429–431 (2015).
22. Ecert, E. M. et al. Every h published metagenome is not available to science. PLoS Biol. 18, e3000698 (2020).
23. Sherry, C. et al. Assessment of transparent and reproducible research practices in the psychiatry literature. Preprint at https://osf.io/
jtcr/download (2019).
24. Vines, T. H. et al. e availability of research data declines rapidly with article age. Curr. Biol. 24, 94–97 (2014).
25. Wicherts, J. M. et al. e poor availability of psychological research data for reanalysis. Am. Psychol. 61, 726–728 (2006).
26. Vines, T. H. et al. Mandated data archiving greatly improves access to research data. FASEB J. 27, 1304–1308 (2013).
27. rawczy, M. & euben, E. (Un)available upon request: Field experiment on researchers’ willingness to share supplementary
materials. Account. Res. 19, 175–186 (2012).
28. Vanpaemel, W. et al. Are we wasting a good crisis? e availability of psychological research data aer the storm. Collabra 1, 1–5
(2015).
29. Grant, . & Hrynasziewicz, I. e impact on authors and editors of introducing data availability statements at Nature journals. Int.
J. Digit. Curat. 13, 195–203 (2018).
30. Hrynasziewicz, I. et al. Developing a research data policy framewor for all journals and publishers. Data Sci. J. 19, 5 (2020).
Content courtesy of Springer Nature, terms of use apply. Rights reserved
11
SCIENTIFIC DATA | (2021) 8:192 | https://doi.org/10.1038/s41597-021-00981-0
www.nature.com/scientificdata
www.nature.com/scientificdata/
31. Wallach, J. D. et al. eproducible research practices, transparency, and open access data in the biomedical literature, 2015–2017.
PLoS Biol. 16, e2006930 (2018).
32. raus, W. L. Do you see what I see? Quality, reliability, and reproducibility in biomedical research. Mol. Endocrinol. 28, 277–280
(2014).
33. Wicherts, J. M. et al. Willingness to share research data is related to the strength of the evidence and the quality of reporting of
statistical results. PLoS One 6, e26828 (2011).
34. Wallis, J. C., olando, E. & Borgman, C. L. If we share data, will anyone use them? Data sharing and reuse in the long tail of science
and technology. PLoS One 8, e67332 (2013).
35. Blumenthal, D. et al. Withholding research results in academic life science. JAMA 277, 1224–1228 (1997).
36. im, Y. & Stanton, J. M. Institutional and individual inuences on scientists’ data sharing practices. J. Comput. Sci. Edu. 3, 47–56
(2013).
37. Federer, L. M. et al. Biomedical data sharing and reuse: Attitudes and practices of clinical and scientic research sta. PLoS One 10,
e0129506 (2015).
38. Patience, G. S. et al. Intellectual contributions meriting authorship: Survey results from the top cited authors across all science
categories. PLoS One 14, e0198117 (2019).
39. Vol, C., Lucero, Y. & Barnas, . Why is data sharing in collaborative natural resource eorts so hard and what can we do to improve
it? Environ. Manage. 53, 883–893 (2014).
40. Tedersoo, L. et al. Towards global patterns in the diversity and community structure of ectomycorrhizal fungi. Mol. Ecol. 21,
4160–4170 (2012).
41. eichman, O. J. et al. Challenges and opportunities of open data in ecology. Science 331, 703–705 (2011).
42. idwell, M. C. et al. Badges to acnowledge open practices: A simple, low-cost, eective method for increasing transparency. PLoS
Biol. 14, e1002456 (2016).
43. Candela, L., Castelli, D., Manghi, P. & Tani, A. Data journals: a survey. J. Ass. Inform. Sci. Technol. 66, 1747–1762 (2015).
44. Callaghan, S. et al. Maing data a rst class scientic output: data citation and publication by NEC’s Environmental Data Centres.
Int. J. Digit. Curat. 7, 107–113 (2012).
45. Dye, S. O. & Hubbard, T. J. Developing and implementing an institute-wide data sharing policy. Genome Med. 3, 1–8 (2011).
46. Heidorn, P. B. Shedding light on the dar data in the long tail of science. Libr. Trends 57, 280–299 (2008).
47. Langille, M. G. et al. “Available upon request”: not good enough for microbiome data! Microbiome 6, 8 (2018).
48. Morey, . D. et al. e Peer eviewers’ Openness Initiative: incentivizing open research practices through peer review. R. Soc. Open
Sci. 3, 150547 (2016).
49. Alsheih-Ali, A. A. et al. Public availability of published research data in high-impact journals. PLoS One 6, e24357 (2011).
50. Tedersoo, L. et al. Data sharing across disciplines:’available upon request’ holds no promise. University of Tartu; Institute of Ecology
and Earth Sciences https://doi.org/10.15156/BIO/1359426 (2021).
51. Sison, C. P. & Glaz, J. Simultaneous condence intervals and sample size determination for multinomial proportions. J. Am. Stat. Ass.
90, 366–369 (1995).
Acknowledgements
We thank all authors who released their data along with their article or responded to our data request. Although
some of the obtained datasets are used in a series of meta-analyses or released by us upon agreement, we apologise
to the authors who spent a signicant amount of time to provide the data, which we cannot use for secondary
analyses. We thank A. Kahru, T. Soomere, Ü. Niinemets and J. Allik for their constructive comments on an earlier
version of the manuscript.
Author contributions
All authors contributed to study design, work with literature and writing. L.T. analysed data and led writing.
Competing interests
e authors declare no competing interests.
Additional information
Supplementary information e online version contains supplementary material available at https://doi.
org/10.1038/s41597-021-00981-0.
Correspondence and requests for materials should be addressed to L.T.
Reprints and permissions information is available at www.nature.com/reprints.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and
institutional aliations.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International
License, which permits use, sharing, adaptation, distribution and reproduction in any medium or
format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Cre-
ative Commons license, and indicate if changes were made. e images or other third party material in this
article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the
material. If material is not included in the article’s Creative Commons license and your intended use is not per-
mitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the
copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
© e Author(s) 2021
Content courtesy of Springer Nature, terms of use apply. Rights reserved
1.
2.
3.
4.
5.
6.
Terms and Conditions
Springer Nature journal content, brought to you courtesy of Springer Nature Customer Service Center GmbH (“Springer Nature”).
Springer Nature supports a reasonable amount of sharing of research papers by authors, subscribers and authorised users (“Users”), for small-
scale personal, non-commercial use provided that all copyright, trade and service marks and other proprietary notices are maintained. By
accessing, sharing, receiving or otherwise using the Springer Nature journal content you agree to these terms of use (“Terms”). For these
purposes, Springer Nature considers academic use (by researchers and students) to be non-commercial.
These Terms are supplementary and will apply in addition to any applicable website terms and conditions, a relevant site licence or a personal
subscription. These Terms will prevail over any conflict or ambiguity with regards to the relevant terms, a site licence or a personal subscription
(to the extent of the conflict or ambiguity only). For Creative Commons-licensed articles, the terms of the Creative Commons license used will
apply.
We collect and use personal data to provide access to the Springer Nature journal content. We may also use these personal data internally within
ResearchGate and Springer Nature and as agreed share it, in an anonymised way, for purposes of tracking, analysis and reporting. We will not
otherwise disclose your personal data outside the ResearchGate or the Springer Nature group of companies unless we have your permission as
detailed in the Privacy Policy.
While Users may use the Springer Nature journal content for small scale, personal non-commercial use, it is important to note that Users may
not:
use such content for the purpose of providing other users with access on a regular or large scale basis or as a means to circumvent access
control;
use such content where to do so would be considered a criminal or statutory offence in any jurisdiction, or gives rise to civil liability, or is
otherwise unlawful;
falsely or misleadingly imply or suggest endorsement, approval , sponsorship, or association unless explicitly agreed to by Springer Nature in
writing;
use bots or other automated methods to access the content or redirect messages
override any security feature or exclusionary protocol; or
share the content in order to create substitute for Springer Nature products or services or a systematic database of Springer Nature journal
content.
In line with the restriction against commercial use, Springer Nature does not permit the creation of a product or service that creates revenue,
royalties, rent or income from our content or its inclusion as part of a paid for service or for other commercial gain. Springer Nature journal
content cannot be used for inter-library loans and librarians may not upload Springer Nature journal content on a large scale into their, or any
other, institutional repository.
These terms of use are reviewed regularly and may be amended at any time. Springer Nature is not obligated to publish any information or
content on this website and may remove it or features or functionality at our sole discretion, at any time with or without notice. Springer Nature
may revoke this licence to you at any time and remove access to any copies of the Springer Nature journal content which have been saved.
To the fullest extent permitted by law, Springer Nature makes no warranties, representations or guarantees to Users, either express or implied
with respect to the Springer nature journal content and all parties disclaim and waive any implied warranties or warranties imposed by law,
including merchantability or fitness for any particular purpose.
Please note that these rights do not automatically extend to content, data or other material published by Springer Nature that may be licensed
from third parties.
If you would like to use or distribute our Springer Nature journal content to a wider audience or on a regular basis or in any other manner not
expressly permitted by these Terms, please contact Springer Nature at
onlineservice@springernature.com
... The researchers should thus store data in an understandable format for future sharing. Researchers should also be motivated to release their data with benefits such as recognition, or possibly bonus points in grant and job applications [37]. Data management costs should also be included in funding research. ...
... However, obtaining and understanding the shared data can be challenging for external researchers, as access is not always straightforward, and the dataset may lack clarity or documentation 4,5 . Indeed, requests for data are ignored in 14-41% of cases, while success is achieved in only 27-59% of instances 6,7 . ...
Article
Full-text available
Facilitating data sharing in scientific research, especially in the domain of animal studies, holds immense value, particularly in mitigating distress and enhancing the efficiency of data collection. This study unveils a meticulously curated collection of neural activity data extracted from six electrophysiological datasets recorded from three parietal areas (V6A, PEc, PE) of two Macaca fascicularis during an instructed-delay foveated reaching task. This valuable resource is now accessible to the public, featuring spike timestamps, behavioural event timings and supplementary metadata, all presented alongside a comprehensive description of the encompassing structure. To enhance accessibility, data are stored as HDF5 files, a convenient format due to its flexible structure and the capability to attach diverse information to each hierarchical sub-level. To guarantee ready-to-use datasets, we also provide some MATLAB and Python code examples, enabling users to quickly familiarize themselves with the data structure.
... Second, a manual screening of all identified instances of open software codes was performed to remove all papers that did not provide such materials. During this screening, all instances of codes to be made available at a future date or "upon request" were also excluded since these latter requests often go unfulfilled (see, e.g., Tedersoo et al. 2021;Krähmer, Schächtele, and Schneck 2023). A similar approach, based on 5 keywords and without a second-stage verification, is employed by Alexander (2022) to detect the availability of either data or code, or both, in Demography papers since 2011. ...
... Many researchers, however, are actively sharing their data under data sharing requirements from journals. Several groups have examined the contents of data availability statements to understand how researchers are describing their data sharing [13][14][15][16][17]. Despite the prevalence of data availability statements, these studies found that researchers often failed to meet data sharing requirements for many reasons, including: a complete lack of data availability information; sharing inappropriately, such as by request or putting data into supplementary information; not sharing in a data repository; and sharing limited data instead of all data supporting the article. ...
Article
Full-text available
To determine where data is shared and what data is no longer available, this study analyzed data shared by researchers at a single university. 2166 supplemental data links were harvested from the university’s institutional repository and web scraped using R. All links that failed to scrape or could not be tested algorithmically were tested for availability by hand. Trends in data availability by link type, age of publication, and data source were examined for patterns. Results show that researchers shared data in hundreds of places. About two-thirds of links to shared data were in the form of URLs and one-third were DOIs, with several FTP links and links directly to files. A surprising 13.4% of shared URL links pointed to a website homepage rather than a specific record on a website. After testing, 5.4% the 2166 supplemental data links were found to be no longer available. DOIs were the type of shared link that was least likely to disappear with a 1.7% loss, with URL loss at 5.9% averaged over time. Links from older publications were more likely to be unavailable, with a data disappearance rate estimated at 2.6% per year, as well as links to data hosted on journal websites. The results support best practice guidance to share data in a data repository using a permanent identifier.
... Synthesis of open (publicly archived, free to reuse) data is a powerful tool that is increasingly being used to test pressing questions in ecology and evolution. However, it remains common for valuable datasets to be forgotten after a single use [1][2][3] . This is a missed opportunity and hinders scientific progress. ...
Article
Genetic and genomic data are collected for a vast array of scientific and applied purposes. Despite mandates for public archiving, data are typically used only by the generating authors. The reuse of genetic and genomic datasets remains uncommon because it is difficult, if not impossible, due to non-standard archiving practices and lack of contextual metadata. But as the new field of macrogenetics is demonstrating, if genetic data and their metadata were more accessible and FAIR (findable, accessible, interoperable and reusable) compliant, they could be reused for many additional purposes. We discuss the main challenges with existing genetic and genomic data archives, and suggest best practices for archiving genetic and genomic data. Recognizing that this is a longstanding issue due to little formal data management training within the fields of ecology and evolution, we highlight steps that research institutions and publishers could take to improve data archiving.
Article
Full-text available
Experts from 18 consortia are collaborating on the Human Reference Atlas (HRA) which aims to map the 37 trillion cells in the healthy human body. Information relevant for HRA construction and usage is held by experts, published in scholarly papers, and captured in experimental data. However, these data sources use different metadata schemas and cannot be cross-searched efficiently. This paper documents the compilation of a dataset, named HRAlit, that links the 136 HRA v1.4 digital objects (31 organs with 4,279 anatomical structures, 1,210 cell types, 2,089 biomarkers) to 583,117 experts; 7,103,180 publications; 896,680 funded projects, and 1,816 experimental datasets. The resulting HRAlit has 22 tables with 20,939,937 records including 6 junction tables with 13,170,651 relationships. The HRAlit can be mined to identify leading experts, major papers, funding trends, or alignment with existing ontologies in support of systematic HRA construction and usage.
Article
Full-text available
Datasets consist of measurement data and metadata. Metadata provides context, essential for understanding and (re-)using data. Various metadata standards exist for different methods, systems and contexts. However, relevant information resides at differing stages across the data-lifecycle. Often, this information is defined and standardized only at publication stage, which can lead to data loss and workload increase. In this study, we developed Metadatasheet, a metadata standard based on interviews with members of two biomedical consortia and systematic screening of data repositories. It aligns with the data-lifecycle allowing synchronous metadata recording within Microsoft Excel, a widespread data recording software. Additionally, we provide an implementation, the Metadata Workbook, that offers user-friendly features like automation, dynamic adaption, metadata integrity checks, and export options for various metadata standards. By design and due to its extensive documentation, the proposed metadata standard simplifies recording and structuring of metadata for biomedical scientists, promoting practicality and convenience in data management. This framework can accelerate scientific progress by enhancing collaboration and knowledge transfer throughout the intermediate steps of data creation.
Article
Full-text available
Efforts to make research results open and reproducible are increasingly reflected by journal policies encouraging or mandating authors to provide data availability statements. As a consequence of this, there has been a strong uptake of data availability statements in recent literature. Nevertheless, it is still unclear what proportion of these statements actually contain well-formed links to data, for example via a URL or permanent identifier, and if there is an added value in providing such links. We consider 531, 889 journal articles published by PLOS and BMC, develop an automatic system for labelling their data availability statements according to four categories based on their content and the type of data availability they display, and finally analyze the citation advantage of different statement categories via regression. We find that, following mandated publisher policies, data availability statements become very common. In 2018 93.7% of 21,793 PLOS articles and 88.2% of 31,956 BMC articles had data availability statements. Data availability statements containing a link to data in a repository—rather than being available on request or included as supporting information files—are a fraction of the total. In 2017 and 2018, 20.8% of PLOS publications and 12.2% of BMC publications provided DAS containing a link to data in a repository. We also find an association between articles that include statements that link to data in a repository and up to 25.36% (± 1.07%) higher citation impact on average, using a citation prediction model. We discuss the potential implications of these results for authors (researchers) and journal publishers who make the effort of sharing their data in repositories. All our data and code are made available in order to reproduce and extend our results.
Article
Full-text available
Have you ever sought to use metagenomic DNA sequences reported in scientific publications? Were you successful? Here, we reveal that metagenomes from no fewer than 20% of the papers found in our literature search, published between 2016 and 2019, were not deposited in a repository or were simply inaccessible. The proportion of inaccessible data within the literature has been increasing year-on-year. Noncompliance with Open Data is best predicted by the scientific discipline of the journal. The number of citations, journal type (e.g., Open Access or subscription journals), and publisher are not good predictors of data accessibility. However, many publications in high-impact factor journals do display a higher likelihood of accessible metagenomic data sets. Twenty-first century science demands compliance with the ethical standard of data sharing of metagenomes and DNA sequence data more broadly. Data accessibility must become one of the routine and mandatory components of manuscript submissions-a requirement that should be applicable across the increasing number of disciplines using metagenomics. Compliance must be ensured and reinforced by funders, publishers, editors, reviewers, and, ultimately, the authors.
Article
Full-text available
An output of the Data policy standardisation and implementation Interest Group (IG) of the Research Data Alliance (RDA) More journals and publishers – and funding agencies and institutions – are introducing research data policies. But as the prevalence of policies increases, there is potential to confuse researchers and support staff with numerous or conflicting policy requirements. We define and describe 14 features of journal research data policies and arrange these into a set of six standard policy types or tiers, which can be adopted by journals and publishers to promote data sharing in a way that encourages good practice and is appropriate for their audience’s perceived needs. Policy features include coverage of topics such as data citation, data repositories, data availability statements, data standards and formats, and peer review of research data. These policy features and types have been created by reviewing the policies of multiple scholarly publishers, which collectively publish more than 10,000 journals, and through discussions and consensus building with multiple stakeholders in research data policy via the Data Policy Standardisation and Implementation Interest Group of the Research Data Alliance. Implementation guidelines for the standard research data policies for journals and publishers are also provided, along with template policy texts which can be implemented by journals in their Information for Authors and publishing workflows. We conclude with a call for collaboration across the scholarly publishing and wider research community to drive further implementation and adoption of consistent research data policies.
Article
Full-text available
Background Reproducibility is a cornerstone of scientific advancement; however, many published works may lack the core components needed for study reproducibility. Aims In this study, we evaluate the state of transparency and reproducibility in the field of psychiatry using specific indicators as proxies for these practices. Methods An increasing number of publications have investigated indicators of reproducibility, including research by Harwicke et al , from which we based the methodology for our observational, cross-sectional study. From a random 5-year sample of 300 publications in PubMed-indexed psychiatry journals, two researchers extracted data in a duplicate, blinded fashion using a piloted Google form. The publications were examined for indicators of reproducibility and transparency, which included availability of: materials, data, protocol, analysis script, open-access, conflict of interest, funding and online preregistration. Results This study ultimately evaluated 296 randomly-selected publications with a 3.20 median impact factor. Only 107 were available online. Most primary authors originated from USA, UK and the Netherlands. The top three publication types were cohort studies, surveys and clinical trials. Regarding indicators of reproducibility, 17 publications gave access to necessary materials, four provided in-depth protocol and one contained raw data required to reproduce the outcomes. One publication offered its analysis script on request; four provided a protocol availability statement. Only 107 publications were publicly available: 13 were registered in online repositories and four, ten and eight publications included their hypothesis, methods and analysis, respectively. Conflict of interest was addressed by 177 and reported by 31 publications. Of 185 publications with a funding statement, 153 publications were funded and 32 were unfunded. Conclusions Currently, Psychiatry research has significant potential to improve adherence to reproducibility and transparency practices. Thus, this study presents a reference point for the state of reproducibility and transparency in Psychiatry literature. Future assessments are recommended to evaluate and encourage progress.
Article
Full-text available
Authorship is the currency of an academic career for which the number of papers researchers publish demonstrates creativity, productivity, and impact. To discourage coercive authorship practices and inflated publication records, journals require authors to affirm and detail their intellectual contributions but this strategy has been unsuccessful as authorship lists continue to grow. Here, we surveyed close to 6000 of the top cited authors in all science categories with a list of 25 research activities that we adapted from the National Institutes of Health (NIH) authorship guidelines. Responses varied widely from individuals in the same discipline, same level of experience, and same geographic region. Most researchers agreed with the NIH criteria and grant authorship to individuals who draft the manuscript, analyze and interpret data, and propose ideas. However, thousands of the researchers also value supervision and contributing comments to the manuscript, whereas the NIH recommends discounting these activities when attributing authorship. People value the minutiae of research beyond writing and data reduction: researchers in the humanities value it less than those in pure and applied sciences; individuals from Far East Asia and Middle East and Northern Africa value these activities more than anglophones and northern Europeans. While developing national and international collaborations, researchers must recognize differences in peoples values while assigning authorship.
Article
Full-text available
This article describes the adoption of a standard policy for the inclusion of data availability statements in all research articles published at the Nature family of journals, and the subsequent research which assessed the impacts that these policies had on authors, editors, and the availability of datasets. The key findings of this research project include the determination of average and median times required to add a data availability statement to an article; and a correlation between the way researchers make their data available, and the time required to add a data availability statement.
Article
Full-text available
Currently, there is a growing interest in ensuring the transparency and reproducibility of the published scientific literature. According to a previous evaluation of 441 biomedical journals articles published in 2000–2014, the biomedical literature largely lacked transparency in important dimensions. Here, we surveyed a random sample of 149 biomedical articles published between 2015 and 2017 and determined the proportion reporting sources of public and/or private funding and conflicts of interests, sharing protocols and raw data, and undergoing rigorous independent replication and reproducibility checks. We also investigated what can be learned about reproducibility and transparency indicators from open access data provided on PubMed. The majority of the 149 studies disclosed some information regarding funding (103, 69.1% [95% confidence interval, 61.0% to 76.3%]) or conflicts of interest (97, 65.1% [56.8% to 72.6%]). Among the 104 articles with empirical data in which protocols or data sharing would be pertinent, 19 (18.3% [11.6% to 27.3%]) discussed publicly available data; only one (1.0% [0.1% to 6.0%]) included a link to a full study protocol. Among the 97 articles in which replication in studies with different data would be pertinent, there were five replication efforts (5.2% [1.9% to 12.2%]). Although clinical trial identification numbers and funding details were often provided on PubMed, only two of the articles without a full text article in PubMed Central that discussed publicly available data at the full text level also contained information related to data sharing on PubMed; none had a conflicts of interest statement on PubMed. Our evaluation suggests that although there have been improvements over the last few years in certain key indicators of reproducibility and transparency, opportunities exist to improve reproducible research practices across the biomedical literature and to make features related to reproducibility more readily visible in PubMed.
Article
Full-text available
Access to data is a critical feature of an efficient, progressive and ultimately self-correcting scientific ecosystem. But the extent to which in-principle benefits of data sharing are realized in practice is unclear. Crucially, it is largely unknown whether published findings can be reproduced by repeating reported analyses upon shared data (‘analytic reproducibility’). To investigate this, we conducted an observational evaluation of a mandatory open data policy introduced at the journal Cognition. Interrupted time-series analyses indicated a substantial post-policy increase in data available statements (104/417, 25% pre-policy to 136/174, 78% post-policy), although not all data appeared reusable (23/104, 22% pre-policy to 85/136, 62%, post-policy). For 35 of the articles determined to have reusable data, we attempted to reproduce 1324 target values. Ultimately, 64 values could not be reproduced within a 10% margin of error. For 22 articles all target values were reproduced, but 11 of these required author assistance. For 13 articles at least one value could not be reproduced despite author assistance. Importantly, there were no clear indications that original conclusions were seriously impacted. Mandatory open data policies can increase the frequency and quality of data sharing. However, suboptimal data curation, unclear analysis specification and reporting errors can impede analytic reproducibility, undermining the utility of data sharing and the credibility of scientific findings.
Preprint
Objective: Reproducibility is a cornerstone of scientific advancement; however, many published works may lack the core components needed for study reproducibility. In this study, we evaluate the state of transparency and reproducibility in the field of Psychiatry.Methods: An observational, cross-sectional study design was used. From a random sample of 300 publications in PubMed-indexed psychiatry journals, two researchers extracted data in a duplicate and blinded fashion using a piloted Google Form. For this study, we included publications from January 1, 2014 to December 31, 2018. The publications were evaluated for indicators of reproducibility and transparency, which included the availability of materials, data, protocol, analysis script, preregistration, open access, financial conflicts of interest, funding sources, and pre-registration in an online repository. Results: Our study identified 158 journals meeting the inclusion criteria and 90,281 publications from within the timeframe. Of the 300 randomly sampled, 4 were inaccessible, resulting in a final sample of 296 publications. Of the 296, only 107 (36%) were publically available online. Regarding reproducibility, 17 publications gave access to necessary materials, 4 provided an in-depth protocol, and 1 contained the raw data required to reproduce the outcomes.Conclusions: Currently, researchers in the field of Psychiatry do not adhere to practices that promote reproducibility and transparency. Change is therefore needed. This study presents a reference point for the state of reproducibility and transparency in psychiatry literature, and future assessments are recommended to evaluate progress.