Data sharing practices and data availability upon request differ across scientific disciplines


Data sharing is one of the cornerstones of modern science that enables large-scale analyses and reproducibility. We evaluated data availability in research articles across nine disciplines in Nature and Science magazines and recorded corresponding authors’ concerns, requests and reasons for declining data sharing. Although data sharing has improved in the last decade and particularly in recent years, data availability and willingness to share data still differ greatly among disciplines. We observed that statements of data availability upon (reasonable) request are inefficient and should not be allowed by journals. To improve data sharing at the time of manuscript acceptance, researchers should be better motivated to release their data with real benefits such as recognition, or bonus points in grant and job applications. We recommend that data management costs should be covered by funding agencies; publicly available research data ought to be included in the evaluation of applications; and surveillance of data sharing should be enforced by both academic publishers and funders. These cross-discipline survey data are available from the plutoF repository.
Data sharing practices and data
availability upon request dier
across scientic disciplines
Leho Tedersoo1,2 ✉ , Rainer Küngas1, Ester Oras1,3,4, Kajar Köster
1,5, Helen Eenmaa1,6,
Äli Leijen1,7, Margus Pedaste7, Marju Raju1,8, Anastasiya Astapova1,9, Heli Lukner1,10,
Karin Kogermann1,11 & Tuul Sepp
Data sharing is one of the cornerstones of modern science that enables large-scale analyses and
reproducibility. We evaluated data availability in research articles across nine disciplines in Nature and
Science magazines and recorded corresponding authors’ concerns, requests and reasons for declining
data sharing. Although data sharing has improved in the last decade and particularly in recent years,
data availability and willingness to share data still dier greatly among disciplines. We observed that
statements of data availability upon (reasonable) request are inecient and should not be allowed by
journals. To improve data sharing at the time of manuscript acceptance, researchers should be better
motivated to release their data with real benets such as recognition, or bonus points in grant and job
applications. We recommend that data management costs should be covered by funding agencies;
publicly available research data ought to be included in the evaluation of applications; and surveillance
of data sharing should be enforced by both academic publishers and funders. These cross-discipline
survey data are available from the plutoF repository.
Technological advances and accumulation of case studies have led many research fields into the era of ‘big
data’ - the possibility to integrate data from various sources for secondary analysis, e.g. meta-studies and
meta-analyses1,2. Nearly half of the researchers commonly use data generated by other scientists3. Data sharing is
a scientic norm and an important part of research ethics in all disciplines, also increasingly endorsed by publish-
ers, funders and the scientic community46. Despite decades of argumentation7, much of the published data is
still essentially unavailable for integration into secondary data analysis and evaluation of reproducibility, a proxy
for reliability810. Furthermore, the deposited data may also be incomplete, sometimes intentionally1114, e.g. in
cases these exhibit mismatching sample codes or lack information about important metadata such as sex and age
of studied organisms in biological and social sciences.
Although the vast majority of researchers prefer data sharing12,15, scientists tend to be concerned about losing
their priority in future publishing and potential commercial use of their work without their consent or partici-
pation12,16,17. Researchers working on human subjects may be bound by legal agreements not to reveal sensitive
data16,18. Across research elds, papers indicating available data are cited on average 25% more19. In research
using microarrays, papers with access to raw data accumulate on average 69% more citations compared with
other articles20. Unfortunately, higher citation rate has not motivated many researchers enough to release their
data, although referees and funding agencies account for bibliometrics when evaluating researchers and their
1Estonian Young Academy of Sciences, Kohtu 6, 10130, Tallinn, Estonia. 2Mycology and Microbiology Center,
University of Tartu, Ravila 14a, 50411, Tartu, Estonia. 3Institute of Chemistry, University of Tartu, Ravila 14a,
50411, Tartu, Estonia. 4Institute of History and Archaeology, University of Tartu, Jakobi 2, 51005, Tartu, Estonia.
5Department of Forest Sciences, University of Helsinki, PO Box 27 (Latokartanonkaari 7), Helsinki, FI-00014, Finland.
6School of Law, University of Tartu, Näituse 20, 50409, Tartu, Estonia. 7Institute of Education, University of Tartu,
Salme 1a, 50103, Tartu, Estonia. 8Department of Musicology, Music Pedagogy and Cultural Management, Estonian
Academy of Music and Theatre, Tatari 13, 10116, Tallinn, Estonia. 9Institute for Cultural Research and Fine Arts,
University of Tartu, Ülikooli 16, 51003, Tartu, Estonia. 10Institute of Physics, University of Tartu, W. Ostwaldi 1, 50411,
Tartu, Estonia. 11Institute of Pharmacy, University of Tartu, Nooruse 1, 50411, Tartu, Estonia. 12Institute of Ecology
and Earth Sciences, University of Tartu, Vanemuise 46, 51003, Tartu, Estonia. e-mail:
SCIENTIFIC DATA | (2021) 8:192 |
proposals21. Multiple case studies have revealed high variation in data availability in dierent journals and disci-
plines, ranging from 9 to 76%8,11,13,19,2224. Data requests to authors are successful in 27–59% of cases, whereas the
request is ignored in 14–41% cases based on previous research10,2528. To promote access to data, many journals
have implemented mandatory data availability statements and require data storage in supplementary materials
or specic databases29,30. Because of poor enforcement, this has not always guaranteed access to published data
because of broken links, the lack of metadata or the authors’ lack of willingness to share upon request8,26.
is study aims to map and evaluate cross-disciplinary dierences in data sharing, authors’ concerns and
reasons for denying access to data, and whether these decisions are reected in article citations (Fig.1). We
selected the scholarly articles published in journals Nature and Science because of their multidisciplinary con-
tents, stringent data availability policies outlined in authors’ instructions, and high-impact conclusions derived
from the data of exceptional size, accuracy and/or novelty. We hypothesised that in spite of overall improvement
in data sharing culture, the actual data availability and reasons for declining the requests to share data depend
on scientic disciplines because of eld-specic ‘traditions’, ‘sensitivity’ of data, or their economic potential. Our
broader goal is to improve data sharing principles and policies among authors, academic publishers and research
Initial and nal data availability. We evaluated the availability of most critical data in 875 articles across
nine scientic disciplines (TableS1) published in Nature and Science over two 10-year intervals (2000–2009 and
2010–2019) and, in case these data were not available for access, we contacted the authors. e initial (pre-con-
tacting) full and at least partial data availability averaged at 54.2% (range across disciplines, 33.0–82.8%) and
71.8% (40.4–100.0%), respectively. Stepwise logistic regression models revealed that initial data availability
diered by research eld, type of data, journal and publishing period (no vs. full availability: n = 721; Somers’
D = 0.676; R2model = 0.476; P < 0.001). According to the best model (TableS2), the data were less readily avail-
able in materials for energy and catalysis (W = 68.0; β = 1.52 ± 0.19; P < 0.001), psychology (W = 55.6;
β = 1.11 ± 0.15; P < 0.001), optics and photonics (W = 18.8; β = 0.59 ± 0.14; P < 0.001) and forestry (W = 9.8;
β = 0.52 ± 0.19; P = 0.002) compared with other disciplines, especially humanities (Fig.2). Data availability was
relatively lower in the period of 2000–2009 (W = 82.5; β = 0.57 ± 0.10; P < 0.001) and when the most important
data were in the form of a dataset (relative to image/video and model; W = 41.5; β = 1.23 ± 0.19; P < 0.001;
Fig.3). Relatively less data were available for Nature (W = 32.7; β = 0.57 ± 0.19; P < 0.001), with striking sever-
al-fold dierences in optics and photonics (Fig.2).
Age/ eperiod
No. corr. authors
Open access
No response
Data declined
Data obtained
Requests &
Time to
obtain data
Recommenda fordata
Data sharing al
Data sharing final
Reminder2 Storage
op ons
Cit ons
Reason for
Fig. 1 Schematic rationale of the study.
SCIENTIFIC DATA | (2021) 8:192 |
Data availability (%)
80 b
for energy
& catalysis
Biomaterials &
Optics &
P1 P2 P1 P2
Science Nature P1 P2 P1 P2
Science Nature P1 P2 P1 P2
Science Nature P1 P2 P1 P2
Science Nature P1 P2 P1 P2
Science Nature P1 P2 P1 P2
Science Nature P1 P2 P1 P2
Science Nature P1 P2 P1 P2
Science Nature P1 P2 P1 P2
Science Nature
Fig. 2 Dierences in partial (grey) and full (black) data availability among disciplines depending on journal and
publishing period (P1, 2000–2009; P2, 2010–2019) before contacting the authors (n = 875). Letters above bars
indicate statistically signicant dierence groups among disciplines in full data availability compared to no data
availability. Asterisks show signicant dierences in full data availability between journals and publishing periods.
Data availability (%) Frequency of critical data types (%)
for energy
& catalysis
Biomaterials &
Optics &
DS Img Mod DS Img Mod DS Img Mod DS Img Mod DS Img Mod DS Img Mod DS Img Mod DS Img Mod DS Img Mod
Fig. 3 Types of critical data (n = 875). (a) Distribution of data types among disciplines (blue, dataset; purple,
image; black, model); (b) Partial (light shades) and full (dark shades) data availability among disciplines
depending on the type of critical data (DS, dataset; Img, image; Mod, model) before contacting the author(s).
SCIENTIFIC DATA | (2021) 8:192 |
Upon contacting the authors of 310 papers, the overall data availability was improved by 35.0%. Full and at
least partial availability averaged 69.5% (range across disciplines, 57.0–87.9%) and 83.2% (64.9–100.0%), respec-
tively (Fig.4), aer 60 days since contacting, a reasonable time frame4. e nal data availability (aer contacting
the authors) was best predicted by scientic discipline, data type and time lapse since publishing (no vs. full
availability: n = 580; D = 0.659; R2model = 0.336; Padj < 0.001; TableS2) but with no major changes in the ranking of
disciplines or data types compared with the initial data availability (Fig.4). It took a median of 15 days to receive
data from the authors (Fig.5), with a minimum time of 13 minutes. Four authors sent their data aer the 60-days
period since the initial request (max. 107 days). e rate of receiving data was unrelated to any studied parameter.
Authors’ responses to data requests. e data were obtained from the authors in 39.4% of data requests
on average, with a range of 27.9–56.1% among research elds. e likelihood of receiving data, the request being
declined or ignored depended mostly on the time period and eld of research. According to the best model
(n = 310; D = 0.300; R2model = 0.106; Padj < 0.001; TableS2), the data were obtained slightly less frequently for
the earlier time period (29.4% vs. 56.0%; W = 20.4; β = 0.56 ± 0.12; Padj < 0.001). Receiving data upon request
tended to be lowest in the eld of forestry (W = 3.6; β = 0.31 ± 0.16; Padj = 0.177), especially when compared
with microbiology (Fig.2).
Declining the data request averaged 19.4% and it differed most strongly among the research fields. The
best model (n = 310; D = 0.508; R2model = 0.221; Padj < 0.001) revealed that the data were not made available
Data availability (%)
Ecology Forestry Humanities Materials
for energy
& catalysis
Microbiology Psychology Social sciences
P1 P2 P1 P2 P1 P2
DS Image Model P1 P2 P1 P2 P1 P2
DS Image Model P1 P2 P1 P2 P1 P2
DS Image Model P1 P2 P1 P2 P1 P2
DS Image Model P1 P2 P1 P2 P1 P2
DS Image Model P1 P2 P1 P2 P1 P2
DS Image Model P1 P2 P1 P2 P1 P2
DS Image Model
Fig. 4 Dierences in partial (grey) and full (black) data availability among disciplines aer data requests
(n = 672) depending on the type of critical data (DS, dataset; image; model) and publishing period (P1, 2000–
2009; P2, 2010–2019). Numbers above bars indicate statistically signicant dierence groups among disciplines
in full data availability.
01248163260 128
Data obtained upon request (days)
Number of datasets
Fig. 5 Histogram of time for receiving data from authors upon request within the 60-day reasonable time
period (blue bars) and beyond (purple bar; data excluded from analyses; n = 199 requests). Note the 2-base
logarithmic scale until 60 days.
SCIENTIFIC DATA | (2021) 8:192 |
upon request most likely in the elds of social sciences (W = 24.3; β = 1.09 ± 0.22; Padj < 0.001), psychology
(W = 20.0; β = 0.73 ± 0.20; Padj < 0.001) and humanities (W = 5.0; β = 0.67 ± 0.30; Padj = 0.078) compared
with natural sciences (Fig.2). Furthermore, the data request was more likely to be declined when the data com-
plexity was high (W = 9.8; β = 0.59 ± 0.19; Padj = 0.005), the paper was not open access in ISI Web of Science
(W = 4.0; β = 0.37 ± 0.18; Padj = 0.132) and published in Science rather than Nature (W = 4.6; β = 0.35 ± 0.16
Padj = 0.096), although these two latter gures are non-signicant when accounting for multiple testing.
We received no response to 41.3% of our data requests, including two biweekly reminders. Responding to the
data request diered most strongly among scientic disciplines and time periods (Fig.6). Altogether 28.9% and
49.0% of requests were ignored by the authors of earlier (2000–2009) and later (2010 to 2019) papers, respec-
tively. According to the best model (n = 310; D = 0.429; R2model = 0.200; Padj < 0.001; TableS2), articles from the
earlier time period (W = 9.3; β = 0.41 ± 0.13; Padj = 0.007) and the elds of forestry (W = 13.4; β = 0.57 ± 0.16;
Padj < 0.001) and ecology (W = 7.0; β = 0.53 ± 0.20; Padj = 0.024) had the greatest likelihood of no response,
whereas social scientists (W = 7.7; β = 0.87 ± 0.31; Padj = 0.016) answered most frequently.
In general, there was no residual eect of time since publication when the publication period was included
in the best model. Within the 2010–2019 period, we specically tested whether the authors publishing in 2019
and 2018 were less likely to share their data because of potential conicting publishing interests. is hypothesis
was not supported and a non-signicant reverse trend was observed as the proportion of data obtained from the
authors increased from 44% in 2010–2017 to 63% in 2018–2019. Accounting for time since publishing across
the entire survey period, the data availability upon request decayed at a rate 5.9% year1 based on an exponential
model. is estimate was marginally higher than the 3.5% annual loss of publicly available data (Fig.7). e num-
ber of articles was insucient to test dierences in data decay rates among disciplines.
Authors’ concerns and reasons for declining data sharing. Upon contacting the authors, we recorded
and categorised their concerns and requests related to data sharing (n = 188 authors) and their reasons for decline
(n = 65). Altogether 22.9% of authors were concerned about certain aspects of our request (Fig.8). Authors
of non-open access publications (W = 4.6; β = 0.49 ± 0.23; Padj = 0.064) and the eld of humanities (W = 9.7;
β = 1.11 ± 0.36; Padj = 0.004) expressed any types of concerns or requests relatively more oen (TableS2). In par-
ticular, researchers in the elds of humanities (W = 15.2; β = 1.36 ± 0.35; Padj < 0.001), materials for energy and
catalysis (W = 6.4; β = 0.65 ± 0.26; Padj = 0.022) and ecology (W = 5.6; β = 0.81 ± 0.34; Padj = 0.036) were more
concerned about the study’s specic purpose than researchers on average.
Data sharing was declined by 33.0% of the 188 established contacts. When we specically inquired about the
reasons, the lack of time to search for data (29.2%), loss of data (27.7%) and privacy or legal concerns (23.1%)
were most commonly indicated by the authors (Fig.8), whereas no specic answer was provided by 10.8% of
authors. According to the best binomial models (TableS2), social scientists indicated data loss more commonly
than other researchers (W = 10.9; β = 1.04 ± 0.32; Padj = 0.003) and psychologists pointed most commonly to
legal or privacy issues (W = 4.9; β = 0.85 ± 0.38; Padj = 0.078). Data decline due to legal issues became increasingly
important in more recent publications (days since 01.01.2000: W = 7.2; β = 0.07 ± 0.03; Padj = 0.035). e lack of
time to search tended to be more common for older studies (W = 4.0; β = 0.73 ± 0.37; Padj = 0.135).
Data storage options and citations. e ways how the data were released diered greatly among disci-
plines (Fig.9), with most common storage options being the supplementary materials on the publisher’s web-
site (62.2% of articles), various data archives (22.3%) and upon request from corresponding authors (19.7%).
Although 29.8% articles declared depositing data in multiple sources, no source was indicated for 35.0% of arti-
cles. Declaring data availability upon request (n = 172) ranged from 1.0% in psychology to 52.0% in forestry, with
greater frequency in earlier (days back since 31.09.2019: W = 15.0; β = 0.016 ± 0.004; P < 0.001) studies and arti-
cles by non-North American corresponding authors (by primary aliation; W = 5.6; β = 0.23 ± 0.10; P = 0.018).
With a few exceptions (three datasets only commercially available, one removed during nal acceptance and one
Frequency (%)
xy xy
for energy
& catalysi
Fig. 6 Authors’ response to data request (n = 199) depending on discipline (blue, declined; orange, ignored;
purple, obtained). Bars indicate 95% CI of Sison and Glaz51. Letters above bars indicate statistically signicant
dierence groups in frequency of data availability by each category based on Tukey post-hoc test and Bonferroni
SCIENTIFIC DATA | (2021) 8:192 |
homepage corrupt), all data were successfully located for other indicated data sources, but only 42.3% of data
could be obtained from the authors upon request in practice. is rate is comparable to articles with no such
statement (38.3%; Chi-square test: P = 0.501).
e number of citations to articles ranged from 0.0 to 692.9 per year (median, 23.1). In contrast to the hypoth-
esis that articles with available data accumulate more citations20, general linear modelling revealed no signicant
eect of initial or nal data availability on annual citations. e model demonstrated that the average number of
yearly citations was explained by research discipline (F8,855 = 11.2; R2 = 0.105; P < 0.001), data type (F2,855 = 7.0;
Initial availability: y=31.1+e ; R=0.804; P<0.001
Final availability: y=47.1+e ; R=0.820; P<0.001
Upon request availability: y=20.3+e ; R=0.670; P=0.004
2000 2002
2004 2006 2008 2010
2012 2014 2016 2018
Data availability (%)
Fig. 7 Decay in critical data availability initially (blue circles; n = 672), at the end of a 60-day contacting period
(purple circles; n = 672) and upon request from the authors (black circles; n = 310).
040 80
Number of requests and concerns
Number of reasons for declining data sharing
120 16
seeing results
04812 16 20
no time to search
data lost
data protected by agreements
not specified
person moved
purpose unclear
more work in progress
person retired
interpretation problematic
need a good reason
bad experience with sharing data
not shared with strangers
person dead
putting on web in progress
Fig. 8 Frequency distribution of authors’ (a) Concerns and requests (n = 199) and (b) reasons for declining
data sharing (n = 67). White bars indicate answers where no concerns or reasons were specied.
SCIENTIFIC DATA | (2021) 8:192 |
R2 = 0.016; P < 0.001), open access status (F1,855 = 4.5; R2 = 0.005; P = 0.034) and the interaction term between
open access and discipline (F8,855 = 2.94; R2 = 0.027; P = 0.003). Post-hoc tests indicated that articles with a data-
set as a critical data source were cited on average 6% more than those with an image or model, and open access
articles attracted 9% more citations than regular articles. Because of high variability in citation counts, it was not
possible to test the interaction terms with scientic discipline in the current dataset. We speculate that the articles
in Nature and Science are heavily cited on the basis of their key ndings and interpretations that may mask the few
extra citations raising from re-use of the data.
Our study uniquely points to dierences among scientic disciplines in data availability as published along with
the article and upon request from the authors. We demonstrate that in several disciplines such as forestry, mate-
rials for energy and catalysis and psychology, critical data are still unavailable for re-analysis or meta-analysis
for more than half of the papers published in Nature and Science in the last decade. ese overall gures roughly
match those reported for other journals in various research elds8,11,13,22, but exceed the lowest reported val-
ues of around 10% available data13,23,24. Fortunately, data availability tends to improve, albeit slowly, in nearly
all disciplines (Figs.3, 7), which conrms recent implications from psychological and ecological journals13,31.
Furthermore, the reverse trend we observed in microbiology corroborates the declining metagenomics sequence
data availability22. Typically, such large DNA sequence data sets are used to publish tens of articles over many
years by the teams producing these data; hence releasing both raw data and datasets may jeopardise their expec-
tations of priority publishing. e weak discipline-specic dierences among Nature and Science (Fig.2) may be
related to how certain subject editors implemented and enforced stringent data sharing policies.
Aer rigorous attempts to contact the authors, data availability increased by one third on average across dis-
ciplines, with full and at least partial availability reaching 70% and 83%, respectively. ese gures are in the
top end of studies conducted thus far8,22 and indicate the relatively superior overall data availability in Science
and Nature compared with other journals. However, the relative rates of data retrieval upon request, decline
sharing data and ignoring the requests were on par with studies covering other journals and specic research
elds10,12,25,26,28. Across 20 years, we identied the overall loss of data at an estimated rate of 3.5% and 5.9% for
initially available data and data eectively available upon request, respectively. is rate of data decay is much less
than 17% year1 previously reported in plant and animal sciences based on a comparable approach24.
While the majority of data are eventually available, it is alarming that less than a half of the data clearly stated
to be available upon request could be eectively obtained from the authors. Although there may be objective
reasons such as force majeure, these results suggest that many authors declaring data availability upon contacting
may have abused the publishers’ or funders’ policy that allows statements of data availability upon request as
the only means of data sharing. We nd that this infringes research ethics and disables fair competition among
research groups. Researchers hiding their own data may be in a power position compared with fair players in
situations of big data analysis, when they can access all data (including their own), while others have more limited
opportunities. Data sharing is also important for securing a possibility to re-analyse and re-interpret unexpected
results9,32 and detect scientic misconduct25,33. More rigorous control of data release would prevent manuscripts
with serious issues in sampling design or analytical procedures from being prepared, reviewed and eventually
accepted for publication.
Our study uniquely recorded the authors’ concerns and specic requests when negotiating data sharing.
Concerns and hesitations about data sharing are understandable because of potential drawbacks and misun-
derstandings related to data interpretation and priority of publishing17,34 that may outweigh the benets of rec-
ognition and passive participation in broader meta-studies. Nearly one quarter of researchers expressed various
concerns or had specic requests depending on the discipline, especially about the specic objectives of our
study. Previous studies with questionnaires about hypothetical data sharing unrelated to actual data sharing reveal
for energy
& catalysis
Biomaterials &
Optics &
Fig. 9 Preferred ways of data storage in articles (n = 875) representing dierent disciplines (blue, text and
supplement; purple, data archive; yellow, authors’ homepage; vermillion, previous publications; grey, museum;
black, upon (reasonable) request; white, none declared.
SCIENTIFIC DATA | (2021) 8:192 |
that nancial interests, priority of additional publishing and fear of challenging the interpretations aer data
re-analysis constitute the authors’ major concerns12,35,36. Another study indicated that two thirds of researchers
sharing biomedical data expected to be invited as co-authors upon use of their data37 although this does not
full the authorship criteria6,38. At least partly related to these issues, the reasons for declining data sharing dif-
fered among disciplines: while social scientists usually referred to the loss of data, psychologists most commonly
pointed out ethical/legal issues. Recently published data were, however, more commonly declined due to ethical/
legal issues, which indicates rising concerns about data protection and potential misuse. Although we oered a
possibility to share anonymised data sets, such trimmed data sets were never obtained from the authors, sug-
gesting that ethical issues were not the only reason for data decline. Because research elds strongly diered in
the frequency of no response to data requests, most unanswered requests can be considered declines that avoid
ocial replies, which may harm the authors’ reputation.
Because we did not sample randomly across journals, our interpretations are limited to the journals Nature
and Science. Our study across disciplines did not account for the particular academic editor, which may have
partly contributed to the dierences among research elds and journals. Not all combinations of disciplines,
journals and time periods received the intended 25 replicate articles because of the poor representation of cer-
tain research elds in the 2000–2009 period. is may have reduced our ability to detect statistically signicant
dierences among the disciplines. We also obtained estimates for the nal data availability for seven out of nine
disciplines. Although we excluded the remaining two disciplines from comparisons of initial and nal data avail-
ability, it may have slightly altered the overall estimates. e process of screening the potentially relevant articles
chronologically backwards resulted in overrepresentation of more recent articles in certain relatively popular
disciplines, which may have biased comparisons across disciplines. However, the paucity of residual year eect
and year x discipline interaction in overall models and residual time eect in separate analyses within research
elds indicate a minimal bias (FigureS1).
We recorded the concerns and requests of authors that had issues with initial data sharing. erefore, these
responses may be relatively more sceptic than the opinions of the majority of the scientic community publishing
in these journals. It is likely that the authors who did not respond may have concerns and reasons for declining
similar to those who refused data sharing.
Our experience shows that receiving data typically required long email exchanges with the authors, contacting
other referred authors or sending a reminder. Obtaining data took on average 15 days, representing a substantial
eort to both parties39. is could have been easily avoided by releasing data upon article acceptance. On the
other hand, we received tips for analysis, caution against potential pitfalls and the authors’ informed consent upon
contacting. According to our experience, more than two thirds of the authors need to be contacted for retrieving
important metadata, variance estimates or specifying methods for meta-analyses40. us, contacting the authors
may be commonly required to ll gaps in the data, but such extra specications are easier to provide compared
with searching and converting old datasets into a universally understandable format.
Due to various concerns and tedious data re-formatting and uploading, the authors should be better moti-
vated for data sharing41. Data formatting and releasing certainly benets from clear instructions and support
from funders, institutions and publishers. In certain cases, public recognition such as badges of open data for
articles following the best data sharing practices and increasing numbers of citations may promote data release by
an order of magnitude42. Citable data papers are certainly another way forward43,44, because these provide access
to a well-organised dataset and add to the authors’ publication record. Encouraging enlisting published data
sets with download and citation metrics in grant and job applications alongside with other bibliometric indica-
tors should promote data sharing. Relating released data in publicly available research accounts such as ORCID,
ResearcherID and Google Scholar would benet both authors, other researchers and evaluators. To account for
many authors’ fear of data the17 and to prioritise the publishing options of data owners, setting a reasonable
embargo period for third-party publishing may be needed in specic cases such as immediate data release follow-
ing data generation45 and dissertations.
All funders, research institutions, researchers, editors and publishers should collectively contribute to turn
data sharing into a win-win situation for all parties and the scientic endeavour in general. Funding agencies may
have a key role here due to the lack of conicting interests and a possibility of exclusive allocation to depositing
and publishing huge data les46. Funders have ecient enforcing mechanisms during reports periods, with an
option to refuse extensions or approving forthcoming grant applications. We advocate that funders should include
published data sets, if relevant, as an evaluation criterion besides other bibliometric information. Research insti-
tutions may follow the same principles when issuing institutional grants and employing research sta. Institutions
should also insist their employees on following open data policies45.
Academic publishers also have a major role in shaping data sharing policies. Although deposition and main-
tenance of data incur extra costs to commercial publishers, they should promote data deposition in their servers
or public repositories. An option is to hire specic data editors for evaluating data availability in supplementary
materials or online repositories and refusing nal publishing before the data are fully available in a relevant for-
mat47. For ecient handling, clear instructions and a machine-readable data availability statement option (with
a QR code or link to the data) should be provided. In non-open access journals, the data should be accessible
free of charge or at reduced price to unsubscribed users. Creating specic data journals or ‘data paper’ formats
may promote publishing and sharing data that would otherwise pile up in the drawer because of disappointing
results or the lack of time for preparing a regular article. e leading scientometrics platforms Clarivate Analytics,
Google Scholar and Scopus should index data journals equally with regular journals to motivate researchers
publishing their data. ere should be a possibility of article withdrawal by the publisher, if the data availability
statements are incorrect or the data have been removed post-acceptance30. Much of the workload should stay on
the editors who are paid by the supporting association, institution or publisher in most cases. e editors should
grant the referees access to these data during the reviewing process48, requesting them a second opinion about
SCIENTIFIC DATA | (2021) 8:192 |
data availability and reasons for declining to do so. Similar stringent data sharing policies are increasingly imple-
mented by various journals26,30,47.
In conclusion, data availability in top scientic journals diers strongly by discipline, but it is improving in
most research elds. As our study exemplies, the ‘data availability upon request’ model is insucient to ensure
access to datasets and other critical materials. Considering the overall data availability patterns, authors’ concerns
and reasons for declining data sharing, we advocate that (a) data releasing costs ought to be covered by funders;
(b) shared data and the associated bibliometric records should be included in the evaluation of job and grant
applications; and (c) data sharing enforcement should be led by both funding agencies and academic publishers.
Materials and Methods
Data collection. To assess dierences in data availability in dierent research disciplines, we focused our
study on Nature and Science, two high-impact, general-interest journals that practise relatively stringent data
availability policies49. Because of major changes in the public attitude and journals’ policies about data shar-
ing, our survey was focused on two study periods, 2000–2009 and 2010–2019. We selected nine scientic dis-
ciplines as dened by the Springer Nature publishing group - biomaterials and biotechnology, ecology, forestry,
humanities, materials for energy and catalysis, microbiology, optics and photonics, psychology and social
sciences (see TableS1 for details) - for analysis based on their coverage in Nature and Science journals and data-
driven research. ese nine disciplines were selected based on the competence of our team and the objective
to cover as dierent research elds as possible including natural sciences, social sciences and humanities. e
articles were searched by discipline, keywords and/or manual browsing as follows. For Nature, our search was
rened as
ject=microbiology&date_range=2010-2019 (italicised parts varied). For Science, the corresponding search
string was the following:−
reviews&source=sciencemag%7CScience. In both journals, the articles were retrieved by browsing search results
chronologically backwards since September 2019 or September 2009 until reaching 25 articles matching the cri-
teria. When the number of suitable articles was insucient, we searched by using additional discipline-specic
keywords in the title and browsed all issues manually when necessary. In some research elds, 25 articles could
not be found for all journal and time period combinations and therefore, data availability was evaluated for 875
articles in total (TableS1). In each article, we identied a specic analysis or result that was critical for the main
conclusion of that study based on both the authors’ emphasis and our subjective assessment. We determined
whether the underlying data of these critical results - datasets, images (including videos), models (including
scripts, programs and analytical procedures) or physical items - are available in the main text, supplementary
materials or other indicated sources such as specic data repositories, authors’ homepages, museums, or upon
request to the corresponding author (FigureS2). When available, we downloaded these data, checked for relevant
metadata, identiers and other components, and evaluated whether it is theoretically possible to repeat these
specic analyses and include these materials in a eld-specic metastudy. For example, in the case of a dataset, we
evaluated the data table for the presence of relevant metadata and sample codes necessary to perform the analysis;
for any statistical procedure, the authors must have used such a data table in their original work. We considered
the data to be too raw if these either required a large amount of work (other than common data transformations)
to generate the data table or model, or we had doubts whether the same data table can be reproduced with the
methods described. Raw high-throughput sequencing data are typical examples of incomplete datasets, because
these usually lack necessary metadata and require a thorough bioinformatics analysis, with the output depending
on soware and selected options. For further examples, certain optical raw images or videos make no sense with-
out expert ltering, and computer scripts are of limited use without thorough instructions.
If these critical data were unavailable or only partly available (i.e., missing some integral metadata, instructions
or explanations), we contacted the rst corresponding author or a relevant author referred in relation to access
to the specic item, requesting the data for a meta-study by using a pre-dened format and an institutional email
address (Item S1). In the email, we carefully specied the materials required to produce a particular gure or table
to avoid confusion and upsetting the authors with a messy request. We indicated that the data are intended for a
metastudy in a related topic to test the authors’ willingness to share the data for actual use, not just their intention
to share for no reasonable purpose. We similarly evaluated the received data for integrity and requested further
information, if necessary, to meet the standards. We also recorded the responses of corresponding authors to data
requests, including any specic requests or concerns and reasons for declining (Item S1).
e authors were mostly contacted early in the week and two reminders were sent ca. 14 and 28 days later if
necessary (Item S1). e reminders were also addressed to other corresponding authors if relevant. If emails were
returned with an error message, we contacted other corresponding authors or used an updated email address
found from the internet or newer publications. We considered 60 days from sending the rst email a reasonable
time period for the authors to locate and send the requested data4.
For each article, we recorded the details of publishing (date printed, journal, discipline), corresponding
authors (number, country of rst aliation, acquaintance to the contact author) and data (availability, type, ways
of access)50. Data complexity was evaluated based on the authors’ relative amount of extra work to polish the
raw data (e.g. low-complexity data include raw DNA sequence data, raw images, artefacts; high-complexity data
include bioinformatics-treated molecular data sets, noise-removed images, models and scripts). As of 23.03.2020,
we recorded the open access status and number of citations for each article using searches in the ISI Web of
Science (https://apps.webo e citation count was expressed as citations per year, discounting
the rst 90 days with initially less citations.
SCIENTIFIC DATA | (2021) 8:192 |
Data analysis. e principal aim of this study was to determine the relative importance of scientic discipline
and time period on data availability and authors’ concerns in response to data sharing requests, by accounting
for multiple potentially important covariates (Fig.1). e response variables, i.e. initial and nal data availability
(none, partly or fully available), author’s responses (ignored, data shared or declined), concerns and reasons
for decline, exhibit multinomial distribution50 and were hence transformed to dummy variables. Similarly, the
multi-level explanatory variables (discipline, topic overlap, countries and continents of corresponding authors,
data type and complexity) were transformed to dummies, whereas continuous variables (linear time, number of
citations, time to obtain data, number of corresponding authors) were square root- or logarithm-transformed
where appropriate. All analyses were performed in STATISTICA 12 (StatSo Inc., Tulsa, OK, USA).
Data analysis of the dummy-transformed multinomial and binomial variables was performed using stepwise
logistic regression model selection with a binomial link function using corrected Akaike information criterion
(AICc) as a selection criterion, and Somers’ D statistic and model determination coecients (R2) as measures of
overall goodness of t. Determination coecients and Wald’s W statistic were used to estimate the relative impor-
tance of explanatory variables. We calculated 95% condence intervals for multiple proportions51 using the R
package multinomialCI ( Increasing false discovery rates related to multiple
comparisons were accounted for by using Bonferroni correction of P-values (expressed as Padj) where appropriate.
Models with continuous response variables (proportion of available data, annual citations, time to receive
data) were tested using general linear models in two steps. First, the model selection included only dummy and
continuous explanatory variables. Multilevel categorical predictors corresponding to signicant dummies as well
as signicant continuous variables were included in the nal model selection as based on forward selection. To
check for potential biases related to the article selection procedure in both periods, we tested the eect of disci-
pline, period and year and all their interaction terms on initial data availability by retaining all variables in the
model (FigureS1). Dierences in these factor levels were tested using Tukey post-hoc tests for unequal sample
size, which accounts for multiple testing issues.
Data availability
e entire dataset is available as in a spreadsheet format in plutoF data repository50.
Code availability
No specic code was generated for analysis of these data.
Received: 11 December 2020; Accepted: 29 June 2021;
Published: xx xx xxxx
We thank all authors who released their data along with their article or responded to our data request. Although
some of the obtained datasets are used in a series of meta-analyses or released by us upon agreement, we apologise
to the authors who spent a signicant amount of time to provide the data, which we cannot use for secondary
analyses. We thank A. Kahru, T. Soomere, Ü. Niinemets and J. Allik for their constructive comments on an earlier
version of the manuscript.
Author contributions
All authors contributed to study design, work with literature and writing. L.T. analysed data and led writing.
Competing interests
e authors declare no competing interests.
Additional information
Supplementary information e online version contains supplementary material available at https://doi.
Correspondence and requests for materials should be addressed to L.T.
Reprints and permissions information is available at
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and
institutional aliations.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International
License, which permits use, sharing, adaptation, distribution and reproduction in any medium or
format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Cre-
ative Commons license, and indicate if changes were made. e images or other third party material in this
article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the
material. If material is not included in the article’s Creative Commons license and your intended use is not per-
mitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the
copyright holder. To view a copy of this license, visit
© e Author(s) 2021
