ArticlePDF Available

Data sharing practices and data availability upon request differ across scientific disciplines

July 2021
Scientific Data 8(1):192

July 2021
8(1):192

DOI:10.1038/s41597-021-00981-0

License
CC BY 4.0

Authors:

Leho Tedersoo

University of Tartu

Ester Oras

University of Tartu

Kajar Köster

University of Eastern Finland

Show all 12 authorsHide

Data sharing is one of the cornerstones of modern science that enables large-scale analyses and reproducibility. We evaluated data availability in research articles across nine disciplines in Nature and Science magazines and recorded corresponding authors’ concerns, requests and reasons for declining data sharing. Although data sharing has improved in the last decade and particularly in recent years, data availability and willingness to share data still differ greatly among disciplines. We observed that statements of data availability upon (reasonable) request are inefficient and should not be allowed by journals. To improve data sharing at the time of manuscript acceptance, researchers should be better motivated to release their data with real benefits such as recognition, or bonus points in grant and job applications. We recommend that data management costs should be covered by funding agencies; publicly available research data ought to be included in the evaluation of applications; and surveillance of data sharing should be enforced by both academic publishers and funders. These cross-discipline survey data are available from the plutoF repository.

Authors' response to data request (n = 199) depending on discipline (blue, declined; orange, ignored; purple, obtained). Bars indicate 95% CI of Sison and Glaz 51 . Letters above bars indicate statistically significant difference groups in frequency of data availability by each category based on Tukey post-hoc test and Bonferroni correction.

…

Decay in critical data availability initially (blue circles; n = 672), at the end of a 60-day contacting period (purple circles; n = 672) and upon request from the authors (black circles; n = 310).

…

Preferred ways of data storage in articles (n = 875) representing different disciplines (blue, text and supplement; purple, data archive; yellow, authors' homepage; vermillion, previous publications; grey, museum; black, upon (reasonable) request; white, none declared.

…

Schematic rationale of the study.

…

Differences in partial (grey) and full (black) data availability among disciplines depending on journal and publishing period (P1, 2000–2009; P2, 2010–2019) before contacting the authors (n = 875). Letters above bars indicate statistically significant difference groups among disciplines in full data availability compared to no data availability. Asterisks show significant differences in full data availability between journals and publishing periods.

…

Figures - uploaded by Leho Tedersoo

Content may be subject to copyright.

Access to this full-text is provided by Springer Nature.

Learn more

Content available from Scientific Data

This content is subject to copyright. Terms and conditions apply.

SCIENTIFIC DATA | (2021) 8:192 | https://doi.org/10.1038/s41597-021-00981-0

www.nature.com/scientificdata

Data sharing practices and data

availability upon request dier

across scientic disciplines

Leho Tedersoo1,2 ✉ , Rainer Küngas1, Ester Oras1,3,4, Kajar Köster

1,5, Helen Eenmaa1,6,

Äli Leijen1,7, Margus Pedaste7, Marju Raju1,8, Anastasiya Astapova1,9, Heli Lukner1,10,

Karin Kogermann1,11 & Tuul Sepp

1,12

Data sharing is one of the cornerstones of modern science that enables large-scale analyses and

reproducibility. We evaluated data availability in research articles across nine disciplines in Nature and

Science magazines and recorded corresponding authors’ concerns, requests and reasons for declining

data sharing. Although data sharing has improved in the last decade and particularly in recent years,

data availability and willingness to share data still dier greatly among disciplines. We observed that

statements of data availability upon (reasonable) request are inecient and should not be allowed by

journals. To improve data sharing at the time of manuscript acceptance, researchers should be better

motivated to release their data with real benets such as recognition, or bonus points in grant and job

applications. We recommend that data management costs should be covered by funding agencies;

publicly available research data ought to be included in the evaluation of applications; and surveillance

of data sharing should be enforced by both academic publishers and funders. These cross-discipline

survey data are available from the plutoF repository.

Introduction

Technological advances and accumulation of case studies have led many research fields into the era of ‘big

data’ - the possibility to integrate data from various sources for secondary analysis, e.g. meta-studies and

meta-analyses1,2. Nearly half of the researchers commonly use data generated by other scientists3. Data sharing is

a scientic norm and an important part of research ethics in all disciplines, also increasingly endorsed by publish-

ers, funders and the scientic community4–6. Despite decades of argumentation7, much of the published data is

still essentially unavailable for integration into secondary data analysis and evaluation of reproducibility, a proxy

for reliability8–10. Furthermore, the deposited data may also be incomplete, sometimes intentionally11–14, e.g. in

cases these exhibit mismatching sample codes or lack information about important metadata such as sex and age

of studied organisms in biological and social sciences.

Although the vast majority of researchers prefer data sharing12,15, scientists tend to be concerned about losing

their priority in future publishing and potential commercial use of their work without their consent or partici-

pation12,16,17. Researchers working on human subjects may be bound by legal agreements not to reveal sensitive

data16,18. Across research elds, papers indicating available data are cited on average 25% more19. In research

using microarrays, papers with access to raw data accumulate on average 69% more citations compared with

other articles20. Unfortunately, higher citation rate has not motivated many researchers enough to release their

data, although referees and funding agencies account for bibliometrics when evaluating researchers and their

1Estonian Young Academy of Sciences, Kohtu 6, 10130, Tallinn, Estonia. 2Mycology and Microbiology Center,

University of Tartu, Ravila 14a, 50411, Tartu, Estonia. 3Institute of Chemistry, University of Tartu, Ravila 14a,

50411, Tartu, Estonia. 4Institute of History and Archaeology, University of Tartu, Jakobi 2, 51005, Tartu, Estonia.

5Department of Forest Sciences, University of Helsinki, PO Box 27 (Latokartanonkaari 7), Helsinki, FI-00014, Finland.

6School of Law, University of Tartu, Näituse 20, 50409, Tartu, Estonia. 7Institute of Education, University of Tartu,

Salme 1a, 50103, Tartu, Estonia. 8Department of Musicology, Music Pedagogy and Cultural Management, Estonian

Academy of Music and Theatre, Tatari 13, 10116, Tallinn, Estonia. 9Institute for Cultural Research and Fine Arts,

University of Tartu, Ülikooli 16, 51003, Tartu, Estonia. 10Institute of Physics, University of Tartu, W. Ostwaldi 1, 50411,

Tartu, Estonia. 11Institute of Pharmacy, University of Tartu, Nooruse 1, 50411, Tartu, Estonia. 12Institute of Ecology

and Earth Sciences, University of Tartu, Vanemuise 46, 51003, Tartu, Estonia. ✉e-mail: leho.tedersoo@ut.ee

ANALYSIS

OPEN

Content courtesy of Springer Nature, terms of use apply. Rights reserved

SCIENTIFIC DATA | (2021) 8:192 | https://doi.org/10.1038/s41597-021-00981-0

www.nature.com/scientificdata

www.nature.com/scientificdata/

proposals21. Multiple case studies have revealed high variation in data availability in dierent journals and disci-

plines, ranging from 9 to 76%8,11,13,19,22–24. Data requests to authors are successful in 27–59% of cases, whereas the

request is ignored in 14–41% cases based on previous research10,25–28. To promote access to data, many journals

have implemented mandatory data availability statements and require data storage in supplementary materials

or specic databases29,30. Because of poor enforcement, this has not always guaranteed access to published data

because of broken links, the lack of metadata or the authors’ lack of willingness to share upon request8,26.

is study aims to map and evaluate cross-disciplinary dierences in data sharing, authors’ concerns and

reasons for denying access to data, and whether these decisions are reected in article citations (Fig.1). We

selected the scholarly articles published in journals Nature and Science because of their multidisciplinary con-

tents, stringent data availability policies outlined in authors’ instructions, and high-impact conclusions derived

from the data of exceptional size, accuracy and/or novelty. We hypothesised that in spite of overall improvement

in data sharing culture, the actual data availability and reasons for declining the requests to share data depend

on scientic disciplines because of eld-specic ‘traditions’, ‘sensitivity’ of data, or their economic potential. Our

broader goal is to improve data sharing principles and policies among authors, academic publishers and research

foundations.

Results

Initial and nal data availability. We evaluated the availability of most critical data in 875 articles across

nine scientic disciplines (TableS1) published in Nature and Science over two 10-year intervals (2000–2009 and

2010–2019) and, in case these data were not available for access, we contacted the authors. e initial (pre-con-

tacting) full and at least partial data availability averaged at 54.2% (range across disciplines, 33.0–82.8%) and

71.8% (40.4–100.0%), respectively. Stepwise logistic regression models revealed that initial data availability

diered by research eld, type of data, journal and publishing period (no vs. full availability: n = 721; Somers’

D = 0.676; R2model = 0.476; P < 0.001). According to the best model (TableS2), the data were less readily avail-

able in materials for energy and catalysis (W = 68.0; β = −1.52 ± 0.19; P < 0.001), psychology (W = 55.6;

β = −1.11 ± 0.15; P < 0.001), optics and photonics (W = 18.8; β = −0.59 ± 0.14; P < 0.001) and forestry (W = 9.8;

β = −0.52 ± 0.19; P = 0.002) compared with other disciplines, especially humanities (Fig.2). Data availability was

relatively lower in the period of 2000–2009 (W = 82.5; β = −0.57 ± 0.10; P < 0.001) and when the most important

data were in the form of a dataset (relative to image/video and model; W = 41.5; β = −1.23 ± 0.19; P < 0.001;

Fig.3). Relatively less data were available for Nature (W = 32.7; β = −0.57 ± 0.19; P < 0.001), with striking sever-

al-fold dierences in optics and photonics (Fig.2).

Data

request

cle

Discipline

Journal

Age/ eperiod

Country/conent

No. corr. authors

Open access

Data

complexity

No response

Data declined

Data obtained

Requests &

concerns

Time to

obtain data

Recommenda fordata

sharingpolicies

Data sharing al

Data sharing ﬁnal

Reminder1

Reminder2 Storage

op ons

Cit ons

Reason for

decline

Fig. 1 Schematic rationale of the study.

Content courtesy of Springer Nature, terms of use apply. Rights reserved

SCIENTIFIC DATA | (2021) 8:192 | https://doi.org/10.1038/s41597-021-00981-0

www.nature.com/scientificdata

www.nature.com/scientificdata/

Data availability (%)

80 b

bcd

{

Ecology

Forestry

Humanities

Materials

for energy

& catalysis

Biomaterials &

biotechnolog

Micro-

biology

Optics &

photonics

Psychology

Social

sciences

P1 P2 P1 P2

Science Nature P1 P2 P1 P2

Science Nature

Fig. 2 Dierences in partial (grey) and full (black) data availability among disciplines depending on journal and

publishing period (P1, 2000–2009; P2, 2010–2019) before contacting the authors (n = 875). Letters above bars

indicate statistically signicant dierence groups among disciplines in full data availability compared to no data

availability. Asterisks show signicant dierences in full data availability between journals and publishing periods.

Data availability (%) Frequency of critical data types (%)

Ecology

Forestry

Humanities

Materials

for energy

& catalysis

Biomaterials &

biotechnology

Micro-

biology

Optics &

photonics

Psychology

Social

sciences

DS Img Mod DS Img Mod DS Img Mod DS Img Mod DS Img Mod DS Img Mod DS Img Mod DS Img Mod DS Img Mod

Fig. 3 Types of critical data (n = 875). (a) Distribution of data types among disciplines (blue, dataset; purple,

image; black, model); (b) Partial (light shades) and full (dark shades) data availability among disciplines

depending on the type of critical data (DS, dataset; Img, image; Mod, model) before contacting the author(s).

Content courtesy of Springer Nature, terms of use apply. Rights reserved

SCIENTIFIC DATA | (2021) 8:192 | https://doi.org/10.1038/s41597-021-00981-0

www.nature.com/scientificdata

www.nature.com/scientificdata/

Upon contacting the authors of 310 papers, the overall data availability was improved by 35.0%. Full and at

least partial availability averaged 69.5% (range across disciplines, 57.0–87.9%) and 83.2% (64.9–100.0%), respec-

tively (Fig.4), aer 60 days since contacting, a reasonable time frame4. e nal data availability (aer contacting

the authors) was best predicted by scientic discipline, data type and time lapse since publishing (no vs. full

availability: n = 580; D = 0.659; R2model = 0.336; Padj < 0.001; TableS2) but with no major changes in the ranking of

disciplines or data types compared with the initial data availability (Fig.4). It took a median of 15 days to receive

data from the authors (Fig.5), with a minimum time of 13 minutes. Four authors sent their data aer the 60-days

period since the initial request (max. 107 days). e rate of receiving data was unrelated to any studied parameter.

Authors’ responses to data requests. e data were obtained from the authors in 39.4% of data requests

on average, with a range of 27.9–56.1% among research elds. e likelihood of receiving data, the request being

declined or ignored depended mostly on the time period and eld of research. According to the best model

(n = 310; D = 0.300; R2model = 0.106; Padj < 0.001; TableS2), the data were obtained slightly less frequently for

the earlier time period (29.4% vs. 56.0%; W = 20.4; β = 0.56 ± 0.12; Padj < 0.001). Receiving data upon request

tended to be lowest in the eld of forestry (W = 3.6; β = −0.31 ± 0.16; Padj = 0.177), especially when compared

with microbiology (Fig.2).

Declining the data request averaged 19.4% and it differed most strongly among the research fields. The

best model (n = 310; D = 0.508; R2model = 0.221; Padj < 0.001) revealed that the data were not made available

Data availability (%)

Ecology Forestry Humanities Materials

for energy

& catalysis

Microbiology Psychology Social sciences

P1 P2 P1 P2 P1 P2

DS Image Model P1 P2 P1 P2 P1 P2

DS Image Model

Fig. 4 Dierences in partial (grey) and full (black) data availability among disciplines aer data requests

(n = 672) depending on the type of critical data (DS, dataset; image; model) and publishing period (P1, 2000–

2009; P2, 2010–2019). Numbers above bars indicate statistically signicant dierence groups among disciplines

in full data availability.

01248163260 128

Data obtained upon request (days)

Number of datasets

Fig. 5 Histogram of time for receiving data from authors upon request within the 60-day reasonable time

period (blue bars) and beyond (purple bar; data excluded from analyses; n = 199 requests). Note the 2-base

logarithmic scale until 60 days.

Content courtesy of Springer Nature, terms of use apply. Rights reserved

SCIENTIFIC DATA | (2021) 8:192 | https://doi.org/10.1038/s41597-021-00981-0

www.nature.com/scientificdata

www.nature.com/scientificdata/

upon request most likely in the elds of social sciences (W = 24.3; β = −1.09 ± 0.22; Padj < 0.001), psychology

(W = 20.0; β = −0.73 ± 0.20; Padj < 0.001) and humanities (W = 5.0; β = −0.67 ± 0.30; Padj = 0.078) compared

with natural sciences (Fig.2). Furthermore, the data request was more likely to be declined when the data com-

plexity was high (W = 9.8; β = −0.59 ± 0.19; Padj = 0.005), the paper was not open access in ISI Web of Science

(W = 4.0; β = −0.37 ± 0.18; Padj = 0.132) and published in Science rather than Nature (W = 4.6; β = −0.35 ± 0.16

Padj = 0.096), although these two latter gures are non-signicant when accounting for multiple testing.

We received no response to 41.3% of our data requests, including two biweekly reminders. Responding to the

data request diered most strongly among scientic disciplines and time periods (Fig.6). Altogether 28.9% and

49.0% of requests were ignored by the authors of earlier (2000–2009) and later (2010 to 2019) papers, respec-

tively. According to the best model (n = 310; D = 0.429; R2model = 0.200; Padj < 0.001; TableS2), articles from the

earlier time period (W = 9.3; β = 0.41 ± 0.13; Padj = 0.007) and the elds of forestry (W = 13.4; β = −0.57 ± 0.16;

Padj < 0.001) and ecology (W = 7.0; β = −0.53 ± 0.20; Padj = 0.024) had the greatest likelihood of no response,

whereas social scientists (W = 7.7; β = 0.87 ± 0.31; Padj = 0.016) answered most frequently.

In general, there was no residual eect of time since publication when the publication period was included

in the best model. Within the 2010–2019 period, we specically tested whether the authors publishing in 2019

and 2018 were less likely to share their data because of potential conicting publishing interests. is hypothesis

was not supported and a non-signicant reverse trend was observed as the proportion of data obtained from the

authors increased from 44% in 2010–2017 to 63% in 2018–2019. Accounting for time since publishing across

the entire survey period, the data availability upon request decayed at a rate 5.9% year−1 based on an exponential

model. is estimate was marginally higher than the 3.5% annual loss of publicly available data (Fig.7). e num-

ber of articles was insucient to test dierences in data decay rates among disciplines.

Authors’ concerns and reasons for declining data sharing. Upon contacting the authors, we recorded

and categorised their concerns and requests related to data sharing (n = 188 authors) and their reasons for decline

(n = 65). Altogether 22.9% of authors were concerned about certain aspects of our request (Fig.8). Authors

of non-open access publications (W = 4.6; β = 0.49 ± 0.23; Padj = 0.064) and the eld of humanities (W = 9.7;

β = 1.11 ± 0.36; Padj = 0.004) expressed any types of concerns or requests relatively more oen (TableS2). In par-

ticular, researchers in the elds of humanities (W = 15.2; β = 1.36 ± 0.35; Padj < 0.001), materials for energy and

catalysis (W = 6.4; β = 0.65 ± 0.26; Padj = 0.022) and ecology (W = 5.6; β = 0.81 ± 0.34; Padj = 0.036) were more

concerned about the study’s specic purpose than researchers on average.

Data sharing was declined by 33.0% of the 188 established contacts. When we specically inquired about the

reasons, the lack of time to search for data (29.2%), loss of data (27.7%) and privacy or legal concerns (23.1%)

were most commonly indicated by the authors (Fig.8), whereas no specic answer was provided by 10.8% of

authors. According to the best binomial models (TableS2), social scientists indicated data loss more commonly

than other researchers (W = 10.9; β = 1.04 ± 0.32; Padj = 0.003) and psychologists pointed most commonly to

legal or privacy issues (W = 4.9; β = 0.85 ± 0.38; Padj = 0.078). Data decline due to legal issues became increasingly

important in more recent publications (days since 01.01.2000: W = 7.2; β = 0.07 ± 0.03; Padj = 0.035). e lack of

time to search tended to be more common for older studies (W = 4.0; β = −0.73 ± 0.37; Padj = 0.135).

Data storage options and citations. e ways how the data were released diered greatly among disci-

plines (Fig.9), with most common storage options being the supplementary materials on the publisher’s web-

site (62.2% of articles), various data archives (22.3%) and upon request from corresponding authors (19.7%).

Although 29.8% articles declared depositing data in multiple sources, no source was indicated for 35.0% of arti-

cles. Declaring data availability upon request (n = 172) ranged from 1.0% in psychology to 52.0% in forestry, with

greater frequency in earlier (days back since 31.09.2019: W = 15.0; β = 0.016 ± 0.004; P < 0.001) studies and arti-

cles by non-North American corresponding authors (by primary aliation; W = 5.6; β = 0.23 ± 0.10; P = 0.018).

With a few exceptions (three datasets only commercially available, one removed during nal acceptance and one

Frequency (%)

xy xy

Ecology

Forestry

Humanities

Materials

for energy

& catalysi

Micro-

biology

Psychology

Social

sciences

Fig. 6 Authors’ response to data request (n = 199) depending on discipline (blue, declined; orange, ignored;

purple, obtained). Bars indicate 95% CI of Sison and Glaz51. Letters above bars indicate statistically signicant

dierence groups in frequency of data availability by each category based on Tukey post-hoc test and Bonferroni

correction.

Content courtesy of Springer Nature, terms of use apply. Rights reserved

SCIENTIFIC DATA | (2021) 8:192 | https://doi.org/10.1038/s41597-021-00981-0

www.nature.com/scientificdata

www.nature.com/scientificdata/

homepage corrupt), all data were successfully located for other indicated data sources, but only 42.3% of data

could be obtained from the authors upon request in practice. is rate is comparable to articles with no such

statement (38.3%; Chi-square test: P = 0.501).

e number of citations to articles ranged from 0.0 to 692.9 per year (median, 23.1). In contrast to the hypoth-

esis that articles with available data accumulate more citations20, general linear modelling revealed no signicant

eect of initial or nal data availability on annual citations. e model demonstrated that the average number of

yearly citations was explained by research discipline (F8,855 = 11.2; R2 = 0.105; P < 0.001), data type (F2,855 = 7.0;

Initial availability: y=31.1+e ; R=0.804; P<0.001

Final availability: y=47.1+e ; R=0.820; P<0.001

Upon request availability: y=20.3+e ; R=0.670; P=0.004

2000 2002

0.0354

0.0590

0.0480

2004 2006 2008 2010

Year

2012 2014 2016 2018

2020

Data availability (%)

Fig. 7 Decay in critical data availability initially (blue circles; n = 672), at the end of a 60-day contacting period

(purple circles; n = 672) and upon request from the authors (black circles; n = 310).

040 80

Number of requests and concerns

Number of reasons for declining data sharing

120 16

none

purpose

seeing results

citing

privacy

authorship

sharing

04812 16 20

no time to search

data lost

data protected by agreements

not speciﬁed

privacy

person moved

purpose unclear

more work in progress

person retired

interpretation problematic

need a good reason

bad experience with sharing data

not shared with strangers

person dead

putting on web in progress

Fig. 8 Frequency distribution of authors’ (a) Concerns and requests (n = 199) and (b) reasons for declining

data sharing (n = 67). White bars indicate answers where no concerns or reasons were specied.

Content courtesy of Springer Nature, terms of use apply. Rights reserved

SCIENTIFIC DATA | (2021) 8:192 | https://doi.org/10.1038/s41597-021-00981-0

www.nature.com/scientificdata

www.nature.com/scientificdata/

R2 = 0.016; P < 0.001), open access status (F1,855 = 4.5; R2 = 0.005; P = 0.034) and the interaction term between

open access and discipline (F8,855 = 2.94; R2 = 0.027; P = 0.003). Post-hoc tests indicated that articles with a data-

set as a critical data source were cited on average 6% more than those with an image or model, and open access

articles attracted 9% more citations than regular articles. Because of high variability in citation counts, it was not

possible to test the interaction terms with scientic discipline in the current dataset. We speculate that the articles

in Nature and Science are heavily cited on the basis of their key ndings and interpretations that may mask the few

extra citations raising from re-use of the data.

Discussion

Our study uniquely points to dierences among scientic disciplines in data availability as published along with

the article and upon request from the authors. We demonstrate that in several disciplines such as forestry, mate-

rials for energy and catalysis and psychology, critical data are still unavailable for re-analysis or meta-analysis

for more than half of the papers published in Nature and Science in the last decade. ese overall gures roughly

match those reported for other journals in various research elds8,11,13,22, but exceed the lowest reported val-

ues of around 10% available data13,23,24. Fortunately, data availability tends to improve, albeit slowly, in nearly

all disciplines (Figs.3, 7), which conrms recent implications from psychological and ecological journals13,31.

Furthermore, the reverse trend we observed in microbiology corroborates the declining metagenomics sequence

data availability22. Typically, such large DNA sequence data sets are used to publish tens of articles over many

years by the teams producing these data; hence releasing both raw data and datasets may jeopardise their expec-

tations of priority publishing. e weak discipline-specic dierences among Nature and Science (Fig.2) may be

related to how certain subject editors implemented and enforced stringent data sharing policies.

Aer rigorous attempts to contact the authors, data availability increased by one third on average across dis-

ciplines, with full and at least partial availability reaching 70% and 83%, respectively. ese gures are in the

top end of studies conducted thus far8,22 and indicate the relatively superior overall data availability in Science

and Nature compared with other journals. However, the relative rates of data retrieval upon request, decline

sharing data and ignoring the requests were on par with studies covering other journals and specic research

elds10,12,25,26,28. Across 20 years, we identied the overall loss of data at an estimated rate of 3.5% and 5.9% for

initially available data and data eectively available upon request, respectively. is rate of data decay is much less

than 17% year−1 previously reported in plant and animal sciences based on a comparable approach24.

While the majority of data are eventually available, it is alarming that less than a half of the data clearly stated

to be available upon request could be eectively obtained from the authors. Although there may be objective

reasons such as force majeure, these results suggest that many authors declaring data availability upon contacting

may have abused the publishers’ or funders’ policy that allows statements of data availability upon request as

the only means of data sharing. We nd that this infringes research ethics and disables fair competition among

research groups. Researchers hiding their own data may be in a power position compared with fair players in

situations of big data analysis, when they can access all data (including their own), while others have more limited

opportunities. Data sharing is also important for securing a possibility to re-analyse and re-interpret unexpected

results9,32 and detect scientic misconduct25,33. More rigorous control of data release would prevent manuscripts

with serious issues in sampling design or analytical procedures from being prepared, reviewed and eventually

accepted for publication.

Our study uniquely recorded the authors’ concerns and specic requests when negotiating data sharing.

Concerns and hesitations about data sharing are understandable because of potential drawbacks and misun-

derstandings related to data interpretation and priority of publishing17,34 that may outweigh the benets of rec-

ognition and passive participation in broader meta-studies. Nearly one quarter of researchers expressed various

concerns or had specic requests depending on the discipline, especially about the specic objectives of our

study. Previous studies with questionnaires about hypothetical data sharing unrelated to actual data sharing reveal

Relative frequency (%)

Ecology

Forestry

Humanities

Materials

for energy

& catalysis

Biomaterials &

biotechnolog

Micro-

biology

Optics &

photonics

Psychology

Social

sciences

Fig. 9 Preferred ways of data storage in articles (n = 875) representing dierent disciplines (blue, text and

supplement; purple, data archive; yellow, authors’ homepage; vermillion, previous publications; grey, museum;

black, upon (reasonable) request; white, none declared.

Content courtesy of Springer Nature, terms of use apply. Rights reserved

SCIENTIFIC DATA | (2021) 8:192 | https://doi.org/10.1038/s41597-021-00981-0

www.nature.com/scientificdata

www.nature.com/scientificdata/

that nancial interests, priority of additional publishing and fear of challenging the interpretations aer data

re-analysis constitute the authors’ major concerns12,35,36. Another study indicated that two thirds of researchers

sharing biomedical data expected to be invited as co-authors upon use of their data37 although this does not

full the authorship criteria6,38. At least partly related to these issues, the reasons for declining data sharing dif-

fered among disciplines: while social scientists usually referred to the loss of data, psychologists most commonly

pointed out ethical/legal issues. Recently published data were, however, more commonly declined due to ethical/

legal issues, which indicates rising concerns about data protection and potential misuse. Although we oered a

possibility to share anonymised data sets, such trimmed data sets were never obtained from the authors, sug-

gesting that ethical issues were not the only reason for data decline. Because research elds strongly diered in

the frequency of no response to data requests, most unanswered requests can be considered declines that avoid

ocial replies, which may harm the authors’ reputation.

Because we did not sample randomly across journals, our interpretations are limited to the journals Nature

and Science. Our study across disciplines did not account for the particular academic editor, which may have

partly contributed to the dierences among research elds and journals. Not all combinations of disciplines,

journals and time periods received the intended 25 replicate articles because of the poor representation of cer-

tain research elds in the 2000–2009 period. is may have reduced our ability to detect statistically signicant

dierences among the disciplines. We also obtained estimates for the nal data availability for seven out of nine

disciplines. Although we excluded the remaining two disciplines from comparisons of initial and nal data avail-

ability, it may have slightly altered the overall estimates. e process of screening the potentially relevant articles

chronologically backwards resulted in overrepresentation of more recent articles in certain relatively popular

disciplines, which may have biased comparisons across disciplines. However, the paucity of residual year eect

and year x discipline interaction in overall models and residual time eect in separate analyses within research

elds indicate a minimal bias (FigureS1).

We recorded the concerns and requests of authors that had issues with initial data sharing. erefore, these

responses may be relatively more sceptic than the opinions of the majority of the scientic community publishing

in these journals. It is likely that the authors who did not respond may have concerns and reasons for declining

similar to those who refused data sharing.

Our experience shows that receiving data typically required long email exchanges with the authors, contacting

other referred authors or sending a reminder. Obtaining data took on average 15 days, representing a substantial

eort to both parties39. is could have been easily avoided by releasing data upon article acceptance. On the

other hand, we received tips for analysis, caution against potential pitfalls and the authors’ informed consent upon

contacting. According to our experience, more than two thirds of the authors need to be contacted for retrieving

important metadata, variance estimates or specifying methods for meta-analyses40. us, contacting the authors

may be commonly required to ll gaps in the data, but such extra specications are easier to provide compared

with searching and converting old datasets into a universally understandable format.

Due to various concerns and tedious data re-formatting and uploading, the authors should be better moti-

vated for data sharing41. Data formatting and releasing certainly benets from clear instructions and support

from funders, institutions and publishers. In certain cases, public recognition such as badges of open data for

articles following the best data sharing practices and increasing numbers of citations may promote data release by

an order of magnitude42. Citable data papers are certainly another way forward43,44, because these provide access

to a well-organised dataset and add to the authors’ publication record. Encouraging enlisting published data

sets with download and citation metrics in grant and job applications alongside with other bibliometric indica-

tors should promote data sharing. Relating released data in publicly available research accounts such as ORCID,

ResearcherID and Google Scholar would benet both authors, other researchers and evaluators. To account for

many authors’ fear of data the17 and to prioritise the publishing options of data owners, setting a reasonable

embargo period for third-party publishing may be needed in specic cases such as immediate data release follow-

ing data generation45 and dissertations.

All funders, research institutions, researchers, editors and publishers should collectively contribute to turn

data sharing into a win-win situation for all parties and the scientic endeavour in general. Funding agencies may

have a key role here due to the lack of conicting interests and a possibility of exclusive allocation to depositing

and publishing huge data les46. Funders have ecient enforcing mechanisms during reports periods, with an

option to refuse extensions or approving forthcoming grant applications. We advocate that funders should include

published data sets, if relevant, as an evaluation criterion besides other bibliometric information. Research insti-

tutions may follow the same principles when issuing institutional grants and employing research sta. Institutions

should also insist their employees on following open data policies45.

Academic publishers also have a major role in shaping data sharing policies. Although deposition and main-

tenance of data incur extra costs to commercial publishers, they should promote data deposition in their servers

or public repositories. An option is to hire specic data editors for evaluating data availability in supplementary

materials or online repositories and refusing nal publishing before the data are fully available in a relevant for-

mat47. For ecient handling, clear instructions and a machine-readable data availability statement option (with

a QR code or link to the data) should be provided. In non-open access journals, the data should be accessible

free of charge or at reduced price to unsubscribed users. Creating specic data journals or ‘data paper’ formats

may promote publishing and sharing data that would otherwise pile up in the drawer because of disappointing

results or the lack of time for preparing a regular article. e leading scientometrics platforms Clarivate Analytics,

Google Scholar and Scopus should index data journals equally with regular journals to motivate researchers

publishing their data. ere should be a possibility of article withdrawal by the publisher, if the data availability

statements are incorrect or the data have been removed post-acceptance30. Much of the workload should stay on

the editors who are paid by the supporting association, institution or publisher in most cases. e editors should

grant the referees access to these data during the reviewing process48, requesting them a second opinion about

Content courtesy of Springer Nature, terms of use apply. Rights reserved

SCIENTIFIC DATA | (2021) 8:192 | https://doi.org/10.1038/s41597-021-00981-0

www.nature.com/scientificdata

www.nature.com/scientificdata/

data availability and reasons for declining to do so. Similar stringent data sharing policies are increasingly imple-

mented by various journals26,30,47.

In conclusion, data availability in top scientic journals diers strongly by discipline, but it is improving in

most research elds. As our study exemplies, the ‘data availability upon request’ model is insucient to ensure

access to datasets and other critical materials. Considering the overall data availability patterns, authors’ concerns

and reasons for declining data sharing, we advocate that (a) data releasing costs ought to be covered by funders;

(b) shared data and the associated bibliometric records should be included in the evaluation of job and grant

applications; and (c) data sharing enforcement should be led by both funding agencies and academic publishers.

Materials and Methods

Data collection. To assess dierences in data availability in dierent research disciplines, we focused our

study on Nature and Science, two high-impact, general-interest journals that practise relatively stringent data

availability policies49. Because of major changes in the public attitude and journals’ policies about data shar-

ing, our survey was focused on two study periods, 2000–2009 and 2010–2019. We selected nine scientic dis-

ciplines as dened by the Springer Nature publishing group - biomaterials and biotechnology, ecology, forestry,

humanities, materials for energy and catalysis, microbiology, optics and photonics, psychology and social

sciences (see TableS1 for details) - for analysis based on their coverage in Nature and Science journals and data-

driven research. ese nine disciplines were selected based on the competence of our team and the objective

to cover as dierent research elds as possible including natural sciences, social sciences and humanities. e

articles were searched by discipline, keywords and/or manual browsing as follows. For Nature, our search was

rened as https://www.nature.com/search?order=date_desc&journal=nature&article_type=research&sub-

ject=microbiology&date_range=2010-2019 (italicised parts varied). For Science, the corresponding search

string was the following: https://search.sciencemag.org/?searchTerm=microbiology&order=newest&limit=−

textFields&pageSize=10&startDate=2010-01-01&endDate=2019-08-31&articleTypes=Research%20and%20

reviews&source=sciencemag%7CScience. In both journals, the articles were retrieved by browsing search results

chronologically backwards since September 2019 or September 2009 until reaching 25 articles matching the cri-

teria. When the number of suitable articles was insucient, we searched by using additional discipline-specic

keywords in the title and browsed all issues manually when necessary. In some research elds, 25 articles could

not be found for all journal and time period combinations and therefore, data availability was evaluated for 875

articles in total (TableS1). In each article, we identied a specic analysis or result that was critical for the main

conclusion of that study based on both the authors’ emphasis and our subjective assessment. We determined

whether the underlying data of these critical results - datasets, images (including videos), models (including

scripts, programs and analytical procedures) or physical items - are available in the main text, supplementary

materials or other indicated sources such as specic data repositories, authors’ homepages, museums, or upon

request to the corresponding author (FigureS2). When available, we downloaded these data, checked for relevant

metadata, identiers and other components, and evaluated whether it is theoretically possible to repeat these

specic analyses and include these materials in a eld-specic metastudy. For example, in the case of a dataset, we

evaluated the data table for the presence of relevant metadata and sample codes necessary to perform the analysis;

for any statistical procedure, the authors must have used such a data table in their original work. We considered

the data to be too raw if these either required a large amount of work (other than common data transformations)

to generate the data table or model, or we had doubts whether the same data table can be reproduced with the

methods described. Raw high-throughput sequencing data are typical examples of incomplete datasets, because

these usually lack necessary metadata and require a thorough bioinformatics analysis, with the output depending

on soware and selected options. For further examples, certain optical raw images or videos make no sense with-

out expert ltering, and computer scripts are of limited use without thorough instructions.

If these critical data were unavailable or only partly available (i.e., missing some integral metadata, instructions

or explanations), we contacted the rst corresponding author or a relevant author referred in relation to access

to the specic item, requesting the data for a meta-study by using a pre-dened format and an institutional email

address (Item S1). In the email, we carefully specied the materials required to produce a particular gure or table

to avoid confusion and upsetting the authors with a messy request. We indicated that the data are intended for a

metastudy in a related topic to test the authors’ willingness to share the data for actual use, not just their intention

to share for no reasonable purpose. We similarly evaluated the received data for integrity and requested further

information, if necessary, to meet the standards. We also recorded the responses of corresponding authors to data

requests, including any specic requests or concerns and reasons for declining (Item S1).

e authors were mostly contacted early in the week and two reminders were sent ca. 14 and 28 days later if

necessary (Item S1). e reminders were also addressed to other corresponding authors if relevant. If emails were

returned with an error message, we contacted other corresponding authors or used an updated email address

found from the internet or newer publications. We considered 60 days from sending the rst email a reasonable

time period for the authors to locate and send the requested data4.

For each article, we recorded the details of publishing (date printed, journal, discipline), corresponding

authors (number, country of rst aliation, acquaintance to the contact author) and data (availability, type, ways

of access)50. Data complexity was evaluated based on the authors’ relative amount of extra work to polish the

raw data (e.g. low-complexity data include raw DNA sequence data, raw images, artefacts; high-complexity data

include bioinformatics-treated molecular data sets, noise-removed images, models and scripts). As of 23.03.2020,

we recorded the open access status and number of citations for each article using searches in the ISI Web of

Science (https://apps.webonowledge.com/). e citation count was expressed as citations per year, discounting

the rst 90 days with initially less citations.

Content courtesy of Springer Nature, terms of use apply. Rights reserved

SCIENTIFIC DATA | (2021) 8:192 | https://doi.org/10.1038/s41597-021-00981-0

www.nature.com/scientificdata

www.nature.com/scientificdata/

Data analysis. e principal aim of this study was to determine the relative importance of scientic discipline

and time period on data availability and authors’ concerns in response to data sharing requests, by accounting

for multiple potentially important covariates (Fig.1). e response variables, i.e. initial and nal data availability

(none, partly or fully available), author’s responses (ignored, data shared or declined), concerns and reasons

for decline, exhibit multinomial distribution50 and were hence transformed to dummy variables. Similarly, the

multi-level explanatory variables (discipline, topic overlap, countries and continents of corresponding authors,

data type and complexity) were transformed to dummies, whereas continuous variables (linear time, number of

citations, time to obtain data, number of corresponding authors) were square root- or logarithm-transformed

where appropriate. All analyses were performed in STATISTICA 12 (StatSo Inc., Tulsa, OK, USA).

Data analysis of the dummy-transformed multinomial and binomial variables was performed using stepwise

logistic regression model selection with a binomial link function using corrected Akaike information criterion

(AICc) as a selection criterion, and Somers’ D statistic and model determination coecients (R2) as measures of

overall goodness of t. Determination coecients and Wald’s W statistic were used to estimate the relative impor-

tance of explanatory variables. We calculated 95% condence intervals for multiple proportions51 using the R

package multinomialCI (https://rdrr.io/cran/MultinomialCI/). Increasing false discovery rates related to multiple

comparisons were accounted for by using Bonferroni correction of P-values (expressed as Padj) where appropriate.

Models with continuous response variables (proportion of available data, annual citations, time to receive

data) were tested using general linear models in two steps. First, the model selection included only dummy and

continuous explanatory variables. Multilevel categorical predictors corresponding to signicant dummies as well

as signicant continuous variables were included in the nal model selection as based on forward selection. To

check for potential biases related to the article selection procedure in both periods, we tested the eect of disci-

pline, period and year and all their interaction terms on initial data availability by retaining all variables in the

model (FigureS1). Dierences in these factor levels were tested using Tukey post-hoc tests for unequal sample

size, which accounts for multiple testing issues.

Data availability

e entire dataset is available as in a spreadsheet format in plutoF data repository50.

Code availability

No specic code was generated for analysis of these data.

Received: 11 December 2020; Accepted: 29 June 2021;

Published: xx xx xxxx

References

1. Fan, J. et al. Challenges of big data analysis. Nat. Sci. Rev. 1, 293–314 (2014).

2. itchin, . e data revolution: Big data, open data, data infrastructures and their consequences. (Sage Publications, London, 2014).

3. Science Sta. Challenges and opportunities. Science 331, 692–693 (2011).

4. Cech, T. . et al. Sharing publication-related data and materials: responsibilities of authorship in the life sciences. National

Academies Press, Washington, D.C. (2003).

5. Fischer, B. A. & Zigmond, M. J. e essential nature of sharing in science. Sci. Engineer. Ethics 16, 783–799 (2010).

6. Due, C. S. & Porter, H. H. e ethics of data sharing and reuse in biology. BioScience 63, 483–489 (2013).

7. Fienberg, S. E. et al. Sharing esearch Data. National Academy Press, Washington, D.C. (1985).

8. Begley, C. G. & Ioannidis, J. P. eproducibility in science: improving the standard for basic and preclinical research. Circul. Res. 116,

116–126 (2015).

9. Open S cience Collaboration. Estimating the reproducibility of psychological science. Science 349, aac4716 (2015).

10. Hardwice, T. E. & Ioannidis, J. P. Populating the Data Ar: An attempt to retrieve, preserve, and liberate data from the most highly-

cited psychology and psychiatry articles. PLoS One 13, e0201856 (2018).

11. oche, D. G. et al. Public data archiving in ecology and evolution: how well are we doing? PLoS Biol. 13, e1002295 (2015).

12. Tenopir, C. et al. Changes in data sharing and data reuse practices and perceptions among scientists worldwide. PLoS One 10,

e0134826 (2015).

13. Hardwice, T. E. et al. Data availability, reusability, and analytic reproducibility: Evaluating the impact of a mandatory open data

policy at the journal Cognition. R. Soc. Open Sci. 5, 180448 (2018).

14. Witwer, . W. Data submission and quality in microarray-based microNA proling. Clin. Chem. 59, 392–400 (2013).

15. Stuart, D. et al. Whitepaper: Practical challenges for researchers in data sharing. gshare https://doi.org/10.6084/m9.gshare.5975011

(2018).

16. Borgman, C.L. Scholarship in the digital age: Information, infrastructure, and the Internet. MIT press, Cambridge (2010).

17. Longo, D. L. & Drazen, J. M. Data sharing. New England J. Med. 375, 276–277 (2016).

18. Lewandowsy, S. & Bishop, D. esearch integrity: Don’t let transparency damage science. Nature 529, 459–461 (2016).

19. Colavizza, G. et al. e citation advantage of lining publications to research data. PLoS One 15, e0230416 (2020).

20. Piwowar, H. A. et al. Sharing detailed research data is associated with increased citation rate. PLoS One 2, e308 (2007).

21. Hics, D. et al. Bibliometrics: the Leiden Manifesto for research metrics. Nature 520, 429–431 (2015).

22. Ecert, E. M. et al. Every h published metagenome is not available to science. PLoS Biol. 18, e3000698 (2020).

23. Sherry, C. et al. Assessment of transparent and reproducible research practices in the psychiatry literature. Preprint at https://osf.io/

jtcr/download (2019).

24. Vines, T. H. et al. e availability of research data declines rapidly with article age. Curr. Biol. 24, 94–97 (2014).

25. Wicherts, J. M. et al. e poor availability of psychological research data for reanalysis. Am. Psychol. 61, 726–728 (2006).

26. Vines, T. H. et al. Mandated data archiving greatly improves access to research data. FASEB J. 27, 1304–1308 (2013).

27. rawczy, M. & euben, E. (Un)available upon request: Field experiment on researchers’ willingness to share supplementary

materials. Account. Res. 19, 175–186 (2012).

28. Vanpaemel, W. et al. Are we wasting a good crisis? e availability of psychological research data aer the storm. Collabra 1, 1–5

(2015).

29. Grant, . & Hrynasziewicz, I. e impact on authors and editors of introducing data availability statements at Nature journals. Int.

J. Digit. Curat. 13, 195–203 (2018).

30. Hrynasziewicz, I. et al. Developing a research data policy framewor for all journals and publishers. Data Sci. J. 19, 5 (2020).

Content courtesy of Springer Nature, terms of use apply. Rights reserved

SCIENTIFIC DATA | (2021) 8:192 | https://doi.org/10.1038/s41597-021-00981-0

www.nature.com/scientificdata

www.nature.com/scientificdata/

31. Wallach, J. D. et al. eproducible research practices, transparency, and open access data in the biomedical literature, 2015–2017.

PLoS Biol. 16, e2006930 (2018).

32. raus, W. L. Do you see what I see? Quality, reliability, and reproducibility in biomedical research. Mol. Endocrinol. 28, 277–280

(2014).

33. Wicherts, J. M. et al. Willingness to share research data is related to the strength of the evidence and the quality of reporting of

statistical results. PLoS One 6, e26828 (2011).

34. Wallis, J. C., olando, E. & Borgman, C. L. If we share data, will anyone use them? Data sharing and reuse in the long tail of science

and technology. PLoS One 8, e67332 (2013).

35. Blumenthal, D. et al. Withholding research results in academic life science. JAMA 277, 1224–1228 (1997).

36. im, Y. & Stanton, J. M. Institutional and individual inuences on scientists’ data sharing practices. J. Comput. Sci. Edu. 3, 47–56

(2013).

37. Federer, L. M. et al. Biomedical data sharing and reuse: Attitudes and practices of clinical and scientic research sta. PLoS One 10,

e0129506 (2015).

38. Patience, G. S. et al. Intellectual contributions meriting authorship: Survey results from the top cited authors across all science

categories. PLoS One 14, e0198117 (2019).

39. Vol, C., Lucero, Y. & Barnas, . Why is data sharing in collaborative natural resource eorts so hard and what can we do to improve

it? Environ. Manage. 53, 883–893 (2014).

40. Tedersoo, L. et al. Towards global patterns in the diversity and community structure of ectomycorrhizal fungi. Mol. Ecol. 21,

4160–4170 (2012).

41. eichman, O. J. et al. Challenges and opportunities of open data in ecology. Science 331, 703–705 (2011).

42. idwell, M. C. et al. Badges to acnowledge open practices: A simple, low-cost, eective method for increasing transparency. PLoS

Biol. 14, e1002456 (2016).

43. Candela, L., Castelli, D., Manghi, P. & Tani, A. Data journals: a survey. J. Ass. Inform. Sci. Technol. 66, 1747–1762 (2015).

44. Callaghan, S. et al. Maing data a rst class scientic output: data citation and publication by NEC’s Environmental Data Centres.

Int. J. Digit. Curat. 7, 107–113 (2012).

45. Dye, S. O. & Hubbard, T. J. Developing and implementing an institute-wide data sharing policy. Genome Med. 3, 1–8 (2011).

46. Heidorn, P. B. Shedding light on the dar data in the long tail of science. Libr. Trends 57, 280–299 (2008).

47. Langille, M. G. et al. “Available upon request”: not good enough for microbiome data! Microbiome 6, 8 (2018).

48. Morey, . D. et al. e Peer eviewers’ Openness Initiative: incentivizing open research practices through peer review. R. Soc. Open

Sci. 3, 150547 (2016).

49. Alsheih-Ali, A. A. et al. Public availability of published research data in high-impact journals. PLoS One 6, e24357 (2011).

50. Tedersoo, L. et al. Data sharing across disciplines:’available upon request’ holds no promise. University of Tartu; Institute of Ecology

and Earth Sciences https://doi.org/10.15156/BIO/1359426 (2021).

51. Sison, C. P. & Glaz, J. Simultaneous condence intervals and sample size determination for multinomial proportions. J. Am. Stat. Ass.

90, 366–369 (1995).

Acknowledgements

We thank all authors who released their data along with their article or responded to our data request. Although

some of the obtained datasets are used in a series of meta-analyses or released by us upon agreement, we apologise

to the authors who spent a signicant amount of time to provide the data, which we cannot use for secondary

analyses. We thank A. Kahru, T. Soomere, Ü. Niinemets and J. Allik for their constructive comments on an earlier

version of the manuscript.

Author contributions

All authors contributed to study design, work with literature and writing. L.T. analysed data and led writing.

Competing interests

e authors declare no competing interests.

Additional information

Supplementary information e online version contains supplementary material available at https://doi.

org/10.1038/s41597-021-00981-0.

Correspondence and requests for materials should be addressed to L.T.

Reprints and permissions information is available at www.nature.com/reprints.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and

institutional aliations.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International

License, which permits use, sharing, adaptation, distribution and reproduction in any medium or

format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Cre-

ative Commons license, and indicate if changes were made. e images or other third party material in this

article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the

material. If material is not included in the article’s Creative Commons license and your intended use is not per-

mitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the

Content courtesy of Springer Nature, terms of use apply. Rights reserved

Terms and Conditions

Springer Nature journal content, brought to you courtesy of Springer Nature Customer Service Center GmbH (“Springer Nature”).

Springer Nature supports a reasonable amount of sharing of research papers by authors, subscribers and authorised users (“Users”), for small-

scale personal, non-commercial use provided that all copyright, trade and service marks and other proprietary notices are maintained. By

accessing, sharing, receiving or otherwise using the Springer Nature journal content you agree to these terms of use (“Terms”). For these

purposes, Springer Nature considers academic use (by researchers and students) to be non-commercial.

These Terms are supplementary and will apply in addition to any applicable website terms and conditions, a relevant site licence or a personal

subscription. These Terms will prevail over any conflict or ambiguity with regards to the relevant terms, a site licence or a personal subscription

(to the extent of the conflict or ambiguity only). For Creative Commons-licensed articles, the terms of the Creative Commons license used will

apply.

We collect and use personal data to provide access to the Springer Nature journal content. We may also use these personal data internally within

ResearchGate and Springer Nature and as agreed share it, in an anonymised way, for purposes of tracking, analysis and reporting. We will not

otherwise disclose your personal data outside the ResearchGate or the Springer Nature group of companies unless we have your permission as

detailed in the Privacy Policy.

While Users may use the Springer Nature journal content for small scale, personal non-commercial use, it is important to note that Users may

not:

use such content for the purpose of providing other users with access on a regular or large scale basis or as a means to circumvent access

control;

use such content where to do so would be considered a criminal or statutory offence in any jurisdiction, or gives rise to civil liability, or is

otherwise unlawful;

falsely or misleadingly imply or suggest endorsement, approval , sponsorship, or association unless explicitly agreed to by Springer Nature in

writing;

use bots or other automated methods to access the content or redirect messages

override any security feature or exclusionary protocol; or

share the content in order to create substitute for Springer Nature products or services or a systematic database of Springer Nature journal

content.

In line with the restriction against commercial use, Springer Nature does not permit the creation of a product or service that creates revenue,

royalties, rent or income from our content or its inclusion as part of a paid for service or for other commercial gain. Springer Nature journal

content cannot be used for inter-library loans and librarians may not upload Springer Nature journal content on a large scale into their, or any

other, institutional repository.

These terms of use are reviewed regularly and may be amended at any time. Springer Nature is not obligated to publish any information or

content on this website and may remove it or features or functionality at our sole discretion, at any time with or without notice. Springer Nature

may revoke this licence to you at any time and remove access to any copies of the Springer Nature journal content which have been saved.

To the fullest extent permitted by law, Springer Nature makes no warranties, representations or guarantees to Users, either express or implied

with respect to the Springer nature journal content and all parties disclaim and waive any implied warranties or warranties imposed by law,

including merchantability or fitness for any particular purpose.

Please note that these rights do not automatically extend to content, data or other material published by Springer Nature that may be licensed

from third parties.

If you would like to use or distribute our Springer Nature journal content to a wider audience or on a regular basis or in any other manner not

expressly permitted by these Terms, please contact Springer Nature at

onlineservice@springernature.com

Available via license: CC BY 4.0

Content may be subject to copyright.

Content uploaded by Leho Tedersoo

Content may be subject to copyright.

Randomized Controlled Trials in Evidence-Based Dentistry

Chapter

Jun 2024

Neurophysiological recordings from parietal areas of macaque brain during an instructed-delay reaching task

Article

Full-text available

Jun 2024

Facilitating data sharing in scientific research, especially in the domain of animal studies, holds immense value, particularly in mitigating distress and enhancing the efficiency of data collection. This study unveils a meticulously curated collection of neural activity data extracted from six electrophysiological datasets recorded from three parietal areas (V6A, PEc, PE) of two Macaca fascicularis during an instructed-delay foveated reaching task. This valuable resource is now accessible to the public, featuring spike timestamps, behavioural event timings and supplementary metadata, all presented alongside a comprehensive description of the encompassing structure. To enhance accessibility, data are stored as HDF5 files, a convenient format due to its flexible structure and the capability to attach diverse information to each hierarchical sub-level. To guarantee ready-to-use datasets, we also provide some MATLAB and Python code examples, enabling users to quickly familiarize themselves with the data structure.

Open science practices in demographic research: An appraisal

Article

Full-text available

Jun 2024
DEMOGR RES

Ugofilippo Basellini

Measuring data rot: An analysis of the continued availability of shared data from a Single University

Article

Full-text available

Jun 2024
PLOS ONE

Kristin A. Briney

To determine where data is shared and what data is no longer available, this study analyzed data shared by researchers at a single university. 2166 supplemental data links were harvested from the university’s institutional repository and web scraped using R. All links that failed to scrape or could not be tested algorithmically were tested for availability by hand. Trends in data availability by link type, age of publication, and data source were examined for patterns. Results show that researchers shared data in hundreds of places. About two-thirds of links to shared data were in the form of URLs and one-third were DOIs, with several FTP links and links directly to files. A surprising 13.4% of shared URL links pointed to a website homepage rather than a specific record on a website. After testing, 5.4% the 2166 supplemental data links were found to be no longer available. DOIs were the type of shared link that was least likely to disappear with a 1.7% loss, with URL loss at 5.9% averaged over time. Links from older publications were more likely to be unavailable, with a data disappearance rate estimated at 2.6% per year, as well as links to data hosted on journal websites. The results support best practice guidance to share data in a data repository using a permanent identifier.

Best practices for genetic and genomic data archiving

Article

May 2024
Nat. Ecol. Evol.

Genetic and genomic data are collected for a vast array of scientific and applied purposes. Despite mandates for public archiving, data are typically used only by the generating authors. The reuse of genetic and genomic datasets remains uncommon because it is difficult, if not impossible, due to non-standard archiving practices and lack of contextual metadata. But as the new field of macrogenetics is demonstrating, if genetic data and their metadata were more accessible and FAIR (findable, accessible, interoperable and reusable) compliant, they could be reused for many additional purposes. We discuss the main challenges with existing genetic and genomic data archives, and suggest best practices for archiving genetic and genomic data. Recognizing that this is a longstanding issue due to little formal data management training within the fields of ecology and evolution, we highlight steps that research institutions and publishers could take to improve data archiving.

Reporting guidelines for terrestrial respirometry: Building openness, transparency of metabolic rate and evaporative water loss data

Article

Jun 2024

Data as the next challenge in atomistic machine learning

Article

Jun 2024

Publication, funding, and experimental data in support of Human Reference Atlas construction and usage

Article

Full-text available

Jun 2024

Experts from 18 consortia are collaborating on the Human Reference Atlas (HRA) which aims to map the 37 trillion cells in the healthy human body. Information relevant for HRA construction and usage is held by experts, published in scholarly papers, and captured in experimental data. However, these data sources use different metadata schemas and cannot be cross-searched efficiently. This paper documents the compilation of a dataset, named HRAlit, that links the 136 HRA v1.4 digital objects (31 organs with 4,279 anatomical structures, 1,210 cell types, 2,089 biomarkers) to 583,117 experts; 7,103,180 publications; 896,680 funded projects, and 1,816 experimental datasets. The resulting HRAlit has 22 tables with 20,939,937 records including 6 junction tables with 13,170,651 relationships. The HRAlit can be mined to identify leading experts, major papers, funding trends, or alignment with existing ontologies in support of systematic HRA construction and usage.

Preliminary assessment of the knowledge gaps to improve nature conservation of soil biodiversity

Article

Full-text available

May 2024

From Planning Stage Towards FAIR Data: A Practical Metadatasheet For Biomedical Scientists

Article

Full-text available

May 2024

Datasets consist of measurement data and metadata. Metadata provides context, essential for understanding and (re-)using data. Various metadata standards exist for different methods, systems and contexts. However, relevant information resides at differing stages across the data-lifecycle. Often, this information is defined and standardized only at publication stage, which can lead to data loss and workload increase. In this study, we developed Metadatasheet, a metadata standard based on interviews with members of two biomedical consortia and systematic screening of data repositories. It aligns with the data-lifecycle allowing synchronous metadata recording within Microsoft Excel, a widespread data recording software. Additionally, we provide an implementation, the Metadata Workbook, that offers user-friendly features like automation, dynamic adaption, metadata integrity checks, and export options for various metadata standards. By design and due to its extensive documentation, the proposed metadata standard simplifies recording and structuring of metadata for biomedical scientists, promoting practicality and convenience in data management. This framework can accelerate scientific progress by enhancing collaboration and knowledge transfer throughout the intermediate steps of data creation.

The citation advantage of linking publications to research data

Article

Full-text available

Apr 2020
PLOS ONE

Efforts to make research results open and reproducible are increasingly reflected by journal policies encouraging or mandating authors to provide data availability statements. As a consequence of this, there has been a strong uptake of data availability statements in recent literature. Nevertheless, it is still unclear what proportion of these statements actually contain well-formed links to data, for example via a URL or permanent identifier, and if there is an added value in providing such links. We consider 531, 889 journal articles published by PLOS and BMC, develop an automatic system for labelling their data availability statements according to four categories based on their content and the type of data availability they display, and finally analyze the citation advantage of different statement categories via regression. We find that, following mandated publisher policies, data availability statements become very common. In 2018 93.7% of 21,793 PLOS articles and 88.2% of 31,956 BMC articles had data availability statements. Data availability statements containing a link to data in a repository—rather than being available on request or included as supporting information files—are a fraction of the total. In 2017 and 2018, 20.8% of PLOS publications and 12.2% of BMC publications provided DAS containing a link to data in a repository. We also find an association between articles that include statements that link to data in a repository and up to 25.36% (± 1.07%) higher citation impact on average, using a citation prediction model. We discuss the potential implications of these results for authors (researchers) and journal publishers who make the effort of sharing their data in repositories. All our data and code are made available in order to reproduce and extend our results.

Every fifth published metagenome is not available to science

Article

Full-text available

Apr 2020
PLOS BIOL

Have you ever sought to use metagenomic DNA sequences reported in scientific publications? Were you successful? Here, we reveal that metagenomes from no fewer than 20% of the papers found in our literature search, published between 2016 and 2019, were not deposited in a repository or were simply inaccessible. The proportion of inaccessible data within the literature has been increasing year-on-year. Noncompliance with Open Data is best predicted by the scientific discipline of the journal. The number of citations, journal type (e.g., Open Access or subscription journals), and publisher are not good predictors of data accessibility. However, many publications in high-impact factor journals do display a higher likelihood of accessible metagenomic data sets. Twenty-first century science demands compliance with the ethical standard of data sharing of metagenomes and DNA sequence data more broadly. Data accessibility must become one of the routine and mandatory components of manuscript submissions-a requirement that should be applicable across the increasing number of disciplines using metagenomics. Compliance must be ensured and reinforced by funders, publishers, editors, reviewers, and, ultimately, the authors.

Developing a Research Data Policy Framework for All Journals and Publishers

Article

Full-text available

Feb 2020
Data Sci J

An output of the Data policy standardisation and implementation Interest Group (IG) of the Research Data Alliance (RDA) More journals and publishers – and funding agencies and institutions – are introducing research data policies. But as the prevalence of policies increases, there is potential to confuse researchers and support staff with numerous or conflicting policy requirements. We define and describe 14 features of journal research data policies and arrange these into a set of six standard policy types or tiers, which can be adopted by journals and publishers to promote data sharing in a way that encourages good practice and is appropriate for their audience’s perceived needs. Policy features include coverage of topics such as data citation, data repositories, data availability statements, data standards and formats, and peer review of research data. These policy features and types have been created by reviewing the policies of multiple scholarly publishers, which collectively publish more than 10,000 journals, and through discussions and consensus building with multiple stakeholders in research data policy via the Data Policy Standardisation and Implementation Interest Group of the Research Data Alliance. Implementation guidelines for the standard research data policies for journals and publishers are also provided, along with template policy texts which can be implemented by journals in their Information for Authors and publishing workflows. We conclude with a call for collaboration across the scholarly publishing and wider research community to drive further implementation and adoption of consistent research data policies.

Assessment of transparent and reproducible research practices in the psychiatry literature

Article

Full-text available

Feb 2020

Background Reproducibility is a cornerstone of scientific advancement; however, many published works may lack the core components needed for study reproducibility. Aims In this study, we evaluate the state of transparency and reproducibility in the field of psychiatry using specific indicators as proxies for these practices. Methods An increasing number of publications have investigated indicators of reproducibility, including research by Harwicke et al , from which we based the methodology for our observational, cross-sectional study. From a random 5-year sample of 300 publications in PubMed-indexed psychiatry journals, two researchers extracted data in a duplicate, blinded fashion using a piloted Google form. The publications were examined for indicators of reproducibility and transparency, which included availability of: materials, data, protocol, analysis script, open-access, conflict of interest, funding and online preregistration. Results This study ultimately evaluated 296 randomly-selected publications with a 3.20 median impact factor. Only 107 were available online. Most primary authors originated from USA, UK and the Netherlands. The top three publication types were cohort studies, surveys and clinical trials. Regarding indicators of reproducibility, 17 publications gave access to necessary materials, four provided in-depth protocol and one contained raw data required to reproduce the outcomes. One publication offered its analysis script on request; four provided a protocol availability statement. Only 107 publications were publicly available: 13 were registered in online repositories and four, ten and eight publications included their hypothesis, methods and analysis, respectively. Conflict of interest was addressed by 177 and reported by 31 publications. Of 185 publications with a funding statement, 153 publications were funded and 32 were unfunded. Conclusions Currently, Psychiatry research has significant potential to improve adherence to reproducibility and transparency practices. Thus, this study presents a reference point for the state of reproducibility and transparency in Psychiatry literature. Future assessments are recommended to evaluate and encourage progress.

Intellectual contributions meriting authorship: Survey results from the top cited authors across all science categories

Article

Full-text available

Jan 2019
PLOS ONE

Authorship is the currency of an academic career for which the number of papers researchers publish demonstrates creativity, productivity, and impact. To discourage coercive authorship practices and inflated publication records, journals require authors to affirm and detail their intellectual contributions but this strategy has been unsuccessful as authorship lists continue to grow. Here, we surveyed close to 6000 of the top cited authors in all science categories with a list of 25 research activities that we adapted from the National Institutes of Health (NIH) authorship guidelines. Responses varied widely from individuals in the same discipline, same level of experience, and same geographic region. Most researchers agreed with the NIH criteria and grant authorship to individuals who draft the manuscript, analyze and interpret data, and propose ideas. However, thousands of the researchers also value supervision and contributing comments to the manuscript, whereas the NIH recommends discounting these activities when attributing authorship. People value the minutiae of research beyond writing and data reduction: researchers in the humanities value it less than those in pure and applied sciences; individuals from Far East Asia and Middle East and Northern Africa value these activities more than anglophones and northern Europeans. While developing national and international collaborations, researchers must recognize differences in peoples values while assigning authorship.

The Impact on Authors and Editors of Introducing Data Availability Statements at Nature Journals

Article

Full-text available

Dec 2018

This article describes the adoption of a standard policy for the inclusion of data availability statements in all research articles published at the Nature family of journals, and the subsequent research which assessed the impacts that these policies had on authors, editors, and the availability of datasets. The key findings of this research project include the determination of average and median times required to add a data availability statement to an article; and a correlation between the way researchers make their data available, and the time required to add a data availability statement.

Reproducible research practices, transparency, and open access data in the biomedical literature, 2015–2017

Article

Full-text available

Nov 2018
PLOS BIOL

Currently, there is a growing interest in ensuring the transparency and reproducibility of the published scientific literature. According to a previous evaluation of 441 biomedical journals articles published in 2000–2014, the biomedical literature largely lacked transparency in important dimensions. Here, we surveyed a random sample of 149 biomedical articles published between 2015 and 2017 and determined the proportion reporting sources of public and/or private funding and conflicts of interests, sharing protocols and raw data, and undergoing rigorous independent replication and reproducibility checks. We also investigated what can be learned about reproducibility and transparency indicators from open access data provided on PubMed. The majority of the 149 studies disclosed some information regarding funding (103, 69.1% [95% confidence interval, 61.0% to 76.3%]) or conflicts of interest (97, 65.1% [56.8% to 72.6%]). Among the 104 articles with empirical data in which protocols or data sharing would be pertinent, 19 (18.3% [11.6% to 27.3%]) discussed publicly available data; only one (1.0% [0.1% to 6.0%]) included a link to a full study protocol. Among the 97 articles in which replication in studies with different data would be pertinent, there were five replication efforts (5.2% [1.9% to 12.2%]). Although clinical trial identification numbers and funding details were often provided on PubMed, only two of the articles without a full text article in PubMed Central that discussed publicly available data at the full text level also contained information related to data sharing on PubMed; none had a conflicts of interest statement on PubMed. Our evaluation suggests that although there have been improvements over the last few years in certain key indicators of reproducibility and transparency, opportunities exist to improve reproducible research practices across the biomedical literature and to make features related to reproducibility more readily visible in PubMed.

Data availability, reusability, and analytic reproducibility: Evaluating the impact of a mandatory open data policy at the journal Cognition

Article

Full-text available

Aug 2018

Access to data is a critical feature of an efficient, progressive and ultimately self-correcting scientific ecosystem. But the extent to which in-principle benefits of data sharing are realized in practice is unclear. Crucially, it is largely unknown whether published findings can be reproduced by repeating reported analyses upon shared data (‘analytic reproducibility’). To investigate this, we conducted an observational evaluation of a mandatory open data policy introduced at the journal Cognition. Interrupted time-series analyses indicated a substantial post-policy increase in data available statements (104/417, 25% pre-policy to 136/174, 78% post-policy), although not all data appeared reusable (23/104, 22% pre-policy to 85/136, 62%, post-policy). For 35 of the articles determined to have reusable data, we attempted to reproduce 1324 target values. Ultimately, 64 values could not be reproduced within a 10% margin of error. For 22 articles all target values were reproduced, but 11 of these required author assistance. For 13 articles at least one value could not be reproduced despite author assistance. Importantly, there were no clear indications that original conclusions were seriously impacted. Mandatory open data policies can increase the frequency and quality of data sharing. However, suboptimal data curation, unclear analysis specification and reporting errors can impede analytic reproducibility, undermining the utility of data sharing and the credibility of scientific findings.

The Data Revolution: Big Data, Open Data, Data Infrastructures & Their Consequences

Book

Jan 2014

Rob Kitchin

Assessment of transparent and reproducible research practices in the psychiatry literature

Preprint

Aug 2019

Objective: Reproducibility is a cornerstone of scientific advancement; however, many published works may lack the core components needed for study reproducibility. In this study, we evaluate the state of transparency and reproducibility in the field of Psychiatry.Methods: An observational, cross-sectional study design was used. From a random sample of 300 publications in PubMed-indexed psychiatry journals, two researchers extracted data in a duplicate and blinded fashion using a piloted Google Form. For this study, we included publications from January 1, 2014 to December 31, 2018. The publications were evaluated for indicators of reproducibility and transparency, which included the availability of materials, data, protocol, analysis script, preregistration, open access, financial conflicts of interest, funding sources, and pre-registration in an online repository. Results: Our study identified 158 journals meeting the inclusion criteria and 90,281 publications from within the timeframe. Of the 300 randomly sampled, 4 were inaccessible, resulting in a final sample of 296 publications. Of the 296, only 107 (36%) were publically available online. Regarding reproducibility, 17 publications gave access to necessary materials, 4 provided an in-depth protocol, and 1 contained the raw data required to reproduce the outcomes.Conclusions: Currently, researchers in the field of Psychiatry do not adhere to practices that promote reproducibility and transparency. Change is therefore needed. This study presents a reference point for the state of reproducibility and transparency in psychiatry literature, and future assessments are recommended to evaluate progress.

Data sharing practices and data availability upon request differ across scientific disciplines

Abstract and Figures

Recommended publications

The grant is dead, long live the data - migration as a pragmatic exit strategy for research data pre...

Exploring Arab researchers' research data sharing and requesting practices: a survey study

An examination of data reuse practices within highly cited articles of faculty at a research univers...

Effect of Impact Factor and Discipline on Journal Data Sharing Policies

Data: Sharing Is Caring