ArticlePDF AvailableLiterature Review

Refinement of Experimental Design and Conduct in Laboratory Animal Research

Authors:

Abstract

The scientific literature of laboratory animal research is replete with papers reporting poor reproducibility of results as well as failure to translate results to clinical trials in humans. This may stem in part from poor experimental design and conduct of animal experiments. Despite widespread recognition of these problems and implementation of guidelines to attenuate them, a review of the literature suggests that experimental design and conduct of laboratory animal research are still in need of refinement. This paper will review and discuss possible sources of biases, highlight advantages and limitations of strategies proposed to alleviate them, and provide a conceptual framework for improving the reproducibility of laboratory animal research. © The Author 2014. Published by Oxford University Press on behalf of the Institute for Laboratory Animal Research. All rights reserved. For permissions, please email: journals.permissions@oup.com.
Renement of Experimental Design and Conduct in Laboratory Animal Research
Jeremy D. Bailoo, Thomas S. Reichlin, and Hanno Würbel
Abstract
The scientic literature of laboratory animal research is re-
plete with papers reporting poor reproducibility of results as
well as failure to translate results to clinical trials in humans.
This may stem in part from poor experimental design and
conduct of animal experiments. Despite widespread recogni-
tion of these problems and implementation of guidelines to
attenuate them, a review of the literature suggests that exper-
imental design and conduct of laboratory animal research are
still in need of renement. This paper will review and discuss
possible sources of biases, highlight advantages and limita-
tions of strategies proposed to alleviate them, and provide a
conceptual framework for improving the reproducibility of
laboratory animal research.
Key Words: 3R; renement; ARRIVE; reproducibility;
internal validity; external validity; standardization;
preregistration
What Is the Problem?
In 2005, the biomedical research community was startled
by a paper entitled Why Most Published Research Find-
ings Are False(Ioannidis 2005). Based on systematic re-
views and simulations, the author concluded that for most
study designs and settings, it is more likely for a research
claim to be false than true.Was this just an alarmist claim
or is there indeed a problem with the validity of biomedical
research? Despite some debate about the validity of Ioanni-
disoriginal analysis (e.g., Goodman and Greenland 2007),
evidence has accumulated over the past 10 years that tends
to favor the latter view. This is further supported by a recent
commentary in Nature (Macleod 2011) that underscores
concerns that experimental design and conduct need to im-
prove in laboratory animal research.
Poor Reproducibility and Translational
Failure
The use of animals for research is a privilege granted to sci-
entists with the explicit understanding that this use provides
signicant new knowledge without causing unnecessary
harm. However, poor reproducibility of results from animal
experiments across many research areas (c.f., Richter et al.
2009) and widespread failure to translate preclinical animal
research to clinical trials (i.e., translational failure; e.g.,
Kola and Landis 2004;Howells et al. 2014;van der Worp
et al. 2010) suggest that these expectations are not met. For
example, of more than 500 neuroprotective interventions
that were effective in animal models of ischemic stroke,
none was found to be effective in humans (OCollins et al.
2006). A 10-year review (19912000) of drug development
revealed that the main causes of such attrition at the clinical
trials stage are lack of efcacy and safety, which together
account for 60% of the overall attrition rate (Kola and Landis
2004). These authors therefore concluded that animal studies
which better predict the efcacy and safety of drugs in clinical
trials are needed to reduce translational failure.
The Study of the Scientic Validity
of Laboratory Animal Research
The empirical study of the scientic validity of laboratory
animal research is an emerging eld (Macleod 2011), and
several lines of evidence highlight both current and potential
problems. For example, translational failure in drug develop-
ment could indicate that the construct validity of animal
models is poor (Box 1). Construct validity refers to the degree
to which a test measures what it claims to be measuring
(Cronbach and Meehl 1955), and there is increasing concern
that the construct validity of many animal models for human
diseases is indeed questionable (e.g., Editor 2011;Nestler and
Hyman 2010). However, construct validity depends on the
specic disease that is modeled, and there is no simple meth-
od for assessing construct validity. Furthermore, improve-
ments in animal models usually go hand in hand with
Jeremy D. Bailoo, PhD, and Thomas S. Reichlin, PhD, are postdoctoral
fellows and Hanno Würbel, PhD, is professor and head of the Division of
Animal Welfare at the Veterinary Public Health Institute of the University
of Bern, Switzerland.
Address correspondence and reprint requests to Hanno Würbel, Division
of Animal Welfare, VPH Institute, Vetsuisse Faculty, University of Bern,
Länggassstrasse 120, 3012 Bern, Switzerland or email hanno.wuerbel@
vetsuisse.unibe.ch.
ILAR Journal, Volume 55, Number 3, doi: 10.1093/ilar/ilu037
© The Author 2014. Published by Oxford University Press on behalf of the Institute for Laboratory Animal Research. All rights reserved.
For permissions, please email: journals.permissions@oup.com 383
at World Trade Institute on January 14, 2015http://ilarjournal.oxfordjournals.org/Downloaded from
advances in research on the construct that is being modeled.
Therefore, improving the construct validity of animal models
depends on advances in research rather than adherence to
methods or policies.
Another aspect related to construct validity concerns
the health and well-being of the animals used for research.
Growing evidence indicates that current standard practices
of housing and care in laboratory animals are associated
with abnormal brain and behavioral development and other
signs of poor welfare, which may also compromise the scien-
tic validity of research ndings (Garner 2005;Knight 2001;
Martin et al. 2010;Würbel 2001). Whether animal welfare
matters in terms of the scientic validity of a research nding,
however, depends on the area of research and on the specic
research question.
Although highly relevant, construct validity and animal
welfare will therefore not be further discussed in this article.
Instead, we will focus our discussion on two fundamental
aspects of scientic validity, both of which are relevant across
all elds of laboratory animal research and are determined
by experimental design and conduct: internal and external
validity.
Box 1. Glossary of Key Terms
Bias: Systematic deviation from the true value of the estimated treatment effect caused by failures in the design, conduct, or
analysis of an experiment.
Attrition bias: The unequal distribution of dropouts or nonresponders between treatment groups. This can lead to a system-
atic difference between treatment groups and may lead to an incorrect ascription of a causal relation between the treatment
and the dependent variable.
Detection bias: Systematic differences between treatment groups in how outcomes are assessed. This can be reduced or
avoided by blinding or masking.
Performance bias: Systematic differences in animal care and handling between treatment groups. This can be reduced or
avoided by blinding or masking.
Selection bias: The biased allocation of subjects to treatment groups. Biased allocation can lead to systematic differences in
the baseline characteristics between groups. This can be avoided by randomized allocation and allocation concealment.
Blinding/masking: The maintenance of the persons(who perform the experiment, collect data, and assess outcome, etc.)
unawareness of the treatment allocation.
Types of error
False negative (β): The failure to reject the null hypothesis when it is false. This is often due to small sample sizes
(underpowered study designs).
False positive (α): The rejection of the null hypothesis when it is true. This is often due to some form of bias.
Randomization
Simple: Randomized allocation of subjects to the different treatment groups based on a single sequence of random assign-
ments. This may lead to imbalanced groups and group sizes when the number of subjects is small.
Stratied: Allocation of subjects to blocks of subjects sharing similar baseline characteristics (e.g., sex, age, body size)
followed by randomized allocation of the subjects of each block to the different treatment groups. This is intended to coun-
terbalance potential covariates across treatment groups.
Reproducibility: The ability of a result to be replicated by an independent experiment in the same or a different laboratory.
Validity
Construct validity: The degree to which inferences are warranted from the sampling properties of an experiment (e.g.,
units, settings, treatments and outcomes) to the entities these samples are intended to represent.
External validity: The extent to which the results of an animal experiment provide a correct basis for generalizations to
other populations of animals (including humans) and/or other environmental conditions.
Internal validity: The extent to which the design, conduct, and analysis of the experiment eliminate the possibility of bias
so that the inference of a causal relationship between an experimental treatment and variation in an outcome measure is
warranted.
Denitions adapted from van der Worp et al. (2010) and from the Cochrane Collaboration.
384 ILAR Journal
at World Trade Institute on January 14, 2015http://ilarjournal.oxfordjournals.org/Downloaded from
Internal and External Validity of Laboratory
Animal Research
Internal validity refers to the extent to which a causal rela-
tion between an experimental treatment and variation in an
outcome measure is warranted (Box 1). It critically depends
on the extent to which experimental design and conduct
minimize systematic error (also called bias). Already some
15 to 20 years ago, reports were published indicating that
fundamental aspects of proper scientic conduct were often
ignored, thereby compromising the internal validity of re-
search ndings (Festing and Altman 2002;McCance
1995). Several recent studies suggest that not much has
changed to date. For example, a systematic review of
animal experiments conducted in publicly funded research
establishments in the United Kingdom and United States re-
vealed that only a few authors reported using randomization
(13%) or blinding (14%) to avoid bias in animal selection
and outcome assessment (Kilkenny et al. 2009). Others
found that only 3% of all studies reported an apriorisample
size calculation (Sena et al. 2007) and in even fewer cases
was a primary outcome variable dened (Macleod 2011).
Similar results were obtained from various reviews of pre-
clinical neurological research (Frantzias et al. 2011;van
der Worp et al. 2010;Vesterinen et al. 2010), indicating
that systematic bias may be widespread in laboratory animal
research.
In clinical research, similar problems became apparent
several years earlier, resulting in the CONSORT statement
intended to improve the reporting of randomized clinical
trials (Begg 1996;Moher et al. 2001;Schulz et al. 2010).
BasedontheCONSORTstatementandwiththeaimtoim-
prove the reporting of animal studies, Kilkenny et al. (2010)
recently developed the Animals in Research: Reporting In
Vivo Experiments (ARRIVE) guidelines, a 20-item check-
list of information to be reported in publications of animal
research. To date, these guidelines have been endorsed by
over 430 journals, funders, universities, and learned socie-
ties (www.NC3Rs.org.uk) in the hope that such guidelines
will not only improve the quality of scientic reporting but
also the internal validity of the research.
In contrast to internal validity, external validity extends
beyond the specic experimental setting and refers to the gen-
eralizability of research ndings, i.e., how applicable they are
to other environmental conditions, experimenters, study
populations, and even to other strains or species of animals
(including humans; Lehner 1996;Box 1). Poor external
validity may thus contribute to both poor reproducibility of
a research nding (e.g., when the same study replicated in a
different laboratory by a different experimenter produces dif-
ferent results) and translational failure (e.g., when a treatment
shown to be efcacious in an animal model is not efcacious
in a clinical trial in humans).
Importantly, some of the strategies employed to increase
internal validity may at the same time decrease external valid-
ity. For example, common strategies of standardizing experi-
ments by using homogenous study populations to maximize
test sensitivity inevitably compromise the external validity
of the research ndings, resulting in poor reproducibility
(Richter et al. 2009,2010,2011;van der Worp et al. 2010;
Würbel 2000,2002;Würbel and Garner 2007).
Scope for Renement of Laboratory
Animal Research
Taken together, there seems to be considerable scope for re-
nement of experimental design and conduct to improve both
the internal and external validity of laboratory animal re-
search. In the following sections, we will explore this in
more detail and propose potential ways of renement as
well as promising areas of future research.
Internal Validity Renement
of Experimental Conduct to Avoid
Systematic Biases
Although 235 different types of bias in biomedical research
have been characterized (Chavalarias and Ioannidis 2010),
van der Worp et al. (2010) consider four types of bias to be
particularly relevant with respect to the internal validity of
laboratory animal research: selection bias, attrition bias, per-
formance bias, and detection bias.
Selection bias refers to the biased allocation of animals to
treatment groups and can be avoided by randomization
(Box 1). Because selection bias may occur either consciously
or subconsciously, methods based on active decisions by the
experimenter (e.g., picking animals at randomfrom their
cages) are not considered true randomization. Tossing coins
or throwing dice provide simple ways of randomization but
for some purposes random number generators (e.g., www.
random.org) may be preferable. Even the use of allegedly
homogeneousstudy populations (such as same-sex, same-
age inbred mice raised under identical housing conditions)
does not preclude the need for randomization, because
individual differences still prevail. This is best illustrated
by studies with inbred mice showing that variation within
strainsisoftensignicantly greater than between strains
(Wahlsten 2010).
In many cases, it is possible to use stratied randomization
instead of simple randomization. In stratied randomization,
the study population is divided into discrete subpopulations
based on systematic differences in factors that are likely to af-
fect the outcome measures, such as sex, age, littermates, dis-
ease severity, treatment dose, etc. The animals of each
subpopulation are then separately allocated at random to the
different treatment groups. Through this, the factor levels de-
ning the different subpopulations are counterbalanced
among all treatment groups. The use of statistical methods de-
signed to analyze such factorial designs results in the removal
of the variation between the strata from the error term, thereby
increasing the precision and statistical power of the experi-
ment (Altman and Bland 1999).
Volume 55, Number 3, doi: 10.1093/ilar/ilu037 2014 385
at World Trade Institute on January 14, 2015http://ilarjournal.oxfordjournals.org/Downloaded from
Selection bias may also occur when the criteria for inclu-
sion or exclusion of animals are poorly dened. Complica-
tions that require exclusion of animals are an inherent risk
in animal studies, especially with animal models involving
invasive surgical procedures (e.g., Jüni et al. 2001)andin
models of stroke (Crossley et al. 2008). For ethical reasons,
humane endpoints need to be dened apriori, and animals
that reach humane endpoints may be lost from the subsequent
analysis. However, it may also be justiable to exclude ani-
mals for scientic reasons if complications occur that are un-
related to the experimental treatment and render the outcome
measures meaningless. To avoid bias, however, all criteria for
inclusion and exclusion of animals need to be predened, and
the person deciding on inclusion or exclusion needs to be un-
aware of the treatment allocation (van der Worp et al. 2010). If
these criteria are not well specied, one risks the induction of
attrition bias, the unequal distribution of dropouts among
treatment groups.
Performance bias may occur whenever there is a systematic
difference in the interaction with the animals (e.g., animal
care, experimental procedures) between the treatment groups,
apart from the treatment under investigation (Jüni et al. 2001;
Box 1). For example, differences in the quality of experiment-
er handling exhibited to stressed vs. nonstressed mice may oc-
cur due to higher fearfulness and stress reactivity in the
stressed mice (Hurst and West 2010). In contrast, detection
bias occurs when the outcome is measured differently in an-
imals of different treatment groups. Again, both performance
bias and detection bias may occur either consciously or sub-
consciously, and the best way to avoid these biases is blinding
(also known as masking).
Blinding is considered complete when the investigator and
everyone else involved in the experiment (animal care per-
sonnel, laboratory technicians, outcome assessors, etc.) are
unaware of the animalsallocation to treatments. In contrast
to randomization, blinding is not always possible, for exam-
ple, when scoring behavior among treatment groups that
differ visibly (e.g., strains of mice differing in coat color).
Thus, it is important that authors explicitly report the blind-
ing status of all people whose involvement may affect the
outcome of the study (Kilkenny et al. 2010a;Moher
et al. 2010).
Other relevant sources of bias include sample sizes that are
either too small or too large, a poor denition of the primary
(and secondary) outcome variable(s), and the use of inappro-
priate statistical analyses, all of which may result in poor
statistical conclusion validity (Cozby and Bates 2011). When-
ever possible, a formal sample size calculation (and power
analysis) should be performed that species the minimal ef-
fect size considered to be relevant (e.g., Cohens d or f), the
desired statistical power (1β), and the level of statistical sig-
nicance (α). Some have argued that such calculations are
only applicable to conrmatory researchbut not to explor-
atory researchsince effect sizes may be unknownand re-
search in the exploratory mode will often test many different
strategies in parallel, and this is only feasible if small sample
sizes are used(Kimmelman et al. 2014). However, neither
unknown effect sizes nor the exploratory nature of research
should be taken as excuses for violating fundamental princi-
ples of good scientic practice. Tools such as NCSS PASS,
G*Power, and the resource equation method (Mead 1990)
(to name just a few) facilitate sample size calculations. This
is even possible when knowledge about the sample distribu-
tion is incomplete because usually a minimally relevant effec-
tive size can be specied a priori. Furthermore, testing many
different hypotheses in parallel using small sample sizes will
inevitably produce spurious results that undermine the reli-
ability of the research (Button et al. 2013).
Both overpowered and underpowered studies are unethical,
albeit for different reasons. Overpowered studies use more an-
imals than needed to detect a signicant effect of a given size.
This is relatively rare, however, because it violates one of the
3R principles (reduction), and ethics committees are trained
to spot reduction potential. From a scientic perspective,
large sample sizes are not a problem as such, as long as a
minimal effect size is dened. However, any two treatments
will be signicantly different if the measurement precision
and sample size are large enough, and so overpowered
designs may lead to bias when biologically irrelevant effect
sizes are considered relevant because of their statistical
signicance. Underpowered studies are much more prevalent,
even though they are much more problematic from both
ethical and scientic points of view (c.f., Button et al.
2013). Underpowered studies are unable to detect biologi-
cally relevant effect sizes, and as a result, the animals are
essentially wasted for inconclusive research. On the other
hand, there are obvious economic incentives to keep sample
sizes small. In addition, it appears that the well-intended yet
one-sided focus of ethics committees on reduction may
further promote underpowered study designs (Demétrio
et al. 2013). In the human clinical trial literature, the ethical
and scientic costs of underpowered study designs have
long been recognized (Halpern 2002); it is crucial that formal
power calculations become standard practice in animal
research so that scientic gain is maximized while animal
use is minimized (Button et al. 2013;Kilkenny et al.
2010b;Macleod 2011).
Recent evidence from preclinical neurological research
indicates that there are also too many statistically signicant
(i.e., positive) results in the literature (Ioannidis and
Trikalinos 2007;Tsilidis et al. 2013). These authors conclud-
ed that selective analysis and selective outcome reporting are
the most likely causes. Selective analysis occurs when several
statistical analyses are performed but only the one with the
best(i.e., most signicant) result is presented (Ioannidis
2008;Tsilidis et al. 2013). Similarly, selective outcome re-
porting occurs when many outcome variables are analyzed
but only the variables that are signicantly affected by the
treatment are reported (Tsilidis et al. 2013). While the possi-
ble merits of selective reporting are still debated (e.g.,
de Winter and Happee 2013;van Assen et al. 2014), we main-
tain that to avoid these potential biases, the primary (and sec-
ondary) outcome variable(s) as well as the statistical
approach(es) to testing for treatment effects need to be
386 ILAR Journal
at World Trade Institute on January 14, 2015http://ilarjournal.oxfordjournals.org/Downloaded from
specied before the onset of the study. Ultimately, the best
way to achieve this would be the prospective registration of
all animal studies (see below).
Finally, as the use of the scientic method requires
reproducibility and falsiability, the sharing of collected
data (i.e., public data archiving) and validation of published
analytical methods should become more common (Molloy
2011). Although this topic is not without issue or debate
(e.g., Alsheikh-Ali et al. 2011;Editor 2014;Nelson 2009;
Roche et al. 2014), the transparency of collected data can
only improve the quality of published scientic results.
Do Reporting Guidelines Help?
The common approach to reducing poor experimental con-
duct has been the implementation of reporting guidelines.
This started with the CONSORT statement to improve the
reporting of human clinical trials about 20 years ago (Begg
1996;Moher et al. 2001;Schulz et al. 2010) and was recently
extended to animal research by the ARRIVE guidelines
(Kilkenny et al. 2010b). Similar reporting guidelines are
available for other areas of research, such as STROBE for
epidemiology (von Elm et al. 2007), PRISMA for systematic
reviews and meta-analyses (Moher 2009), and several others
listedbytheEQUATORNetwork(www.equator-network.
org). More recently, it has been proposed that animal experi-
ments should be preregistered (Chambers 2013), similarly to
clinical trials which according to the Declaration of Helsinki
(WMA 2013) must be registered in a publicly accessible
database (e.g., www.ClinicalTrials.gov) before recruitment
of the rst subject. Preregistration should help to avoid inap-
propriate research practices, including inadequate statistical
power, selective reporting of results, undisclosed analytic
exibility, and publication bias(Chambers 2013). All of
these initiatives reect the pervasive nature of bias in biomed-
ical research.
So, do reporting guidelines improve experimental con-
duct? Although there is only indirect evidence, there is
good reason to believe that they do indeed. For example, sys-
tematic reviews and meta-analyses in preclinical research on
stroke, multiple sclerosis, and Parkinsons disease indicate
that poor reporting of study quality attributes (e.g., randomi-
zation, blinding, sample size calculation, etc.) correlates with
overstated treatment effects (Rooke et al. 2011;Sena et al.
2007;Vesterinen et al. 2010). It is therefore plausible that bet-
ter reporting correlates with better quality of study conduct.
Although, theoretically, the reporting of accurate study qual-
ity may be faked, such outright fraud is hopefully uncommon.
It is more likely that the advocacy of reporting guidelines will
raise awareness of the importance of rigorous experimental
conduct (Landis et al. 2012). Nevertheless, a recent analysis
of papers published in the PLoS and Nature journals after the
endorsement of the ARRIVE guidelines found as yet very lit-
tle improvement in reporting standards, indicating that au-
thors are still ignoring, and referees and editors are not
enforcing, these guidelines (Baker et al. 2014).
External Validity Renement
of Experimental Design to Avoid
Spurious Results
Reproducibility is a cornerstone of the scientic method, and
poor reproducibility threatens the credibility of the entire eld
of animal research (Johnson 2013;Richter et al. 2009).
Although better internal validity will also improve the repro-
ducibility of results, reproducibility of a result is primarily
a function of external validity (Richter et al. 2009;
Würbel 2000).
By denition, external validity refers to the applicability of
results to other environmental conditions, experimenters,
study populations, and even to other strains or species of an-
imals (including humans; Lehner 1996;Box 1). External va-
lidity therefore denes how generalizable results are. This
also includes reproducibility, which is dened as the ability
of a result to be replicated by an independent experiment ei-
ther in the same or in a different laboratory (Box 1). However,
the relationship between external validity and reproducibility
is not so straightforward. External validity (i.e., the range of
conditions to which a result can be generalized) is an inherent
feature of a result; some results are more externally valid than
others. For example, pre-pulse inhibition (PPI) of the startle
reex to acoustic stimuli is highly conserved across many spe-
cies, including mice and humans, and is fairly robust against
variation in environmental conditions (Geyer et al. 2002).
Thus, PPI has very high external validity. Because of this,
PPI is also highly reproducible across different laboratories
despite considerable variation in conditions among laborato-
ries. In contrast, the locomotor activity of mice on an elevated
zero-maze or plus-maze has very little external validity, as it
is highly sensitive to test conditions (e.g., handling; Hurst and
West 2010), and differences between strains of mice are high-
ly inconsistent despite considerable efforts to equate condi-
tions across laboratories (e.g., Crabbe et al. 1999;Richter
et al. 2011). Therefore, experiments should be designed in
ways that permit for estimation of the external validity of
the results. This can only be achieved if relevant features of
the study design, such as animal characteristics and environ-
mental conditions, are varied systematically (Würbel
2000,2002).
Interestingly, this is contrary to conventional wisdom in
laboratory animal science. The gold standard of experimental
design adopted from the pure sciences (mathematics, physics,
chemistry) is to hold constant all factors except for the inde-
pendent variable(s) under investigation. This has become a
central dogma in laboratory animal science that is referred
to as standardization. Thus, laboratory animal science text-
books advise researchers to standardize their experiments
by using genetically uniform animals, selecting these for
maximal phenotypic uniformity (e.g., same age, same weight,
etc.), and keeping all environmental and procedural factors
constant (Beynen, Festing, et al. 2001;Beynen, Gärtner,
et al. 2001). Such homogenization of study populations
may compromise both the external validity and reproducibil-
ity of the results, an effect that has been referred to as the
Volume 55, Number 3, doi: 10.1093/ilar/ilu037 2014 387
at World Trade Institute on January 14, 2015http://ilarjournal.oxfordjournals.org/Downloaded from
standardization fallacy (Würbel 2000,2002). The same falla-
cy was highlighted 80 years ago by the eminent Ronald
A. Fisher (1935, p. 102): The exact standardisation of exper-
imental conditions, which is often thoughtlessly advocated as
a panacea, always carries with it the real disadvantage that a
highly standardised experiment supplies direct information
only with respect to the narrow range of conditions achieved
by standardisation. Standardisation, therefore, weakens rather
than strengthens our ground for inferring a like result, when,
as is invariably the case in practice, these conditions are
somewhat varied.
Indeed, despite rigorous standardization of the experimen-
tal conditions across laboratories, several multi-laboratory
studies revealed large proportions of results that were idiosyn-
cratic to one laboratory (Crabbe et al. 1999;Richter et al.
2011;Wolfer et al. 2004). The reason for this may be that
many environmental factors (e.g., staff, noise, etc.) cannot
be equalized between laboratories, so that different laborato-
ries inevitably standardize to different local environments
(Richter et al. 2009;Würbel and Garner 2007). Therefore,
standardization may actually be a cause of, rather than a
cure for, poor reproducibility (Richter et al. 2009). Thus,
not surprisingly, van der Worp et al. (2010) listed homoge-
nous study populations as a main source of poor external
validity in preclinical animal research, which to some extent
may also contribute to translational failure.
Some scientists have argued that we simply need to report
more parameters that may potentially affect outcome mea-
sures (e.g., Arndt and Surjo 2001;Philip et al. 2010;Surjo
and Arndt 2001). In this case, however, reporting guidelines
will not help, and the attempt to promote extensive lists of
methodological detail to facilitate interpretation of conict-
ing ndings has been referred to as the listing fallacy (Wür-
bel 2002). If anything, such lists may induce interpretation
bias by attracting attention to differences in the listed param-
eters, although there may be many more parameters that
were not considered, were considered to be irrelevant or
too difcult to assess, or simply could not be listed. As
long as a particular parameter has not been varied systemati-
cally within a given experiment, it is no more likely to ex-
plain conicting ndings than any other parameter, listed
or unlisted, that differed between the respective experiments
(Würbel 2002).
Statistical and Experimental Solutions
Among studies investigating behavioral differences between
different inbred and mutant strains of mice (behavioral pheno-
typing), current estimates of the proportion of irreproducible
results (false discovery proportion, FDP) from multi-
laboratory studies range between 30 and 60% (Benjamini
et al. 2014;Kafkaet al. 2005,2014). It is likely that similar
FDPs apply to other areas of research.
Various solutions have been proposed to reduce the risk of
obtaining such spurious results. For example, Johnson (2013)
suggested lowering the critical Pvalue of statistical signi-
cance from 0.05 to 0.005 or even 0.001 to match conventional
evidence thresholds used in Bayesian testing. Assuming that
approximately one-half of the hypotheses tested by scientists
are true, Johnson (2013) estimated that between 17% and
25% of marginally signicant scienticndings are false
positives. However, to avoid lowering the proportion of false
positives and increasing the proportion of false negatives,
sample sizes would have to be increased by about 50% to
100% to achieve similar statistical power (Johnson 2013).
Moreover, a general decrease of critical Pvalues does not
take into account that both external validity and reproducibil-
ity depend on the nature of the measured effect.
Kafkaand colleagues (2005, 2014) have therefore pro-
posed to raise the benchmark for signicant results in a
more specic way. According to their random laboratory
model, laboratories should be considered as a sample, repre-
senting the population of all potential laboratories, and the
interaction noise (the treatment x laboratory variance) should
be added as a random factor to the individual animal noise
(the within-laboratory variance). Similarly to the suggestion
of lowering Pvalues (Johnson 2013), this ination of within-
laboratory variance would generate a larger yardstick for the
signicance of treatment effects (Benjamini et al. 2014;
Kafkaet al. 2005,2014), albeit in a more specicway.
Using data from several multi-laboratory studies, the authors
showed that this method may reduce the FDP considerably
without losing too much statistical power. The difculty
with this approach is that such specicity will be achieved
only if the treatments and measures are rst tested across
several laboratories to obtain accurate estimates of between-
laboratory variance. This approach may thus not be applica-
ble to animal experiments in general but may be useful for
standard preclinical tests of efcacy and toxicity in drug
development, as well as for specic large scale projects,
such as the International Mouse Phenotyping Consortium
(Brown and Moore 2012a,2012b;Mallon et al. 2012) which
aims to determine the phenotypes of thousands of mutant
lines with a battery of standard tests (Benjamini et al. 2014;
Kafkaet al. 2005,2014).
Besides these statistical approaches, others have proposed
mimicking between-laboratory variability experimentally.
These proposals range from conducting an independent
replicate study to conducting real multi-laboratory stud-
ies. For example, the Reproducibility Initiative has estab-
lished a service to facilitate independent replicate studies
(http://validation.scienceexchange.com/), while the Multi-
PARTconsortium aims to develop a platform for international
multicenter preclinical stroke trials based on randomized
clinical trial design (www.dcn.ed.ac.uk/multipart/).
In addition to such true replications, there are several other
ways in which studies may be designed to provide an esti-
mate of the external validity and reproducibility of results.
For example, Richter and colleagues (2010,2011) proposed
the heterogenization of study populations (rather than ho-
mogenization through standardization) by systematically
varying a few selected factors. In principle, any aspect of
the animals (e.g., genotype, sex, age, body condition, etc.)
and their environment (e.g., housing conditions,
388 ILAR Journal
at World Trade Institute on January 14, 2015http://ilarjournal.oxfordjournals.org/Downloaded from
experimental protocol ) may be used for such heterogeniza-
tion. By varying two environmental factors using a 2 × 2 fac-
torial design, Richter and colleagues (2010) successfully
mimicked variation between independent replicates conduct-
ed within their own laboratory (see also Jonker et al. 2013;
Wo l nger 2013;Würbel et al. 2013). However, a similar sim-
ple form of heterogenization did not account for between-
laboratory variation in a true multi-laboratory study (Richter
et al. 2011). Further research is therefore needed to develop
heterogenization protocols that mimic between-laboratory
variability more effectively.
In the meantime, simple precautions may be taken as pro-
posed by Paylor (2009), for example, by splitting experiments
into small batches of animals that are tested some time apart
instead of testing them in one large batch, by using multiple
experimenters for testing and data collection instead of using
only one, or by spreading test sessions across time of day in-
stead of testing all animals at the same time of day. Assessing
the effects of batch, experimenter, or time of day, respectively,
will reveal whether such minor variations of conditions affect
results and will therefore indicate whether reproducibility
across larger variations of conditions (such as between labo-
ratories) may be at stake.
Conclusions
Reproducibility and falsiability are cornerstones of the
scientic method, and it is because of these principles that sci-
ence is often viewed as self-correcting, at least in the long
term. The common consensus is that failures in reproducibil-
ity of animal research are not a consequence of scientic mis-
conduct (e.g., Collins and Tabak 2014). However, negligence
in experimental design, conduct, and publication (whether
conscious or not) continue to plague animal research and,
despite numerous initiatives to curb these effects, they contin-
ue to persist (Baker et al. 2014).
Facing these problems and underlying causes, as discussed
here, is hopefully a step towards effective renement of
experimental design and conduct. The ARRIVE guidelines
provide a useful tool for improving the internal validity of
animal research, and several strategies have been put forward
for improving external validity as well. Nevertheless, it seems
that greater pressure must be placed on researchers, reviewers,
and journal editors to not only endorse such methods of
renement but to rigorously enforce them. Otherwise, the
credibility and ethical justication of animal research may
be permanently undermined.
Acknowledgments
The authors of this paper were funded by the ERC Advanced
Grant REFINE(H.W. and J.D.B.), the FP7 Coordination
and Support Action Multi-PART(H.W. and J.D.B.),
and a research grant by the Swiss Federal Food Safety and
Veterinary Ofce (H.W. and T.S.R.).
References
Alsheikh-Ali AA, Qureshi W, Al-Mallah MH, Ioannidis JPA. 2011. Public
availability of published research data in high-impact journals. PLoS
One 6:e24357.
Altman DG, Bland JM. 1999. How to randomise. BMJ 319:703704.
Arndt SS, Surjo D. 2001. Methods for the behavioural phenotyping of mouse
mutants. How to keep the overview. Behav Brain Res 125:3942.
Baker D, Lidster K, Sottomayor A, Amor S. 2014. Two years later: Journals
are not yet enforcing the ARRIVE guidelines on reporting standards for
pre-clinical animal studies. PLoS Biol 12:e1001756.
Begg C. 1996. Improving the quality of reporting of randomized controlled
trials. JAMA 276:637639.
Benjamini Y, Lahav T, KafkaN. 2014. Estimating replicability of behavioral
phenotyping results in a single laboratory. In: Measuring Behavior 2014:
The replicability of measuring behavior.
Beynen AC, Festing MFW, van Montfort MAJ. 2001. Design of animal ex-
periments. In: Van Zutphen LFM, Baumans V, Beynen AC, eds.
Principles of Laboratory Animal Science. Revised. Amsterdam: Elsevier.
p 219249.
Beynen AC, Gärtner K, van Zutphen LFM. 2001. Standardization of Animal
Experimentation. In: Van Zutphen LFM, Baumans V, Beynen AC, eds.
Principles of Laboratory Animal Science. Revised. Amsterdam: Elsevier.
p 103110.
Brown SDM, Moore MW. 2012a. The International Mouse Phenotyping
Consortium: Past and future perspectives on mouse phenotyping.
Mamm Genome 23:632640.
Brown SDM, Moore MW. 2012b. Towards an encyclopaedia of mammalian
gene function: the International Mouse Phenotyping Consortium. Dis
Model Mech 5:289292.
Button KS, Ioannidis JPA, Mokrysz C, Nosek BA, Flint J, Robinson ESJ,
Munafò MR. 2013. Power failure: Why small sample size undermines
the reliability of neuroscience. Nat Rev Neurosci 14:365376.
Chambers CD. 2013. Registered reports: A new publishing initiative at Cor-
tex. Cortex 49:609610.
Chavalarias D, Ioannidis JPA. 2010. Science mapping analysis characterizes
235 biases in biomedical research. J Clin Epidemiol 63:12051215.
Collins FS, Tabak LA. 2014. Policy: NIH plans to enhance reproducibility.
Nature 505:612613.
Cozby P, Bates S. 2011. Methods in Behavioral Research. 11th ed. McGraw-
Hill Education.
Crabbe JC, Wahlsten DL, Dudek BC. 1999. Genetics of Mouse behavior: in-
teractions with laboratory environment. Science 284:16701672.
Cronbach LJ, Meehl PE. 1955. Construct validity in psychological tests.
Psychol Bull 52:281302.
Crossley NA, Sena E, Goehler J, Horn J, van der Worp B, Bath PMW,
Macleod M, Dirnagl U. 2008. Empirical evidence of bias in the design
of experimental stroke studies: A metaepidemiologic approach. Stroke
39:929934.
Demétrio CGB, Menten JFM, Leandro RA, Brien C. 2013. Experimental
power considerations-justifying replication for animal care and use com-
mittees. Poult Sci 92:24902497.
De Winter J, Happee R. 2013. Why selective publication of statistically sig-
nicant results can be effective. PLoS One 8:e66463.
Editor. 2011. Building a better mouse test. Nat Methods 8:697.
Editor. 2014. Share alike. Nature 507:140.
Festing MFW, Altman DG. 2002. Guidelines for the design and statistical
analysis of experiments using laboratory animals. ILAR J 43:244258.
Fisher RA. 1935. The design of experiments. Oliver & Boyd.
Frantzias J, Sena ES, Macleod MR, Al-Shahi Salman R. 2011. Treatment of
intracerebral hemorrhage in animal models: Meta-analysis. Ann Neurol
69:389399.
Garner JP. 2005. Stereotypies and Other abnormal repetitive behaviors:
Potential impact on validity, reliability, and replicability of scientic
outcomes. ILAR J 46:106117.
Geyer MA, McIlwain KL, Paylor R. 2002. Mouse genetic models for pre-
pulse inhibition: An early review. Mol Psychiatry 7:10391053.
Volume 55, Number 3, doi: 10.1093/ilar/ilu037 2014 389
at World Trade Institute on January 14, 2015http://ilarjournal.oxfordjournals.org/Downloaded from
Goodman S, Greenland S. 2007. Why most published research ndings are
false: Problems in the analysis. PLoS Med 4:e168.
Halpern SD. 2002. The continuing unethical conduct of underpowered
clinical trials. JAMA 288:358362.
Howells DW, Sena ES, Macleod MR. 2014. Bringing rigour to translational
medicine. Nat Rev Neurol 10:3743.
Hurst JL, West RS. 2010. Taming anxiety in laboratory mice. Nat Methods
7:825826.
Ioannidis JPA. 2005. Why most published research ndings are false. PLoS
Med 2:e124.
Ioannidis JPA. 2008. Why most discovered true associations are inated.
Epidemiology 19:640648.
Ioannidis JPA, Trikalinos TA. 2007. The appropriateness of asymmetry
tests for publication bias in meta-analyses: A large survey. CMAJ
176:10911096.
Johnson VE. 2013. Revised standards for statistical evidence. Proc Natl Acad
Sci U S A 110:1931319317.
Jonker RM, Guenther A, Engqvist L, Schmoll T. 2013. Does systematic
variation improve the reproducibility of animal experiments? Nat
Methods 10:373.
Jüni P, Altman DG, Egger M. 2001. Systematic reviews in health care:
Assessing the quality of controlled clinical trials. BMJ 323:4246.
KafkaN, Benjamini Y, Sakov A, Elmer GI, Golani I. 2005.
Genotype-environment interactions in mouse behavior: A way out of
the problem. Proc Natl Acad Sci U S A 102:46194624.
KafkaN, Lahav T, Benjamini Y. 2014. Whats always wrong with my mouse?
In: Measuring Behavior 2014: The replicability of Measuring Behavior.
Kilkenny C, Browne WJ, Cuthill IC, Emerson M, Altman DG. 2010a.
Improving bioscience research reporting: the ARRIVE guidelines for
reporting animal research. PLoS Biol 8:e1000412.
Kilkenny C, Browne WJ, Cuthill IC, Emerson M, Altman DG. 2010b.
Animal research: reporting in vivo experiments: The ARRIVE guide-
lines. J Gene Med 12:561563.
Kilkenny C, Parsons N, Kadyszewski E, Festing MFW, Cuthill IC, Fry D,
Hutton J, Altman DG. 2009. Survey of the quality of experimental
design, statistical analysis and reporting of research using animals.
PLoS One 4:e7824.
Kimmelman J, Mogil JS, Dirnagl U. 2014. Distinguishing between explor-
atory and conrmatory preclinical research will improve translation.
Jones DR, editor. PLoS Biol 12:e1001863.
Knight J. 2001. Animal data jeopardized by life behind bars. Nature 412:669.
Kola I, Landis J. 2004. Can the pharmaceutical industry reduce attrition rates?
Nat Rev Drug Discov 3:711715.
Landis SC, Amara SG, Asadullah K, Austin CP, Blumenstein R, Bradley EW,
Crystal RG, Darnell RB, Ferrante RJ, Fillit H, Finkelstein R, Fisher M,
Gendelman HE, Golub RM, Goudreau JL, Gross RA, Gubitz AK,
Hesterlee SE, Howells DW, Huguenard J, Kelner K, Koroshetz W,
Krainc D, Lazic SE, Levine MS, Macleod MR, McCall JM,
Moxley RT 3rd, Narasimhan K, Noble LJ, Perrin S, Porter JD,
Steward O, Unger E, Utz U, Silberberg SD. 2012. A call for
transparent reporting to optimize the predictive value of preclinical
research. Nature 490:187191.
Lehner PN. 1996. Handbook of Ethological Methods. Cambridge University
Press.
Macleod M. 2011. Why animal research needs to improve. Nature 477:511.
Mallon A-M, Iyer V, Melvin D, Morgan H, Parkinson H, Brown SDM,
Flicek P, Skarnes WC. 2012. Accessing data from the International
Mouse Phenotyping Consortium: state of the art and future plans.
Mamm Genome 23:641652.
Martin B, Ji S, Maudsley S, Mattson MP. 2010. Controllaboratory rodents
are metabolically morbid: why it matters. Proc Natl Acad Sci U S A
107:61276133.
McCance I. 1995. Assessment of statistical procedures used in papers in the
Australian Veterinary Journal. Aust Vet J 72:322329.
Mead R. 1990. The design of experiments: Statistical principles for practical
applications. Cambridge University Press.
Moher D, Hopewell S, Schulz KF, Montori V, Gøtzsche PC, Devereaux PJ,
Elbourne D, Egger M, Altman DG. 2010. CONSORT 2010 explanation
and elaboration: Updated guidelines for reporting parallel group rando-
mised trials. BMJ 340:c869.
Moher D, Schulz KF, Altman DG. 2001. The CONSORT statement: Revised
recommendations for improving the quality of reports of parallel-group
randomised trials. Lancet 357:11911194.
Moher D. 2009. Preferred Reporting Items for Systematic Reviews and Meta-
Analyses: The PRISMA Statement. Ann Intern Med 151:264.
Molloy JC. 2011. The Open Knowledge Foundation: open data means better
science. PLoS Biol 9:e1001195.
Nelson B. 2009. Data sharing: Empty archives. Nature 461:160163.
Nestler EJ, Hyman SE. 2010. Animal models of neuropsychiatric disorders.
Nat Neurosci 3:11611169.
OCollins VE, Macleod MR, Donnan GA, Horky LL, van der Worp BH,
Howells DW. 2006. 1,026 experimental treatments in acute stroke. Ann
Neurol 59:467477.
Paylor R. 2009. Questioning standardization in science. Nat Methods
6:253254.
Philip VM, Duvvuru S, Gomero B, Ansah TA, Blaha CD, Cook MN,
Hamre KM, Lariviere WR, Matthews DB, Mittleman G, Goldowitz D,
Chesler EJ. 2010. High-throughput behavioral phenotyping in the ex-
panded panel of BXD recombinant inbred strains. Genes Brain Behav
9:129159.
Richter SH, Garner JP, Auer C, Kunert J, Würbel H. 2010. Systematic vari-
ation improves reproducibility of animal experiments. Nat Methods
7:167168.
Richter SH, Garner JP, Würbel H. 2009. Environmental standardization: Cure
or cause of poor reproducibility in animal experiments? Nat Methods
6:257261.
Richter SH, Garner JP, Zipser B, Lewejohann L, Sachser N, Touma C,
Schindler B, Chourbaji S, Brandwein C, Gass P, van Stipdonk N, van
der Harst J, Spruijt B, Võikar V, Wolfer DP, Würbel H. 2011. Effect of
population heterogenization on the reproducibility of mouse behavior:
A multi-laboratory study. PLoS One 6:e16461.
Roche DG, Lanfear R, Binning SA, Haff TM, Schwanz LE, Cain KE,
Kokko H, Jennions MD, Kruuk LEB. 2014. Troubleshooting public
data archiving: Suggestions to increase participation. PLoS Biol 12:
e1001779.
Rooke EDM, Vesterinen HM, Sena ES, Egan KJ, Macleod MR. 2011. Dop-
amine agonists in animal models of Parkinsons disease: A systematic re-
view and meta-analysis. Parkinsonism Relat Disord 17:313320.
Schulz KF, Altman DG, Moher D. 2010. CONSORT 2010 Statement: Updat-
ed guidelines for reporting parallel group randomised trials. BMC
Med 8:18.
Sena E, van der Worp HB, Howells D, Macleod M. 2007. How can we im-
prove the pre-clinical development of drugs for stroke? Trends Neurosci
30:433439.
Surjo D, Arndt SS. 2001. The Mutant Mouse Behaviour network, a medium
to present and discuss methods for the behavioural phenotyping. Physiol
Behav 73:691694.
Tsilidis KK, Panagiotou OA, Sena ES, Aretouli E, Evangelou E,
Howells DW, Al-Shahi Salman R, Macleod MR, Ioannidis JPA. 2013.
Evaluation of excess signicance bias in animal studies of neurological
diseases. PLoS Biol 11:e1001609.
Vesterinen HM, Sena ES, Ffrench-Constant C, Williams A, Chandran S,
Macleod MR. 2010. Improving the translational hit of experimental treat-
ments in multiple sclerosis. Mult Scler 16:10441055.
Wahlsten DL. 2010. Mouse Behavioral Testing: How to use Mice in Behav-
ioral Neuroscience. First. Elsevier.
WMA. 2013. World Medical Association Declaration of Helsinki: Ethical
principles for medical research involving human subjects. JAMA
310:21912194.
Wolfer DP, Litvin O, Morf S, Nitsch RM, Lipp H-P, Würbel H. 2004. Labo-
ratory animal welfare: Cage enrichment and mouse behaviour. Nature
432:821822.
Wo l nger RD. 2013. Reanalysis of Richter et al. (2010) on reproducibility.
Nat Methods 10:373374.
390 ILAR Journal
at World Trade Institute on January 14, 2015http://ilarjournal.oxfordjournals.org/Downloaded from
Van Assen MALM, van Aert RCM, Nuijten MB, Wicherts JM. 2014. Why
publishing everything is more effective than selective publishing of stat-
istically signicant results. PLoS One 9:e84896.
Van der Worp HB, Howells DW, Sena ES, Porritt MJ, Rewell S, OCollins V,
Macleod MR. 2010. Can animal models of disease reliably inform human
studies? PLoS Med. 7:e1000245.
Von Elm E, Altman DG, Egger M, Pocock SJ, Gøtzsche PC,
Vandenbroucke JP. 2007. The Strengthening the Reporting of
Observational Studies in Epidemiology (STROBE) statement:
Guidelines for reporting observational studies. Prev Med (Baltim)
45:247251.
Würbel H. 2000. Behaviour and the standardization fallacy. Nat Genet
26:263.
Würbel H. 2001. Ideal homes? Housing effects on rodent brain and behav-
iour. TRENDS Neurosci 24:207211.
Würbel H. 2002. Behavioral phenotyping enhanced--Beyond (environmental)
standardization. Genes Brain Behav 1:38.
Würbel H, Garner JP. 2007. Renement of rodent research though environ-
mental enrichment and systematic randomization. Available from: http://
www.nc3rs.org.uk/downloaddoc.asp?id=506&page=395&skin=0
Würbel H, Richter SH, Garner JP. 2013. Reply to: Reanalysis of Richter
et al. (2010) on reproducibility. Nat Methods 10:374.
Volume 55, Number 3, doi: 10.1093/ilar/ilu037 2014 391
at World Trade Institute on January 14, 2015http://ilarjournal.oxfordjournals.org/Downloaded from
... Several rhesus macaque subspecies have been shown to display a variety of physiological and behavioural differences, and significant differences between distinct geographical populations affect macaque populations worldwide, influencing their susceptibility to a variety of diseases and how they metabolise drugs, for instance. 59,60 Yet, instead of responding to this type of considerable evidence, and resolving that future biomedical research and testing must maintain a focus on human biology from start to finish, either the evidence is largely overlooked, or there is a belief that these issues can be overcome with better experimental technique and/or analysis [61][62][63][64] and/or genetic modification. 16,65 This is in stark contrast to other scientific and engineering disciplines that try to rigorously challenge their models, and alter their approaches, based on solid data. ...
Article
Full-text available
The Three Rs have become widely accepted and pursued, and are now the go-to framework that encourages the humane use of animals in science, where no other option is believed to exist. However, many people, including scientists, harbour varying degrees of concern about the value and impact of the Three Rs. This ranges from a continued adherence to the Three Rs principles in the belief that they have performed well, through a belief that there should be more emphasis (or indeed a sole focus) on replacement, to a view that the principles have hindered, rather than helped, a critical approach to animal research that should have resulted in replacement to a much greater extent. This critical review asks questions of the Three Rs and their implementation, and provides an overview of the current situation surrounding animal use in biomedical science (chiefly in research). It makes a case that it is time to move away from the Three Rs and that, while this happens, the principles need to be made more robust and enforced more efficiently. To expedite a shift from animal use in science, toward a much greater and quicker adoption of human-specific New Approach Methodologies (NAMs), some argue for a straightforward focus on the best available science.
... The third step is to verify the studies' external validity, i.e. the extent to which the findings of a study can be generalized and applied to other species, environmental conditions, or experimental settings (53,54). This is especially important in the case of parrots as the Psittaciformes order comprises a vast diversity of species. ...
Preprint
Full-text available
Parrots are popular companion animals but show prevalent and at times severe welfare issues. Nonetheless, there are no scientific tools available to assess parrot welfare. The aim of this systematic review was to identify valid and feasible outcome measures that could be used as welfare indicators for companion parrots. From 1848 peer-reviewed studies retrieved, 98 met our inclusion and exclusion criteria (e.g. experimental studies, captive parrots). For each outcome collected, validity was assessed based on the statistical significance reported by the authors, as other validity parameters were rarely available for evaluation. Feasibility was assigned by considering the need for specific instruments, veterinary-level expertise or handling the parrot. A total of 1512 outcomes were evaluated, of which 572 had a significant p-value and were considered feasible. These included changes in behaviour (e.g. activity level, social interactions, exploration), body measurements (e.g. body weight, plumage condition) and abnormal behaviours, amongst others. However, a high risk of bias undermined the internal validity of these outcomes. Moreover, a strong taxonomic bias, a predominance of studies on parrots in laboratories, and an underrepresentation of companion parrots jeopardized their external validity. These results provide a promising starting point for validating a set of welfare indicators in parrots.
... While rodents are the most prevalently used animal in preclinical animal research, high rates of translational failure concerning drug development have brought into sharp focus the need to study mammalian species that are physiologically more similar to humans, particularly in relation to the aspects of the diseases being modeled (Public Law 89-544, 1966;Public Law 91-579, 1970;Public Law 94-279, 1976;Ioannidis, 2005Ioannidis, , 2006Goodman and Greenland, 2007;Chavalarias and Ioannidis, 2010;Paul et al., 2010;Bailoo et al., 2014b;Gaire et al., 2021). Pigs are an important model in preclinical biomedical research, historically accounting for approximately 6% of all the United States Department of Agriculture (USDA) species protected under the Animal Welfare Act (Public Law 89-544, 1966;Public Law 91-579, 1970;Public Law 94-279, 1976). ...
Article
Full-text available
Pigs can be an important model for preclinical biological research, including neurological diseases such as Alcohol Use Disorder. Such research often involves longitudinal assessment of changes in motor coordination as the disease or disorder progresses. Current motor coordination tests in pigs are derived from behavioral assessments in rodents and lack critical aspects of face and construct validity. While such tests may permit for the comparison of experimental results to rodents, a lack of validation studies of such tests in the pig itself may preclude the drawing of meaningful conclusions. To address this knowledge gap, an apparatus modeled after a horizontally placed ladder and where the height of the rungs could be adjusted was developed. The protocol that was employed within the apparatus mimicked the walk and turn test of the human standardized field sobriety test. Here, five Sinclair miniature pigs were trained to cross the horizontally placed ladder, starting at a rung height of six inches and decreasing to three inches in one-inch increments. It was demonstrated that pigs can reliably learn to cross the ladder, with few errors, under baseline/unimpaired conditions. These animals were then involved in a voluntary consumption of ethanol study where animals were longitudinally evaluated for motor coordination changes at baseline, 2.5, 5, 7.5, and 10% ethanol concentrations subsequently to consuming ethanol. Consistent with our predictions, relative to baseline performance, motor incoordination increased as voluntary consumption of escalating concentrations of ethanol increased. Together these data highlight that the horizontal ladder test (HLT) test protocol is a novel, optimized and reliable test for evaluating motor coordination as well as changes in motor coordination in pigs.
... 10,11 Examples include reverse-Bayes methods 12 and Bayesian alternatives to p-values. [13][14][15] Whether a Bayesian approach is taken or not, transparent reporting of statistical analyses as well as the refinement of experimental design is a key requirement to optimize the reliability of preclinical research, 16 and in this article, we focus on two approaches to filter effective drugs and treatments in exploratory preclinical research for transitioning to a confirmatory replication study. The performance of two preclinical research pipelines is assessed and compared, showing that a methodological shift to methods which incorporate the smallest effect size of interest can improve the reliability of positive preclinical research findings. ...
Article
Full-text available
The success of preclinical research hinges on exploratory and confirmatory animal studies. Traditional null hypothesis significance testing is a common approach to eliminate the chaff from a collection of drugs, so that only the most promising treatments are funneled through to clinical research phases. Balancing the number of false discoveries and false omissions is an important aspect to consider during this process. In this paper, we compare several preclinical research pipelines, either based on null hypothesis significance testing or based on Bayesian statistical decision criteria. We build on a recently published large-scale meta-analysis of reported effect sizes in preclinical animal research and elicit a non-informative prior distribution under which both approaches are compared. After correcting for publication bias and shrinkage of effect sizes in replication studies, simulations show that (i) a shift towards statistical approaches which explicitly incorporate the minimum clinically important difference reduces the false discovery rate of frequentist approaches and (ii) a shift towards Bayesian statistical decision criteria can improve the reliability of preclinical animal research by reducing the number of false-positive findings. It is shown that these benefits hold while keeping the number of experimental units low which are required for a confirmatory follow-up study. Results show that Bayesian statistical decision criteria can help in improving the reliability of preclinical animal research and should be considered more frequently in practice.
... At the same time as the 3Rs principle has been gaining influence via its integration into legislative frameworks across the world, its effectiveness is coming under growing scrutiny by various stakeholder groups (22). In particular, the discussion of why implementation of the principle has not progressed further is on the table, as is debate over why its implementation has not had a "stronger" impact on, for example, the number of animals still being used in research [ (23)(24)(25)(26)(27); Expectations concerning measurable effects of the 3Rs principle are high, both from political and from public perspectives. The fact that there has been no substantial and consistent decrease in the absolute number of animals being used in experiments [e.g., (28); for difficulties Frontiers in Veterinary Science 03 frontiersin.org of comparing the numbers of animals used in the EU see: (29)] is perceived as a "missing" 3Rs effect. ...
Article
Full-text available
The 3Rs principle of replacing, reducing and refining the use of animals in science has been gaining widespread support in the international research community and appears in transnational legislation such as the European Directive 2010/63/EU, a number of national legislative frameworks like in Switzerland and the UK, and other rules and guidance in place in countries around the world. At the same time, progress in technical and biomedical research, along with the changing status of animals in many societies, challenges the view of the 3Rs principle as a sufficient and effective approach to the moral challenges set by animal use in research. Given this growing awareness of our moral responsibilities to animals, the aim of this paper is to address the question: Can the 3Rs, as a policy instrument for science and research, still guide the morally acceptable use of animals for scientific purposes, and if so, how? The fact that the increased availability of alternatives to animal models has not correlated inversely with a decrease in the number of animals used in research has led to public and political calls for more radical action. However, a focus on the simple measure of total animal numbers distracts from the need for a more nuanced understanding of how the 3Rs principle can have a genuine influence as a guiding instrument in research and testing. Hence, we focus on three core dimensions of the 3Rs in contemporary research: (1) What scientific innovations are needed to advance the goals of the 3Rs? (2) What can be done to facilitate the implementation of existing and new 3R methods? (3) Do the 3Rs still offer an adequate ethical framework given the increasing social awareness of animal needs and human moral responsibilities? By answering these questions, we will identify core perspectives in the debate over the advancement of the 3Rs.
Article
Full-text available
In laboratory animals, there is a scarcity of digestibility data under non-experimental conditions. Such data is important as basis to generate nutrient requirements, which contributes to the refinement of husbandry conditions. Digestibility trials can also help to identify patterns of absorption and potential factors that influence the digestibility. Thus, a digestibility trial with a pelleted diet used as standard feed in laboratory mice was conducted. To identify potential differences between genetic lines, inbred C57Bl/6 J and outbred CD1 mice (n = 18 each, male, 8 weeks-old, housed in groups of three) were used. For seven days, the feed intake was recorded and the total faeces per cage collected. Energy, crude nutrient and mineral content of diet and faecal samples were analyzed to calculate the apparent digestibility (aD). Apparent dry matter and energy digestibility did not differ between both lines investigated. The C57Bl/6 J mice had significantly higher aD of magnesium and potassium and a trend towards a lower aD of sodium than the mice of the CD1 outbred stock. Lucas-tests were performed to calculate the mean true digestibility of the nutrients and revealed a uniformity of the linear regression over data from both common laboratory mouse lines. The mean true digestibility of crude nutrients was > 90%, except for fibre, that of the minerals ranged between 66 and 97%.
Article
Full-text available
Is microbial pathogenesis a predictable scientific field? At a time when we are dealing with coronavirus disease 2019, there is intense interest in knowing about the epidemic potential of other microbial threats and new emerging infectious diseases. To know whether microbial pathogenesis will ever be a predictable scientific field requires knowing whether a host-microbe interaction follows deterministic, stochastic, or chaotic dynamics. If randomness and chaos are absent from virulence, there is hope for prediction in the future regarding the outcome of microbe-host interactions. Chaotic systems are inherently unpredictable, although it is possible to generate short-term probabilistic models, as is done in applications of stochastic processes and machine learning to weather forecasting. Information on the dynamics of a system is also essential for understanding the reproducibility of experiments, a topic of great concern in the biological sciences. Our study finds preliminary evidence for chaotic dynamics in infectious diseases.
Article
Introduction: Translation is about successfully bringing findings from preclinical contexts into the clinic. This transfer is challenging as clinical trials frequently fail despite positive preclinical results. Limited robustness of preclinical research has been marked as one of the drivers of such failures. One suggested solution is to improve the external validity of in vitro and in vivo experiments via a suite of complementary strategies. Areas covered: In this review, the authors summarize the literature available on different strategies to improve external validity in in vivo, in vitro, or ex vivo experiments; systematic heterogenization; generalizability tests; and multi-batch and multicenter experiments. Articles that tested or discussed sources of variability in systematically heterogenized experiments were identified, and the most prevalent sources of variability are reviewed further. Special considerations in sample size planning, analysis options, and practical feasibility associated with each strategy are also reviewed. Expert opinion: The strategies reviewed differentially influence variation in experiments. Different research projects, with their unique goals, can leverage the strengths and limitations of each strategy. Applying a combination of these approaches in confirmatory stages of preclinical research putatively increases the chances of success in clinical studies.
Article
Full-text available
Much biomedical research is observational. The reporting of such research is often inadequate, which hampers the assessment of its strengths and weaknesses and of a study's generalisability. The Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) initiative developed recommendations on what should be included in an accurate and complete report of an observational study. We defined the scope of the recommendations to cover three main study designs: cohort, case-control, and cross-sectional studies. We convened a 2-day workshop in September, 2004, with methodologists, researchers, and journal editors to draft a che-cklist of items. This list was subsequently revised during several meetings of the coordinating group and in e-mail discussions with the larger group of STROBE contributors, taking into account empirical evidence and methodological considerations. The workshop and the subsequent iterative process of consultation and revision resulted in a checklist of 22 items (the STROBE statement) that relate to the title, abstract, introduction, methods, results, and discussion sections of articles. 18 items are common to all three study designs and four are specific for cohort, case-control, or cross-sectional studies. A detailed explanation and elaboration document is published separately and is freely available on the websites of PLoS Medicine, Annals of Internal Medicine, and Epidemiology. We hope that the STROBE statement will contribute to improving the quality of reporting of observational studies.
Article
Full-text available
Systematic reviews should build on a protocol that describes the rationale, hypothesis, and planned methods of the review; few reviews report whether a protocol exists. Detailed, well-described protocols can facilitate the understanding and appraisal of the review methods, as well as the detection of modifications to methods and selective reporting in completed reviews. We describe the development of a reporting guideline, the Preferred Reporting Items for Systematic reviews and Meta-Analyses for Protocols 2015 (PRISMA-P 2015). PRISMA-P consists of a 17-item checklist intended to facilitate the preparation and reporting of a robust protocol for the systematic review. Funders and those commissioning reviews might consider mandating the use of the checklist to facilitate the submission of relevant protocol information in funding applications. Similarly, peer reviewers and editors can use the guidance to gauge the completeness and transparency of a systematic review protocol submitted for publication in a journal or other medium.
Article
Published research in English-language journals are increasingly required to carry a statement that the study has been approved and monitored by an Institutional Review Board in conformance with 45 CFR 46 standards if the study was conducted in the United States. Alternative language attesting conformity with the Helsinki Declaration is often included when the research was conducted in Europe or elsewhere. The Helsinki Declaration was created by the World Medical Association in 1964 (ten years before the Belmont Report) and has been amended several times. The Helsinki Declaration differs from its American version in several respects, the most significant of which is that it was developed by and for physicians. The term "patient" appears in many places where we would expect to see "subject." It is stated in several places that physicians must either conduct or have supervisory control of the research. The dual role of the physician-researcher is acknowledged, but it is made clear that the role of healer takes precedence over that of scientist. In the United States, the federal government developed and enforces regulations on researcher; in the rest of the world, the profession, or a significant part of it, took the initiative in defining and promoting good research practice, and governments in many countries have worked to harmonize their standards along these lines. The Helsinki Declaration is based less on key philosophical principles and more on prescriptive statements. Although there is significant overlap between the Belmont and the Helsinki guidelines, the latter extends much further into research design and publication. Elements in a research protocol, use of placebos, and obligation to enroll trials in public registries (to ensure that negative findings are not buried), and requirements to share findings with the research and professional communities are included in the Helsinki Declaration. As a practical matter, these are often part of the work of American IRBs, but not always as a formal requirement. Reflecting the socialist nature of many European counties, there is a requirement that provision be made for patients to be made whole regardless of the outcomes of the trial or if they happened to have been randomized to a control group that did not enjoy the benefits of a successful experimental intervention.
Article
Mice housed in standard cages show impaired brain development, abnormal repetitive behaviours (stereotypies) and an anxious behavioural profile, all of which can be lessened by making the cage environment more stimulating. But concerns have been raised that enriched housing might disrupt standardization and so affect the precision and reproducibility of behavioural-test results (for example, see ref. 4). Here we show that environmental enrichment increases neither individual variability in behavioural tests nor the risk of obtaining conflicting data in replicate studies. Our findings indicate that the housing conditions of laboratory mice can be markedly improved without affecting the standardization of results.
Article
This summary corresponds to the translation into Spanish of the Special Communication published in the Journal of the American Medical Association in August 1996, along with the editorial published in the same issue "How to report Randomized Controlled Trials. The Consort Statement". It describes the Consolidated Standars for Preparation of Controlled Clinical Trials, prepared by a work group made up of members of the SORT Group and of the Asilomar Work Group, along with the editor of a magazine and the author of the report on a clinical trial. The work was carried out by means of a Delphy process and the result was a check list and a process diagram. The check list is made up of 21 items that mainly refer to methods, results and discussions on the report of a controlled clinical trial, identifying the necessary information in order to be able to evaluate the internal and external value of the report, judging the improvement to be positive for the patient, the editors and the reviewers of the magazines.