ArticlePDF AvailableLiterature Review

Refinement of Experimental Design and Conduct in Laboratory Animal Research

December 2014
ILAR journal / National Research Council, Institute of Laboratory Animal Resources 55(3):383-91

December 2014
55(3):383-91

DOI:10.1093/ilar/ilu037

Source
PubMed

Authors:

Jeremy Davidson Bailoo

Texas Tech University Health Sciences Center

Thomas S Reichlin

Geistlich Pharma AG

Hanno Würbel

Universität Bern

The scientific literature of laboratory animal research is replete with papers reporting poor reproducibility of results as well as failure to translate results to clinical trials in humans. This may stem in part from poor experimental design and conduct of animal experiments. Despite widespread recognition of these problems and implementation of guidelines to attenuate them, a review of the literature suggests that experimental design and conduct of laboratory animal research are still in need of refinement. This paper will review and discuss possible sources of biases, highlight advantages and limitations of strategies proposed to alleviate them, and provide a conceptual framework for improving the reproducibility of laboratory animal research. © The Author 2014. Published by Oxford University Press on behalf of the Institute for Laboratory Animal Research. All rights reserved. For permissions, please email: journals.permissions@oup.com.

Content uploaded by Jeremy Davidson Bailoo

Content may be subject to copyright.

Reﬁnement of Experimental Design and Conduct in Laboratory Animal Research

Jeremy D. Bailoo, Thomas S. Reichlin, and Hanno Würbel

Abstract

The scientiﬁc literature of laboratory animal research is re-

plete with papers reporting poor reproducibility of results as

well as failure to translate results to clinical trials in humans.

This may stem in part from poor experimental design and

conduct of animal experiments. Despite widespread recogni-

tion of these problems and implementation of guidelines to

attenuate them, a review of the literature suggests that exper-

imental design and conduct of laboratory animal research are

still in need of reﬁnement. This paper will review and discuss

possible sources of biases, highlight advantages and limita-

tions of strategies proposed to alleviate them, and provide a

conceptual framework for improving the reproducibility of

laboratory animal research.

Key Words: 3R; reﬁnement; ARRIVE; reproducibility;

internal validity; external validity; standardization;

preregistration

What Is the Problem?

In 2005, the biomedical research community was startled

by a paper entitled “Why Most Published Research Find-

ings Are False”(Ioannidis 2005). Based on systematic re-

views and simulations, the author concluded that “for most

study designs and settings, it is more likely for a research

claim to be false than true.”Was this just an alarmist claim

or is there indeed a problem with the validity of biomedical

research? Despite some debate about the validity of Ioanni-

dis’original analysis (e.g., Goodman and Greenland 2007),

evidence has accumulated over the past 10 years that tends

to favor the latter view. This is further supported by a recent

commentary in Nature (Macleod 2011) that underscores

concerns that experimental design and conduct need to im-

prove in laboratory animal research.

Poor Reproducibility and Translational

Failure

The use of animals for research is a privilege granted to sci-

entists with the explicit understanding that this use provides

signiﬁcant new knowledge without causing unnecessary

harm. However, poor reproducibility of results from animal

experiments across many research areas (c.f., Richter et al.

2009) and widespread failure to translate preclinical animal

research to clinical trials (i.e., translational failure; e.g.,

Kola and Landis 2004;Howells et al. 2014;van der Worp

et al. 2010) suggest that these expectations are not met. For

example, of more than 500 neuroprotective interventions

that were effective in animal models of ischemic stroke,

none was found to be effective in humans (O’Collins et al.

2006). A 10-year review (1991–2000) of drug development

revealed that the main causes of such attrition at the clinical

trials stage are lack of efﬁcacy and safety, which together

account for 60% of the overall attrition rate (Kola and Landis

2004). These authors therefore concluded that animal studies

which better predict the efﬁcacy and safety of drugs in clinical

trials are needed to reduce translational failure.

The Study of the Scientiﬁc Validity

of Laboratory Animal Research

The empirical study of the scientiﬁc validity of laboratory

animal research is an emerging ﬁeld (Macleod 2011), and

several lines of evidence highlight both current and potential

problems. For example, translational failure in drug develop-

ment could indicate that the construct validity of animal

models is poor (Box 1). Construct validity refers to the degree

to which a test measures what it claims to be measuring

(Cronbach and Meehl 1955), and there is increasing concern

that the construct validity of many animal models for human

diseases is indeed questionable (e.g., Editor 2011;Nestler and

Hyman 2010). However, construct validity depends on the

speciﬁc disease that is modeled, and there is no simple meth-

od for assessing construct validity. Furthermore, improve-

ments in animal models usually go hand in hand with

Jeremy D. Bailoo, PhD, and Thomas S. Reichlin, PhD, are postdoctoral

fellows and Hanno Würbel, PhD, is professor and head of the Division of

Animal Welfare at the Veterinary Public Health Institute of the University

of Bern, Switzerland.

Address correspondence and reprint requests to Hanno Würbel, Division

of Animal Welfare, VPH Institute, Vetsuisse Faculty, University of Bern,

Länggassstrasse 120, 3012 Bern, Switzerland or email hanno.wuerbel@

vetsuisse.unibe.ch.

ILAR Journal, Volume 55, Number 3, doi: 10.1093/ilar/ilu037

For permissions, please email: journals.permissions@oup.com 383

at World Trade Institute on January 14, 2015http://ilarjournal.oxfordjournals.org/Downloaded from

advances in research on the construct that is being modeled.

Therefore, improving the construct validity of animal models

depends on advances in research rather than adherence to

methods or policies.

Another aspect related to construct validity concerns

the health and well-being of the animals used for research.

Growing evidence indicates that current standard practices

of housing and care in laboratory animals are associated

with abnormal brain and behavioral development and other

signs of poor welfare, which may also compromise the scien-

tiﬁc validity of research ﬁndings (Garner 2005;Knight 2001;

Martin et al. 2010;Würbel 2001). Whether animal welfare

matters in terms of the scientiﬁc validity of a research ﬁnding,

however, depends on the area of research and on the speciﬁc

research question.

Although highly relevant, construct validity and animal

welfare will therefore not be further discussed in this article.

Instead, we will focus our discussion on two fundamental

aspects of scientiﬁc validity, both of which are relevant across

all ﬁelds of laboratory animal research and are determined

by experimental design and conduct: internal and external

validity.

Box 1. Glossary of Key Terms

Bias: Systematic deviation from the true value of the estimated treatment effect caused by failures in the design, conduct, or

analysis of an experiment.

•Attrition bias: The unequal distribution of dropouts or nonresponders between treatment groups. This can lead to a system-

atic difference between treatment groups and may lead to an incorrect ascription of a causal relation between the treatment

and the dependent variable.

•Detection bias: Systematic differences between treatment groups in how outcomes are assessed. This can be reduced or

avoided by blinding or masking.

•Performance bias: Systematic differences in animal care and handling between treatment groups. This can be reduced or

avoided by blinding or masking.

•Selection bias: The biased allocation of subjects to treatment groups. Biased allocation can lead to systematic differences in

the baseline characteristics between groups. This can be avoided by randomized allocation and allocation concealment.

Blinding/masking: The maintenance of the persons’(who perform the experiment, collect data, and assess outcome, etc.)

unawareness of the treatment allocation.

Types of error

•False negative (β): The failure to reject the null hypothesis when it is false. This is often due to small sample sizes

(underpowered study designs).

•False positive (α): The rejection of the null hypothesis when it is true. This is often due to some form of bias.

Randomization

•Simple: Randomized allocation of subjects to the different treatment groups based on a single sequence of random assign-

ments. This may lead to imbalanced groups and group sizes when the number of subjects is small.

•Stratiﬁed: Allocation of subjects to blocks of subjects sharing similar baseline characteristics (e.g., sex, age, body size)

followed by randomized allocation of the subjects of each block to the different treatment groups. This is intended to coun-

terbalance potential covariates across treatment groups.

Reproducibility: The ability of a result to be replicated by an independent experiment in the same or a different laboratory.

Validity

•Construct validity: The degree to which inferences are warranted from the sampling properties of an experiment (e.g.,

units, settings, treatments and outcomes) to the entities these samples are intended to represent.

•External validity: The extent to which the results of an animal experiment provide a correct basis for generalizations to

other populations of animals (including humans) and/or other environmental conditions.

•Internal validity: The extent to which the design, conduct, and analysis of the experiment eliminate the possibility of bias

so that the inference of a causal relationship between an experimental treatment and variation in an outcome measure is

warranted.

Deﬁnitions adapted from van der Worp et al. (2010) and from the Cochrane Collaboration.

384 ILAR Journal

at World Trade Institute on January 14, 2015http://ilarjournal.oxfordjournals.org/Downloaded from

Internal and External Validity of Laboratory

Animal Research

Internal validity refers to the extent to which a causal rela-

tion between an experimental treatment and variation in an

outcome measure is warranted (Box 1). It critically depends

on the extent to which experimental design and conduct

minimize systematic error (also called bias). Already some

15 to 20 years ago, reports were published indicating that

fundamental aspects of proper scientiﬁc conduct were often

ignored, thereby compromising the internal validity of re-

search ﬁndings (Festing and Altman 2002;McCance

1995). Several recent studies suggest that not much has

changed to date. For example, a systematic review of

animal experiments conducted in publicly funded research

establishments in the United Kingdom and United States re-

vealed that only a few authors reported using randomization

(13%) or blinding (14%) to avoid bias in animal selection

and outcome assessment (Kilkenny et al. 2009). Others

found that only 3% of all studies reported an apriorisample

size calculation (Sena et al. 2007) and in even fewer cases

was a primary outcome variable deﬁned (Macleod 2011).

Similar results were obtained from various reviews of pre-

clinical neurological research (Frantzias et al. 2011;van

der Worp et al. 2010;Vesterinen et al. 2010), indicating

that systematic bias may be widespread in laboratory animal

research.

In clinical research, similar problems became apparent

several years earlier, resulting in the CONSORT statement

intended to improve the reporting of randomized clinical

trials (Begg 1996;Moher et al. 2001;Schulz et al. 2010).

BasedontheCONSORTstatementandwiththeaimtoim-

prove the reporting of animal studies, Kilkenny et al. (2010)

recently developed the Animals in Research: Reporting In

Vivo Experiments (ARRIVE) guidelines, a 20-item check-

list of information to be reported in publications of animal

research. To date, these guidelines have been endorsed by

over 430 journals, funders, universities, and learned socie-

ties (www.NC3Rs.org.uk) in the hope that such guidelines

will not only improve the quality of scientiﬁc reporting but

also the internal validity of the research.

In contrast to internal validity, external validity extends

beyond the speciﬁc experimental setting and refers to the gen-

eralizability of research ﬁndings, i.e., how applicable they are

to other environmental conditions, experimenters, study

populations, and even to other strains or species of animals

(including humans; Lehner 1996;Box 1). Poor external

validity may thus contribute to both poor reproducibility of

a research ﬁnding (e.g., when the same study replicated in a

different laboratory by a different experimenter produces dif-

ferent results) and translational failure (e.g., when a treatment

shown to be efﬁcacious in an animal model is not efﬁcacious

in a clinical trial in humans).

Importantly, some of the strategies employed to increase

internal validity may at the same time decrease external valid-

ity. For example, common strategies of standardizing experi-

ments by using homogenous study populations to maximize

test sensitivity inevitably compromise the external validity

of the research ﬁndings, resulting in poor reproducibility

(Richter et al. 2009,2010,2011;van der Worp et al. 2010;

Würbel 2000,2002;Würbel and Garner 2007).

Scope for Reﬁnement of Laboratory

Animal Research

Taken together, there seems to be considerable scope for re-

ﬁnement of experimental design and conduct to improve both

the internal and external validity of laboratory animal re-

search. In the following sections, we will explore this in

more detail and propose potential ways of reﬁnement as

well as promising areas of future research.

Internal Validity –Reﬁnement

of Experimental Conduct to Avoid

Systematic Biases

Although 235 different types of bias in biomedical research

have been characterized (Chavalarias and Ioannidis 2010),

van der Worp et al. (2010) consider four types of bias to be

particularly relevant with respect to the internal validity of

laboratory animal research: selection bias, attrition bias, per-

formance bias, and detection bias.

Selection bias refers to the biased allocation of animals to

treatment groups and can be avoided by randomization

(Box 1). Because selection bias may occur either consciously

or subconsciously, methods based on active decisions by the

experimenter (e.g., picking animals “at random”from their

cages) are not considered true randomization. Tossing coins

or throwing dice provide simple ways of randomization but

for some purposes random number generators (e.g., www.

random.org) may be preferable. Even the use of allegedly

“homogeneous”study populations (such as same-sex, same-

age inbred mice raised under identical housing conditions)

does not preclude the need for randomization, because

individual differences still prevail. This is best illustrated

by studies with inbred mice showing that variation within

strainsisoftensigniﬁcantly greater than between strains

(Wahlsten 2010).

In many cases, it is possible to use stratiﬁed randomization

instead of simple randomization. In stratiﬁed randomization,

the study population is divided into discrete subpopulations

based on systematic differences in factors that are likely to af-

fect the outcome measures, such as sex, age, littermates, dis-

ease severity, treatment dose, etc. The animals of each

subpopulation are then separately allocated at random to the

different treatment groups. Through this, the factor levels de-

ﬁning the different subpopulations are counterbalanced

among all treatment groups. The use of statistical methods de-

signed to analyze such factorial designs results in the removal

of the variation between the strata from the error term, thereby

increasing the precision and statistical power of the experi-

ment (Altman and Bland 1999).

Volume 55, Number 3, doi: 10.1093/ilar/ilu037 2014 385

at World Trade Institute on January 14, 2015http://ilarjournal.oxfordjournals.org/Downloaded from

Selection bias may also occur when the criteria for inclu-

sion or exclusion of animals are poorly deﬁned. Complica-

tions that require exclusion of animals are an inherent risk

in animal studies, especially with animal models involving

invasive surgical procedures (e.g., Jüni et al. 2001)andin

models of stroke (Crossley et al. 2008). For ethical reasons,

humane endpoints need to be deﬁned apriori, and animals

that reach humane endpoints may be lost from the subsequent

analysis. However, it may also be justiﬁable to exclude ani-

mals for scientiﬁc reasons if complications occur that are un-

related to the experimental treatment and render the outcome

measures meaningless. To avoid bias, however, all criteria for

inclusion and exclusion of animals need to be predeﬁned, and

the person deciding on inclusion or exclusion needs to be un-

aware of the treatment allocation (van der Worp et al. 2010). If

these criteria are not well speciﬁed, one risks the induction of

attrition bias, the unequal distribution of dropouts among

treatment groups.

Performance bias may occur whenever there is a systematic

difference in the interaction with the animals (e.g., animal

care, experimental procedures) between the treatment groups,

apart from the treatment under investigation (Jüni et al. 2001;

Box 1). For example, differences in the quality of experiment-

er handling exhibited to stressed vs. nonstressed mice may oc-

cur due to higher fearfulness and stress reactivity in the

stressed mice (Hurst and West 2010). In contrast, detection

bias occurs when the outcome is measured differently in an-

imals of different treatment groups. Again, both performance

bias and detection bias may occur either consciously or sub-

consciously, and the best way to avoid these biases is blinding

(also known as masking).

Blinding is considered complete when the investigator and

everyone else involved in the experiment (animal care per-

sonnel, laboratory technicians, outcome assessors, etc.) are

unaware of the animals’allocation to treatments. In contrast

to randomization, blinding is not always possible, for exam-

ple, when scoring behavior among treatment groups that

differ visibly (e.g., strains of mice differing in coat color).

Thus, it is important that authors explicitly report the blind-

ing status of all people whose involvement may affect the

outcome of the study (Kilkenny et al. 2010a;Moher

et al. 2010).

Other relevant sources of bias include sample sizes that are

either too small or too large, a poor deﬁnition of the primary

(and secondary) outcome variable(s), and the use of inappro-

priate statistical analyses, all of which may result in poor

statistical conclusion validity (Cozby and Bates 2011). When-

ever possible, a formal sample size calculation (and power

analysis) should be performed that speciﬁes the minimal ef-

fect size considered to be relevant (e.g., Cohen’s d or f), the

desired statistical power (1–β), and the level of statistical sig-

niﬁcance (α). Some have argued that such calculations are

only applicable to “conﬁrmatory research”but not to “explor-

atory research”since “effect sizes may be unknown”and “re-

search in the exploratory mode will often test many different

strategies in parallel, and this is only feasible if small sample

sizes are used”(Kimmelman et al. 2014). However, neither

unknown effect sizes nor the exploratory nature of research

should be taken as excuses for violating fundamental princi-

ples of good scientiﬁc practice. Tools such as NCSS PASS,

G*Power, and the resource equation method (Mead 1990)

(to name just a few) facilitate sample size calculations. This

is even possible when knowledge about the sample distribu-

tion is incomplete because usually a minimally relevant effec-

tive size can be speciﬁed a priori. Furthermore, testing many

different hypotheses in parallel using small sample sizes will

inevitably produce spurious results that undermine the reli-

ability of the research (Button et al. 2013).

Both overpowered and underpowered studies are unethical,

albeit for different reasons. Overpowered studies use more an-

imals than needed to detect a signiﬁcant effect of a given size.

This is relatively rare, however, because it violates one of the

3R principles (reduction), and ethics committees are trained

to spot reduction potential. From a scientiﬁc perspective,

large sample sizes are not a problem as such, as long as a

minimal effect size is deﬁned. However, any two treatments

will be signiﬁcantly different if the measurement precision

and sample size are large enough, and so overpowered

designs may lead to bias when biologically irrelevant effect

sizes are considered relevant because of their statistical

signiﬁcance. Underpowered studies are much more prevalent,

even though they are much more problematic from both

ethical and scientiﬁc points of view (c.f., Button et al.

2013). Underpowered studies are unable to detect biologi-

cally relevant effect sizes, and as a result, the animals are

essentially wasted for inconclusive research. On the other

hand, there are obvious economic incentives to keep sample

sizes small. In addition, it appears that the well-intended yet

one-sided focus of ethics committees on reduction may

further promote underpowered study designs (Demétrio

et al. 2013). In the human clinical trial literature, the ethical

and scientiﬁc costs of underpowered study designs have

long been recognized (Halpern 2002); it is crucial that formal

power calculations become standard practice in animal

research so that scientiﬁc gain is maximized while animal

use is minimized (Button et al. 2013;Kilkenny et al.

2010b;Macleod 2011).

Recent evidence from preclinical neurological research

indicates that there are also too many statistically signiﬁcant

(i.e., “positive”) results in the literature (Ioannidis and

Trikalinos 2007;Tsilidis et al. 2013). These authors conclud-

ed that selective analysis and selective outcome reporting are

the most likely causes. Selective analysis occurs when several

statistical analyses are performed but only the one with the

“best”(i.e., most signiﬁcant) result is presented (Ioannidis

2008;Tsilidis et al. 2013). Similarly, selective outcome re-

porting occurs when many outcome variables are analyzed

but only the variables that are signiﬁcantly affected by the

treatment are reported (Tsilidis et al. 2013). While the possi-

ble merits of selective reporting are still debated (e.g.,

de Winter and Happee 2013;van Assen et al. 2014), we main-

tain that to avoid these potential biases, the primary (and sec-

ondary) outcome variable(s) as well as the statistical

approach(es) to testing for treatment effects need to be

386 ILAR Journal

at World Trade Institute on January 14, 2015http://ilarjournal.oxfordjournals.org/Downloaded from

speciﬁed before the onset of the study. Ultimately, the best

way to achieve this would be the prospective registration of

all animal studies (see below).

Finally, as the use of the scientiﬁc method requires

reproducibility and falsiﬁability, the sharing of collected

data (i.e., public data archiving) and validation of published

analytical methods should become more common (Molloy

2011). Although this topic is not without issue or debate

(e.g., Alsheikh-Ali et al. 2011;Editor 2014;Nelson 2009;

Roche et al. 2014), the transparency of collected data can

only improve the quality of published scientiﬁc results.

Do Reporting Guidelines Help?

The common approach to reducing poor experimental con-

duct has been the implementation of reporting guidelines.

This started with the CONSORT statement to improve the

reporting of human clinical trials about 20 years ago (Begg

1996;Moher et al. 2001;Schulz et al. 2010) and was recently

extended to animal research by the ARRIVE guidelines

(Kilkenny et al. 2010b). Similar reporting guidelines are

available for other areas of research, such as STROBE for

epidemiology (von Elm et al. 2007), PRISMA for systematic

reviews and meta-analyses (Moher 2009), and several others

listedbytheEQUATORNetwork(www.equator-network.

org). More recently, it has been proposed that animal experi-

ments should be preregistered (Chambers 2013), similarly to

clinical trials which according to the Declaration of Helsinki

(WMA 2013) must be registered in a publicly accessible

database (e.g., www.ClinicalTrials.gov) before recruitment

of the ﬁrst subject. Preregistration should help to avoid “inap-

propriate research practices, including inadequate statistical

power, selective reporting of results, undisclosed analytic

ﬂexibility, and publication bias”(Chambers 2013). All of

these initiatives reﬂect the pervasive nature of bias in biomed-

ical research.

So, do reporting guidelines improve experimental con-

duct? Although there is only indirect evidence, there is

good reason to believe that they do indeed. For example, sys-

tematic reviews and meta-analyses in preclinical research on

stroke, multiple sclerosis, and Parkinson’s disease indicate

that poor reporting of study quality attributes (e.g., randomi-

zation, blinding, sample size calculation, etc.) correlates with

overstated treatment effects (Rooke et al. 2011;Sena et al.

2007;Vesterinen et al. 2010). It is therefore plausible that bet-

ter reporting correlates with better quality of study conduct.

Although, theoretically, the reporting of accurate study qual-

ity may be faked, such outright fraud is hopefully uncommon.

It is more likely that the advocacy of reporting guidelines will

raise awareness of the importance of rigorous experimental

conduct (Landis et al. 2012). Nevertheless, a recent analysis

of papers published in the PLoS and Nature journals after the

endorsement of the ARRIVE guidelines found as yet very lit-

tle improvement in reporting standards, indicating that au-

thors are still ignoring, and referees and editors are not

enforcing, these guidelines (Baker et al. 2014).

External Validity –Reﬁnement

of Experimental Design to Avoid

Spurious Results

Reproducibility is a cornerstone of the scientiﬁc method, and

poor reproducibility threatens the credibility of the entire ﬁeld

of animal research (Johnson 2013;Richter et al. 2009).

Although better internal validity will also improve the repro-

ducibility of results, reproducibility of a result is primarily

a function of external validity (Richter et al. 2009;

Würbel 2000).

By deﬁnition, external validity refers to the applicability of

results to other environmental conditions, experimenters,

study populations, and even to other strains or species of an-

imals (including humans; Lehner 1996;Box 1). External va-

lidity therefore deﬁnes how generalizable results are. This

also includes reproducibility, which is deﬁned as the ability

of a result to be replicated by an independent experiment ei-

ther in the same or in a different laboratory (Box 1). However,

the relationship between external validity and reproducibility

is not so straightforward. External validity (i.e., the range of

conditions to which a result can be generalized) is an inherent

feature of a result; some results are more externally valid than

others. For example, pre-pulse inhibition (PPI) of the startle

reﬂex to acoustic stimuli is highly conserved across many spe-

cies, including mice and humans, and is fairly robust against

variation in environmental conditions (Geyer et al. 2002).

Thus, PPI has very high external validity. Because of this,

PPI is also highly reproducible across different laboratories

despite considerable variation in conditions among laborato-

ries. In contrast, the locomotor activity of mice on an elevated

zero-maze or plus-maze has very little external validity, as it

is highly sensitive to test conditions (e.g., handling; Hurst and

West 2010), and differences between strains of mice are high-

ly inconsistent despite considerable efforts to equate condi-

tions across laboratories (e.g., Crabbe et al. 1999;Richter

et al. 2011). Therefore, experiments should be designed in

ways that permit for estimation of the external validity of

the results. This can only be achieved if relevant features of

the study design, such as animal characteristics and environ-

mental conditions, are varied systematically (Würbel

2000,2002).

Interestingly, this is contrary to conventional wisdom in

laboratory animal science. The gold standard of experimental

design adopted from the pure sciences (mathematics, physics,

chemistry) is to hold constant all factors except for the inde-

pendent variable(s) under investigation. This has become a

central dogma in laboratory animal science that is referred

to as standardization. Thus, laboratory animal science text-

books advise researchers to standardize their experiments

by using genetically uniform animals, selecting these for

maximal phenotypic uniformity (e.g., same age, same weight,

etc.), and keeping all environmental and procedural factors

constant (Beynen, Festing, et al. 2001;Beynen, Gärtner,

et al. 2001). Such homogenization of study populations

may compromise both the external validity and reproducibil-

ity of the results, an effect that has been referred to as the

Volume 55, Number 3, doi: 10.1093/ilar/ilu037 2014 387

at World Trade Institute on January 14, 2015http://ilarjournal.oxfordjournals.org/Downloaded from

standardization fallacy (Würbel 2000,2002). The same falla-

cy was highlighted 80 years ago by the eminent Ronald

A. Fisher (1935, p. 102): “The exact standardisation of exper-

imental conditions, which is often thoughtlessly advocated as

a panacea, always carries with it the real disadvantage that a

highly standardised experiment supplies direct information

only with respect to the narrow range of conditions achieved

by standardisation. Standardisation, therefore, weakens rather

than strengthens our ground for inferring a like result, when,

as is invariably the case in practice, these conditions are

somewhat varied.”

Indeed, despite rigorous standardization of the experimen-

tal conditions across laboratories, several multi-laboratory

studies revealed large proportions of results that were idiosyn-

cratic to one laboratory (Crabbe et al. 1999;Richter et al.

2011;Wolfer et al. 2004). The reason for this may be that

many environmental factors (e.g., staff, noise, etc.) cannot

be equalized between laboratories, so that different laborato-

ries inevitably standardize to different local environments

(Richter et al. 2009;Würbel and Garner 2007). Therefore,

standardization may actually be a cause of, rather than a

cure for, poor reproducibility (Richter et al. 2009). Thus,

not surprisingly, van der Worp et al. (2010) listed homoge-

nous study populations as a main source of poor external

validity in preclinical animal research, which to some extent

may also contribute to translational failure.

Some scientists have argued that we simply need to report

more parameters that may potentially affect outcome mea-

sures (e.g., Arndt and Surjo 2001;Philip et al. 2010;Surjo

and Arndt 2001). In this case, however, reporting guidelines

will not help, and the attempt to promote extensive lists of

methodological detail to facilitate interpretation of conﬂict-

ing ﬁndings has been referred to as the listing fallacy (Wür-

bel 2002). If anything, such lists may induce interpretation

bias by attracting attention to differences in the listed param-

eters, although there may be many more parameters that

were not considered, were considered to be irrelevant or

too difﬁcult to assess, or simply could not be listed. As

long as a particular parameter has not been varied systemati-

cally within a given experiment, it is no more likely to ex-

plain conﬂicting ﬁndings than any other parameter, listed

or unlisted, that differed between the respective experiments

(Würbel 2002).

Statistical and Experimental Solutions

Among studies investigating behavioral differences between

different inbred and mutant strains of mice (behavioral pheno-

typing), current estimates of the proportion of irreproducible

results (false discovery proportion, FDP) from multi-

laboratory studies range between 30 and 60% (Benjamini

et al. 2014;Kafkaﬁet al. 2005,2014). It is likely that similar

FDPs apply to other areas of research.

Various solutions have been proposed to reduce the risk of

obtaining such spurious results. For example, Johnson (2013)

suggested lowering the critical Pvalue of statistical signiﬁ-

cance from 0.05 to 0.005 or even 0.001 to match conventional

evidence thresholds used in Bayesian testing. Assuming that

approximately one-half of the hypotheses tested by scientists

are true, Johnson (2013) estimated that between 17% and

25% of marginally signiﬁcant scientiﬁcﬁndings are false

positives. However, to avoid lowering the proportion of false

positives and increasing the proportion of false negatives,

sample sizes would have to be increased by about 50% to

100% to achieve similar statistical power (Johnson 2013).

Moreover, a general decrease of critical Pvalues does not

take into account that both external validity and reproducibil-

ity depend on the nature of the measured effect.

Kafkaﬁand colleagues (2005, 2014) have therefore pro-

posed to raise the benchmark for signiﬁcant results in a

more speciﬁc way. According to their random laboratory

model, laboratories should be considered as a sample, repre-

senting the population of all potential laboratories, and the

interaction noise (the treatment x laboratory variance) should

be added as a random factor to the individual animal noise

(the within-laboratory variance). Similarly to the suggestion

of lowering Pvalues (Johnson 2013), this inﬂation of within-

laboratory variance would generate a larger yardstick for the

signiﬁcance of treatment effects (Benjamini et al. 2014;

Kafkaﬁet al. 2005,2014), albeit in a more speciﬁcway.

Using data from several multi-laboratory studies, the authors

showed that this method may reduce the FDP considerably

without losing too much statistical power. The difﬁculty

with this approach is that such speciﬁcity will be achieved

only if the treatments and measures are ﬁrst tested across

several laboratories to obtain accurate estimates of between-

laboratory variance. This approach may thus not be applica-

ble to animal experiments in general but may be useful for

standard preclinical tests of efﬁcacy and toxicity in drug

development, as well as for speciﬁc large scale projects,

such as the International Mouse Phenotyping Consortium

(Brown and Moore 2012a,2012b;Mallon et al. 2012) which

aims to determine the phenotypes of thousands of mutant

lines with a battery of standard tests (Benjamini et al. 2014;

Kafkaﬁet al. 2005,2014).

Besides these statistical approaches, others have proposed

mimicking between-laboratory variability experimentally.

These proposals range from conducting an independent

replicate study to conducting real multi-laboratory stud-

ies. For example, the Reproducibility Initiative has estab-

lished a service to facilitate independent replicate studies

(http://validation.scienceexchange.com/), while the Multi-

PARTconsortium aims to develop a platform for international

multicenter preclinical stroke trials based on randomized

clinical trial design (www.dcn.ed.ac.uk/multipart/).

In addition to such true replications, there are several other

ways in which studies may be designed to provide an esti-

mate of the external validity and reproducibility of results.

For example, Richter and colleagues (2010,2011) proposed

the heterogenization of study populations (rather than ho-

mogenization through standardization) by systematically

varying a few selected factors. In principle, any aspect of

the animals (e.g., genotype, sex, age, body condition, etc.)

and their environment (e.g., housing conditions,

388 ILAR Journal

at World Trade Institute on January 14, 2015http://ilarjournal.oxfordjournals.org/Downloaded from

experimental protocol ) may be used for such heterogeniza-

tion. By varying two environmental factors using a 2 × 2 fac-

torial design, Richter and colleagues (2010) successfully

mimicked variation between independent replicates conduct-

ed within their own laboratory (see also Jonker et al. 2013;

Wo l ﬁnger 2013;Würbel et al. 2013). However, a similar sim-

ple form of heterogenization did not account for between-

laboratory variation in a true multi-laboratory study (Richter

et al. 2011). Further research is therefore needed to develop

heterogenization protocols that mimic between-laboratory

variability more effectively.

In the meantime, simple precautions may be taken as pro-

posed by Paylor (2009), for example, by splitting experiments

into small batches of animals that are tested some time apart

instead of testing them in one large batch, by using multiple

experimenters for testing and data collection instead of using

only one, or by spreading test sessions across time of day in-

stead of testing all animals at the same time of day. Assessing

the effects of batch, experimenter, or time of day, respectively,

will reveal whether such minor variations of conditions affect

results and will therefore indicate whether reproducibility

across larger variations of conditions (such as between labo-

ratories) may be at stake.

Conclusions

Reproducibility and falsiﬁability are cornerstones of the

scientiﬁc method, and it is because of these principles that sci-

ence is often viewed as self-correcting, at least in the long

term. The common consensus is that failures in reproducibil-

ity of animal research are not a consequence of scientiﬁc mis-

conduct (e.g., Collins and Tabak 2014). However, negligence

in experimental design, conduct, and publication (whether

conscious or not) continue to plague animal research and,

despite numerous initiatives to curb these effects, they contin-

ue to persist (Baker et al. 2014).

Facing these problems and underlying causes, as discussed

here, is hopefully a step towards effective reﬁnement of

experimental design and conduct. The ARRIVE guidelines

provide a useful tool for improving the internal validity of

animal research, and several strategies have been put forward

for improving external validity as well. Nevertheless, it seems

that greater pressure must be placed on researchers, reviewers,

and journal editors to not only endorse such methods of

reﬁnement but to rigorously enforce them. Otherwise, the

credibility and ethical justiﬁcation of animal research may

be permanently undermined.

Acknowledgments

The authors of this paper were funded by the ERC Advanced

Grant “REFINE”(H.W. and J.D.B.), the FP7 Coordination

and Support Action “Multi-PART”(H.W. and J.D.B.),

and a research grant by the Swiss Federal Food Safety and

Veterinary Ofﬁce (H.W. and T.S.R.).

References

Alsheikh-Ali AA, Qureshi W, Al-Mallah MH, Ioannidis JPA. 2011. Public

availability of published research data in high-impact journals. PLoS

One 6:e24357.

Altman DG, Bland JM. 1999. How to randomise. BMJ 319:703–704.

Arndt SS, Surjo D. 2001. Methods for the behavioural phenotyping of mouse

mutants. How to keep the overview. Behav Brain Res 125:39–42.

Baker D, Lidster K, Sottomayor A, Amor S. 2014. Two years later: Journals

are not yet enforcing the ARRIVE guidelines on reporting standards for

pre-clinical animal studies. PLoS Biol 12:e1001756.

Begg C. 1996. Improving the quality of reporting of randomized controlled

trials. JAMA 276:637–639.

Benjamini Y, Lahav T, KafkaﬁN. 2014. Estimating replicability of behavioral

phenotyping results in a single laboratory. In: Measuring Behavior 2014:

The replicability of measuring behavior.

Beynen AC, Festing MFW, van Montfort MAJ. 2001. Design of animal ex-

periments. In: Van Zutphen LFM, Baumans V, Beynen AC, eds.

Principles of Laboratory Animal Science. Revised. Amsterdam: Elsevier.

p 219–249.

Beynen AC, Gärtner K, van Zutphen LFM. 2001. Standardization of Animal

Experimentation. In: Van Zutphen LFM, Baumans V, Beynen AC, eds.

Principles of Laboratory Animal Science. Revised. Amsterdam: Elsevier.

p 103–110.

Brown SDM, Moore MW. 2012a. The International Mouse Phenotyping

Consortium: Past and future perspectives on mouse phenotyping.

Mamm Genome 23:632–640.

Brown SDM, Moore MW. 2012b. Towards an encyclopaedia of mammalian

gene function: the International Mouse Phenotyping Consortium. Dis

Model Mech 5:289–292.

Button KS, Ioannidis JPA, Mokrysz C, Nosek BA, Flint J, Robinson ESJ,

Munafò MR. 2013. Power failure: Why small sample size undermines

the reliability of neuroscience. Nat Rev Neurosci 14:365–376.

Chambers CD. 2013. Registered reports: A new publishing initiative at Cor-

tex. Cortex 49:609–610.

Chavalarias D, Ioannidis JPA. 2010. Science mapping analysis characterizes

235 biases in biomedical research. J Clin Epidemiol 63:1205–1215.

Collins FS, Tabak LA. 2014. Policy: NIH plans to enhance reproducibility.

Nature 505:612–613.

Cozby P, Bates S. 2011. Methods in Behavioral Research. 11th ed. McGraw-

Hill Education.

Crabbe JC, Wahlsten DL, Dudek BC. 1999. Genetics of Mouse behavior: in-

teractions with laboratory environment. Science 284:1670–1672.

Cronbach LJ, Meehl PE. 1955. Construct validity in psychological tests.

Psychol Bull 52:281–302.

Crossley NA, Sena E, Goehler J, Horn J, van der Worp B, Bath PMW,

Macleod M, Dirnagl U. 2008. Empirical evidence of bias in the design

of experimental stroke studies: A metaepidemiologic approach. Stroke

39:929–934.

Demétrio CGB, Menten JFM, Leandro RA, Brien C. 2013. Experimental

power considerations-justifying replication for animal care and use com-

mittees. Poult Sci 92:2490–2497.

De Winter J, Happee R. 2013. Why selective publication of statistically sig-

niﬁcant results can be effective. PLoS One 8:e66463.

Editor. 2011. Building a better mouse test. Nat Methods 8:697.

Editor. 2014. Share alike. Nature 507:140.

Festing MFW, Altman DG. 2002. Guidelines for the design and statistical

analysis of experiments using laboratory animals. ILAR J 43:244–258.

Fisher RA. 1935. The design of experiments. Oliver & Boyd.

Frantzias J, Sena ES, Macleod MR, Al-Shahi Salman R. 2011. Treatment of

intracerebral hemorrhage in animal models: Meta-analysis. Ann Neurol

69:389–399.

Garner JP. 2005. Stereotypies and Other abnormal repetitive behaviors:

Potential impact on validity, reliability, and replicability of scientiﬁc

outcomes. ILAR J 46:106–117.

Geyer MA, McIlwain KL, Paylor R. 2002. Mouse genetic models for pre-

pulse inhibition: An early review. Mol Psychiatry 7:1039–1053.

Volume 55, Number 3, doi: 10.1093/ilar/ilu037 2014 389

at World Trade Institute on January 14, 2015http://ilarjournal.oxfordjournals.org/Downloaded from

Goodman S, Greenland S. 2007. Why most published research ﬁndings are

false: Problems in the analysis. PLoS Med 4:e168.

Halpern SD. 2002. The continuing unethical conduct of underpowered

clinical trials. JAMA 288:358–362.

Howells DW, Sena ES, Macleod MR. 2014. Bringing rigour to translational

medicine. Nat Rev Neurol 10:37–43.

Hurst JL, West RS. 2010. Taming anxiety in laboratory mice. Nat Methods

7:825–826.

Ioannidis JPA. 2005. Why most published research ﬁndings are false. PLoS

Med 2:e124.

Ioannidis JPA. 2008. Why most discovered true associations are inﬂated.

Epidemiology 19:640–648.

Ioannidis JPA, Trikalinos TA. 2007. The appropriateness of asymmetry

tests for publication bias in meta-analyses: A large survey. CMAJ

176:1091–1096.

Johnson VE. 2013. Revised standards for statistical evidence. Proc Natl Acad

Sci U S A 110:19313–19317.

Jonker RM, Guenther A, Engqvist L, Schmoll T. 2013. Does systematic

variation improve the reproducibility of animal experiments? Nat

Methods 10:373.

Jüni P, Altman DG, Egger M. 2001. Systematic reviews in health care:

Assessing the quality of controlled clinical trials. BMJ 323:42–46.

KafkaﬁN, Benjamini Y, Sakov A, Elmer GI, Golani I. 2005.

Genotype-environment interactions in mouse behavior: A way out of

the problem. Proc Natl Acad Sci U S A 102:4619–4624.

KafkaﬁN, Lahav T, Benjamini Y. 2014. What’s always wrong with my mouse?

In: Measuring Behavior 2014: The replicability of Measuring Behavior.

Kilkenny C, Browne WJ, Cuthill IC, Emerson M, Altman DG. 2010a.

Improving bioscience research reporting: the ARRIVE guidelines for

reporting animal research. PLoS Biol 8:e1000412.

Kilkenny C, Browne WJ, Cuthill IC, Emerson M, Altman DG. 2010b.

Animal research: reporting in vivo experiments: The ARRIVE guide-

lines. J Gene Med 12:561–563.

Kilkenny C, Parsons N, Kadyszewski E, Festing MFW, Cuthill IC, Fry D,

Hutton J, Altman DG. 2009. Survey of the quality of experimental

design, statistical analysis and reporting of research using animals.

PLoS One 4:e7824.

Kimmelman J, Mogil JS, Dirnagl U. 2014. Distinguishing between explor-

atory and conﬁrmatory preclinical research will improve translation.

Jones DR, editor. PLoS Biol 12:e1001863.

Knight J. 2001. Animal data jeopardized by life behind bars. Nature 412:669.

Kola I, Landis J. 2004. Can the pharmaceutical industry reduce attrition rates?

Nat Rev Drug Discov 3:711–715.

Landis SC, Amara SG, Asadullah K, Austin CP, Blumenstein R, Bradley EW,

Crystal RG, Darnell RB, Ferrante RJ, Fillit H, Finkelstein R, Fisher M,

Gendelman HE, Golub RM, Goudreau JL, Gross RA, Gubitz AK,

Hesterlee SE, Howells DW, Huguenard J, Kelner K, Koroshetz W,

Krainc D, Lazic SE, Levine MS, Macleod MR, McCall JM,

Moxley RT 3rd, Narasimhan K, Noble LJ, Perrin S, Porter JD,

Steward O, Unger E, Utz U, Silberberg SD. 2012. A call for

transparent reporting to optimize the predictive value of preclinical

research. Nature 490:187–191.

Lehner PN. 1996. Handbook of Ethological Methods. Cambridge University

Press.

Macleod M. 2011. Why animal research needs to improve. Nature 477:511.

Mallon A-M, Iyer V, Melvin D, Morgan H, Parkinson H, Brown SDM,

Flicek P, Skarnes WC. 2012. Accessing data from the International

Mouse Phenotyping Consortium: state of the art and future plans.

Mamm Genome 23:641–652.

Martin B, Ji S, Maudsley S, Mattson MP. 2010. “Control”laboratory rodents

are metabolically morbid: why it matters. Proc Natl Acad Sci U S A

107:6127–6133.

McCance I. 1995. Assessment of statistical procedures used in papers in the

Australian Veterinary Journal. Aust Vet J 72:322–329.

Mead R. 1990. The design of experiments: Statistical principles for practical

applications. Cambridge University Press.

Moher D, Hopewell S, Schulz KF, Montori V, Gøtzsche PC, Devereaux PJ,

Elbourne D, Egger M, Altman DG. 2010. CONSORT 2010 explanation

and elaboration: Updated guidelines for reporting parallel group rando-

mised trials. BMJ 340:c869.

Moher D, Schulz KF, Altman DG. 2001. The CONSORT statement: Revised

recommendations for improving the quality of reports of parallel-group

randomised trials. Lancet 357:1191–1194.

Moher D. 2009. Preferred Reporting Items for Systematic Reviews and Meta-

Analyses: The PRISMA Statement. Ann Intern Med 151:264.

Molloy JC. 2011. The Open Knowledge Foundation: open data means better

science. PLoS Biol 9:e1001195.

Nelson B. 2009. Data sharing: Empty archives. Nature 461:160–163.

Nestler EJ, Hyman SE. 2010. Animal models of neuropsychiatric disorders.

Nat Neurosci 3:1161–1169.

O’Collins VE, Macleod MR, Donnan GA, Horky LL, van der Worp BH,

Howells DW. 2006. 1,026 experimental treatments in acute stroke. Ann

Neurol 59:467–477.

Paylor R. 2009. Questioning standardization in science. Nat Methods

6:253–254.

Philip VM, Duvvuru S, Gomero B, Ansah TA, Blaha CD, Cook MN,

Hamre KM, Lariviere WR, Matthews DB, Mittleman G, Goldowitz D,

Chesler EJ. 2010. High-throughput behavioral phenotyping in the ex-

panded panel of BXD recombinant inbred strains. Genes Brain Behav

9:129–159.

Richter SH, Garner JP, Auer C, Kunert J, Würbel H. 2010. Systematic vari-

ation improves reproducibility of animal experiments. Nat Methods

7:167–168.

Richter SH, Garner JP, Würbel H. 2009. Environmental standardization: Cure

or cause of poor reproducibility in animal experiments? Nat Methods

6:257–261.

Richter SH, Garner JP, Zipser B, Lewejohann L, Sachser N, Touma C,

Schindler B, Chourbaji S, Brandwein C, Gass P, van Stipdonk N, van

der Harst J, Spruijt B, Võikar V, Wolfer DP, Würbel H. 2011. Effect of

population heterogenization on the reproducibility of mouse behavior:

A multi-laboratory study. PLoS One 6:e16461.

Roche DG, Lanfear R, Binning SA, Haff TM, Schwanz LE, Cain KE,

Kokko H, Jennions MD, Kruuk LEB. 2014. Troubleshooting public

data archiving: Suggestions to increase participation. PLoS Biol 12:

e1001779.

Rooke EDM, Vesterinen HM, Sena ES, Egan KJ, Macleod MR. 2011. Dop-

amine agonists in animal models of Parkinson’s disease: A systematic re-

view and meta-analysis. Parkinsonism Relat Disord 17:313–320.

Schulz KF, Altman DG, Moher D. 2010. CONSORT 2010 Statement: Updat-

ed guidelines for reporting parallel group randomised trials. BMC

Med 8:18.

Sena E, van der Worp HB, Howells D, Macleod M. 2007. How can we im-

prove the pre-clinical development of drugs for stroke? Trends Neurosci

30:433–439.

Surjo D, Arndt SS. 2001. The Mutant Mouse Behaviour network, a medium

to present and discuss methods for the behavioural phenotyping. Physiol

Behav 73:691–694.

Tsilidis KK, Panagiotou OA, Sena ES, Aretouli E, Evangelou E,

Howells DW, Al-Shahi Salman R, Macleod MR, Ioannidis JPA. 2013.

Evaluation of excess signiﬁcance bias in animal studies of neurological

diseases. PLoS Biol 11:e1001609.

Vesterinen HM, Sena ES, Ffrench-Constant C, Williams A, Chandran S,

Macleod MR. 2010. Improving the translational hit of experimental treat-

ments in multiple sclerosis. Mult Scler 16:1044–1055.

Wahlsten DL. 2010. Mouse Behavioral Testing: How to use Mice in Behav-

ioral Neuroscience. First. Elsevier.

WMA. 2013. World Medical Association Declaration of Helsinki: Ethical

principles for medical research involving human subjects. JAMA

310:2191–2194.

Wolfer DP, Litvin O, Morf S, Nitsch RM, Lipp H-P, Würbel H. 2004. Labo-

ratory animal welfare: Cage enrichment and mouse behaviour. Nature

432:821–822.

Wo l ﬁnger RD. 2013. Reanalysis of Richter et al. (2010) on reproducibility.

Nat Methods 10:373–374.

390 ILAR Journal

at World Trade Institute on January 14, 2015http://ilarjournal.oxfordjournals.org/Downloaded from

Van Assen MALM, van Aert RCM, Nuijten MB, Wicherts JM. 2014. Why

publishing everything is more effective than selective publishing of stat-

istically signiﬁcant results. PLoS One 9:e84896.

Van der Worp HB, Howells DW, Sena ES, Porritt MJ, Rewell S, O’Collins V,

Macleod MR. 2010. Can animal models of disease reliably inform human

studies? PLoS Med. 7:e1000245.

Von Elm E, Altman DG, Egger M, Pocock SJ, Gøtzsche PC,

Vandenbroucke JP. 2007. The Strengthening the Reporting of

Observational Studies in Epidemiology (STROBE) statement:

Guidelines for reporting observational studies. Prev Med (Baltim)

45:247–251.

Würbel H. 2000. Behaviour and the standardization fallacy. Nat Genet

26:263.

Würbel H. 2001. Ideal homes? Housing effects on rodent brain and behav-

iour. TRENDS Neurosci 24:207–211.

Würbel H. 2002. Behavioral phenotyping enhanced--Beyond (environmental)

standardization. Genes Brain Behav 1:3–8.

Würbel H, Garner JP. 2007. Reﬁnement of rodent research though environ-

mental enrichment and systematic randomization. Available from: http://

www.nc3rs.org.uk/downloaddoc.asp?id=506&page=395&skin=0

Würbel H, Richter SH, Garner JP. 2013. Reply to: “Reanalysis of Richter

et al. (2010) on reproducibility”. Nat Methods 10:374.

Volume 55, Number 3, doi: 10.1093/ilar/ilu037 2014 391

at World Trade Institute on January 14, 2015http://ilarjournal.oxfordjournals.org/Downloaded from

It’s Time to Review the Three Rs, to Make them More Fit for Purpose in the 21st Century

Article

Full-text available

Apr 2024
ATLA-ALTERN LAB ANIM

Jarrod Bailey

The Three Rs have become widely accepted and pursued, and are now the go-to framework that encourages the humane use of animals in science, where no other option is believed to exist. However, many people, including scientists, harbour varying degrees of concern about the value and impact of the Three Rs. This ranges from a continued adherence to the Three Rs principles in the belief that they have performed well, through a belief that there should be more emphasis (or indeed a sole focus) on replacement, to a view that the principles have hindered, rather than helped, a critical approach to animal research that should have resulted in replacement to a much greater extent. This critical review asks questions of the Three Rs and their implementation, and provides an overview of the current situation surrounding animal use in biomedical science (chiefly in research). It makes a case that it is time to move away from the Three Rs and that, while this happens, the principles need to be made more robust and enforced more efficiently. To expedite a shift from animal use in science, toward a much greater and quicker adoption of human-specific New Approach Methodologies (NAMs), some argue for a straightforward focus on the best available science.

What We (Don't) Know about Parrot Welfare: A Systematic Literature Review

Preprint

Full-text available

Mar 2024

Parrots are popular companion animals but show prevalent and at times severe welfare issues. Nonetheless, there are no scientific tools available to assess parrot welfare. The aim of this systematic review was to identify valid and feasible outcome measures that could be used as welfare indicators for companion parrots. From 1848 peer-reviewed studies retrieved, 98 met our inclusion and exclusion criteria (e.g. experimental studies, captive parrots). For each outcome collected, validity was assessed based on the statistical significance reported by the authors, as other validity parameters were rarely available for evaluation. Feasibility was assigned by considering the need for specific instruments, veterinary-level expertise or handling the parrot. A total of 1512 outcomes were evaluated, of which 572 had a significant p-value and were considered feasible. These included changes in behaviour (e.g. activity level, social interactions, exploration), body measurements (e.g. body weight, plumage condition) and abnormal behaviours, amongst others. However, a high risk of bias undermined the internal validity of these outcomes. Moreover, a strong taxonomic bias, a predominance of studies on parrots in laboratories, and an underrepresentation of companion parrots jeopardized their external validity. These results provide a promising starting point for validating a set of welfare indicators in parrots.

The horizontal ladder test (HLT) protocol: a novel, optimized, and reliable means of assessing motor coordination in Sus scrofa domesticus

Article

Full-text available

Mar 2024

Pigs can be an important model for preclinical biological research, including neurological diseases such as Alcohol Use Disorder. Such research often involves longitudinal assessment of changes in motor coordination as the disease or disorder progresses. Current motor coordination tests in pigs are derived from behavioral assessments in rodents and lack critical aspects of face and construct validity. While such tests may permit for the comparison of experimental results to rodents, a lack of validation studies of such tests in the pig itself may preclude the drawing of meaningful conclusions. To address this knowledge gap, an apparatus modeled after a horizontally placed ladder and where the height of the rungs could be adjusted was developed. The protocol that was employed within the apparatus mimicked the walk and turn test of the human standardized field sobriety test. Here, five Sinclair miniature pigs were trained to cross the horizontally placed ladder, starting at a rung height of six inches and decreasing to three inches in one-inch increments. It was demonstrated that pigs can reliably learn to cross the ladder, with few errors, under baseline/unimpaired conditions. These animals were then involved in a voluntary consumption of ethanol study where animals were longitudinally evaluated for motor coordination changes at baseline, 2.5, 5, 7.5, and 10% ethanol concentrations subsequently to consuming ethanol. Consistent with our predictions, relative to baseline performance, motor incoordination increased as voluntary consumption of escalating concentrations of ethanol increased. Together these data highlight that the horizontal ladder test (HLT) test protocol is a novel, optimized and reliable test for evaluating motor coordination as well as changes in motor coordination in pigs.

Reducing the false discovery rate of preclinical animal research with Bayesian statistical decision criteria

Article

Full-text available

Jul 2023
STAT METHODS MED RES

Riko Kelter

The success of preclinical research hinges on exploratory and confirmatory animal studies. Traditional null hypothesis significance testing is a common approach to eliminate the chaff from a collection of drugs, so that only the most promising treatments are funneled through to clinical research phases. Balancing the number of false discoveries and false omissions is an important aspect to consider during this process. In this paper, we compare several preclinical research pipelines, either based on null hypothesis significance testing or based on Bayesian statistical decision criteria. We build on a recently published large-scale meta-analysis of reported effect sizes in preclinical animal research and elicit a non-informative prior distribution under which both approaches are compared. After correcting for publication bias and shrinkage of effect sizes in replication studies, simulations show that (i) a shift towards statistical approaches which explicitly incorporate the minimum clinically important difference reduces the false discovery rate of frequentist approaches and (ii) a shift towards Bayesian statistical decision criteria can improve the reliability of preclinical animal research by reducing the number of false-positive findings. It is shown that these benefits hold while keeping the number of experimental units low which are required for a confirmatory follow-up study. Results show that Bayesian statistical decision criteria can help in improving the reliability of preclinical animal research and should be considered more frequently in practice.

Advancing the 3Rs: innovation, implementation, ethics and society

Article

Full-text available

Jun 2023

The 3Rs principle of replacing, reducing and refining the use of animals in science has been gaining widespread support in the international research community and appears in transnational legislation such as the European Directive 2010/63/EU, a number of national legislative frameworks like in Switzerland and the UK, and other rules and guidance in place in countries around the world. At the same time, progress in technical and biomedical research, along with the changing status of animals in many societies, challenges the view of the 3Rs principle as a sufficient and effective approach to the moral challenges set by animal use in research. Given this growing awareness of our moral responsibilities to animals, the aim of this paper is to address the question: Can the 3Rs, as a policy instrument for science and research, still guide the morally acceptable use of animals for scientific purposes, and if so, how? The fact that the increased availability of alternatives to animal models has not correlated inversely with a decrease in the number of animals used in research has led to public and political calls for more radical action. However, a focus on the simple measure of total animal numbers distracts from the need for a more nuanced understanding of how the 3Rs principle can have a genuine influence as a guiding instrument in research and testing. Hence, we focus on three core dimensions of the 3Rs in contemporary research: (1) What scientific innovations are needed to advance the goals of the 3Rs? (2) What can be done to facilitate the implementation of existing and new 3R methods? (3) Do the 3Rs still offer an adequate ethical framework given the increasing social awareness of animal needs and human moral responsibilities? By answering these questions, we will identify core perspectives in the debate over the advancement of the 3Rs.

Environmental enrichment: animal welfare and scientific validity

Chapter

Mar 2024

Digestibility of crude nutrients and minerals in C57Bl/6J and CD1 mice fed a pelleted lab rodent diet

Article

Full-text available

Jan 2024

In laboratory animals, there is a scarcity of digestibility data under non-experimental conditions. Such data is important as basis to generate nutrient requirements, which contributes to the refinement of husbandry conditions. Digestibility trials can also help to identify patterns of absorption and potential factors that influence the digestibility. Thus, a digestibility trial with a pelleted diet used as standard feed in laboratory mice was conducted. To identify potential differences between genetic lines, inbred C57Bl/6 J and outbred CD1 mice (n = 18 each, male, 8 weeks-old, housed in groups of three) were used. For seven days, the feed intake was recorded and the total faeces per cage collected. Energy, crude nutrient and mineral content of diet and faecal samples were analyzed to calculate the apparent digestibility (aD). Apparent dry matter and energy digestibility did not differ between both lines investigated. The C57Bl/6 J mice had significantly higher aD of magnesium and potassium and a trend towards a lower aD of sodium than the mice of the CD1 outbred stock. Lucas-tests were performed to calculate the mean true digestibility of the nutrients and revealed a uniformity of the linear regression over data from both common laboratory mouse lines. The mean true digestibility of crude nutrients was > 90%, except for fibre, that of the minerals ranged between 66 and 97%.

Preliminary evidence for chaotic signatures in host-microbe interactions

Article

Full-text available

Jan 2024

Is microbial pathogenesis a predictable scientific field? At a time when we are dealing with coronavirus disease 2019, there is intense interest in knowing about the epidemic potential of other microbial threats and new emerging infectious diseases. To know whether microbial pathogenesis will ever be a predictable scientific field requires knowing whether a host-microbe interaction follows deterministic, stochastic, or chaotic dynamics. If randomness and chaos are absent from virulence, there is hope for prediction in the future regarding the outcome of microbe-host interactions. Chaotic systems are inherently unpredictable, although it is possible to generate short-term probabilistic models, as is done in applications of stochastic processes and machine learning to weather forecasting. Information on the dynamics of a system is also essential for understanding the reproducibility of experiments, a topic of great concern in the biological sciences. Our study finds preliminary evidence for chaotic dynamics in infectious diseases.

Systematic heterogenization revisited: Increasing variation in animal experiments to improve reproducibility?

Article

Oct 2023
J NEUROSCI METH

Mapping strategies towards improved external validity in preclinical translational research

Article

Sep 2023
Expet Opin Drug Discov

Introduction: Translation is about successfully bringing findings from preclinical contexts into the clinic. This transfer is challenging as clinical trials frequently fail despite positive preclinical results. Limited robustness of preclinical research has been marked as one of the drivers of such failures. One suggested solution is to improve the external validity of in vitro and in vivo experiments via a suite of complementary strategies. Areas covered: In this review, the authors summarize the literature available on different strategies to improve external validity in in vivo, in vitro, or ex vivo experiments; systematic heterogenization; generalizability tests; and multi-batch and multicenter experiments. Articles that tested or discussed sources of variability in systematically heterogenized experiments were identified, and the most prevalent sources of variability are reviewed further. Special considerations in sample size planning, analysis options, and practical feasibility associated with each strategy are also reviewed. Expert opinion: The strategies reviewed differentially influence variation in experiments. Different research projects, with their unique goals, can leverage the strengths and limitations of each strategy. Applying a combination of these approaches in confirmatory stages of preclinical research putatively increases the chances of success in clinical studies.

The Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) Statement: Guidelines for Reporting Observational Studies

Article

Full-text available

Jun 2008
REV ESP SALUD PUBLIC

Much biomedical research is observational. The reporting of such research is often inadequate, which hampers the assessment of its strengths and weaknesses and of a study's generalisability. The Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) initiative developed recommendations on what should be included in an accurate and complete report of an observational study. We defined the scope of the recommendations to cover three main study designs: cohort, case-control, and cross-sectional studies. We convened a 2-day workshop in September, 2004, with methodologists, researchers, and journal editors to draft a che-cklist of items. This list was subsequently revised during several meetings of the coordinating group and in e-mail discussions with the larger group of STROBE contributors, taking into account empirical evidence and methodological considerations. The workshop and the subsequent iterative process of consultation and revision resulted in a checklist of 22 items (the STROBE statement) that relate to the title, abstract, introduction, methods, results, and discussion sections of articles. 18 items are common to all three study designs and four are specific for cohort, case-control, or cross-sectional studies. A detailed explanation and elaboration document is published separately and is freely available on the websites of PLoS Medicine, Annals of Internal Medicine, and Epidemiology. We hope that the STROBE statement will contribute to improving the quality of reporting of observational studies.

Improving Bioscience Research Reporting: The ARRIVE Guidelines for Reporting Animal Research

Article

Full-text available

Jun 2010

Preferred Reporting Items for Systematic Reviews and Meta-Analyses: The PRISMA Statement

Article

Full-text available

Jan 2014

Systematic reviews should build on a protocol that describes the rationale, hypothesis, and planned methods of the review; few reviews report whether a protocol exists. Detailed, well-described protocols can facilitate the understanding and appraisal of the review methods, as well as the detection of modifications to methods and selective reporting in completed reviews. We describe the development of a reporting guideline, the Preferred Reporting Items for Systematic reviews and Meta-Analyses for Protocols 2015 (PRISMA-P 2015). PRISMA-P consists of a 17-item checklist intended to facilitate the preparation and reporting of a robust protocol for the systematic review. Funders and those commissioning reviews might consider mandating the use of the checklist to facilitate the submission of relevant protocol information in funding applications. Similarly, peer reviewers and editors can use the guidance to gauge the completeness and transparency of a systematic review protocol submitted for publication in a journal or other medium.

Why most published research findings are false

Article

Jan 2005
Chance

J. P. A. Ioannidis

CONSORT 2010 Explanation and Elaboration: Updated guidelines for reporting parallel group randomised trial

Article

Jan 2010
PLOS MED

Declaration of Helsinki. Ethical Principles for Medical Research Involving Human Subjects

Article

Jan 2009

World Medical Association (WMA

Published research in English-language journals are increasingly required to carry a statement that the study has been approved and monitored by an Institutional Review Board in conformance with 45 CFR 46 standards if the study was conducted in the United States. Alternative language attesting conformity with the Helsinki Declaration is often included when the research was conducted in Europe or elsewhere. The Helsinki Declaration was created by the World Medical Association in 1964 (ten years before the Belmont Report) and has been amended several times. The Helsinki Declaration differs from its American version in several respects, the most significant of which is that it was developed by and for physicians. The term "patient" appears in many places where we would expect to see "subject." It is stated in several places that physicians must either conduct or have supervisory control of the research. The dual role of the physician-researcher is acknowledged, but it is made clear that the role of healer takes precedence over that of scientist. In the United States, the federal government developed and enforces regulations on researcher; in the rest of the world, the profession, or a significant part of it, took the initiative in defining and promoting good research practice, and governments in many countries have worked to harmonize their standards along these lines. The Helsinki Declaration is based less on key philosophical principles and more on prescriptive statements. Although there is significant overlap between the Belmont and the Helsinki guidelines, the latter extends much further into research design and publication. Elements in a research protocol, use of placebos, and obligation to enroll trials in public registries (to ensure that negative findings are not buried), and requirements to share findings with the research and professional communities are included in the Helsinki Declaration. As a practical matter, these are often part of the work of American IRBs, but not always as a formal requirement. Reflecting the socialist nature of many European counties, there is a requirement that provision be made for patients to be made whole regardless of the outcomes of the trial or if they happened to have been randomized to a control group that did not enjoy the benefits of a successful experimental intervention.

Ideal homes? Housing effects on rodent brain and behaviour

Article

TRENDS NEUROSCI

Hanno Würbel

Laboratory animal welfare: cage enrichment and mouse behaviour

Article

Jan 2004

Mice housed in standard cages show impaired brain development, abnormal repetitive behaviours (stereotypies) and an anxious behavioural profile, all of which can be lessened by making the cage environment more stimulating. But concerns have been raised that enriched housing might disrupt standardization and so affect the precision and reproducibility of behavioural-test results (for example, see ref. 4). Here we show that environmental enrichment increases neither individual variability in behavioural tests nor the risk of obtaining conflicting data in replicate studies. Our findings indicate that the housing conditions of laboratory mice can be markedly improved without affecting the standardization of results.

Ethical Principles for Medical Research Involving Human SubjectsWorld Medical Association Declaration of HelsinkiJAMA2013310202191219424141714

Article

Nov 2013

World Medical Association

Improving the Quality of Reporting of Randomized Controlled Trials: The CONSORT Statement

Article

Aug 1996

Colin Begg

This summary corresponds to the translation into Spanish of the Special Communication published in the Journal of the American Medical Association in August 1996, along with the editorial published in the same issue "How to report Randomized Controlled Trials. The Consort Statement". It describes the Consolidated Standars for Preparation of Controlled Clinical Trials, prepared by a work group made up of members of the SORT Group and of the Asilomar Work Group, along with the editor of a magazine and the author of the report on a clinical trial. The work was carried out by means of a Delphy process and the result was a check list and a process diagram. The check list is made up of 21 items that mainly refer to methods, results and discussions on the report of a controlled clinical trial, identifying the necessary information in order to be able to evaluate the internal and external value of the report, judging the improvement to be positive for the patient, the editors and the reviewers of the magazines.

Refinement of Experimental Design and Conduct in Laboratory Animal Research

Abstract

Recommended publications

The Place of Experimental Design and Statistics in the 3Rs

The Researchers’ View of Scientific Rigor—Survey on the Conduct and Reporting of In Vivo Research

More than 3Rs: The importance of scientific validity for harm-benefit analysis of animal research

Authorization of Animal Experiments Is Based on Confidence Rather than Evidence of Scientific Rigor

How our approaches to assessing benefits and harms can be improved