ArticleLiterature Review

A Dirty Dozen: Twelve P-Value Misconceptions

Authors:
To read the full-text of this research, you can request a copy directly from the author.

Abstract

The P value is a measure of statistical evidence that appears in virtually all medical research papers. Its interpretation is made extraordinarily difficult because it is not part of any formal system of statistical inference. As a result, the P value's inferential meaning is widely and often wildly misconstrued, a fact that has been pointed out in innumerable papers and books appearing since at least the 1940s. This commentary reviews a dozen of these common misinterpretations and explains why each is wrong. It also reviews the possible consequences of these improper understandings or representations of its meaning. Finally, it contrasts the P value with its Bayesian counterpart, the Bayes' factor, which has virtually all of the desirable properties of an evidential measure that the P value lacks, most notably interpretability. The most serious consequence of this array of P-value misconceptions is the false belief that the probability of a conclusion being in error can be calculated from the data in a single experiment without reference to external evidence or the plausibility of the underlying mechanism.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the author.

... The NHST approach to inference has been criticized due to certain limitations and erroneous interpretations of p-values (e.g., [9][10][11][12][13][14][15][16][17][18][19][20][21]), which we briefly describe below. As a result, some methodologists have argued that p-values should be mostly abandoned from scientific practice (e.g., [14,17,22,23]). ...
... The NHST approach to inference has been criticized due to certain limitations and erroneous interpretations of p-values (e.g., [9][10][11][12][13][14][15][16][17][18][19][20][21]), which we briefly describe below. As a result, some methodologists have argued that p-values should be mostly abandoned from scientific practice (e.g., [14,17,22,23]). ...
... Alternatively, the highest density interval of the posterior distribution could be compared to a predefined region of practical equivalence: If the highest density interval does not overlap with the region of equivalence, the alternative hypothesis can be accepted [24,31,32]. Another possibility is the use of Bayes factors [14,[33][34][35][36], which quantify the evidence for the alternative hypothesis relative to the evidence for the null hypothesis. ...
Article
Full-text available
Background Clinical trials often seek to determine the superiority, equivalence, or non-inferiority of an experimental condition (e.g., a new drug) compared to a control condition (e.g., a placebo or an already existing drug). The use of frequentist statistical methods to analyze data for these types of designs is ubiquitous even though they have several limitations. Bayesian inference remedies many of these shortcomings and allows for intuitive interpretations, but are currently difficult to implement for the applied researcher. Results We outline the frequentist conceptualization of superiority, equivalence, and non-inferiority designs and discuss its disadvantages. Subsequently, we explain how Bayes factors can be used to compare the relative plausibility of competing hypotheses. We present baymedr, an R package and web application, that provides user-friendly tools for the computation of Bayes factors for superiority, equivalence, and non-inferiority designs. Instructions on how to use baymedr are provided and an example illustrates how existing results can be reanalyzed with baymedr. Conclusions Our baymedr R package and web application enable researchers to conduct Bayesian superiority, equivalence, and non-inferiority tests. baymedr is characterized by a user-friendly implementation, making it convenient for researchers who are not statistical experts. Using baymedr, it is possible to calculate Bayes factors based on raw data and summary statistics.
... We want to use such a generative model to test if a given query annotation Q behaves as if it was "randomly shuffled" on the chromosome. To this end, we set the parameters of the Markov chain so that the expected interval φ Q State 0 State 1 Figure 1: An example of a query annotation Q = { [1,3), [5,7), [9,14)} shown as a set of framed boxes and the corresponding sequence of states of the context-aware Markov chain that induces the annotation (shown as filled circles). Genome context ϕ is shown as black and gray bars with colors corresponding to two distinct class labels; the same colors are also used on transition arrows between successive states of the Markov chain, as the transition probabilities depend on the genome context. ...
... ; TARs, and Q are regions with a particular epigenetic mark. One could compare the enrichment p-value of Q in R 1 with the enrichment p-value of Q in R 2 ; however, this is not statistically sound, as p-values should not generally be compared with each other [9,21]. ...
... of adjacent intervals of size one, producing a modified reference annotation R ′ . For example, reference annotation R = {[2, 5), [7,9) [3,4), [4,5), [7,8), [8,9)}. It is easy to see that B(R, Q) = B(R ′ , Q) = K(R ′ , Q), and we can therefore compute the exact PMF for the B(R, Q) by computing the PMF for the K(R ′ , Q) statistic by the MCDP* algorithm. ...
Preprint
Full-text available
An annotation is a set of genomic intervals sharing a particular function or property. Examples include genes, conserved elements, and epigenetic modifications. A common task is to compare two annotations to determine if one is enriched or depleted in the regions covered by the other. We study the problem of assigning statistical significance to such a comparison based on a null model representing two random unrelated annotations. Previous approaches to this problem remain too slow or inaccurate. To incorporate more background information into such analyses and avoid biased results, we propose a new null model based on a Markov chain which differentiates among several genomic contexts. These contexts can capture various confounding factors, such as GC content or sequencing gaps. We then develop a new algorithm for estimating p-values by computing the exact expectation and variance of the test statistics and then estimating the p-value using a normal approximation. Compared to the previous algorithm by Gafurov et al., the new algorithm provides three advances: (1) the running time is improved from quadratic to linear or quasi-linear, (2) the algorithm can handle two different test statistics, and (3) the algorithm can handle both simple and context-dependent Markov chain null models. We demonstrate the efficiency and accuracy of our algorithm on synthetic and real data sets, including the recent human telomere-to-telomere assembly. In particular, our algorithm computed p-values for 450 pairs of human genome annotations using 24 threads in under three hours. The use of genomic contexts to correct for GC-bias also resulted in the reversal of some previously published findings. Availability The software is freely available at https://github.com/fmfi-compbio/mcdp2_under the MIT licence. All data for reproducibility are available at https://github.com/fmfi-compbio/mcdp2-reproducibility
... We performed statistical inference on the basis of estimation statistics, which is a much more powerful and informative tool than hypothesis testing [22][23][24]. Accordingly, no statistical test was performed and therefore no measure of statistical significance was reported [25][26][27][28]. Furthermore, we calculated Cohen's d to assess the relative magnitude of the effect sizes that we estimated [29]. ...
... Table 3 displays the results regarding volumetric differences between migraine cases and controls in the investigated cerebellar brain regions and brainstem. Of note, individuals with migraine manifested larger volumes than controls in the sub-regions V (mean difference: 72 mm 3 , 95% CI [13,132]), crus I (mean difference: 259 mm 3 , 95% CI [9,510]), VIIIa (mean difference: 120 mm 3 , 95% CI [0.9, 238]), and X (mean difference: 14 mm 3 , 95% CI [1,27]). ...
... To the best of our knowledge, this is the first study that has estimated volumetric brain differences between individuals with migraine and controls at the cerebellar and brainstem level in a considerably large cohort applying confounder-adjusted quantitative methods. Notably, we found larger gray matter volumes in the sub-regions V (mean difference: 72 mm 3 , 95% CI [13,132]), crus I (mean difference: 259 mm 3 , 95% CI [9, 510]), VIIIa (mean difference: 120 mm 3 , 95% CI [0.9, 238]), and X (mean difference: 14 mm 3 , 95% CI [1,27]). We were also able to show that the cerebellar sub-regions are characterized by a medium-to-high gradient of positive volumetric correlation, conditioning the model within the levels of a broad range of important biological covariates. ...
Article
Full-text available
Background: The cerebellum and the brainstem are two brain structures involved in pain processing and modulation that have also been associated with migraine pathophysiology. The aim of this study was to investigate possible associations between the morphology of the cerebellum and brainstem and migraine, focusing on gray matter differences in these brain areas. Methods: The analyses were based on data from 712 individuals with migraine and 45,681 healthy controls from the UK Biobank study. Generalized linear models were used to estimate the mean gray matter volumetric differences in the brainstem and the cerebellum. The models were adjusted for important biological covariates such as BMI, age, sex, total brain volume, diastolic blood pressure, alcohol intake frequency, current tobacco smoking, assessment center, material deprivation, ethnic background, and a wide variety of health conditions. Secondary analyses investigated volumetric correlation between cerebellar sub-regions. Results: We found larger gray matter volumes in the cerebellar sub-regions V (mean difference: 72 mm 3 , 95% CI [13, 132]), crus I (mean difference: 259 mm 3 , 95% CI [9, 510]), VIIIa (mean difference: 120 mm 3 , 95% CI [0.9, 238]), and X (mean difference: 14 mm 3 , 95% CI [1, 27]). Conclusions: Individuals with migraine show larger gray matter volumes in several cerebellar sub-regions than controls. These findings support the hypothesis that the cerebellum plays a role in the pathophysiology of migraine.
... La dependencia del tamaño muestral es uno de los criterios que ha contribuido a la impopularidad creciente de las pruebas de significación estadística. (9,10,11) Dadas las circunstancias expuestas, un estudio sobre la influencia de esas comorbilidades en el pronóstico de un paciente hospitalizado con COVID-19, sólo puede concebirse con el propósito de confirmar o de verificar la existencia y el sentido previsto de dicha influencia. ...
... Esta posibilidad de hacer explícita la definición de un efecto (diferencia, asociación, capacidad predictiva) relevante marca un contraste clave entre la aspiración a verificar y la de simplemente confirmar. Las pruebas de significación, que tanto descrédito han archivado por su empleo indiscriminado en la investigación biomédica y salubrista, (9,10,11) conservan un nicho legítimo de aplicación en los ensayos clínicos. ...
Article
Introducción: Se argumenta la importancia del conocimiento previo, junto a los objetivos, la naturaleza del dato y el tipo de diseño, en la elección de los recursos analíticos en la investigación biomédica y salubrista. Pocas veces se toma en cuenta el conocimiento previo, y esta omisión conduce, no sólo a contentarse con una simple descripción cuando se necesitan recursos más complejos, sino también a utilizar ornamentos analíticos superfluos. Desarrollo: El conocimiento previo puede resumirse en una perspectiva analítica, que a su vez puede concretarse en 3 verbos clave: explorar, confirmar o verificar. Estos 3 verbos son determinantes a la hora de elegir el diseño y los recursos analíticos, aunque no se hagan explícitos al formular los objetivos de la investigación. Conclusiones: Salvo en las investigaciones de desarrollo, cuyo propósito es la obtención de un producto tangible, en la investigación hay siempre una perspectiva subyacente, que puede ser exploratoria o confirmatoria, y en este último caso puede orientarse a aportar apoyo empírico a una conjetura o a verificar hipótesis, que han sido previamente fundamentadas sobre bases teóricas o empíricas.
... In termini generali, la difficoltà nell'utilizzo di questa stima sta, per Goodman, nel fatto che il livello di probabilità non è parte di un sistema formale di inferenza statistica ma si limita a stimare la probabilità che l'ipotesi nulla sia vera. Anzi, secondo Goodman (2008), è proprio una confusione su questo punto che genera la prima e più frequente misconception e cioè che, "se P = .05, l'ipotesi nulla ha solo un probabilità del 5% di essere vera" (mentre, se la stima si basa sull'assunzione che l'ipotesi nulla sia vera, non può stimare contemporaneamente la probabilità che sia falsa). ...
... Tuttavia, benché si osservi un crescente interesse per questi modelli, il riferimento alla soglia del 5% ha mantenuto un ruolo straordinariamente importante nella ricerca attuale. Goodman (2008) descrive questa tendenza in questo modo: "One of many reasons that P values persist is that they are part of the vocabulary of research; whatever they do or do not mean, the scientific community feels they understand the rules with regard to their use, and are collectively not familiar enough with alternative methodologies or metrics." Qui il punto che ci interessa è che il riferimento a P = .05 ...
Article
Full-text available
Vengono presentate alcune considerazioni generali sulle misure in psicologia con particolare riferimento a come clinici e ricercatori se le rappresentano. Nell'identificare attraverso prove psicometriche la presenza di un disturbo cognitivo facciamo delle scelte relative alla struttura della misura, alle statistiche che ci consentono di identificare la presenza di una devianza e ai valori di probabilità che sono associati a queste statistiche. Nel fare ciò, utilizziamo modalità di osservare i dati che si sono strutturate nel corso della formazione scolastica ed universitaria. L'uso di questi "occhiali" (aritmetici, gaussiani e probabilistici) è in qualche modo necessario perché i numeri che derivano dai test non sono interpretabili in assenza di assunzioni metriche, statistiche e probabilistiche. D'altro canto, la tendenza a vedere le misure psicologiche con occhiali aritmetici può creare problemi nel comprendere la reale utilizzabilità dei test nel caso di dimensioni psicologiche. Inoltre, la scelta di quale assunzione utilizzare da un punto di vista probabilistico non è indifferente rispetto al risultato che si ottiene (in particolare nella identificazione di soglie di prestazione patologica). La comprensione della natura delle assunzioni che utilizziamo in questi contesti può favorire una migliore consapevolezza del valore e dei limiti delle osservazioni psicometriche nella valutazione dei disturbi evolutivi. Autore responsabile per la corrispondenza: Pierluigi Zoccolotti, Abstract Some general considerations about measures in psychology are presented regarding how clinicians and researchers represent them. In identifying the presence of a cognitive disorder through psychometric tests, we make choices regarding the structure of the measure, the statistics that allow us to identify the presence of deviance, and the probability values associated with these statistics. In doing so, we use ways of observing data structured throughout school and university training. Using these "spectacles" (arithmetic, Gaussian and probabilistic) is somewhat necessary because the numbers derived from the tests cannot be interpreted without metric, statistical and probabilistic assumptions. On the other hand, the tendency to view psychological measures with arithmetic glasses can create problems in understanding the real usability of tests in the case of psychological dimensions. In addition, the choice of which assumption to use from a probabilistic point of view is not indifferent to the result obtained (particularly in identifying pathological performance thresholds). Understanding the nature of the assumptions we use in these contexts can foster a better awareness of the value and limitations of psychometric observations in the assessment of developmental disorders.
... This is different for the frequentist null hypothesis significance testing. Here, the p value indicates the probability with which the same or an even more extreme effect will be found in hypothetical repetitions of the same experiment if the hypothesis of no effect is true 27 . Secondly, the Bayesian credible interval represents the bounds within which the true value is expected to lie with 95% probability given the observed data ( 28 , chapter 11.3). ...
... For region of interest based analyses we used Bayesian ANCOVA with Bayes factor (BF) hypothesis testing with volume or metabolism as dependent variable, mutation carrier status as independent variable, and age, sex, CDR www.nature.com/scientificreports/ score, and (for analyses of cognitive scores) education as confounders, to compare the alternative hypothesis against the null hypothesis (i.e., the assumption that there is an effect of carrier status, H 1 ) 26,27 , as implemented in Jeffreys' Amazing Statistics Program (JASP Version 0. 16.4), available at jasp-stats.org. We report the Bayes Factor (BF 10 ) quantifying evidence in favor of the alternative hypotheses. ...
Article
Full-text available
We aimed to study atrophy and glucose metabolism of the cholinergic basal forebrain in non-demented mutation carriers for autosomal dominant Alzheimer's disease (ADAD). We determined the level of evidence for or against atrophy and impaired metabolism of the basal forebrain in 167 non-demented carriers of the Colombian PSEN1 E280A mutation and 75 age- and sex-matched non-mutation carriers of the same kindred using a Bayesian analysis framework. We analyzed baseline MRI, amyloid PET, and FDG-PET scans of the Alzheimer’s Prevention Initiative ADAD Colombia Trial. We found moderate evidence against an association of carrier status with basal forebrain volume (Bayes factor (BF10) = 0.182). We found moderate evidence against a difference of basal forebrain metabolism (BF10 = 0.167). There was only inconclusive evidence for an association between basal forebrain volume and delayed memory and attention (BF10 = 0.884 and 0.184, respectively), and between basal forebrain volume and global amyloid load (BF10 = 2.1). Our results distinguish PSEN1 E280A mutation carriers from sporadic AD cases in which cholinergic involvement of the basal forebrain is already detectable in the preclinical and prodromal stages. This indicates an important difference between ADAD and sporadic AD in terms of pathogenesis and potential treatment targets.
... This view limits the scope of scientific discussion to a mutually exclusive binary. For decades, many authors have pointed out misconceptions of the p-value that lead to erroneous conclusions, nicely summarized in ref. 7. One misconception is that "Studies with p-values on opposite sides of .05 are conflicting", 7 Bayesian random effects meta-analysis allows assessing both the mean effect across studies and between study heterogeneity with the posterior distribution providing an estimate of the most likely values of the mean and heterogeneity parameters given the data. ...
... This view limits the scope of scientific discussion to a mutually exclusive binary. For decades, many authors have pointed out misconceptions of the p-value that lead to erroneous conclusions, nicely summarized in ref. 7. One misconception is that "Studies with p-values on opposite sides of .05 are conflicting", 7 Bayesian random effects meta-analysis allows assessing both the mean effect across studies and between study heterogeneity with the posterior distribution providing an estimate of the most likely values of the mean and heterogeneity parameters given the data. Additionally, the Bayesian framework allows direct quantification of evidence in favor of or against an effect on a continuous basis. ...
Article
Full-text available
INTRODUCTION Phase 3 trials using the anti‐amyloid antibodies aducanumab, lecanemab, donanemab, and high‐dose gantenerumab in prodromal and mild Alzheimer's disease dementia were heterogeneous in respect to statistical significance of effects. However, heterogeneity of results has not yet directly be quantified. METHODS We used Bayesian random effects meta‐analysis to quantify evidence for or against a treatment effect, and assessed the size of the effect and its heterogeneity. Data were extracted from published studies where available and Web based data reports, assuming a Gaussian data generation process. RESULTS We found moderate evidence in favor of a treatment effect (Bayes factor = 13.2). The effect was moderate to small with −0.33 (95% credible interval −0.54 to −0.10) points on the Clinical Dementia Rating – Sum of Boxes (CDR‐SB) scale. The heterogeneity parameter was low to moderate with 0.21 (0.04 to 0.45) CDR‐SB points. DISCUSSION Heterogeneity across studies was moderate despite some trials reaching statistical significance, while others did not. This suggests that the negative aducanumab and gantenerumab trials are in full agreement with the expected effect sizes.
... 15 This is different from the P value, which tells us the probability that a similar or even more extreme effect will occur in future experiments if the null hypothesis were true. 16,17 Additionally, the 95% credible interval of the posterior distribution of the parameter estimates is directly interpretable as the range in which the parameter lies with 95% probability given the data. This is different from the interpretation of the frequentist 95% confidence interval: the parameter estimate will lie in this interval in 95% of future repeated experiments. ...
... ,17,18 Part of the data came from the DELCODE cohort, conducted by the DZNE. Another part of the data came from the ADNI cohorts, accessed via the ADNI database (http://adni.loni.usc.edu/). ...
Article
Full-text available
INTRODUCTION We investigated the association of inflammatory mechanisms with markers of Alzheimer's disease (AD) pathology and rates of cognitive decline in the AD spectrum. METHODS We studied 296 cases from the Deutsches Zentrum für Neurodegenerative Erkrankungen Longitudinal Cognitive Impairment and Dementia Study (DELCODE) cohort, and an extension cohort of 276 cases of the Alzheimer's Disease Neuroimaging Initiative study. Using Bayesian confirmatory factor analysis, we constructed latent factors for synaptic integrity, microglia, cerebrovascular endothelial function, cytokine/chemokine, and complement components of the inflammatory response using a set of inflammatory markers in cerebrospinal fluid. RESULTS We found strong evidence for an association of synaptic integrity, microglia response, and cerebrovascular endothelial function with a latent factor of AD pathology and with rates of cognitive decline. We found evidence against an association of complement and cytokine/chemokine factors with AD pathology and rates of cognitive decline. DISCUSSION Latent factors provided access to directly unobservable components of the neuroinflammatory response and their association with AD pathology and cognitive decline.
... Similarly, there was a small increase from 0 to ~11 % in the number of studies that reported submitting metabolomics data to a repository. Another promising trend was the general increase of ~20-~30 % in the application and/or the reporting of standard statistical validation parameters such as p-values [40], R 2 /Q 2 [41], and false discovery rate (FDR) corrections [42][43][44]. ...
... This is consistent with the results of the survey ( Fig. 1) where ~70-75 % of the 2010 and 2020 papers reported using univariate statistics, ~75 % of the studies used unsupervised multivariate statistics, ~52-73 % of the papers reported using supervised multivariate statistics, and ~61-68 % of the studies used both univariate and multivariate statistics. Since most metabolomics investigators are not necessarily trained biostatisticians, the proper application of statistics to a metabolomics study is a common concern, especially if a biostatistician has not been part of the study design [26][27][28]40,41,55]. This concern is underscored by the results of our survey. ...
Article
A literature survey was conducted to identify current practices used by NMR metabolomics investigators when conducting and reporting their metabolomics studies. A total of 463 papers from 2020 and 80 papers from 2010 were selected from PubMed and were manually analyzed by a team of investigators to assess the extent and completeness of the experimental procedures and protocols reported. A significant number of the papers did not report on essential experimental details, incompletely stated which statistical methods were used, improperly applied supervised multivariate statistical analyses, or lacked validation of statistical models. A large diversity of protocols and software were identified, which suggests a lack of consensus and a relatively limited use of commonly agreed upon standards for conducting and reporting NMR metabolomics studies. The overall intent of the survey is to inform and encourage the NMR metabolomics community to develop and adopt best-practices for the field.
... Therefore, the ranks of optimization algorithms, sampling procedures, sample size proxies, and climate projection scenarios were based on the mean coefficient estimates or mean relative explained variance. We chose a low significance level and specified Bayes factors to account for the large GAMs with many explanatory variables and the frequent misinterpretation or overinterpretation of p values (Benjamin and Berger, 2019;Goodman, 2008;Ioannidis, 2005;Wasserstein et al., 2019;Nuzzo, 2015). We applied a lower-than-usual significance level, namely α = 0.01 (i.e., p < 0.005 for two-sided distributions; Benjamin and Berger, 2019), and included the 99 % confidence intervals in our results. ...
... Recent models accounted for this and based their threshold for the senescence rate on spring phenology (SIAM model; Keenan and Richardson, 2015) or on seasonal drivers such as the average growing-season temperature or accumulated net photosynthetic product (TDM or PIA models; Liu et al., 2019;Zani et al., 2020). However, Gill et al. (2015) and Chen et al. (2018) observed site-specific responses of leaf phenology to climate change, which could be due to site-specific soil properties (Arend et al., 2016), nutrient availability , and local adaptation (Peaucelle et al., 2019), which are not yet included in the current models. In addition, observations may be biased , and the perceptions of observers at different sites are usually not aligned. ...
Article
Full-text available
Autumn leaf phenology marks the end of the growing season, during which trees assimilate atmospheric CO 2. The length of the growing season is affected by climate change because autumn phenology responds to climatic conditions. Thus, the timing of autumn phenology is often mod-eled to assess possible climate change effects on future CO 2-mitigating capacities and species compositions of forests. Projected trends have been mainly discussed with regards to model performance and climate change scenarios. However, there has been no systematic and thorough evaluation of how performance and projections are affected by the calibration approach. Here, we analyzed > 2.3 million performances and 39 million projections across 21 process-oriented models of autumn leaf phenology, 5 optimization algorithms, ≥ 7 sampling procedures, and 26 climate model chains from two representative concentration pathways. Calibration and validation were based on > 45 000 observations for beech, oak, and larch from 500 central European sites each. Phenology models had the largest influence on model performance. The best-performing models were (1) driven by daily temperature , day length, and partly by seasonal temperature or spring leaf phenology; (2) calibrated with the generalized simulated annealing algorithm; and (3) based on systematically balanced or stratified samples. Autumn phenology was projected to shift between −13 and +20 d by 2080-2099 compared to 1980-1999. Climate scenarios and sites explained more than 80 % of the variance in these shifts and thus had an influence 8 to 22 times greater than the phenology models. Warmer climate scenarios and better-performing models predominantly projected larger backward shifts than cooler scenarios and poorer models. Our results justify inferences from comparisons of process-oriented phenology models to phenology-driving processes, and we advocate for species-specific models for such analyses and subsequent projections. For sound calibration, we recommend a combination of cross-validations and independent tests, using randomly selected sites from stratified bins based on mean annual temperature and average autumn phenology, respectively. Poor performance and little influence of phenology models on autumn phenology projections suggest that current models are overlooking relevant drivers. While the uncertain projections indicate an extension of the growing season, further studies are needed to develop models that adequately consider the relevant processes for autumn phenology. Summary. This study analyzed the impact of process-oriented models , optimization algorithms, calibration samples, and climate scenarios on the simulated timing of autumn leaf phenology (Fig. 2). The accuracy of the simulated timing was assessed by the root mean square error (RMSE) between the observed and simulated timing of autumn phenology. The future timing was expressed as a projected shift between 1980-1999 and 2080-2099 (100). While the RMSE was related to the models, optimization algorithms, and calibration samples through linear mixed-effects models (LMMs), 100 was related to the climate change scenarios, models, optimization algorithms, and calibration samples. The analyzed > 2.3 million RMSEs and 39 million 100 were derived from site-and species-specific calibrations (i.e., one set of parameters per site and species vs. one set of parameters per species, respectively). The calibrations were based on 17 211, 16 954, and 11 602 observed site years for common beech (Fagus sylvatica L.), pedunculate oak (Quercus robur L.), and European larch (Larix decidua MILL.), respectively, which were recorded at 500 central European sites per species. Published by Copernicus Publications on behalf of the European Geosciences Union. 7172 M. Meier and C. Bigler: Process-oriented models of autumn leaf phenology Process-oriented models are a useful tool to study leaf senes-cence. The assessed phenology models differed in their functions and drivers, which had the largest influence on the accuracy of the simulated autumn phenology (i.e., model performance). In all 21 models, autumn phenology occurs when a threshold related to an accumulated daily senescence rate is reached. While the threshold is either a constant or depends linearly on one or two seasonal drivers, the rate depends on daily temperature and, in all but one model, on day length. Depending on the model, the rate is (1) a monotonically increasing response to cooler days and is (i) amplified or (ii) weakened by shorter days, or it is (2) a sigmoidal response to both cooler and shorter days. In the three most accurate models, the threshold was either a constant or was derived from the timing of spring leaf phenology (site-specific calibration) or the average temperature of the growing season (species-specific calibration). Further, the daily rate of all but one of these models was based on monotonically increasing curves, which were both amplified or weakened by shorter days. Overall, the relatively large influence of the models on the performance justifies inferences from comparisons of process-oriented models to the leaf senescence process. Chosen optimization algorithms must be carefully tuned. The choice of the optimization algorithm and corresponding control settings had the second largest influence on model performance. The models were calibrated with five algorithms (i.e., efficient global optimization based on kriging with or without trust region formation , generalized simulated annealing, particle swarm optimization , and covariance matrix adaptation with evolutionary strategies), each executed with a few and many iterations. In general, generalized simulated annealing found the parameters that led to the best-performing models. Depending on the algorithm, model performance increased with more iterations for calibration. The positive and negative effects of more iterations on subsequent model performance relativize the comparison of algorithms in this study and exemplify the importance of carefully tuning the chosen algorithm to the studied search space. Stratified samples result in the most accurate calibrations. Model performance was influenced relatively little by the choice of the calibration sample in both the site-and species-specific calibrations. The models were calibrated and validated with site-specific 5-fold cross-validation, as well as with species-specific calibration samples that contained 75 % randomly assigned observations from between 2 and 500 sites and corresponding validation samples that contained the remaining observations of these sites or of all sites of the population. For the site-specific cross-validation, observations were selected in a random or systematic procedure. The random procedure assigned the observations randomly. For the systematic procedure, observations were first ordered based on year, mean annual temperature (MAT), or autumn phenology date (AP). Thus, every fifth observation (i.e., 1 + i, 6 + i,. .. with i ∈ (0, 1,.. . , 4)-systematically balanced) or every fifth of the n observations (i.e.,; 1 + i, 2 + i,. .. , n/5 + i with i ∈ (0, 1/5 × n,. .. , 4/5 × n)-systematically continuous) was assigned to one of the cross-validation samples. For the species-specific calibration, sites were selected in a random, systematic, or stratified procedure. The random procedure randomly assigned 2, 5, 10, 20, 50, 100, or 200 sites from the entire or half of the population according to the average MAT or average AP. For the systematic procedure, sites were first ordered based on average MAT or average AP. Thus, every j th site was assigned to a particular calibration sample with the greatest possible difference in MAT or AP between the 2, 5, 10, 20, 50, 100, or 200 sites. For the stratified procedure, the ordered sites were separated into 12 or 17 equal-sized bins based on MAT or AP, respectively (i.e., the smallest possible size that led to at least one site per bin). Thus, one site per bin was randomly selected and assigned to a particular calibration sample. The effects of these procedures on model performance were analyzed together with the effect of sample size. The results show that at least nine observations per free model parameter (i.e., the parameters that are fitted during calibration) should be used, which advocates for the pooling of sites and thus species-specific models. These models likely perform best when (1) sites are selected in a stratified procedure based on MAT for (2) a cross-validation with systematically balanced observations based on site and year, and their performance (3) should be tested with new sites selected in a stratified procedure based on AP. Projections of autumn leaf phenology are highly uncertain. Projections of autumn leaf phenology to the years 2080-2099 were mostly influenced by the climate change scenarios, whereas the influence of the phenology models was relatively small. The analyzed projections were based on 16 and 10 climate model chains (CMCs) that assume moderate vs. extreme future warming, following the representative concentration pathways (RCPs) 4.5 and 8.5, respectively. Under more extreme warming, the projected autumn leaf phenology occurred 8-9 d later than under moderate warming, specifically shifting by −4 to +20 d (RCP 8.5) vs. −13 to +12 d (RCP 4.5). While autumn phenology was projected to generally occur later according to the better-performing models, the projections were over 6 times more influenced by the climate scenarios than by the phenology models. This small influence of models that differ in their functions and drivers indicates that the modeled relationship between warmer days and slowed senescence rates suppresses the effects of the other drivers considered by the models. However, because some of these drivers are known to considerably influence autumn phenology, the lack of corresponding differences between the projections of current phenology models underscores their uncertainty rather than the reliability of these models.
... Similarly, there was a small increase from 0 to ~11 % in the number of studies that reported submitting metabolomics data to a repository. Another promising trend was the general increase of ~20-~30 % in the application and/or the reporting of standard statistical validation parameters such as p-values [40], R 2 /Q 2 [41], and false discovery rate (FDR) corrections [42][43][44]. ...
... This is consistent with the results of the survey ( Fig. 1) where ~70-75 % of the 2010 and 2020 papers reported using univariate statistics, ~75 % of the studies used unsupervised multivariate statistics, ~52-73 % of the papers reported using supervised multivariate statistics, and ~61-68 % of the studies used both univariate and multivariate statistics. Since most metabolomics investigators are not necessarily trained biostatisticians, the proper application of statistics to a metabolomics study is a common concern, especially if a biostatistician has not been part of the study design [26][27][28]40,41,55]. This concern is underscored by the results of our survey. ...
... P-Value is a statistical tool used to evaluate the probability that the relationship existing between two variables is statistically significant (Goodman 2008). Conventional P-Value interpretation stipulates that the p-value of less than 0.001 points towards strong evidence of a significant correlation, while 0.05 refers to moderate proof of significant correlation (Goodman 2008;Halsey et al. 2015). ...
... P-Value is a statistical tool used to evaluate the probability that the relationship existing between two variables is statistically significant (Goodman 2008). Conventional P-Value interpretation stipulates that the p-value of less than 0.001 points towards strong evidence of a significant correlation, while 0.05 refers to moderate proof of significant correlation (Goodman 2008;Halsey et al. 2015). Additionally, 0.1 suggest the existence of a possible week significant correlation, while 0.1 shows no evidence that the existing correlation is significant. ...
... One major critique was the 2016 American Statistical Association (ASA) statement on p values and statistical significance which outlines six major principles required to understand and report p values accurately 6 : 1) p values indicate how incompatible data are with a statistical model; 2) They are not estimates of the probability of the hypothesis being true or data resulting from random error alone; 3) Scientific conclusions and policy decisions do not depend on p value thresholds.; 4) Proper inference demands full reporting and transparency; 5) They do not measure effect; and 6) They alone are insufficient for assessing evidence regarding a model. Although this view on p values is not unanimous, most methodologists concur with the problems associated with equating "statistical significance" to "statistical importance" [7][8][9][10][11][12][13] . To address this issue, multiple researchers have either proposed a shift in the language used or a shift in the scale used to introduce and interpret p values, in order to promote more accurate interpretations and applications of the p value 1,2,7,8,14 . ...
Preprint
Full-text available
The concept of p values is central to research reporting in medicine, with its formal structure largely attributed to Ronald Fisher, who established the term statistical significance and proposed the 0.05 significance threshold in 1925. Despite its century long use, the p value has faced significant scrutiny and debate regarding its proper interpretation and application. This proposal aims to redress this criticism by addressing a root cause of the misconceptions surrounding p values – language. The proposed shift in language involves replacing significance with ‘divergence’. This change aims to ease out the inference of importance attached to significance and emphasize what is actually at stake – divergence between the data and the hypothesized null model as its source. Additionally, the term ’confidence interval’ acquires the same language change and the proposed name is now ‘non-divergent interval’ (nDI) to reflect the range of effect models non-divergent from the observed data at the set divergence threshold. This change can be expected to help researchers and readers better understand the implications of p values, reduce publication bias, and address inappropriate expectations of replicability. It would also curb misinterpretation and over-interpretation of results, promoting a more accurate and nuanced understanding of the use of p values in statistical reporting thereby addressing the existing controversy surrounding their misuse.
... P-values are the most frequently used measure of statistical evidence across all fields of science and research [32]. Despite the frequency of use, understanding p-values is elusive to most users, with widespread misuse being well-documented since the 1940s [33,32]. When conducting a hypothesis test, a test's significance level (alpha) is chosen to determine the acceptable type I error (falsely rejecting the null hypothesis). ...
Preprint
Full-text available
Background: Decisions about health care, such as the effectiveness of new treatments for disease, are regularly made based on evidence from published work. However, poor reporting of statistical methods and results is endemic across health research and risks ineffective or harmful treatments being used in clinical practice. Statistical modelling choices often greatly influence the results. Authors do not always provide enough information to evaluate and repeat their methods, making interpreting results difficult. Our research is designed to understand current reporting practices and inform efforts to educate researchers. Methods: Reporting practices for linear regression were assessed in 95 randomly sampled published papers in the health field from PLOS ONE in 2019, which were randomly allocated to statisticians for post-publication review. The prevalence of reporting practices is described using frequencies, percentages, and Wilson 95% confidence intervals. Results: While 92% of authors reported p-values and 81% reported regression coefficients, only 58% of papers reported a measure of uncertainty, such as confidence intervals or standard errors. Sixty-nine percent of authors did not discuss the scientific importance of estimates, and only 23% directly interpreted the size of coefficients. Conclusion: Our results indicate that statistical methods and results were often poorly reported without sufficient detail to reproduce them. To improve statistical quality and direct health funding to effective treatments, we recommend that statisticians be involved in the research cycle, from study design to post-peer review. The research environment is an ecosystem, and future interventions addressing poor statistical quality should consider the interactions between the individuals, organisations and policy environments. Practical recommendations include journals producing templates with standardised reporting and using interactive checklists to improve reporting practices. Investments in research maintenance and quality control are required to assess and implement these recommendations to improve the quality of health research.
... El escenario B aparece cuando χ 2 calculado > χ 2 crítico y, por lo tanto, el valor p < α, por lo que se rechaza la Hipótesis Nula. Una advertencia acerca de la Prueba de χ 2 , es que se puede generalizar a las otras pruebas donde se recurra a la significancia estadística: un solo resultado estadísticamente significativo no debe de representar un hecho científico, pero debe de llamar la atención hacia un fenómeno que aparenta ser valioso para más investigaciones, incluyendo la replicación (Goodman, 2008 ...
Book
Full-text available
This book teaches how to use the statistical software named JASP to run chi squared analysis.
... analysis was conducted by examining correlations using the Pearson linear correlation coefficient, Spearman's rank correlation, and Kendall's tau correlation between the specified factors, with a statistical significance level set for results with a p-value < 0.05. Adopting a p-value of this level is common in scientific articles(Genovese et al., 2006;Goodman, 2008). ...
... Spearman correlation coefficients were used for continuous predictors due to the non-normal distribution of the data collected [81]. P-values < 0.05 were considered statistically significant [82,83]. Further analysis included indicators with statistically significant connections. ...
Article
Full-text available
Previous studies have highlighted the significant role of vision in human perception. In this study, we examined whether the assessment of the audiovisual environment at China's iconic Great Wall aligns with these findings to understand how this assessment influences visitor satisfaction and a sense of restoration. Using a field survey with 107 participants, an eight-variable structural equation model was used, encompassing sound sources (sounds of technology, human being and nature), the pleasantness and eventfulness of soundscape, visual assessment, visitor satisfaction, and the Short-version Revised Restoration Scale (SRRS). The results revealed that: (1) visual assessment acted as a partial mediator not only between soundscape assessment and visitor satisfaction, but also between the soundscape assessment and restoration. (2) The pleasantness of the soundscape positively correlated with both visitor satisfaction (β = 0.278, p = 0.004) and restoration (β = 0.244, p = 0.000), while human-generated sounds had a negative impact on soundscape pleasantness (β =-0.256, p = 0.019). (3) The combined visual and auditory assessments contributed 60.5% and 39.5% to visitor satisfaction and 42.1% and 36.8% to restoration, respectively, indicating that the soundscape assessment of the Great Wall was higher than expected and comparable to visual assessment.
... There are many misconceptions about p-values and myths about statistical significance (17). In particular, this includes an excessive emphasis on the p < 0.05 threshold as a prejudgment of the significance of a result. ...
Article
Full-text available
The aim of this study was to verify if there is a relationship between physicians' specialization, place of practice, or views on issues related to compounded medicines and the frequency of prescribing such formulations. Additionally, the study aimed to determine the factors that influence these views. The study also examined the barriers and encouragements, according to physicians, in the use of compounded medicines. The original questionnaire was developed specifically for this study. Included sample consisted of Polish-speaking physicians whose main specialization was gynaecology, gastroenterology, dermatology, family medicine, internal medicine, otorhinolaryngology, paediatric otorhinolaryngology, paediatrics. Answers were gathered using computer-assisted telephone interview. The majority of surveyed physicians believed that every Polish patient receives treatment tailored to their personal needs (68,7%) and agreed that compounded medicines are a way to facilitate individualization of pharmacotherapy (79,3%). Frequency of prescribing of magistral formulations, as well as views on quality of compounded medicines, ease of prescribing, and filling a prescription vary between specializations and places of practice. The problem that negatively affects the practice of the largest group of respondents was complicated and frequently changing legal regulations (63.3%), while the least frequently indicated problems were concerns about the quality of preparation (16.7%). Based on opinion of surveyed physicians, a simplification of the regulations for prescribing and liberalization of reimbursement rules are main changes to consider in order to improve clinical practice in the areas where it relates to the prescription of compounded medicines.
... Absolute value of Cohen's |d| is reported in each t-test and η 2 are reported in each ANOVA. No arbitrary level of significance was adopted due to the many disadvantages and lack of a strong justification for this approach [22,46]. Instead, according to test power calculations, any p-value ≤ 0.10 was considered worthy of interest, while lower p-values were treated as leading to higher significance compared to higher p-values (hence, it was not a 0-1 decision-making rule). ...
Article
Full-text available
It is rather uncontroversial that gender should have no influence on treating others as equal epistemic agents. However, is this view reflected in practice? This paper aims to test whether the gender of the testifier and the accused of assault is related to the perception of a testimony’s reliability and the guilt of the potential perpetrator. Two experiments were conducted: the subjects (n = 361, 47% females, 53% males) assessed the reliability of the testifier in four scenarios of assault accusation, in which the only difference was the gender of the people presented. During the study, we have observed dependencies of gender and ascription of reliability, but only marginal differences in guilt attribution. The results of our research may constitute an argument for the existence of different epistemic status endowed on people depending on their gender and existing gender stereotypes. Our results suggest that gender bias may be situated at a deeper level than the linguistically triggered representation.
... default, except for the method of distribution of the test statistics computation, which we changed to an often more robust Monte Carlo setting, with 1 000 000 permutations. The literature highlights that the restrictive treatment of the conventional significance level of 0.05, as an absolute threshold for significant variables, is a misinterpretation of the concept of p-value 15,16 . We used, therefore, a more flexible approach in which p-values below 0.1 were treated as 'potentially significant' , and the significance was the subject to gradation (as a continuous measure of compatibility of the data with the assumed model), rather than the erroneous binary focus on the below/above 0.05. ...
Article
Full-text available
The paper aims to define the variables that elevate the risk of VFL recurrence after adequate primary treatment, and to present the Recurrence Risk Model with practical conclusions to handle pVFL and rVFL. Out of 207 patients with primary vocal fold leukoplakia (pVFL), in 41 (19.8%) recurrent VFL (rVFL) was diagnosed. All patients were assessed by using a trans-nasal flexible video-endoscope using white light, and NBI. The primary measure of our study was to investigate whether morphological features of pVFL in WL, vascular pattern in NBI, and primary histological findings could predict VFL recurrence. To create a model of risk factors, two methods were used: logistic regression and a conditional inference decision tree. The study showed smoking was the factor that significantly and most strongly increased the likelihood of rVFL, as well as the older age groups have a greater odds of rVFL. Types IV, V and VI, according to Ni 2019 classification, were associated with a significantly higher risk of rVFL. The algorithm combining patients’ dependent variables and the combination of two classifications improves the predictive value of the presented VFL Recurrence Risk Model.
... The interpretive power of LR is based on the statistically significant level of parameters which is closely relevant to p-value. Nevertheless, p-value has been recently criticized since its meaning is usually misunderstood [4]. Furthermore, in imbalanced circumstances where the minority class is the interest object, the parameter estimation of LR can be biased and the conditional probability of belonging to the interest class can be under-estimated [5,6]. ...
... In the present study, we survey two misinterpretations of non-significant p-values. One misinterpretation is that non-significant p-values indicate evidence for the absence of an effect (Goodman, 2008;Greenland et al., 2016). Although this interpretation occurs frequently in social science research (Greenland et al., 2016), is it unwarranted, because a non-significant pvalue itself leaves inconclusive evidence. ...
Article
Full-text available
When used appropriately, non-significant p-values have the potential to further our understanding of what does not work in education, and why. When misinterpreted, they can trigger misguided conclusions, for example about the absence of an effect of an educational intervention, or about a difference in the efficacy of different interventions. We examined the frequency of non-significant p-values in recent volumes of peer-reviewed educational research journals. We also examined how frequently researchers misinterpret non-significance to imply the absence of an effect, or a difference to another significant effect. Within a random sample of 50 peer-reviewed articles, we found that of 528 statistically tested hypotheses, 253 (48%) were non-significant. Of these, 142 (56%) were erroneously interpreted to indicate the absence of an effect, and 59 (23%) to indicate a difference to another significant effect. For 97 (38%) of non-significant results, such misinterpretations were linked to potentially misguided implications for educational theory, practice, or policy. We outline valid ways for dealing with non-significant p-values to improve their utility for education, discussing potential reasons for these misinterpretations and implications for research.
... However, understanding why p-values rank among the most ubiquitous but misunderstood statistical metrics 24,25 requires first understanding the history of their development. Thus, in what follows, the key approaches are briefly introduced chronologically, beginning with Fisher's tests of significance and Neyman and Pearson's tests of statistical hypotheses 26 developed in the 1920s. ...
Article
Full-text available
Nursing should rely on the best evidence, but nurses often struggle with statistics, impeding research integration into clinical practice. Statistical significance, a key concept in classical statistics, and its primary metric, the p-value, are frequently misused. This topic has been debated in many disciplines but rarely in nursing. The aim is to present key arguments in the debate surrounding the misuse of p-values, discuss their relevance to nursing, and offer recommendations to address them. The literature indicates that the concept of probability in classical statistics is not easily understood, leading to misinterpretations of statistical significance. A substantial portion of the critique concerning p-values arises from such misunderstandings and imprecise terminology. Consequently, some scholars have argued for the complete abandonment of p-values. Instead of discarding p-values, this article provides a more comprehensive account of their historical context and the information they convey. This will clarify why they are widely used yet often misunderstood. Additionally, the article offers recommendations for accurate interpretation of statistical significance by incorporating other key metrics. To mitigate publication bias resulting from p-value misuse, pre-registering the analysis plan is recommended. The article also explores alternative approaches, particularly Bayes factors, as they may resolve several of these issues. P-values serve a purpose in nursing research as an initial safeguard against the influence of randomness. Much criticism directed towards p-values arises from misunderstandings and inaccurate terminology. Several considerations and measures are recommended, some which go beyond the conventional, to obtain accurate p-values and to better understand statistical significance. Nurse educators and researchers should considerer these in their educational and research reporting practices.
... Generally, the standard p-value did not belong to the statistical inference. As a consequence, an alternative for the standard (classic) p-values were confidence intervals (CIs), statistical effects sizes (ESs), Bayes factors, and exploratory data analysis (EDA) [37,38]. In this study, the EDA methods were considered for the evaluation of the systematic relations between variables and groups. ...
Article
Full-text available
It is well known that platinum-based antineoplastic agents, including cisplatin (CP), have side effects that limit their use. Nefrotoxicity, neurotoxicity, and hemolytic anemia are the most common side effects. There are few studies on the reduction in these effects that involves nanoencapsulation; however, almost none involve cyclodextrins (CDs). Changes in the hematological and biochemical parameters of healthy Wistar rats treated with solutions of γ-cyclodextrin/resveratrol/cisplatin (γ-CD/Rv/CP) ternary complexes are investigated for the first time. They are intraperitoneally injected with γ-CD/Rv/CP solutions containing 5 mg CP/kg.b.w. Single shots were administered to six groups of Wistar rats (six individuals for every group) using γ-CD/Rv/CP, γ-CD/CP, γ-CD/Rv complexes, as well as positive- and negative-control groups, respectively. Thirty-two hematological and biochemical parameters were evaluated from blood samples and used as input variables for the principal component analysis (PCA) discrimination of the groups. The best protection was obtained for the γ-CD/Rv/CP ternary complex, which determined closer biochemical values to the control group. These values significantly differ from those of the γ-CD/CP treated group, especially for the IP, UA, and T-Pro kidney-related biochemical parameters. This finding proves the beneficial influence of Rv during CP administration through CD-based carriers.
... With respect to the interpretation of both single trials and meta-analyses, it is important to emphasize that the statistical significance of findings is not necessarily equivalent to clinical relevance, while statistical non-significance is not equivalent to clinical irrelevance [79][80] . The clinical relevance of an intervention may not be represented accurately when conclusions are based solely on statistical P-values [81][82][83] . For example, the information on statistical significance provided by P-values is highly dependent on sample sizes. ...
Article
Full-text available
The hypothesis that some children with attention-deficit hyperactivity disorder (ADHD) may show sensitivity or allergic reactions to various food items has led to the development of the the few-foods (or oligoantigenic) diet. The rationale of the diet is to eliminate certain foods from the diet in order to exclude potential allergens contained either naturally in food or in artificial ingredients with allergenic properties. The oligoantigenic diet attempts to identify individual foods to which a person might be sensitive. First, ADHD symptoms are monitored while multiple foods are excluded from the diet. Subsequently, if symptoms remit, foods are re-introduced, while observing the individual for the return of symptoms. An advantage of the oligoantigenic diet is that it can be tailored to the individual. A growing body of evidence suggests that behavioral symptomsof subgroups of children with ADHD may benefit from the elimination of certain foods. The effect sizes of an oligoantigenic diet regarding improvement of ADHD symptoms have been found to be medium to large. Available evidence suggests that the investigation of the role of food hypersensitivities in ADHD is a promising avenue worthy of further exploration. Further large-scale, randomized controlled studies including assessment of long-term outcome are therefore warranted.
... Traditionally, a p-value < 0.05 has been considered a reasonable significance level, but in some fields such as genomic studies, more stringent levels are adopted. The criticism of this method is extensive across fields of research, [57][58][59][60] including technical limitations such as the difference between the characteristics of the scientific data as opposed to the assumptions upon which the significance tests are defined, sampling issues regarding size and randomness, the arbitrary level of significance, the dichotomous reject/not-reject, the misinterpretation of p-values and the lack of reproducibility. ...
Article
Full-text available
The exposome complements information captured in the genome by covering all external influences and internal (biological) responses of a human being from conception onwards. Such a paradigm goes beyond a single scientific discipline and instead requires a truly interdisciplinary approach. The concept of “historical exposomics” could help bridge the gap between “nature” and “nurture” using both natural and social archives to capture the influence of humans on earth (the Anthropocene) in an interdisciplinary manner. The LuxTIME project served as a test bed for an interdisciplinary exploration of the historical exposome, focusing on the Belval area located in the Minett region in southern Luxembourg. This area evolved from a source of mineral water to steel production through to the current campus for research and development. This article explores the various possibilities of natural and social archives that were considered in creating the historical exposome of Belval and reflects upon possibilities and limitations of the current approaches in assessing the exposome using purely a natural science approach. Issues surrounding significance, visualization, and availability of material suitable to form natural archives are discussed in a critical manner. The “Minett Stories” are presented as a way of creating new historical narratives to support exposome research. New research perspectives on the history of the Anthropocene were opened by investigating the causal relationships between factual evidence and narrative evidence stemming from historical sources. The concept of historical exposome presented here may thus offer a useful conceptual framework for studying the Anthropocene in a truly interdisciplinary fashion.
Article
Full-text available
Sampling distributions are fundamental for statistical inference, yet their abstract nature poses challenges for students. This research investigates the development of high school students’ conceptions of sampling distribution through informal significance tests with the aid of digital technology. The study focuses on how technological tools contribute to forming these conceptions, guided by an emerging theory that describes this process. A workshop for high school students was organized, involving 36 participants working in pairs across four sessions, each with access to a computer. These sessions involved problem-solving activities, with the teacher introducing key concepts in the initial three sessions. The analysis, employing grounded theory, aimed to characterize the nature of students’ conceptions of sampling distribution as evident in their responses. The findings reveal a transition from empirical to informal conceptions of sampling distribution among students, facilitated by computational mediation. This transition is marked by an abstraction process that includes mathematization, processing, uncertainty/randomness, and conditional reasoning. The study underscores the role of digital simulations in teaching statistical concepts, facilitating students’ conceptual shift critical for grasping statistical inference.
Article
Objectives To investigate if a prospective feedback loop that flags older patients at risk of death can reduce non-beneficial treatment at end of life. Design Prospective stepped-wedge cluster randomised trial with usual care and intervention phases. Setting Three large tertiary public hospitals in south-east Queensland, Australia. Participants 14 clinical teams were recruited across the three hospitals. Teams were recruited based on a consistent history of admitting patients aged 75+ years, and needed a nominated lead specialist consultant. Under the care of these teams, there were 4,268 patients (median age 84 years) who were potentially near the end of life and flagged at risk of non-beneficial treatment. Intervention The intervention notified clinicians of patients under their care determined as at-risk of non-beneficial treatment. There were two notification flags: a real-time notification and an email sent to clinicians about the at-risk patients at the end of each screening day. The nudge intervention ran for 16–35 weeks across the three hospitals. Main outcome measures The primary outcome was the proportion of patients with one or more intensive care unit (ICU) admissions. The secondary outcomes examined times from patients being flagged at-risk. Results There was no improvement in the primary outcome of reduced ICU admissions (mean probability difference [intervention minus usual care] = −0.01, 95% confidence interval −0.08 to 0.01). There were no differences for the times to death, discharge, or medical emergency call. There was a reduction in the probability of re-admission to hospital during the intervention phase (mean probability difference −0.08, 95% confidence interval −0.13 to −0.03). Conclusions This nudge intervention was not sufficient to reduce the trial’s non-beneficial treatment outcomes in older hospital patients. Trial registration Australia New Zealand Clinical Trial Registry, ACTRN12619000675123 (registered 6 May 2019).
Chapter
In this chapter, we shall consider a published randomised controlled trial (RCT) involving a controversial retroactive intervention as the basis for highlighting statistical flaws that can creep into the reporting and interpretation of RCTs more generally. These contributions support the standpoint that even for an RCT, the credibility of the reported findings relies on the correct implementation and interpretation of statistical procedures. Within medicine and other health professions, this ought to complement existing training in critical appraisal and data skills by encouraging an interrogation of the more formal aspects of applying and reporting statistics, even at a relatively basic level. We shall also consider data from a large international stroke trial to assist in visualising the influence of random allocation error (as a specific form of random error) on RCT results. This will serve as a concrete example illustrating that for a well-designed RCT, by the very nature of simple randomisation, statistical significance can arise in the form of a false positive favouring the effectiveness of the intervention when the null hypothesis of no effect is true. From a philosophical perspective, we shall also consider an alternative approach to that of merely equating chance events with accidental events. This will include both recognising that such events are an inevitable product of simple randomisation and rejecting the chance-cause dichotomy.
Article
Full-text available
Article
The objective of this study was to determine the optimal heat treatment and build orientation to minimize the susceptibility of additively manufactured (AM) alloy 625 to crevice corrosion. To accomplish this, metal-to-metal and acrylic-to-metal remote crevice assembly (RCA) experiments were carried out for as made (NT) AM, stress relieved (SR) AM, solution annealed (SA) AM, and solution plus stabilization annealed (SSA) AM alloy 625 in two different build orientations. Current vs. time data from metal-to-metal RCA experiments were analyzed using a commercially available statistical analysis software that was used to perform Analysis of Variance (ANOVA). While there was a lack of statistical evidence build orientation has an effect on crevice corrosion susceptibility, there was strong evidence heat treatment affects crevice corrosion susceptibility. Specifically, according to Tukey’s Multiple Comparison, alloys that were heat treated had a statistically significant lower charge passed as compared to the NT specimens. This finding was consistent with measured penetration depth where NT AM specimens had the largest maximum penetration depth. In contrast, acrylic-to-metal RCAs were used to calculate crevice corrosion current density (rate) and repassivation potential. While current densities for the AM materials were comparable, the lateral motion of the active crevice corrosion front on the NT and SR specimens was found to be slow in comparison, resulting in high damage accumulation locally. Both metal-to-metal and acrylic-to-metal RCA results are discussed within the context of non-homogenized microstructures associated with AM.
Preprint
Full-text available
Working Memory (WM) enables us to maintain and directly manipulate mental representations, yet we know little about the neural implementation of this privileged online format. We recorded EEG data as human subjects engaged in a task requiring continuous updates to the locations of objects retained in WM. Analysis of contralateral delay activity (CDA) revealed that mental representations moved across cortex in real time as their remembered locations were updated.
Preprint
Full-text available
Working Memory (WM) enables us to maintain and directly manipulate mental representations, yet we know little about the neural implementation of this privileged online format. We recorded EEG data as human subjects engaged in a task requiring continuous updates to the locations of objects retained in WM. Analysis of contralateral delay activity (CDA) revealed that mental representations moved across cortex in real time as their remembered locations were updated.
Article
Hypothesis testing is often used for inference in the social sciences. In particular, null hypothesis significance testing (NHST) and its p value have been ubiquitous in published research for decades. Much more recently, null hypothesis Bayesian testing (NHBT) and its Bayes factor have also started to become more commonplace in applied research. Following preliminary work by Wong and colleagues, we investigated how, and to what extent, researchers misapply the Bayes factor in applied psychological research by means of a literature study. Based on a final sample of 167 articles, our results indicate that, not unlike NHST and the [Formula: see text] value, the use of NHBT and the Bayes factor also shows signs of misconceptions. We consider the root causes of the identified problems and provide suggestions to improve the current state of affairs. This article is aimed to assist researchers in drawing the best inferences possible while using NHBT and the Bayes factor in applied research.
Article
Although some data suggest that patients with mutRAS colorectal liver metastases (CRLM) may benefit from anatomic hepatectomy, this topic remains controversial. We performed a systematic review and meta-analysis to determine whether RAS mutation status was associated with prognosis relative to surgical technique [anatomic resection (AR) vs. nonanatomic resection (NAR)] among patients with CRLM. A systematic review and meta-analysis of studies were performed to investigate the association of AR versus NAR with overall and liver-specific disease-free survival (DFS and liver-specific DFS, respectively) in the context of RAS mutation status. Overall, 2018 patients (831 mutRAS vs. 1187 wtRAS) were included from five eligible studies. AR was associated with a 40% improvement in liver-specific DFS [hazard ratio (HR) = 0.6, 95% confidence interval (CI) 0.44–0.81, p = 0.01] and a 28% improvement in overall DFS (HR = 0.72, 95% CI 0.54–0.95, p = 0.02) among patients with mutRAS tumors; in contrast, AR was not associated with any improvement in liver-specific DFS or overall DFS among wtRAS patients. These differences may have been mediated by the 40% decreased incidence in R1 resection among patients with mutRAS tumors who underwent AR versus NAR [relative risk (RR): 0.6, 95% CI 0.40–0.91, p = 0.02]. In contrast, the probability of an R1 resection was not decreased among wtRAS patients who underwent AR versus NAR (RR: 0.93, 95% CI 0.69–1.25, p = 0.62). The data suggest that precision surgery may be relevant to CRLM. Specifically, rather than a parenchymal sparing dogma for all patients, AR may have a role in individuals with mutRAS tumors.
Article
Full-text available
The timing of leaf senescence in deciduous trees influences carbon uptake and the resources available for tree growth, defense, and reproduction. Therefore, simulated biosphere-atmosphere interactions and, eventually, estimates of the biospheric climate change mitigation potential are affected by the accuracy of process-oriented leaf senescence models. However, current leaf senescence models are likely to suffer from a bias towards the mean (BTM). This may lead to overly flat trends, whereby errors would increase with increasing difference from the average timing of leaf senescence, ultimately distorting model performance and projected future shifts. However, such effects of the BTM on model performance and future shifts have rarely been investigated. We analyzed >17 × 10 6 past dates and >49 × 10 6 future shifts of leaf senescence simulated by 21 process-oriented models that had been calibrated with >45,000 observations from Central Europe for three major European tree species. The surmised effects on model performance and future shifts occurred in all 21 models, revealing strong model-specific BTM. In general, the models performed only slightly better than a null model that just simulates the average timing of leaf senescence. While standard comparisons of model performance favored models with stronger BTM, future shifts of leaf senescence were smaller when projected by models with weaker BTM. Overall, the future shifts for 2090-2099 relative to 1990-1999 increased by an average of 13-14 days after correcting for the BTM. In conclusion, the BTM substantially affects simulations by state-of-the-art leaf senescence models, which compromises model comparisons and distorts projections of future shifts. Smaller shifts result from flatter trends associated with stronger BTM. Therefore, smaller shifts according to models with weaker BTM illustrate the considerable uncertainty in current leaf senescence projections. It is likely that state-of-the-art projections of future biosphere behavior under global change are distorted by erroneous leaf senescence models. K E Y W O R D S compromised model comparisons, distorted model accuracy, distorted projections, mean bias, Nash-Sutcliffe efficiency, phenological difference, root mean squared error
Article
Full-text available
Background Bayesian statistical approaches are extensively used in new statistical methods but have not been adopted at the same rate in clinical and translational (C&T) research. The goal of this paper is to accelerate the transition of new methods into practice by improving the C&T researcher’s ability to gain confidence in interpreting and implementing Bayesian analyses. Methods We developed a Bayesian data analysis plan and implemented that plan for a two-arm clinical trial comparing the effectiveness of a new opioid in reducing time to discharge from the post-operative anesthesia unit and nerve block usage in surgery. Through this application, we offer a brief tutorial on Bayesian methods and exhibit how to apply four Bayesian statistical packages from STATA, SAS, and RStan to conduct linear and logistic regression analyses in clinical research. Results The analysis results in our application were robust to statistical package and consistent across a wide range of prior distributions. STATA was the most approachable package for linear regression but was more limited in the models that could be fitted and easily summarized. SAS and R offered more straightforward documentation and data management for the posteriors. They also offered direct programming of the likelihood making them more easily extendable to complex problems. Conclusion Bayesian analysis is now accessible to a broad range of data analysts and should be considered in more C&T research analyses. This will allow C&T research teams the ability to adopt and interpret Bayesian methodology in more complex problems where Bayesian approaches are often needed.
Article
Null hypothesis significance testing (NHST) is the default approach to statistical analysis and reporting in marketing and the biomedical and social sciences more broadly. Despite its default role, NHST has long been criticized by both statisticians and applied researchers including those within marketing. Therefore, the authors propose a major transition in statistical analysis and reporting. Specifically, they propose moving beyond binary: abandoning NHST as the default approach to statistical analysis and reporting. To facilitate this, they briefly review some of the principal problems associated with NHST. They next discuss some principles that they believe should underlie statistical analysis and reporting. They then use these principles to motivate some guidelines for statistical analysis and reporting. They next provide some examples that illustrate statistical analysis and reporting that adheres to their principles and guidelines. They conclude with a brief discussion.
Article
BACKGROUND The p value has been criticized for an oversimplified determination of whether a treatment effect exists. One alternative is the fragility index. It is a representation of the minimum number of non-events that would need to be converted to events to increase the p value above 0.05. OBJECTIVE To determine the fragility index of randomized controlled trials assessing the efficacy of interventions for patients with diverticular disease since 2010 to assess the robustness of current evidence. DATA SOURCES MEDLINE, Embase, and CENTRAL were searched from inception to August 2022. STUDY SELECTION Articles were eligible for inclusion if they were randomized trials conducted between 2010 and 2022 with parallel, superiority designs evaluating interventions in patients with diverticular disease. Only randomized trials with dichotomous primary outcomes with an associated p-value of less than 0.05 were considered for inclusion. INTERVENTION(S) Any surgical or medical intervention for patients with diverticular disease. MAIN OUTCOME MEASURES The fragility index was determined by adding events and subtracting non-events from the groups with the smaller number of events. Events were added until the p-value exceeded 0.05. The smallest number of events required was considered the fragility index. RESULTS After screening 1,271 citations, 15 randomized trials met inclusion criteria. Nine of the studies evaluated surgical interventions and six evaluated medical interventions. The mean number of patients randomized and lost to follow-up per RCT was 92 (SD 35.3) and 9 (SD 11.4), respectively. The median fragility index was 1 (range: 0-5). The fragility indices for the included studies did not correlate significantly with any study characteristics. LIMITATIONS Small sample, heterogeneity, and lack of inclusion of studies with continuous outcomes. CONCLUSIONS The randomized trials evaluating surgical and medical interventions for diverticular disease are not robust. Changing a single outcome event in most studies was sufficient to make a statistically significant study finding non-significant.
Chapter
Statistical hypothesis testing is among the most misunderstood quantitative analysis methods in data science, despite its seeming simplicity. Having originated from statistics, hypothesis testing has complex interdependencies between its procedural components, which makes it hard to thoroughly comprehend. In this chapter, we discuss the underlying logic behind statistical hypothesis testing and the formal meaning of its components and their connections. Furthermore, we discuss some examples of hypothesis tests.
Article
Optimal management of cancer patients relies heavily on late-phase oncology randomized controlled trials. A comprehensive understanding of the key considerations in designing and interpreting late-phase trials is crucial for improving subsequent trial design, execution, and clinical decision-making. In this review, we explore important aspects of late-phase oncology trial design. We begin by examining the selection of primary endpoints, including the advantages and disadvantages of using surrogate endpoints. We address the challenges involved in assessing tumor progression and discuss strategies to mitigate bias. We define informative censoring bias and its impact on trial results, including illustrative examples of scenarios that may lead to informative censoring. We highlight the traditional roles of the log-rank test and hazard ratio in survival analyses, along with their limitations in the presence of nonproportional hazards as well as an introduction to alternative survival estimands, such as restricted mean survival time or MaxCombo. We emphasize the distinctions between the design and interpretation of superiority and noninferiority trials, and compare Bayesian and frequentist statistical approaches. Finally, we discuss appropriate utilization of phase II and phase III trial results in shaping clinical management recommendations and evaluate the inherent risks and benefits associated with relying on phase II data for treatment decisions.
Preprint
Full-text available
We prove that the probability of "A OR B", where A and B are events or hypotheses that are mutually (recursively) dependent, is given by a "Hyperbolic Sum Rule" (HSR) relation isomorphic to the hyperbolic-tangent double-angle formula, and that this HSR is Maximum Entropy (MaxEnt). The possibility of mutual recursion is excluded by the "Conventional Sum Rule" (CSR) for probabilities, which we also prove MaxEnt albeit within its narrower domain of applicability. We show when the HSR and CSR are respectively applicable. The concatenation property of the HSR is exploited to enable analytical, consistent and scalable calculations for multiple recursive hypotheses; calculations not conveniently available using the CSR. We also show that it is as reasonable to state that "probability is physical" (it is not merely a mathematical construct) as it is to state that "information is physical" (now recognised as a truism of communications network engineering). We relate this treatment to the physics of Quantitative Geometrical Thermodynamics which is defined in complex hyperbolic (Minkowski) spacetime and show how the HSR is isomorphic to other physical quantities. This is a substantially revised version (4th Sept 2023) of the paper "A hyperbolic sum rule for probability: solving the recursive 'Chicken & Egg' problem"
Article
Full-text available
Bayesian statistics, a currently controversial viewpoint concerning statistical inference, is based on a definition of probability as a particular measure of the opinions of ideally consistent people. Statistical inference is modification of these opinions in the light of evidence, and Bayes' theorem specifies how such modifications should be made. The tools of Bayesian statistics include the theory of specific distributions and the principle of stable estimation, which specifies when actual prior opinions may be satisfactorily approximated by a uniform distribution. A common feature of many classical significance tests is that a sharp null hypothesis is compared with a diffuse alternative hypothesis. Often evidence which, for a Bayesian statistician, strikingly supports the null hypothesis leads to rejection of that hypothesis by standard classical procedures. The likelihood principle emphasized in Bayesian statistics implies, among other things, that the rules governing when data collection stops are irrelevant to data interpretation. It is entirely appropriate to collect data until a point has been proven or disproven, or until the data collector runs out of time, money, or patience. (71 ref.) (PsycINFO Database Record (c) 2012 APA, all rights reserved)
Article
Full-text available
This commentary reviews the arguments for and against the use of p-values put forward in the Journal and other forums, and shows that they are all missing both a measure and concept of "evidence." The mathematics and logic of evidential theory are presented, with the log-likelihood ratio used as the measure of evidence. The profoundly different philosophy behind evidential methods (as compared to traditional ones) is presented, as well as a comparative example showing the difference between the two approaches. The reasons why we mistakenly ascribe evidential meaning to p-values and related measures are discussed. Unfamiliarity with the technology and philosophy of evidence is seen as the main reason why certain arguments about p-values persist, and why they are frequently contradictory and confusing.
Article
Full-text available
Tests of statistical significance are often used by investigators in reporting the results of clinical research. Although such tests are useful tools, the significance levels are not appropriate indices of the size or importance of differences in outcome between treatments. Lack of "statistical significance" can be misinterpreted in small studies as evidence that no important difference exists. Confidence intervals are important but underused supplements to tests of significance for reporting the results of clinical investigations. Their usefulness is discussed here, and formulas are presented for calculating confidence intervals with types of data commonly found in clinical trials.
Article
Full-text available
Conventional interpretation of clinical trials relies heavily on the classic p value. The p value, however, represents only a false-positive rate, and does not tell the probability that the investigator's hypothesis is correct, given his observations. This more relevant posterior probability can be quantified by an extension of Bayes' theorem to the analysis of statistical tests, in a manner similar to that already widely used for diagnostic tests. Reanalysis of several published clinical trials according to Bayes' theorem shows several important limitations of classic statistical analysis. Classic analysis is most misleading when the hypothesis in question is already unlikely to be true, when the baseline event rate is low, or when the observed differences are small. In such cases, false-positive and false-negative conclusions occur frequently, even when the study is large, when interpretation is based solely on the p value. These errors can be minimized if revised policies for analysis and reporting of clinical trials are adopted that overcome the known limitations of classic statistical theory with applicable bayesian conventions.
Article
Full-text available
Given an observed test statistic and its degrees of freedom, one may compute the observed P value with most statistical packages. It is unknown to what extent test statistics and P values are congruent in published medical papers. We checked the congruence of statistical results reported in all the papers of volumes 409-412 of Nature (2001) and a random sample of 63 results from volumes 322-323 of BMJ (2001). We also tested whether the frequencies of the last digit of a sample of 610 test statistics deviated from a uniform distribution (i.e., equally probable digits). 11.6% (21 of 181) and 11.1% (7 of 63) of the statistical results published in Nature and BMJ respectively during 2001 were incongruent, probably mostly due to rounding, transcription, or type-setting errors. At least one such error appeared in 38% and 25% of the papers of Nature and BMJ, respectively. In 12% of the cases, the significance level might change one or more orders of magnitude. The frequencies of the last digit of statistics deviated from the uniform distribution and suggested digit preference in rounding and reporting. This incongruence of test statistics and P values is another example that statistical practice is generally poor, even in the most renowned scientific journals, and that quality of papers should be more controlled and valued.
Article
Full-text available
There is increasing concern that most current published research findings are false. The probability that a research claim is true may depend on study power and bias, the number of other studies on the same question, and, importantly, the ratio of true to no relationships among the relationships probed in each scientific field. In this framework, a research finding is less likely to be true when the studies conducted in a field are smaller; when effect sizes are smaller; when there is a greater number and lesser preselection of tested relationships; where there is greater flexibility in designs, definitions, outcomes, and analytical modes; when there is greater financial and other interest and prejudice; and when more teams are involved in a scientific field in chase of statistical significance. Simulations show that for most study designs and settings, it is more likely for a research claim to be false than true. Moreover, for many current scientific fields, claimed research findings may often be simply accurate measures of the prevailing bias. In this essay, I discuss the implications of these problems for the conduct and interpretation of research.
Article
This paper reviews the research relating Logo, as a programming language, to the learning and teaching of secondary mathematics. The paper makes a clear distinction between Logo used as a programming language, and Logo used in the terms originally put forth by Papert (1980). The main thesis is that research studies in this area often require that the experimental group learn Logo as well as the regular mathematical content that the control group is learning. Thus the Logo group is often learning more. A result of no significant difference can therefore be more important than a casual glance would indicate.
Article
In a 1935 paper and in his book Theory of Probability, Jeffreys developed a methodology for quantifying the evidence in favor of a scientific theory. The centerpiece was a number, now called the Bayes factor, which is the posterior odds of the null hypothesis when the prior probability on the null is one-half. Although there has been much discussion of Bayesian hypothesis testing in the context of criticism of P-values, less attention has been given to the Bayes factor as a practical tool of applied statistics. In this article we review and discuss the uses of Bayes factors in the context of five scientific applications in genetics, sports, ecology, sociology, and psychology. We emphasize the following points:
Article
The prime object of this book is to put into the hands of research workers, and especially of biologists, the means of applying statistical tests accurately to numerical data accumulated in their own laboratories or available in the literature.
Article
t ;/ I read with interest Joe Fleiss' Letter to the Editor of this journal 18:394 (1987)1, entitled "Some Thoughts on Two-Tailed Tests." Certainly Fleiss is one of the more respected members and practitioners in our profession and I respect his opinion. However, I do not believe that he gives sufficient treatment of the one-tailed versus two-tailed issue. I, for one, believe that there are many situations where a one-tailed test is the appropriate test, provided that hypothesis testing itself is meaningful to do. On this point, I shall only state that fundamentally I believe that hy-pothesis testing is appropriate only in those trials designed to provide a confirmatory answer to a medically and/or scientifically important question. Otherwise, at the analysis stage, one is either in an estimation mode or a hypothesis-generating mode. I should underscore that helping to define the question at the protocol development stage is one of the most important contributions the statistician can make. The question to be answered is the research objective and, I believe most would agree/ should be the altemative hypothesis (H") in the hypothesis testing framework, the idea being that contradicting the null hypothesis (Ho) is seen as evidence in support of the research objective. So one point in support of one-sided tests is that if the question, the research is directed toward is unidirectional, then significance tests should be one-sided. A second point is that we should have internal consistency to our construction of the alternative hypothesis. An example of what I mean here is the dose response or dose comparison trial. Few (none that I have asked) statisticians would disagree that dose response as a research objective, cap-tured in the hypothesis specification framework, is Ho I pp 3 ltat s pd2, where for simplicity I havc assumed that there are two doses--J1 and d2 of the test drug, a placebo (p) control, and the effect of drug is expected to be nonde-creasing. If this is the case and if for some reason, the research is conducted in the absence of the d, group, then why would Hotpp = pa2 become H^: p,* 1ta2? A third point is that if the trial is designed to be confirmatory, then the alternative cannot be two-sided and still be logical. I believe this holds for positive controlled trials as well as for placebo controlled trials. However, it is more likely to gain broader acceptance for placebo controlled trials. To elaborate, suppose the (confirmatory) trial has one dose group and one pla-cebo group and a two-sided alternative was seen as appropriate at the design stage. Suppose further that the analyst is masked and that the only results known are the F statistic and corresponding p value (which is significant). The analyst then has to search for where the difference lies-a situation no different from those that invoke multiple range tests. That search fundamen-tally precludes a confirmatory conclusion even if the direction favors the drug. 383 ot97-2456t1988t53.50 Controlled Clinical Trials 9;383-38a G988) @ Elsevier Science Publishing Co., lnc. 1988 655 Avenue of the Anericas,
Article
The Fisher and Neyman-Pearson approaches to testing statistical hypotheses are compared with respect to their attitudes to the interpretation of the outcome, to power, to conditioning, and to the use of fixed significance levels. It is argued that despite basic philosophical differences, in their main practical aspects the two theories are complementary rather than contradictory and that a unified approach is possible that combines the best features of both. As applications, the controversies about the Behrens-Fisher problem and the comparison of two binomials (2 × 2 tables) are considered from the present point of view.
Article
The problem of testing a point null hypothesis (or a “small interval” null hypothesis) is considered. Of interest is the relationship between the P value (or observed significance level) and conditional and Bayesian measures of evidence against the null hypothesis. Although one might presume that a small P value indicates the presence of strong evidence against the null, such is not necessarily the case. Expanding on earlier work [especially Edwards, Lindman, and Savage (1963) and Dickey (1977)], it is shown that actual evidence against a null (as measured, say, by posterior probability or comparative likelihood) can differ by an order of magnitude from the P value. For instance, data that yield a P value of .05, when testing a normal mean, result in a posterior probability of the null of at least .30 for any objective prior distribution. (“Objective” here means that equal prior weight is given the two hypotheses and that the prior is symmetric and nonincreasing away from the null; other definitions of “objective” will be seen to yield qualitatively similar results.) The overall conclusion is that P values can be highly misleading measures of the evidence provided by the data against the null hypothesis.
Article
The Empire of Chance tells how quantitative ideas of chance transformed the natural and social sciences, as well as daily life over the last three centuries. A continuous narrative connects the earliest application of probability and statistics in gambling and insurance to the most recent forays into law, medicine, polling and baseball. Separate chapters explore the theoretical and methodological impact in biology, physics and psychology. Themes recur - determinism, inference, causality, free will, evidence, the shifting meaning of probability - but in dramatically different disciplinary and historical contexts. In contrast to the literature on the mathematical development of probability and statistics, this book centres on how these technical innovations remade our conceptions of nature, mind and society. Written by an interdisciplinary team of historians and philosophers, this readable, lucid account keeps technical material to an absolute minimum. It is aimed not only at specialists in the history and philosophy of science, but also at the general reader and scholars in other disciplines.
Article
In some comparisons - for example, between two means or two proportions - there is a choice between two sided or one sided tests of significance (all comparisons of three or more groups are two sided).* This is the eighth in a series of occasional notes on medical statistics.When we use a test of significance to compare two groups we usually start with the null hypothesis that there is no difference between the populations from which the data come. If this hypothesis is not true the alternative hypothesis must be true - that there is a difference. Since the null hypothesis specifies no direction for the difference nor does the alternative hypothesis, and so we have a two sided test. In a one sided test the alternative hypothesis does specify a direction - for example, that an active treatment is better than a placebo. This is sometimes justified by saying that we are not interested in the possibility that the active treatment is worse than no treatment. This possibility is still part of the test; it is part of the null hypothesis, which now states that the difference in the population is zero or in favour of the placebo.A one sided test is sometimes appropriate. Luthra et al investigated the effects of laparoscopy and hydrotubation on the fertility of women presenting at an infertility clinic.1 After some months laparoscopy was carried out on those who had still not conceived. These women were then observed for several further months and some of these women also conceived. The conception rate in the period before laparoscopy was compared with that afterwards. The less fertile a woman is the longer it is likely to take her to conceive. Hence, the women who had the laparoscopy should have a lower conception rate (by an unknown amount) than the larger group who entered the study, because the more fertile women had conceived before their turn for laparoscopy came. To see whether laparoscopy increased fertility, Luthra et al tested the null hypothesis that the conception rate after laparoscopy was less than or equal to that before. The alternative hypothesis was that the conception rate after laparoscopy was higher than that before. A two sided test was inappropriate because if the laparoscopy had no effect on fertility the conception rate after laparoscopy was expected to be lower.One sided tests are not often used, and sometimes they are not justified. Consider the following example. Twenty five patients with breast cancer were given radiotherapy treatment of 50 Gy in fractions of 2 Gy over 5 weeks.2 Lung function was measured initially, at one week, at three months, and at one year. The aim of the study was to see whether lung function was lowered following radiotherapy. Some of the results are shown in the table, the forced vital capacity being compared between the initial and each subsequent visit using one sided tests. The direction of the one sided tests was not specified, but it may appear reasonable to test the alternative hypothesis that forced vital capacity decreases after radiotherapy, as there is no reason to suppose that damage to the lungs would increase it. The null hypothesis is that forced vital capacity does not change or increases. If the forced vital capacity increases, this is consistent with the null hypothesis, and the more it increases the more consistent the data are with the null hypothesis. Because the differences are not all in the same direction, at least one P value should be greater than 0.5. What has been done here is to test the null hypothesis that forced vital capacity does not change or decreases from visit 1 to visit 2 (nine week), and to test the null hypothesis that it does not change or increases from visit 1 to visit 3 (three months) or visit 4 (one year). These authors seem to have carried out one sided tests in both directions for each visit and then taken the smaller probability. If there is no difference in the population the probability of getting a significant difference by this approach is 10%, not 5% as it should be. The chance of a spurious significant difference is doubled. Two sided tests should be used, which would give probabilities of 0.26, 0.064, and 0.38, and no significant differences.In general a one sided test is appropriate when a large difference in one direction would lead to the same action as no difference at all. Expectation of a difference in a particular direction is not adequate justification. In medicine, things do not always work out as expected, and researchers may be surprised by their results. For example, Galloe et al found that oral magnesium significantly increased the risk of cardiac events, rather than decreasing it as they had hoped.3 If a new treatment kills a lot of patients we should not simply abandon it; we should ask why this happened.Two sided tests should be used unless there is a very good reason for doing otherwise. If one sided tests are to be used the direction of the test must be specified in advance. One sided tests should never be used simply as a device to make a conventionally non-significant difference significant.References1.↵Lund MB, Myhre KI, Melsom H, Johansen B. The effect on pulmonary function of tangential field technique in radiotherapy for carcinoma of the breast. Br J Radiol 1991;64:520–3.OpenUrlFREE Full Text2.↵Luthra P, Bland JM, Stanton SL. Incidence of pregnancy after laparoscopy and hydrotubation. BMJ 1982;284:1013.3.↵Galloe AM, Rasmussen HS, Jorgensen LN, Aurup P, Balslov S, Cintin C, Graudal N, McNair P. Influence of oral magnesium supplementation on cardiac events among survivors of an acute myocardial infarction. BMJ 1993;307:585–7.
Article
An abstract is unavailable. This article is available as HTML full text and PDF.
Article
This paper concerns interim analysis in clinical trials involving two treatments from the points of view of both classical and Bayesian inference. I criticize classical hypothesis testing in this setting and describe and recommend a Bayesian approach in which sampling stops when the probability that one treatment is the better exceeds a specified value. I consider application to normal sampling analysed in stages and evaluate the gain in average sample number as a function of the number of interim analyses.
Article
For both P-values and confidence intervals, an alpha level is chosen to set limits of acceptable probability for the role of chance in the observed distinctions. The level of alpha is used either for direct comparison with a single P-value, or for determining the extent of a confidence interval. "Statistical significance" is proclaimed if the calculations yield a P-value that is below alpha, or a 1-alpha confidence interval whose range excludes the null result of "no difference." Both the P-value and confidence-interval methods are essentially reciprocal, since they use the same principles of probabilistic calculation; and both can yield distorted or misleading results if the data do not adequately conform to the underlying mathematical requirements. The major scientific disadvantage of both methods is that their "significance" is merely an inference derived from principles of mathematical probability, not an evaluation of substantive importance for the "big" or "small" magnitude of the observed distinction. The latter evaluation has not received adequate attention during the emphasis on probabilistic decisions; and careful principles have not been developed either for the substantive reasoning or for setting appropriate boundaries for "big" or "small." After a century of "significance" inferred exclusively from probabilities, a basic scientific challenge is to develop methods for deciding what is substantively impressive or trivial.
Article
This article has no abstract; the first 100 words appear below. Many medical researchers believe that it would be fruitless to submit for publication any paper that lacks statistical tests of significance. Their belief is not ill founded: editors and referees commonly rely on tests of significance as indicators of a sophisticated and meaningful statistical analysis, as well as the primary means to assess sampling variability in a study. The preoccupation with significance tests is embodied in the focus on whether the P value is less than 0.05; results are considered "significant" or "not significant" according to whether or not the P value is less than or greater than 0.05. Dr. . . . J. Rothman, DR.P.H. Harvard School of Public Health Boston, MA 02115 Kenneth *Freiman JA, Chalmers TC, Smith HS Jr, et al: The importance of beta, the Type II error and sample size in the design and interpretation of the randomized controlled trial: survey of 71 "negative" trials. N Engl J Med 299:690–694, 1978
Article
Statistical significance’ is commonly tested in biologic research when the investigator has found an impressive difference in two groups of animals or people. If the groups are relatively small, the investigator (or a critical reviewer) becomes worried about a statistical problem. Although the observed difference in the means or percentages is large enough to be biologically (or clinically) significant, do the groups contain enough members for the numerical differences to be ‘statistically significant’?
Article
This paper reviews the role of statistics in causal inference. Special attention is given to the need for randomization to justify causal inferences from conventional statistics, and the need for random sampling to justify descriptive inferences. In most epidemiologic studies, randomization and random sampling play little or no role in the assembly of study cohorts. I therefore conclude that probabilistic interpretations of conventional statistics are rarely justified, and that such interpretations may encourage misinterpretation of nonrandomized studies. Possible remedies for this problem include deemphasizing inferential statistics in favor of data descriptors, and adopting statistical techniques based on more realistic probability models than those in common use.
Article
Overemphasis on hypothesis testing--and the use of P values to dichotomise significant or non-significant results--has detracted from more useful approaches to interpreting study results, such as estimation and confidence intervals. In medical studies investigators are usually interested in determining the size of difference of a measured outcome between groups, rather than a simple indication of whether or not it is statistically significant. Confidence intervals present a range of values, on the basis of the sample data, in which the population value for such a difference may lie. Some methods of calculating confidence intervals for means and differences between means are given, with similar information for proportions. The paper also gives suggestions for graphical display. Confidence intervals, if appropriate to the type of study, should be used for major findings in both the main text of a paper and its abstract.
Article
Article
This paper concerns interim analysis in clinical trials involving two treatments from the points of view of both classical and Bayesian inference. I criticize classical hypothesis testing in this setting and describe and recommend a Bayesian approach in which sampling stops when the probability that one treatment is the better exceeds a specified value. I consider application to normal sampling analysed in stages and evaluate the gain in average sample number as a function of the number of interim analyses.
Article
It is not generally appreciated that the p value, as conceived by R. A. Fisher, is not compatible with the Neyman-Pearson hypothesis test in which it has become embedded. The p value was meant to be a flexible inferential measure, whereas the hypothesis test was a rule for behavior, not inference. The combination of the two methods has led to a reinterpretation of the p value simultaneously as an "observed error rate" and as a measure of evidence. Both of these interpretations are problematic, and their combination has obscured the important differences between Neyman and Fisher on the nature of the scientific method and inhibited our understanding of the philosophic implications of the basic methods in use today. An analysis using another method promoted by Fisher, mathematical likelihood, shows that the p value substantially overstates the evidence against the null hypothesis. Likelihood makes clearer the distinction between error rates and inferential evidence and is a quantitative tool for expressing evidential strength that is more appropriate for the purposes of epidemiology than the p value.
Article
The recent controversy over the increased risk of venous thrombosis with third generation oral contraceptives illustrates the public policy dilemma that can be created by relying on conventional statistical tests and estimates: case-control studies showed a significant increase in risk and forced a decision either to warn or not to warn. Conventional statistical tests are an improper basis for such decisions because they dichotomise results according to whether they are or are not significant and do not allow decision makers to take explicit account of additional evidence--for example, of biological plausibility or of biases in the studies. A Bayesian approach overcomes both these problems. A Bayesian analysis starts with a "prior" probability distribution for the value of interest (for example, a true relative risk)--based on previous knowledge--and adds the new evidence (via a model) to produce a "posterior" probability distribution. Because different experts will have different prior beliefs sensitivity analyses are important to assess the effects on the posterior distributions of these differences. Sensitivity analyses should also examine the effects of different assumptions about biases and about the model which links the data with the value of interest. One advantage of this method is that it allows such assumptions to be handled openly and explicitly. Data presented as a series of posterior probability distributions would be a much better guide to policy, reflecting the reality that degrees of belief are often continuous, not dichotomous, and often vary from one person to another in the face of inconclusive evidence.
Article
An important problem exists in the interpretation of modern medical research data: Biological understanding and previous research play little formal role in the interpretation of quantitative results. This phenomenon is manifest in the discussion sections of research articles and ultimately can affect the reliability of conclusions. The standard statistical approach has created this situation by promoting the illusion that conclusions can be produced with certain "error rates," without consideration of information from outside the experiment. This statistical approach, the key components of which are P values and hypothesis tests, is widely perceived as a mathematically coherent approach to inference. There is little appreciation in the medical community that the methodology is an amalgam of incompatible elements, whose utility for scientific inference has been the subject of intense debate among statisticians for almost 70 years. This article introduces some of the key elements of that debate and traces the appeal and adverse impact of this methodology to the P value fallacy, the mistaken idea that a single number can capture both the long-run outcomes of an experiment and the evidential meaning of a single result. This argument is made as a prelude to the suggestion that another measure of evidence should be used--the Bayes factor, which properly separates issues of long-run behavior from evidential strength and allows the integration of background knowledge with statistical findings.
Article
Bayesian inference is usually presented as a method for determining how scientific belief should be modified by data. Although Bayesian methodology has been one of the most active areas of statistical development in the past 20 years, medical researchers have been reluctant to embrace what they perceive as a subjective approach to data analysis. It is little understood that Bayesian methods have a data-based core, which can be used as a calculus of evidence. This core is the Bayes factor, which in its simplest form is also called a likelihood ratio. The minimum Bayes factor is objective and can be used in lieu of the P value as a measure of the evidential strength. Unlike P values, Bayes factors have a sound theoretical foundation and an interpretation that allows their use in both inference and decision making. Bayes factors show that P values greatly overstate the evidence against the null hypothesis. Most important, Bayes factors require the addition of background knowledge to be transformed into inferences--probabilities that a given conclusion is right or wrong. They make the distinction clear between experimental evidence and inferential conclusions while providing a framework in which to combine prior with current evidence.
Article
Genetic association studies for multigenetic diseases are like fishing for the truth in a sea of trillions of candidate analyses. Red herrings are unavoidably common, and bias might cause serious misconceptions. However, a sizeable proportion of identified genetic associations are probably true. Meta-analysis, a rigorous, comprehensive, quantitative synthesis of all the available data, might help us to separate the true from the false.
Article
Although statistical textbooks have for a long time asserted that "not significant" merely implies "not proven," investigators still display confusion regarding the interpretation of the verdict. This appears to be due to the ambiguity of the term "significance," to inadequate exposition, and especially to the behavior of textbook writers who in the analysis of data act as if "nat significant" means "nonexistent" or "unimportant." Appropriate action after a verdict of "nonsignificance" depends on many circumstances and requires much thought. "Significance" tests often could be, and in some instances should be, avoided; then "nonsignificance" would cease to he a serious problem.