Article

A General Definition of Residuals

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Residuals are usually defined in connection with linear models. Here a more general definition is given and some asymptotic properties found. Some illustrative examples are discussed, including a regression problem involving exponentially distributed errors and some problems concerning Poisson and binomially distributed observations.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... , n be the outcome of interest and X i be the set of covariates. Under a parametric model M, which synthesizes information including the distribution family and regressors, we denote r(Y i |X i ) as the generalized model error formulated in Cox and Snell (1968). In a linear regression model, the error is ...
... where β represents the coefficients. Cox and Snell (1968) generalized the concept of model errors beyond normality by seeking independently identically distributed (i.i.d.) unobserved variables. For example, the generalized error for a continuous outcome, such as a gamma variable, can be defined as the uniformly distributed probability integral transform r(Y i |X i ) = F (Y i |X i ), wherein F is the conditional distribution of Y i given X i . ...
... For the same reason, Cox-Snell residuals (Cox and Snell 1968), which serve as a compelling diagnostic tool for continuous outcomes, lose effectiveness for discrete outcomes. ...
Preprint
Full-text available
The assessment of regression models with discrete outcomes is challenging and has many fundamental issues. With discrete outcomes, standard regression model assessment tools such as Pearson and deviance residuals do not follow the conventional reference distribution (normal) under the true model, calling into question the legitimacy of model assessment based on these tools. To fill this gap, we construct a new type of residuals for general discrete outcomes, including ordinal and count outcomes. The proposed residuals are based on two layers of probability integral transformation. When at least one continuous covariate is available, the proposed residuals closely follow a uniform distribution (a normal distribution after transformation) under the correctly specified model. One can construct visualizations such as QQ plots to check the overall fit of a model straightforwardly, and the shape of QQ plots can further help identify possible causes of misspecification such as overdispersion. We provide theoretical justification for the proposed residuals by establishing their asymptotic properties. Moreover, in order to assess the mean structure and identify potential covariates, we develop an ordered curve as a supplementary tool, which is based on the comparison between the partial sum of outcomes and of fitted means. Through simulation, we demonstrate empirically that the proposed tools outperform commonly used residuals for various model assessment tasks. We also illustrate the workflow of model assessment using the proposed tools in data analysis.
... According to [57], the bias of θ w (the MLE of θ w ), for independent observations without necessarily be identically distributed, can be written as: ...
... In this section, we analyze three real datasets that can be better fitted using our proposed semiparametric approach, which are made available in Appendix 5. For each data set, we present several key metrics: the p-value derived from the Shapiro-Wilk normality test [66], the AIC and BIC criteria values (utilized for determining the optimal number of changepoints for the PE model), the Kaplan-Meier estimate of the reliability function [67], the Cox-Snell residuals [57], and the p-value derived from the Cramér-von Mises test [68,69], which evaluates the goodness-of-fit of our model. Additionally, we provide the estimated PCIs of the PE model, offering a comprehensive assessment of our approach's efficacy across diverse data sets. ...
Article
Full-text available
Piecewise models have gained popularity as a useful tool in reliability and quality control/monitoring, particularly when the process data deviates from a normal distribution. In this study, we develop maximum likelihood estimators (MLEs) for the process capability indices, denoted as C pk , C pm , C * pm and C pmk , using a semiparametric model. To remove the bias in the MLEs with small sample sizes, we propose a bias-correction approach to obtain improved estimates. Furthermore, we extend the proposed method to situations where the change-points in the density function are unknown. To estimate the model parameters efficiently, we employ the profiled maximum likelihood approach. Our simulation study reveals that the suggested method yields accurate estimates with low bias and mean squared error. Finally, we provide real-world data applications to demonstrate the superiority of the proposed procedure over existing ones.
... Previous studies reported different risk factors associated with mortality of stroke patients, like older age [17][18][19], body temperature greater than 7.1 degrees centigrade, potassium level below 2 mmol/l, and creatinine level > 1 2 mg/dl [16], type of stroke, diabetes, severity of the stroke [18], atrial fibrillation, and lower education status [20] were associated with lower survival of stroke patients. Another study found that removing the feeding gastrostomy tube (FGT) at discharge from the rehabilitation hospital, as well as not aspirating during videofluoroscopic surgery (VSS), was linked to a longer survival time for stroke patients [21]. ...
... After a model is fitted, its adequacy needs to be assessed. The methods involved in this study are checking the adequacy of the parametric baselines and the Cox-Snell residuals [19]. ...
Article
Full-text available
Background Stroke is a life-threatening condition that occurs due to impaired blood flow to brain tissues. Every year, about 15 million people worldwide suffer from a stroke, with five million of them suffering from some form of permanent physical disability. Globally, stroke is the second-leading cause of death following ischemic heart disease. It is a public health burden for both developed and developing nations, including Ethiopia. Objectives This study is aimed at estimating the time to death among stroke patients at Jimma University Medical Center, Southwest Ethiopia. Methods A facility-based retrospective cohort study was conducted among 432 patients. The data were collected from stroke patients under follow-up at Jimma University Medical Center from January 1, 2016, to January 30, 2019. A log-rank test was used to compare the survival experiences of different categories of patients. The Cox proportional hazard model and the accelerated failure time model were used to analyze the survival analysis of stroke patients using R software. An Akaike's information criterion was used to compare the fitted models. Results Of the 432 stroke patients followed, 223 (51.6%) experienced the event of death. The median time to death among the patients was 15 days. According to the results of the Weibull accelerated failure time model, the age of patients, atrial fibrillation, alcohol consumption, types of stroke diagnosed, hypertension, and diabetes mellitus were found to be the significant prognostic factors that contribute to shorter survival times among stroke patients. Conclusion The Weibull accelerated failure time model better described the time to death of the stroke patients' data set than other distributions used in this study. Patients' age, atrial fibrillation, alcohol consumption, being diagnosed with hemorrhagic types of stroke, having hypertension, and having diabetes mellitus were found to be factors shortening survival time to death for stroke patients. Hence, healthcare professionals need to thoroughly follow the patients who pass risk factors. Moreover, patients need to be educated about lifestyle modifications.
... The MLEs for a statistical model are known to exhibit bias of the order O(n −1 ) (see Cox and Snel [15]). The Cox-Snell methodology can be used to correct for the expected bias and to improve the precision of the estimator. ...
... Using this expression, Cox and Snell [15] proved that for θ m can be written by ...
Article
Full-text available
The power-law distribution plays a crucial role in complex networks as well as various applied sciences. Investigating whether the degree distribution of a network follows a power-law distribution x^(−α) , with 2 < α < 3 (scale-free hypothesis), is an important concern. The commonly used inferential methods for estimating the model parameters often yield biased estimates, which can lead to the rejection of the hypothesis that a model conforms to a power-law. In this paper, we discuss improved methods that utilize Bayesian inference to obtain accurate estimates and precise credibility intervals. The inferential methods are derived for both continuous and discrete distributions. These methods reveal that objective Bayesian approaches return nearly unbiased estimates for the parameters of both models. Notably, in the continuous case, we identify an explicit posterior distribution. This work enhances the power of goodness-of-fit tests, enabling us to accurately discern whether a network or any other dataset adheres to a power-law distribution. We apply the proposed approach to fit degree distributions for more than 5,000 synthetic networks and over 3,000 real networks. The results indicate that our method is more suitable in practice, as it yields a frequency of acceptance close to the specified nominal level.
... By using the convergence result of the exchangeable random variables in Billingsley (1968), for 0 ≤ p ≤ 1, In contrast, Klein and Wu (2003) considered the following Cox model allowing an additional p-dimensional covariate vector Z ∈ ℝ p : where * (∈ ℝ p ) denotes the regression coefficient and a symbol T denotes the transpose. They developed a test statistic based on the Cox and Snell (1968) ...
Article
Full-text available
Hypothesis testing for the regression coefficient associated with a dichotomized continuous covariate in a Cox proportional hazards model has been considered in clinical research. Although most existing testing methods do not allow covariates, except for a dichotomized continuous covariate, they have generally been applied. Through an analytic bias analysis and a numerical study, we show that the current practice is not free from an inflated type I error and a loss of power. To overcome this limitation, we develop a bootstrap-based test that allows additional covariates and dichotomizes two-dimensional covariates into a binary variable. In addition, we develop an efficient algorithm to speed up the calculation of the proposed test statistic. Our numerical study demonstrates that the proposed bootstrap-based test maintains the type I error well at the nominal level and exhibits higher power than other methods, as well as that the proposed efficient algorithm reduces computational costs.
... Cox-Snell residuals are another form of residuals derived from following relationship;ê i = − log 1 − G(y i ,β,μ i ) . Detailed information on Cox-Snell residuals can be found in [27]. When the model fits the data appropriately, these residuals should follow an exponential distribution with a scale parameter of 1. ...
Preprint
Full-text available
In practical scenarios, data measurements like ratios and proportions often fall within the 0 to 1 range. Analyzing such bounded data introduces unique modeling challenges, prompting statisticians to explore new distributions that can effectively handle this context. Although beta and Kumaraswamy distributions, along with their related regression models, have gained popularity for examining the relationship between bounded response variables and covariates, several alternative models have shown superior performance compared to these two. However, there is still no agreement on the most effective alternative models. Consequently, this paper introduces a novel bounded probability distribution derived from transforming the Weibull distribution. Our investigation has revealed several interesting properties, including various moments and their generating function, entropies, quantile function, and a linear form of the proposed model. Additionally, we have developed the sequential probability ratio test (SPRT) for the proposed model. The maximum likelihood estimation method was employed to estimate the model parameters. A Monte Carlo simulation was conducted to evaluate the performance of parameter estimation for the model. Finally, we formulated a quantile regression model and applied it to data sets related to risk assessment and educational attainment, demonstrating its superior performance over alternative regression models. These results highlight the importance of our contributions to enhancing the statistical toolkit for analyzing bounded variables across different scientific fields.
... We adopted the idea of the residuals for survival analysis pioneered in [39] using a transformation of the cumulative hazard function (CHF) associated with the PEW regression model. For the regression models normally employed in survival analysis, the Cox-Snell residuals should behave like a censored sample from a standard exponential distribution when the fit is adequate [38]. ...
Article
Full-text available
The use of cure-rate survival models has grown in recent years. Even so, proposals to perform the goodness of fit of these models have not been so frequent. However, residual analysis can be used to check the adequacy of a fitted regression model. In this context, we provide Cox–Snell residuals for Poisson-exponentiated Weibull regression with cure fraction. We developed several simulations under different scenarios for studying the distributions of these residuals. They were applied to a melanoma dataset for illustrative purposes.
... This simulation-based approach is facilitated in FRK v2 with the function simulate(). In particular, simulations generated with simulate() may be used with the R package DHARMa (Hartig 2022) which, given observed data and simulations from a fitted model, computes interpretable, simulation-based quantile residuals (Cox and Snell 1968;Dunn and Smyth 1996). Under the true model, these residuals always follow a standard uniform distribution, which greatly facilitates their interpretation. ...
Article
Full-text available
Non-Gaussian spatial and spatio-temporal data are becoming increasingly prevalent, and their analysis is needed in a variety of disciplines. FRK is an R package for spatial and spatio-temporal modeling and prediction with very large data sets that, to date, has only supported linear process models and Gaussian data models. In this paper, we describe a major upgrade to FRK that allows for non-Gaussian data to be analyzed in a generalized linear mixed model framework. These vastly more general spatial and spatio-temporal models are fitted using the Laplace approximation via the software TMB. The existing functionality of FRK is retained with this advance into non-Gaussian models; in particular, it allows for automatic basis-function construction, it can handle both point-referenced and areal data simultaneously, and it can predict process values at any spatial support from these data. This new version of FRK also allows for the use of a large number of basis functions when modeling the spatial process, and thus it is often able to achieve more accurate predictions than previous versions of the package in a Gaussian setting. We demonstrate innovative features in this new version of FRK, highlight its ease of use, and compare it to alternative packages using both simulated and real data sets.
... This also implies that the model was fit to the data. The value of the Cox & Snell R Square (Cox & Snell, 1968) is 0.119, and the value of the Nagelkerke (1991) R Square is 0.459, indicating that between 11.9% and 45.9% of the variability in the outcome variable was explained by the predictor variables. We then examined the validity of the predicted probabilities based on the classification table. ...
Article
Full-text available
The purpose of this study was to predict the likelihood that teachers would consider the Geometric AR-based Pedagogical Module (GeAR-PM) as beneficial, and study the factors that influence their decision. The data for this study was gathered via a survey method. A sample of 202 Malaysian secondary school teachers was randomly selected as respondents for this study. Specifically, the respondents were asked to answer questions using an instrument which was developed based on the Unified Theory of Acceptance and Use of Technology (UTAUT) model. This study comprised four UTAUT variables: performance expectation, effort expectancy, social influence, and facilitating conditions, and one additional variable, self-efficacy. All variables showed high reliability with Cronbach Alpha’s values between 0.87 to 0.91. The data was analyzed using a binary logistic regression via Statistical Package for the Social Sciences (SPSS) version 27.0. The findings revealed that a teacher's belief that GeAR-PM is beneficial was significantly related to his/ her perception on effort expectancy, social influence, and self-efficacy. Hence, when developing the GeAR-PM, these three factors should be weighed up.
... GBcr-PE q = 1 and r = 2 Figure 4 shows the QQ-plot for the quantile residuals (left panel) and KM estimator for the Cox-Snell residuals [25] (right panel). On the other hand, we also applied some common normality tests to check the validity of the quantile residuals, such as the Kolmogorov-Smirnov (KS, [26]), Shapiro-Wilk (SW, [27]), Anderson-Darling (AD, [28]), and Cramer-Von-Mises (CVM, [29]). ...
Article
Full-text available
A novel cure rate model is introduced by considering, for the number of concurrent causes, the modified power series distribution and, for the time to event, the recently proposed power piecewise exponential distribution. This model includes a wide variety of cure rate models, such as binomial, Poisson, negative binomial, Haight, Borel, logarithmic, and restricted generalized Poisson. Some characteristics of the model are examined, and the estimation of parameters is performed using the Expectation–Maximization algorithm. A simulation study is presented to evaluate the performance of the estimators in finite samples. Finally, an application in a real medical dataset from a population-based study of incident cases of lobular carcinoma diagnosed in the state of São Paulo, Brazil, illustrates the advantages of the proposed model compared to other common cure rate models in the literature, particularly regarding the underestimation of the cure rate in other proposals and the improved precision in estimating the cure rate of our proposal.
... ( , ) and ( , ) be the biases of ̂( = 1, … , ) and ̂, respectively. The use of Cox and Snell [29]'s formula to obtain these biases is simplified, since and are globally orthogonal, and = = 0. We have that ...
... Cox-Snell residuals estimate the difference between the predicted and real (observed) value of the observation. To do so, it applies a negative logarithmic transformation to the survival probability of each observation (Cox and Snell, 1968). The Cox-Snell plots are characterized by the cumulative hazard on the y-axis, and the Cox-Snell residuals on the x-axis. ...
Article
The aim of the paper is to investigate the impact of the extension of a metro line on the survival of individual firms. An empirical analysis focuses on the relation between proximity to new stations and firm survival following the announcement of the extension of the orange line of the metro in Montréal (Canada) between 1996 and 2016. To do so, a Cox Proportional Hazards model is estimated using the simultaneous announcement of two potential extensions to define the treatment (new stations) and control groups (speculative stations) based on the distance to the closest stations. The model explicitly controls for anticipation and speculation effect by introducing three distinct treatment effects by period. The impact of the new metro stations appears to be mostly positive on firms' survival probability during the construction period and after the opening of the service. The results suggest that the metro extension does have a positive influence on the survival rates of individual firms within 250 to 1,250 m. of the closest station, especially for local activities.
... Survival heatmaps developed for the new packaging materials are further discussed in Section 3. Various metrics can be used to select the best suited models and to check their performance. In survival analysis, the most common metrics are the Akaike Information Criterion (AIC) (Akaike, 1974), the Bayesian Information Criterion (BIC) (Schwarz, 1978) and the Cox-Snell Residuals (Cox & Snell, 1968). The results presented in the next section were obtained using the AIC selection criterion, but a similar conclusion was obtained when the BIC criterion was used instead. ...
Article
Full-text available
Effective packaging solutions are pivotal in addressing sustainability challenges associated with food consumption. Sustainable packaging must not only reduce ecological impact but also ensure food quality preservation, minimize the risk of foodborne diseases, and contribute to mitigating the accumulation of plastic waste. While paper-based food packaging is environmentally friendly, it is susceptible to mold growth when exposed to specific humidity and temperatures, potentially leading to spoilage and safety concerns. This study introduces an innovative modeling approach employing survival analysis to estimate time to visible mold growth on paper-based materials used in food-contact packaging under varying environmental conditions. The objective is to establish an alternative statistical framework for assessing the impact of relative humidity and temperature on mold proliferation. This approach is particularly relevant to novel sustainable packaging materials in the food industry, especially when dealing with censored data. Survival heatmaps are developed to simplify the visualization of spoilage likelihood, measured by time to visible growth at a given confidence level. Although the survival model is proposed for paper-based materials, there are no theoretical limitations to its extension for use with other materials within and beyond the food industry.
... Figure 1 and Figure 2 show the trace plots of the parameters for the fitted models and Figure 3 and Figure 4 show the Posterior Density Plots of the Parameters for the fitted models. For model diagnostics, we consider a residual plot of (16) a plot of estimated cumulative hazard function (based on Cox and Snell residual and the censored data) versus the Cox and Snell residual. If the model provides a good fit to the data, we expect a straight line through the origin with slop 1. ...
Article
Full-text available
Objectives: The objective of this study is to develop a new Proportional Odds frailty model by using a Weibull hazard function in context of Bayesian mechanism. Methods: Frailty models provide a convenient way to introduce random effects, association and unobserved heterogeneity into models for survival data. Proportional Odds (PO) model is a widely pursued model in survival analysis which extends the concept of Odds ratio considered for lifetime data with covariates. Proportional Odds models can be derived under frailty approach by introducing a frailty term for each individual in the exponent of the hazard function, which acts multiplicatively on the baseline hazard function. In this paper an attempt has been made to develop a new Proportional Odds frailty model by using a Weibull hazard function in context of Bayesian mechanism. Findings: The methodologies are applied to a real life survival data set and the posterior inferences are drawn using Markov Chain Monte Carlo (MCMC) simulation methods and model comparison tools like Deviance Information Criteria (DIC) and the Log Pseudo Marginal Likelihood (LPML) are also calculated and to check the fit of the model Cox-Snell residual plot is employed. The performance of the newly developed model is compared with an existing proportional odds model without the frailty term and it is observed that the newly developed frailty model perform well as compered to the traditional non frailty model. Novelty: A new Proportional Odds frailty model using Weibull hazard function under Bayesian mechanism is developed in this paper which is an added contribution in the field of Survival Analysis.
... Furthermore, the model_diagnostics() function facilitates diagnostic assessments through analysis of the residuals. It supports the calculation and visualization of martingale residuals, deviance residuals (Therneau et al. 1990), and Cox-Snell residuals (Cox and Snell 1968). ...
Article
Full-text available
Due to their flexibility and superior performance, machine learning models frequently complement and outperform traditional statistical survival models. However, their widespread adoption is hindered by a lack of user-friendly tools to explain their internal operations and prediction rationales. To tackle this issue, we introduce the survex R package, which provides a cohesive framework for explaining any survival model by applying explainable artificial intelligence techniques. The capabilities of the proposed software encompass understanding and diagnosing survival models, which can lead to their improvement. By revealing insights into the decision-making process, such as variable effects and importances, survex enables the assessment of model reliability and the detection of biases. Thus, transparency and responsibility may be promoted in sensitive areas, such as biomedical research and healthcare applications. Availability and Implementation survex is available under the GPL3 public license at https://github.com/modeloriented/survex and on CRAN with documentation available at https://modeloriented.github.io/survex.
... In general, one way to validate the adequacy of any fitted regression model is to check whether the residuals of the model are well behaved. In reliability studies, the Cox-Snell residual by Cox and Snell [32] is a kind of standardized residuals that is widely used for assessing the goodness-of-fit of the fitted regression model and it is given bŷ ...
Article
Full-text available
A two-parameter unit distribution and its regression model plus its extension to 0 and 1 inflation is introduced and studied. The distribution is called the unit upper truncated Weibull (UUTW) distribution, while the inflated variant is called the 0−1 inflated unit upper truncated Weibull (ZOIUUTW) distribution. The UUTW distribution has an increasing and a J-shaped hazard rate function. The parameters of the proposed models are estimated by the method of maximum likelihood estimation. For the UUTW distribution, two practical examples involving household expenditure and maximum flood level data are used to show its flexibility and the proposed distribution demonstrates better fit tendencies than some of the competing unit distributions. Application of the proposed regression model demonstrates adequate capability in describing the real data set with better modeling proficiency than the existing competing models. Then, for the ZOIUUTW distribution, the CD34+ data involving cancer patients are analyzed to show the flexibility of the model in characterizing inflation at both endpoints of the unit interval.
... The MLEs are typically biased of order O(n −1 ), and these errors begin to reduce as the sample size increases (Cordeiro . By applying the corrective Cox-Snell methodology (Cox and Snell 1968), the expected bias can be accounted for and corrected to improve the precision of the estimator. The modification of the MLEs returned corrected estimators that are bias-free up to the second order. ...
Article
Full-text available
Piecewise models play a crucial role in statistical analysis as they allow the same pattern to be adjusted over different regions of the data, achieving a higher quality of fit than would be obtained by fitting them all at once. The standard piecewise linear distribution assumes that the hazard rate is constant between each change point. However, this assumption may be unrealistic in many applications. To address this issue, we introduce a piecewise distribution based on the power-law model. The proposed semi-parametric distribution boasts excellent properties and features a non-constant hazard function between change points. We discuss parameter estimates using the maximum likelihood estimators (MLEs), which yield closed-form expressions for the estimators and the Fisher information matrix for both complete and randomly censored data. Since MLEs can be biased for small samples, we derived bias-corrected MLEs that are unbiased up to the second order and also have closed-form expressions. We consider a profiled MLE approach to estimate change points and construct a hypothesis test to determine the number of change points. We apply our proposed model to analyze the survival pattern of monarchs in the Pharaoh dynasties. Our results indicate that the piecewise power-law distribution fits the data well, suggesting that the lifespans of pharaonic monarchs exhibit varied survival patterns.
... We used the proportional hazard model using the survival package of the R software (R Development Core Team 2014). The fit of the model was evaluated through the Cox-Snell residuals (Cox and Snell 1968), which made it possible to verify if there was no violation of the adjusted model. The relationships between the reproductive efficiency of Nile tilapia female and the weight, total length, standard length, and height of females and males, as well as the number of eggs and egg mass weight, were expressed as a risk ratio (HRR). ...
Article
Full-text available
Survival analysis has proven to be a robust tool for studies in humans, and recently, it has been adopted in studies with animals, but very little in fish. The production of uniform fingerlings has been one challenge of aquaculture, influenced by the use of the efficient breeders. We aimed to present, for the first time in aquaculture, a survival analysis as a tool for decision-making by using the relationship between the main morphometric traits and the spawning time of Nile tilapia females up to 28 days after mating in an intensive system. We used 78 females and 26 males from which ten traits were evaluated. A check was made for the presence of eggs in the female mouth (spawning) every seven days until the final twenty-eight days. Confirmation of eggs was defined as uncensored data (C = 1) and absence of eggs as censored (C = 0). The Cox proportional hazards ratio and Kaplan–Meier models were adjusted to analyze the data. Females with a standard length of 19.19 cm (small group) and males with a weight of 259.14 g (small group) reproduced early, being more adapted to 1000 L tanks in a recirculation system. Less heavy egg masses were also related to early spawning. Therefore, the traits, standard length of the female, egg mass weight, and male weight, affected the presence of eggs up to 28 days and should be used as selection criteria for early breeders. Besides, the survival analysis was accurate as a tool for tilapia breeding management in an intensive system.
... The P-value of 0.008 indicates that the covariate effects are significant. Now, we check the overall fit of the model by using Cox-Snell residuals (Cox & Snell, 1968). Suppose that the AR model given in (1) is fitted to the data. ...
Article
Middle-censoring refers to data arising in situations where the exact lifetime of study subjects becomes unobservable if it happens to fall in a random censoring interval. In the present paper we propose a semiparametric additive risks regression model for analysing middle-censored lifetime data arising from an unknown population. We estimate the regression parameters and the unknown baseline survival function by two different methods. The first method uses the martingale-based theory and the second method is an iterative method. We report simulation studies to assess the finite sample behaviour of the estimators. Then, we illustrate the utility of the model with a real life data set. The paper ends with a conclusion.
... In the second "bridgehead" specification, newly invaded regions may themselves become source regions for further invasion, thus we allowed for J tk to grow over time, adding countries in which species k is newly discovered. We used robust standard errors clustered at the importing country-species level and Cox-Snell residuals to evaluate model fit (Cox & Snell, 1968). ...
Article
International trade continues to drive biological invasions. We investigate the drivers of global nonnative ant establishments over the last two centuries using a Cox proportional hazards model. We use country‐level discovery records for 36 of the most widespread nonnative ant species worldwide from 1827 to 2012. We find that climatic similarity combined with cumulative imports during the 20 years before a species discovery in any given year is an important predictor of establishment. Accounting for invasions from both the native and previously invaded “bridgehead” regions substantially improves the model's fit, highlighting the role of spatial spillovers. These results are valuable for targeting biosecurity efforts.
... Further, the survival models can be compared using Cox-Snell residual plots. The Cox-Snell residuals (r j ) for the cumulative hazard of the fitted model (H) with covariates X j can be expressed by Equation 10 (Cox and Snell 1968). ...
... Regarding fire metrics, we considered fire recurrence (FIRE rec ), fire seasonality (FIRE fs for fire-season fires and FIRE nfs for non-seasonal fires), and the size of the largest fire that affected each burned patch (FIRE fslf ). We used the R software package rcompanion [122] to compute the Efron, McFadden, Cox and Snell, and Nagelkerke/Cragg and Ubler pseudo-R 2 measures [123][124][125] to assess model performance. ...
Article
Full-text available
Socio-demographic changes in recent decades and fire policies centered on fire suppression have substantially diminished the ability to maintain low fuel loads at the landscape scale in marginal lands. Currently, shepherds face many barriers to the use of fire for restoring pastures in shrub-encroached communities. The restrictions imposed are based on the lack of knowledge of their impacts on the landscape. We aim to contribute to this clarification. Therefore, we used a dataset of burned areas in the Alto Minho region for seasonal and unseasonal (pastoral) fires. We conducted statistical and spatial analyses to characterize the fire regime (2001-2018), the distribution of fuel types and their dynamics, and the effects of fire on such changes. Unseasonal fires are smaller and spread in different spatial contexts. Fuel types characteristic of maritime pine and eucalypts are selected by seasonal fires and avoided by unseasonal fires which, in turn, showed high preference for heterogeneous mosaics of herbaceous and shrub vegetation. The area covered by fuel types of broadleaved and eucalypt forest stands increased between 2000 and 2018 at the expense of the fuel type corresponding to maritime pine stands. Results emphasize the role of seasonal fires and fire recurrence in these changes, and the weak effect of unseasonal fires. An increase in the maritime pine fuel type was observed only in areas burned by unseasonal fires, after excluding the areas overlapping with seasonal fires.
... Here, r = y − E(Y |x) is the classical residual and Φ(·) is the CDF of the standard normal distribution. Our definition maps the classical residual r to a scale of (-1/2, 1/2), and it reduces to the Cox-Snell residual (Cox and Snell, 1968). Throughout the paper, we use an upper-case letter (e.g., R or S) to denote a random variable and a lower-case letter (e.g., r or s) an observation. ...
Preprint
Full-text available
This paper is motivated by the analysis of a survey study of college student wellbeing before and after the outbreak of the COVID-19 pandemic. A statistical challenge in well-being survey studies lies in that outcome variables are often recorded in different scales, be it continuous, binary, or ordinal. The presence of mixed data complicates the assessment of the associations between them while adjusting for covariates. In our study, of particular interest are the associations between college students' wellbeing and other mental health measures and how other risk factors moderate these associations during the pandemic. To this end, we propose a unifying framework for studying partial association between mixed data. This is achieved by defining a unified residual using the surrogate method. The idea is to map the residual randomness to the same continuous scale, regardless of the original scales of outcome variables. It applies to virtually all commonly used models for covariate adjustments. We demonstrate the validity of using such defined residuals to assess partial association. In particular, we develop a measure that generalizes classical Kendall's tau in the sense that it can size both partial and marginal associations. More importantly, our development advances the theory of the surrogate method developed in recent years by showing that it can be used without requiring outcome variables having a latent variable structure. The use of our method in the well-being survey analysis reveals (i) significant moderation effects (i.e., the difference between partial and marginal associations) of some key risk factors; and (ii) an elevated moderation effect of physical health, loneliness, and accommodation after the onset of COVID-19.
Article
Since its invention HyperLogLog has become the standard algorithm for approximate distinct counting. Due to its space efficiency and suitability for distributed systems, it is widely used and also implemented in numerous databases. This work presents UltraLogLog, which shares the same practical properties as HyperLogLog. It is commutative, idempotent, mergeable, and has a fast guaranteed constant-time insert operation. At the same time, it requires 28% less space to encode the same amount of distinct count information, which can be extracted using the maximum likelihood method. Alternatively, a simpler and faster estimator is proposed, which still achieves a space reduction of 24%, but at an estimation speed comparable to that of HyperLogLog. In a non-distributed setting where martingale estimation can be used, UltraLogLog is able to reduce space by 17%. Moreover, its smaller entropy and its 8-bit registers lead to better compaction when using standard compression algorithms. All this is verified by experimental results that are in perfect agreement with the theoretical analysis which also outlines potential for even more space-efficient data structures. A production-ready Java implementation of UltraLogLog has been released as part of the open-source Hash4j library.
Article
Full-text available
Conditional cash transfer programs (CCTs) are increasingly common in Latin American countries and have become the main strategy for combating poverty and social inequality. Among many, noteworthy programs include Progresa-Oportunidades, later renamed Prospera, in Mexico in 1997; Programa Familias en Acción in Colombia; Chile Solidario in Chile in 2002; and Bolsa Família Program in Brazil in 2003. The central feature of the Bolsa Família Program (BFP) was that the receipt of monetary benefits by families was tied to the fulfillment of certain conditions, or conditionalities, involving healthcare and education for children and adolescents. The purpose of this design was, in the short term, to combat the negative effects of poverty on family well-being through the transfer of monetary resources, and in the long term, through the requirement of conditionalities, to break the poverty trap caused, in large part, by intergenerational transmission of income and education. The literature on conditional cash transfers has largely focused on entry conditions, paying less attention to exit conditions. This study aims to explore the trajectory of participant families in the Bolsa Família Program, using information obtained from the cohort of individuals born in the city of Pelotas in 2004. The research focuses on the socioeconomic characteristics of families at the time of their child’s birth in 2004 and the follow-up conducted in 2015. Survival analysis models are used to analyze the probability of exiting the program. The results reported by the survival analysis, Kaplan-Meier method, indicate that a BRL 1.00 (US$ 0.30) increase in income reduces, on average, the likelihood of successful departure from the BFP beneficiary family by 1%. Conversely, exercises conducted by family groups indicate that for families receiving benefits above the average, BRL 141.00 (US$ 42.32), the mother’s age and her employment status positively influence the chances of leaving the program. Overall, the results demonstrate that the values of the benefits received are subject to decreasing returns. For families receiving benefits below the average, a one-unit increase in the amount received reduces the probability of exiting the program by an average of 1%. Regarding successful exits, it was observed that having a white father increases the chances by 40%, and if the father is employed, it increases the likelihood of successfully exiting the program by 48%. These results indicate that the successful exit of families from the program is primarily associated with parental characteristics.
Article
We obtain new mathematical properties of the exponentiated odd log-logistic family of distributions, and of its special case named the exponentiated odd log-logistic Weibull, and its log transformed. A new location and scale regression model is constructed, and some simulations are carried out to verify the behavior of the maximum likelihood estimators, and of the modified deviance-based residuals. The methodology is applied to the Japanese-Brazilian emigration data.
Article
Full-text available
Introduction To assess the rate of change in soluble fms‐like tyrosine kinase‐1/placental growth factor (sFlt‐1/PlGF) ratio and PlGF levels per week compared to a single sFlt‐1/PlGF ratio or PlGF level to predict preterm birth for pregnancies complicated by fetal growth restriction. Material and methods A prospective cohort study of pregnancies complicated by isolated fetal growth restriction. Maternal serum PlGF levels and the sFlt‐1/PlGF ratio were measured at 4‐weekly intervals from recruitment to delivery. We investigated the utility of PlGF levels, sFlt‐1/PlGF ratio, change in PlGF levels per week or sFlt‐1/PlGF ratio per week. Cox‐proportional hazard models and Harrell's C concordance statistic were used to evaluate the effect of biomarkers on time to preterm birth. Results The total study cohort was 158 pregnancies comprising 91 (57.6%) with fetal growth restriction and 67 (42.4%) with appropriate for gestational age controls. In the fetal growth restriction cohort, sFlt‐1/PlGF ratio and PlGF levels significantly affected time to preterm birth (Harrell's C: 0.85–0.76). The rate of increase per week of the sFlt‐1/PlGF ratio (hazard ratio [HR] 3.91, 95% confidence interval [CI]: 1.39–10.99, p = 0.01, Harrell's C: 0.74) was positively associated with preterm birth but change in PlGF levels per week was not (HR 0.65, 95% CI: 0.25–1.67, p = 0.37, Harrell's C: 0.68). Conclusions Both a high sFlt‐1/PlGF ratio and low PlGF levels are predictive of preterm birth in women with fetal growth restriction. Although the rate of increase of the sFlt‐1/PlGF ratio predicts preterm birth, it is not superior to either a single elevated sFlt‐1/PlGF ratio or low PlGF level.
Article
Semicontinuous outcomes commonly arise in a wide variety of fields, such as insurance claims, healthcare expenditures, rainfall amounts, and alcohol consumption. Regression models, including Tobit, Tweedie, and two-part models, are widely employed to understand the relationship between semicontinuous outcomes and covariates. Given the potential detrimental consequences of model misspecification, after fitting a regression model, it is of prime importance to check the adequacy of the model. However, due to the point mass at zero, standard diagnostic tools for regression models (eg, deviance and Pearson residuals) are not informative for semicontinuous data. To bridge this gap, we propose a new type of residuals for semicontinuous outcomes that is applicable to general regression models. Under the correctly specified model, the proposed residuals converge to being uniformly distributed, and when the model is misspecified, they significantly depart from this pattern. In addition to in-sample validation, the proposed methodology can also be employed to evaluate predictive distributions. We demonstrate the effectiveness of the proposed tool using health expenditure data from the US Medical Expenditure Panel Survey.
Article
Full-text available
Resumo Uma das características marcantes no campo de trabalho artístico é o diferencial de rendimentos e suas desigualdades. Como o campo é muito heterogêneo, este artigo pretende verificar se os diferenciais e desigualdades retratados pela literatura ocorrem para os músicos atuantes em Belo Horizonte. Para tanto, nos meses de fevereiro, março e abril de 2020, foi realizada uma pesquisa para a coleta de dados primários a fim de responder estas questões, a partir da aplicação econométrica e estatística. As aplicações sugerem que a hipótese de não linearidade entre o nível de instrução e rendimentos não pode ser rejeitada, e as características pessoais não se mostraram significativas nesta análise. Em relação às desigualdades intragrupo, constata-se que os maiores índices ocorrem em alguns grupos específicos, como os de músicos com dedicação exclusiva, na faixa etária de 30 a 36 anos e graduados no ensino superior em música.
Article
Even if the value of the risk factor evolves continuously in the extended Cox model, the inference may be somewhat biased because it employs a discrete approximation method. As a result, if the risk factor’s value not only fluctuates constantly but also contains measurement error, a model that substitutes the average value rather than the measured value of the risk factor might be considered. Such a model is known as a joint model. After introducing the most widely used the present-value model among joint models, this model was extended to various types of data. In addition, several residuals for model diagnosis were introduced, methods for predicting the probability of event occurrence and the value of the longitudinal risk factor by application of the estimation model were introduced, and an index that can evaluate how well the longitudinal risk factors divide patients into high-risk and low-risk groups was introduced. Finally, all the statistical inference methods introduced in this study were implemented using the JM R package and the source codes were supplied.
Article
We derive approximations to the bias and squared bias with errors of order o (1/ n ) where n is the sample size. Our results hold for a large class of estimators, including quantiles, transformations of unbiased estimators, maximum likelihood estimators in (possibly) incorrectly specified models and functions thereof. Furthermore, we use the approximations to derive estimators of the mean squared error (MSE) which are correct to order o (1/ n ). Since the variance of many estimators is of order O (1/ n ), this level of precision is needed for the mean squared error estimator to properly take the variance into account. We also formulate a new focused information criterion (FIC) for model selection based on the estimators of the squared bias. Lastly, we illustrate the methods on data containing the number of battle deaths in all major inter‐state wars between 1823 and the present day. The application illustrates the potentially large impact of using a less accurate estimator of the squared bias. This article is protected by copyright. All rights reserved.
Article
Full-text available
Motivation Technologies identifying single nucleotide polymorphisms (SNPs) in DNA sequencing yield an avalanche of data requiring analysis and interpretation. Standard methods may require many weeks of processing time. The use of statistical methods requiring data sorting, matrix inversions of a high-dimension and replication in subsets of the data on multiple outcomes exacerbate these times. A method which reduces the computational time in problems with time-to-event outcomes and hundreds of thousands/millions of SNPs using Cox-Snell residuals after fitting the Cox proportional hazards model (PH) to a fixed set of concomitant variables is proposed. This yields coefficients for SNP effect from a Cox-Snell adjusted Poisson model and shows a high concordance to the adjusted PH model. The method is illustrated with a sample of 10000 SNPs from a genome wide association study (GWAS) in a diabetic population. The gain in processing efficiency using the proposed method based on Poisson modelling can be as high as 61%. This could result in saving of over three weeks processing time if 5 million SNPs require analysis. The method involves only a single predictor variable (SNP), offering a simpler, computationally more stable approach to examining and identifying SNP patterns associated with the outcome(s) allowing for a faster development of genetic signatures. Use of deviance residuals from the PH model to screen SNPs demonstrates a large discordance rate at a 0.2% threshold of concordance. This rate is 15 times larger than that based on the Cox-Snell residuals from the Cox-Snell adjusted Poisson model.
Article
Full-text available
The depth of information collected in participants’ daily lives with active (e.g., experience sampling surveys) and passive (e.g., smartphone sensors) ambulatory measurement methods is immense. When measuring participants’ behaviors in daily life, the timing of particular events—such as social interactions—is often recorded. These data facilitate the investigation of new types of research questions about the timing of those events, including whether individuals’ affective state is associated with the rate of social interactions (binary event occurrence) and what types of social interactions are likely to occur (multicategory event occurrences, e.g., interactions with friends or family). Although survival analysis methods have been used to analyze time-to-event data in longitudinal settings for several decades, these methods have not yet been incorporated into ambulatory assessment research. This article illustrates how multilevel and multistate survival analysis methods can be used to model the social interaction dynamics captured in intensive longitudinal data, specifically when individuals exhibit particular categories of behavior. We provide an introduction to these models and a tutorial on how the timing and type of social interactions can be modeled using the R statistical programming language. Using event-contingent reports (N = 150, Nevents = 64,112) obtained in an ambulatory study of interpersonal interactions, we further exemplify an empirical application case. In sum, this article demonstrates how survival models can advance the understanding of (social interaction) dynamics that unfold in daily life.
Article
Full-text available
A new probability distribution is proposed in this paper. This new distribution has support on the interval and was obtained after transforming the random variable with exponential distribution. The mode, quantile function, median, ordinary moments and density function belongs to exponential family of distributions are demonstrated. The maximum likelihood method is used to obtain the parameter estimate. A regression model for the median of the distribution is also proposed. Closed-form expressions for the score vector and Fisher’s information matrix are demonstrated. A simulation study and an application to real data showed the good performance of the proposed regression model.
Article
Full-text available
In this paper, an attempt has been made to derive three new trnivariate Proportional Hazards models under frailty approach with different baseline hazardfunctions. The proposed models are illushated with a real life survival data set and the posterior inferences are drawn using Markov Chain Monte Carlo (MCMC) simulation methods. For the model comparison two popular model choice criteria-deviance information criteria (DIC) and the log pseudo marginal likelihood (LPML) are employed and Cox-Snell residual plot is used here to check the fit of the models.
Preprint
Due to their flexibility and superior performance, machine learning models frequently complement and outperform traditional statistical survival models. However, their widespread adoption is hindered by a lack of user-friendly tools to explain their internal operations and prediction rationales. To tackle this issue, we introduce the survex R package, which provides a cohesive framework for explaining any survival model by applying explainable artificial intelligence techniques. The capabilities of the proposed software encompass understanding and diagnosing survival models, which can lead to their improvement. By revealing insights into the decision-making process, such as variable effects and importances, survex enables the assessment of model reliability and the detection of biases. Thus, transparency and responsibility may be promoted in sensitive areas, such as biomedical research and healthcare applications.
Chapter
Sir David R. Cox was a leading statistician, perhaps the foremost of his era. He made major contributions in diverse areas of statistics and probability and mentored many who subsequently became leaders of the profession. He published over 20 books and 350+ papers.
Article
Full-text available
In this paper, likelihood-based inference and bias correction based on Firth’s approach are developed in the modified skew-t-normal (MStN) distribution. The latter model exhibits a greater flexibility than the modified skew-normal (MSN) distribution since it is able to model heavily skewed data and thick tails. In addition, the tails are controlled by the shape parameter and the degrees of freedom. We provide the density of this new distribution and present some of its more important properties including a general expression for the moments. The Fisher’s information matrix together with the observed matrix associated with the log-likelihood are also given. Furthermore, the non-singularity of the Fisher’s information matrix for the MStN model is demonstrated when the shape parameter is zero. As the MStN model presents an inferential problem in the shape parameter, Firth’s method for bias reduction was applied for the scalar case and for the location and scale case.
Article
In this article we propose two-step generalized method of moment (GMM) procedure for a Spatial Binary Probit Model. In particular, we propose a series of two-step estimators based on different choices of the weighting matrix for the moments conditions in the first step, and different estimators for the variance–covariance matrix of the estimated coefficients. In the context of a Monte Carlo experiment, we compare the properties of these estimators, a linearized version of the one-step GMM and the recursive importance sampler (RIS). Our findings reveal that there are benefits related both to the choice of the weight matrix for the moment conditions and in adopting a two-step procedure.
Article
In the analysis of data it is often assumed that observations y1, y2, …, yn are independently normally distributed with constant variance and with expectations specified by a model linear in a set of parameters θ. In this paper we make the less restrictive assumption that such a normal, homoscedastic, linear model is appropriate after some suitable transformation has been applied to the y's. Inferences about the transformation and about the parameters of the linear model are made by computing the likelihood function and the relevant posterior distribution. The contributions of normality, homoscedasticity and additivity to the transformation are separated. The relation of the present methods to earlier procedures for finding transformations is discussed. The methods are illustrated with examples.
Article
For a distribution depending on a single parameter the first four sampling moments of the maximum‐likelihood estimate to orders N‐2, N‐3, N‐3 and N‐4 respectively are given. Expressions for the measures of skewness γ1 and γ2 are also given. Several illustrative examples are included as a check on the heavy algebra. The paper extends earlier work by Haldane and Smith.
Article
(1) GIINNAR BLOM. ‘Transformations of the binomial, negative binomial, Poisson and χ2 distributions’, Biometrika (1954), 41, 302.
Article
In an earlier paper (Durbin & Watson, 1950) the authors investigated the problem of testing the error terms of a regression model for serial correlation. Test criteria were put forward, their moments calculated, and bounds to their distribution functions were obtained. In the present paper these bounds are tabulated and their use in practice is described. For cases in which the bounds do not settle the question of significance an approximate method is suggested. Expressions are given for the mean and variance of a test statistic for one- and two-way classifications and polynomial trends, leading to approximate tests for these cases. The procedures described should be capable of application by the practical worker without reference to the earlier paper (hereinafter referred to as Part I).
Article
The paper considers a number of problems arising from the test of serial correlation based on the d statistic proposed earlier by the authors (Durbin & Watson, 1950, 1951). Methods of computing the exact distribution of d are investigated and the exact distribution is compared with six approximations to it for four sets of published data. It is found that approximations suggested by Theil and Nagar and by Hannan are too inaccurate for practical use but that the beta approximation proposed in the 1950 and 1951 papers and a new approximation, called by us the a + bdU approximation and based, like the beta approximation, on the exact first two moments of d, both perform well. The power of the d test is compared with that of certain exact tests proposed by Theil, Durbin, Koerts and Abrahamse from the standpoint of invariance theory. It is shown that the d test is locally most powerful invariant but that the other tests are not. There are three appendices. The first gives an account of the exact distribution of d. The second derives the mean and variance to a second order of approximation of a modified maximum likelihood statistic closely related to d. The third sets out details of the computations required for the a + bdU approximation.
An analysis of transformations
  • Box G. E. P.