Article

A General Definition of Residuals

July 1968
Journal of the Royal Statistical Society Series B (Methodological) 30(2)

July 1968
30(2)

DOI:10.1111/j.2517-6161.1968.tb00724.x

Authors:

Residuals are usually defined in connection with linear models. Here a more general definition is given and some asymptotic properties found. Some illustrative examples are discussed, including a regression problem involving exponentially distributed errors and some problems concerning Poisson and binomially distributed observations.

Double Probability Integral Transform Residuals for Regression Models with Discrete Outcomes

Preprint

Full-text available

Aug 2023

Lu Yang

The assessment of regression models with discrete outcomes is challenging and has many fundamental issues. With discrete outcomes, standard regression model assessment tools such as Pearson and deviance residuals do not follow the conventional reference distribution (normal) under the true model, calling into question the legitimacy of model assessment based on these tools. To fill this gap, we construct a new type of residuals for general discrete outcomes, including ordinal and count outcomes. The proposed residuals are based on two layers of probability integral transformation. When at least one continuous covariate is available, the proposed residuals closely follow a uniform distribution (a normal distribution after transformation) under the correctly specified model. One can construct visualizations such as QQ plots to check the overall fit of a model straightforwardly, and the shape of QQ plots can further help identify possible causes of misspecification such as overdispersion. We provide theoretical justification for the proposed residuals by establishing their asymptotic properties. Moreover, in order to assess the mean structure and identify potential covariates, we develop an ordered curve as a supplementary tool, which is based on the comparison between the partial sum of outcomes and of fitted means. Through simulation, we demonstrate empirically that the proposed tools outperform commonly used residuals for various model assessment tasks. We also illustrate the workflow of model assessment using the proposed tools in data analysis.

Improved process capability assessment through semiparametric piecewise modeling

Article

Full-text available

Jun 2024
J STAT COMPUT SIM

Piecewise models have gained popularity as a useful tool in reliability and quality control/monitoring, particularly when the process data deviates from a normal distribution. In this study, we develop maximum likelihood estimators (MLEs) for the process capability indices, denoted as C pk , C pm , C * pm and C pmk , using a semiparametric model. To remove the bias in the MLEs with small sample sizes, we propose a bias-correction approach to obtain improved estimates. Furthermore, we extend the proposed method to situations where the change-points in the density function are unknown. To estimate the model parameters efficiently, we employ the profiled maximum likelihood approach. Our simulation study reveals that the suggested method yields accurate estimates with low bias and mean squared error. Finally, we provide real-world data applications to demonstrate the superiority of the proposed procedure over existing ones.

Modeling Survival Time to Death among Stroke Patients at Jimma University Medical Center, Southwest Ethiopia: A Retrospective Cohort Study

Article

Full-text available

Nov 2023

Background Stroke is a life-threatening condition that occurs due to impaired blood flow to brain tissues. Every year, about 15 million people worldwide suffer from a stroke, with five million of them suffering from some form of permanent physical disability. Globally, stroke is the second-leading cause of death following ischemic heart disease. It is a public health burden for both developed and developing nations, including Ethiopia. Objectives This study is aimed at estimating the time to death among stroke patients at Jimma University Medical Center, Southwest Ethiopia. Methods A facility-based retrospective cohort study was conducted among 432 patients. The data were collected from stroke patients under follow-up at Jimma University Medical Center from January 1, 2016, to January 30, 2019. A log-rank test was used to compare the survival experiences of different categories of patients. The Cox proportional hazard model and the accelerated failure time model were used to analyze the survival analysis of stroke patients using R software. An Akaike's information criterion was used to compare the fitted models. Results Of the 432 stroke patients followed, 223 (51.6%) experienced the event of death. The median time to death among the patients was 15 days. According to the results of the Weibull accelerated failure time model, the age of patients, atrial fibrillation, alcohol consumption, types of stroke diagnosed, hypertension, and diabetes mellitus were found to be the significant prognostic factors that contribute to shorter survival times among stroke patients. Conclusion The Weibull accelerated failure time model better described the time to death of the stroke patients' data set than other distributions used in this study. Patients' age, atrial fibrillation, alcohol consumption, being diagnosed with hemorrhagic types of stroke, having hypertension, and having diabetes mellitus were found to be factors shortening survival time to death for stroke patients. Hence, healthcare professionals need to thoroughly follow the patients who pass risk factors. Moreover, patients need to be educated about lifestyle modifications.

Scale-free networks: Improved Inference

Article

Full-text available

Nov 2023

The power-law distribution plays a crucial role in complex networks as well as various applied sciences. Investigating whether the degree distribution of a network follows a power-law distribution x^(−α) , with 2 < α < 3 (scale-free hypothesis), is an important concern. The commonly used inferential methods for estimating the model parameters often yield biased estimates, which can lead to the rejection of the hypothesis that a model conforms to a power-law. In this paper, we discuss improved methods that utilize Bayesian inference to obtain accurate estimates and precise credibility intervals. The inferential methods are derived for both continuous and discrete distributions. These methods reveal that objective Bayesian approaches return nearly unbiased estimates for the parameters of both models. Notably, in the continuous case, we identify an explicit posterior distribution. This work enhances the power of goodness-of-fit tests, enabling us to accurately discern whether a network or any other dataset adheres to a power-law distribution. We apply the proposed approach to fit degree distributions for more than 5,000 synthetic networks and over 3,000 real networks. The results indicate that our method is more suitable in practice, as it yields a frequency of acceptance close to the specified nominal level.

Hypothesis testing in Cox models when continuous covariates are dichotomized: bias analysis and bootstrap-based test

Article

Full-text available

Jun 2024
COMPUTATION STAT

Hypothesis testing for the regression coefficient associated with a dichotomized continuous covariate in a Cox proportional hazards model has been considered in clinical research. Although most existing testing methods do not allow covariates, except for a dichotomized continuous covariate, they have generally been applied. Through an analytic bias analysis and a numerical study, we show that the current practice is not free from an inflated type I error and a loss of power. To overcome this limitation, we develop a bootstrap-based test that allows additional covariates and dichotomizes two-dimensional covariates into a binary variable. In addition, we develop an efficient algorithm to speed up the calculation of the proposed test statistic. Our numerical study demonstrates that the proposed bootstrap-based test maintains the type I error well at the nominal level and exhibits higher power than other methods, as well as that the proposed efficient algorithm reduces computational costs.

New Bounded Unit Weibull Model: Applications with Quantile Regression

Preprint

Full-text available

Jun 2024

In practical scenarios, data measurements like ratios and proportions often fall within the 0 to 1 range. Analyzing such bounded data introduces unique modeling challenges, prompting statisticians to explore new distributions that can effectively handle this context. Although beta and Kumaraswamy distributions, along with their related regression models, have gained popularity for examining the relationship between bounded response variables and covariates, several alternative models have shown superior performance compared to these two. However, there is still no agreement on the most effective alternative models. Consequently, this paper introduces a novel bounded probability distribution derived from transforming the Weibull distribution. Our investigation has revealed several interesting properties, including various moments and their generating function, entropies, quantile function, and a linear form of the proposed model. Additionally, we have developed the sequential probability ratio test (SPRT) for the proposed model. The maximum likelihood estimation method was employed to estimate the model parameters. A Monte Carlo simulation was conducted to evaluate the performance of parameter estimation for the model. Finally, we formulated a quantile regression model and applied it to data sets related to risk assessment and educational attainment, demonstrating its superior performance over alternative regression models. These results highlight the importance of our contributions to enhancing the statistical toolkit for analyzing bounded variables across different scientific fields.

Residual Analysis for Poisson-Exponentiated Weibull Regression Models with Cure Fraction

Article

Full-text available

May 2024

The use of cure-rate survival models has grown in recent years. Even so, proposals to perform the goodness of fit of these models have not been so frequent. However, residual analysis can be used to check the adequacy of a fitted regression model. In this context, we provide Cox–Snell residuals for Poisson-exponentiated Weibull regression with cure fraction. We developed several simulations under different scenarios for studying the distributions of these residuals. They were applied to a melanoma dataset for illustrative purposes.

Modeling Big, Heterogeneous, Non-Gaussian Spatial and Spatio-Temporal Data Using FRK

Article

Full-text available

Feb 2024
J STAT SOFTW

Non-Gaussian spatial and spatio-temporal data are becoming increasingly prevalent, and their analysis is needed in a variety of disciplines. FRK is an R package for spatial and spatio-temporal modeling and prediction with very large data sets that, to date, has only supported linear process models and Gaussian data models. In this paper, we describe a major upgrade to FRK that allows for non-Gaussian data to be analyzed in a generalized linear mixed model framework. These vastly more general spatial and spatio-temporal models are fitted using the Laplace approximation via the software TMB. The existing functionality of FRK is retained with this advance into non-Gaussian models; in particular, it allows for automatic basis-function construction, it can handle both point-referenced and areal data simultaneously, and it can predict process values at any spatial support from these data. This new version of FRK also allows for the use of a large number of basis functions when modeling the spatial process, and thus it is often able to achieve more accurate predictions than previous versions of the package in a Gaussian setting. We demonstrate innovative features in this new version of FRK, highlight its ease of use, and compare it to alternative packages using both simulated and real data sets.

Geometric AR-based pedagogical module is beneficial: Affecting factors from the lens of Malaysian mathematics secondary school teachers

Article

Full-text available

Mar 2024

The purpose of this study was to predict the likelihood that teachers would consider the Geometric AR-based Pedagogical Module (GeAR-PM) as beneficial, and study the factors that influence their decision. The data for this study was gathered via a survey method. A sample of 202 Malaysian secondary school teachers was randomly selected as respondents for this study. Specifically, the respondents were asked to answer questions using an instrument which was developed based on the Unified Theory of Acceptance and Use of Technology (UTAUT) model. This study comprised four UTAUT variables: performance expectation, effort expectancy, social influence, and facilitating conditions, and one additional variable, self-efficacy. All variables showed high reliability with Cronbach Alpha’s values between 0.87 to 0.91. The data was analyzed using a binary logistic regression via Statistical Package for the Social Sciences (SPSS) version 27.0. The findings revealed that a teacher's belief that GeAR-PM is beneficial was significantly related to his/ her perception on effort expectancy, social influence, and self-efficacy. Hence, when developing the GeAR-PM, these three factors should be weighed up.

A Modified Cure Rate Model Based on a Piecewise Distribution with Application to Lobular Carcinoma Data

Article

Full-text available

Mar 2024

A novel cure rate model is introduced by considering, for the number of concurrent causes, the modified power series distribution and, for the time to event, the recently proposed power piecewise exponential distribution. This model includes a wide variety of cure rate models, such as binomial, Poisson, negative binomial, Haight, Borel, logarithmic, and restricted generalized Poisson. Some characteristics of the model are examined, and the estimation of parameters is performed using the Expectation–Maximization algorithm. A simulation study is presented to evaluate the performance of the estimators in finite samples. Finally, an application in a real medical dataset from a population-based study of incident cases of lobular carcinoma diagnosed in the state of São Paulo, Brazil, illustrates the advantages of the proposed model compared to other common cure rate models in the literature, particularly regarding the underestimation of the cure rate in other proposals and the improved precision in estimating the cure rate of our proposal.

Modified score functions for von Mises regressions

Article

Mar 2024
APPL MATH MODEL

Artur J. Lemonte

Standing strong? The causal impact of metro stations on service firms' survival

Article

Feb 2024
TRANSPORT RES A-POL

The aim of the paper is to investigate the impact of the extension of a metro line on the survival of individual firms. An empirical analysis focuses on the relation between proximity to new stations and firm survival following the announcement of the extension of the orange line of the metro in Montréal (Canada) between 1996 and 2016. To do so, a Cox Proportional Hazards model is estimated using the simultaneous announcement of two potential extensions to define the treatment (new stations) and control groups (speculative stations) based on the distance to the closest stations. The model explicitly controls for anticipation and speculation effect by introducing three distinct treatment effects by period. The impact of the new metro stations appears to be mostly positive on firms' survival probability during the construction period and after the opening of the service. The results suggest that the metro extension does have a positive influence on the survival rates of individual firms within 250 to 1,250 m. of the closest station, especially for local activities.

Survival models to estimate time to visible mold growth on new paper-based food-contact materials under varying environmental conditions

Article

Full-text available

Jan 2024
LWT-FOOD SCI TECHNOL

Effective packaging solutions are pivotal in addressing sustainability challenges associated with food consumption. Sustainable packaging must not only reduce ecological impact but also ensure food quality preservation, minimize the risk of foodborne diseases, and contribute to mitigating the accumulation of plastic waste. While paper-based food packaging is environmentally friendly, it is susceptible to mold growth when exposed to specific humidity and temperatures, potentially leading to spoilage and safety concerns. This study introduces an innovative modeling approach employing survival analysis to estimate time to visible mold growth on paper-based materials used in food-contact packaging under varying environmental conditions. The objective is to establish an alternative statistical framework for assessing the impact of relative humidity and temperature on mold proliferation. This approach is particularly relevant to novel sustainable packaging materials in the food industry, especially when dealing with censored data. Survival heatmaps are developed to simplify the visualization of spoilage likelihood, measured by time to visible growth at a given confidence level. Although the survival model is proposed for paper-based materials, there are no theoretical limitations to its extension for use with other materials within and beyond the food industry.

A Multivariate Proportional Odds Frailty Model with Weibull Hazard under Bayesian Mechanism

Article

Full-text available

Nov 2023

Objectives: The objective of this study is to develop a new Proportional Odds frailty model by using a Weibull hazard function in context of Bayesian mechanism. Methods: Frailty models provide a convenient way to introduce random effects, association and unobserved heterogeneity into models for survival data. Proportional Odds (PO) model is a widely pursued model in survival analysis which extends the concept of Odds ratio considered for lifetime data with covariates. Proportional Odds models can be derived under frailty approach by introducing a frailty term for each individual in the exponent of the hazard function, which acts multiplicatively on the baseline hazard function. In this paper an attempt has been made to develop a new Proportional Odds frailty model by using a Weibull hazard function in context of Bayesian mechanism. Findings: The methodologies are applied to a real life survival data set and the posterior inferences are drawn using Markov Chain Monte Carlo (MCMC) simulation methods and model comparison tools like Deviance Information Criteria (DIC) and the Log Pseudo Marginal Likelihood (LPML) are also calculated and to check the fit of the model Cox-Snell residual plot is employed. The performance of the newly developed model is compared with an existing proportional odds model without the frailty term and it is observed that the newly developed frailty model perform well as compered to the traditional non frailty model. Novelty: A new Proportional Odds frailty model using Weibull hazard function under Bayesian mechanism is developed in this paper which is an added contribution in the field of Survival Analysis.

survex: an R package for explaining machine learning survival models

Article

Full-text available

Dec 2023
BIOINFORMATICS

Due to their flexibility and superior performance, machine learning models frequently complement and outperform traditional statistical survival models. However, their widespread adoption is hindered by a lack of user-friendly tools to explain their internal operations and prediction rationales. To tackle this issue, we introduce the survex R package, which provides a cohesive framework for explaining any survival model by applying explainable artificial intelligence techniques. The capabilities of the proposed software encompass understanding and diagnosing survival models, which can lead to their improvement. By revealing insights into the decision-making process, such as variable effects and importances, survex enables the assessment of model reliability and the detection of biases. Thus, transparency and responsibility may be promoted in sensitive areas, such as biomedical research and healthcare applications. Availability and Implementation survex is available under the GPL3 public license at https://github.com/modeloriented/survex and on CRAN with documentation available at https://modeloriented.github.io/survex.

Unit Upper Truncated Weibull Distribution with Extension to 0 and 1 Inflated Model - Theory and Applications

Article

Full-text available

Nov 2023

A two-parameter unit distribution and its regression model plus its extension to 0 and 1 inflation is introduced and studied. The distribution is called the unit upper truncated Weibull (UUTW) distribution, while the inflated variant is called the 0−1 inflated unit upper truncated Weibull (ZOIUUTW) distribution. The UUTW distribution has an increasing and a J-shaped hazard rate function. The parameters of the proposed models are estimated by the method of maximum likelihood estimation. For the UUTW distribution, two practical examples involving household expenditure and maximum flood level data are used to show its flexibility and the proposed distribution demonstrates better fit tendencies than some of the competing unit distributions. Application of the proposed regression model demonstrates adequate capability in describing the real data set with better modeling proficiency than the existing competing models. Then, for the ZOIUUTW distribution, the CD34+ data involving cancer patients are analyzed to show the flexibility of the model in characterizing inflation at both endpoints of the unit interval.

Power-law distribution in pieces: a semi-parametric approach with change point detection

Article

Full-text available

Oct 2023
STAT COMPUT

Piecewise models play a crucial role in statistical analysis as they allow the same pattern to be adjusted over different regions of the data, achieving a higher quality of fit than would be obtained by fitting them all at once. The standard piecewise linear distribution assumes that the hazard rate is constant between each change point. However, this assumption may be unrealistic in many applications. To address this issue, we introduce a piecewise distribution based on the power-law model. The proposed semi-parametric distribution boasts excellent properties and features a non-constant hazard function between change points. We discuss parameter estimates using the maximum likelihood estimators (MLEs), which yield closed-form expressions for the estimators and the Fisher information matrix for both complete and randomly censored data. Since MLEs can be biased for small samples, we derived bias-corrected MLEs that are unbiased up to the second order and also have closed-form expressions. We consider a profiled MLE approach to estimate change points and construct a hypothesis test to determine the number of change points. We apply our proposed model to analyze the survival pattern of monarchs in the Pharaoh dynasties. Our results indicate that the piecewise power-law distribution fits the data well, suggesting that the lifespans of pharaonic monarchs exhibit varied survival patterns.

Survival analysis as a tool for breeding management of Nile tilapia in an intensive system

Article

Full-text available

Sep 2023
AQUACULT INT

Survival analysis has proven to be a robust tool for studies in humans, and recently, it has been adopted in studies with animals, but very little in fish. The production of uniform fingerlings has been one challenge of aquaculture, influenced by the use of the efficient breeders. We aimed to present, for the first time in aquaculture, a survival analysis as a tool for decision-making by using the relationship between the main morphometric traits and the spawning time of Nile tilapia females up to 28 days after mating in an intensive system. We used 78 females and 26 males from which ten traits were evaluated. A check was made for the presence of eggs in the female mouth (spawning) every seven days until the final twenty-eight days. Confirmation of eggs was defined as uncensored data (C = 1) and absence of eggs as censored (C = 0). The Cox proportional hazards ratio and Kaplan–Meier models were adjusted to analyze the data. Females with a standard length of 19.19 cm (small group) and males with a weight of 259.14 g (small group) reproduced early, being more adapted to 1000 L tanks in a recirculation system. Less heavy egg masses were also related to early spawning. Therefore, the traits, standard length of the female, egg mass weight, and male weight, affected the presence of eggs up to 28 days and should be used as selection criteria for early breeders. Besides, the survival analysis was accurate as a tool for tilapia breeding management in an intensive system.

An additive risks regression model for middle-censored lifetime data

Article

Sep 2017

Middle-censoring refers to data arising in situations where the exact lifetime of study subjects becomes unobservable if it happens to fall in a random censoring interval. In the present paper we propose a semiparametric additive risks regression model for analysing middle-censored lifetime data arising from an unknown population. We estimate the regression parameters and the unknown baseline survival function by two different methods. The first method uses the martingale-based theory and the second method is an iterative method. We report simulation studies to assess the finite sample behaviour of the estimators. Then, we illustrate the utility of the model with a real life data set. The paper ends with a conclusion.

The role of climatic similarity and bridgehead effects in two centuries of trade‐driven global ant invasions

Article

Aug 2023

International trade continues to drive biological invasions. We investigate the drivers of global nonnative ant establishments over the last two centuries using a Cox proportional hazards model. We use country‐level discovery records for 36 of the most widespread nonnative ant species worldwide from 1827 to 2012. We find that climatic similarity combined with cumulative imports during the 20 years before a species discovery in any given year is an important predictor of establishment. Accounting for invasions from both the native and previously invaded “bridgehead” regions substantially improves the model's fit, highlighting the role of spatial spillovers. These results are valuable for targeting biosecurity efforts.

Hazard-based overtaking duration model for mixed traffic

Article

Jul 2023

Unraveling the Effect of Fire Seasonality on Fire-Preferred Fuel Types and Dynamics in Alto Minho, Portugal (2000-2018)

Article

Full-text available

Jul 2023

Socio-demographic changes in recent decades and fire policies centered on fire suppression have substantially diminished the ability to maintain low fuel loads at the landscape scale in marginal lands. Currently, shepherds face many barriers to the use of fire for restoring pastures in shrub-encroached communities. The restrictions imposed are based on the lack of knowledge of their impacts on the landscape. We aim to contribute to this clarification. Therefore, we used a dataset of burned areas in the Alto Minho region for seasonal and unseasonal (pastoral) fires. We conducted statistical and spatial analyses to characterize the fire regime (2001-2018), the distribution of fuel types and their dynamics, and the effects of fire on such changes. Unseasonal fires are smaller and spread in different spatial contexts. Fuel types characteristic of maritime pine and eucalypts are selected by seasonal fires and avoided by unseasonal fires which, in turn, showed high preference for heterogeneous mosaics of herbaceous and shrub vegetation. The area covered by fuel types of broadleaved and eucalypt forest stands increased between 2000 and 2018 at the expense of the fuel type corresponding to maritime pine stands. Results emphasize the role of seasonal fires and fire recurrence in these changes, and the weak effect of unseasonal fires. An increase in the maritime pine fuel type was observed only in areas burned by unseasonal fires, after excluding the areas overlapping with seasonal fires.

Surrogate method for partial association between mixed data with application to well-being survey analysis

Preprint

Full-text available

Jun 2023

This paper is motivated by the analysis of a survey study of college student wellbeing before and after the outbreak of the COVID-19 pandemic. A statistical challenge in well-being survey studies lies in that outcome variables are often recorded in different scales, be it continuous, binary, or ordinal. The presence of mixed data complicates the assessment of the associations between them while adjusting for covariates. In our study, of particular interest are the associations between college students' wellbeing and other mental health measures and how other risk factors moderate these associations during the pandemic. To this end, we propose a unifying framework for studying partial association between mixed data. This is achieved by defining a unified residual using the surrogate method. The idea is to map the residual randomness to the same continuous scale, regardless of the original scales of outcome variables. It applies to virtually all commonly used models for covariate adjustments. We demonstrate the validity of using such defined residuals to assess partial association. In particular, we develop a measure that generalizes classical Kendall's tau in the sense that it can size both partial and marginal associations. More importantly, our development advances the theory of the surrogate method developed in recent years by showing that it can be used without requiring outcome variables having a latent variable structure. The use of our method in the well-being survey analysis reveals (i) significant moderation effects (i.e., the difference between partial and marginal associations) of some key risk factors; and (ii) an elevated moderation effect of physical health, loneliness, and accommodation after the onset of COVID-19.

UltraLogLog: A Practical and More Space-Efficient Alternative to HyperLogLog for Approximate Distinct Counting

Article

May 2024

Otmar Ertl

Since its invention HyperLogLog has become the standard algorithm for approximate distinct counting. Due to its space efficiency and suitability for distributed systems, it is widely used and also implemented in numerous databases. This work presents UltraLogLog, which shares the same practical properties as HyperLogLog. It is commutative, idempotent, mergeable, and has a fast guaranteed constant-time insert operation. At the same time, it requires 28% less space to encode the same amount of distinct count information, which can be extracted using the maximum likelihood method. Alternatively, a simpler and faster estimator is proposed, which still achieves a space reduction of 24%, but at an estimation speed comparable to that of HyperLogLog. In a non-distributed setting where martingale estimation can be used, UltraLogLog is able to reduce space by 17%. Moreover, its smaller entropy and its 8-bit registers lead to better compaction when using standard compression algorithms. All this is verified by experimental results that are in perfect agreement with the theoretical analysis which also outlines potential for even more space-efficient data structures. A production-ready Java implementation of UltraLogLog has been released as part of the open-source Hash4j library.

Z-residual diagnostic tool for assessing covariate functional form in shared frailty models

Article

May 2024

Factors Associated With the Probability of Leaving Bolsa Família Program

Article

Full-text available

Jan 2024
J Develop Area

Conditional cash transfer programs (CCTs) are increasingly common in Latin American countries and have become the main strategy for combating poverty and social inequality. Among many, noteworthy programs include Progresa-Oportunidades, later renamed Prospera, in Mexico in 1997; Programa Familias en Acción in Colombia; Chile Solidario in Chile in 2002; and Bolsa Família Program in Brazil in 2003. The central feature of the Bolsa Família Program (BFP) was that the receipt of monetary benefits by families was tied to the fulfillment of certain conditions, or conditionalities, involving healthcare and education for children and adolescents. The purpose of this design was, in the short term, to combat the negative effects of poverty on family well-being through the transfer of monetary resources, and in the long term, through the requirement of conditionalities, to break the poverty trap caused, in large part, by intergenerational transmission of income and education. The literature on conditional cash transfers has largely focused on entry conditions, paying less attention to exit conditions. This study aims to explore the trajectory of participant families in the Bolsa Família Program, using information obtained from the cohort of individuals born in the city of Pelotas in 2004. The research focuses on the socioeconomic characteristics of families at the time of their child’s birth in 2004 and the follow-up conducted in 2015. Survival analysis models are used to analyze the probability of exiting the program. The results reported by the survival analysis, Kaplan-Meier method, indicate that a BRL 1.00 (US$ 0.30) increase in income reduces, on average, the likelihood of successful departure from the BFP beneficiary family by 1%. Conversely, exercises conducted by family groups indicate that for families receiving benefits above the average, BRL 141.00 (US$ 42.32), the mother’s age and her employment status positively influence the chances of leaving the program. Overall, the results demonstrate that the values of the benefits received are subject to decreasing returns. For families receiving benefits below the average, a one-unit increase in the amount received reduces the probability of exiting the program by an average of 1%. Regarding successful exits, it was observed that having a white father increases the chances by 40%, and if the father is employed, it increases the likelihood of successfully exiting the program by 48%. These results indicate that the successful exit of families from the program is primarily associated with parental characteristics.

Toward understanding waiting time in an intercity station: A hazard-based approach

Article

Apr 2024

New results and regression model for the exponentiated odd log-logistic family with applications

Article

Mar 2024

We obtain new mathematical properties of the exponentiated odd log-logistic family of distributions, and of its special case named the exponentiated odd log-logistic Weibull, and its log transformed. A new location and scale regression model is constructed, and some simulations are carried out to verify the behavior of the maximum likelihood estimators, and of the modified deviance-based residuals. The methodology is applied to the Japanese-Brazilian emigration data.

Prediction of preterm birth in women with fetal growth restriction – Is the weekly change in sFlt‐1/PlGF ratio or PlGF levels useful?

Article

Full-text available

Mar 2024
ACTA OBSTET GYN SCAN

Introduction To assess the rate of change in soluble fms‐like tyrosine kinase‐1/placental growth factor (sFlt‐1/PlGF) ratio and PlGF levels per week compared to a single sFlt‐1/PlGF ratio or PlGF level to predict preterm birth for pregnancies complicated by fetal growth restriction. Material and methods A prospective cohort study of pregnancies complicated by isolated fetal growth restriction. Maternal serum PlGF levels and the sFlt‐1/PlGF ratio were measured at 4‐weekly intervals from recruitment to delivery. We investigated the utility of PlGF levels, sFlt‐1/PlGF ratio, change in PlGF levels per week or sFlt‐1/PlGF ratio per week. Cox‐proportional hazard models and Harrell's C concordance statistic were used to evaluate the effect of biomarkers on time to preterm birth. Results The total study cohort was 158 pregnancies comprising 91 (57.6%) with fetal growth restriction and 67 (42.4%) with appropriate for gestational age controls. In the fetal growth restriction cohort, sFlt‐1/PlGF ratio and PlGF levels significantly affected time to preterm birth (Harrell's C: 0.85–0.76). The rate of increase per week of the sFlt‐1/PlGF ratio (hazard ratio [HR] 3.91, 95% confidence interval [CI]: 1.39–10.99, p = 0.01, Harrell's C: 0.74) was positively associated with preterm birth but change in PlGF levels per week was not (HR 0.65, 95% CI: 0.25–1.67, p = 0.37, Harrell's C: 0.68). Conclusions Both a high sFlt‐1/PlGF ratio and low PlGF levels are predictive of preterm birth in women with fetal growth restriction. Although the rate of increase of the sFlt‐1/PlGF ratio predicts preterm birth, it is not superior to either a single elevated sFlt‐1/PlGF ratio or low PlGF level.

Diagnostics for regression models with semicontinuous outcomes

Article

Jan 2024
BIOMETRICS

Lu Yang

Semicontinuous outcomes commonly arise in a wide variety of fields, such as insurance claims, healthcare expenditures, rainfall amounts, and alcohol consumption. Regression models, including Tobit, Tweedie, and two-part models, are widely employed to understand the relationship between semicontinuous outcomes and covariates. Given the potential detrimental consequences of model misspecification, after fitting a regression model, it is of prime importance to check the adequacy of the model. However, due to the point mass at zero, standard diagnostic tools for regression models (eg, deviance and Pearson residuals) are not informative for semicontinuous data. To bridge this gap, we propose a new type of residuals for semicontinuous outcomes that is applicable to general regression models. Under the correctly specified model, the proposed residuals converge to being uniformly distributed, and when the model is misspecified, they significantly depart from this pattern. In addition to in-sample validation, the proposed methodology can also be employed to evaluate predictive distributions. We demonstrate the effectiveness of the proposed tool using health expenditure data from the US Medical Expenditure Panel Survey.

Extending the code in the open-source saemix package to fit joint models of longitudinal and time-to-event data

Article

Feb 2024
COMPUT METH PROG BIO

Survival tree averaging by functional martingale-based residuals

Article

Feb 2024

Statistical inference for a two-parameter distribution with a bathtub-shaped or increasing hazard rate function based on record values and inter-record times with an application to COVID-19 data

Article

Feb 2024

Double Probability Integral Transform Residuals for Regression Models with Discrete Outcomes

Article

Jan 2024

Lu Yang

Study on Computer Science Undergraduate Students Dropout at the University of Brasilia

Conference Paper

Oct 2023

Divide and recombine approach for warranty database: Estimating the reliability of an automobile component

Article

Dec 2023

M.R. Karim

Diferencial de rendimentos e desigualdades intragrupo para os músicos atuantes em Belo Horizonte

Article

Full-text available

Dec 2023

Resumo Uma das características marcantes no campo de trabalho artístico é o diferencial de rendimentos e suas desigualdades. Como o campo é muito heterogêneo, este artigo pretende verificar se os diferenciais e desigualdades retratados pela literatura ocorrem para os músicos atuantes em Belo Horizonte. Para tanto, nos meses de fevereiro, março e abril de 2020, foi realizada uma pesquisa para a coleta de dados primários a fim de responder estas questões, a partir da aplicação econométrica e estatística. As aplicações sugerem que a hipótese de não linearidade entre o nível de instrução e rendimentos não pode ser rejeitada, e as características pessoais não se mostraram significativas nesta análise. Em relação às desigualdades intragrupo, constata-se que os maiores índices ocorrem em alguns grupos específicos, como os de músicos com dedicação exclusiva, na faixa etária de 30 a 36 anos e graduados no ensino superior em música.

Joint Models for Longitudinal and Time-to-event Data with JM R Package

Article

Nov 2023

Jinheum Kim

Even if the value of the risk factor evolves continuously in the extended Cox model, the inference may be somewhat biased because it employs a discrete approximation method. As a result, if the risk factor’s value not only fluctuates constantly but also contains measurement error, a model that substitutes the average value rather than the measured value of the risk factor might be considered. Such a model is known as a joint model. After introducing the most widely used the present-value model among joint models, this model was extended to various types of data. In addition, several residuals for model diagnosis were introduced, methods for predicting the probability of event occurrence and the value of the longitudinal risk factor by application of the estimation model were introduced, and an index that can evaluate how well the longitudinal risk factors divide patients into high-risk and low-risk groups was introduced. Finally, all the statistical inference methods introduced in this study were implemented using the JM R package and the source codes were supplied.

Accurate bias estimation with applications to focused model selection

Article

Nov 2023

We derive approximations to the bias and squared bias with errors of order o (1/ n ) where n is the sample size. Our results hold for a large class of estimators, including quantiles, transformations of unbiased estimators, maximum likelihood estimators in (possibly) incorrectly specified models and functions thereof. Furthermore, we use the approximations to derive estimators of the mean squared error (MSE) which are correct to order o (1/ n ). Since the variance of many estimators is of order O (1/ n ), this level of precision is needed for the mean squared error estimator to properly take the variance into account. We also formulate a new focused information criterion (FIC) for model selection based on the estimators of the squared bias. Lastly, we illustrate the methods on data containing the number of battle deaths in all major inter‐state wars between 1823 and the present day. The application illustrates the potentially large impact of using a less accurate estimator of the squared bias. This article is protected by copyright. All rights reserved.

Improving Efficiency of Fitting Cox Proportional Hazards Models for Time-to-Event Outcomes in Genome-Wide Association Studies (GWAS)

Article

Full-text available

Oct 2023

Motivation Technologies identifying single nucleotide polymorphisms (SNPs) in DNA sequencing yield an avalanche of data requiring analysis and interpretation. Standard methods may require many weeks of processing time. The use of statistical methods requiring data sorting, matrix inversions of a high-dimension and replication in subsets of the data on multiple outcomes exacerbate these times. A method which reduces the computational time in problems with time-to-event outcomes and hundreds of thousands/millions of SNPs using Cox-Snell residuals after fitting the Cox proportional hazards model (PH) to a fixed set of concomitant variables is proposed. This yields coefficients for SNP effect from a Cox-Snell adjusted Poisson model and shows a high concordance to the adjusted PH model. The method is illustrated with a sample of 10000 SNPs from a genome wide association study (GWAS) in a diabetic population. The gain in processing efficiency using the proposed method based on Poisson modelling can be as high as 61%. This could result in saving of over three weeks processing time if 5 million SNPs require analysis. The method involves only a single predictor variable (SNP), offering a simpler, computationally more stable approach to examining and identifying SNP patterns associated with the outcome(s) allowing for a faster development of genetic signatures. Use of deviance residuals from the PH model to screen SNPs demonstrates a large discordance rate at a 0.2% threshold of concordance. This rate is 15 times larger than that based on the Cox-Snell residuals from the Cox-Snell adjusted Poisson model.

Modeling Categorical Time-to-Event Data: The Example of Social Interaction Dynamics Captured With Event-Contingent Experience Sampling Methods

Article

Full-text available

Sep 2023

The depth of information collected in participants’ daily lives with active (e.g., experience sampling surveys) and passive (e.g., smartphone sensors) ambulatory measurement methods is immense. When measuring participants’ behaviors in daily life, the timing of particular events—such as social interactions—is often recorded. These data facilitate the investigation of new types of research questions about the timing of those events, including whether individuals’ affective state is associated with the rate of social interactions (binary event occurrence) and what types of social interactions are likely to occur (multicategory event occurrences, e.g., interactions with friends or family). Although survival analysis methods have been used to analyze time-to-event data in longitudinal settings for several decades, these methods have not yet been incorporated into ambulatory assessment research. This article illustrates how multilevel and multistate survival analysis methods can be used to model the social interaction dynamics captured in intensive longitudinal data, specifically when individuals exhibit particular categories of behavior. We provide an introduction to these models and a tutorial on how the timing and type of social interactions can be modeled using the R statistical programming language. Using event-contingent reports (N = 150, Nevents = 64,112) obtained in an ambulatory study of interpersonal interactions, we further exemplify an empirical application case. In sum, this article demonstrates how survival models can advance the understanding of (social interaction) dynamics that unfold in daily life.

The Pezeta regression model: an alternative to unit Lindley regression model

Article

Full-text available

Aug 2023

Lucas David

A new probability distribution is proposed in this paper. This new distribution has support on the interval and was obtained after transforming the random variable with exponential distribution. The mode, quantile function, median, ordinary moments and density function belongs to exponential family of distributions are demonstrated. The maximum likelihood method is used to obtain the parameter estimate. A regression model for the median of the distribution is also proposed. Closed-form expressions for the score vector and Fisher’s information matrix are demonstrated. A simulation study and an application to real data showed the good performance of the proposed regression model.

A Bayesian Analysis of Univariate Proportional Hazards Frailty Models with Different Hazard Functions

Article

Full-text available

Sep 2017

In this paper, an attempt has been made to derive three new trnivariate Proportional Hazards models under frailty approach with different baseline hazardfunctions. The proposed models are illushated with a real life survival data set and the posterior inferences are drawn using Markov Chain Monte Carlo (MCMC) simulation methods. For the model comparison two popular model choice criteria-deviance information criteria (DIC) and the log pseudo marginal likelihood (LPML) are employed and Cox-Snell residual plot is used here to check the fit of the models.

survex: an R package for explaining machine learning survival models

Preprint

Aug 2023

Survival model for simulated middle east respiratory syndrome corona virus MERS-COV data

Conference Paper

Jan 2023

Cox, Sir David R.

Chapter

Aug 2023

David Oakes

Sir David R. Cox was a leading statistician, perhaps the foremost of his era. He made major contributions in diverse areas of statistics and probability and mentored many who subsequently became leaders of the profession. He published over 20 books and 350+ papers.

Likelihood Based Inference and Bias Reduction in the Modified Skew-t-Normal Distribution

Article

Full-text available

Jul 2023

In this paper, likelihood-based inference and bias correction based on Firth’s approach are developed in the modified skew-t-normal (MStN) distribution. The latter model exhibits a greater flexibility than the modified skew-normal (MSN) distribution since it is able to model heavily skewed data and thick tails. In addition, the tails are controlled by the shape parameter and the degrees of freedom. We provide the density of this new distribution and present some of its more important properties including a general expression for the moments. The Fisher’s information matrix together with the observed matrix associated with the log-likelihood are also given. Furthermore, the non-singularity of the Fisher’s information matrix for the MStN model is demonstrated when the shape parameter is zero. As the MStN model presents an inferential problem in the shape parameter, Firth’s method for bias reduction was applied for the scalar case and for the location and scale case.

One or two-step? Evaluating GMM efficiency for spatial binary probit models

Article

Jul 2023

In this article we propose two-step generalized method of moment (GMM) procedure for a Spatial Binary Probit Model. In particular, we propose a series of two-step estimators based on different choices of the weighting matrix for the moments conditions in the first step, and different estimators for the variance–covariance matrix of the estimated coefficients. In the context of a Monte Carlo experiment, we compare the properties of these estimators, a linearized version of the one-step GMM and the recursive importance sampler (RIS). Our findings reveal that there are benefits related both to the choice of the weight matrix for the moment conditions and in adopting a two-step procedure.

Gender Animus Can Still Exist Under Favorable Disparate Impact: a Cautionary Tale from Online P2P Lending

Conference Paper

Jun 2023

Modelling Survival Data in Medical Research

Book

May 2023

David Collett

The statistical analysis of series of events / by D. R. Cox and P. A. W. Lewis

Book

Full-text available

Jan 1966
Ann Math Stat

The Statistical Analysis of Series of Events

Article

Oct 1967

An analysis of transformations

Article

Jan 1964

An analysis of transformations

Article

Jan 1962

“An Analysis of Transformations.”

Article

Jul 1964

In the analysis of data it is often assumed that observations y1, y2, …, yn are independently normally distributed with constant variance and with expectations specified by a model linear in a set of parameters θ. In this paper we make the less restrictive assumption that such a normal, homoscedastic, linear model is appropriate after some suitable transformation has been applied to the y's. Inferences about the transformation and about the parameters of the linear model are made by computing the likelihood function and the relevant posterior distribution. The contributions of normality, homoscedasticity and additivity to the transformation are separated. The relation of the present methods to earlier procedures for finding transformations is discussed. The methods are illustrated with examples.

Contribution to the discussion of H. Hotelling's paper

Article

Jan 1953

F.J. Anscombe

Analysis of Factorial Arrangements when the Data are Proportions

Article

Mar 1952

Approximate confidence intervals

Article

Jan 1953
BIOMETRIKA

M.S. Bartlett

The estimation of two parameters from a sample

Article

Jan 1953

J.B.S. Haldane

An Analysis Of Transformations

Article

Jan 1964

Transformations to Normality Using Fractional Powers of the Variable

Article

Jun 1957

P. G. Moore

Higher Moments of a Maximum-Likelihood Estimate

Article

Jul 1963

For a distribution depending on a single parameter the first four sampling moments of the maximum‐likelihood estimate to orders N‐2, N‐3, N‐3 and N‐4 respectively are given. Expressions for the measures of skewness γ1 and γ2 are also given. Several illustrative examples are included as a check on the heavy algebra. The paper extends earlier work by Haldane and Smith.

The Bias of Moment Estimators with an Application to the Negative Binomial Distribution

Article

Jun 1962
BIOMETRIKA

Examination of Residuals

Article

Jan 1961

F. J. Anscombe

The Sampling Distribution of a Maximum-Likelihood Estimate

Article

Jun 1956

Transformations of the binomial, negative binomial, Poisson, and χ2 distributions

Article

Jun 1956

GUNNAR BLOM

(1) GIINNAR BLOM. ‘Transformations of the binomial, negative binomial, Poisson and χ2 distributions’, Biometrika (1954), 41, 302.

Estimation of Exponential Survival Probabilities With Concomitant Information

Article

Jan 1966

Testing for Serial Correlation in Least Squares Regression. II

Article

Jan 1951

In an earlier paper (Durbin & Watson, 1950) the authors investigated the problem of testing the error terms of a regression model for serial correlation. Test criteria were put forward, their moments calculated, and bounds to their distribution functions were obtained. In the present paper these bounds are tabulated and their use in practice is described. For cases in which the bounds do not settle the question of significance an approximate method is suggested. Expressions are given for the mean and variance of a test statistic for one- and two-way classifications and polynomial trends, leading to approximate tests for these cases. The procedures described should be capable of application by the practical worker without reference to the earlier paper (hereinafter referred to as Part I).

Testing for Serial Correlation in Least Squares Regression. III

Article

Jul 1951

The paper considers a number of problems arising from the test of serial correlation based on the d statistic proposed earlier by the authors (Durbin & Watson, 1950, 1951). Methods of computing the exact distribution of d are investigated and the exact distribution is compared with six approximations to it for four sets of published data. It is found that approximations suggested by Theil and Nagar and by Hannan are too inaccurate for practical use but that the beta approximation proposed in the 1950 and 1951 papers and a new approximation, called by us the a + bdU approximation and based, like the beta approximation, on the exact first two moments of d, both perform well. The power of the d test is compared with that of certain exact tests proposed by Theil, Durbin, Koerts and Abrahamse from the standpoint of invariance theory. It is shown that the d test is locally most powerful invariant but that the other tests are not. There are three appendices. The first gives an account of the exact distribution of d. The second derives the mean and variance to a second order of approximation of a modified maximum likelihood statistic closely related to d. The third sets out details of the computations required for the a + bdU approximation.

An analysis of transformations

Jan 1964
211

Box G. E. P.

A General Definition of Residuals

Abstract

No full-text available

Recommended publications

Views on conditional and marginal methods of statistical inference