Article

Some Recent Work on Resampling Methods for Complex Surveys“

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... There exist bootstrap weight methods under complex survey sampling, but the proposed bootstrap method differs from them in the following aspects. Rao et al. (1992) proposed a bootstrap weight method to estimate the variance of a function of population total estimators under stratified random sampling, but did not investigate the theoretical properties of their bootstrap method. Chipperfield and Preston (2007) proposed a without replacement scaled bootstrap to achieve the same goal as Rao et al. (1992) under stratified random sampling, but their method is only applicable when the parameter of interest is a smooth function of population totals. ...
... Rao et al. (1992) proposed a bootstrap weight method to estimate the variance of a function of population total estimators under stratified random sampling, but did not investigate the theoretical properties of their bootstrap method. Chipperfield and Preston (2007) proposed a without replacement scaled bootstrap to achieve the same goal as Rao et al. (1992) under stratified random sampling, but their method is only applicable when the parameter of interest is a smooth function of population totals. Moreover, neither method is applicable to other complex survey sampling. ...
... There indeed exist some papers discussing bootstrap confidence intervals under survey sampling, but these not apply to general hypothesis testing problems. For example,Rao et al. (1992) proposed a bootstrap-t confidence interval for the parameter of interest, but their bootstrap confidence interval does not apply when the parameter of interest is multi-dimensional. In addition, their bootstrap method is only valid under the stratified simple random sampling.Beaumont and Patak (2012) andBertail and Combris (1997) discussed bootstrap confidence intervals in their simulation studies, but it is not clear how the corresponding intervals are constructed. ...
Article
Full-text available
Standard statistical methods without taking proper account of the complexity of a survey design can lead to erroneous inferences when applied to survey data due to unequal selection probabilities, clustering, and other design features. In particular, the type I error rates of hypotheses tests using standard methods can be much larger than the nominal significance level. Methods incorporating design features in testing hypotheses have been proposed, including Wald tests and quasi-score tests that involve estimated covariance matrices of parameter estimates. In this paper, we present a unified approach to hypothesis testing without requiring estimated covariance matrices or design effects, by constructing bootstrap approximations to quasi-likelihood ratio statistics and quasi-score statistics and establishing its asymptotic validity. The proposed method can be easily implemented without a specific software designed for complex survey sampling. We also consider hypothesis testing for categorical data and present a bootstrap procedure for testing simple goodness of fit and independence in a two-way table. In simulation studies, the type I error rates of the proposed approach are much closer to their nominal significance level compared with the naive likelihood ratio test and quasi-score test. An application to an educational survey under a logistic regression model is also presented.
... Most of Statistics Canada's surveys that implement the bootstrap method have a complex stratified two-stage or three-stage sampling design. The Rao-Wu-Yue bootstrap weights [1] are often computed in those surveys. The Rao-Wu-Yue bootstrap weights are applicable when the first-stage sample is drawn with replacement within strata. ...
... The first term on the right-hand side of (5), V 1 = var p ( θ), is the variance under singlestage cluster sampling given in (1). The second term, V 2 = E p ∑ k∈s w 2 1k V 2k , reflects the increase in variance due to the second stage of sampling. ...
... Rao, Wu and Yue [1] did not provide bootstrap weights for this design. However, [9] showed that the bootstrap method of [7] can be implemented by using the following bootstrap weight adjustment: ...
Article
Full-text available
The bootstrap method is often used for variance estimation in sample surveys with a stratified multistage sampling design. It is typically implemented by producing a set of bootstrap weights that is made available to users and that accounts for the complexity of the sampling design. The Rao–Wu–Yue method is often used to produce the required bootstrap weights. It is valid under stratified with-replacement sampling at the first stage or fixed-size without-replacement sampling provided the first-stage sampling fractions are negligible. Some surveys use designs that do not satisfy these conditions. We propose a simple and unified bootstrap method that addresses this limitation of the Rao–Wu–Yue bootstrap weights. This method is applicable to any multistage sampling design as long as valid bootstrap weights can be produced for each distinct stage of sampling. Our method is also applicable to two-phase sampling designs provided that Poisson sampling is used at the second phase. We use this design to model survey nonresponse and derive bootstrap weights that account for nonresponse weighting. The properties of our bootstrap method are evaluated in three limited simulation studies.
... Rao and Wu [2] applied a scale adjustment directly to the survey data values so as to recover the usual variance formulae. Rao et al. [4] presented a modification of the method of Rao and Wu [2], where the scale adjustment is applied to the survey weights rather than to the data values. The second group of procedures consists of first creating a pseudo-population from the original sample. ...
... Rao and Wu [2] showed that in the case of a population total, the above algorithm matches the standard variance estimator (3). Rao et al. [4] proposed a weighted version of the Rao-Wu method, whereby the rescaling is applied to the sampling weights rather than the y-values; see also [10]. The method of Rao et al. [4] is described in Section 4. ...
... Rao et al. [4] proposed a weighted version of the Rao-Wu method, whereby the rescaling is applied to the sampling weights rather than the y-values; see also [10]. The method of Rao et al. [4] is described in Section 4. ...
Article
Full-text available
Multi-stage sampling designs are often used in household surveys because a sampling frame of elements may not be available or for cost considerations when data collection involves face-to-face interviews. In this context, variance estimation is a complex task as it relies on the availability of second-order inclusion probabilities at each stage. To cope with this issue, several bootstrap algorithms have been proposed in the literature in the context of a two-stage sampling design. In this paper, we describe some of these algorithms and compare them empirically in terms of bias, stability, and coverage probability.
... of (14), whereσ 2 (·) is an estimator of the design variance var p (·). Unfortunately, estimator (15) can be very unstable and take negative values for individual small domains. Therefore, the straightforward estimation of optimal weights (13) is avoided. ...
... and [2]. However, this estimator has the same drawbacks as (15). Another general method is to assume that the estimatorθ C i defined by (12) approximates the optimal combinationθ opt i =θ C i (λ * i ) quite well and derive the approximation [3] ...
... Then we smooth theseψ d i to obtainψ i =ψ sD i according to (8) and use the smoothed estimates for (6), (7), (10), (18), and in the synthetic parts of (22) andθ opt i . We apply the bootstrap method of [15] to evaluate the estimators of the design variances in (17), (18), and (21). Denote byθ i any estimator for which we need to estimate the design variance. ...
Preprint
Full-text available
Traditional direct estimation methods are not efficient for domains of a survey population with small sample sizes. To estimate the domain proportions, we combine the direct estimators and the regression-synthetic estimators based on domain-level auxiliary information. For the case of small true proportions, we introduce the design-based linear combination that is a robust alternative to the empirical best linear unbiased predictor (EBLUP) based on the Fay--Herriot model. We also consider an adaptive procedure optimizing a sample-size-dependent composite estimator, which depends on a single parameter for all domains. We imitate the Lithuanian Labor Force Survey, where we estimate the proportions of the unemployed and employed in municipalities. We show where the considered design-based compositions and estimators of their mean square errors are competitive for EBLUP and its accuracy estimation.
... We propose a rescaled bootstrap method tailored for simple random sampling without replacement in each dimension, drawing inspiration from the approach proposed in Rao et al. (1992). The ...
... (1) Rao et al. (1992) ...
Preprint
Full-text available
We investigate the family of cross-classified sampling designs across an arbitrary number of dimensions. We introduce a variance decomposition that enables the derivation of general asymptotic properties for these designs and the development of straightforward and asymptot-ically unbiased variance estimators. Additionally, we demonstrate the suitability of weighted bootstrap techniques for CCS, given the availability of a weighted bootstrap technique in each dimension. Our conclusions are supported by an extensive simulation study. Finally, we apply the proposed methods to a French longitudinal survey conducted among children.
... All analyses were weighted to account for the complex sampling design, nonresponse bias, population frame calibration, and age range, and used a resampling-based variance estimation employing the (n-1) rescaling bootstrap (RBS) with 200 replicate weights (Kolenikov, 2010;Rao et al., 1992). Separate weights were constructed for veterans and non-veterans. ...
... The proportion of veterans and non-veterans endorsing each study outcome were assessed using frequency statistics and were unadjusted (i.e., crude). To assess group differences among outcomes, poisson regression analyses were conducted with the bootstrap-based method to produce robust standard errors (Kolenikov, 2010;Rao et al., 1992). Veteran status was included as the primary predictor, and analyses were stratified by sex. ...
Article
Full-text available
Background Prior research has examined how the post-military health and well-being of both the larger veteran population and earlier veteran cohorts differs from non-veterans. However, no study has yet to provide a holistic examination of how the health, vocational, financial, and social well-being of the newest generation of post-9/11 U.S. military veterans compares with their non-veteran peers. This is a significant oversight, as accurate knowledge of the strengths and vulnerabilities of post-9/11 veterans is required to ensure that the needs of this population are adequately addressed, as well as to counter inaccurate veteran stereotypes. Methods Post-9/11 U.S. veterans’ (N = 15,160) and non-veterans’ (N = 4,533) reported on their health and broader well-being as part of a confidential web-based survey in 2018. Participants were drawn from probability-based sampling frames, and sex-stratified weighted logistic regressions were conducted to examine differences in veterans’ and non-veterans’ reports of health, vocational, financial, and social outcomes. Results Although both men and women post-9/11 veterans endorsed poorer health status than non-veterans, they reported greater engagement in a number of positive health behaviors (healthy eating and exercise) and were more likely to indicate having access to health care. Veterans also endorsed greater social well-being than non-veterans on several outcomes, whereas few differences were observed in vocational and financial well-being. Conclusion Despite their greater vulnerability to experiencing health conditions, the newest generation of post-9/11 U.S. veterans report experiencing similar or better outcomes than non-veterans in many aspects of their lives. Findings underscore the value of examining a wider range of health and well-being outcomes in veteran research and highlight a number of important directions for intervention, public health education, policy, and research related to the reintegration of military veterans within broader civilian society.
... bootstrap are properly rescaled, as well as in [5,6]; cfr. also the review in [7]. In [8] a "rescaled bootstrap process" based on asymptotic arguments is proposed. ...
... where N * i s are integer-valued r.v.s, with (joint) probability distribution P pred . In practice, Equation (7) means that N * i I i population units are predicted to have y-value equal to y i and x-variable x i , for each sample unit i. ...
Article
Full-text available
In the present paper, resampling for finite populations under an iid sampling design is reviewed. Our attention is mainly focused on pseudo-population-based resampling due to its properties. A principled appraisal of the main theoretical foundations and results is given and discussed, together with important computational aspects. Finally, a discussion on open problems and research perspectives is provided.
... To address the complex sampling design, we used main population weights and rescaling bootstrap and replication weights (Rao et al., 1992). First, we conducted chi-square tests to examine the unadjusted prevalence of experiencing PTEs and other stressors, the prevalence of self-reported lifetime diagnoses, and mental health treatment among LGBTQ+ veterans and LGBTQ+ nonveterans. ...
Article
Full-text available
Objective: The purpose of the study was to compare lesbian, gay, bisexual, transgender, queer+ (LGBTQ+) veterans’ and nonveterans’ prevalence of potentially traumatic events (PTEs) and other stressor exposures, mental health concerns, and mental health treatment. Method: A subsample of veterans and nonveterans who identified as LGBTQ+ (N = 1,291; 851 veterans; 440 nonveterans) were identified from a national cohort of post-9/11 veterans and matched nonveterans. Majority of the sample identified as White (59.7%), men (40.4%), and gay or lesbian (48.6%). Measures included PTEs and other stressors, depression, anxiety, posttraumatic stress disorder (PTSD), and receipt of mental health treatment. Logistic regressions compared the likelihood of experiencing PTEs and other stressors, self-reported mental health diagnoses, and mental health treatment between LGBTQ+ veterans and nonveterans. Results: Compared with LGBTQ+ nonveterans, LGBTQ+ veterans were more likely to report financial strain, divorce, discrimination, witnessing the sudden death of a friend or family member, and experiencing a serious accident or disaster. LGBTQ+ veterans reported greater depression, anxiety, and PTSD symptom severity than LGBTQ+ nonveterans. However, LGBTQ+ veterans were only more likely to receive psychotherapy for PTSD and did not differ from nonveterans in the likelihood of receiving any other types of mental health treatment. Conclusions: The study was the first to demonstrate that LGBTQ+ veterans have a greater prevalence of PTEs and other stressors and report worse mental health symptoms. These findings suggest that LGBTQ+ veterans may have unmet mental health treatment needs and need interventions to increase engagement in needed mental health services, especially for depression and anxiety.
... First, we draw an independent sample of households with replacement from the original sample in each Autonomous Community. Second, the cross-sectional weights are adjusted as Rao et al. (1992) and Rust and Rao (1996) proposed. For instance, the adjusted weight for household i in Autonomous Community j, , is given by where w ij is the original cross-sectional weight, r i is the number of times the i - th household in Autonomous Community j is selected in the bootstrap sample and n j is the original sample size of Autonomous Community j. ...
Article
Full-text available
The AROPE rate is a multidimensional indicator to monitor poverty in the European Union whichcombines income, work intensity and material deprivation. However, it misses the possible relationshipbetween its components. To overcome this drawback, some authors proposed to complement theAROPE rate with measures of the dependence between its dimensions, since higher dependence canexacerbate poverty. In this paper, we follow this approach and measure such dependence in the Spanishregions over the period 2008-2018 using three multivariate versions of Spearman’s rank correlationcoefficient. Our results reveal an asymmetric effect of the economic cycle on the dependence betweenpoverty dimensions, as this dependence, in many Spanish regions, substantially increased during theGreat Recession but dropped little during the economic recovery. Moreover, regions with higherAROPE rates also tend to experience more dependence between its dimensions.
... All analyses were weighted using the main population weights (Rao et al., 1992) to account for the sampling design. Because age and race/ethnicity are related to negative mental health outcomes and mental health treatment seeking, we controlled for these variables in all adjusted analyses (Benjet et al., 2016;Roberts et al., 2011). ...
Article
Full-text available
Sexual minority veterans are at heightened risk for mental health conditions compared with their heterosexual peers. Subpopulations of the sexual minority community, including veterans, are at even greater risk for mental health conditions. Despite this heightened risk, little is known about mental health treatment seeking among sexual minority veterans, especially in under-researched sexual minority subpopulations (e.g., bisexual men and women). This study examined sexual orientation-based differences in mental health symptom severity and past-year mental health treatment among a national sample of post-9/11 veteran men and women (N = 14,968). Results indicated that bisexual veteran women had greater mental health symptom severity compared with lesbian/gay and heterosexual veteran women. Gay and bisexual veteran men had greater depression and anxiety symptom severity than heterosexual veteran men. However, among individuals who reported receiving a mental health diagnosis (posttraumatic stress disorder, depression, anxiety) there were no significant differences in odds of receiving mental health treatment between lesbian/gay and bisexual veteran men and women compared to their heterosexual counterparts. These results suggest the need for additional research on facilitators and barriers to accessing and engaging in mental health care among sexual minority veterans, especially bisexual veteran women who experience disproportionate psychological burden compared to their lesbian/gay and heterosexual peers.
... Estimated variances for these two surveys are computed via the Rao-Wu bootstrap procedure. This procedure constructs bootstrap weights that reflect the sample details: see Rao and Wu (1988) or Rao et al. (1992) for details on how the bootstrap weights are computed. ...
Article
Full-text available
Sampling variance smoothing is an important topic in small area estimation. In this article, we propose sampling variance smoothing methods for small area proportion estimation. In particular, we consider the generalized variance function and design effect methods for sampling variance smoothing. We evaluate and compare the smoothed sampling variances and small area estimates based on the smoothed variance estimates through analysis of survey data from Statistics Canada. The results from real data analysis and simulation study indicate that the proposed sampling variance smoothing methods perform very well for small area estimation.
... In order for estimator (8) to efficiently correct the selection bias, the propensity score model has to be well-specified. The variance estimator V IPW for (8) may be obtained by using resampling methods, for example, the bootstrap procedure from [10]. ...
Article
Full-text available
We aim to find a way to effectively integrate a non-probability (voluntary) sample under the data framework, where the study variable is also observed in a probability sample of some statistical survey. The selection bias that arises from voluntary participation in the survey is corrected by estimating the inclusion into the sample probabilities (propensity scores) for the units in the non-probability sample. The estimators for the propensity scores are constructed using a parametric logistic regression model. We consider two modeling scenarios: with an assumption that the willingness to participate in the voluntary survey does not depend on the survey variable itself and that such a variable does contribute to whether the individual responds or not. The maximum likelihood method is applied in both scenarios to estimate the propensity scores. The estimators of the population mean based on the estimated propensity scores are linearly combined with the unbiased estimator using the probability sample data. We compare the constructed estimators in the simulation study, where we estimate the population proportions using data from the Population and Housing Census surveys.
... Shao and Tu [37] presented the variance estimator for θ obtained by jackknife method when multistage sampling is used. Therefore, the resampling techniques, in the context of sampling survey, have been widely studied, and the original idea of jackknife variance estimation has been developed to apply under stratified and multistage sampling design by number of statisticians as Jones [17], Kish and Frankel [18], Krewski and Rao [20], Kovar et al. [19], Rao et al. [32], and Shao and Tu [37]. ...
... The two major classes of estimation approaches under this framework are Taylor linearization (sometimes called infintessimal jackknife) and replication methods. We describe each at a high level but point the interested reader to the rich literature on comparisons and variations of these approaches applied to survey-weighted estimating functions (See for example Binder, 1996;Rao et al., 1992, as a starting point). ...
Preprint
Full-text available
We present csSampling, an R package for estimation of Bayesian models for data collected from complex survey samples. csSampling combines functionality from the probabilistic programming language Stan (via the rstan and brms R packages) and the handling of complex survey data from the survey R package. Under this approach, the user creates a survey-weighted model in brms or provides a custom weighted model via rstan. Survey design information is provided via the svydesign function of the survey package. The cs_sampling function of csSampling estimates the weighted stan model and provides an asymptotic covariance correction for model mis-specification due to using survey sampling weights as plug-in values in the likelihood. This is often known as a ``design effect'' which is the ratio between the variance from a complex survey sample and a simple random sample of the same size. The resulting adjusted posterior draws can then be used for the usual Bayesian inference while also achieving frequentist properties of asymptotic consistency and correct uncertainty (e.g. coverage).
... Alternatively, a with-replacement bootstrap variance estimation can also be used here [43]. To illustrate, we consider a single-stage probability proportional to size sampling with negligible sampling ratios. ...
... Alternatively, a with-replacement bootstrap variance estimation can also be used here [43]. To illustrate, we consider a single-stage probability proportional to size sampling with negligible sampling ratios. ...
Preprint
Full-text available
Multiple heterogeneous data sources are becoming increasingly available for statistical analyses in the era of big data. As an important example in finite-population inference, we develop a unified framework of the test-and-pool approach to general parameter estimation by combining gold-standard probability and non-probability samples. We focus on the case when the study variable is observed in both datasets for estimating the target parameters, and each contains other auxiliary variables. Utilizing the probability design, we conduct a pretest procedure to determine the comparability of the non-probability data with the probability data and decide whether or not to leverage the non-probability data in a pooled analysis. When the probability and non-probability data are comparable, our approach combines both data for efficient estimation. Otherwise, we retain only the probability data for estimation. We also characterize the asymptotic distribution of the proposed test-and-pool estimator under a local alternative and provide a data-adaptive procedure to select the critical tuning parameters that target the smallest mean square error of the test-and-pool estimator. Lastly, to deal with the non-regularity of the test-and-pool estimator, we construct a robust confidence interval that has a good finite-sample coverage property.
... Adjustments are thus required to apply these methods to finite populations (Quatember, 2015). For finite populations, the rescaled bootstrap technique (Rao et al., 1992) can be used for bias correction of a given empirical version of . This method has been used in many research studies (Berger and Muñoz, 2015;Moya et al., 2020;etc.) in many areas (see Yang et al., 2010;Muñoz et al., 2018;etc.). ...
Article
Full-text available
The Gini index is probably the most commonly used indicator to measure inequality. For continuous distributions, the Gini index can be computed using several equivalent formulations. However, this is not the case with discrete distributions, where controversy remains regarding the expression to be used to estimate the Gini index. We attempt to bring a better understanding of the underlying problem by regrouping and classifying the most common estimators of the Gini index proposed in both infinite and finite populations, and focusing on the biases. We use Monte Carlo simulation studies to analyse the bias of the various estimators under a wide range of scenarios. Extremely large biases are observed in heavy-tailed distributions with high Gini indices, and bias corrections are recommended in this situation. We propose the use of some (new and traditional) bootstrap-based and jackknife-based strategies to mitigate this bias problem. Results are based on continuous distributions often used in the modelling of income distributions. We describe a simulation-based criterion for deciding when to use bias corrections. Various real data sets are used to illustrate the practical application of the suggested bias corrected procedures.
... We apply the bootstrap method by Rao et al. [24] to evaluate the estimators of the design variances used in (18), (19), and (22). Let us estimate the variance for any estimatorθ i . ...
Article
Full-text available
Traditional direct estimation methods are inefficient for domains of a survey population with small sample sizes. To estimate the domain proportions, we combine the direct estimators and the regression-synthetic estimators based on domain-level auxiliary information. For the case of small true proportions, we propose the design-based linear combination that is a robust alternative to the empirical best linear unbiased predictor (EBLUP) based on the Fay–Herriot model. We imitate the Lithuanian Labor Force Survey, where we estimate the proportions of the unemployed and employed in municipalities. We show where the proposed design-based composition and estimator of its mean square error are competitive for EBLUP and its accuracy estimation.
... The 95% confidence intervals for the NSUM prevalence estimates were produced based on using the rescaled bootstrap procedure (J. Rao et al., 1992;J. N. Rao & Wu, 1988;Rust & Rao, 1996) with 50,000 resamples, as these have been found to perform better, in terms of coverage rates, than those based on the usual NSUM standard error calculations (Feehan & Salganik, 2016a) and provide insights into the sampling distribution of the NSUM estimator. ...
Article
Full-text available
The goal of this paper is to compare a traditional survey method with the network scale-up method (NSUM) for the prevalence estimation of child trafficking in Sierra Leone in 2020. The traditional survey method involved a probability-based, stratified, and clustered multistage sampling design in which adult respondents in 3,070 households were interviewed about trafficking of children who reside in their household in three selected districts. This paper details the first attempt to estimate the prevalence of child trafficking using NSUM, which entailed questioning the same adult respondents about the trafficking-related activities of children in their personal networks. Findings and interpretation of these results are presented, along with implications and recommendations for future studies.
... Its idea is to assume the relation ψ i ≈ KN γ i and then estimate the parameters K > 0 and γ ℝ through a log-log regression model. We estimate the design variances of all synthetic and composite estimators using the rescaling bootstrap from Rao et al. (1992). Letμ ðrÞ i , r ¼ 1,…, R, be the realizations of any estimatorμ i of the parameter μ i , where μ i is a proportion or MSE of the estimator of the proportion. ...
Article
Small area estimation methods are used in surveys, where sample sizes are too small to get reliable direct estimates of parameters in some population domains. We consider design‐based linear combinations of direct and synthetic estimators and propose a two‐step procedure to approach the optimal combination. We construct the mean square error estimator suitable for this and any other linear composition that estimates the optimal one. We apply the theory to two design‐based compositions analogous to the empirical best linear unbiased predictors (EBLUPs) based on the basic area‐ and unit‐level models. The simulation study shows that the new methods are efficient compared to estimation using EBLUP.
... The presented approximate normal confidence intervals for postestimation predictions are based on a bootstrapping approach that accounts for the complex survey design through a scale adjustment applied to 1000 sets of replicate weights. [25][26][27] We used R version 4.0.4 for our statistical analysis. ...
Article
Background: Diabetes is a growing concern in South Asia but few nationally representative studies identify factors behind this rising disease burden. We studied the nationwide change in diabetes prevalence in Bangladesh, subpopulations disproportionately affected, and the contribution of rising unhealthy weight to the change in diabetes prevalence. Methods: Based on a sample of 13,959 adults aged 35 years and older with biomarker measurements from the 2011 and 2017/2018 Bangladesh Demographic and Health Surveys, we estimated how the prevalence of diabetes changed nationally and across socioeconomic/geographic groups. Using counterfactual decomposition, we assessed how much the prevalence of diabetes would have grown if BMI had not changed between 2011 and 2017. Results: Diabetes prevalence increased from 12.1 (11.1, 13.1) to 14.4% (13.3, 15.5) between 2011 and 2017/2018. Diabetes grew disproportionately quickly among population groups with higher household wealth, more education, and in three regions. Over this same period, mean BMI increased from 20.9 (20.8, 21.1) to 22.5 kg/m2 (22.4, 22.7) and overweight from 25.8 (24.4, 27.3) to 42.1% (40.4, 43.7). Under the counterfactual scenario of constant BMI, diabetes would have risen by only 1.0 (-0.4, 2.4) instead of 2.3 percentage points (0.8, 3.7) nationally, corresponding to a contribution of 58% (-106.3, 221.7). Similarly, group-specific trends were largely attributable to increasing BMI. Conclusions: Diabetes prevalence in Bangladesh has increased rapidly between 2011 and 2017/2018. Decomposition analysis estimates have wide confidence intervals but are consistent with the hypothesis that this change was driven by the dramatic rise in body weights.
... Thus, estimates from univariate and bivariable analyses involving these weights were partially adjusted via nonveteran-to-veteran standardization weighting. A resampling-based variance estimation approach, employing the (n-1) rescaling bootstrap method, was applied using 200 replicate weights (Kolenikov, 2010;Rao et al., 1992). Weights for the 127 persons who were removed were set to zero. ...
Article
Full-text available
Large-scale epidemiological studies suggest that veterans may have poorer physical health than nonveterans, but this has been largely unexamined in post-9/11 veterans despite research indicating their high levels of disability and healthcare utilization. Additionally, little investigation has been conducted on sex-based differences and interactions by veteran status. Notably, few studies have explored veteran physical health in relation to national health guidelines. Self-reported, weighted data were analyzed on post-9/11 U.S. veterans and nonveterans (n = 19,693; 6,992 women, 12,701 men; 15,160 veterans, 4,533 nonveterans). Prevalence was estimated for 24 physical health conditions classified by Healthy People 2020 targeted topic areas. Associations between physical health outcomes and veteran status were evaluated using bivariable and multivariable analyses. Back/neck pain was most reported by veterans (49.3 %), twice that of nonveterans (22.8 %)(p < 0.001). Adjusted odds ratios (AORs) for musculoskeletal and hearing disorders, traumatic brain injury, and chronic fatigue syndrome (CFS) were 3-6 times higher in veterans versus nonveterans (p < 0.001). Women versus men had the greatest adjusted odds for bladder infections (males:females, AOR = 0.08, 95 % CI:0.04-0.18)(p < 0.001), and greater odds than men for multiple sclerosis, CFS, cancer, irritable bowel syndrome/colitis, respiratory disease, some musculoskeletal disorders, and vision loss (p < 0.05). Cardiovascular-related conditions were most prominent for men (p < 0.001). Veteran status by sex interactions were found for obesity (p < 0.03; greater for male veterans) and migraine (p < 0.01; greater for females). Healthy People 2020 targeted topic areas exclude some important physical health conditions that are associated with being a veteran. National health guidelines for Americans should provide greater consideration of veterans in their design.
... We follow the same procedure here except the resampling procedure must account for the survey weights. To do so, we use the approach in Kolenikov (2010), which is based on the rescaling bootstrap procedure developed in Rao et al. (1992). Second, inference is conducted by using the same bootstrap approach to account for survey weights along with the Imbens-Manski (2004) correction to obtain 90% confidence intervals (CIs). ...
Article
We examine economic mobility in India while accounting for misclassification to better understand the welfare effects of the rise in inequality. To proceed, we extend recently developed methods on the partial identification of transition matrices. Allowing for modest misclassification, we find overall mobility has been remarkably low: at least 65% of poor households remained poor or at-risk of being poor between 2005 and 2012. We also find Muslims, lower caste groups, and rural households are in a more disadvantageous position compared to Hindus, upper caste groups, and urban households. These findings cast doubt on the conventional wisdom that marginalized households in India are catching up.
... All statistical analyses were performed using Stata 17 SE. To consider the CCHS sampling plan and protect the confidentiality of respondents, all results were computed using bootstraps and sampling weights [37,38]. ...
Article
Full-text available
Life course exposure to neighbourhood deprivation may have a previously unstudied relationship with health disparities. This study examined the association between neighbourhood deprivation trajectories (NDTs) and poor reported self-perceived health (SPH) among Quebec’s adult population. Data of 45,990 adults with complete residential address histories from the Care-Trajectories-Enriched Data cohort, which links Canadian Community Health Survey respondents to health administrative data, were used. Accordingly, participants were categorised into nine NDTs (T1 (Privileged Stable)–T9 (Deprived Stable)). Using multivariate logistic regression, the association between trajectory groups and poor SPH was estimated. Of the participants, 10.3% (95% confidence interval [CI]: 9.9–10.8) had poor SPH status. This proportion varied considerably across NDTs: From 6.4% (95% CI: 5.7–7.2) for Privileged Stable (most advantaged) to 16.4% (95% CI: 15.0–17.8) for Deprived Stable (most disadvantaged) trajectories. After adjustment, the likelihood of reporting poor SPH was significantly higher among participants assigned to a Deprived Upward (odds ratio [OR]: 1.77; 95% CI: 1.48–2.12), Average Downward (OR: 1.75; CI: 1.08–2.84) or Deprived trajectory (OR: 1.81; CI: 1.45–2.86), compared to the Privileged trajectory. Long-term exposure to neighbourhood deprivation may be a risk factor for poor SPH. Thus, NDT measures should be considered when selecting a target population for public-health-related interventions.
... Many resampling methods have been proposed to capture the variation in a probability sample (see, e.g., Rao et al. 1992). To flexibly incorporate researchers' understanding of the inclusion mechanism of the nonprobability sample and allow the possibility of considering dependency between two samples, we implement a pseudo population bootstrap. ...
Article
Full-text available
Nonprobability samples, for example observational studies, online opt-in surveys, or register data, do not come from a sampling design and therefore may suffer from selection bias. To correct for selection bias, Elliott and Valliant (EV) proposed a pseudo-weight estimation method that applies a two-sample setup for a probability sample and a nonprobability sample drawn from the same population, sharing some common auxiliary variables. By estimating the propensities of inclusion in the nonprobability sample given the two samples, we may correct the selection bias by (pseudo) design-based approaches. This paper expands the original method, allowing for large sampling fractions in either sample or for high expected overlap between selected units in each sample, conditions often present in administrative data sets and more frequently occurring with Big Data.
... Analyses were conducted using SAS Enterprise Guide 7.1 and SUDAAN 11.0.0 software. To account for the complex survey design, p-values, 95% confidence intervals, and coefficients of variation (CV), were estimated using the bootstrap technique with 22 degrees of freedom (Rao 1992;Rust and Rao 1996). Statistical significance was specified as a p-value of less than 0.05. ...
Article
Full-text available
Objective: To examine the association between individual and cumulative leisure noise exposure in addition to acceptable yearly exposure (AYE) and hearing outcomes among a nationally representative sample of Canadians. Design: Audiometry, distortion-product otoacoustic emissions (DPOAEs) and in-person questionnaires were used to evaluate hearing and leisure noise exposure across age, sex, and household income/education level. High-risk cumulative leisure noise exposure was defined as 85 dBA or greater for 40 h or more per week, with AYE calculations also based on this occupational limit. Study sample: A randomised sample of 10,460 respondents, aged 6–79, completed questionnaires and hearing evaluations between 2012 and 2015. Results: Among 50–79 year olds, high-risk cumulative leisure noise was associated with increased odds of a notch while high exposure to farming/construction equipment noise was associated with hearing loss, notches and absent DPOAEs. No associations with hearing loss were found however, non-significant tendencies observed included higher mean hearing thresholds, notches and hearing loss odds. Conclusion: Educational outreach and monitoring of hearing among young and middle-aged populations exposed to hazardous leisure noise would be beneficial.
... Due to the complexity of the sampling designs, some modifications and adjustments to the jackknife are often required. Various versions of the jackknife have been developed and studied both theoretically and empirically, for example, Krewski and Rao (1981), Rao and Wu (1985), Kovar, Rao, and Wu (1988), and Rao, Wu, and Yue (1992), in the context of stratified multistage sampling. For the case of a unistage stratified sampling without replacement with unequal probabilities, Berger (2007) proposed a novel jackknife estimator for the variance of a point estimator, which is a function of H ajek estimators. ...
Article
The generalized regression estimator (GREG) is a well-known procedure for using auxiliary data to estimate means or totals using a sample selected from a finite population. The GREG estimator is motivated by an assumed linear superpopulation model and it is known to be asymptotically unbiased regardless of whether the model is correctly specified or not. When the sample size is small and/or when the linear model does not fit the sample data well, the GREG estimator may have nonnegligible bias. In this paper, we use the jackknife procedure to correct the bias of the GREG. We evaluate both theoretically and by simulation, the performance of the jackknife bias-corrected regression estimator (GREG-JK) under unistage sampling without replacement with unequal probabilities. A jackknife mean squared error estimator is proposed that naturally includes a finite population correction which is usually absent in the standard jackknife methods for variance estimation. A simulation study shows that the empirical bias of GREG-JK is negligible for all sample sizes and generated populations. Furthermore, the proposed jackknife mean squared error estimator demonstrates improvements over the customary estimator.
... From these three (sets of) models, we calculated the indirect effect of each social support indicator and the total indirect effect by summing the four indirect effects. Estimates of the statistical significance of indirect effects were estimated in 10,000 bootstrap samples using Rao, Wu, and Yue's bootstrap weighting method for complex surveys as implemented in SAS [41]. Due to concerns about model convergence problems in at least some of the bootstrap samples (which were smaller than the original sample) for uncommon disorders or disorder groups, we examined possible mediation only for any mental disorder and for > 1 mental disorder. ...
Article
Full-text available
Purpose Lesbian, gay, and bisexual (LGB) individuals, and LB women specifically, have an increased risk for psychiatric morbidity, theorized to result from stigma-based discrimination. To date, no study has investigated the mental health disparities between LGB and heterosexual AQ1individuals in a large cross-national population-based comparison. The current study addresses this gap by examining differences between LGB and heterosexual participants in 13 cross-national surveys, and by exploring whether these disparities were associated with country-level LGBT acceptance. Since lower social support has been suggested as a mediator of sexual orientation-based differences in psychiatric morbidity, our secondary aim was to examine whether mental health disparities were partially explained by general social support from family and friends. Methods Twelve-month prevalence of DSM-IV anxiety, mood, eating, disruptive behavior, and substance disorders was assessed with the WHO Composite International Diagnostic Interview in a general population sample across 13 countries as part of the World Mental Health Surveys. Participants were 46,889 adults (19,887 males; 807 LGB-identified). Results Male and female LGB participants were more likely to report any 12-month disorder (OR 2.2, p < 0.001 and OR 2.7, p < 0.001, respectively) and most individual disorders than heterosexual participants. We found no evidence for an association between country-level LGBT acceptance and rates of psychiatric morbidity between LGB and heterosexualAQ2 participants. However, among LB women, the increased risk for mental disorders was partially explained by lower general openness with family, although most of the increased risk remained unexplained. Conclusion These results provide cross-national evidence for an association between sexual minority status and psychiatric morbidity, and highlight that for women, but not men, this association was partially mediated by perceived openness with family. Future research into individual-level and cross-national sexual minority stressors is needed.
... Due to the scaling issue under the finite population setting, the Naïve bootstrap technique (Efron, 1979) may not be able to give unbiased variance estimates. Therefore, to overcome this scaling problem rescaling bootstrap with replacement techniques (Rao and Wu, 1988;Rao et al. 1992) and rescaling bootstrap without replacement techniques (Ahmad, 1997) have been developed to obtain unbiased variance estimation of the estimators of finite population parameters. Chen et al. (2004) and Modarres et al. (2006) developed the Bootstrapping technique for RSS design in an infinite population setting without the use of any rescaling and/ or finite correction factor. ...
Article
Full-text available
McIntyre (1952) introduced Ranked Set Sampling (RSS) to advance upon Simple Random Sampling (SRS) for circumstances where any preliminary ranking of sampled units is possible for variable of interest using visual inspection or some other means without physically measuring the units. Further, the RSS was classified into three sampling protocols named as Level-0, Level-1 and Level-2 (Deshpande et al., 2006). The Level-0 sampling protocol of RSS is considered in this article. Estimating the variance of the Level-0 RSS estimator under the finite population framework was found to be cumbersome. In this article, two distinct rescaling bootstrap with replacement methods known as Strata-based rescaling bootstrap with-replacement (SRBWR) method and Cluster-based rescaling bootstrap with-replacement (CRBWR) method have been proposed to unbiasedly estimate the variance of Level-0 RSS estimator of finite population mean. Rescaling factors are obtained for both the proposed methods to estimate the variance of the Level-0 RSS estimator unbiasedly. The results of the simulation analysis, together with real data application support, proposed methods are capable of estimating the variance of the Level-0 RSS estimator almost unbiasedly. The developed SRBWR method performs better than the CRBWR method considering Relative stability (RS) and percentage Relative Bias (%RB) for various combinations of set size (m) and several cycles (r).
... Also see Rao and Wu (1984) and Chipperfield and Preston (2007). Rao et al. (1992) proposed a rescaling bootstrap method to cover non-smooth statistics, but did not discuss second-order accuracy. Sitter (1992b) proposed a mirror-match bootstrap method for complex sample designs, including stratified random sampling and two-stage cluster sampling. ...
Article
Full-text available
Bootstrap is a useful computational tool for statistical inference, but it may lead to erroneous analysis under complex survey sampling. In this paper, we propose a unified bootstrap method for stratified multi‐stage cluster sampling, Poisson sampling, simple random sampling without replacement and probability proportional to size sampling with replacement. In the proposed bootstrap method, we first generate bootstrap finite populations, apply the same sampling design to each bootstrap population to get a bootstrap sample, and then apply studentization. The second‐order accuracy of the proposed bootstrap method is established by the Edgeworth expansion. Simulation studies confirm that the proposed bootstrap method outperforms the commonly used Wald‐type method in terms of coverage, especially when the sample size is not large.
Article
Our work was motivated by the question whether, and to what extent, well‐established risk factors mediate the racial disparity observed for colorectal cancer (CRC) incidence in the United States. Mediation analysis examines the relationships between an exposure, a mediator and an outcome. All available methods require access to a single complete data set with these three variables. However, because population‐based studies usually include few non‐White participants, these approaches have limited utility in answering our motivating question. Recently, we developed novel methods to integrate several data sets with incomplete information for mediation analysis. These methods have two limitations: (i) they only consider a single mediator and (ii) they require a data set containing individual‐level data on the mediator and exposure (and possibly confounders) obtained by independent and identically distributed sampling from the target population. Here, we propose a new method for mediation analysis with several different data sets that accommodates complex survey and registry data, and allows for multiple mediators. The proposed approach yields unbiased causal effects estimates and confidence intervals with nominal coverage in simulations. We apply our method to data from U.S. cancer registries, a U.S.‐population‐representative survey and summary level odds‐ratio estimates, to rigorously evaluate what proportion of the difference in CRC risk between non‐Hispanic Whites and Blacks is mediated by three potentially modifiable risk factors (CRC screening history, body mass index, and regular aspirin use).
Article
We present a practical approach for computing the sandwich variance estimator in two-stage regression model settings. As a motivating example for two-stage regression, we consider regression calibration, a popular approach for addressing covariate measurement error. The sandwich variance approach has been rarely applied in regression calibration, despite it requiring less computation time than popular resampling approaches for variance estimation, specifically the bootstrap. This is likely due to requiring specialized statistical coding. We first outline the steps needed to compute the sandwich variance estimator. We then develop a convenient method of computation in R for sandwich variance estimation, which leverages standard regression model outputs and existing R functions and can be applied in the case of a simple random sample or complex survey design. We use a simulation study to compare the sandwich to a resampling variance approach for both settings. Finally, we further compare these two variance estimation approaches for data examples from the Women’s Health Initiative (WHI) and Hispanic Community Health Study/Study of Latinos (HCHS/SOL). The sandwich variance estimator typically had good numerical performance, but simple Wald bootstrap confidence intervals were unstable or over-covered in certain settings, particularly when there was high correlation between covariates or large measurement error.
Article
Background: Population-based seroprevalence studies are crucial to understand community transmission of COVID-19 and guide responses to the pandemic. Seroprevalence is typically measured from diagnostic tests with imperfect sensitivity and specificity. Failing to account for measurement error can lead to biased estimates of seroprevalence. Methods to adjust seroprevalence estimates for the sensitivity and specificity of the diagnostic test have largely focused on estimation in the context of convenience sampling. Many existing methods are inappropriate when data are collected using a complex sample design. Methods: We present methods for seroprevalence point estimation and confidence interval construction that account for imperfect test performance for use with complex sample data. We apply these methods to data from the Chatham County COVID-19 Cohort (C4), a longitudinal seroprevalence study conducted in central North Carolina. Using simulations, we evaluate bias and confidence interval coverage for the proposed estimator compared with a standard estimator under a stratified, three-stage cluster sample design. Results: We obtained estimates of seroprevalence and corresponding confidence intervals for the C4 study. SARS-CoV-2 seroprevalence increased rapidly from 10.4% in January to 95.6% in July 2021 in Chatham County, North Carolina. In simulation, the proposed estimator demonstrates desirable confidence interval coverage and minimal bias under a wide range of scenarios. Conclusion: We propose a straightforward method for producing valid estimates and confidence intervals when data are based on a complex sample design. The method can be applied to estimate the prevalence of other infections when estimates of test sensitivity and specificity are available.
Article
Full-text available
In this short paper we sketch how survey sampling changed during the last 50 years. We describe the development and use of model-assisted survey sampling and model-assisted estimators, such as the generalized regression estimator. We also discuss the development of complex survey designs, in particular mixed-mode survey designs and adaptive survey designs. These latter two kinds of survey designs were mainly developed to increase response rates and decrease survey costs. A third topic that we discuss is the estimation of sampling variance. The increased computing power of computers has made it possible to estimate sampling variance of an estimator by means of replication methods, such as the bootstrap. Finally, we briefly discuss current and future developments in survey sampling, such as the increased interest in using nonprobability samples.
Article
Statistical inference in the presence of nuisance functionals with complex survey data is an important topic in social and economic studies. The Gini index, Lorenz curves and quantile shares are among the commonly encountered examples. The nuisance functionals are usually handled by a plug-in nonparametric estimator and the main inferential procedure can be carried out through a two-step generalized empirical likelihood method. Unfortunately, the resulting inference is not efficient and the nonparametric version of the Wilks’ theorem breaks down even under simple random sampling. We propose an augmented estimating equations method with nuisance functionals and complex surveys. The second-step augmented estimating functions obey the Neyman orthogonality condition and automatically handle the impact of the first-step plug-in estimator, and the resulting estimator of the main parameters of interest is invariant to the first step method. More importantly, the generalized empirical likelihood based Wilks’ theorem holds for the main parameters of interest under the design-based framework for commonly used survey designs, and the maximum generalized empirical likelihood estimators achieve the semiparametric efficiency bound. Performances of the proposed methods are demonstrated through simulation studies and an application using the dataset from the New York City Social Indicators Survey.
Preprint
Full-text available
Statistical inference in the presence of nuisance functionals with complex survey data is an important topic in social and economic studies. The Gini index, Lorenz curves and quantile shares are among the commonly encountered examples. The nuisance functionals are usually handled by a plug-in nonparametric estimator and the main inferential procedure can be carried out through a two-step generalized empirical likelihood method. Unfortunately, the resulting inference is not efficient and the nonparametric version of the Wilks' theorem breaks down even under simple random sampling. We propose an augmented estimating equations method with nuisance functionals and complex surveys. The second-step augmented estimating functions obey the Neyman orthogonality condition and automatically handle the impact of the first-step plug-in estimator, and the resulting estimator of the main parameters of interest is invariant to the first step method. More importantly, the generalized empirical likelihood based Wilks' theorem holds for the main parameters of interest under the design-based framework for commonly used survey designs, and the maximum generalized empirical likelihood estimators achieve the semiparametric efficiency bound. Performances of the proposed methods are demonstrated through simulation studies and an application using the dataset from the New York City Social Indicators Survey.
Article
The changing values of the indicators obtained from national labour force surveys provide analysts and planners with valuable information on the fluctuations of the labour market of the country. Labour force surveys in many countries follow the standards established by the International Labour Organization, and, as a result, tend to be similar in various respects. Given these similarities, the procedures used by the statistical organizations of Canada and the European Union are examined in this paper for the development of variance estimates of changes of the labour force indicators in Iran. While the survey in Iran and those in the countries under study have many similarities, they also differ in certain respects, namely, in terms of the periodicity of the survey, the rotation pattern as well as the unit of rotation, and the possible existence of non-response among the primary sampling units. Here, first, the methodologies of Statistics Canada and Eurostat are modified and adapted to the particularities of the labour force survey in Iran. Then, the results are compared. Among the four methods examined, the bootstrap methodology of Statistics Canada, after some modifications and adaptations, is found to be especially suitable for application in the labour force survey of Iran and, perhaps, in other counties with similar conditions. The proposed methodology can, particularly well, take into account the impact of the various steps of weight calculations on the variance estimates of change of the main labour force indicators.
Chapter
Survey data provide a key source for calculating point and variance estimates on the population of interest. In this chapter, we discuss several factors to guide the choice of an appropriate variance formula for measuring the precision of point estimates, especially for surveys of establishments. Specific examples are taken from establishment surveys conducted around the world for additional background. A critical factor is the protocol used to obtain the sample members—probability‐based or nonprobability sampling. Variance estimation for probability surveys, where the sample inclusion probabilities are defined for all units on the sampling frame, relies on well‐developed design‐based theory that account for inclusion probabilities and design features, such as stratification and clustering. Conversely, inclusion probabilities for nonprobability surveys are unknown, and design‐ or model‐based variance estimation methods are used under a set of strict assumptions. Another important factor is the form of the point estimate. Design‐based estimates are calculated with survey analysis weights, whereas model‐based estimates rely on a set of strong model covariates. We provide an overview of different approaches for statistical inference and survey weighting for probability and nonprobability surveys, citing additional references where appropriate. For example, ratio point estimators, such as a mean, are a function of two (weighted) survey estimates each with an associated measure of precision. Relative to an estimated total, variance formula for a ratio estimate does not have a closed form and must be approximated. Moreover, additional complexities for variance estimation must be addressed when data include statistical imputation to treat missing values. Consequently, we discuss pros and cons of variance estimation with linearization, replication, and model‐based techniques for probability and nonprobability establishment surveys under a variety of analytic needs.
Article
In observational cohort studies, there is frequently interest in modeling longitudinal change in a biomarker (ie, physiological measure indicative of metabolic dysregulation or disease; eg, blood pressure) in the absence of treatment (ie, medication), and its association with modifiable risk factors expected to affect health (eg, body mass index). However, individuals may start treatment during the study period, and consequently biomarker values observed while on treatment may be different than those that would have been observed in the absence of treatment. If treated individuals are excluded from analysis, then effect estimates may be biased if treated individuals differ systematically from untreated individuals. We addressed this concern in the setting of the Hispanic Community Health Study/Study of Latinos (HCHS/SOL), an observational cohort study that employed a complex survey sampling design to enable inference to a finite target population. We considered biomarker values measured while on treatment to be missing data, and applied missing data methodology (inverse probability weighting (IPW) and doubly robust estimation) to this problem. The proposed methods leverage information collected between study visits on when individuals started treatment, by adapting IPW and doubly robust approaches to model the treatment mechanism using survival analysis methods. This methodology also incorporates sampling weights and uses a bootstrap approach to estimate standard errors accounting for the complex survey sampling design. We investigated variance estimation for these methods, conducted simulation studies to assess statistical performance in finite samples, and applied the methodology to model temporal change in blood pressure in HCHS/SOL.
Technical Report
Full-text available
En este dosier estadístico se publican los primeros resultados de la Encuesta de Hogares llevada a cabo en la ciudad de Rosario, durante el último trimestre de 2021, por la Usina de Datos UNR. En primer lugar, se presentan datos en relación a los tipos de hogar, vivienda y régimen de tenencia; y otras condiciones socio habitacionales de los hogares. Además, la EHR aporta por primera vez, información valiosa sobre la tenencia de animales. Asimismo, el informe contiene datos sobre la población en relación al lugar de nacimiento, personas migrantes y con dificultades de largo plazo. Además, proporciona información sobre educación, salud, ambiente y fuentes de ingreso. De forma posterior, se desarrollan las consideraciones metodológicas relativas a la metodología, diseño muestral y trabajo de campo. Por último, se presenta un glosario compuesto por las categorías relevantes del presente dosier. En las próximas entregas se publicarán otros indicadores relevados por la EHR que profundizarán y ampliarán la información aquí presentada. CONTENIDOS
Article
Using five diet quality indices, we estimated the poor dietary pattern attributable mortality and life expectancy lost at the national level, which had previously been largely unknown. The Canadian Community Health Survey 2004 linked to vital statistics was used (n=16 212 adults; representing n=22 898 880). After a median follow-up of 7.5 years, 1722 mortality cases were recorded. Population attributable fractions were calculated to estimate mortality burden of poor dietary patterns (Dietary Guidelines for Americans Adherence Index 2015, Dietary Approaches to Stop Hypertension, Healthy Eating Index, Alternative HEI, and Mediterranean Style Dietary Pattern Score). Better diet quality was associated with a 32-51% and 21-43% reduction in all-cause mortality among adults 45-80 years and ≥20 years, respectively. Projected life expectancy at 45 years was longer for Canadians adhering to a healthy dietary pattern (average 5.2-8.0 years (males) and 1.6-4.1 (females)). At the population level, 26.5-38.9% (males) and 8.9-22.9% (females) of deaths were attributable to poor dietary patterns. Survival benefit was greater for individuals with higher scores on all diet indices, even with relatively small intake differences. The large attributable burden was likely from assessing overall dietary patterns instead of a limited range of foods and nutrients.
Article
The generalized regression estimator (GREG) uses auxiliary data that are available from the finite population to improve the efficiency of the estimator of a total (mean). Estimators of the variance of GREG that have been proposed in the sampling literature include those based on Taylor linearization and the jackknife techniques. Approximations based on Taylor expansions are reasonable for large samples. However, when the sample size is small, the Taylor-based variance estimator has a large negative bias. The jackknife variance estimators overestimate the variance of GREG for small sample sizes. We offset these setbacks using a bootstrap procedure for estimating the variance of the GREG. The method uses a bootstrap population constructed with the model underlying the GREG estimator. Repeated samples are selected in the bootstrap population according to the design used to select the initial sample, and the variability associated with these bootstrap samples is used to compute the proposed bootstrap variance estimator. Simulations show that the new bootstrap estimator has a small bias for samples that have few observations.
Article
Imputation is usually used to deal with item nonresponse in surveys. Treating the imputed values as true observations may obviously lead to serious underestimation of the variance of point estimators. In this article, we propose a new bootstrap method under the rescaling bootstrap approach for estimating the variance of an imputed estimator obtained after applying deterministic regression or random hot-deck imputation. A novel technique is used to rescale the original data set through solving certain systems of linear equations. The proposed procedure can handle unequal response probabilities and large sampling fractions. Some simulation studies are conducted to show the great performance of the proposed method in terms of relative bias, relative efficiency, and coverage probability, for both population mean and median.
Article
We propose two synthetic microdata approaches to generate private tabular survey data products for public release. We adapt a pseudo posterior mechanism that downweights by-record likelihood contributions with weights ∈[0,1] based on their identification disclosure risks to producing tabular products for survey data. Our method applied to an observed survey database achieves an asymptotic global probabilistic differential privacy guarantee. Our two approaches synthesize the observed sample distribution of the outcome and survey weights, jointly, such that both quantities together possess a privacy guarantee. The privacy-protected outcome and survey weights are used to construct tabular cell estimates (where the cell inclusion indicators are treated as known and public) and associated standard errors to correct for survey sampling bias. Through a real data application to the Survey of Doctorate Recipients public use file and simulation studies motivated by the application, we demonstrate that our two microdata synthesis approaches to construct tabular products provide superior utility preservation as compared to the additive noise approach of the Laplace Mechanism. Moreover, our approaches allow the release of microdata to the public, enabling additional analyses at no extra privacy cost.
Article
Single-cell RNA sequencing (scRNA-seq) data exhibit an unusual abundance of zero counts with a considerable fraction due to the dropout events, which introduces challenges to differential expression analysis. To correct biases in differential expression due to the informative dropouts, an inverse non-dropout-probability weighting method is proposed given that the dropout rate is negatively dependent on the underlying gene expression magnitude in scRNA-seq data. The weights are estimated using the maximum likelihood method where dropout values are integrated out using the Gauss-Hermite quadrature. Linear, generalized linear and mixed regressions with the estimated weights are fitted on original or transformed scRNA-seq data. Variances of coefficient estimators from the weighted regressions are estimated using the jackknife method. Extensive simulation studies are carried out to compare the proposed method to five cutting-edge methods (Limma, edgeR, MAST, ZIAQ and scImpute), where the proposed method performs among the best under all scenarios in terms of AUC, sensitivity, specificity and FDR. Rate of detecting true positives is examined for the proposed method and five comparison methods using mouse embryonic stem cells and fibroblasts where differentially expressed (DE) genes detected in bulk RNA-seq data on the same set of genes under the same conditions from independent source serve as true positives. Specificity is compared for these methods on true negative data by random splitting of a real dataset. Furthermore, the proposed method is illustrated on a lineage study where cells in the same embryo are correlated and genes differentially expressed between cell division lineages are identified.
Article
Background Hybrid methodologies have gained continuing interest as unique data reduction techniques for establishing a direct link between dietary exposures and clinical outcomes. Objectives We aimed to compare partial least squares (PLS) and reduced rank regression (RRR) in identifying a dietary pattern associated with a high cardiovascular disease (CVD) risk in Canadian adults, construct PLS- and RRR-based simplified dietary patterns, and assess associations between the four dietary pattern scores and CVD risk. Design Data were collected from 24-hour dietary recalls of adult respondents in the two cycles of the nationally representative Canadian Community Health Survey (CCHS)-Nutrition: CCHS 2004 linked to health administrative databases (n = 12,313) and CCHS 2015 (n = 14,020). Using 39 food groups, PLS and RRR were applied for the identification of an energy-dense (ED), high-saturated-fat (HSF) and low-fiber-density (LFD) dietary pattern. Associations of the derived dietary pattern scores with lifestyle characteristics and CVD risk were examined using weighted multivariate regression and weighted multivariable-adjusted Cox-proportional hazard models, respectively. Results PLS and RRR identified highly similar ED, HSF, LFD dietary patterns with common high positive loadings for fast food, carbonated drinks, salty snacks and solid fats, and high negative loadings for fruit, dark green vegetables, red and orange vegetables, other vegetables, whole grains, legumes and soy (≥|0.17|). Food groups with the highest loadings were summed to form simplified pattern scores. Although the dietary patterns were not significantly associated with CVD risk, they were positively associated with 402 kcal/d higher energy intake (P-trends <0.05) and higher obesity risk [PLS (OR: 2.09; 95% CI: 1.62, 2.7) and RRR (OR: 1.76; 95% CI: 1.44, 2.17)] (P-trends <0.0001) in the fourth quartiles. Conclusion PLS and RRR were shown to be equally effective for the derivation of a high-CVD-risk dietary pattern among Canadian adults. Further research is warranted on the role of major dietary components in cardiovascular health.
Article
Modelling survey data often requires having the knowledge of design and weighting variables. With public‐use survey data, some of these variables may not be available for confidentiality reasons. The proposed approach can be used in this situation, as long as calibrated weights and variables specifying the strata and primary sampling units are available. It gives consistent point estimation and a pivotal statistics for testing and confidence intervals. The proposed approach does not rely on with‐replacement sampling, single‐stage, negligible sampling fractions or non‐informative sampling. Adjustments based on design effects, eigenvalues, joint‐inclusion probabilities or bootstrap, are not needed. The inclusion probabilities and auxiliary variables do not have to be known. Multi‐stage designs with unequal selection of primary sampling units are considered. Non‐response can be easily accommodated if the calibrated weights include re‐weighting adjustment for non‐response. We use an unconditional approach, where the variables and sample are random variables. The design can be informative.
Article
This study aimed to determine whether higher intakes of sodium, added sugars and saturated fat are prospectively associated with all-cause mortality and cardiovascular disease (CVD) incidence and mortality in a diverse population. The nationally-representative Canadian Community Health Survey (CCHS)-Nutrition 2004 was linked with the Canadian Vital Statistics – Death Database and the Discharge Abstract Database (2004-2011). Outcomes were all-cause mortality and CVD incidence and mortality. There were 1,722 mortality cases within 115,566 person-years of follow-up (median (IQR) of 7.48 (7.22-7.70) years). There was no statistically significant association between sodium density or energy from saturated fat and all-cause mortality or CVD events for all models investigated. The association of usual percentage of energy from added sugars and all-cause mortality was significant in the base model with participants consuming 11.47% of energy from added sugars having 1.34 (95% CI: 1.01-1.77) times higher risk of all-cause mortality compared to those consuming 4.17% of energy from added sugars. Overall, our results did not find statistically significant associations between the three nutrients and risk of all-cause mortality or CVD events at the population level in Canada. Large-scale linked national nutrition datasets may not have the discrimination to identify prospective impacts of nutrients on health measures.
ResearchGate has not been able to resolve any references for this publication.