Article

Nonexperimental Replications of Social Experiments: A Systematic Review

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... A variety of techniques are employed to caliper matches (Glazerman, Levy, & Myers, 2002). These include "nearest neighbor," "kernel," and "forced" matching with and without replacement sampling. ...
... These methods are only beginning to penetrate social work, and depending on the estimation model, they produce a range of outcomes-some of which approximate treatment effects when random assignment is used (e.g., see Sosin, 2002). To be sure, the evidence is uneven, with about one half of the studies attempting to replicate experimental findings by using nonexperimental designs producing promising findings ( Glazerman et al., 2002). Thus caution is warranted. ...
... The conditions under which a propensity score plus matching strategy seem more likely to replicate experimental findings are not yet clear. However, they appear to include (a) when the comparison group is drawn from within the agency or local community and (b) when preintervention measures are used in a selection bias equation ( Glazerman et al., 2002). Even though these emerging methods tend to require larger samples, their potential is great, because, at the agency level, it is random assignment and not pre-and postmeasurement that often compromises intervention research. ...
Article
The purpose of this article is to review substantive and methodological advances in interventive research. Three sub- stantive advances are discussed: (a) the growing use of a risk factor perspective, (b) the emergence of practice-rele- vant microsocial theories, and (c) the increased acceptance of structured treatment protocols and manual. In addition, three methodological developments are discussed. They include new developments for dealing with attri- tion, for dealing with selection effects, and for decomposing complexities using text and numerical analyses. Arguing that intervention research holds the potential to unify research scholarship in social work, the conclusion discusses ongoing challenges associated with the implementation of new programs, variance in outcomes by method, reactivity to measurement, and construct validity in the context of culture.
... This review only included well-implemented experimental design studies because there is a strong body of research documenting the unreliability, and in some cases bias, in studies based on quasi-experimental studies (Guyat et al., 2000;Agodini & Dynarski, 2001;Weisburd, Lum, & Petrosino, 2001;Glazerman, Levy, & Myers, 2003). Furthermore, the study reports needed to provide a reasonable description of the methodology, program goals, and program activities. ...
... There is a strong body of research documenting the unreliability, and in some cases bias, in studies based on quasiexperimental studies, especially for studies of voluntary participation in programs(Guyat, DiCenso, Farewell, Willan, & Griffith, 2000;Agodini & Dynarski, 2001;Weisburd, Lum, & Petrosino, 2001;Glazerman, Levy, & Myers, 2003). 5Table 19compares the methods and conclusions from prior reviews. ...
... However, using an experimental design, such as a randomised controlled trial (RCT) has many advantages since it eliminates some biases, for instance " selection bias " (pre-existing differences), " omitted variable bias " and partly " publication bias " (supporting results that are statistically significant) [39]. A systematic review study in developed countries suggests that the omitted variable bias is a major problem when non-experimental methods are used [40] . There is increased evidence that interventions to change (also dietary) behaviour are enhanced by applying theories of behaviour and behavioural change in their development, implementation and evaluation phase [41, 42]. ...
... However, using an experimental design, such as a randomised controlled trial (RCT) has many advantages since it eliminates some biases, for instance " selection bias " (pre-existing differences), " omitted variable bias " and partly " publication bias " (supporting results that are statistically significant) [39]. A systematic review study in developed countries suggests that the omitted variable bias is a major problem when non-experimental methods are used [40]. This research design assigns subjects randomly to either a study or a control group (Fig. 2). ...
Article
Full-text available
Background: Food, nutrition and health policy makers are poised with two pertinent issues more than any other: obesity and climate change. Consumer research has focused primarily on specific areas of sustainable food, such as organic food, local or traditional food, meat substitution and/or reduction. More holistic view of sustainable healthy eating behaviour has received less attention, albeit that more research is emerging in this area. Methods/design: This study protocol that aims to investigate young consumers' attitudes and behaviour towards sustainable and healthy eating by applying a multidisciplinary approach, taking into account economical, marketing, public health and environmental related issues. In order to achieve this goal, consumers' reactions on interactive tailored informational messages about sustainable from social, environmental and economical point of view, as well as healthy eating behaviour in a group of young adults will be investigated using randomized controlled trial. To undertake the objective, the empirical research is divided into three studies: 1) Qualitative longitudinal research to explore openness to adopting sustainable healthy eating behaviour; 2) Qualitative research with the objective to develop a sustainable healthy eating behaviour index; and 3) Randomised controlled trial to describe consumers' reactions on interactive tailored messages about sustainable healthy eating in young consumers. Discussion: To our knowledge, this is the first randomised controlled trial to test the young adults reactions to interactive tailor made messages on sustainable healthy eating using mobile smartphone app. Mobile applications designed to deliver intervention offer new possibilities to influence young adults behaviour in relation to diet and sustainability. Therefore, the study will provide valuable insights into drivers of change towards more environmentally sustainable and healthy eating behaviours. Trial registration: NCT02776410 registered May 16, 2016.
... First, teachers were not randomly assigned to condition. Thus, selection bias could have affected the results, and several metaanalytic studies do suggest that experimental studies give more accurate estimates of impact than do quasi-experimental studies (Glazerman, Levy, & Myers, 2003). Second, in controlling for effects of clustering in our third analysis, we increased the power of the study to detect significant effects but doing so limited its generalizability to the teachers in the sample. ...
... Although we found differences on pretest scores only for the Motion test and no significant measured differences between teachers in the two groups, unmeasured differences between groups could bias estimates of the impacts of the units. Past studies show that randomized control trials offer more unbiased estimates of effects than do quasi-experimental studies (Glazerman, Levy, & Myers, 2003). ...
... we adjusted the standard errors using robust variance estimation (Dehejia & Wahba, 2002). As the applied literature from quasi-experimental studies suggests, bias is lower when the comparison group is locally matched to treatment (Glazerman et al., 2002). In accordance, we conducted all matching of students within a single district context, CVESD, and within the same grade-level, 4, 5, 6, 7, and 8. ...
... According to the results of the descriptive statistics based on the data from the CHARLS, the main source of income for the elderly, excluding pension benefits, is the transfer income from their non-co-resident children or income sharing from their co-resident children. 10. Glazerman et al. (2003) also pointed out that PSM is a nonparametric statistical method that can significantly reduce the endogeneity issue, especially when combined with other methods such as the DID method. 11. ...
Chapter
Using data from the China Health and Retirement Longitudinal Survey of 2013 and 2015, this study investigates the effect of enrollment in public pensions and the amount of public pension benefits on the income transfer between the elderly and their children. The three conclusions are as follows. First, in general, there is a flattening and then rising trend in the relationship between enrollments in pensions and net transfer income; the effect of the amount of pension benefits on net transfer income is negative but not significant. Second, the amount of pension benefit of the New Rural Social Pension Insurance does not significantly affect the net transfer income, transfer income from children, or transfer income to children, while the pension benefit of the Employees’ Basic Pension Insurance has a significant positive effect on the transfer income to children. Third, the effects of pensions differ by the heterogeneous group. The need for high pension income is greater for the disadvantaged group, such as older women, single elderly, co-residence elderly, elderly with chronic diseases or disabilities, and elderly in the rural central and eastern regions.KeywordsPension benefitNet transfer incomeTransfer income to childrenTransfer income from childrenNew Rural Social Pension Insurance (NRSPI)
... A strength of this econometric work was the careful consideration of the causal estimands of interest, including the definition of the target population, as well as the use of individual-level data to harmonize analyses. In economics, education, and other social sciences, a large literature on "within-study comparisons" between trial and observational analyses based on datasets formed by combining experimental and non-experimental data has continued this tradition (8)(9)(10)(11)(12)(13) and has produced largely similar findings: disagreements between trial and observational analyses do occur and are often hard to predict (8,14). number of epidemiologic investigations have also attempted to benchmark the results of observational analyses against trial analyses, but until recently the majority of these comparisons used published information from studies addressing similar clinical questions, rather than using individual-level data (15)(16)(17)(18)(19)(20)(21). ...
Article
Comparisons between randomized trial analyses and observational analyses that attempt to address similar research questions have generated many controversies in epidemiology and the social sciences. There has been little consensus on when such comparisons are reasonable, what their implications are for the validity of observational analyses, or whether trial and observational analyses can be integrated to address effectiveness questions. Here, we consider methods for using observational analyses to complement trial analyses when assessing treatment effectiveness. First, we review the framework for designing observational analyses that emulate target trials and present an evidence map of its recent applications. We then review approaches for estimating the average treatment effect in the target population underlying the emulation: using observational analyses of the emulation data alone; and using transportability analyses to extend inferences from a trial to the target population. We explain how comparing treatment effect estimates from the emulation against those from the trial can provide evidence on whether observational analyses can be trusted to deliver valid estimates of effectiveness - a process we refer to as benchmarking - and, in some cases, allow the joint analysis of the trial and observational data. We illustrate different approaches using a simplified example of a pragmatic trial and its emulation in registry data. We conclude that synthesizing trial and observational data - in transportability, benchmarking, or joint analyses - can leverage their complementary strengths to enhance learning about comparative effectiveness, through a process combining quantitative methods and epidemiological judgements.
... Using this approach, outcomes between treatment and control groups are compared, after matching them with similar observable factors, followed by estimation by DiD [40][41][42]. Combining the PSM approach with DiD allows further elimination of any time-invariant differences between the treatment and control groups, and allows selection on observables and unobservables which are constant over time [40,43]. Additionally, matching on the propensity score accounts for imbalances in the distribution of the covariates between the treatment and control groups [40] 4 . ...
Article
Full-text available
Background Health services research often relies on quasi-experimental study designs in the estimation of treatment effects of a policy change or an intervention. The aim of this study is to compare some of the commonly used non-experimental methods in estimating intervention effects, and to highlight their relative strengths and weaknesses. We estimate the effects of Activity-Based Funding, a hospital financing reform of Irish public hospitals, introduced in 2016. Methods We estimate and compare four analytical methods: Interrupted time series analysis, Difference-inDifferences , Propensity Score Matching Difference-inDifferences and the Synthetic Control method. Specifically, we focus on the comparison between the control-treatment methods and the non-control-treatment approach, interrupted time series analysis. Our empirical example evaluated the length of stay impact post hip replacement surgery, following the introduction of Activity-Based Funding in Ireland. We also contribute to the very limited research reporting the impacts of Activity-Based-Funding within the Irish context. Results Interrupted time-series analysis produced statistically significant results different in interpretation, while the Difference-inDifferences , Propensity Score Matching Difference-inDifferences and Synthetic Control methods incorporating control groups, suggested no statistically significant intervention effect, on patient length of stay. Conclusion Our analysis confirms that different analytical methods for estimating intervention effects provide different assessments of the intervention effects. It is crucial that researchers employ appropriate designs which incorporate a counterfactual framework. Such methods tend to be more robust and provide a stronger basis for evidence-based policy-making.
... As the average treatment effect estimated by the simple PSM model may still be biased (Dehejia, 2005), we use the PSM-DID method to eliminate the influence of unobservable variables, in particular that of time-invariant and time-variant factors. Glazerman et al. (2002) believe that PSM is a non-parametric statistical method that can significantly lower deviation, especially when it is used in combination with DID and other methods. 7 This is why we use the DID-based PSM method developed by De Loecker (2007): ...
Article
Full-text available
This study investigates the impact of the Internet on Chinese firms’ export and import performance by using China’s industrial enterprise and customs data and adopting a propensity score matching difference‐in‐differences method (PSM‐DID). The empirical results show that utilizing the Internet has positive effects on firms’ exports and imports, however, the effects are mainly concentrated in the first 2‐3 years. The positive effect on exports is larger than on domestic sales; thus, the Internet increases export intensity. Further, we investigate the effects of the Internet on the three margins of Chinese exports. First, we borrow the multi‐product multi‐destination firm exporting theory developed by Bernard et al. (2011) and find that the Internet improves not only the extensive margin between firms, but also the within‐firm extensive margin. We then investigate the effects of the Internet on product quality and find that the Internet has a negative effect since Chinese firms export more products to developing countries than to other countries, where the requirements for product quality are relatively low. Our findings empirically justify the “Internet Plus” strategy proposed by the Chinese government with regard to international trade.
... However, this remains a necessary and nonnegligible way of exploring the ecoenvironmental effects of ecological compensation programs from the perspective of investigating and assessing behavioral changes of environmental contributors based on questionnaire data (Zheng et al. 2013). The methods applied to evaluate the net policy effects in the field of resources and environmental economics include the selected natural experiment (Rosenbaum and Rubin 1983), difference-indifferences (Heckman et al. 1998), propensity score matching (PSM) (Dehejia 2005), and regression truncation (Glazerman et al. 2002). Among them, PSM can significantly reduce the estimation bias through matching samples from treatment and control groups (Heckman et al. 1998). ...
Article
Full-text available
Ecological compensation is an innovative and effective tool to explore the coordinated development of socioeconomic prosperity and ecological protection, especially for a watershed crossing different regions. It converts the externalities of ecosystem services into practical financial incentives for local stakeholders. This empirical study applies a quantitative policy evaluation approach to evaluate the environmental and economic effects of an ecological compensation policy, using the paddy land–to–dry land (PLDL) program implemented in China’s Miyun Reservoir watershed as an example. The study is based on responses to a 2017 questionnaire regarding agricultural production inputs and outputs administered to 269 households in Hebei Province, where the PLDL program has been operational for over 10 yr. The results show that the program has reduced nitrogen usage by 24% on average in 2017 and decreased the total nitrogen emission load by 16.98 tons for the entire case area, which accounts for approximately 18.6% of the total nitrogen load reduction of the Miyun Reservoir basin. However, the upstream households involved in this program have experienced agricultural income losses higher than that allowed for by the current compensation criterion. Therefore, this paper discusses the factors that should be considered in the process of determining ecological compensation criteria. In particular, the paper proposes a differential compensation scheme based on the environmental effect at the individual level to avoid a standard payment for all households irrespective of their different contributions. This differential compensation payment scheme facilitates the fair treatment of environmental contributors and maximizes environmental benefits through an equitable allocation of limited ecological compensation funds. This study serves as a theoretical and practical reference for further improvement of the current ecological compensation policy in China. The study also sheds light on practices for estimating ecological compensation criteria and formulating ecological compensation policies for other regions or countries in the future.
... We used these data on the number of books each student received in analytic models, which assess potential "dosage effects" of receiving more or less books from KRN. Finally, we included a dummy indicator of each student's grade and school for the propensity score matching, as the applied literature from quasi-experimental studies suggests that bias is lower when the comparison group is locally matched to treatment (Glazerman, Levy, & Myers, 2002). ...
Article
Full-text available
Drawing on administrative data and reading achievement data provided by two Midwestern school districts for three schools, we analyze the literacy impacts of a replicable summer reading program, Kids Read Now. The program includes both school-based and home-based components that together encourage students to remain engaged in reading high-quality books over the summer months. We apply propensity score matching methods to match participating Kids Read Now students with similar comparison students. Our results suggest that Kids Read Now participants outperformed comparison group students, with a mean effect size of d = .12. Additional model estimates of the impacts for those students who read more of the books provided by Kids Read Now revealed that those who received all 9 books realized an effect size of d = .18 relative to the outcomes for matched comparison students. We discuss how these results might be considered in light of prior findings on summer learning.
... Oneto-one matching, K-nearest neighbour matching, radius matching and kernel matching were used to match the treatment and control groups. In this model, health is the amount of health change between 2012 and 2016 rather than the health status at a certain point in time [9]. The details about DID and PSM can be found in the appendix. ...
Article
Full-text available
Background: With the accelerated ageing of the population in China, the health problems of elderly people have attracted much attention. Although religious belief has been shown to be a key way to improve the health of elderly people in various studies, little is known about the causal relationship between these variables in China. This paper explores the effect of religious belief on the health of elderly people in China, which will provide an important reference for China to achieve healthy ageing. Methods: Balanced panel data collected between 2012 and 2016 from the China Family Panel Studies (CFPS) were used. Health was assessed using self-rated health, and religious belief was measured by whether the respondents believed in a religion. The DID+PSM method was employed to solve the endogeneity problem caused by self-selection and omitted variables. In addition, the CESD score (replacing self-rated health) and different matching methods (the method of PSM after DID method) were used to perform the robustness test. Results: The results show that religious belief has no significant effect on the health of elderly people. With the application of different matching methods (one-to-one matching, K-nearest neighbour matching, radius matching and kernel matching) and replacing the health indicator (the CESD score) with the above matching methods, the results are still robust. Conclusion: In China, religious belief plays a limited role in promoting "healthy ageing", and it is difficult to improve the health of elderly people only via religious belief. Therefore, except for focusing on the guidance of religion with regard to healthy lifestyles, multiple measures need to be taken to improve the health of elderly people.
... The second most rigorous study design, the QED, has demonstrated bias that can be partially, but not fully, met by well-matched comparison groups and controls for preintervention measures (Glazerman, Levy, and Myers, 2002). Several design features of the intervention or its context-such as a lack of existing data or a lack of a comparison group-may make a QED infeasible. ...
... Two key design characteristics had a particularly strong effect on effect sizes: Sample size (smaller studies produce inflated effects) (Slavin & Smith, 2009) and randomized and quasi-experimental studies. Previous reviews of research comparing effect sizes in randomized vs. quasi-experimental evaluations have found mixed effects (de Boer, Donker, & van der Werf, 2014;Glazerman, Levy, & Myers, 2002;Heinsman & Shadish, 1996;Li & Ma, 2010;Lipsey & Wilson, 1993;Rake, Valentine, McGatha, & Ronau, 2010;Shadish, Clark, & Steiner, 2008;Slavin, Lake, & Groff, 2009;Torgerson, 2007). However, Cheung and Slavin (2016) had a sufficient sample of high-quality studies to permit a test of the effects of randomized vs quasi-experimental designs on effect sizes to be made with adequate power. ...
Article
Large-scale randomized studies provide the best means of evaluating practical, replicable approaches to improving educational outcomes. This article discusses the advantages, problems, and pitfalls of these evaluations, focusing on alternative methods of randomization, recruitment, ensuring high-quality implementation, dealing with attrition, and data analysis. It also discusses means of increasing the chances that large randomized experiments will find positive effects, and interpreting effect sizes.
... Firms 29 How well quasi-experimental methods perform compared with randomisation has been a subject of intense scrutiny since the seminal paper of Lalonde (1986), with largely inconclusive results. Glazerman et al (2003) found that quasi-experimental methods produced substantially biased results compared with experimental ones in 12 replication studies of welfare and employment programmes in the USA. Cook et al (2006) found less clear-cut results for education programmes. ...
... After that influential paper, various authors carried out similar comparisons and some of them challenge Lalonde's results. For example,Glazerman, Levy and Myers (2003) compare both methodologies by the analysis of twelve programs, finding similar results across both methodologies in only some occasions. ...
Research
Full-text available
This paper presents an impact evaluation of three nutritional programs implemented in Puebla, Mexico, run by SEDIF, a social assistance institution. The present study uses both a propensity score matching and weighting in order to balance the treatment and the control groups in terms of observable characteristics, and to estimate, later on, the causal effect of the programs on different areas: food support, food orientation, education, and health. This investigation adds strong empirical evidence about the beneficial effects of nutritional programs on growth indicators (i.e. on anthropometric variables). In addition, it provides some evidence about the favorable impact of this kind of programs on food orientation outcomes, such as eating habit changes or diet diversity, variety, and quality. However, this study unveils only marginal effects on food security and detrimental effects on educational outcomes (specifically on student's marks). Finally, it does not provide conclusive effects on health.
... Moreover, our method is more robust than the DID model because we construct more comparable treatment and control groups by conditioning on the propensity scores in the initial time period. As Glazerman et al. (2003) pointed out, the combination of PSM and the DID model can take advantage of both methods and reduce the selection bias to the minimal level. Specifically, we will take the following procedure to calculate the estimator in equation 2. ...
Article
This paper provides new empirical evidence on the health consequences of rural-to-urban migration in China. We use a panel dataset from 2003 to 2006 constructed by the Research Center on the Rural Economy at the Ministry of Agriculture in China to investigate the effects of short-term and medium-term migration on health status. By combining propensity-score matching and the difference-in-difference model, we attempt to overcome the migration endogeneity issue and estimate the average treatment effect on the treated. We find that the effect of short-term migration on health in China is significantly positive mostly because of the income effect. However, the effect of longer-term continuous migration on health is insignificant and close to zero. Our results are robust to several alternative estimation techniques and a series of robustness checks. Copyright © 2015 John Wiley & Sons, Ltd. Copyright © 2015 John Wiley & Sons, Ltd.
... In particular the results may be biased if regular participants are particularly motivated students compared to non-participants, even after controlling for past achievement and other observable characteristics. More generally the types of studies discussed above are all subject to the criticism that non-experimental results are often not replicated by careful experimental studies, which are considered to be far less likely to produce biased estimates (Michalopoulos 2005;Glazerman et al. 2003;Bloom et al. 2002, Agodini andDynarski 2001;Wilde and Hollister 2002). ...
... 52 Furthermore, another comparative study in settings of welfare, job training, and employment service failed to find any approach that could remove the bias considearbly. 53 Therefore, more well-designed comparative studies should be encouraged to assess the size and prevalence of selection biases arising from using non-experimental data and provide concrete guidance on how to choose a most robust methodology and interpret causality properly. ...
Article
Full-text available
This article proposes a critical but non-systematic review of recent health care system reforms in developing countries. The literature reports mixed results as to whether reforms improve the financial protection of the poor or not. We discuss the reasons for these differences by comparing three representative countries: Mexico, Vietnam, and China. First, the design of the health care system reform, as well as the summary of its evaluation, is briefly described for each country. Then, the discussion is developed along two lines: policy design and evaluation methodology. The review suggests that i) background differences, such as social development, poverty level, and population health should be considered when taking other countries as a model; ii) although demand-side reforms can be improved, more attention should be paid to supply-side reforms; and iii) the findings of empirical evaluation might be biased due to the evaluation design, the choice of outcome, data quality, and evaluation methodology, which should be borne in mind when designing health care system reforms.
... RCTs provide high internal validity and the ability to disentangle spurious effects from those of the targeted intervention 1 (Berk 2005;Cook and Campbell 1979;Farrington 2003a;Farrington et al. 2002;Shadish et al. 2002;. Even more compelling, research illustrates that nonrandomized studies not only yield different results from RCTs, but also larger effect sizes (Glazerman et al. 2002;Weisburd et al. 2001). A review of all crime prevention studies finds that RCTs produce results that are more valid since weaker study designs overestimate treatment effects . ...
Article
Full-text available
Objectives. This study expands upon Weisburd’s work (1993) by reexamining the relationship between sample size and statistical power in criminological experiments. This inquiry, now known as the Weisburd paradox, postulates that increasing the sample size of experiments does not always lead to increases in statistical power. The current research also begins to explore the potential sources of the Weisburd paradox. Methods. The effect sizes and statistical power are computed for the outcome measures (n=402) of all experiments (n=66) included in systematic reviews published by the Campbell Collaboration’s Crime and Justice Coordinating Group. The design sensitivity of these experiments is reviewed by sample size, as well as other factors that may explain the variation in effect sizes and statistical power across studies. Results. Effect sizes decline as the sample size of the experiment increase, whereas statistical power is unrelated to sample size but strongly associated with effect size. Disclosure of fidelity issues and publication bias is unrelated to statistical power and treatment effects. Variability in the dependent variable and sample demographics are significantly related to statistical power, but not to effect size. Conclusions. The study finds support for the Weisburd paradox, as the ability to manipulate statistical power by increasing sample size is not as strong as statistical theory would suggest, and experiments with larger sample sizes on whole produce smaller effects. It is believed that a relationship was not observed between sample size and statistical power because the sensitivity gained from increasing sample size is offset by effect size simultaneously decreasing.
... However, potential selection bias was minimized as much as possible. Furthermore, the research literature describe a number of conditions to prevent selection bias in quasi-experimental studies (Cook et al. 2008;Glazerman et al. 2002). These are: ...
Article
Full-text available
Objectives: To estimate the incapacitation effect and the impact on post-release recidivism of a measure combining prolonged incarceration and rehabilitation, the ISD measure for high frequency offenders (HFOs) was compared to the standard practice of short-term imprisonment. Methods: We applied a quasi-experimental design with observational data to study the effects of ISD. The intervention group consisted of all HFOs released from ISD in the period 2004–2008. Two control groups were derived from the remaining population of HFOs who were released from a standard prison term. To form groups of controls, a combination of multiple imputation (MI) and propensity score matching (PSM) was used including a large number of covariates. In order to measure the incapacitation effect of ISD, the number of convictions and recorded offences in a criminal case of the controls were counted in the same period as their ISD counterfactuals were incarcerated. The impact on recidivism was measured by the prevalence and the frequency of reconvictions corrected for time at risk. Robustness of the results were checked by performing a combined PSM and difference-in-difference (DD) design. Results: The estimate of the incapacitation effect was on average 5.7 criminal cases and 9.2 offences per ISD measure. On average 2.5 convictions and 4 recorded offences per year per HFO are prevented. The HFOs released from ISD showed 12 to 16 % lower recidivism rates than their control HFOs released from prison (Cohen’s h = 0.3–0.4). The recidivists of the ISD group also showed a lower reconviction frequency than the control group recidivists (Cohen’s d = 0.2). Conclusions: The ISD measure seems to be effective in reducing recidivism and crime. The estimated incapacitation effect showed that a large portion of criminal cases and offences was prevented. DD analysis and sensitivity analyses confirmed the robustness of the PSM results. Due to the absence of actual treatment data, the effects found cannot be attributed separately to resocialization, imprisonment, or improvement of life circumstances.
... The only quasi-experimental design shown to yield unbiased causal effects is the regression discontinuity design (Glazerman, Levy, & Myers, 2002;West, Biesanz, & Pitts, 2000). The key is that the researcher controls the assignment to the treatment or control conditions, using a cutoff score on any pretreatment variable, usually one expected to predict the outcome. ...
Article
As a result of an inherent selection bias, most longitudinal analyses are biased against corrective actions that parents use to address perceived child problems. This bias can lead to unjustified or even counterproductive recommendations about corrective parental actions. To overcome this bias, this article summarizes current scholarship on improving the validity of causal inferences. Enhancing research designs is preferred, using quasi-experimental design components and natural experiments. Comparing a typical longitudinal design with a one-group pre-post design shows how longitudinal designs can be improved to enhance causal validity. Perfect statistical controls for confounds or perfect instrumental variables to circumvent them could produce unbiased causal evidence. Strategies to approximate that ideal are summarized, as well as methods to check for the adequacy of those approximations.
... Controlling for pretests and other covariates greatly reduced, but did not eliminate, the differences. Glazerman, Levy, and Myers (2002) also found that use of powerful covariates could greatly reduce but not eliminate differences between randomized and matched studies. This was also the finding of a comparison of randomized and matched studies of dropout prevention programs (Agodini & Dynarski, 2004). ...
Article
Syntheses of research on educational programs have taken on increasing policy importance. Procedures for performing such syntheses must therefore produce reliable, unbiased, and meaningful information on the strength of evidence behind each program. Because evaluations of any given program are few in number, syntheses of program evaluations must focus on minimizing bias in reviews of each study. This article discusses key issues in the conduct of program evaluation syntheses: requirements for research design, sample size, adjustments for pretest differences, duration, and use of unbiased outcome measures. It also discusses the need to balance factors such as research designs, effect sizes, and numbers of studies in rating the overall strength of evidence supporting each program.
... 11,12 Centre*. 8 Where such opportunities exist they are welcome. For an organisation like SCIE, with a brief to do more than simply collate evidence or knowledge, the EPPI model might provide a model for aspects of its work, that is, pulling together evidence of different kinds in an easily accessible way. ...
... It is important to recall that the current review and the one by Slavin and Lake (2008a) used stringent inclusion criteria for M studies, so these findings may not apply to all M studies. This finding reinforces conclusions made by Cook, Shadish, and Wong (2008), Torgerson (2006), Glazerman, Levy, and Myers (2002), and Slavin and Smith (2008c) that high-quality studies with well-matched control groups produce outcomes similar to those of REs. Randomization is still valuable in reducing the possibility of selection bias, but these findings suggest that reviewers of research on educational programs can include well-matched evaluations (for more on this, see Slavin, 2008;Slavin & Smith, 2008c;Cook et al., 2008). ...
Article
Full-text available
This article reviews research on the achievement outcomes of mathematics programs for middle and high schools. Study inclusion requirements include use of a randomized or matched control group, a study duration of at least 12 weeks, and equality at pretest. There were 100 qualifying studies, 26 of which used random assignment to treatments. Effect sizes were very small for mathematics curricula and for computer-assisted instruction. Positive effects were found for two cooperative learning programs. Outcomes were similar for disadvantaged and nondisadvantaged students and for students of different ethnicities. Consistent with an earlier review of elementary programs, this article concludes that programs that affect daily teaching practices and student interactions have more promise than those emphasizing textbooks or technology alone.
... Recall that the matched studies had to meet stringent Effective Programs in Elementary Mathematics 10,2009 at VIRGINIA COMMONWEALTH UNIV on January http://rer.aera.net Downloaded from methodological standards, so the similarity between randomized and matched outcomes reinforces the observation made by Glazerman, Levy, and Myers (2002) and Torgerson (2006) that high-quality studies with well-matched control groups produce outcomes similar to those of randomized experiments. ...
Article
Full-text available
This article reviews research on the achievement outcomes of three types of approaches to improving elementary mathematics: mathematics curricula, computer-assisted instruction (CAI), and instructional process programs. Study inclusion requirements included use of a randomized or matched control group, a study duration of at least 12 weeks, and achievement measures not inherent to the experimental treatment. Eighty-seven studies met these criteria, of which 36 used random assignment to treatments. There was limited evidence supporting differential effects of various mathematics textbooks. Effects of CAI were moderate. The strongest positive effects were found for instructional process approaches such as forms of cooperative learning, classroom management and motivation programs, and supplemental tutoring programs. The review concludes that programs designed to change daily teaching practices appear to have more promise than those that deal primarily with curriculum or technology alone.
... Second, there is substantial evidence of "method effects"; that is, results of studies of intervention effects can be greatly affected by research methods . Results of RCTs are not consistently approximated with other research designs (Glazerman, Levy, & Myers, 2002;Kunz, Vist, & Oxman, 2002). ...
Article
Full-text available
Objective: The goal of this study is to advance an approach to the assessment of the quality of studies considered for inclusion in systematic reviews of the effects of social-care interventions. Method: To achieve this objective, quality is defined in relation to the widely accepted validity typology; prominent approaches to study quality assessment are evaluated as to their adequacy. Results: Problems with these approaches are identified. Conclusion: A formal, yet explicit, multidimensional approach to assessment grounded in substantive issues relevant to the intervention and the broader context in which it is embedded is promoted. Uncritical and exclusive use of indicators of study quality such as publication status, reporting quality, and single summative quality scores are rejected.
... However, there is usually no way to know if the matching procedure has succeeded or not. Moreover, when it has been possible to test the success of matching procedures, the results have generally not been very encouraging (Bloom et al., 2002;Friedlander and Robins, 1995;Glazerman et al., 2002;Heckman et al., 1997). This is probably because it is not usually feasible to match on characteristics, such as drive and motivation, which are not readily measurable but nonetheless may influence programme participation decisions. ...
Article
The Employment Retention and Advancement (ERA) Demonstration programme is a major current welfare-to-work social experiment, the largest random allocation evaluation ever mounted in Great Britain. This article draws on experience gained in designing the ERA Demonstration to explore the strengths and limitations of social experimentation for policy evaluation and analysis. The focus of the discussion is on the reasons for the choice of random allocation as a mean of estimating programme impacts, contrasting this approach with the alternatives. The weaknesses of random allocation designs are also examined in the light of the types of information policy-makers require from evaluations of labour market programmes and social policy demonstrations. The perennial ‘black box’ problem and the difficulties in generalizing from social experiments are given particular prominence.
... This review only included well-implemented experimental design studies because there is a strong body of research documenting the unreliability, and in some cases bias, in studies based on quasi-experimental studies (Guyat et al., 2000;Agodini & Dynarski, 2001;Weisburd, Lum, & Petrosino, 2001;Glazerman, Levy, & Myers, 2003). Furthermore, the study reports needed to provide a reasonable description of the methodology, program goals, and program activities. ...
Article
China's high-speed rail (HSR) has developed expeditiously since the beginning of the 21st century, exerting significant influences on many aspects, such as economic development and residents' travel mode. Improving carbon productivity is one of the necessary measures to realize China's carbon neutrality goal, considering economic growth. However, previous researches have not dealt with the exact impact of HSR opening on carbon emission performance. This study seeks to fill this gap. Through the difference-in-differences (DID) model, we discover that the opening of HSR significantly improves the city's total-factor carbon productivity in China. In addition, the influencing mechanism and heterogeneity of the impact are discussed, and the polarization effect is also analyzed. Overall, this study strengthens the idea that HSR construction has positive environmental externalities. The insights gained from this study may be of assistance to accurately formulate policies related to HSR planning and construction in the future.
Article
Full-text available
Executive Summary Background Many systematic reviews incorporate nonrandomised studies of effects, sometimes called quasi‐experiments or natural experiments. However, the extent to which nonrandomised studies produce unbiased effect estimates is unclear in expectation or in practice. The usual way that systematic reviews quantify bias is through “risk of bias assessment” and indirect comparison of findings across studies using meta‐analysis. A more direct, practical way to quantify the bias in nonrandomised studies is through “internal replication research”, which compares the findings from nonrandomised studies with estimates from a benchmark randomised controlled trial conducted in the same population. Despite the existence of many risks of bias tools, none are conceptualised to assess comprehensively nonrandomised approaches with selection on unobservables, such as regression discontinuity designs (RDDs). The few that are conceptualised with these studies in mind do not draw on the extensive literature on internal replications (within‐study comparisons) of randomised trials. Objectives Our research objectives were as follows: Objective 1: to undertake a systematic review of nonrandomised internal study replications of international development interventions. Objective 2: to develop a risk of bias tool for RDDs, an increasingly common method used in social and economic programme evaluation. Methods We used the following methods to achieve our objectives. Objective 1: we searched systematically for nonrandomised internal study replications of benchmark randomised experiments of social and economic interventions in low‐ and middle‐income countries (L&MICs). We assessed the risk of bias in benchmark randomised experiments and synthesised evidence on the relative bias effect sizes produced by benchmark and nonrandomised comparison arms. Objective 2: We used document review and expert consultation to develop further a risk of bias tool for quasi‐experimental studies of interventions (ROBINS‐I) for RDDs. Results Objective 1: we located 10 nonrandomised internal study replications of randomised trials in L&MICs, six of which are of RDDs and the remaining use a combination of statistical matching and regression techniques. We found that benchmark experiments used in internal replications in international development are in the main well‐conducted but have “some concerns” about threats to validity, usually arising due to the methods of outcomes data collection. Most internal replication studies report on a range of different specifications for both the benchmark estimate and the nonrandomised replication estimate. We extracted and standardised 604 bias coefficient effect sizes from these studies, and present average results narratively. Objective 2: RDDs are characterised by prospective assignment of participants based on a threshold variable. Our review of the literature indicated there are two main types of RDD. The most common type of RDD is designed retrospectively in which the researcher identifies post‐hoc the relationship between outcomes and a threshold variable which determines assignment to intervention at pretest. These designs usually draw on routine data collection such as administrative records or household surveys. The other, less common, type is a prospective design where the researcher is also involved in allocating participants to treatment groups from the outset. We developed a risk of bias tool for RDDs. Conclusions Internal study replications provide the grounds on which bias assessment tools can be evidenced. We conclude that existing risk of bias tools needs to be further developed for use by Campbell collaboration authors, and there is a wide range of risk of bias tools and internal study replications to draw on in better designing these tools. We have suggested the development of a promising approach for RDD. Further work is needed on common methodologies in programme evaluation, for example on statistical matching approaches. We also highlight that broader efforts to identify all existing internal replication studies should consider more specialised systematic search strategies within particular literatures; so as to overcome a lack of systematic indexing of this evidence.
Article
The landscape of China’s rural land market has been changed by several significant land right reforms since the 1970s. It is always of great interest to both the government and the public to gauge the effectiveness of these reforms. We address this question by investigating the impact of a recent land use right reform, namely, the ‘Three Rights Separation Policy’, on agro-environmental sustainability. By separating land management right from land contracted management right, this new reform is believed to be a powerful tool to encourage land transfer, optimize land resource allocation, and increase the economy of scale in the agriculture sector. Using a PSM-DID model applied to panel data for the years 2008 and 2014, our study demonstrates that the new policy also increases the use of organic fertilizers by 48.641 kg/mu in total, which is a very important step to ensure agro-environmental sustainability in China. The new policy is more effective in encouraging the application of organic fertilizer when the issuing of land certificates is enforced and administrative barriers to land right transfers are removed. The findings add value to the growing literature on rural land right reforms in China and may also have significant implications in developing countries with similar rural land tenure systems and underdeveloped land and labor markets.
Article
Using 2003–2006 RCRE (Research Center for Rural Economy) panel data, we estimate the effect of parental migration on the health of children left behind, with a difference-in-differences and propensity score matching combined model. On average we do not find any significant effect on children's health; however, the effect varies among different groups. Children's health may improve as a result of parental migration in families with lower income in the base year and families with higher-income growth rates. Furthermore, children's health may deteriorate with maternal migration but improve with longer distance of paternal migration and longer time of paternal migration. We argue that parental migration affects children's health through complex mechanisms: income increase may have a positive impact while decreased parental care may have a negative effect. The two effects seem to offset each other in rural China.
Article
Full-text available
In the domain of mathematics education there have been series of debates on lexical ambiguity in algebra especially with the resurgence of mathematics educators’ awareness of the relevance of language in mathematics education. Therefore, this study investigated lexical ambiguity in algebra, method of teaching as determinant of grade 9 students’ academic performance in East London. A pre-test-post-test- quasi-experimental group design was adopted in the study. A sample of 109 students was involved in the study. The instruments adopted and structured for the study were lexical ambiguity questionnaire (LAAQ). Method of Instruction Questionnaire (MIQ) Problem Based Learning Strategies in two parts (PBLSa) and (PBLSb), Conventional Teaching Guide (C.T.G). They were tested at .05 level of significance using a two-way (2 x 2) Analysis of Covariance (ANCOVA). The findings showed that students exposed to the PBLS achieved higher than their counterparts that were exposed to the Conventional method. Multiple Comparison Analysis and Tukey post-hoc were employed to detect the source of variation and the direction of significance. The findings also revealed that lexical ambiguity determines students’ academic performance (r=0.422; P<0.05); effect of the experiment on students post-test performance scores in lexical ambiguity (F (2,109) =.926; P< 0.05). Method of teaching is also said to be the determinant of students’ performance (r=0.764, P<0.05). Hence, there is need for teachers to update their knowledge about the problem solving skills that can be used as a remedy to mathematics phobia and ambiguities in algebra word problem; it should also be enshrined into the school curriculum. DOI: 10.5901/mjss.2014.v5n23p897
Article
Full-text available
Securing data on students’ academic achievement is typically one of the most important and costly aspects of conducting education experiments. As state assessment programs have become practically universal and more uniform in terms of grades and subjects tested, the relative appeal of using state tests as a source of study outcome measures has grown. However, the variation in state assessments—in both content and proficiency standards—complicates decisions about whether a particular state test is suitable for research purposes and poses difficulties when planning to combine results across multiple states or grades. This paper aims to help researchers evaluate and make decisions about whether and how to use state test data in education experiments. It outlines the issues that researchers should consider, including how to evaluate the validity and reliability of state tests relative to study purposes; factors influencing the feasibility of collecting state test data; how to analyze state test scores; and whether to combine results based on different tests. It also highlights best practices to help inform ongoing and future experimental studies. Many of the issues discussed are also relevant for nonexperimental studies.
Article
Full-text available
Securing data on students' academic achievement is typically one of the most important and costly aspects of conducting education experiments. As state assessment programs have become practically universal and more uniform in terms of grades and subjects tested, the relative appeal of using state tests as a source of study outcome measures has grown. However, the variation in state assessments--in both content and proficiency standards--complicates decisions about whether a particular state test is suitable for research purposes and poses difficulties when planning to combine results across multiple states or grades. This discussion paper aims to help researchers evaluate and make decisions about whether and how to use state test data in education experiments. It outlines the issues that researchers should consider, including how to evaluate the validity and reliability of state tests relative to study purposes; factors influencing the feasibility of collecting state test data; how to analyze state test scores; and whether to combine results based on different tests. It also highlights best practices to help inform ongoing and future experimental studies. Many of the issues discussed are also relevant for nonexperimental studies. Appendices include: (1) State Testing Programs Under NCLB; (2) How NCEE-Funded Evaluations Use State Test Data. (Contains 35 footnotes and 4 tables.)
Article
Thomas Jefferson recognized the value of reason and scientific experimentation in the eighteenth century. This chapter extends the idea in contemporary ways to standards that may be used to judge the ethical propriety of randomized trials and the dependability of evidence on effects of social interventions.
Article
Full-text available
Estimates of developmental models of processes involving contextual influences (e.g., child care arrangements, divorce, parenting, neighborhood location, peers) are subject to bias if, as is often the case, the contexts are influenced by the actions of ei-ther the individuals being studied or their parents or teachers. We assessed the nature of the endogeneity biases that may result, discuss the importance of such biases in practice, and suggest possible ways of avoiding them. Our primary recommendation is that developmentalists consider reorienting their data collection strategies to take advantage of real or "natural" experiments that produce exogenous variation in fam-ily and contextual variables of interest. Individuals'lives are shaped by a rich set of interactive genetic, social, structural, and historical forces and processes. Consequently, developmental science places high demands on the evidence needed to separate correlation from causation. Although social science theory can commonly be invoked to limit the scope of problems and isolate key variables, a developmental perspective often does just the opposite. Be-cause a broad theoretical perspective holds great promise for advancing researchers' understanding of human development, developmental scientists should not be sim-plifying their theories for the sake of empirical tractability. Instead, they should de-vote themselves to ensuring that their empirical work does justice to the theory.
Article
Full-text available
In the social sciences, randomized experimentation is the optimal research design for establishing causation. However, for a number of practical reasons, researchers are sometimes unable to conduct experiments and must rely on observational data. In an effort to develop estimators that can approximate experimental results using observational data, scholars have given increasing attention to matching. In this article, we test the performance of matching by gauging the success with which matching approximates experimental results. The voter mobilization experiment presented here comprises a large number of observations (60,000 randomly assigned to the treatment group and nearly two million assigned to the control group) and a rich set of covariates. This study is analyzed in two ways. The first method, instrumental variables estimation, takes advantage of random assignment in order to produce consistent estimates. The second method, matching estimation, ignores random assignment and analyzes the data as though they were nonexperimental. Matching is found to produce biased results in this application because even a rich set of covariates is insufficient to control for preexisting differences between the treatment and control group. Matching, in fact, produces estimates that are no more accurate than those generated by ordinary least squares regression. The experimental findings show that brief paid get-out-the-vote phone calls do not increase turnout, while matching and regression show a large and significant effect.
Article
Full-text available
This paper asks whether personal financial management education is an effective mechanism for helping lower-income households accumulate financial assets and improve credit histories. The paper argues that the best existing studies of the effectiveness of financial literacy initiatives suggest that such initiatives might help lower-income households build savings and improve credit records, but the results are only suggestive due to the limitations of the studies. The paper concludes that a high research priority should be to gathering more robust evidence on whether teaching personal financial management skills to lower-income households can be an effective means to improve their financial situations.
Article
Full-text available
The recent literature on evaluating manpower training programs demonstrates that alternative nonexperimental estimators of the same program produce a array of estimates of program impact. These findings have led to the call for experiments to be used to perform credible program evaluations. Missing in all of the recent pessimistic analyses of nonexperimental methods is any systematic discussion of how to choose among competing estimators. This paper explores the value of simple specification tests in selecting an appropriate nonexperimental estimator. A reanalysis of the National Supported Work Demonstration Data previously analyzed by proponents of social experiments reveals that a simple testing procedure eliminates the range of nonexperimental estimators that are at variance with the experimental estimates of program impact.
Characterizing Selection Bias
  • James J Heckman
  • Jeffrey C Ichimura
  • Petra Smith
  • Todd
Heckman, James J., Hidehiko Ichimura, Jeffrey C. Smith, and Petra Todd. “Characterizing Selection Bias.” Econometrica, vol. 66, no. 5, September 1998, pp. 1017-1098