Table 6 - uploaded by Adam Wyse
Content may be subject to copyright.
Percent Passing Impact Data for Different Examinee Ability

Percent Passing Impact Data for Different Examinee Ability

Source publication
Article
Full-text available
This article illustrates five different methods for estimating Angoff cut scores using item response theory (IRT) models. These include maximum likelihood (ML), expected a priori (EAP), modal a priori (MAP), and weighted maximum likelihood (WML) estimators, as well as the most commonly used approach based on translating ratings through the test cha...

Context in source publication

Context 1
... choice of estimator and the method of aggregating the data could clearly produce disparate cut scores for these data. Table 5: IRT Cut Score Estimates for Five Different Methods for the 200 Item Test Round 1 Round 2 Rater TS ML WML EAP MAP TS ML WML EAP Table 6 shows the percentage of examinees that would be at or above the cut score if one calculated the cut score using the same method as was used to estimate examinee ability compared to the TS method. Similar to the simulation study, one can see that when the TS method is used to estimate ability and the cut score that there is no difference in the percentage of examinees at or above the cut score. ...

Similar publications

Article
Full-text available
In this article, a weighted model based on the Rayleigh distribution is proposed and the statistical and reliability properties of this model are presented. Some non-Bayesian and Bayesian methods are used to estimate the beta parameter of proposed model. The Bayes estimators are obtained under the symmetric (squared error) and the asymmetric (linea...
Article
Full-text available
We investigate the seemingly ill-defined problem of extracting a ground-state mass from a lattice simulation where the extent of the lattice is not long enough to project out the ground-state properly. We regulate the problem using a Bayesian method. We show that controlling meta-parameters (overconfidence) can allow the data to overcome the input...
Article
Full-text available
In this paper, a new class of distributions called the odd log-logistic Lindley-G family is proposed. Several of its statistical and reliability properties are studied in-detail. One members of the proposed family can have symmetrical, right-skewed, leftt-skewed and reversed-J shaped densities, and decreasing, increasing, bathtub, unimodal and reve...
Article
Full-text available
In the fields of education and psychology, nested data with small samples and imbalances are very common. Bauer et al. (2008) first proposed adjusting the traditional multilevel model to analyze the small sample imbalanced nested data (SSIND). In terms of parameter estimation, the Bayesian method shows the possibility of providing unbiased estimati...
Article
Full-text available
This paper extends the one-parameter Chris-Jerry distribution to the power size biased Chris-Jerry distribution, a lifetime distribution in the class of the Lindley distribution. We derive the r th moment and particularly estimated the parameters using six classical methods and the Bayesian method. Results of the two real data analyses show that th...

Citations

... However, concerns regarding the reliability, validity, and acceptability of these methods remain an issue (7). The differences in cutoff scores among different standard-setting methods may reduce the legal defensibility of these cutoffs, especially when it led to differences in the pass/fail decision (8)(9). ...
Article
Full-text available
Any high-stakes assessment that leads to an important decision requires careful consideration in determining whether a student passes or fails. Despite implementation of many standard setting methods in clinical examinations, concerns remain about the reliability of pass/fail decisions in high stakes assessment, especially clinical assessment. This observational study proposes a defensible pass/fail decision based on the number of failed competencies. The study conducted in Erbil, Iraq, in June 2018, results were obtained for 150 medical students on their final objective structured clinical examination. Cutoff scores and pass/fail decisions were calculated using the modified Angoff, borderline, borderline-regression and holistic methods. The results were compared with each other and with a new competency method using Cohen’s kappa. Rasch analysis was used to compare the consistency of competency data with Rasch model estimates. The competency method resulted in 40 (26.7%) students failing, compared with 76 (50.6%), 37 (24.6%), 35 (23.3%) and 13 (8%) for the modified Angoff, borderline, borderline regression and holistic methods, respectively. The competency method demonstrated a sufficient degree of fit to the Rasch model (mean outfit and infit statistics of 0.961 and 0.960, respectively). In conclusion, the competency method was more stringent in determining pass/fail, compared with other standard-setting methods, except for the modified Angoff method. The fit of competency data to the Rasch model provides evidence for the validity and reliability of pass/fail decisions. Keywords: pass/fail decision; competence-based; standard-setting; Rasch model Graphical abstract Competency method for pass/fail decision, number of stations for each competency represented in circle shapes.
... The direction of the change in bias and sampling variability would depend on how well the ratings fit the IRT model underlying the ICCs. Alternatively, different estimators could be applied (i.e., ML, weighted ML [WML]) to place the cut score on the MAPTscale score (Wyse, 2017). ...
Article
Setting cut scores on multistage‐adaptive tests (MSTs) is difficult, particularly when the test spans several grade levels, and the selection of items from MST panels must reflect the operational test specifications. In this study, we describe, illustrate, and evaluate three methods for mapping panelists’ Angoff ratings into cut scores on the scale underlying an MST. The results suggest the test characteristic function and item characteristic curve methods performed similarly, but the method based on dichotomizing panelists’ ratings at a response probability of .67 was unacceptable. The study featured a rating booklet design that allowed us to systematically evaluate the validity of the Angoff ratings across test levels, which contributed internal validity evidence for the cut scores, which were also evaluated using procedural and external validity evidence. The implications of the results for future standard setting studies and research in this area are discussed.
... Among the methods for determining cut scores on large-scale assessments, the modified Angoff (1971) method, the Bookmark (Lewis, Mitzel, & Green, 1996;Lewis, Mitzel, Mercado, & Schulz, 2012;Mitzel, Lewis, Patz, & Green, 2001) method, or variations on these approaches (Impara & Plake, 1997;Plake & Cizek, 2012;Schulz & Mitzel, 2009;Wang, 2003;Wyse, 2013;2017;Wyse & Reckase, 2012) represent some of the most commonly used methods (Brandon, 2004;Hurtz & Auerbach, 2003;Karatonis & Sireci, 2006;Plake & Cizek, 2012;Lewis et al., 2012). In the modified Angoff method, the panelists' task is to review each item in a test and estimate the probability that a minimally competent examinee would be able to answer the item correctly. ...
Article
A common belief is that the Bookmark method is a cognitively simpler standard‐setting method than the modified Angoff method. However, a limited amount of research has investigated panelist's ability to perform well the Bookmark method, and whether some of the challenges panelists face with the Angoff method may also be present in the Bookmark method. This article presents results from three experiments where panelists were asked to give Bookmark‐type ratings to separate items into groups based on item difficulty data. Results of the experiments showed, consistent with results often observed with the Angoff method, that panelists typically and paradoxically perceived hard items to be too easy and easy items to be too hard. These perceptions were reflected in panelists often placing their Bookmarks too early for hard items and often placing their Bookmarks too late for easy items. The article concludes with a discussion of what these results imply for educators and policymakers using the Bookmark standard‐setting method.
... Test-centered methods generally involve panelists reviewing test items as part of determining cut scores. Examples of commonly used test-centered standard-setting methods include the modified Angoff method (Angoff, 1971), the Bookmark method (Lewis, Mitzel, & Green, 1996;Lewis, Mitzel, Mercado, & Schulz, 2012;Mitzel, Lewis, Patz, & Green, 2001), or variations on these approaches (Impara & Plake, 1997;Plake & Cizek, 2012;Schulz & Mitzel, 2009;Wang, 2003;Wyse & Reckase, 2012;Wyse, 2017). Examinee-centered methods have a different focus and involve panelists reviewing samples of candidate work or providing judgments of how they think individual examinees would perform on the exam. ...
Article
One commonly used compromise standard‐setting method is the Beuk (1984) method. A key assumption of the Beuk method is that the emphasis given to the pass rate and the percent correct ratings should be proportional to the extent that the panelists agree on their ratings. However, whether the slope of Beuk line reflects the emphasis that panelists believe should be assigned to the pass rate and the percentage correct ratings has not be fully tested. In this article, I evaluate this critical assumption of the Beuk method by asking panelists to assign importance weights to their percentage correct and pass rate judgments. I show that in several cases that the emphasis suggested by the Beuk slope is noticeably different from what one would expect and is inconsistent with importance weight ratings. I also suggest two ways that the importance weights can be used to calculate alternate cut scores, and I show that one of the ways of calculating cut scores using the importance weights leads to larger potential differences in cut score estimates. I suggest that practitioners should consider collecting importance weights when the Beuk method is used for determining cut scores.
... In the Angoff method, panelists are asked to review test items and provide item level probability judgments of how they think minimally competent examinees would perform on the items. These item level probability judgments are then analyzed and combined in some way to determine cut scores (see Hurtz & Jones, 2009;Wyse, 2017). Specific implementations of the Angoff method often differ in the number of rounds, the feedback discussed with panelists, the number of different minimally competent examinees for which ratings are collected, the type of items rated, and the rounding rules that panelists use when providing their ratings. ...
... These rating patterns are often observed with the Angoff method no matter the rounding rule applied. It is important to be aware of and investigate these rating patterns since these rating patterns can influence cut score estimates (see Reckase, 2006b;Wyse, 2017) and the validity of Angoff standard-setting results. ...
Article
Full-text available
One common modification to the Angoff standard-setting method is to have panelists round their ratings to the nearest 0.05 or 0.10 instead of 0.01. Several reasons have been offered as to why it may make sense to have panelists round their ratings to the nearest 0.05 or 0.10. In this article, we examine one reason that has been suggested, which is that even if panelists are given the opportunity to provide ratings to the nearest 0.01 they often round their ratings to the nearest 0.05 or 0.10 anyway. Using data from four standard settings, we show that in many cases ratings ended in a 0 or 5 when panelists were given the option of using a scale from 0 to 100 in one-point increments and that only about 9% of all ratings ended in a digit other than a 0 or 5. We also examined the impact of different rounding rules and we found that results were quite similar when using different rounding rules. Additional analyses showed the common phenomenon of panelists giving too high of ratings for hard items and too low of ratings for easy items in comparison to conditional p-values. It is suggested that rounding ratings to the nearest 0.05 or 0.10 represent reasonable alternatives to rounding ratings to the nearest 0.01.
Article
This study evaluates various scoring methods including number-correct scoring, IRT theta scoring, and hybrid scoring in terms of scale-score stability over time. A simulation study was conducted to examine the relative performance of five scoring methods in terms of preserving the first two moments of scale scores for a population in a chain of linking with multiple test forms. Simulation factors included 1) the number of forms linked back to the initial form, 2) the pattern in mean shift, and 3) the proportion of common items. Results showed that scoring methods that operate with number-correct scores generally outperform those that are based on IRT proficiency estimators (θ) in terms of reproducing the mean and standard deviation of scale scores. Scoring methods performed differently as a function of patterns in a group proficiency change.
Article
A common practical challenge is how to assign ability estimates to all incorrect and all correct response patterns when using item response theory (IRT) models and maximum likelihood estimation (MLE) since ability estimates for these types of responses equal −∞ or +∞. This article uses a simulation study and data from an operational K − 12 computerized adaptive test (CAT) to compare how well several alternatives – including Bayesian maximum a priori (MAP) estimators; various MLE based methods; and assigning constants – work as strategies for computing ability estimates for extreme scores in vertically scaled fixed-length Rasch-based CATs. Results suggested that the MLE-based methods, MAP estimators with prior standard deviations of 4 and above, and assigning constants achieved the desired outcomes of producing finite ability estimates for all correct and all incorrect responses that were more extreme than the MLE values of students that got one item correct or one item incorrect as well as being more extreme than the difficulty of the items students saw during the CAT. Additional analyses showed that it is possible for some methods to exhibit changes in how much they differ in magnitude and variability from the MLE comparison values or the b values of the CAT items for all correct versus all incorrect responses and across grades. Specific discussion is given to how one may select a strategy to assign ability estimates to extreme scores in vertically scaled fixed-length CATs that employ the Rasch model.
Preprint
Full-text available
Background: Any high-stakes assessment that leads to an important decision requires careful consideration in determining whether a student passes or fails. This observational study conducted in Erbil, Iraq, in June 2018 proposes a defensible pass/fail decision based on the number of failed competencies. Methods: Results were obtained for 150 medical students on their final objective structured clinical examination. Cutoff scores and pass/fail decisions were calculated using the modified Angoff, borderline, borderline-regression and holistic methods. The results were compared with each other and with a new competency method using Cohen s kappa. Rasch analysis was used to compare the consistency of competency data with Rasch model estimates. Results: The competency method resulted in 40 (26.7%) students failing, compared with 76 (50.6%), 37 (24.6%), 35 (23.3%) and 13 (8%) for the modified Angoff, borderline, borderline regression and holistic methods, respectively. The competency method demonstrated a sufficient degree of fit to the Rasch model (mean outfit and infit statistics of 0.961 and 0.960, respectively). Conclusions: the competency method was more stringent in determining pass/fail, compared with other standard-setting methods, except for the modified Angoff method. The fit of competency data to the Rasch model provides evidence for the validity and reliability of pass/fail decisions.
Article
Full-text available
One common phenomenon in Angoff standard setting is that panelists regress their ratings in toward the middle of the probability scale. This study describes two indices based on taking ratios of standard deviations that can be utilized with a scatterplot of item ratings versus expected probabilities of success to identify whether ratings are regressed in toward the middle of the probability scale. Results from a simulation study show that the standard deviation ratio indices can successfully detect ratings for hard and easy items that are regressed in toward the middle of the probability scale in Angoff standard‐setting data, where previously proposed indices often do not work as well to detect these effects. Results from a real data set show that, while virtually all raters improve from Round 1 to Round 2 as measured by previously developed indices, the standard deviation ratios in conjunction with a scatterplot of item ratings versus expected probabilities of success can identify individuals who may still be regressing their ratings in toward the middle of the probability scale even after receiving feedback. The authors suggest using the scatterplot along with the standard deviation ratio indices and other statistics for measuring the quality of Angoff standard‐setting data.
Article
Full-text available
An important consideration in standard setting is recruiting a group of panelists with different experiences and backgrounds to serve on the standard-setting panel. This study uses data from 14 different Angoff standard settings from a variety of medical imaging credentialing programs to examine whether people with different professional roles and test development experiences tended to recommend higher or lower cut scores or were more or less accurate in their standard-setting judgments. Results suggested that there were not any statistically significant differences for different types of panelists in terms of the cut scores they recommended or the accuracy of their judgments. Discussion of what these results may mean for panelist selection and recruitment is provided.