Percent Passing Impact Data for Different Examinee Ability

Source publication

Five Methods for Estimating Angoff Cut Scores with IRT

Article

Full-text available

Jul 2017

Adam Wyse

This article illustrates five different methods for estimating Angoff cut scores using item response theory (IRT) models. These include maximum likelihood (ML), expected a priori (EAP), modal a priori (MAP), and weighted maximum likelihood (WML) estimators, as well as the most commonly used approach based on translating ratings through the test cha...

Context 1

... choice of estimator and the method of aggregating the data could clearly produce disparate cut scores for these data. Table 5: IRT Cut Score Estimates for Five Different Methods for the 200 Item Test Round 1 Round 2 Rater TS ML WML EAP MAP TS ML WML EAP Table 6 shows the percentage of examinees that would be at or above the cut score if one calculated the cut score using the same method as was used to estimate examinee ability compared to the TS method. Similar to the simulation study, one can see that when the TS method is used to estimate ability and the cut score that there is no difference in the percentage of examinees at or above the cut score. ...

View in full-text

Parameter Estimation In Weighted Rayleigh Distribution

Article

Full-text available

Nov 2017

In this article, a weighted model based on the Rayleigh distribution is proposed and the statistical and reliability properties of this model are presented. Some non-Bayesian and Bayesian methods are used to estimate the beta parameter of proposed model. The Bayes estimators are obtained under the symmetric (squared error) and the asymmetric (linea...

Several prior distributions (upper panel), all with and different N0....

Local mass, obtained from a measurement of the pion correlation...

Upper panel: the best fit value of m0 changes systematically with the...

The ground-state overlap obtained as explained here at , 5.6 and 5.7...

Ground-state mass in short lattices by controlling overconfidence and bias in Bayesian fits

Article

Full-text available

Nov 2019

We investigate the seemingly ill-defined problem of extracting a ground-state mass from a lattice simulation where the extent of the lattice is not long enough to project out the ground-state properly. We regulate the problem using a Bayesian method. We show that controlling meta-parameters (overconfidence) can allow the data to overcome the input...

The Odd Log-Logistic Lindley-G Family of Distributions: Properties, Bayesian and Non-Bayesian Estimation with Applications

Article

Full-text available

Mar 2020

In this paper, a new class of distributions called the odd log-logistic Lindley-G family is proposed. Several of its statistical and reliability properties are studied in-detail. One members of the proposed family can have symmetrical, right-skewed, leftt-skewed and reversed-J shaped densities, and decreasing, increasing, bathtub, unimodal and reve...

Comparison of estimation results of treatment and nesting effects for...

Comparison of Prior Setting Methods for Multilevel Model Effect Estimation Based on Small Sample Imbalanced Nested Data in Bayesian Framework

Article

Full-text available

Nov 2022

In the fields of education and psychology, nested data with small samples and imbalances are very common. Bauer et al. (2008) first proposed adjusting the traditional multilevel model to analyze the small sample imbalanced nested data (SSIND). In terms of parameter estimation, the Bayesian method shows the possibility of providing unbiased estimati...

Estimation of the Parameters of the Power Size Biased Chris-Jerry Distribution

Article

Full-text available

May 2023

This paper extends the one-parameter Chris-Jerry distribution to the power size biased Chris-Jerry distribution, a lifetime distribution in the class of the Lindley distribution. We derive the r th moment and particularly estimated the parameters using six classical methods and the Bayesian method. Results of the two real data analyses show that th...

A Competency-Based Approach to Pass/Fail Decisions: An Observational Study

Article

Full-text available

Jul 2021

Any high-stakes assessment that leads to an important decision requires careful consideration in determining whether a student passes or fails. Despite implementation of many standard setting methods in clinical examinations, concerns remain about the reliability of pass/fail decisions in high stakes assessment, especially clinical assessment. This observational study proposes a defensible pass/fail decision based on the number of failed competencies. The study conducted in Erbil, Iraq, in June 2018, results were obtained for 150 medical students on their final objective structured clinical examination. Cutoff scores and pass/fail decisions were calculated using the modified Angoff, borderline, borderline-regression and holistic methods. The results were compared with each other and with a new competency method using Cohen’s kappa. Rasch analysis was used to compare the consistency of competency data with Rasch model estimates. The competency method resulted in 40 (26.7%) students failing, compared with 76 (50.6%), 37 (24.6%), 35 (23.3%) and 13 (8%) for the modified Angoff, borderline, borderline regression and holistic methods, respectively. The competency method demonstrated a sufficient degree of fit to the Rasch model (mean outfit and infit statistics of 0.961 and 0.960, respectively). In conclusion, the competency method was more stringent in determining pass/fail, compared with other standard-setting methods, except for the modified Angoff method. The fit of competency data to the Rasch model provides evidence for the validity and reliability of pass/fail decisions. Keywords: pass/fail decision; competence-based; standard-setting; Rasch model Graphical abstract Competency method for pass/fail decision, number of stations for each competency represented in circle shapes.

Setting and Validating Multiple Standards on a Multistage‐Adaptive Test

Article

May 2021
Educ Meas

Jennifer Lewis

Setting cut scores on multistage‐adaptive tests (MSTs) is difficult, particularly when the test spans several grade levels, and the selection of items from MST panels must reflect the operational test specifications. In this study, we describe, illustrate, and evaluate three methods for mapping panelists’ Angoff ratings into cut scores on the scale underlying an MST. The results suggest the test characteristic function and item characteristic curve methods performed similarly, but the method based on dichotomizing panelists’ ratings at a response probability of .67 was unacceptable. The study featured a rating booklet design that allowed us to systematically evaluate the validity of the Angoff ratings across test levels, which contributed internal validity evidence for the cut scores, which were also evaluated using procedural and external validity evidence. The implications of the results for future standard setting studies and research in this area are discussed.

It's Not Just Angoff: Misperceptions of Hard and Easy Items in Bookmark-Type Ratings

Article

Feb 2020
Educ Meas

A common belief is that the Bookmark method is a cognitively simpler standard‐setting method than the modified Angoff method. However, a limited amount of research has investigated panelist's ability to perform well the Bookmark method, and whether some of the challenges panelists face with the Angoff method may also be present in the Bookmark method. This article presents results from three experiments where panelists were asked to give Bookmark‐type ratings to separate items into groups based on item difficulty data. Results of the experiments showed, consistent with results often observed with the Angoff method, that panelists typically and paradoxically perceived hard items to be too easy and easy items to be too hard. These perceptions were reflected in panelists often placing their Bookmarks too early for hard items and often placing their Bookmarks too late for easy items. The article concludes with a discussion of what these results imply for educators and policymakers using the Bookmark standard‐setting method.

A Critical Look into the Beuk Standard-Setting Method

Article

Feb 2020
Educ Meas

Adam E. Wyse

One commonly used compromise standard‐setting method is the Beuk (1984) method. A key assumption of the Beuk method is that the emphasis given to the pass rate and the percent correct ratings should be proportional to the extent that the panelists agree on their ratings. However, whether the slope of Beuk line reflects the emphasis that panelists believe should be assigned to the pass rate and the percentage correct ratings has not be fully tested. In this article, I evaluate this critical assumption of the Beuk method by asking panelists to assign importance weights to their percentage correct and pass rate judgments. I show that in several cases that the emphasis suggested by the Beuk slope is noticeably different from what one would expect and is inconsistent with importance weight ratings. I also suggest two ways that the importance weights can be used to calculate alternate cut scores, and I show that one of the ways of calculating cut scores using the importance weights leads to larger potential differences in cut score estimates. I suggest that practitioners should consider collecting importance weights when the Beuk method is used for determining cut scores.

Rounding in Angoff Ratings

Article

Full-text available

May 2018

Adam Wyse

One common modification to the Angoff standard-setting method is to have panelists round their ratings to the nearest 0.05 or 0.10 instead of 0.01. Several reasons have been offered as to why it may make sense to have panelists round their ratings to the nearest 0.05 or 0.10. In this article, we examine one reason that has been suggested, which is that even if panelists are given the opportunity to provide ratings to the nearest 0.01 they often round their ratings to the nearest 0.05 or 0.10 anyway. Using data from four standard settings, we show that in many cases ratings ended in a 0 or 5 when panelists were given the option of using a scale from 0 to 100 in one-point increments and that only about 9% of all ratings ended in a digit other than a 0 or 5. We also examined the impact of different rounding rules and we found that results were quite similar when using different rounding rules. Additional analyses showed the common phenomenon of panelists giving too high of ratings for hard items and too low of ratings for easy items in comparison to conditional p-values. It is suggested that rounding ratings to the nearest 0.05 or 0.10 represent reasonable alternatives to rounding ratings to the nearest 0.01.

Maintaining Score Scales Over Time: A Comparison of Five Scoring Methods

Article

Feb 2023
APPL MEAS EDUC

This study evaluates various scoring methods including number-correct scoring, IRT theta scoring, and hybrid scoring in terms of scale-score stability over time. A simulation study was conducted to examine the relative performance of five scoring methods in terms of preserving the first two moments of scale scores for a population in a chain of linking with multiple test forms. Simulation factors included 1) the number of forms linked back to the initial form, 2) the pattern in mean shift, and 3) the proportion of common items. Results showed that scoring methods that operate with number-correct scores generally outperform those that are based on IRT proficiency estimators (θ) in terms of reproducing the mean and standard deviation of scale scores. Scoring methods performed differently as a function of patterns in a group proficiency change.

Handling Extreme Scores in Vertically Scaled Fixed-Length Computerized Adaptive Tests

Article

Jan 2022
Measurement

A common practical challenge is how to assign ability estimates to all incorrect and all correct response patterns when using item response theory (IRT) models and maximum likelihood estimation (MLE) since ability estimates for these types of responses equal −∞ or +∞. This article uses a simulation study and data from an operational K − 12 computerized adaptive test (CAT) to compare how well several alternatives – including Bayesian maximum a priori (MAP) estimators; various MLE based methods; and assigning constants – work as strategies for computing ability estimates for extreme scores in vertically scaled fixed-length Rasch-based CATs. Results suggested that the MLE-based methods, MAP estimators with prior standard deviations of 4 and above, and assigning constants achieved the desired outcomes of producing finite ability estimates for all correct and all incorrect responses that were more extreme than the MLE values of students that got one item correct or one item incorrect as well as being more extreme than the difficulty of the items students saw during the CAT. Additional analyses showed that it is possible for some methods to exhibit changes in how much they differ in magnitude and variability from the MLE comparison values or the b values of the CAT items for all correct versus all incorrect responses and across grades. Specific discussion is given to how one may select a strategy to assign ability estimates to extreme scores in vertically scaled fixed-length CATs that employ the Rasch model.

A competency-based approach to pass/fail decisions in an objective structured clinical examination: An observational study from Iraq

Preprint

Full-text available

Mar 2020

Background: Any high-stakes assessment that leads to an important decision requires careful consideration in determining whether a student passes or fails. This observational study conducted in Erbil, Iraq, in June 2018 proposes a defensible pass/fail decision based on the number of failed competencies. Methods: Results were obtained for 150 medical students on their final objective structured clinical examination. Cutoff scores and pass/fail decisions were calculated using the modified Angoff, borderline, borderline-regression and holistic methods. The results were compared with each other and with a new competency method using Cohen s kappa. Rasch analysis was used to compare the consistency of competency data with Rasch model estimates. Results: The competency method resulted in 40 (26.7%) students failing, compared with 76 (50.6%), 37 (24.6%), 35 (23.3%) and 13 (8%) for the modified Angoff, borderline, borderline regression and holistic methods, respectively. The competency method demonstrated a sufficient degree of fit to the Rasch model (mean outfit and infit statistics of 0.961 and 0.960, respectively). Conclusions: the competency method was more stringent in determining pass/fail, compared with other standard-setting methods, except for the modified Angoff method. The fit of competency data to the Rasch model provides evidence for the validity and reliability of pass/fail decisions.

A Method for Detecting Regression of Hard and Easy Item Angoff Ratings

Article

Full-text available

Mar 2019
J EDUC MEAS

One common phenomenon in Angoff standard setting is that panelists regress their ratings in toward the middle of the probability scale. This study describes two indices based on taking ratios of standard deviations that can be utilized with a scatterplot of item ratings versus expected probabilities of success to identify whether ratings are regressed in toward the middle of the probability scale. Results from a simulation study show that the standard deviation ratio indices can successfully detect ratings for hard and easy items that are regressed in toward the middle of the probability scale in Angoff standard‐setting data, where previously proposed indices often do not work as well to detect these effects. Results from a real data set show that, while virtually all raters improve from Round 1 to Round 2 as measured by previously developed indices, the standard deviation ratios in conjunction with a scatterplot of item ratings versus expected probabilities of success can identify individuals who may still be regressing their ratings in toward the middle of the probability scale even after receiving feedback. The authors suggest using the scatterplot along with the standard deviation ratio indices and other statistics for measuring the quality of Angoff standard‐setting data.

Examining How Professional Roles and Test Development Experiences Impact Angoff Ratings

Article

Full-text available

Oct 2018
APPL MEAS EDUC

Adam E. Wyse

An important consideration in standard setting is recruiting a group of panelists with different experiences and backgrounds to serve on the standard-setting panel. This study uses data from 14 different Angoff standard settings from a variety of medical imaging credentialing programs to examine whether people with different professional roles and test development experiences tended to recommend higher or lower cut scores or were more or less accurate in their standard-setting judgments. Results suggested that there were not any statistically significant differences for different types of panelists in terms of the cut scores they recommended or the accuracy of their judgments. Discussion of what these results may mean for panelist selection and recruitment is provided.

Percent Passing Impact Data for Different Examinee Ability

Context in source publication

Similar publications

Citations