Summary Statistics for SAT Verbal Raw-to-Scale Conversions

Source publication

Consistency of SAT® I: Reasoning Test Score Conversions

Article

Full-text available

Dec 2008

This study uses historical data to explore the consistency of SAT® I: Reasoning Test score conversions and to examine trends in scaled score means. During the period from April 1995 to December 2003, both Verbal (V) and Math (M) means display substantial seasonality, and a slight increasing trend for both is observed. SAT Math means increase more t...

Context 1

... the whole, Tables 5 to 8 provide a rather favorable picture in terms of test stability. Except for very high or very low raw scores, raw-to-scale conversions exhibit quite limited variability. ...

View in full-text

Labour Input on Irish Dairy Farms and the Effect of Scale and Seasonality

Article

Full-text available

Jan 2008

The objective of this study was to quantify the labour input per cow for different herd sizes. Data was collected from 98 and 73 spring-calving farms in years 1 and 2, respectively. Average annual total dairy labour input per cow was 49.7 h, 42.2 h and 29.3 h for small (80 cows) herd-size farms, respectively. Maximum labour input levels were observ...

Maintaining and monitoring quality of a continuously administered digital assessment

Article

Full-text available

Jul 2022

Digital-first assessments are a new generation of high-stakes assessments that can be taken anytime and anywhere in the world. The flexibility, complexity, and high-stakes nature of these assessments pose quality assurance challenges and require continuous data monitoring and the ability to promptly identify, interpret, and correct anomalous results. In this manuscript, we illustrate the development of a quality assurance system for anomaly detection for a new high-stakes digital-first assessment, for which the population of test takers is still in flux. Various control charts and models are applied to detect and flag any abnormal changes in the assessment statistics, which are then reviewed by experts. The procedure of determining the causes of a score anomaly is demonstrated with a real-world example. Several categories of statistics, including scores, test taker profiles, repeaters, item analysis and item exposure, are monitored to provide context and evidence for evaluating the score anomaly as well as assure the quality of the assessment. The monitoring results and alerts are programmed to be automatically updated and delivered via an interactive dashboard every day.

Evaluating the Stability of Test Score Means for the TOEIC ® Speaking and Writing Tests: Evaluating the Stability of Test Score Means

Article

Oct 2017

For educational tests, it is critical to maintain consistency of score scales and to understand the sources of variation in score means over time. This practice helps to ensure that interpretations about test takers' abilities are comparable from one administration (or one form) to another. This study examines the consistency of reported scores for the TOEIC® Speaking and Writing tests using statistical procedures. Specifically, the stability of the TOEIC Speaking score means from 431 forms administered in a 3-year period was evaluated using harmonic regression, and the stability of TOEIC Writing score means from 66 forms administered in a 3-year period was evaluated using analysis of variance. Results indicated that the fluctuations in the TOEIC Speaking or Writing score means mainly reflect changes in test takers' overall English speaking or writing ability levels instead of score inaccuracies. For both speaking and writing test scores, a large proportion of the variation in score means can be explained by seasonality (the rise or fall of score means associated with specific times of the year) and test takers' demographic information, which have been shown to be related to test-taker ability. As a result, this finding provides evidence for the consistency of the TOEIC Speaking and Writing score scales across forms.

A Comparison of Raw-to-Scale Conversion Consistency Between Single- and Multiple-Linking Using a Nonequivalent Groups Anchor Test Design

Article

Full-text available

Dec 2014

Maintaining score interchangeability and scale consistency is crucial for any testing programs that administer multiple forms across years. The use of a multiple linking design, which involves equating a new form to multiple old forms and averaging the conversions, has been proposed to control scale drift. However, the use of multiple linking often conflicts with the need for minimizing old item/form exposure and the need for pretesting. This study tried to find a balance point where the needs for equating, item/form exposure and controlling, and pretesting can be satisfied. Three equating scenarios were examined using real data: equating to one old form, equating to two old forms, or equating to three old forms. The finding is that the equating based on one old form produced persistent score drift and also showed increased variability in score means and standard deviations over time. In contrast, equating back to two or three old forms produced much more stable conversions and had less variation. Overall, equating based on multiple linking designs shows the promise of producing more consistent results and preventing scale drift. We recommend that testing programs and practitioners consider the use of multiple linking whenever possible.

Assessing a Critical Aspect of Construct Continuity When Test Specifications Change or Test Forms Deviate from Specifications

Article

Mar 2013
Educ Meas

We make a distinction between two types of test changes: inevitable deviations from specifications versus planned modifications of specifications. We describe how score equity assessment (SEA) can be used as a tool to assess a critical aspect of construct continuity, the equivalence of scores, whenever planned changes are introduced to testing programs. We also report on how SEA can be used as a quality control check to evaluate whether tests developed to a static set of specifications remain within acceptable tolerance levels with respect to equatability.

THE STABILITY OF THE SCORE SCALES FOR THE SAT REASONING TEST ™ FROM 2005 TO 2010

Article

Full-text available

Dec 2012

This study examines the stability of the SAT Reasoning Test™ score scales from 2005 to 2010. A 2005 old form (OF) was administered along with a 2010 new form (NF). A new conversion for OF was derived through direct equipercentile equating. A comparison of the newly derived and the original OF conversions showed that Critical Reading and Mathematics score scales have experienced, at most, a moderate upward scale drift (no greater than 5 points on average), and the drift may be explained by an accumulation of random equating errors. The Writing score scale has experienced a significant upward scale drift (11 points on average), which may be caused by sources other than random equating errors.

A Scale Drift Study

Article

Dec 2009

This study examines the stability of the SAT® scale from 1994 to 2001. A 1994 form and a 2001 form were readministered in a 2005 SAT administration, and the 1994 form was equated to the 2001 form. The new conversion was compared to the old conversion. Both the verbal and math sections exhibit a similar degree of scale drift, but in opposite directions: the verbal scale has drifted upward, whereas the math scale has drifted downward. We suggest testing programs monitor the score scales periodically by building a testing form schedule that allows a systematic and periodic checking of scale stability.

Quality Assurance in Digital-First Assessments

Chapter

Jul 2022

Computational psychometrics, a blend of theory-driven psychometrics and data-driven algorithms, provides the theoretical underpinnings for the design and analysis of the new generation of high-stakes, digital-first assessments that can be taken anytime and anywhere in the world, and their scores impact test takers’ lives. The unprecedented flexibility, complexity, and high-stakes nature of these digital-first assessments pose enormous quality assurance challenges. In order to ensure these assessments meet both “the contest and the measurement” requirements of high-stakes tests, it is necessary to conduct continuous pattern monitoring and to be able to promptly react when needed. In this paper, we illustrate the development of a quality assurance system for a high-stakes and digital-first assessment. To build the system, educational data from continuous administrations of the assessments are mined, modeled and monitored. In particular, five categories of statistics are monitored to assure the quality of the assessment, including scores, test taker profiles, repeaters, item analysis and item exposure. Various control charts and models were applied to detect and flag the abnormal changes in the assessment statistics. The monitoring results and alerts were communicated with the stakeholders via an interactive dashboard. The paper concludes with a discussion on how the automatic quality assurance system is combined with the human review process in real-world application.

Model Adequacy Checking for Applying Harmonic Regression to Assessment Quality Control

Article

Aug 2021

In recent years, harmonic regression models have been applied to implement quality control for educational assessment data consisting of multiple administrations and displaying seasonality. As with other types of regression models, it is imperative that model adequacy checking and model fit be appropriately conducted. However, there has been no literature on how to perform a comprehensive model adequacy evaluation when applying harmonic regression models to sequential data with seasonality in the educational assessment field. This paper is intended to fill this gap with an illustration of real data from an English language assessment. Two types of cross-validation, leave-one-out and out-of-sample, were designed to measure prediction errors and check model validation. Three types of R-squared (, , and ) and various residual diagnostics were applied to check model adequacy and model fitting.

Studying Score Stability with a Harmonic Regression Family: A Comparison of Three Approaches to Adjustment of Examinee-Specific Demographic Data: Studying Score Stability with a Harmonic Regression Family

Article

Feb 2020
J EDUC MEAS

For assessments that use different forms in different administrations, equating methods are applied to ensure comparability of scores over time. Ideally, a score scale is well maintained throughout the life of a testing program. In reality, instability of a score scale can result from a variety of causes, some are expected while others may be unforeseen. The situation is more challenging for assessments that assemble many different forms and deliver frequent administrations per year. Harmonic regression, a seasonal‐adjustment method, has been found useful in achieving the goal of differentiating between possible known sources of variability and unknown sources so as to study score stability for such assessments. As an extension, this paper presents a family of three approaches that incorporate examinees' demographic data into harmonic regression in different ways. A generic evaluation method based on jackknifing is developed to compare the approaches within the family. The three approaches are compared using real data from an international language assessment. Results suggest that all approaches perform similarly and are effective in meeting the goal. The paper also discusses the properties and limitations of the three approaches, along with inferences about score (in)stability based on the harmonic regression results.

THE USE OF QUALITY CONTROL AND DATA MINING TECHNIQUES FOR MONITORING SCALED SCORES: AN OVERVIEW

Article

Dec 2012

Alina A. von Davier

Maintaining comparability of test scores is a major challenge faced by testing programs that have almost continuous administrations. Among the potential problems are scale drift and rapid accumulation of errors. Many standard quality control techniques for testing programs, which can effectively detect and address scale drift for small numbers of administrations yearly, are not always adequate to detect changes in a complex, rapid flow of scores. To address this issue, Educational Testing Service has been conducting research into applying data mining and quality control tools from manufacturing, biology, and text analysis to scaled scores and other relevant assessment variables. Data mining tools can identify patterns in the data and quality control techniques can detect trends. This type of data analysis of scaled scores is relatively new, and this paper gives a brief overview of the theoretical and practical implications of the issues. More in-depth analyses to refine the approaches for matching the type of data from educational assessments are needed.

Summary Statistics for SAT Verbal Raw-to-Scale Conversions

Context in source publication

Similar publications

Citations