Summary Statistics for SAT Verbal Raw-to-Scale Conversions

Summary Statistics for SAT Verbal Raw-to-Scale Conversions

Source publication
Article
Full-text available
This study uses historical data to explore the consistency of SAT® I: Reasoning Test score conversions and to examine trends in scaled score means. During the period from April 1995 to December 2003, both Verbal (V) and Math (M) means display substantial seasonality, and a slight increasing trend for both is observed. SAT Math means increase more t...

Context in source publication

Context 1
... the whole, Tables 5 to 8 provide a rather favorable picture in terms of test stability. Except for very high or very low raw scores, raw-to-scale conversions exhibit quite limited variability. ...

Similar publications

Article
Full-text available
The objective of this study was to quantify the labour input per cow for different herd sizes. Data was collected from 98 and 73 spring-calving farms in years 1 and 2, respectively. Average annual total dairy labour input per cow was 49.7 h, 42.2 h and 29.3 h for small (80 cows) herd-size farms, respectively. Maximum labour input levels were observ...

Citations

... First, most existing QA methods operate under either the assumption that the mean scores are expected to be stable over time or the assumption that variations in score trends can be largely explained by seasonal variations. Lee and von Davier (2013) have summarized a number of techniques to describe score trends and seasonal patterns, including linear ANOVA models (Haberman et al., 2008), regression with autoregressive moving-average (Li et al., 2009), harmonic regressions (Lee and Haberman, 2013) dynamic linear models (Wanjohi et al., 2013), and the Shewhart chart (Schafer et al., 2011). These methods, in combination with change detection methods such as changepoint models and hidden Markov model (Lee and von Davier, 2013) and cumulative sum (CUSUM) charts (Page, 1954), were found effective in monitoring the stability of the mean scores (Lee and von Davier, 2013). ...
Article
Full-text available
Digital-first assessments are a new generation of high-stakes assessments that can be taken anytime and anywhere in the world. The flexibility, complexity, and high-stakes nature of these assessments pose quality assurance challenges and require continuous data monitoring and the ability to promptly identify, interpret, and correct anomalous results. In this manuscript, we illustrate the development of a quality assurance system for anomaly detection for a new high-stakes digital-first assessment, for which the population of test takers is still in flux. Various control charts and models are applied to detect and flag any abnormal changes in the assessment statistics, which are then reviewed by experts. The procedure of determining the causes of a score anomaly is demonstrated with a real-world example. Several categories of statistics, including scores, test taker profiles, repeaters, item analysis and item exposure, are monitored to provide context and evidence for evaluating the score anomaly as well as assure the quality of the assessment. The monitoring results and alerts are programmed to be automatically updated and delivered via an interactive dashboard every day.
... To better observe and monitor the pattern and trend of the many score means across different forms or administrations, researchers at Educational Testing Service (ETS) have used ANOVA and harmonic regression to check score mean fluctuations over time (Lee & von Davier, 2013;von Davier, 2012). For example, Haberman, Guo, Liu, and Dorans (2008) used the ANOVA method (Howell, 2002) to examine the stability of SAT ® Math and Reading score means over a 9-year period. They found that the scales of SAT Math and Reading reporting scores were stable and the fluctuations in SAT score means were mainly due to seasonal effect. ...
Article
For educational tests, it is critical to maintain consistency of score scales and to understand the sources of variation in score means over time. This practice helps to ensure that interpretations about test takers' abilities are comparable from one administration (or one form) to another. This study examines the consistency of reported scores for the TOEIC® Speaking and Writing tests using statistical procedures. Specifically, the stability of the TOEIC Speaking score means from 431 forms administered in a 3-year period was evaluated using harmonic regression, and the stability of TOEIC Writing score means from 66 forms administered in a 3-year period was evaluated using analysis of variance. Results indicated that the fluctuations in the TOEIC Speaking or Writing score means mainly reflect changes in test takers' overall English speaking or writing ability levels instead of score inaccuracies. For both speaking and writing test scores, a large proportion of the variation in score means can be explained by seasonality (the rise or fall of score means associated with specific times of the year) and test takers' demographic information, which have been shown to be related to test-taker ability. As a result, this finding provides evidence for the consistency of the TOEIC Speaking and Writing score scales across forms.
... As the four-old-form linking plan has been proved effective, producing very stable conversions (Haberman et al., 2008), and our purpose is to find out a way to balance the equating needs and needs for pretesting and/or minimizing old-form exposure, we use the operational conversions that are based on the four-old-form equating as the criterion. ...
Article
Full-text available
Maintaining score interchangeability and scale consistency is crucial for any testing programs that administer multiple forms across years. The use of a multiple linking design, which involves equating a new form to multiple old forms and averaging the conversions, has been proposed to control scale drift. However, the use of multiple linking often conflicts with the need for minimizing old item/form exposure and the need for pretesting. This study tried to find a balance point where the needs for equating, item/form exposure and controlling, and pretesting can be satisfied. Three equating scenarios were examined using real data: equating to one old form, equating to two old forms, or equating to three old forms. The finding is that the equating based on one old form produced persistent score drift and also showed increased variability in score means and standard deviations over time. In contrast, equating back to two or three old forms produced much more stable conversions and had less variation. Overall, equating based on multiple linking designs shows the promise of producing more consistent results and preventing scale drift. We recommend that testing programs and practitioners consider the use of multiple linking whenever possible.
... Brennan also notes that over an extended period of time, even small year-to-year changes could add up to substantial differences between old and new forms. The study by Haberman, Guo, Liu, and Dorans (2008) is an example of scale monitoring over an extended period of time. SEA can be used in these scale drift contexts as well. ...
Article
We make a distinction between two types of test changes: inevitable deviations from specifications versus planned modifications of specifications. We describe how score equity assessment (SEA) can be used as a tool to assess a critical aspect of construct continuity, the equivalence of scores, whenever planned changes are introduced to testing programs. We also report on how SEA can be used as a quality control check to evaluate whether tests developed to a static set of specifications remain within acceptable tolerance levels with respect to equatability.
... However, for W, the zero line in Figure 7 is below the lower ASEE band, which may indicate that the scale drift in W is caused by sources other than random equating error. Diff CL CH continuations of well-established and stable SAT-Mathematics and SAT-Verbal scales (Haberman, Guo, Liu, & Dorans, 2008). ...
... In order to help maintain the scale stability and reduce scale drift, researchers have proposed many suggestions, such as constructing parallel test forms, using large equating sample sizes, using a braiding plan (test forms are interweaved to avoid the development of separate strains), and using multiple-linking equating designs (Guo, Liu, Dorans, & Feigenbaum, 2011;Haberman et al., 2008). However, scale drift occurs after a series of equatings even when best practices are followed. ...
Article
Full-text available
This study examines the stability of the SAT Reasoning Test™ score scales from 2005 to 2010. A 2005 old form (OF) was administered along with a 2010 new form (NF). A new conversion for OF was derived through direct equipercentile equating. A comparison of the newly derived and the original OF conversions showed that Critical Reading and Mathematics score scales have experienced, at most, a moderate upward scale drift (no greater than 5 points on average), and the drift may be explained by an accumulation of random equating errors. The Writing score scale has experienced a significant upward scale drift (11 points on average), which may be caused by sources other than random equating errors.
... For instance, the norms group used at the time the scale was established may not be appropriate over time, as mentioned above in the SAT recentering case. Also, equating is imperfect both due to violations of equating assumptions and due to use of finite samples to estimate parameters (Haberman, Guo, Liu, & Dorans, 2008). ...
... In such a case, differences in equating results may be difficult to interpret. Given all that, it is not surprising that the finding from this study is not quite consistent with the study by Haberman et al. (2008), where they examined time series composed of mean scaled scores and raw-to-scale conversions for 54 SAT verbal and math forms administered from April 1995 to December 2003. They found that the data provide a picture of stability on the whole. ...
Article
This study examines the stability of the SAT® scale from 1994 to 2001. A 1994 form and a 2001 form were readministered in a 2005 SAT administration, and the 1994 form was equated to the 2001 form. The new conversion was compared to the old conversion. Both the verbal and math sections exhibit a similar degree of scale drift, but in opposite directions: the verbal scale has drifted upward, whereas the math scale has drifted downward. We suggest testing programs monitor the score scales periodically by building a testing form schedule that allows a systematic and periodic checking of scale stability.
Chapter
Computational psychometrics, a blend of theory-driven psychometrics and data-driven algorithms, provides the theoretical underpinnings for the design and analysis of the new generation of high-stakes, digital-first assessments that can be taken anytime and anywhere in the world, and their scores impact test takers’ lives. The unprecedented flexibility, complexity, and high-stakes nature of these digital-first assessments pose enormous quality assurance challenges. In order to ensure these assessments meet both “the contest and the measurement” requirements of high-stakes tests, it is necessary to conduct continuous pattern monitoring and to be able to promptly react when needed. In this paper, we illustrate the development of a quality assurance system for a high-stakes and digital-first assessment. To build the system, educational data from continuous administrations of the assessments are mined, modeled and monitored. In particular, five categories of statistics are monitored to assure the quality of the assessment, including scores, test taker profiles, repeaters, item analysis and item exposure. Various control charts and models were applied to detect and flag the abnormal changes in the assessment statistics. The monitoring results and alerts were communicated with the stakeholders via an interactive dashboard. The paper concludes with a discussion on how the automatic quality assurance system is combined with the human review process in real-world application.
Article
In recent years, harmonic regression models have been applied to implement quality control for educational assessment data consisting of multiple administrations and displaying seasonality. As with other types of regression models, it is imperative that model adequacy checking and model fit be appropriately conducted. However, there has been no literature on how to perform a comprehensive model adequacy evaluation when applying harmonic regression models to sequential data with seasonality in the educational assessment field. This paper is intended to fill this gap with an illustration of real data from an English language assessment. Two types of cross-validation, leave-one-out and out-of-sample, were designed to measure prediction errors and check model validation. Three types of R-squared (, , and ) and various residual diagnostics were applied to check model adequacy and model fitting.
Article
For assessments that use different forms in different administrations, equating methods are applied to ensure comparability of scores over time. Ideally, a score scale is well maintained throughout the life of a testing program. In reality, instability of a score scale can result from a variety of causes, some are expected while others may be unforeseen. The situation is more challenging for assessments that assemble many different forms and deliver frequent administrations per year. Harmonic regression, a seasonal‐adjustment method, has been found useful in achieving the goal of differentiating between possible known sources of variability and unknown sources so as to study score stability for such assessments. As an extension, this paper presents a family of three approaches that incorporate examinees' demographic data into harmonic regression in different ways. A generic evaluation method based on jackknifing is developed to compare the approaches within the family. The three approaches are compared using real data from an international language assessment. Results suggest that all approaches perform similarly and are effective in meeting the goal. The paper also discusses the properties and limitations of the three approaches, along with inferences about score (in)stability based on the harmonic regression results.
Article
Maintaining comparability of test scores is a major challenge faced by testing programs that have almost continuous administrations. Among the potential problems are scale drift and rapid accumulation of errors. Many standard quality control techniques for testing programs, which can effectively detect and address scale drift for small numbers of administrations yearly, are not always adequate to detect changes in a complex, rapid flow of scores. To address this issue, Educational Testing Service has been conducting research into applying data mining and quality control tools from manufacturing, biology, and text analysis to scaled scores and other relevant assessment variables. Data mining tools can identify patterns in the data and quality control techniques can detect trends. This type of data analysis of scaled scores is relatively new, and this paper gives a brief overview of the theoretical and practical implications of the issues. More in-depth analyses to refine the approaches for matching the type of data from educational assessments are needed.