Figure 4 - uploaded by Frank Goldhammer
Content may be subject to copyright.
Response time distribution and proportion correct by response time

Response time distribution and proportion correct by response time

Source publication
Technical Report
Full-text available
In this study, we investigated how empirical indicators of test-taking engagement can be defined, empirically validated, and used to describe group differences in the context of the Programme of International Assessment of Adult Competences (PIAAC). The approach was to distinguish between disengaged and engaged response behavior by means of respons...

Contexts in source publication

Context 1
... method requires disengaged test-taking behavior to give rise to a bimodal item response time distribution (cf. Figure 4). Typically, there is a high-frequency spike during the initial seconds after the item is administered, followed by a region of low frequency, followed by another strong increase in frequency that finally decreases (cf. ...
Context 2
... upper part of Figure 4 shows a response time distribution which clearly has two modes and a clear-cut gap in between, which can be interpreted as a threshold separating disengaged and engaged response behavior. For the sample item Literacy C313410, the threshold was rated to be 12 s. ...

Citations

... Examinees who engage in more rapid guessing behavior perform lower than their peers, thus, their individual scores may underestimate their achievement levels and invalidate test outcomes. Consequently, aggregate scores, such as those at the country level, may be biased downwards and if there are cultural differences in test-taking effort, the validity of country-level comparisons may be harmed (Debeer et al., 2014;Goldhammer et al., 2016). ...
... In the context of cross-country comparisons, studies have documented variations in test-taking effort across countries when examinees self-report the effort they invest on a test in surveys of Trends in International Mathematics and Science Study (TIMSS; Eklöf, et al., 2014) and Programme for International Student Assessment (PISA; Eklöf, 2015). When item response times, which are considered less susceptible to response biases, are employed as measures of effort, engagement, or rapid guessing, country differences have also been reported in the Programme for the International Assessment of Adult Competencies (Goldhammer, et al., 2016) and PISA (Azzolini, et al., 2019;Guo & Ercikan, 2020;Michaelides & Ivanova, 2022). Many of these studies have additionally highlighted heterogeneous associations between the measure of effort used and test performance across country samples. ...
Article
Full-text available
International large-scale assessments are low-stakes tests for examinees and their motivation to perform at their best may not be high. Thus, these programs are criticized as invalid for accurately depicting individual and aggregate achievement levels. In this paper, we examine whether filtering out examinees who rapid-guess impacts country score averages and rankings. Building on an earlier analysis that identified rapid guessers using two different methods, we re-estimated country average scores and rankings in three subject tests of PISA 2015 (Science, Mathematics, Reading) after filtering out rapid-guessing examinees. Results suggest that country mean scores increase for all countries after filtering, but in most conditions the change in rankings is minimal, if any. A few exceptions with considerable changes in rankings were observed in the Science and Reading tests with methods that were more liberal in identifying rapid guessing. Lack of engagement and effort is a validity concern for individual scores, but has a minor impact on aggregate scores and country rankings.
... A pilot study was conducted using the 3 CR items in the Core stage; a subsequent analysis comprised more items (i.e., 44 CR items from Stage I). Separate analyses were conducted for students who took the reading test in the first and in the second session of the assessment, to account for the item position effect on effort Goldhammer et al., 2016). The magnitude of the effortless test-taking behavior in Stage I was also evaluated separately for the high and low testlet difficulty levels. ...
... As hypothesized, the validity results for effort measures were similar across sessions, even though items administered later in the test yielded more effortless responses than the same items administered in the first half of the assessment, especially in lowdifficulty testlets. The difference in the level of disengagement across test sessions was in line with previous literature confirming the item position effect on engagement (e.g., Ivanova et al., 2020;Debeer et al., 2014;Goldhammer et al., 2016;Wise & Kingsbury, 2016). ...
Article
Full-text available
Research on methods for measuring examinee engagement with constructed-response items is limited. The present study used data from the PISA 2018 Reading domain to construct and compare indicators of test-taking effort on constructed-response items: response time, number of actions, the union (combining effortless responses detected by either response time or number of actions measures or both), and the intersection of response time and number of actions (responses identified as effortless by both response-time and number of actions measures). A 10% normative threshold identification method was used for both response time and number of actions. Pre-defined validation criteria were used to explore the validity of each of the four indicators. Number of actions yielded a similar number of effortless responses as the union measure. Response time and intersection measures also had similar results and were related to lower disengagement than the number of actions and union indicators. With the normative threshold identification method, number of actions and the union of the two process data may result in a higher level of response misclassifications on some constructed-response items than response time and the intersection measures. Response time appears to be a more valid indicator of test-taking effort on constructed-response items than number of actions.
... These are defined based on the assumption that the minimum time required to complete each item is different for each item. While test-takers can quickly solve a simple arithmetic problem, reading, interpreting, and solving a complex problem-solving task take much more time (Goldhammer, Martens, Christoph, & Lüdtke, 2016). This means that the threshold is not the same for all items but can differ item by item, task by task. ...
... Log data-based methods provide an opportunity to measure test-taking efforts more than a few times and during each item. A decrease in test-taking effort has been supported by a number of log data-based and model-based studies (Attali, 2016;Goldhammer et al., 2016;Nuutila, Tapola, Tuominen, Molnár, & Niemivirta, 2021;Penk & Richter, 2017;Wise, Pastor, & Kong, 2009). ...
... In previous research, a number of log data-based methods have been developed to identify unmotivated responses. These methods produce different results on the same sample (Goldhammer et al., 2016). Generally, there is a positive correlation between test-taking effort and test performance, but the relationship is not so clear when examining clustered groups of testtakers (Lundgren & Eklöf, 2020). ...
Article
Full-text available
The present study investigates students' test-taking effort by integrating and comparing traditional self-report questionnaire data and students' test-taking behavior, based on log data analyses. Previous studies have shown that different methods often lead to different results. A computer-based measure of complex problem-solving in uncertain situations was used to minimize the influence of factual knowledge on test performance. K-means cluster analysis was used to build groups of students differing in test-taking effort, resulting in 3 distinct groups. The correlation between students' test-taking effort and test performance proved to be weaker based on the self-reported questionnaire data than on their actual test-taking behavior. Both the self-report questionnaire and the log data showed a decrease in test-taking effort during the test. The number of clicks played the largest role in predicting performance. Results suggest that (1) self-report questionnaire data are not consistent with students' actual test-taking behavior and (2) it's not necessary to make the maximum effort to obtain valid test results, but a certain level of effort is needed. Educational relevance statement In the implementation of effective personalised education, smart education, an increasingly important role is played by the accurate, fast and valid diagnostic of students' ability level. As for educational relevance , we stated that: (1) Self-reported data are not always consistent with students' actual test-taking behavior, therefore log data-based methods are more appropriate than self-report questionnaires to investigate test-taking effort. For problem-solving tasks, the P+ > 0 % method performed better. (2) For problem-solving tasks, the number of clicks plays the largest role in predicting performance. Using the number of clicks may increase the validity of response time-based methods. (3) There is not necessary to make the maximum effort to obtain valid test results but rather to reach a certain level of effort.
... They were also less likely to have a low education level and more likely to have a high education level. These findings correspond with previous results showing that older participants and participants with a low education level were less likely to participate in the PSTRE assessment (OECD, 2013, p.92; see also Goldhammer et al., 2016). Further country-specific information about the samples can be found in the supplementary material Description by Country. ...
... For the flimsy pattern, the result that adults tend to stick to the flimsy pattern across tasks speaks for the latter. Accordingly, it may indicate test-taking disengagement (see Goldhammer et al., 2016;OECD, 2019). Another validation strategy for such an interpretation might involve a microanalytic look at talk and gesture data (e.g., Maddox, 2017). ...
Article
Full-text available
Background. A priori assumptions about specific behavior in test items can be used to process log data in a rule-based fashion to identify the behavior of interest. In this study, we demonstrate such a top-down approach and created a process indicator to represent what type of information processing (flimsy, breadth-first, satisficing, sampling, laborious) adults exhibit when searching online for information. We examined how often the predefined patterns occurred for a particular task, how consistently they occurred within individuals, and whether they explained task success beyond individual background variables (age, educational attainment, gender) and information processing skills (reading and evaluation skills). Methods. We analyzed the result and log file data of ten countries that participated in the Programme for the International Assessment of Adult Competencies (PIAAC). The information processing behaviors were derived for two items that simulated a web search environment. Their explanatory value for task success was investigated with generalized linear mixed models. Results. The results showed item-specific differences in how frequently specific information processing patterns occurred, with a tendency of individuals not to settle on a single behavior across items. The patterns explained task success beyond reading and evaluation skills, with differences across items as to which patterns were most effective for solving a task correctly. The patterns even partially explained age-related differences. Conclusions. Rule-based process indicators have their strengths and weaknesses. Although dependent on the clarity and precision of a predefined rule, they allow for a targeted examination of behaviors of interest and can potentially support educational intervention during a test session. Concerning adults’ digital competencies, our study suggests that the effective use of online information is not inherently based on demographic factors but mediated by central skills of lifelong learning and information processing strategies.
... As the model is defined on one level only, the authors call it the dependent latent class single level item response theory (DLC-SL-IRT) model. As Equations (5) and (6) assume that persons do not differ regarding their response speed, it is also assumed that individual test-taking speed and ability are not correlated, contradicting several studies that suggest the opposite (e.g., Goldhammer et al., 2016;Lee & Jia, 2014;Nagy & Ulitzsch, 2022). ...
... Although response engagement and ability were repeatedly shown to be related, the size of the correlation is usually estimated to range between .25 and .70 (e.g., Goldhammer et al., 2016;Ulitzsch et al., 2021). This suggests rather limited evidence for individual differences in the threshold for response engagement in the present sample. ...
... First, for reasons of practicability, we chose a study with a rather homogenous sample and a rather short test. Because previous studies showed that age, educational attainment, and country can be associated with disengaged responding (e.g., Goldhammer et al., 2016;Lindner et al., 2019), more representative samples might identify more disengaged responses and, thus, stronger effects for the novel indicators. Second, the extended model included only one random effect for all four response engagement indicators. ...
Article
Disengaged responding poses a severe threat to the validity of educational large-scale assessments, because item responses from unmotivated test-takers do not reflect their actual ability. Existing identification approaches rely primarily on item response times, which bears the risk of misclassifying fast engaged or slow disengaged responses. Process data with its rich pool of additional information on the test-taking process could thus be used to improve existing identification approaches. In this study, three process data variables—text reread, item revisit, and answer change—were introduced as potential indicators of response engagement for multiple-choice items in a reading comprehension test. An extended latent class item response model for disengaged responding was developed by including the three new indicators as additional predictors of response engagement. In a sample of 1,932 German university students, the extended model indicated a better model fit than the baseline model, which included item response time as only indicator of response engagement. In the extended model, both item response time and text reread were significant predictors of response engagement. However, graphical analyses revealed no systematic differences in the item and person parameter estimation or item response classification between the models. These results suggest only a marginal improvement of the identification of disengaged responding by the new indicators. Implications of these results for future research on disengaged responding with process data are discussed.
... In terms of test type, examinee effort was found to be lower for longer and more difficult tests (Barry & Finney, 2016), Reading than Mathematics tests (Wise et al., 2010), Problem Solving and Literacy than Numeracy tests (Goldhammer, Martens, Christoph, & Lüdtke, 2016), and for tests carrying low or no consequences for the test-takers (Wise, Kingsbury, Thomason, & Kong, 2004). Effort was not significantly influenced by the time of the year, or by the day of the week a test is administered, unlike the time of the day the test is taken, with the solution behavior occurring earlier in the day and diminishing as the day progresses, regardless of the student grade or the subject examined (Mathematics or Reading; Wise et al., 2010). ...
... A common concern in discussions for international large-scale assessments with performance rankings, is that low level of effort invested by examinees may invalidate country-level comparisons (Debeer et al., 2014;Goldhammer et al., 2016). Swedish 12 th -graders in TIMSS 2008 Advanced demonstrated lower self-reported test-taking effort, poorer test performance, and a stronger relationship between test-taking effort and performance compared to Norwegian and Slovenian samples; no significant differences across the three countries were observed when comparing only students reporting high level of test-taking effort (Eklöf, Pavešič, & Grønmo, 2014). ...
... Cross-national differences in effort may also be dependent on the subject examined. Larger cross-country discrepancies in test-taking engagement, measured via item response times, was found in problem solving than in Literacy and Numeracy in the Programme for the International Assessment of Adult Competencies (PIAAC; Goldhammer, et al., 2016). However, cross-cultural findings should be cautiously interpreted because studies have reported cultural differences in response tendencies in self-report scales (van de Vijver & He, 2014) and in response time scales (Shin, Kerzabi, Joo, Robin, & Yamamoto, 2020). ...
Article
Full-text available
In low-stakes assessments, when test-takers do not invest adequate effort, test scores underestimate the individual’s true ability, and ignoring the impact of test-taking effort may harm the validity of test outcomes. The study examined the level of examinees’ test-taking effort and accuracy in the Programme for International Student Assessment (PISA) across countries and different item types. The 2015 PISA computerized assessment was administered in 59 jurisdictions. Behavioral measures of students’ test-taking effort were constructed for the Mathematics and Reading assessments by applying a fixed and a normative threshold on item response times to identify rapid guessing. The proportion of rapid guessers on each item was found to be small on average, about 3 %, according to the normative and 1 % with a fixed five-second threshold. Rapid guessing was about twice as high in human-coded open-response items, compared to simple and complex multiple-choice items, and computer-scored open-response items. Average performance for rapid guessers was on average much lower than for test-takers engaged in solution behavior for all types of items and more pronounced in Reading than in Mathematics. Weighted response time effort indicators by country were very high, and positively correlated with country mean PISA score. No other robust correlates were found with response time effort at the country level. Computerized test administrations facilitate the use of response time as a proxy for examinee test-taking effort. Programs may monitor this behavior to identify cross-country differences prior to comparisons of performance and for developing interventions to promote engagement with the assessment.
... The present study also presents certain limitations. First, the methodology used herein represents one among several alternatives to modeling response times (e.g., Pokropek, 2016;Man et al., 2018) with their own advantages and disadvantages (Goldhammer et al., 2016). Second, although the examination of response times provides insights on the participant behaviors, those conclusions are correlational in nature. ...
Article
Full-text available
The goal of the present study was to evaluate the roles of response times in the achievement of students in the following latent ability domains: (a) verbal, (b) math and spatial reasoning, (c) mental flexibility, and (d) scientific and mechanical reasoning. Participants were 869 students who took on the Multiple Mental Aptitude Scale. A mixture item response model was implemented to evaluate the roles of response times in performance by modeling ability and non-ability classes. Results after applying this model to the data across domains indicated the presence of several behaviors related to rapid responding which were covaried with low achievement likely representing unsuccessful guessing attempts.
... Concretely, when many fast disengaged responses are present, the correlation between speed and ability will likely be more negative than it would be if those disengaged responses would be excluded from the analysis. While RTs may provide relevant information for determining disengaged responding (e.g., see Goldhammer et al. 2016;Nagy & Ulitzsch 2021), it is unlikely that any method will succeed in detecting disengaged responses with such a degree of accuracy that their presence no longer biases the estimate of the correlation between speed and ability. 3 Consequently, there remains 3 It is important to stress here that the estimated ability effectively just summarizes the observed performance on the test, meaning that it only captures "effective ability." ...
Chapter
Full-text available
With the advance of computerized testing in educational and psychological measurement, the availability of response time data is becoming commonplace, and practitioners are faced with the question if and how they should incorporate this information into their measurement models. For this purpose, the use of the hierarchical model is often considered, which promises to improve the precision of measurement and has various other appealing properties. However, practitioners also need to be aware of the several limitations and risks involved when using this model, which have been covered less extensively in the literature. This chapter covers both the advantages and disadvantages of using the hierarchical model, to allow practitioners to form a balanced assessment of the potential use of the hierarchical model for their testing application.
... The finding that students from higher grades show higher rates of disengagement is consistent with earlier research with MAP Growth (Wise et al., 2010) and results from international achievement assessments (e.g., Eklöf & Knekta, 2014). Additional research has reported test-taking disengagement increasing with age (e.g., Goldhammer et al., 2016). Although the reasons for the trend are not entirely clear, the grade range investigated in the current study corresponds to a transition from childhood into adolescence, and the increase may reflect an emerging desire for independence that tends to develop during that time period. ...
Article
The arrival of the COVID-19 pandemic had a profound effect on K-12 education. Most schools transitioned to remote instruction, and some used remote testing to assess student learning. Remote testing, however, is less controlled than in-school testing, leading to concerns regarding test-taking engagement. This study compared the disengagement of students remotely administered an adaptive interim assessment in spring 2020 with their disengagement on the assessment administered in-school during fall 2019. Results showed that disengagement gradually increased across grade level. This pattern was not meaningfully different between the two testing contexts, with the exception of results for American Indian/Alaska Native students, who showed higher disengagement under remote testing. In addition, the test’s engagement feature – which automatically paused the test event of a disengaged student and notified the test proctor – had a consistently positive impact whether the proctor was in the same room as the student or proctoring was done remotely.
... While RG is a possible outcome for all examinees that perceive their test performance to have minimal consequences, its occurrence has been found to differ between subgroups. For instance, previous researchers have found differential rates of RG among examinee subgroups that vary in gender (DeMars et al., 2013), age (Kuhfeld & Soland, 2020), ethnicity (Soland, 2018), primary language (native vs. non-native; Goldhammer et al., 2016) and nationality (Rios & Guo, 2020). Given these differential rates, researchers have recently called for score users to evaluate how differential RG behavior can influence perceptions of measurement invariance, a statistical property that ensures an instrument measures the same construct equally well across examinee subgroups (Milfont & Fischer, 2010). ...
... On low-stakes assessments, examinee effort tends to decrease as the test progresses (Goldhammer et al., 2016;Pastor et al., 2019). To reflect this, each test form was divided into five bins with 12 items each. ...
Article
Full-text available
When there are no personal consequences associated with test performance for examinees, rapid guessing (RG) is a concern and can differ between subgroups. To date, the impact of differential RG on item-level measurement invariance has received minimal attention. To that end, a simulation study was conducted to examine the robustness of the Mantel-Haenszel (MH), standardization index (STD), and logistic regression (LR) differential item functioning (DIF) procedures to type I error in the presence of differential RG. Sample size, test difficulty, group impact, and differential RG rates were manipulated. Findings revealed that the LR procedure was completely robust to type I errors, while slightly elevated false positive rates (< 1%) were observed for the MH and STD procedures. An applied analysis examining data from the Programme for International Student Assessment showed minimal differences in DIF classifications when comparing data in which RG responses were unfiltered and filtered. These results suggest that large rates of differences in RG rates between subgroups are unassociated with false positive classifications of DIF.