Article

Item Response Theory: Parameter Estimation Techniques

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... In this sense, we assume the graded item response approach combined with the ability and item information functions proposed by [8,81]. The quality scales have been developed adopting the Item Response Theory (IRT) model [7]. ...
... On the IRT, the evaluation information is defined in terms of item information functions I i ( ) , which is a measure of how sound responses in that category estimate the examinee's ability [8]. ...
... On a gradual scale, from 1 (strongly disagree) to 5 (strongly agree), the assessment items M1 to M7 received values over 3,5 (average 3,8). Despite the small sample of specialists, all reported having good experience regarding key concepts of awareness (D1), collaboration (D2), and HCI (D3), corroborating the quality of the responses. ...
Article
Full-text available
Awareness has been a valuable concept in collaborative systems since its formation, an essential part of groupware. Awareness research followed the evolution of the whole field over the last decades. We can see the progress in a mutual understanding of awareness and developing concepts and technology of awareness support. An efficient awareness mechanism ensures a better understanding and, consequently, a better projection of future actions; in contrast, the lack of these mechanisms undermines comprehension and prevents participants from projecting their work accordingly. Few works present methods or processes that assist in providing awareness in groupware systems; most common strategies focus on the design/development stages or are ad-hoc evaluation models. Furthermore, there are no standardized tests for awareness assessment; thus, measures must be established to assess awareness and identify the criteria for achieving awareness indicators. This work establishes an assessment model for collaborative interfaces by analyzing the awareness mechanisms provided from the participant’s viewpoint. In this model, we consider the participant’s skill in understanding the awareness and the difficulty involved, providing advances toward designing, developing, and evaluating groupware systems. The proposed assessment model allows us to measure the awareness support provided considering the collaboration, workspace, and contextual awareness perspectives. Assuming a plural collaborative environment, where different participants with different skills, knowledge, and wisdom meet and interact, this model seeks to build a more faithful representation of these existing profiles across a broad spectrum of individual abilities.
... Inclusion of person covariates in the 2PCMPM would offer the advantages of a onestep procedure, e.g., regarding uncertainty quantification. IRT models are often used to investigate reliability as a function of the latent trait (Baker & Kim, 2004). The inclusion of person covariates in the 2PCMPM would allow doing so while controlling for construct-irrelevant person covariates, such as typing speed in the context of DT or verbal fluency (Forthmann et al., 2017). ...
... In Equation (1), a j denotes the slope and d j denotes the intercept in this slope-intercept parameterization (as opposed to the also commonly used discriminationdifficulty parameterization, l ij ¼ exp ða j ðh i − d j ÞÞ, obtainable from Equation (1) with discrimination a j ¼ a j and difficulty d j ¼ −d j =a j ). The slope-intercept parameterization is often used in IRT method development and software implementations (Baker & Kim, 2004). We found this parameterization helpful for including item and person covariates. ...
... Even for binary data, which has received considerably more attention in IRT research than count data, the best known explanatory models (i.e., the Log-Linear Test Model, LLTM; Fischer, 1973; for item covariates and the Latent Regression Model, LRM; Zwinderman, 1991; for person covariates) are also based on a one-parameter (Rasch) model (but note that explanatory versions are available for other IRT models, such as for the 2PL model; see De Boeck & Wilson, 2004, for more details). However, empirical data are often better described by (at least) two-parameter models (Baker & Kim, 2004). Thus, understanding item differences between more than difficulty parameters (i.e., between discriminations and dispersions) is very relevant. ...
Article
In psychology and education, tests (e.g., reading tests) and self-reports (e.g., clinical questionnaires) generate counts, but corresponding Item Response Theory (IRT) methods are underdeveloped compared to binary data. Recent advances include the Two-Parameter Conway-Maxwell-Poisson model (2PCMPM), generalizing Rasch’s Poisson Counts Model, with item-specific difficulty, discrimination, and dispersion parameters. Explaining differences in model parameters informs item construction and selection but has received little attention. We introduce two 2PCMPM-based explanatory count IRT models: The Distributional Regression Test Model for item covariates, and the Count Latent Regression Model for (categorical) person covariates. Estimation methods are provided and satisfactory statistical properties are observed in simulations. Two examples illustrate how the models help understand tests and underlying constructs.
... Demographic changes and political economic conditions have intensified the need and demand for more efficient health care operations, including a call to reduce elective surgery wait-times. For example, Health Quality Ontario, an organization established to advise the province regarding the performance of its $55 billion annual health care expenditures, maintains an up-to-date public Internet dashboard listing of surgical wait-times for six key categories of procedures, not only at the provincial level but also by region and individual hospital [1]. More recently, the province of Ontario has been trying to find ways, including the privatization of healthcare, to reduce high and chronic surgical wait times, which have gotten much worse over the years and the pandemic [2,3]. ...
... Item response theory (IRT; [1]) is the statistical analysis of test items in education, psychology, and other fields of social sciences. Typically, a number of test items are administered to test-takers. ...
... Some identification constraints on item parameters γ i or distribution parameters α must be imposed to ensure model identification [9]. If the IRT model (1) has been estimated, individual ability estimates θ can be estimated by maximizing the log-likelihood function l that gives the most likely ability estimate for θ given a vector of item responses x. The log-likelihood function is given by (see [1]) ...
Article
Full-text available
In a series of papers, Dimitrov suggested the classical D-scoring rule for scoring items that give difficult items a higher weight while easier items receive a lower weight. The latent D-scoring model has been proposed to serve as a latent mirror of the classical D-scoring model. However, the item weights implied by this latent D-scoring model are typically only weakly related to the weights in the classical D-scoring model. To this end, this article proposes an alternative item response model, the modified Ramsay quotient model, that is better-suited as a latent mirror of the classical D-scoring model. The reasoning is based on analytical arguments and numerical illustrations.
... Parameter estimation has been a major concern in the application of IRT models. Estimation techniques may be performed in four ways: joint maximum likelihood, conditional maximum likelihood, marginal maximum likelihood, and Bayesian estimation with a Markov chain Monte Carlo (MCMC) algorithm [13]. In addition, two newer algorithms has gained researchers' attention in recent years. ...
... where the f (·) is the corresponding probability distribution function of the augmented variable L ik , I(·) denotes the indicator function, and π(·) denotes the prior distribution. A natural prior for the latent trait θ i is the standard normal distribution, N(0, 1), which has been commonly used for calculating the posterior mode or mean of the latent trait in IRTbased scoring [13]. Because the item discrimination parameters are usually restricted to be positive, a natural prior for α k is truncated normal distribution, e.g. ...
... From Figure 8, we can see: for the small and large total score parts, all the methods can fit the data well, and they perform similarly; for the medium total score part (i.e. [10][11][12][13][14][15][16][17][18][19][20], the three methods underestimate observed score frequency slightly. ...
Article
In the context of both cognitive and affective tests, items are usually designed to involve more than two responses, for which polytomous models are applicable. The purpose of this paper is to propose a highly effective Pólya-Gamma Gibbs sampling algorithm based on auxiliary variables to estimate the multidimensional graded response model that has been widely used in psychological, educational, and health-related assessment. The strategy is based on the Pólya Gamma family of distributions which provides a closed-form posterior distribution for logistic-based models. With the introduction of the two latent variables, the full conditional distributions are tractable, and consequently the Gibbs sampling is easy to implement. Nice features including empirical performance of the proposed methodology are demonstrated by simulation studies. Finally, two empirical data sets were analysed to demonstrate the efficiency and utility of the proposed method.
... From an empirical perspective, we show that the proposed algorithm substantially improves the cost-accuracy trade-offs compared with the baselines on several real-world datasets from various domains collected at Meta. We also propose a third variant, POAKI, which reduces the number of latent variables by incorporating item response theory (IRT) (Baker and Kim 2004). ...
... where d j , b j , p 0 j correspond to item j's difficulty, separation, and base rate; and P IRT is the probability that answer x ij is correct (Baker and Kim 2004). We adopt the concept and the formula but use it in two unusual ways: ...
Article
Crowdsourcing platforms use various truth discovery algorithms to aggregate annotations from multiple labelers. In an online setting, however, the main challenge is to decide whether to ask for more annotations for each item to efficiently trade off cost (i.e., the number of annotations) for quality of the aggregated annotations. In this paper, we propose a novel approach for general complex annotation (such as bounding boxes and taxonomy paths), that works in an online crowdsourcing setting. We prove that the expected average similarity of a labeler is linear in their accuracy conditional on the reported label. This enables us to infer reported label accuracy in a broad range of scenarios. We conduct extensive evaluations on real-world crowdsourcing data from Meta and show the effectiveness of our proposed online algorithms in improving the cost-quality trade-off.
... The most adopted method for estimating IRT parameters is marginal maximum likelihood (MML) based on the work of Bock and Lieberman (1970) and Bock and Aitkin (1981). Other estimation methods have been reviewed by Baker (1992) and Ackerman (1991). Existing literature focuses on technical and theoretical comparison of IRT with classical models. ...
... The most adopted method for estimating IRT parameters is marginal maximum likelihood (MML), based on the work of Bock and Lieberman (1970) and Bock and Aitkin (1981). Other estimation methods have been reviewed by Baker (1992), and an MCMC approach has been described by Patz and Junker (1999). Thomas and Cyr (2002) used the three parameters Logistic model and discussed various IRT issues, including point and variance estimates of item parameters, the potential for bias due to ignoring survey weights, biases in the distribution of ability predictors, and the dependence of this bias on test length. ...
Article
Economic efficiency demands an accurate assessment of individual ability for selection purposes. This study investigates Classical Test Theory (CTT) and Item Response Theory (IRT) for estimating true ability and ranking individuals. Two Monte Carlo simulations and real data analyses were conducted. Results suggest a slight advantage for IRT, but ability estimates from both methods were highly correlated (r=0.95), indicating similar outcomes. The Logistic two-parameter IRT model emerged as the most reliable and rigorous approach. ARTICLE HISTORY
... 35 Only a few items had partially missing difficulty threshold β, and respondents had no choice of extremely low values. 36 This might be related to the fact that respondents were all thoracic surgery nurses with relevant work experience, and low-value options were not consistent with the actual situations of these nurses. In future research, the expression of these options can be improved. ...
... The present questionnaire was developed by referring to the relevant literature and consulting thoracic surgery experts from hospitals and universities in many locations across China, and it was based on KAP theory. 36,37 The team members have rich research experience in the field, and the team was comprehensive and extensive, representing the development level of thoracic surgery in Mainland China. This rendered the questionnaire items highly relevant to both theory and practice. ...
Article
Full-text available
Objective This study aims to develop and validate a suitable scale for assessing the level of nurses' knowledge and practice of perioperative pulmonary rehabilitation. Methods We divided the study into two phases: scale development and validation. In Phase 1, the initial items were generated through a literature review. In Phase 2, a cross-sectional survey was conducted involving 603 thoracic nurses to evaluate the scale's validity, reliability, and difficulty and differentiation of items. Item and exploratory factor analyses were performed for item reduction. Thereafter, their validity, reliability, difficulty, and differentiation of items were assessed using Cronbach's α coefficient, retest reliability, content validity, and item response theory (IRT). Results The final questionnaire comprised 34 items, and exploratory factor analysis revealed 3 common dimensions with internal consistency coefficients of 0.950, 0.959, and 0.965. The overall internal consistency of the scale was 0.966, with a split-half reliability of 0.779 and a retest reliability Pearson's correlation coefficient of 0.936. The content validity of the scale was excellent (item-level content validity index = 0.875–1.000, scale-level content validity index = 0.978). The difficulty and differentiation of item response theory were all verified to a certain extent (average value = 2.391; threshold β values = −1.393–0.820). Conclusions The knowledge–attitudes–practices questionnaire for nurses can be used as a tool to evaluate knowledge, attitudes, and practices among nurses regarding perioperative pulmonary rehabilitation for patients with lung cancer.
... Latent trait models, also known as item response theory (IRT) models, have gained widespread application in educational testing and psychological measurement (Lord and Novick, 1968;van der Linden and Hambleton, 1997;Embretson and Reise, 2000;Baker and Kim, 2004). These models utilize the probability of a response to establish the interaction between an examinee's "ability" and the characteristics of the test items, such as difficulty and guessing. ...
... The normal ogive IRT models (Lord, 1980;van der Linden and Hambleton, 1997;Embretson and Reise, 2000;Baker and Kim, 2004), also known as the one parameter normal ogive model, are a mathematical model used in the field of psychometrics to relate the latent ability of an examinee to the probability of a correct response on a test item. This model, as a component of IRT, facilitates the design, analysis, and scoring of tests, questionnaires, and comparable instruments intended for the measurement of abilities, attitudes, or other variables. ...
Article
Full-text available
This paper primarily analyzes the one-parameter generalized logistic (1PGlogit) model, which is a generalized model containing other one-parameter item response theory (IRT) models. The essence of the 1PGlogit model is the introduction of a generalized link function that includes the probit, logit, and complementary log-log functions. By transforming different parameters, the 1PGlogit model can flexibly adjust the speed at which the item characteristic curve (ICC) approaches the upper and lower asymptote, breaking the previous constraints in one-parameter IRT models where the ICC curves were either all symmetric or all asymmetric. This allows for a more flexible way to fit data and achieve better fitting performance. We present three simulation studies, specifically designed to validate the accuracy of parameter estimation for a variety of one-parameter IRT models using the Stan program, illustrate the advantages of the 1PGlogit model over other one-parameter IRT models from a model fitting perspective, and demonstrate the effective fit of the 1PGlogit model with the three-parameter logistic (3PL) and four-parameter logistic (4PL) models. Finally, we demonstrate the good fitting performance of the 1PGlogit model through an analysis of real data.
... Typically, the MC items are scored dichotomously and the CR items are scored polytomously. Item response theory (IRT) models are often used for calibrating dichotomous or polytomous test data (Baker & Kim, 2004;Nering & Ostini, 2011). Tests with such mixed item formats are calibrated by modeling both item formats in a single analysis using combinations of different IRT models such as a combination of the three-parameter logistic model (3PLM; Birnbaum, 1968) and graded response model (GRM; Samejima (1969). ...
... Bayesian and maximum likelihood estimation are two common methods for ability parameter estimation. Although the later one requires less computation time, it may not be as stable as Bayesian estimation (Baker & Kim, 2004). Bayesian solutions are often employed to estimate IRT ability parameters, such as the expected a posterior (EAP; Bock & Mislevy, 1982) for only either dichotomous or polytomous items. ...
Article
Full-text available
Large-scale tests often contain mixed-format items, such as when multiple-choice (MC) items and constructed-response (CR) items are both contained in the same test. Although previous research has analyzed both types of items simultaneously, this may not always provide the best estimate of ability. In this paper, a two-step sequential Bayesian (SB) analytic method under the concept of empirical Bayes is explored for mixed item response models. This method integrates ability estimates from different item formats. Unlike the empirical Bayes method, the SB method estimates examinees’ posterior ability parameters with individual-level sample-dependent prior distributions estimated from the MC items. Simulations were used to evaluate the accuracy of recovery of ability and item parameters over four factors: the type of the ability distribution, sample size, test length (number of items for each item type), and person/item parameter estimation method. The SB method was compared with a traditional concurrent Bayesian (CB) calibration method, EAPsum, that uses scaled scores for summed scores to estimate parameters from the MC and CR items simultaneously in one estimation step. From the simulation results, the SB method showed more accurate and reliable ability estimation than the CB method, especially when the sample size was small (150 and 500). Both methods presented similar recovery results for MC item parameters, but the CB method yielded a bit better recovery of the CR item parameters. The empirical example suggested that posterior ability estimated by the proposed SB method had higher reliability than the CB method.
... A opção de utilizar as estimativas de proficiência (teta -θ) modeladas a partir da Teoria de Resposta ao Item (TRI) Andrade et al., 2021;Baker & Kim, 2004), por exemplo, é mais difícil de ser explicada para o público leigo. No entanto, esforços de tradução dos conceitos para esse público devem ser realizados em função das vantagens (e.g., os parâmetros dos itens são independentes dos sujeitos, estimativas de erro para cada nível de proficiência etc.) que tais métodos proporcionam. ...
... A TRI, entretanto, assume a propriedade de invariância dos parâmetros, considerada como a sua maior distinção da TCT. Esse princípio afirma que, quando um conjunto total de itens tem um bom ajuste a um modelo da TRI, os parâmetros psicométricos dos itens não dependem da habilidade dos examinandos e tal habilidade pode ser estimada independentemente da dificuldade do teste utilizado (Baker & Kim, 2004). Em pesquisa realizada por Conde e Laros (2007), foi verificada relação negativa entre a propriedade de invariância da TRI e o grau de falta da unidimensionalidade de uma medida. ...
Article
Full-text available
Psychological assessment (PA) and educational assessment (EA) are among the most important contributions of cognitive and behavioral sciences to modern society. They provide important information about individuals and groups that are a part of the society. The aim of this article is to present guidelines for researchers regarding PA and large-scale learning assessments (LSLAs). The paths of PA in a world in health crisis due to the Covid-19 pandemic are discussed. In the context of LSLAs, we discuss theoretical, methodological and analytical aspects that must be considered by evaluators and researchers in the area. We conclude that PA and LSLAs are related to the extent that both fulfill the social function of identifying gaps that deserve attention, as well as functional aspects that must be maintained and encouraged. Another important characteristic is the requirement for constant technical improvement by both evaluators and researchers.
... Item calibration estimates the item parameters from response data by marginalizing the examinee ability θ from the likelihood in order to ensure the asymptotic consistency of the item parameter estimates. Specifically, marginal maximum likelihood (MML) estimation using an expectationmaximization (EM) algorithm has been widely used for item calibration (Baker and Kim, 2004). Given calibrated item parameters, the ability estimation phase calculates the examinee's ability θ. ...
... We linearly transformed the difficulty values estimated on the real value scale (-3.96, -1.82, -0.26, 0.88, 2.01, 3.60) to positive integer values (1, 29, 49, 64, 79, 100) to make it easier for the language models to understand the numerical inputs. Table 1 shows the ability estimatesθ for the five QA systems, where the abilities were estimated by the EAP estimation using a Gaussian quadrature (Baker and Kim, 2004), given the calibrated itemdifficulty parameters. The table shows that the abilities of the five QA systems differ greatly. ...
Conference Paper
Full-text available
Question generation (QG) for reading comprehension, a technology for automatically generating questions related to given reading passages, has been used in various applications, including in education. Recently, QG methods based on deep neural networks have succeeded in generating fluent questions that are pertinent to given reading passages. One example of how QG can be applied in education is a reading tutor that automatically offers reading comprehension questions related to various reading materials. In such an application, QG methods should provide questions with difficulty levels appropriate for each learner's reading ability in order to improve learning efficiency. Several difficulty-controllable QG methods have been proposed for doing so. However, conventional methods focus only on generating questions and cannot generate answers to them. Furthermore, they ignore the relation between question difficulty and learner ability, making it hard to determine an appropriate difficulty for each learner. To resolve these problems, we propose a new method for generating question--answer pairs that considers their difficulty, estimated using item response theory. The proposed difficulty-controllable generation is realized by extending two pre-trained transformer models: BERT and GPT-2.
... The awareness mechanisms measurement allows us to assess the general awareness quality of the collaborative environment, its presented design elements, goals, and awareness dimensions, through the estimate of the examinee's ability. In this sense, we assume the graded item response approach combined with the ability and item information functions proposed by [Samejima 1969] and [Baker and Kim 2004]. ...
... On the IRT, the evaluation information is defined in terms of item information functions I i (θ), which is a measure of how good responses in that category estimate the examinee's ability [Baker and Kim 2004]. In our model, we assume the graded item response approach, where each item has been divided into n ordered response categories. ...
Conference Paper
[Context] Awareness has been a valuable concept in Collaborative Systems since its formation, being an essential part of groupware. The efficient awareness mechanism ensures a better understanding and, consequently, a better projection of future actions; in contrast, the lack of these mechanisms undermines comprehension and prevents participants from projecting their work accordingly. [Problem] This is a multi-factorial problem, and finding a goodm starting point in the literature can be challenging for novice groupware designers; they must reinvent awareness from their own experience of what it is, how it works, and how it is used. [Goal] This work consists of establishing an assessment model for collaborative interfaces by analyzing the awareness mechanisms provided from the participant’s viewpoint. Our awareness assessment model developed adopting the statistical technique Item Response Theory (IRT) and considers the participant’s skill in understanding the awareness and the difficulty involved. [Results] The proposed assessment model allow us to measure the awareness support provided considering the collaboration, workspace, and contextual awareness perspectives. The results obtained were translated into an awareness support scale and three levels of quality were defined.
... Currently, there are mainly two methods for estimating the ability of the examinee in CAT: Maximum Likelihood Estimation (MLE) (Baker & Kim, 2004) and Bayesian estimation methods (Bayes Estimation Method) (Bock & Mislievy, 1982;Baker & Kim, 2004). The Bayesian estimation method is further divided into Maximum A Posteriori (MAP) and Expectation A Posteriori (EAP). ...
... Currently, there are mainly two methods for estimating the ability of the examinee in CAT: Maximum Likelihood Estimation (MLE) (Baker & Kim, 2004) and Bayesian estimation methods (Bayes Estimation Method) (Bock & Mislievy, 1982;Baker & Kim, 2004). The Bayesian estimation method is further divided into Maximum A Posteriori (MAP) and Expectation A Posteriori (EAP). ...
Article
Computerized Adaptive Testing (CAT) is a new testing mode that utilizes the adaptive measurement concept of "tailored to fit." Compared with traditional paper-and-pencil testing, CAT has the advantages of improving measurement accuracy, reducing test length, and ensuring test security. Therefore, it is highly regarded by researchers and practitioners both domestically and internationally. However, the platform construction of CAT involves complex statistical measurement theory and tedious numerical calculations, which hinder the application and promotion of CAT in practice. This article mainly introduces the development platform of computerized adaptive testing - flexCAT. Users can quickly build their own CAT system using the convenient human-computer interactive interface provided by the flexCAT platform. This article will introduce the first web-based computerized adaptive testing development platform in China - flexCAT, from the perspectives of its advantages, basic theory, module functions, etc. The aim is to provide free adaptive testing platform development services for research and application personnel in the fields of education, psychology, and further promote the development of psychological and educational measurement theory and technology in China. The URL for the flexCAT platform is: http://www.psychometrics-studio.cn/app/cat_demo/index.html?Id=false&Block=false.
... The basic notions of IRT rely on the individual items of a test rather than on a certain aggregate of item responses (e.g., score indicator; Baker and Kim 2004). Therefore, in this study, we use an IRT model to estimate an occupant's energy-saving behavior score; the model considers both the difficulty of given ecological behaviors and the household's ability to perform them. ...
... Therefore, in this study, we use an IRT model to estimate an occupant's energy-saving behavior score; the model considers both the difficulty of given ecological behaviors and the household's ability to perform them. IRT considers a class of latent variable models that link dichotomous and polytomous response variables (i.e., manifest factors) to a single latent factor (Baker and Kim 2004). This method models the fundamental relation between the respondent's IRT measured construct, often denoted as θ, and their probability of managing an item. ...
Article
In addition to scrutinizing the decision process behind energy efficiency investment, this study investigates its association with energy-saving behavior. Its conceptual underpinnings are based on the intersection of behavioral change and "energy efficiency paradox" theories. Based upon a rich, disaggregated dataset representative of the French housing sector, it develops an energy-saving score based on the item response theory model, which considers household attributes and ability levels. Then this score is used as an independent factor of a multivariate probit model to examine the drivers of household investment decisions for various energy performance solutions. The results highlight that: (i) contextual and attitudinal attributes are two major drivers of energy efficiency investments, and (ii) depending on the energy solution considered, there is a significant inverse relationship between energy-savings behavior and energy efficiency investments. This reveals that environmental awareness is not necessarily a driving factor behind energy efficiency investments and emphasizes the so-called "rebound effect" issue. The results support the view that promoting energy-saving behaviors and energy efficiency investments necessitate differentiated public policies that consider both individual preferences and housing stock heterogeneity. The analysis offers valuable policy guidance and research agenda outlining future energy efficiency research priorities.
... We often apply g(θ ) to a standard normal distribution. The EM algorithm [3] is usually used in such a case [2]. Then, the students' abilities are obtained by maximizing the corresponding likelihood function. ...
... Then, the students' abilities are obtained by maximizing the corresponding likelihood function. To circumvent the ill conditions so that all the items are correctly answered or incorrectly answered, the Bayes technique is applied [2]. However, we sometimes meet other cases such as the uniform distribution to g(θ ). ...
... One of the most established models of estimating the difficulty of questions and learning problems is Item Response Theory (IRT) [40]. This model is actively used in intelligent tutoring systems for modeling the difficulty of exercise items for students [41]. ...
Article
Full-text available
Modern advances in creating shared banks of learning problems and automatic question and problem generation have led to the creation of large question banks in which human teachers cannot view every question. These questions are classified according to the knowledge necessary to solve them and the question difficulties. Constructing tests and assignments on the fly at the teacher’s request eliminates the possibility of cheating by sharing solutions because each student receives a unique set of questions. However, the random generation of predictable and effective assignments from a set of problems is a non-trivial task. In this article, an algorithm for generating assignments based on teachers’ requests for their content is proposed. The algorithm is evaluated on a bank of expression-evaluation questions containing more than 5000 questions. The evaluation shows that the proposed algorithm can guarantee the minimum expected number of target concepts (rules) in an exercise with any settings. The available bank and exercise difficulty chiefly determine the difficulty of the found questions. It almost does not depend on the number of target concepts per item in the exercise: teaching more rules is achieved by rotating them among the exercise items on lower difficulty settings. An ablation study show that all the principal components of the algorithm contribute to its performance. The proposed algorithm can be used to reliably generate individual exercises from large, automatically generated question banks according to teachers’ requests, which is important in massive open online courses.
... This was done to guarantee the independence of the two results that we intend to compare. For the IRT analysis, we fed the data into the graded model (Samejima, 1969;Samejima, 2010) from the ltm package in R (Baker and Kim, 2004;Rizopoulos, 2007), from which we extracted the IC curve parameters and means. Finally, we correlated the means of the curves with the nodes' positions. ...
Article
Full-text available
Belief network analysis (BNA) refers to a class of methods designed to detect and outline structural organizations of complex attitude systems. BNA can be used to analyze attitude-structures of abstract concepts such as ideologies, worldviews, and norm systems that inform how people perceive and navigate the world. The present manuscript presents a formal specification of the Response-Item Network (or ResIN), a new methodological approach that advances BNA in at least two important ways. First, ResIN allows for the detection of attitude asymmetries between different groups, improving the applicability and validity of BNA in research contexts that focus on intergroup differences and/or relationships. Second, ResIN’s networks include a spatial component that is directly connected to item response theory (IRT). This allows for access to latent space information in which each attitude (i.e. each response option across items in a survey) is positioned in relation to the core dimension(s) of group structure, revealing non-linearities and allowing for a more contextual and holistic interpretation of the attitudes network. To validate the effectiveness of ResIN, we develop a mathematical model and apply ResIN to both simulated and real data. Furthermore, we compare these results to existing methods of BNA and IRT. When used to analyze partisan belief-networks in the US-American political context, ResIN was able to reliably distinguish Democrat and Republican attitudes, even in highly asymmetrical attitude systems. These results demonstrate the utility of ResIN as a powerful tool for the analysis of complex attitude systems and contribute to the advancement of BNA.
... Item response theory (IRT; [1][2][3][4][5]) modeling is a class of statistical models that analyze discrete multivariate data. In these models, a vector X = (X 1 , . . . ...
Article
Full-text available
Item response theory (IRT) models are frequently used to analyze multivariate categorical data from questionnaires or cognitive test data. In order to reduce the model complexity in item response models, regularized estimation is now widely applied, adding a nondifferentiable penalty function like the LASSO or the SCAD penalty to the log-likelihood function in the optimization function. In most applications, regularized estimation repeatedly estimates the IRT model on a grid of regularization parameters λ. The final model is selected for the parameter that minimizes the Akaike or Bayesian information criterion (AIC or BIC). In recent work, it has been proposed to directly minimize a smooth approximation of the AIC or the BIC for regularized estimation. This approach circumvents the repeated estimation of the IRT model. To this end, the computation time is substantially reduced. The adequacy of the new approach is demonstrated by three simulation studies focusing on regularized estimation for IRT models with differential item functioning, multidimensional IRT models with cross-loadings, and the mixed Rasch/two-parameter logistic IRT model. It was found from the simulation studies that the computationally less demanding direct optimization based on the smooth variants of AIC and BIC had comparable or improved performance compared to the ordinarily employed repeated regularized estimation based on AIC or BIC.
... These tests attempt to measure one or several hypothetical constructs that are typically unobservable, known as latent trait. According to Baker and Kim, 2004, examples of latent traits include intelligence and arithmetic ability. It is very essential that tests demonstrate consistency in measurement when measuring latent traits, and this is called reliability. ...
Article
Full-text available
This study examine the comparability of item statistics generated from the frameworks of classical test theory (CTT) and 2-parameter model of item response theory (IRT). A 40-item Physics Achievement Test was developed and administered to 600 senior secondary school two students, who were randomly selected from 12 senior secondary schools in Taraba State, Nigeria. Results showed that item statistics obtained from both frameworks were relatively similar. However, item statistics obtained from IRT 2-parameter model looked balanced than those from CTT. In addition, for item selection process, IRT 2-parameter model retained more items than CTT model. This result implies that test developers and public examining bodies should integrate IRT model into their test development processes. Through IRT model, test constructors would be able to generate stable items than in the CTT model used at present and at the end, the test scores of examinees will be more reliably estimated.
... • Parametric estimation: This method uses mathematical models to estimate the effort required for a project. It is based on the project's size, complexity, and other [10]. Estimating is crucial in project management, as inaccuracies in estimation can lead to poor project performance, potentially resulting in project failure. ...
Article
Full-text available
Effort estimation is a crucial aspect of software development, as it helps project managers plan, control, and schedule the development of software systems. This research study compares various machine learning techniques for estimating effort in software development, focusing on the most widely used and recent methods. The paper begins by highlighting the significance of effort estimation and its associated difficulties. It then presents a comprehensive overview of the different categories of effort estimation techniques, including algorithmic, model-based, and expert-based methods. The study concludes by comparing methods for a given software development project. Random Forest Regression algorithm performs well on the given dataset tested along with various Regression algorithms, including Support Vector, Linear, and Decision Tree Regression. Additionally, the research identifies areas for future investigation in software effort estimation, including the requirement for more accurate and reliable methods and the need to address the inherent complexity and uncertainty in software development projects. This paper provides a comprehensive examination of the current state-of-the-art in software effort estimation, serving as a resource for researchers in the field of software engineering.
... IRT models use individual items as the unit of measurement to obtain latent trait/ability scores [4]. A wide variety of parametric and nonparametric IRT models have been developed to describe how individuals respond to items. ...
Article
Full-text available
Likert scales are the most common psychometric response scales in the social and behavioral sciences. Likert items are typically used to measure individuals' attitudes, perceptions, knowledge, and behavioral changes. To analyze the psychometric properties of individual Likert-type items and overall Likert scales, mostly methods based on classical test theory (CTT) are used, including corrected item-total correlations and reliability indices. CTT methods heavily rely on the total scale scores, making it challenging to directly examine the performance of items and response options across varying levels of the trait. In this study, Kernel Smoothing Item Response Theory (KS-IRT) is introduced as a graphical nonparametric IRT approach for the evaluation of Likert items. Unlike parametric IRT models, nonparametric IRT models do not involve strong assumptions regarding the form of item response functions (IRFs). KS-IRT provides graphics for detecting peculiar patterns in items across different levels of a latent trait. Differential item functioning (DIF) can also be examined by applying KS-IRT. Using empirical data, we illustrate the application of KS-IRT to the examination of Likert items on a psychological scale.
... Lack of stability of the estimates of the 3PL model has also been discussed by Patz andJunker (1999), DeMars (2001), and Pelton (2002). As a consequence of the difficulty in estimating the lower asymptote of the 3PL, the 4PL model is often considered even more problematic to estimate (see, e.g., Embretson and Reise, 2000;Baker and Kim, 2004). Nonetheless, it is worth mentioning that there has recently been a renewed interest in 4-Parameter models (see, e.g., Hessen, 2005;Loken and Rulison, 2010;Ogasawara, 2012;Culpepper, 2016). ...
Article
Full-text available
The present work aims at showing that the identification problems (here meant as both issues of empirical indistinguishability and unidentifiability) of some item response theory models are related to the notion of identifiability in knowledge space theory. Specifically, that the identification problems of the 3-and 4-parameter models are related to the more general issues of forward-and backward-gradedness in all items of the power set, which is the knowledge structure associated with IRT models under the assumption of local independence. As a consequence, the identifiability problem of a 4-parameter model is split into two parts: a first one, which is the result of a trade-off between the left-side added parameters and the remainder of the Item Response Function, e.g., a 2-parameter model, and a second one, which is the already well-known identifiability issue of the 2-parameter model itself. Application of the results to the logistic case appears to provide both a confirmation and a generalization of the current findings in the literature for both fixed-and random-effects IRT logistic models.
... The CAT scores were obtained by running the adaptive test algorithm (Appendix) on the available data for the calibration sample. We used Fisher's information index to select the next item in the CAT [39] and an "expected a posteriori estimation" to estimate the CAT scores [40]. We assumed that in community-based PCH, 30 items are the maximum number feasible, and therefore limited the number of items to be used in this CAT to 30. ...
Article
Full-text available
Questionnaires to detect emotional and behavioral (EB) problems in preventive child healthcare (PCH) should be short; this potentially affects their validity and reliability. Computerized adaptive testing (CAT) could overcome this weakness. The aim of this study was to (1) develop a CAT to measure EB problems among pre-school children and (2) assess the efficiency and validity of this CAT. We used a Dutch national dataset obtained from parents of pre-school children undergoing a well-child care assessment by PCH (n = 2192, response 70%). Data regarded 197 items on EB problems, based on four questionnaires, the Strengths and Difficulties Questionnaire (SDQ), the Child Behavior Checklist (CBCL), the Ages and Stages Questionnaire: Social Emotional (ASQ:SE), and the Brief Infant–Toddler Social and Emotional Assessment (BITSEA). Using 80% of the sample, we calculated item parameters necessary for a CAT and defined a cutoff for EB problems. With the remaining part of the sample, we used simulation techniques to determine the validity and efficiency of this CAT, using as criterion a total clinical score on the CBCL. Item criteria were met by 193 items. This CAT needed, on average, 16 items to identify children with EB problems. Sensitivity and specificity compared to a clinical score on the CBCL were 0.89 and 0.91, respectively, for total problems; 0.80 and 0.93 for emotional problems; and 0.94 and 0.91 for behavioral problems. Conclusion: A CAT is very promising for the identification of EB problems in pre-school children, as it seems to yield an efficient, yet high-quality identification. This conclusion should be confirmed by real-life administration of this CAT. What is Known: • Studies indicate the validity of using computerized adaptive test (CAT) applications to identify emotional and behavioral problems in school-aged children. • Evidence is as yet limited on whether CAT applications can also be used with pre-school children. What is New: • The results of this study show that a computerized adaptive test is very promising for the identification of emotional and behavior problems in pre-school children, as it appears to yield an efficient and high-quality identification.
... Furthermore, ∂ 2 E j (M js , γ js |Y , ψ (t) ) γ js =γ (t) js and ∂ E j (M js , γ js |Y , ψ (t) ) γ js =γ (t) js in Eq. (18) can also be approximated based on "artificial data". One can refer to Baker and Kim (2004) ...
Article
Full-text available
In this paper, we propose a generalized expectation model selection (GEMS) algorithm for latent variable selection in multidimensional item response theory models which are commonly used for identifying the relationships between the latent traits and test items. Under some mild assumptions, we prove the numerical convergence of GEMS for model selection by minimizing the generalized information criteria of observed data in the presence of missing data. For latent variable selection in the multidimensional two-parameter logistic (M2PL) models, we present an efficient implementation of GEMS to minimize the Bayesian information criterion. To ensure parameter identifiability, the variances of all latent traits are assumed to be unity and each latent trait is required to have an item exclusively associated with it. The convergence of GEMS for the M2PL models is verified. Simulation studies show that GEMS is computationally more efficient than the expectation model selection (EMS) algorithm and the expectation maximization based L1\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$L_{1}$$\end{document}-penalized method (EML1), and it yields better correct rate of latent variable selection and mean squared error of parameter estimates than the EMS and EML1. The GEMS algorithm is illustrated by analyzing a real dataset related to the Eysenck Personality Questionnaire.
... In this case, Molenaar (2015) recommends fixing 0i = 1 for all items. Although the strategy of fixing the lowest two thresholds has been used previously for the polytomous model (Falk 2020;Molenaar et al. 2012;Rodriguez 2017), another possible identification strategy for the polytomous model is to fix 0i and constrain the average ic to equal zero, a strategy that is in line with the identification constraints sometimes used for other polytomous item response models (Baker and Kim 2004). We will further explore and compare these identification strategies in our empirical illustrations. ...
Article
Full-text available
The residual heteroscedasticity (RH) model is a recently popularized asymmetric model that aims to model complex item response behavior. In this paper, we probe the conditions under which the existing form of the RH model does not guarantee monotonic boundary response functions (BRFs), a necessary condition for ensuring at least ordinal-level measurement. We derive the conditions under which RH BRFs are not monotonic and we use this result to propose a Bayesian computational strategy that enforces monotonicity. Through real and simulated data illustrations, we demonstrate that failures of monotonicity occur in real data and that our proposed computational solution effectively enforces monotonicity and yields accurate item parameter estimates. Finally, we demonstrate that any IRT model developed by specifying a residual variance function is likely to encounter similar issues with monotonicity. We recommend our reparameterization for both data generation in simulation studies and for fitting the RH model to real data.
... Most earlier studies of automated test assembly have evaluated the measurement accuracies of parallel test forms (such as [1], [2], [8], [10]- [12], [23]) using Item Response Theory (IRT) [24], [25]. It is noteworthy that IRT can measure the ability of examinees on the same scale, even when the examinees have taken different tests. ...
Article
Full-text available
Recently, through the progress achieved in the study of computer science, automated test assemblies of parallel test forms, for which each form has equivalent measurement accuracy but with a different set of items, have emerged as a new standard tool. An important goal for automated test assembly is to assemble as many parallel test forms as possible. Although many automated test assembly methods exist, maximum clique using the integer programming method is known to be able to assemble the greatest number of assembled test forms with the highest measurement accuracy. Nevertheless, because of the high time complexity of integer programming, the method requires a month or more to assemble 300,000 tests. This study proposes a new automated test assembly using Zero-suppressed Binary Decision Diagrams (ZDD): a graphical representation for a set of item combinations. This representation is derived by reducing a binary decision tree. According to the proposed method, each node in the binary decision tree corresponds to an item of an item pool, which is a test item database. Each node has two edges, each signifying that the corresponding item is included in a test form or not. Furthermore, all equivalent nodes are shared, providing that they have equal measurement accuracy and equal test length. Numerical experiments demonstrate that the proposed method can assemble 1,500,000 test forms within 24 hr, although earlier methods have been capable of assembling only 300,000 test forms during a week or more.
... It is possible to notice unordered item-step difficulty parameters for the majority of items, mainly related to the high unbalance between the first category (no knowledge) vs. the remaining ones (see Figure 2). Indeed, considering the item response category characteristic curves (IRCCCs; [22]), we have that the item-step difficulty parameters represent the point on the latent continuum where two consecutive IRCCCs cross each other [17]. Thus, starting from the observed result, we can derive that the characteristic curve of the first category dominates the remaining. ...
Article
Full-text available
Introduction Modern FinTech tools (e.g., instant payments, blockchain, roboadvisor) represent the new frontier of digital finance. Consequently, the evaluation of the knowledge level of the population about these topics is a crucial concern. In this context, several exogenous factors may influence individual differences in financial literacy. In particular, the territorial characteristics can have an impact on FinTech. In this work, we investigate individual heterogeneity in subjective financial knowledge in Italy, specifically focusing on modern FinTech tools, and exploring the differences at the individual and regional levels. Methods A sample of 598 Italian individuals from 10 different Italian regions was involved. A multilevel IRT model is performed to evaluate the level of FinTech individual knowledge and the differences according to Italian regions to account for the hierarchical structure of the data. Results Results reported a weak regional effect, revealing that heterogeneity in financial knowledge can be mainly attributed to individual characteristics. At the individual level, age, economic condition, knowledge of traditional financial objects and numeracy showed a significant effect. In addition, a scientific field of study and work have an impact on respondents' knowledge level. Discussion What is shown and discussed in this contribution can inspire policymakers' actions to increase financial literacy in the population. In particular, the obtained results imply that policymakers should improve the population's awareness of less popular FinTech tools and foster individuals' literacy about numbers and traditional financial tools, which proved to have a great influence in explaining FinTech knowledge differences.
... Regarding the person parameters, we have to distinguish the CML-and the MMLbased methods ( [50]), with eRm and psychotools supporting the former and ltm, mirt, and TAM the latter. In the CML context, no person parameter estimates can be obtained for perfect and zero scores as they tend to add or subtract infinity, respectively (The same applies to the item parameter estimation; i.e., items with zero or perfect scores will also require special treatment, but this is already handled in the originating packages). ...
Article
Full-text available
A constituting feature of item response models is that item and person parameters share a latent scale and are therefore comparable. The Person–Item Map is a useful graphical tool to visualize the alignment of the two parameter sets. However, the “classical” variant has some shortcomings, which are overcome by the new RMX package (Rasch models—eXtended). The package provides the RMX::plotPIccc() function, which creates an extended version of the classical PI Map, termed “PIccc”. It juxtaposes the person parameter distribution to various item-related functions, like category and item characteristic curves and category, item, and test information curves. The function supports many item response models and processes the return objects of five major R packages for IRT analysis. It returns the used parameters in a unified form, thus allowing for their further processing.
... It has extensions with applications in agriculture, health care studies and in research in marketing (Mendes et al., 2020;Bezruczko, 2005;Bechtel, 1985). Estimating the parameters in the Rasch and other item response models has been a difficult issue and there is much work on developing expectation-maximization (EM) algorithms for parameter estimation (Dempster et al., 1977;Baker and Kim, 2004;Liu et al., 2018). The Bock-Aitkin algorithm is a variant of the EM algorithm and is one of the most popular algorithms for estimating parameters in the Rasch models (Bock and Aitkin, 1981). ...
Preprint
Full-text available
Nature-inspired metaheuristic algorithms are important components of artificial intelligence, and are increasingly used across disciplines to tackle various types of challenging optimization problems. We apply a newly proposed nature-inspired metaheuristic algorithm called competitive swarm optimizer with mutated agents (CSO-MA) and demonstrate its flexibility and out-performance relative to its competitors in a variety of optimization problems in the statistical sciences. In particular, we show the algorithm is efficient and can incorporate various cost structures or multiple user-specified nonlinear constraints. Our applications include (i) finding maximum likelihood estimates of parameters in a single cell generalized trend model to study pseudotime in bioinformatics, (ii) estimating parameters in a commonly used Rasch model in education research, (iii) finding M-estimates for a Cox regression in a Markov renewal model and (iv) matrix completion to impute missing values in a two compartment model. In addition we discuss applications to (v) select variables optimally in an ecology problem and (vi) design a car refueling experiment for the auto industry using a logistic model with multiple interacting factors.
... A diferencia de las otras tres escalas, el intervalo de puntuaciones con precisión aceptable cubrió casi todo el rango de puntuaciones medias-bajas y medias-altas. Estas dos categorías de puntuaciones deben ser estimadas con bajos niveles de error, debido a que ellas generalmente abarcan más de la mitad de la población, ya que en los modelos de TRI, sin equiparación, se estiman las puntuaciones procurando una distribución normal estándar (Baker & Kim, 2004). ...
Article
Full-text available
El presente artículo tiene como objetivo analizar las propiedades psicométricas del Inventario Alemán de Ansiedad ante los exámenes adaptado a Costa Rica (GTAI-CR), con base en el modelo de respuesta graduada. Para este propósito se aplicó el instrumento a 184 personas (101 hombres, 82 mujeres y 1 persona no identificada con las categorías anteriores). Cada una de las cuatro subescalas del GTAI fue evaluada de manera independiente. Se obtuvo que las subescalas de forma global y sus ítems de forma independiente mostraron un ajuste aceptable al modelo. Las curvas características de cada categoría en cada ítem fueron plausibles para grupos representativos de población. Por otro lado, en cada subescala se calculó el rango donde las estimaciones de estas puntuaciones latentes presentaron precisiones aceptables. Finalmente, se presentan recomendaciones para que las escalas de la GTAI-CR puedan mejorar la precisión de las puntuaciones en las que brindan baja información.
... The higher the difficulty parameter, the more anxious the respondents were. According to the rules suggested by Baker and Kim [23] , an item with a discrimination parameter of >0.65 was considered to have a moderate or good discrimination power and would be retained. The IRT can provide two useful measures, difficulty and discrimination, both of which are technical properties of ICCs. ...
Article
Full-text available
Objective: This study aimed to evaluate the structural reliability and validity of generalized anxiety disorder 7-item (GAD-7) scale in early pregnant women. Methods: In this cross-sectional study, 30,823 patients in early pregnancy registered in the Obstetrics and Gynecology Hospital of Fudan University completed the GAD-7 scale and patient health questionnaire-9 item (PHQ-9). The discriminative ability, reliability, construct validity, and criterion validity were assessed to evaluate the psychometric properties and factor structures. Items with a discrimination parameter (α) of <0.65, factor loading of <0.30, or cross loading of >0.40 in two or more factors simultaneously were deleted from the scale. Results: All GAD-7 scale items exhibited a high discrimination power. The reliability of the GAD-7 scale was good (Cronbach's alpha coefficient = 0.891). Exploratory factor analysis extracted one factor with eigenvalues of greater than 1.0, which explained 61.930% of the common variance. Confirmatory factor analysis confirmed that the one-factor structure fitted the data well. The correlation coefficient with the PHQ-9 was 0.639. Conclusion: The Chinese version of the GAD-7 scale can be used as a screening tool for early pregnant women. It performs well in terms of discriminative ability, reliability, construct validity, and criterion validity. Pregnant women who screen positive may require more attention and investigation to confirm the presence of generalized anxiety disorder.
... Given that IRT is modeled by distinct sets of parameters, a primary concern associated with IRT research has been on parameter estimation, which offers the basis for the theoretical advantages of IRT. One major concern is on the statistical complexities that often arise when item and person parameters are simultaneously estimated (see [1,[20][21][22]). More recent attention has focused on the fully Bayesian estimation where Markov chain Monte Carlo (MCMC, [23,24]) simulation techniques are used. ...
Article
Full-text available
Item response theory (IRT) is a popular approach for addressing large-scale assessment problems in psychometrics and other areas of applied research. An emergent research direction that integrates it with machine learning techniques has made IRT applicable to a wide range of fields. The fully Bayesian approach for estimating IRT models is computationally expensive due to the large number of iterations, which require a large amount of memory to store massive amount of data. This limits the use of the procedure in many applications using traditional CPU architecture. In an effort to overcome such restrictions, previous studies focused on utilizing high performance computing using either distributed memory-based Message Passing Interface (MPI) or massive threads compute unified device architecture (CUDA) to achieve certain speedups with a simple IRT model. This study focuses on this model and aims at demonstrating the scalability of parallel algorithms integrating CUDA into MPI computing paradigm.
... We observe that the estimation value of discrimination (α) of each item is above 0.65. According to previous guidelines (Baker & Kim, 2004), the results manifest that all ten items support excellent measuring efficiency for the C-SPS-10. The item characteristic curves (ICC) are displayed in Fig. 2a. ...
Article
Full-text available
This study performed a cross-cultural validation of the Chinese version of the 10-item Social Provisions Scale (C-SPS-10) in Chinese populations. Study 1 examined the factor structure, internal reliability, discrimination, criterion validity, and network structure of C-SPS-10 by utilizing a sample of disaster victims in the 2021 Henan floods. Study 2 substantiated the findings of Study 1 in a general population sample. Measurement invariances between populations and between sexes in terms of the C-SPS-10 were also tested using the network approach. Study 3 used three samples to examine the test-retest reliability of the C-SPS-10 over three different time periods. The general results showed that the C-SPS-10 has excellent factor structure, internal reliability, discrimination, and criterion validity. The C-SPS-10 was confirmed to have good psychometric properties. Although the full scale functions well, problems may exist at a domain level. Moreover, the full scale of the C-SPS-10 was varied as a useful tool to capture trait-like characteristics of individuals’ perceptions of social support for the general population.
... Item response theory (IRT) models are an essential tool in educational assessment and psychological measurement where study outcomes consist of dichotomous or discrete responses (Baker and Kim, 2004). The response data often come with certain kinds of missingness, especially in large-scale assessments. ...
Article
Full-text available
Missingness due to not‐reached items and omitted items has received much attention in the recent psychometric literature. Such missingness, if not handled properly, would lead to biased parameter estimation, as well as inaccurate inference of examinees, and further erode the validity of the test. This paper reviews some commonly used IRT based models allowing missingness, followed by three popular examinee scoring methods, including maximum likelihood estimation, maximum a posteriori, and expected a posteriori. Simulation studies were conducted to compare these examinee scoring methods across these commonly used models in the presence of missingness. Results showed that all the methods could infer examinees' ability accurately when the missingness is ignorable. If the missingness is nonignorable, incorporating those missing responses would improve the precision in estimating abilities for examinees with missingness, especially when the test length is short. In terms of examinee scoring methods, expected a posteriori method performed better for evaluating latent traits under models allowing missingness. An empirical study based on the PISA 2015 Science Test was further performed.
Article
With the growing attention on large-scale educational testing and assessment, the ability to process substantial volumes of response data becomes crucial. Current estimation methods within item response theory (IRT), despite their high precision, often pose considerable computational burdens with large-scale data, leading to reduced computational speed. This study introduces a novel “divide- and-conquer” parallel algorithm built on the Wasserstein posterior approximation concept, aiming to enhance computational speed while maintaining accurate parameter estimation. This algorithm enables drawing parameters from segmented data subsets in parallel, followed by an amalgamation of these parameters via Wasserstein posterior approximation. Theoretical support for the algorithm is established through asymptotic optimality under certain regularity assumptions. Practical validation is demonstrated using real-world data from the Programme for International Student Assessment. Ultimately, this research proposes a transformative approach to managing educational big data, offering a scalable, efficient, and precise alternative that promises to redefine traditional practices in educational assessments.
Article
This article introduces conditional maximum-likelihood (CML) item parameter estimation in multistage designs based on probabilities \(p^{[b]}(x_{+}^{[b]})\) for choosing a particular module \({\textbf {m}}^{[b+1]}\) conditional on a raw score \(x_{+}^{[b]}\) in a previous module \({\textbf {m}}^{[b]}\). This type of multistage design is applied to ensure a minimum exposure rate for all items, for example, in international large-scale assessments (ILSAs). For the item parameter estimation, various likelihood-based methods are available. While the marginal maximum-likelihood method (MML) provides consistent estimates in multistage designs, the CML method in its original formulation leads to biased item parameter estimates. In this contribution, we will propose a modification of the common CML method for probabilistic routing strategies, based on the approach for deterministic routing strategies (Zwitser & Maris, 2015, Psychometrika), that provides practically unbiased item parameter estimates for the Rasch model. In a simulation study, it is shown that this modified CML estimation method also provides in probabilistic multistage designs, practically unbiased item parameter estimates.
Article
Full-text available
Background Health-related quality of life (Hr-QoL) scales provide crucial information on neurodegenerative disease progression, help improve patient care and constitute a meaningful endpoint for therapeutic research. However, Hr-QoL progression is usually poorly documented, as for multiple system atrophy (MSA), a rare and rapidly progressing alpha-synucleinopathy. This work aimed to describe Hr-QoL progression during the natural course of MSA, explore disparities between patients and identify informative items using a four-step statistical strategy. Methods We leveraged the data of the French MSA cohort comprising annual assessments with the MSA-QoL questionnaire for more than 500 patients over up to 11 years. A four-step strategy (1) determined the subdimensions of Hr-QoL, (2) modelled the subdimension trajectories over time, (3) mapped item impairments with disease stages and (4) identified most informative items. Results Four dimensions were identified. In addition to the original motor, non-motor and emotional domains, an oropharyngeal component was highlighted. While the motor and oropharyngeal domains deteriorated rapidly, the non-motor and emotional aspects were already impaired at cohort entry and deteriorated slowly over the disease course. Impairments were associated with sex, diagnosis subtype and delay since symptom onset. Except for the emotional domain, each dimension was driven by key identified items. Conclusion The multidimensional Hr-QoL deteriorates progressively over the course of MSA and brings essential knowledge for improving patient care. As exemplified with MSA, the thorough description of Hr-QoL over time using the four-step strategy can provide perspectives on neurodegenerative diseases’ management to ultimately deliver better support focused on the patient’s perspective.
Conference Paper
In the realm of educational assessment, accurate measurement of students’ knowledge and abilities is crucial for effective teaching and learning. Traditional assessment methods often fall short in providing precise and meaningful insights into students’ aptitudes. However, Item Response Theory (IRT), a psychometric framework, offers a powerful toolset to address these limitations. This article proposes an exploration of IRT’s models and their potential to enhance educational assessment practices.
Article
The study of coup-proofing holds significant importance in political science as it offers insights into critical topics such as military coups, authoritarian governance, and international conflicts. However, due to the multifaceted nature of coup-proofing and empirical inconsistencies with existing indicators, there is a need for a more profound understanding and a new measurement methodology. We propose a new measure of the extent of coup-proofing, utilizing a Bayesian item response theory. We estimate the extent of coup-proofing using a sample of 76 countries between 1965 and 2005 and theoretically relevant observed indicators. The findings from the estimation demonstrate that the extent of coup-proofing varies across regime type, country, and time. Furthermore, we verify the construct validity of our measurement.
Article
Full-text available
Understanding and accurately measuring resilience among Chinese civil aviation pilots is imperative, especially concerning the psychological impact of distressing events on their well-being and aviation safety. Despite the necessity, a validated and tailored measurement tool specific to this demographic is absent. Addressing this gap, this study built on the widely used CD-RISC-25 to analyze and modify its applicability to Chinese civil aviation pilots. Utilizing CD-RISC-25 survey data from 231 Chinese pilots, correlational and differential analyses identified items 3 and 20 as incongruent with this population's resilience profile. Subsequently, factor analysis derived a distinct two-factor resilience psychological framework labeled “Decisiveness” and “Adaptability”, which diverged from the structure found in American female pilots and the broader Chinese populace. Additionally, to further accurately identify the measurement characteristics of this 2-factor measurement model, this study introduced Generalized Theory and Item Response Theory, two modern measurement analysis theories, to comprehensively analyze the overall reliability of the measurement and issues with individual items. Results showed that the 2-factor model exhibited high reliability, with generalizability coefficient reaching 0.89503 and dependability coefficient reaching 0.88496, indicating the 2-factor measurement questionnaire can be effectively utilized for relative and absolute comparison of Chinese civil aviation pilot resilience. However, items in Factor 2 provided less information and have larger room for optimization than those in Factor 1, implying item option redesign may be beneficial. Consequently, this study culminates in the creation of a more accurate and reliable two-factor psychological resilience measurement tool tailored for Chinese civil aviation pilots, while exploring directions for optimization. By facilitating early identification of individuals with lower resilience and enabling the evaluation of intervention efficacy, this tool aims to positively impact pilot psychological health and aviation safety in the context of grief and trauma following distressing events.
Article
Knowledge is defined as a multi-faceted latent variable that is not directly measurable but through manifest variables, i.e., items. Latent variable models are, therefore, widely used in this context to analyze latent traits from items, usually expressed by ordinal variables. Finding homogeneous groups of units according to their knowledge levels turns out helpful to policymakers and to any other who has to take decisions into the domain. As a result, latent variables models are combined within integrated approaches to find homogeneous groups. The present work proposes a coordinated strategy combining the item response theory (IRT) models with the archetypal analysis (AA). The proposed method is applied to a data set of 625 Italian respondents to a survey conducted within the European project “Fintech and Artificial Intelligence in Finance”. Empirical evidence demonstrates that the proposed method is an effective and helpful tool to get homogeneous groups and their respective profiles according to the knowledge levels of the respondents based on their responses to the survey.
Article
Full-text available
In the human-to-human Collaborative Problem Solving (CPS) test, students’ problem-solving process reflects the interdependency among partners. The high interdependency in CPS makes it very sensitive to group composition. For example, the group outcome might be driven by a highly competent group member, so it does not reflect all the individual performances, especially for a low-ability member. As a result, how to effectively assess individuals’ performances has become a challenging issue in educational measurement. This research aims to construct the measurement model to estimate an individual’s collaborative problem-solving ability and correct the impact of partners’ abilities. First, 175 eighth graders’ dyads were divided into six cooperative groups with different levels of problem-solving (PS) ability combinations (i.e., high-high, high-medium, high-low, medium-medium, medium–low, and low-low). Then, they participated in the test of three CPS tasks, and the log data of the dyads were recorded. We applied Multidimensional Item Response Theory (MIRT) measurement models to estimate an individual’s CPS ability and proposed a mean correction method to correct the impact of group composition on individual ability. Results show that (1) the multidimensional IRT model fits the data better than the multidimensional IRT model with the testlet effect; (2) the mean correction method significantly reduced the impact of group composition on obtained individual ability. This study not only successfully increased the validity of individuals’ CPS ability measurement but also provided useful guidelines in educational settings to enhance individuals’ CPS ability and promote an individualized learning environment.
Article
Full-text available
Parents of children with Autism Spectrum Disorder (ASD) may experience increased stress in their social and professional activities due to the challenges of raising a child with ASD. The present study developed a scale to measure the Social and Professional Stress (SPS) experienced daily by these parents. The study sample consisted of 255 parents residing in Brazil aged between 21 and 61 years (mean = 38, SD = 6.0). Item Response Theory (IRT) was used to develop the SPS-Scale, which showed good psychometric properties. Our findings indicated a higher level of SPS among mothers who are primary caregivers and who have children with symptoms of ASD at medium or severe levels. The child's age and the interviewee's marital status also showed an association with the SPS experienced by the parents. Overall, the SPS-Scale proved to be a valid instrument to measure the SPS experienced daily by parents of children or adolescents diagnosed with ASD.
Article
Full-text available
This paper investigates the performance of item response theory based on distance criteria rather than likelihood criteria. For this purpose, the estimated item response matrix is introduced. This matrix is a reconstruction of the item response matrix using maximum likelihood estimates of the parameters in item response theory. Then the distance between the observed and estimated matrices can be determined using the Frobenius matrix norm. An approximated low-rank matrix can be generated from the observed item response matrix by singular value decomposition, and the distance between the observed and low-rank matrices can be obtained in the same way. By comparing these two distances, we can evaluate the performance of the estimated item response matrix comparable to the performance of an approximated low-rank matrix. Applying this comparison to actual examination data, it is found that the rank of the approximated low-rank matrix that is equivalent to the estimated item response matrix is very low when using matrices as training data. However, using test data, the predictive ability of item response theory seems high enough since the minimum distance between the approximated low-rank matrix and the observed item response matrix is approximately equal to or slightly less than the distance between the estimated item response matrix and the observed item response matrix. This fact has been first discovered by utilizing the estimated item response matrix defined here.
Preprint
Full-text available
Background Physical activity plays an integral role in promoting health and well-being. Despite its importance, comprehensive studies exploring the influences of socio-demographic factors on physical activity in the Chinese context are relatively scarce. This study aims to investigate the relationship between physical activity and socio-demographic factors such as gender, age, and socioeconomic status, using data from the 2018 China Family Panel Studies (CFPS). Methods Data was derived from the 2018 CFPS, resulting in a final sample size of 21,854 adults, with physical activity as the dependent variable. The International Socio-Economic Index of Occupational Status (ISEI) was used to gauge socioeconomic status. Other incorporated variables included gender, age, community type, marital status, physical health, and mental health. The study employed a logistic regression model considering the dichotomous nature of the dependent variable. Results Significant correlations were found between physical activity and gender, age, and socioeconomic status. Men were found to be more likely to engage in physical activity than women, and the likelihood of physical activity increased with age and socioeconomic status. Further, the influence of socioeconomic status on physical activity was found to vary significantly across different genders and age groups, with complex intersections noted among these factors. Conclusion The study underscores the need for public health interventions that are mindful of the complex interplay between gender, age, and socioeconomic status in influencing physical activity. Efforts to promote physical activity should focus on bridging the disparities arising from these socio-demographic factors, especially targeting women and individuals from lower socioeconomic classes. Future research should delve into the mechanisms through which these factors intersect and explore other potential influential elements to enhance our understanding of physical activity behavior.
Article
计算机化自适应测验(CAT)是一种全新的测验模式,采用了“因人施测”“量体裁衣”的自适应测量思想。与传统纸笔测验相比, 它具有提高测量精度,减少测验长度,保证测验安全等优势,因此深受国内外研究者和实践者的推崇。但CAT的平台搭建涉及复杂的统计测量理论以及繁琐的数值计算,阻碍了CAT在实践中应用推广。本文主要介绍计算机化自适应测验开发平台——flexCAT,用户借助flexCAT平台,利用便捷的人机交互页面可以快速搭建自己的CAT系统。本文将从优势特点,基础理论,模块功能等方面介绍国内首个基于网络的计算机化自适应测验开发平台——flexCAT,旨在为教育、心理等领域研究及应用人员免费提供自适应测试平台开发服务,进一步推动心理与教育测量理论与技术在中国的发展。flexCAT平台的网址为:http://www.psychometrics-studio.cn/app/cat_demo/index.html?Id=false&Block=false
Book
Full-text available
D-scoring Method of Measurement (DSM) presents a unified framework of classical and latent measurement. Provided are detailed descriptions of DSM procedures and illustrative examples of how to apply the DSM in various scenarios of measurement. The DSM is designed to combine merits of the traditional CTT and IRT for the purpose of transparency, ease of interpretations, computational simplicity of test scoring and scaling, and practical efficiency. This book shows how practical applications of DSM procedures are facilitated by the inclusion of operationalized guidance for their execution that can be readily translated into computer source codes for popular software packages such as R.
Some latent trait models and their use in inferring an examinee's ability
  • . . . . Index
Index....................................................... 174 References References Birnbaum, A. "Some latent trait models and their use in inferring an examinee's ability." Part 5 in F.M. Lord and M.R. Novick. Statistical Theories of Mental Test Scores. Reading, MA: Addison-Wesley, 1968.
Calibrating Items with the Rasch Model
  • B D Wright
  • R J Mead
  • Bical
Wright, B.D., and Mead, R.J. BICAL: Calibrating Items with the Rasch Model. Research Memorandum No. 23. Statistical Laboratory, Department of Education, University of Chicago, 1976.
NOTE: See (www.assess.com) for more information about the BILOG program. See (www.winsteps.com) for more information about BICAL and its successors
  • B D Wright
  • M A Stone
Wright, B.D., and Stone, M.A. Best Test Design. Chicago: MESA Press, 1979. NOTE: See (www.assess.com) for more information about the BILOG program. See (www.winsteps.com) for more information about BICAL and its successors, WINSTEPS and BIGSTEPS.
154 maximum likelihood estimation
  • . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Logit
logit......................................................... 22 mathematical models............................................ 21 maximum amount of information................................ 130 maximum likelihood........................... 48, 51, 85, 90, 133, 154 maximum likelihood estimation................................... 50 maximum value of a true score.................................... 73 metric......................................................... 5 mid-true score................................................. 70
166 standard error of estimate
  • . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Raw Score
raw score.................................................... 137 raw test score................................................ 6, 65 row marginals................................................ 138 scale of measurement............................................ 5 screening test......................................... 158, 162, 166 standard error of estimate....................................... 120 symmetric............................................... 113, 116 test calibration............................................ 133, 141 test characteristic curve......................................... 142 test constructor................................... 107, 133, 156, 157 test equating........................................... 55, 150, 157 test information............................................... 109 test information function................... 110, 115, 117, 148, 154, 164 three-parameter model.......................................... 28 two-parameter model........................................ 24, 30