Article

A Comparative Study of Measures of Partial Knowledge in Multiple-Choice Tests

March 1997
Applied Psychological Measurement 21(1):65-88

March 1997
21(1):65-88

DOI:10.1177/0146621697211006

Authors:

Anat Ben-Simon

National Institute for Testing and Evaluation

David V. Budescu

Fordham University

A common belief among many test experts is that measurements obtained from multiple-choice (MC) tests can be improved by using evidence about partial knowledge. A large number of methods designed to extract such information from direct reports provided by examinees have been developed over the last 50 years. Most methods require modifications in test instructions, response modes, and scoring rules. These testing methods are reviewed and the results of a large-scale empirical study of the most promising among them are reported. Seven testing methods were applied to MC tests from four different content areas using a between-persons design. To identify the most efficient methods and the optimal conditions for their application, the results were analyzed with respect to six different criteria. The results showed a surprisingly large tendency on the part of the examinees to take advantage of the special features of the alternative methods and indicated that, on average, high ability examinees were better judges of their level of knowledge and, consequently, could benefit more from these methods. Systematic interactions were found between the testing method and the test content, indicating that no method was uniformly superior.

COMPARISON OF PSYCHOMETRIC PROPERTIES OF MULTIPLE-CHOICE TEST USING CONFIDENCE AND NUMBER RIGHT SCORING AMONG SENIOR SECONDARY SCHOOL STUDENTS IN IBADAN METROPOLIS

Article

Full-text available

Jan 2019

This study investigated the comparison of psychometric properties of multiple-choice test usingconfidence and number right scoring among senior secondary school students in Ibadanmetropolis. The study adopted a descriptive design of survey type. The population for the studyconsisted of Senior Secondary School two (SSS II) students in Ibadan Metropolis, Oyo state,Nigeria. A sample of 400 Agricultural science students was selected across 4 Local GovernmentAreas in Ibadan metropolis, using purposive (mainly Agricultural Science Students) and randomsampling techniques. The instrument used for the study was Agricultural science Multiple-choiceTest. The 50 items Agricultural Science 4-option test was administered on the Students. Datacollected were analyzed using paired samples t-test, Kuder-Richardson (KR-21), Cronbachalpha, and Fisher z-test. The results obtained revealed that significant difference existed in thedifficulty indices with Number Right (NR) and Confidence Scoring Method (CSM) with mean of55.42 and 44.01 respectively. Also, there was a significant difference in the CSM and NR in thediscrimination indices with NR and CSM has mean of 0.57 and 0.52 respectively. It was foundthat NR significantly improved the difficulty and discrimination indices. Furthermore, the findingrevealed that there was no significant difference between NR and CSM in the reliabilitycoefficient. Based on these findings, it was recommended that number right scoring methodshould be used to assess Agricultural science students’ performances because it makes test itemappear moderate in terms of difficulty level and is very easy for students to guess the items right.Keywords: Comparison, Psychometric Properties, Multiple Choice Test, Confidence andNumber Right Scoring (1) (PDF) COMPARISON OF PSYCHOMETRIC PROPERTIES OF MULTIPLE-CHOICE TEST USING CONFIDENCE AND NUMBER RIGHT SCORING AMONG SENIOR SECONDARY SCHOOL STUDENTS IN IBADAN METROPOLIS. Available from: https://www.researchgate.net/publication/361944455_COMPARISON_OF_PSYCHOMETRIC_PROPERTIES_OF_MULTIPLE-CHOICE_TEST_USING_CONFIDENCE_AND_NUMBER_RIGHT_SCORING_AMONG_SENIOR_SECONDARY_SCHOOL_STUDENTS_IN_IBADAN_METROPOLIS [accessed Jul 13 2022].

An Alternative Multiple-Choice Question Format to Guide Feedback Using Student Self-Assessment of Knowledge: An Alternative Multiple-Choice Question Format to Guide Feedback

Article

Full-text available

Jun 2020

Management science professors who teach large classes often assess students with multiple‐choice questions (MCQs) because it is efficient. However, traditional MCQ formats are ill‐fitted for constructive feedback. We propose the reward for omission with confidence in knowledge (ROCK) format as an original formative assessment technique to help guide feedback associated with MCQs in an introductory undergraduate management science course. Our study contributes to theory by empirically showing that students can self‐assess their state of knowledge, signal it to the professor, and use proper answering options. In practice, ROCK is an easily implementable MCQ format that allows professors to gain information on student learning based on answers selected. ROCK identifies lack of knowledge or misinformation at both individual and collective levels thus providing opportunities for better feedback in class and during office hours. Limitations of the application of ROCK are also discussed.

Partial Credit in Answer-Until-Correct Multiple-Choice Tests Deployed in a Classroom Setting

Article

Full-text available

Apr 2019

The answer-until-correct (AUC) method of multiple-choice (MC) testing involves test respondents making selections until the keyed answer is identified. Despite attendant benefits that include improved learning, broad student adoption, and facile administration of partial credit, the use of AUC methods for classroom testing has been extremely limited. This study presents scoring properties and item analysis for 26 AUC university course examinations, administered using a commercial scratch-card response system. Here, we show that beyond the traditional pedagogical advantages of AUC, the availability of partial credit adds psychometric advantages by boosting both the mean item discrimination and overall test-score reliability, when compared to tests scored dichotomously upon initial response. Furthermore we also find a strong correlation between students’ initial-response successes and the likelihood that they would obtain partial credit when they make incorrect initial responses. Thus, partial credit is being granted based on partial knowledge that remains latent in traditional MC tests. The fact that these advantages are realized in real-life classroom tests may motivate further expansion of the use of AUC MC tests in higher education.

Elimination testing with adapted scoring reduces guessing and anxiety in multiple-choice assessments, but does not increase grade average in comparison with negative marking

Article

Full-text available

Oct 2018
PLOS ONE

Background and hypotheses This study is the first to offer an in-depth comparison of elimination testing with the scoring rule of Arnold & Arnold (hereafter referred to as elimination testing with adapted scoring) and negative marking. As such, this study is motivated by the search for an alternative for negative marking that still discourages guessing, but is less disadvantageous for non-relevant student characteristics such a risk-aversion and does not result in grade inflation. The comparison is structured around seven hypotheses: in comparison with negative marking, elimination testing with adapted scoring leads to (1) a similar average score (no grade inflation); (2) students expressing their partial knowledge; (3) a decrease in the number of blank answers; (4) no gender bias in the number of blank answers; (5) a reduction in guessing; (6) a decrease in self-reported test anxiety; and finally (7) students preferring elimination testing with adapted scoring over negative marking. Methodology To investigate the above hypotheses, this study implemented elimination testing with adapted scoring and negative marking in real exam settings in two courses in a Faculty of Medicine at a large university. Due to changes in the master of medicine the same two courses were taught to both students of the 1st and 2nd master in the same semester. Given that both student groups could take the same exam with different test instructions and scoring methods, a unique opportunity occurred in which elimination testing with adapted scoring and negative marking could be compared in a high-stakes testing situation. After receiving the grades on the exams, students received a questionnaire to assess their experiences. Findings The statistical analysis taking into account student ability and gender showed that elimination testing with adapted scoring is a valuable alternative for negative marking when looking for a scoring method that discourages guessing. In contrast to traditional scoring of elimination testing, elimination testing with adapted scoring does not result in grade inflation in comparison with negative marking. This study showed that elimination testing with adapted scoring reduces blank answers and finds strong indications for the reduction of guessing in comparison with negative marking. Finally, students preferred elimination testing with adapted scoring over negative marking and reported lower stress levels in elimination testing with adapted scoring in comparison with negative marking.

Scoring of Complex Multiple Choice Items in NEPS Competence Tests

Chapter

Apr 2016

In order to precisely assess the cognitive achievement and abilities of students, different types of items are often used in competence tests. In the National Educational Panel Study (NEPS), test instruments also consist of items with different response formats, mainly simple multiple choice (MC) items in which one answer out of four is correct and complex multiple choice (CMC) items comprising several dichotomous “yes/no” subtasks. The different subtasks of CMC items are usually aggregated to a polytomous variable and analyzed via a partial credit model. When developing an appropriate scaling model for the NEPS competence tests, different questions arose concerning the response formats in the partial credit model. Two relevant issues were how the response categories of polytomous CMC variables should be scored in the scaling model and how the different item formats should be weighted. In order to examine which aggregation of item response categories and which item format weighting best models the two response formats of CMC and MC items, different procedures of aggregating response categories and weighting item formats were analyzed in the NEPS, and the appropriateness of these procedures to model the data was evaluated using certain item fit and test fit indices. Results suggest that a differentiated scoring without an aggregation of categories of CMC items best discriminates between persons. Additionally, for the NEPS competence data, an item format weighting of one point for MC items and half a point for each subtask of CMC items yields the best item fit for both MC and CMC items. In this paper, we summarize important results of the research on the implementation of different response formats conducted in the NEPS.

A mixed methods evaluation of the effect of confidence-based versus conventional multiple-choice questions on student performance and the learning journey

Preprint

Full-text available

Mar 2024

Background Traditional single best answer multiple-choice questions (MCQs) are a proven and ubiquitous assessment tool. By their very nature, MCQs prompt students to guess a correct outcome when unsure of the answer, which may lead to a reduced ability to reliably assay student knowledge. Moreover, the traditional Single Best Answer Test (SBAT) offers binary feedback (correct or incorrect) and therefore offers no feedback or enhancement of the student learning journey. Confidence-based Answer Tests (CBATs) are designed to improve reliability because participants are not forced to guess where they cannot choose between two or more alternative answers which they may favour equally. CBATs enable students to reflect on their knowledge and better appreciate where their mastery of a particular subject may be weaker. Although CBATs can provide richer feedback to students and improve the learning journey, their use may be limited if they significantly alter student scores or grades, which may be viewed negatively. The aim of this study was to compare performance across these test paradigms, to investigate if there are any systematic biases present. Methods Thirty-four first-year optometry students and 10 lecturers undertook a test comprising 40 questions. Each question was completed using two specified test paradigms; for the first paradigm, they were allowed to weight their answers based on confidence (CBAT), and a single best answer (SBAT). Upon test completion, students undertook a survey comprising both Likert scale and open-ended responses regarding their experience and perspectives on the CBAT and SBAT multiple-choice test paradigms. These were analysed thematically. Results There was no significant difference between paradigms, with a median difference of 1.25% (p = 0.313, Kruskal-Wallis) in students and 3.33% (p = 0.437, Kruskal-Wallis) in staff. The survey indicated that students had no strong preference towards a particular method. Conclusions Since there was no significant difference between test paradigms, this validates implementation of the confidence-based paradigm as an equivalent and viable option for traditional MCQs but with the added potential benefit that, if coupled with reflective practice, can provide students with a richer learning experience. There is no inherent bias within one method over another.

The Reliability of Multiple-choice Tests -A Case Study

Conference Paper

Full-text available

Apr 2024

COMPARISON OF PSYCHOMETRIC PROPERTIES OF MULTIPLE-CHOICE TEST USING CONFIDENCE AND NUMBER RIGHT SCORING AMONG SENIOR SECONDARY SCHOOL STUDENTS IN IBADAN METROPOLIS

Research

Full-text available

Oct 2019

The effect of the correction of self-assessment-based chance success on psychometric characteristics of the test

Article

Full-text available

Apr 2018

This study examines the effect of self-assessment-based chance success on psychometric characteristics of the test. First, the data was cleared of chance success by means of correction-for-guessing formula and self-assessment, and then statistical analyses were conducted. Item discriminations showed an increase when the correction-for-guessing formula was used; and when self-assessment was used, they showed variability. Test validity increased when correction formula was used; and when self-assessment was used, a slight decrease was observed. Besides, this study examined the effect of correction for chance success upon corrected self-assessment based on IRT guessing parameter. It was observed that the data that were not corrected in accordance with chance scores had higher guessing parameters than those corrected in accordance with self-assessment. In addition, it was evident that the difference between the guessing parameters of the uncorrected data and the data cleared of chance scores by means of self-assessment was significant. It was also revealed that the correction of self-assessment-based chance success have an advantage over classical correction for guessing formula on psychometric characteristics of the test.

Score Increase and Partial-Credit Validity When Administering Multiple-Choice Tests Using an Answer-Until-Correct Format

Article

Sep 2016

There are numerous benefits to answer-until-correct (AUC) approaches to multiple-choice testing, not the least of which is the straightforward allotment of partial credit. However, the benefits of granting partial credit can be tempered by the inevitable increase in test scores and by fears that such increases are further contaminated by a large random guessing component. We have measured the effects of using the immediate feedback assessment technique (IF-AT), a commercially available AUC response system, on the scores of a typical first-year chemistry multiple-choice test. We find that with a particular commonly used scoring scheme the test scores from IF-AT deployment are 6–7 percentage points higher than from Scantron deployment. This amount is less than that suggested by previous studies, where the mark increase was calculated in a purely post hoc manner and thus neglected affective changes of students’ behavior associated with the IF-AT technique. Furthermore, we have strong evidence that partial credit is awarded in a highly rational manner in accordance with the students’ level of understanding.

Scoring Single-Response Multiple-Choice Items: Scoping Review and Comparison of Different Scoring Methods

Article

Full-text available

May 2023

Background: Single-choice items (eg, best-answer items, alternate-choice items, single true-false items) are 1 type of multiple-choice items and have been used in examinations for over 100 years. At the end of every examination, the examinees’ responses have to be analyzed and scored to derive information about examinees’ true knowledge. Objective: The aim of this paper is to compile scoring methods for individual single-choice items described in the literature. Furthermore, the metric expected chance score and the relation between examinees’ true knowledge and expected scoring results (averaged percentage score) are analyzed. Besides, implications for potential pass marks to be used in examinations to test examinees for a predefined level of true knowledge are derived. Methods: Scoring methods for individual single-choice items were extracted from various databases (ERIC, PsycInfo, Embase via Ovid, MEDLINE via PubMed) in September 2020. Eligible sources reported on scoring methods for individual single-choice items in written examinations including but not limited to medical education. Separately for items with n=2 answer options (eg, alternate-choice items, single true-false items) and best-answer items with n=5 answer options (eg, Type A items) and for each identified scoring method, the metric expected chance score and the expected scoring results as a function of examinees’ true knowledge using fictitious examinations with 100 single-choice items were calculated. Results: A total of 21 different scoring methods were identified from the 258 included sources, with varying consideration of correctly marked, omitted, and incorrectly marked items. Resulting credit varied between –3 and +1 credit points per item. For items with n=2 answer options, expected chance scores from random guessing ranged between –1 and +0.75 credit points. For items with n=5 answer options, expected chance scores ranged between –2.2 and +0.84 credit points. All scoring methods showed a linear relation between examinees’ true knowledge and the expected scoring results. Depending on the scoring method used, examination results differed considerably: Expected scoring results from examinees with 50% true knowledge ranged between 0.0% (95% CI 0% to 0%) and 87.5% (95% CI 81.0% to 94.0%) for items with n=2 and between –60.0% (95% CI –60% to –60%) and 92.0% (95% CI 86.7% to 97.3%) for items with n=5. Conclusions: In examinations with single-choice items, the scoring result is not always equivalent to examinees’ true knowledge. When interpreting examination scores and setting pass marks, the number of answer options per item must usually be taken into account in addition to the scoring method used.

Scoring Single-Response Multiple-Choice Items - Quite Simple?! A Scoping Review and Comparison of Different Scoring Methods (Preprint)

Preprint

Full-text available

Mar 2023

Background: Single-choice items (eg, best-answer items, alternate-choice items, single true-false items) are one type of multiple-choice items and have been used in examinations for over 100 years. At the end of every examination, the examinees' responses have to be analyzed and scored in order to derive with an information about examinees' true knowledge. Objective: The aim of this paper is to compile scoring methods for individual single-choice items described in the literature. Furthermore, the metric expected chance score and the relation between examinees' true knowledge and expected scoring results (averaged percentage score) are analyzed. Furthermore, implications for potential pass marks to be used in examinations to test examinees for a predefined level of true knowledge are derived. Methods: Scoring methods for individual single-choice items including were extracted from various databases (ERIC, PsycInfo, Embase via Ovid, MEDLINE via PubMed) in September 2020. Eligible sources reported on scoring methods for individual single-choice items in written examinations including but not limited to medical education. Separately for items with n = 2 answer options (eg, alternate-choice items, single true-false items) and best-answer items with n = 5 answer options (eg, Type A items) and for each identified scoring method, the metric expected chance score and the expected scoring results as a function of examinees' true knowledge using fictitious examinations with 100 single-choice items were calculated. Results: A total of 21 different scoring methods were identified from the 258 included sources, with varying consideration of correctly marked, omitted, and incorrectly marked items. Resulting credit varied between -3 and +1 credit points per item. For items with n = 2 answer options, expected chance scores from random guessing ranged between -1 and +0.75 credit points. For items with n = 5 answer options, expected chance scores ranged between -2.2 and +0.84 credit points. All scoring methods showed a linear relation between examinees' true knowledge and the expected scoring results. Depending on the scoring method used, examination results differed considerably: Expected scoring results from examinees with 50% true knowledge ranged between 0.0% (95% CI: 0% to 0%) and 87.5% (95% CI: 81.0% to 94.0%) for items with n = 2 and between -60.0% (95% CI: -60% to -60%) and 92.0% (95% CI: 86.7% to 97.3%) for items with n = 5. Conclusions: In examinations with single-choice items, the scoring result is not always equivalent to examinees' true knowledge. When interpreting examination scores and setting pass marks, the number of answer options per item must usually be taken into account in addition to the scoring method used.

Theoretical evaluation of partial credit scoring of the multiple-choice test item

Article

Full-text available

Jan 2023

Rasmus Persson

In multiple-choice tests, guessing is a source of test error which can be suppressed if its expected score is made negative by either penalizing wrong answers or rewarding expressions of partial knowledge. Starting from the most general formulation of the necessary and sufficient scoring conditions for guessing to lead to an expected loss beyond the test-taker’s knowledge, we formulate a class of optimal scoring functions, including the proposal by Zapechelnyuk (Econ. Lett. 132 , 24–27 (2015)) as a special case. We then consider an arbitrary multiple-choice test taken by a rational test-taker whose knowledge of a test item is defined by the fraction of the answer options which can be ruled out. For this model, we study the statistical properties of the obtained score for both standard marking (where guessing is not penalized), and marking where guessing is suppressed either by expensive score penalties for incorrect answers or by different marking schemes that reward partial knowledge.

Using Many-Facet Rasch Model to Examine Student Performance on Contextualized Science Assessment

Conference Paper

Jan 2019

Xiaoming Zhai

Digital-First Learning and Assessment Systems for the 21st Century

Article

Full-text available

May 2022

In the past few years, our lives have changed due to the COVID-19 pandemic; many of these changes resulted in pivoting our activities to a virtual environment, forcing many of us out of traditional face-to-face activities into digital environments. Digital-first learning and assessment systems (LAS) are delivered online, anytime, and anywhere at scale, contributing to greater access and more equitable educational opportunities. These systems focus on the learner or test-taker experience while adhering to the psychometric, pedagogical, and validity standards for high-stakes learning and assessment systems. Digital-first LAS leverage human-in-the-loop artificial intelligence to enable personalized experience, feedback, and adaptation; automated content generation; and automated scoring of text, speech, and video. Digital-first LAS are a product of an ecosystem of integrated theoretical learning and assessment frameworks that align theory and application of design and measurement practices with technology and data management, while being end-to-end digital. To illustrate, we present two examples—a digital-first learning tool with an embedded assessment, the Holistic Educational Resources and Assessment (HERA) Science, and a digital-first assessment, the Duolingo English Test.

Implementing Confidence Assessment in Low-Stakes, Formative Mathematics Assessments

Article

Full-text available

Sep 2021

Colin Foster

Confidence assessment (CA) involves students stating alongside each of their answers a confidence rating (e.g. 0 low to 10 high) to express how certain they are that their answer is correct. Each student’s score is calculated as the sum of the confidence ratings on the items that they answered correctly, minus the sum of the confidence ratings on the items that they answered incorrectly; this scoring system is designed to incentivize students to give truthful confidence ratings. Previous research found that secondary-school mathematics students readily understood the negative-marking feature of a CA instrument used during one lesson, and that they were generally positive about the CA approach. This paper reports on a quasi-experimental trial of CA in four secondary-school mathematics lessons ( N = 475 students) across time periods ranging from 3 weeks up to one academic year, compared to business-as-usual controls. A meta-analysis of the effect sizes across the four schools gave an aggregated Cohen’s d of –0.02 [95% CI –0.22, 0.19] and an overall Bayes Factor B 01 of 8.48. This indicated substantial evidence for the null hypothesis that there was no difference between the attainment gains of the intervention group and the control group, relative to the alternative hypothesis that the gains were different. I conclude that incorporating confidence assessment into low-stakes classroom mathematics formative assessments does not appear to be detrimental to students’ attainment, and I suggest reasons why a clear positive outcome was not obtained.

Validating a partial-credit scoring approach for multiple-choice science items: an application of fundamental ideas in science

Article

Full-text available

May 2021

This study provides a partial credit scoring (PCS) approach to awarding students’ performance on multiple-choice items in science education. The approach is built on fundamental ideas, the critical pieces of students’ understanding and knowledge to solve science problems. We link each option of the items to several specific fundamental ideas to capture their mastery patterns when an option is selected. Using these mastery patterns to order the options of items and accordingly assign credits to students, we measure students’ cognitive proficiency without including additional measures (e.g., times of trial) to the test or requiring extra support (e.g., technology). Using many-facet Rasch analysis, we find that the ordered options students selected were aligned with their ability measures. The PCS yields robust psychometric quality; compared to the dichotomous scoring of multiple-choice items, it generates better item fit and separation parameters. Besides, this PCS approach helps address construct validity by modelling student responses at the option level to reflect students’ mastery of fundamental ideas.

Validating a Partial-Credit Scoring Approach for Multiple-Choice Science Items: An Application of Fundamental Ideas in Science

Preprint

Full-text available

Apr 2021

SELF ADAPTED TESTING AS FORMATIVE ASSESSMENT: EFFECTS OF FEEDBACK AND SCORING ON ENGAGEMENT AND PERFORMANCE

Thesis

Full-text available

Feb 2016

Meirav Arieli-Attali

The Immediate Feedback Assessment Technique: A Learner-centered Multiple-choice Response Form

Article

Oct 2013

David DiBattista

The Immediate Feedback Assessment Technique (IFAT) is a new multiple-choice response form that has advantages over more commonly used response techniques. The IFAT, which is commercially available at reasonable cost and can be used conveniently with large classes, has an answer-until-correct format that provides students with immediate, corrective, item-by-item feedback. Advantages of this learner-centered response form are that it: (a) actively promotes learning; (b) allows students’ partial knowledge to be rewarded with partial credit; (c) is strongly preferred by students over other response techniques; and (d) lets instructors more easily maintain the security of multiple choice (MC) items so that they can be reused from one semester to the next. The IFAT’s major shortcoming is that grading must be done manually because it does not yet have a compatible optical scanning device. Helpful suggestions are presented for instructors who may be considering using the IFAT for the first time. RÉSUMÉ La Technique D’Évaluation Immédiat (Immediate Feedback Assessment Technique ou IFAT) est un nouveau formulaire pour examens à choix multiple qui a plusieurs avantages. Le IFAT, disponible à un prix raisonable et convenable pour les cours suivis par de nombreux étudiants, est constitué d’un format dans lequel les édudiants selectionnent alternative-par- alternative parmi les choix disponibles jusqu’à ce que la réponse correcte soit indiquée. En suite, la correction est automatique et informe la réponse correcte immédiatement. Le IFAT a plusieurs avantages: (a) il favorise l’apprentissage; (b) les étudiants peuvent obtenir des points partiels avec connaissances partiels; (c) les étudiants préferent ce formulaire à comparer à autres formats à choix multiple; et (d) les instructeurs peuvent maintenir plus facilement leurs questions et alternatives en sécurité et les réutiliser au cours des prochaines sessions. Le défaut principal du IFAT est que la notation est manuele car il n’y a pas encore de lecteur optique compatible avec ce formulaire. Des suggestions utiles sont ici données pour les instructeurs qui envisagent d’utiliser cette technique pour la première fois.

The Level of Competency Knowledge in Safety Training among Construction Personnel

Article

Full-text available

Dec 2019

FARKLI MADDE TÜRLERİNE GÖRE FEN OKURYAZARLIĞININ İNCELENMESİ: PISA 2015 TÜRKİYE ÖRNEĞİ

Article

Full-text available

Jul 2019

The aim of the present study was to reveal the psychometric properties of the items in the cognitive test of PISA 2015 assessing scientific literacy according to different item types and to examine scientific literacy in relation to different independent variables. In the sample of PISA 2015 Turkey, 175 students from 5895 students were included in the study with the aim of researching. Descriptive statistics and various hypothesis tests were used to obtain the findings of the research. When scientific literacy and item difficulty averages for all three item types were examined, students with a high level of scientific literacy level were more successful at responding to constructed response (CR) items, while students with a low level of scientific literacy were more successful at answering multiple choice (MC) items. Male students were more successful than female students in responding to MC and complex multiple choice (CMC) items, while female students were more successful than male students in answering CR items. It was found that students with a high level of economic, social and cultural status were more successful than those with a low level of economic, social and cultural status.

The Value of Choice: An Experiment Using Multiple‐Choice Tests

Article

May 2019

This article presents a novel experimental methodology in which groups of students were offered the option to choose between two equivalent scoring rules to assess a multiple‐choice test. The effect of choosing the scoring rule on marks is tested. Two major contributions arise from this research. First, it contributes to the literature on the value of choice. Second, it also contributes to the literature on the educational measurement of knowledge. The results suggest that choice could positively affect students' scores. However, students need to learn to choose the assessment method. Moreover, women seem to obtain greater benefits from the option of choosing the scoring rule.

Investigating the safety cognition of construction personnel based on safety education

Article

Full-text available

Apr 2019

Competency in safety is important for construction personnel and it is compulsory for all construction personnel in Malaysia to attend safety training/courses. A literature review of the recommended safety module revealed gaps in evaluating the effectiveness of safety cognition among construction personnel. Therefore, this paper investigates the safety cognition based on safety education. A structured, self-administered questionnaire was designed and used to assess the level of safety cognition in safety education. The results show the safety cognition of construction personnel is still at a moderate level and there is a difference level of safety cognition among construction personnel on Occupational Safety module.

A study of selected-response type assessment (Mcq) and essay type assessment methods for engineering students

Article

Full-text available

Jan 2019

Chintan Dwivedi

In the field of education, students' assessment provides important data on the knowledge, skill, attitudes, and beliefs of students which are used to refine programs and improve student learning. With the help of assessment data, a teacher can track students' progress, plan his/her lessons more effectively and can also motivate his/her students by providing an accurate measure of his/her progress. Assessment can be done by using many methods such as multiple-choice question-based test and essay or short answer type test, etc.Whichever assessment method is used, it should enhance student learning in relation to educational goals. Assessment should correctly evaluate students' performance against curricular goals. The present research paper analyzes the two popular methods of assessment in India which are Essay/short answer type assessment and Multiple-choice question-based assessment. A small study was conducted on 108 students of Bachelor of Technology 8th Semester (IV Year) in order to investigate the two methods. The group of students was examined by using Essay type method and MCQ type method in TCIE-1 and TCIE-2 respectively for the same subject which is power system design.Wilcoxon signed rank test was applied using MATLAB to test the hypothesis of significant difference between scores of students in these two assessments.Moreover, correlation coefficient was also calculated to prove that students require a different skill set to secure good marks in these two methods. So, these two methods should be used to address different assessment purposes. © 2019, Rajarambapu Institute Of Technology. All rights reserved.

Affordances of Item Formats and Their Effects on Test‐Taker Cognition under Uncertainty

Article

Dec 2018

The current study investigated how item formats and their inherent affordances influence test‐takers’ cognition under uncertainty. Adult participants solved content‐equivalent math items in multiple‐selection multiple‐choice and four alternative grid formats. The results indicated that participants’ affirmative response tendency (i.e., judge the given information as True) was affected by the presence of a grid, type of grid options, and their visual layouts. The item formats further affected the test scores obtained from the alternatives keyed True and the alternatives keyed False, and their psychometric properties. The current results suggest that the affordances rendered by item design can lead to markedly different test‐taker behaviors and can potentially influence test outcomes. They emphasize that a better understanding of the cognitive implications of item formats could potentially facilitate item design decisions for large‐scale educational assessments.

Board # 119 : Using Multiple Choice Responses to Assess Uncertainty in Student Understanding of Vector Concepts

Conference Paper

Full-text available

Jun 2017

Elimination Scoring Versus Correction for Guessing: A Simulation Study

Chapter

Apr 2018

Administering multiple-choice questions with correction for guessing fails to take into account partial knowledge and may introduce a bias as examinees may differ in risk-taking to guess the correct answer when not having full knowledge. In the latter case, elimination scoring gives examinees the opportunity to express their partial knowledge as this alternative scoring procedure requires examinees to eliminate all the response alternatives they consider to be incorrect. The current simulation study investigates how these two scoring procedures affect response behaviors of examinees who differ not only in ability but also in their attitude toward risk. Combining a psychometric model accounting for ability and item difficulty with the decision theory accounting for individual differences in risk aversion, a two-step response-generating model is proposed to predict the expected answering patterns on given multiple-choice questions. The results of the simulations show that overall there are no substantial differences in the answering patterns for examinees at both ends of the ability continuum under two scoring procedures, suggesting that ability has a predominant effect on the response patterns. Compared to correction for guessing, elimination scoring leads to fewer full score response and more demonstration of partial knowledge, especially for examinees with intermediate success probabilities on the items. Only for those examinees, risk aversion has a decisive impact on the expected answering patterns.

Level and Quality of Knowledge Using Confidence-Weighted NRET Scoring Method in Multiple Choice Test

Article

Full-text available

Feb 2017
J. Comput. Theor. Nanosci.

This study proposed a new scoring method with the integration of confidence level into Number Right Elimination Testing (NRET) which is called the Confidence-weighted Number Right Elimination Testing (CWNRET). This paper tried to investigate how comprehensive can CWNRET scoring method determine the level of knowledge of students in Multiple Choice (MC) Test. Results showed that CWNRET scores were equivalent to NRET and Elimination Testing (ET) scores. It also showed that not all students who have Full Knowledge using NRET are the same students who have Full Knowledge in CWNRET. Some of these students only have Partial Knowledge. Not all students who have Full Misconception in NRET also have Full Misconception in CWNRET. Some of them fall between Partial Knowledge and Partial Misconception. Furthermore, findings revealed that students who have mastery usually have Full Knowledge and sometimes Partial Knowledge. Students who have doubt on their responses commonly have Partial Knowledge, misinformed students have Partial Misconception, and students who are uninformed usually lie between Partial Knowledge and Partial Misconception. This shows that CWNRET can detect the quality of knowledge, and a more comprehensive level of knowledge compared to NRET scoring method. Hence, CWNRET scoring method is an effective tool for evaluating the level and quality of knowledge of students accurately in MC test.

Promoting collaborative learning through regulation of guessing in clickers

Article

May 2017
COMPUT HUM BEHAV

Collaborative learning is a promising avenue in education research. Learning from others and with others can foster deeper learning at a multiple-choice assignment, but it is hard to control the level of students' pure guessing. This paper addresses the problem of promoting collaborative learning through regulation of guessing, when students use clickers to answer multiple-choice questions of various levels of difficulty. The study is aimed at identifying how the difficulty of the task and students' levels of knowledge influence on the degree of partial guessing. To answer this research question, we developed two research models and validated them by testing 84 students with regard to the students' level of knowledge and the penalty announcement. The findings of this research reveal that: a) the announcement of penalty has a negative effect on promoting collaborative learning even if it leads to reducing pure guesses in test results; b) questions that require higher-order thinking skills promote collaborative learning to a greater extent; c) creating mixed level groups of students seems advisable to enhance learning from collaboration and, thus, to decrease the degree of pure guessing.

Integrated Testlets: A New Form of Expert-Student Collaborative Testing

Article

Full-text available

Jun 2015

Integrated testlets are a new assessment tool that encompass the procedural benefits of multiple-choice testing, the pedagogical advantages of free-response-based tests, and the collaborative aspects of a viva voce or defence examination format. The result is a robust assessment tool that provides a significant formative aspect for students. Integrated testlets utilize an answer-until-correct response format within a scaffolded set of multiple-choice items that each provide immediate confirmatory or corrective feedback while also allowing for the granting of partial credit. We posit here that this testing format comprises a form of expert-student collaboration, we expand on this significance and discuss possible extensions to the approach.

PISA 2015 Okuma Becerisi Maddelerinin Güçlük İndeksini Etkileyen Madde Özelliklerinin İncelenmesiInvestigation of Item Properties Affecting the Difficulty Index of PISA 2015 Reading Literacy Items

Article

Full-text available

Jul 2023

Bu çalışmanın amacı okuma becerisi maddelerinin güçlük indeksini etkileyen madde özelliklerini belirlemektir. Bu amaç doğrultusunda madde formatı, madde bilişsel alan düzeyi ve bu iki değişkene ait etkileşimin madde güçlüğü üzerindeki etkileri incelenmiştir. Araştırmanın çalışma grubunu PISA 2015 Türkiye uygulamasında okuma becerisi alt testine yanıt veren 2418 öğrenci oluşturmaktadır. Çalışmanın analizleri çok seviyeli bir yöntem olan Açıklayıcı MTK modelleri ile yürütülmüştür. Elde edilen sonuçlar açık uçlu maddelerin çoktan seçmeli maddelere göre, anlama ve yorumlama bilişsel alanında yer alan maddelerin ise bilgi ve değerlendirme basamağında yer alan maddelere göre anlamlı derecede daha zor olduğunu göstermektedir. Madde formatı ve madde bilişsel alan kesişimi incelendiğinde ise, bilişsel alanı anlama ve yorumlama olan maddelerinin açık uçlu sorulmasının maddeleri kolaylaştıracağı, bilgi basamağında yer alan maddelerin ise açık uçlu sorulmasının maddeleri zorlaştıracağı saptanmıştır.

دراسة مقارنة في الخصائص السيكومترية لاختبار تحصيلي ذي اختيار من متعدد باختلاف طرق تقدير درجات المفردات

Research

Full-text available

Jan 2022

دراسة مقارنة في الخصائص السیكومتریة لاختبار تحصیلي ذي اختيار من متعدد باختلاف طرق تقدیر درجات المفردات

Research

Full-text available

Jan 2021

أ.م.د. هند صبيح رحيم

iq. ‫اﻟﻤﺴﺘﺨﻠص‬ ‫اﻟﺤﺎﻟﻲ‬ ‫اﻟﺒﺤث‬ ‫ﯿﻬدف‬ ‫ﻤن‬ ‫اﺨﺘﯿﺎر‬ ‫ذي‬ ‫ﺘﺤﺼﯿﻠﻲ‬ ‫ﻻﺨﺘﺒﺎر‬ ‫اﻟﻘﯿﺎﺴﯿﺔ‬ ‫اﻟﺨﺼﺎﺌص‬ ‫ﻋﻠﻰ‬ ‫اﻟﺘﻌرف‬ ‫ﺒ‬ ‫ﻤﺘﻌدد‬ ‫ﺎ‬ ، ‫ﺒﯿﻌﯿﺔ‬ ‫اﻟﺘر‬ ‫ﯿﻘﺔ‬ ‫اﻟطر‬ ، ‫اﻟﺘﻘﻠﯿدﯿﺔ‬ ‫ﯿﻘﺔ‬ ‫اﻟطر‬) ‫اﻟﻤﻔردات‬ ‫درﺠﺎت‬ ‫ﺘﻘدﯿر‬ ‫طرق‬ ‫ﺨﺘﻼف‬ ‫ﯿﻘﺔ‬ ‫طر‬ ‫اﻟﻌﻘﺎب‬ ‫و‬ ‫اﻟﻤﻛﺎﻓﺄة‬. ‫اﻻﺤﯿﺎء‬ ‫ﻋﻠم‬ ‫ﻤﺎدة‬ ‫ﻓﻲ‬ ‫ﺘﺤﺼﯿﻠﻲ‬ ‫اﺨﺘﺒﺎر‬ ‫ﺒﺒﻨﺎء‬ ‫اﻟﺒﺎﺤث‬ ‫ﻗﺎم‬ ‫اﻟﻬدف‬ ‫ﻫذا‬ ‫وﻟﺘﺤﻘﯿق‬ ‫ﻟﻠﺼ‬ (‫اﻻﺤﯿﺎﺌﻲ‬) ‫اﻻﻋدادي‬ ‫اﻟﺨﺎﻤس‬ ‫ف‬ ، ‫ﻋدد‬ ‫ﺒﻠﻎ‬ ‫ﻤﻔردات‬) ‫اﻻﺨﺘﺒﺎر‬ ٥٠ (‫ﻤﻔردة‬ ‫ﻨوع‬ ‫ﻤن‬ ‫ﻤﺘﻌدد‬ ‫ﻤن‬ ‫اﻻﺨﺘﯿﺎر‬ ‫ﻛل‬ ‫ﺘﻀﻤﻨت‬ ‫ﻤﻔردة‬ ‫ﺨﺎطﺌﺔ‬ ‫ﺘﺒﻘﻰ‬ ‫وﻤﺎ‬ ‫ﺼﺤﯿﺤﺔ‬ ‫اﺤدﻫﺎ‬ ‫ﺒداﺌل‬ ‫ﺒﻌﺔ‬ ‫ار‬. ‫ا‬ ‫ﺼﻼﺤﯿﺔ‬ ‫ﻤن‬ ‫وﻟﻠﺘﺤﻘق‬ ‫ﻟﻤﻔردات‬ ‫ﻋرض‬ ‫ﻓﻘد‬ ‫ﻟﻼﺨﺘﺒﺎر،‬ ‫اﻟظﺎﻫري‬ ‫اﻟﺼدق‬ ‫اج‬ ‫اﺴﺘﺨر‬ ‫و‬ ً ‫ﻤﻨطﻘﯿﺎ‬) ‫ﻋﻠﻰ‬ ‫اﻻﺨﺘﺒﺎر‬ ١٩ ‫ﻤﺤﻛﻤ‬ (ً ‫ﺎ‬ ‫ﯿس‬ ‫ﺘدر‬ ‫اﺌق‬ ‫وطر‬ ‫اﻟﻨﻔﺴﯿﺔ‬ ‫و‬ ‫ﺒوﯿﺔ‬ ‫اﻟﺘر‬ ‫اﻟﻌﻠوم‬ ‫ﻓﻲ‬ ‫اﻟﻤﺘﺨﺼﺼﯿن‬ ‫ﻤن‬ ‫أﯿﺔ‬ ‫ﺘﺴﺘﺒﻌد‬ ‫وﻟم‬ ‫اﻵﺨر،‬ ‫ﺒﻌﻀﻬﺎ‬ ‫ﺼﯿﺎﻏﺔ‬ ‫أﻋﯿد‬ ‫و‬ ‫ﺒﻌﻀﻬﺎ،‬ ‫ﻋدﻟت‬ ‫ﻤﻼﺤظﺎﺘﻬم‬ ‫ﻀوء‬ ‫وﻓﻲ‬ ‫اﻻﺤﯿﺎء‬ ‫ﻤﻔردة‬ ‫ﻤن‬ ‫ﻤﻔردات‬ ‫ﻋ‬ ‫ﻟﺤﺼوﻟﻬﺎ‬ ‫اﻻﺨﺘﺒﺎر‬ ‫ا‬ ‫ﻟﻘﺒول‬ ‫اﻟﻤطﻠوب‬ ‫اﻻﺘﻔﺎق‬ ‫ﻨﺴﺒﺔ‬ ‫ﻠﻰ‬ ‫ﻟﻤﻔردة‬ ‫ﺒﻨﺴﺒ‬ ‫ـ‬ ‫ـ‬ ‫ـ‬ ‫ـ‬ ‫ﺔ‬) ١٠٠ (% ‫ﻟﻼ‬ ‫اﻟظﺎﻫري‬ ‫اﻟﺼدق‬ ‫ﻤن‬ ‫اﻟﺘﺄﻛد‬ ‫ﺘم‬ ‫وﺒذﻟك‬ ‫ﺨﺘﺒﺎر‬ ، ‫ﺘﻌﻠﯿﻤﺎت‬ ‫وﻀوح‬ ‫ﻤن‬ ‫وﻟﻠﺘﺄﻛد‬ ‫و‬ ‫ﻤﻔردات‬ ‫اﻟﺒﺤث‬ ‫ﻋﯿﻨﺔ‬ ‫ﻟدى‬ ‫اﻻﺨﺘﺒﺎر‬) ‫ﺘﺒﻠﻎ‬ ‫ﻋﯿﻨﺔ‬ ‫ﻋﻠﻰ‬ ‫اﻻﺨﺘﺒﺎر‬ ‫طﺒق‬ ، ٤٠ ‫اﺌﯿﺎ‬ ‫ﻋﺸو‬ ‫ا‬ ‫اﺨﺘﯿرو‬ ‫وطﺎﻟﺒﺔ‬ ً ‫طﺎﻟﺒﺎ‬ (‫اﻟﺨﺎﻤس‬ ‫طﻠﺒﺔ‬ ‫ﻤن‬ ‫اﻻﻋدادي‬) ‫اﻻﺤﯿﺎﺌﻲ‬ (‫ﺒﻐداد‬ ‫ﻤﺤﺎﻓظﺔ‬ ‫ﺒﯿﺔ‬ ‫ﺘر‬ ‫ﯿﺎت‬ ‫ﻤدﯿر‬ ‫ﻓﻲ‬ ‫ﺘﻌﻠﯿﻤﺎت‬ ‫ﺒﺎن‬ ‫وظﻬر‬ ، ‫و‬ ‫ﻤﻔردات‬ ‫ﻟﻠ‬ ‫اﻟﻘﯿﺎﺴﯿﺔ‬ ‫اﻟﺨﺼﺎﺌص‬ ‫وﻟﺘﺤدﯿد‬ ‫اﻀﺤﺔ،.‬ ‫و‬ ‫اﻻﺨﺘﺒﺎر‬ ‫ﻤﻔردات‬ ‫طﺒق‬ ‫اﻟﻛﻠﻲ‬ ‫اﻻﺨﺘﺒﺎر‬ ‫و‬) ‫ﻤن‬ ‫ﻤﻛوﻨﺔ‬ ‫ﻋﯿﻨﺔ‬ ‫ﻋﻠﻰ‬ ‫اﻻﺨﺘﺒﺎر‬ ٤٠٠ ، ‫اﻻﺤﯿﺎﺌﻲ‬ ‫اﻟﺨﺎﻤس‬ ‫اﻟﺼف‬ ‫طﻠﺒﺔ‬ ‫ﻤن‬ ‫وطﺎﻟﺒﺔ‬ ‫طﺎﻟب‬ (‫ﺒﺎﻷ‬ ‫اﻟﻌﯿﻨﺔ‬ ‫ﻫذﻩ‬ ‫اﺨﺘﯿرت‬ ‫اﻟﻤر‬ ‫ﺴﻠوب‬ ‫اﻟﻌ‬ ‫ﺤﻠﻲ‬ ‫اﻻﻋدادﯿﺔ‬ ‫اﻟﻤرﺤﻠﺔ‬ ‫طﻠﺒﺔ‬ ‫ﻤن‬ ‫اﺌﻲ‬ ‫ﺸو‬ ، ‫اﻟﺒﺎﺤث‬ ‫اﻋﺘﻤد‬ ‫و‬ ‫اﻟﻘﯿﺎس‬ ‫ﯿﺔ‬ ‫ﻨظر‬ ‫ﻋﻠﻰ‬ ‫ذﻟك‬ ‫ﻓﻲ‬ ‫اﻟﻛﻼﺴﯿﻛﯿﺔ‬ ‫ﻟﻼﺨﺘﺒﺎ‬ ‫اﻟﻘﯿﺎﺴﯿﺔ‬ ‫اﻟﺨﺼﺎﺌص‬ ‫اﯿﺠﺎد‬ ‫ﻓﻲ‬ ‫ر‬ ، ‫و‬ ‫اﻟﺘﺤﻘق‬ ‫ﺒﻌد‬ ‫ﻤن‬ ‫ﺨﺼﺎﺌص‬ ‫ا‬ ‫ﻤن‬ ‫ﻋدد‬ ‫ﻋﻠﻰ‬ ‫اﻻﺨﺘﺒﺎر‬ ‫ﺒﺘطﺒﯿق‬ ‫اﻟﺒﺎﺤث‬ ‫ﻗﺎم‬ ‫اﻟﺒﺤث‬ ‫ﻫدف‬ ‫وﻟﺘﺤﻘﯿق‬ ، ‫ﻻﺨﺘﺒﺎر‬ ‫ﻓﻲ‬ ‫اﻻﺤﯿﺎﺌﻲ‬ ‫اﻟﺨﺎﻤس‬ ‫اﻟﺼف‬ ‫طﻠﺒﺔ‬ ‫ﻤن‬ ‫اﻟﻌﯿﻨﺎت‬ ‫ﻛﺎﻻﺘﻲ‬ ‫وﻫﻲ‬ ‫ﺒﻐداد‬ ‫ﺒﯿﺔ‬ ‫ﺘر‬ ‫ﯿﺎت‬ ‫ﻤدﯿر‬ : ‫ﻗﺎم‬ : ً ‫اوﻻ‬) ‫ﻋﯿﻨﺔ‬ ‫ﺤﺠم‬ ‫ﻋﻠﻰ‬ ‫اﻻﺨﺘﺒﺎر‬ ‫ﺒﺘطﺒﯿق‬ ‫اﻟﺒﺎﺤث‬ ٢٠٠ ‫ذﻟك‬ ‫ﺒﻌد‬ ، ‫وطﺎﻟﺒﺔ‬ ‫طﺎﻟب‬ (‫أﺠرى‬ ‫ﻟ‬ ‫اﻹﺤﺼﺎﺌﻲ‬ ‫اﻟﺘﺤﻠﯿل‬ ‫ﻤﻔردات‬. ‫ﺒﯿﻌﯿﺔ‬ ‫اﻟﺘر‬ ‫ﯿﻘﺔ‬ ‫ﻟﻠطر‬ ً ‫وﻓﻘﺎ‬ ‫اﻻﺨﺘﺒﺎر

Chronometry of distractor views to discover the thinking process of students during a computer knowledge test

Article

Feb 2022
BEHAV RES METHODS

Dmitry Sherbina

Accuracy in estimating knowledge with multiple-choice quizzes largely depends on the distractor discrepancy. The order and duration of distractor views provide significant information to itemize knowledge estimates and detect cheating. To date, a precise and accurate method for segmenting time spent for a single quiz item has not been developed. This work proposes process mining tools for test-taking strategy classification by extracting informative trajectories of interaction with quiz elements. The efficiency of the method was verified in the real learning environment where the difficult knowledge test items were mixed with simple control items. The proposed method can be used for segmenting the quiz-related thinking process for detailed knowledge examination.

BIM, un paso en lo académico. Reflexión sobre la implementación de la metodología en el programa de Ingeniería Civil de la Universidad Católica de Colombia

Chapter

Full-text available

Nov 2018

Many countries are currently making the transition from CAD modeling to BIM modeling and project management. There are considerable challenges for a society to integrate these new methodologies in an industry that is so changing, where many professional disciplines are involved, and whose economic contribution is relevant for the growth of a nation. There are different authors in the global context that have documented the advantages in the implementation of the methodology, but this does not mean that it is a simple process. In this sense, the undergraduate programs of universities play a fundamental role. This document describes the exercise that was done, on the recognition of the process that would be required to achieve the implementation of the BIM methodology in the current Civil Engineering program of the Catholic University of Colombia, in this, it reflects on the most relevant aspects to consider in the approach of BIM from the academy, particularly from Engineering.

Impact of Scoring Instructions, Timing, and Feedback on Measurement: An Experimental Study

Article

Sep 2021

We investigated whether and to what extent different scoring instructions, timing conditions, and direct feedback affect performance and speed. An experimental study manipulating these factors was designed to address these research questions. According to the factorial design, participants were randomly assigned to one of twelve study conditions. We collected data from 2,484 participants on 20 quantitative reasoning items obtained from an admissions test for graduate and professional schools. The results showed that there were significant differences in both performance and speed between the conditions. Both item time limits and feedback led to faster but less accurate responses. The results for scoring instructions with an emphasis on speed and test time limits were mixed with respect to accuracy, but the responses in these conditions were generally faster. Notwithstanding these experimental effects, measurement invariance held for models fitted to response accuracy and response time, which means that the manipulations could reasonably be summarized through impact on structural parameters (latent means and variances) of the studied models. This finding is supported by the lack of differences between conditions in the correlations with an external measure of quantitative reasoning.

Certainty-Based Marking on Multiple-Choice Items: Psychometrics Meets Decision Theory

Article

Apr 2021

When a response to a multiple-choice item consists of selecting a single-best answer, it is not possible for examiners to differentiate between a response that is a product of knowledge and one that is largely a product of uncertainty. Certainty-based marking (CBM) is one testing format that requires examinees to express their degree of certainty on the response option they have selected, leading to an item score that depends both on the correctness of an answer and the certainty expressed. The expected score is maximized if examinees truthfully report their level of certainty. However, prospect theory states that people do not always make rational choices of the optimal outcome due to varying risk attitudes. By integrating a psychometric model and a decision-making perspective, the present study looks into the response behaviors of 334 first-year students of physiotherapy on six multiple-choice examinations with CBM in a case study. We used item response theory to model the objective probability of students giving a correct response to an item, and cumulative prospect theory to estimate their risk attitudes when students choose to report their certainty. The results showed that with the given CBM scoring matrix, students’ choices of a certainty level were affected by their risk attitudes. Students were generally risk averse and loss averse when they had a high success probability on an item, leading to an under-reporting of their certainty. Meanwhile, they were risk seeking in case of small success probabilities on the items, resulting in the over-reporting of certainty.

Are Multiple-Choice Questions Suitable for a Final Examination in a STEM Course?

Conference Paper

Jun 2014

HUBUNGAN MODEL PENSKORANTERHADAP ESTIMASI SKOR SESUNGGUHNYA BERDASARKAN TEORI RESPONS BUTIR

Article

Full-text available

Sep 2013

Musmuliadi Musmuliadi

Penelitian ini bertujuan untuk mendeskripsikan: 1) karakteristik tes UN mata pelajaran matematika tingkat SLTP tahun 2007/2008, 2) karakteristik distribusi skor sesungguhnya hasil estimasi beberapa model penskoran, 3) hubungan antara skor kemampuan dan skor tampak dengan skor sesungguhnya, dan 4) implikasi penerapan model penskoran terhadap estimasi skor sesungguhnya. Data penelitian ini berupa respons siswa SMP/MTs terhadap tes Ujian Nasional (UN) mata pelajaran matematika tahun 2007/2008 di Propinsi Nusa Tenggara Barat. Analisis dilakukan dengan pendekatan kuantitatif. Hasil analisis menunjukkan bahwa tes UN mata pelajaran matematika tahun 2007/2008 tingkat SMP/MTs pada kategori sulit, memiliki rerata daya pembeda baik, tetapi rerata indeks pseudo-guessing kurang baik. Rerata skor sesungguhnya yang paling tinggi diperoleh pada model penskoran jumlah benar sesungguhnya, sedangkan rerata paling kecil terjadi pada model penskoran koreksi terhadap tebakan. Hubungan antara skor kemampuan () dengan skor sesungguhnya menunjukkan korelasi positif dengan nilai koefisien korelasi sangat tinggi. Rerata hasil estimasi skor sesungguhnya dari ketiga model penskoran menunjukkan perbedaan yang signifikan. Kata kunci: skor tampak, skor kemampuan, skor sesungguhnya, model penskoran

ÖRTÜK SINIF ANALİZİNİN FARKLI PUANLAMA DURUMLARINDA İNCELENMESİ

Article

Full-text available

Mar 2020

Bu çalışmada Örtük Sınıf Analizinin farklı puanlama durumlarındaki işleyişini incelemek amaçlanmıştır. Bu amaç doğrultusunda çoktan seçmeli maddelerden oluşan ve seçenekleri ikili, uzman yargısına dayalı ağırlıklı ve deneysel ağırlıklı puanlanabilen bir test kullanılarak Mersin Üniversitesi Eğitim Fakültesi’nde öğrenim görmekte olan toplam 595 öğrenciden veri toplanmıştır. Öğrencilerin test maddelerine vermiş oldukları yanıtlar ikili, deneysel ağırlıklı ve uzman yargısına dayalı ağırlıklı olarak ayrı ayrı puanlanmış ve elde edilen veriler üzerinde örtük sınıf analizi yapılmıştır. Ulaşılan bulgular ikili ve uzman yargısına dayalı ağırlıklı puanlama için örtük sınıf analizinde aynı sınıf sayısına ulaşıldığını göstermiştir. En az sınıf sayısına ise en az parametre kestirimiyle deneysel ağırlıklı puanlama için ulaşılmıştır. İkili ve uzman yargısına dayalı ağırlıklı puanlama için ise elde edilen sınıf sayısı ve kestirilen parametre sayısı deneysel ağırlıklı puanlamayla ulaşılandan daha yüksektir. Örtük sınıf analizinde en uygun model, en az örtük sınıf sayısıyla ve en az parametre kestirimiyle model veri uyumunu yakalayan modeldir. Bu nedenle deneysel ağırlıklı puanlama yöntemiyle ulaşılan modelin örtük sınıf analizi için en uygun model olduğu ifade edilebilir.

المقارنة بین طریقة الاحتمال المقترح للإجابة الصحیحة والطریقة التجریبیة لتقدیر درجات أسئلة الاختیار من متعدد وفق درجة الارتباط بإحصائیات المفردة والأفراد المقدرة بالنموذج اللوغاریتمی الثلاثی المعلم

Article

Jun 2019

د/ مشعل مونس نخیلان الرویلی

Modeling Partial Knowledge on Multiple‐Choice Items Using Elimination Testing

Article

Jun 2019

Single‐best answers to multiple‐choice items are commonly dichotomized into correct and incorrect responses, and modeled using either a dichotomous item response theory (IRT) model or a polytomous one if differences among all response options are to be retained. The current study presents an alternative IRT‐based modeling approach to multiple‐choice items administered with the procedure of elimination testing, which asks test‐takers to eliminate all the response options they consider to be incorrect. The partial credit model is derived for the obtained responses. By extracting more information pertaining to test‐takers’ partial knowledge on the items, the proposed approach has the advantage of providing more accurate estimation of the latent ability. In addition, it may shed some light on the possible answering processes of test‐takers on the items. As an illustration, the proposed approach is applied to a classroom examination of an undergraduate course in engineering science.

A Sequential Process Model for Cognitive Diagnostic Assessment With Repeated Attempts

Article

Dec 2018
APPL PSYCH MEAS

When diagnostic assessments are administered to examinees, the mastery status of each examinee on a set of specified cognitive skills or attributes can be directly evaluated using cognitive diagnosis models (CDMs). Under certain circumstances, allowing the examinees to have at least one opportunity to correctly answer the questions and assessments, with repeated attempts on the items, provides many potential benefits. A sequential process model can be extended to model repeated attempts in diagnostic assessments. Two formulations of the sequential generalized deterministic-input noisy-“and”-gate (G-DINA) model were developed in this study. The first extension uses the latent transition analysis (LTA) approach to model changes in the attributes over attempts, and the second extension constructs a higher order structure of latent continuous variables and latent attributes to account for the dependences of the attributes over attempts. Accurate model parameter estimation and correct classifications of attributes were observed in a series of simulations using Bayesian estimation. The effectiveness of the developed sequential G-DINA model was demonstrated by fitting real data from a longitudinal mathematical test to the developed model and the longitudinal G-DINA model using the LTA approach. Finally, this article closes by discussing several important issues associated with the developed models and providing suggestions for future directions.

A Comparative Study of Self-Esteem and Gender Bias in Confidence-Based and Number-Right Scoring Methods in English Grammar Classes

Article

Full-text available

Jan 2018

Multiple-choice tests are among the most widely used test formats throughout the world due to their ease of administration and other advantages; however, one of the shortcomings of this test format is the role of guessing inherent in it. To solve this problem, different scoring methods have been proposed. Confidence-based scoring is a scoring method that both removes uninformed guesses from multiple-choice tests and takes partial knowledge into account. This scoring method, however, has been criticized for being biased against gender and specific personality traits. The present study was an attempt to examine the self-esteem and gender bias of confidence-based scoring compared to number-right scoring while testing English grammar. The participants were forty-nine freshman students who were taking their English grammar course. At the end of the semester, they were given an eighty-item multiple-choice test based on the content of the course. The test was scored in two ways: confidence-based scoring and number-right scoring. The participants were also given the Self-Esteem Inventory. To examine the self-esteem bias of these two scoring methods, Pearson correlation was used. To investigate their gender bias, the means of scores in male and female participants were calculated, and the significance of the difference was tested by independent-samples t-test. The results showed that both confidence-based scoring and number-right scoring were biased against self-esteem. In other words, confidence-based scores were no more biased than number-right scores against self-esteem. The results also showed that neither confidence-based scores nor number-right scores were biased against gender. © Common Ground Research Networks, Lirije Olluri, Sejdi Sejdiu.

A General Cognitive Diagnosis Model for Continuous-Response Data

Article

Jan 2018

Cognitive diagnosis models (CDMs) allow for the extraction of fine-grained, multidimensional diagnostic information from appropriately designed tests. In recent years, interest in such models has grown as formative assessment grows in popularity. Many dichotomous as well as several polytomous CDMs have been proposed in the last two decades, but there has been only one continuous-response model, the C-DINA, proposed to date. The C-DINA model offers a promising first step at modeling continuous process data, but the application of a model with a strong conjunctive assumption may be limited. Thus, the generalized version of the C-DINA model is proposed, and its viability is demonstrated with both a simulation study and a real data example.

Test-wiseness: ein unterschätztes Konstrukt?: Empirische Befunde zur Überprüfung und Erlernbarkeit von test-wiseness

Article

Full-text available

Feb 2018

Test-wiseness gilt bereits seit langem als wichtiger Einflussfaktor auf die Ergebnisse von Multiple-Choice Tests. So scheinen Personen mit Wissen über test-wiseness besser in Multiple-Choice Tests abzuschneiden als Personen ohne dieses Wissen. Darüber hinaus konnte gezeigt werden, dass test-wiseness trainiert bzw. erlernt werden kann. Obwohl test-wiseness für das Abschneiden in einem Multiple-Choice Test von entscheidender Bedeutung zu sein scheint, stammen die vorhandenen Erkenntnisse und Befunde fast ausschließlich aus dem amerikanischen Sprachraum. Im deutschsprachigen Raum dagegen liegen kaum Befunde zu test-wiseness vor. Vor diesem Hintergrund wurde anhand einer englischen Testversion ein deutschsprachiger Test entwickelt. Ziel der vorliegenden Studie war es, den Einfluss thematischen Wissens sowie den Einfluss einer Schulung zu test-wiseness auf das Testergebnis zu untersuchen. Hierzu beantworteten 252 Studierende 24 deutschsprachige Multiple-Choice Aufgaben zur Erfassung ihres Wissens über test-wiseness. Es wurde ein 2 (mit vs. ohne Expertise) x 2 (mit vs. ohne Schulung) Versuchsdesign realisiert. Die Ergebnisse zeigen, dass Personen mit thematischem Wissen bessere Ergebnisse erzielen als Personen ohne thematisches Wissen und Personen mit Schulung bessere Ergebnisse als Personen ohne Schulung. Insgesamt machen die Befunde deutlich, dass ein deutschsprachiger Test zur Erfassung von test-wiseness entstanden ist, der der Güte vorhandener internationaler Tests entspricht. Zusätzlich wird deutlich, dass auch im deutschsprachigen Sprachraum zukünftig ein stärkerer Fokus auf die Kontrolle von test-wiseness gerichtet werden sollte.

A Cognitive Diagnosis Model for Continuous Response

Article

Dec 2017

Nondichotomous response models have been of greater interest in recent years due to the increasing use of different scoring methods and various performance measures. As an important alternative to dichotomous scoring, the use of continuous response formats has been found in the literature. To assess finer-grained skills or attributes and to extract information with diagnostic value from continuous response data, a multidimensional skills diagnosis model for continuous response is proposed. An expectation-maximization implementation of marginal maximum likelihood estimation is developed to estimate its parameters. The viability of the proposed model is shown via a simulation study and a real data example. The proposed model is also shown to provide a substantial improvement in attribute classification when compared to a model based on dichotomized continuous responses.

Impact of Risk Taking Strategies on Male and Female EFL Learners’ Test Performance: The Case of Multiple Choice Questions

Article

Full-text available

Oct 2017

This study scrutinized the interaction between gender and risk taking variables in test performance of Iranian EFL learners. The research was conducted on 120 male and female EFL learners from Islamic Azad university of Isfahan (khorasgan). The participants received a Venturesomeness subscale of Eysenck `s IVE questionnaire and were asked to rate each item on a 5point Likert-scale. The total score for this questionnaire ranges from 16 to 80. Students who were lower than 30 were considered as low risk-takers, those who were more than 70 as high risk-takers, and those between 30 and 70 as moderate risk-takers. In a weeks’ time, a complete TOEFL PBT test comprising 140- multiple-choice items as the second instrument was administrated. The results revealed that the female EFL students were lower risk takers and left questions unanswered more frequently and skipped questions a lot more than their male counterparts. Finally, it was found out that low risk takers answered the least number of questions in comparison to high and moderate risk takers, and consequently, had the most number of questions left unanswered which had a negative effect on test takers’ performance.

Efficacy of NRET Scoring Method in Paper-and-Pen Multiple Choice Test

Article

Full-text available

Feb 2017
J. Comput. Theor. Nanosci.

Different scoring methods have been introduced through time whose objective is to address partial knowledge in Multiple Choice (MC) test. One of these is the Number Right Elimination Testing (NRET). This study tried to investigate the effectiveness of NRET scoring method in a pen-and-paper MC test in evaluating the level of knowledge of students and in minimizing guessing. Results showed that NRET scores were not similar to Number Right (NR) scores while NRET and Elimination Testing (ET) scores were equivalent to each other. It also showed that not all correct answers in NR scoring method were based on Full Knowledge, while not all incorrect answers were based on Lack of Knowledge or Full Misconception. Furthermore, NRET scoring method also reduced guessing in the low-ability group. Hence, NRET scoring method is an effective tool for evaluating a more comprehensive level of knowledge of students in paper-and-pen MC test.

Solving Measurement Problems with an Answer-Until-Correct Scoring Procedure

Article

Full-text available

Jul 1981
APPL PSYCH MEAS

Rand R Wilcox

Answer-until-correct (AUC) tests have been in use for some time. Pressey (1950) pointed to their ad vantages in enhancing learning, and Brown (1965) proposed a scoring procedure for AUC tests that appears to increase reliability (Gilman & Ferry, 1972; Hanna, 1975). This paper describes a new scoring procedure for AUC tests that (1) makes it possible to determine whether guessing is at ran dom, (2) gives a measure of how "far away" guess ing is from being random, (3) corrects observed test scores for partial information, and (4) yields a mea sure of how well an item reveals whether an ex aminee knows or does not know the correct re sponse. In addition, the paper derives the optimal linear estimate (under squared-error loss) of true score that is corrected for partial information, as well as another formula score under the assumption that the Dirichlet-multinomial model holds. Once certain parameters are estimated, the latter formula score makes it possible to correct for partial infor mation using only the examinee's usual number- correct observed score. The importance of this for mula score is discussed. Finally, various statistical techniques are described that can be used to check the assumptions underlying the proposed scoring procedure.

Validity of Self-Evaluation of Ability: A Review and Meta-Analysis

Article

Full-text available

Jun 1982

Reviews 55 studies in which self-evaluations of ability were compared with measures of performance to show a low mean validity coefficient (mean r = .29) with high variability ( SD = .25). A meta-analysis by the procedures of J. E. Hunter et al (1982) calculated sample-size weighted estimates of –- r and SDr and estimated the appropriate adjustments of these values for sampling error and unreliability. Among person variables, high intelligence, high achievement status, and internal locus of control were associated with more accurate evaluations. Much of the variability in the validity coefficients ( R = .64) could be accounted for by 9 specific conditions of measurement, notably (a) the rater's expectation that the self-evaluation would be compared with criterion measures, (b) the rater's previous experience with self-evaluation, (c) instructions guaranteeing anonymity of the self-evaluation, and (d) self-evaluation instructions emphasizing comparison with others. It is hypothesized that conditions increasing self-awareness would increase the validity of self-evaluation. (84 ref) (PsycINFO Database Record (c) 2012 APA, all rights reserved)

Comparing the Calibration and Coherence of Numerical and Verbal Probability Judgments

Article

Full-text available

Feb 1993

Despite the common reliance on numerical probability estimates in decision research and decision analysis, there is considerable interest in the use of verbal probability expressions to communicate opinion. A method is proposed for obtaining and quantitatively evaluating verbal judgments in which each analyst uses a limited vocabulary that he or she has individually selected and scaled. An experiment compared this method to standard numerical responding under three different payoff conditions. Response mode and payoff never interacted. Probability scores and their components were virtually identical for the two response modes and for all payoff groups. Also, judgments of complementary events were essentially additive under all conditions. The two response modes differed in that the central response category was used more frequently in the numerical than the verbal case, while overconfidence was greater verbally than numerically. Response distributions and degrees of overconfidence were also affected by payoffs. Practical and theoretical implications are discussed.

Calibration of probabilities: The state of the art to 1980

Chapter

Apr 1982

Calibration of probabilities: the state of the art to 1980

Book

Jan 1982

Writing the test item

Article

Jan 1971

A.G. Wesman

Effect of Self-Scoring Procedures on Test Reliability

Article

Jun 1974

Statistical Theories of Mental Test Scores

Article

Sep 1971

A Simple Confidence Testing Format

Article

Dec 1971

Robert F. Boldt

This paper presents the development of scoring functions for use in conjunction with standard multiple-choice items. In addition to the usual indication of the correct alternative, the examinee is to indicate his personal probability of the correctness of his response. Both linear and quadratic polynomial scoring functions are examined for suitability, and a unique scoring function is found such that a score of zero is assigned when complete uncertainty is indicated and such that the examinee can expect to do best if he reports his personal probability accurately. A table of simple integer approximations to the scoring function is supplied.

Theory of Mental Tests

Article

Oct 1951

Some theories of performance in multiple choice tests, and their implications for variants of the task

Article

May 1982
BRIT J MATH STAT PSY

T. P. Hutchinson

Theories are put forward that attempt to answer the practical question of ‘How should we correct for guessing in multiple choice tests?’ and the theoretical question of ‘How can we mathematically describe partial knowledge so as to predict behaviour in tasks which enable it to be shown?’. Empirical findings relating to performance on variants of the multiple choice task are reviewed, and compared to the predictions of the theories.

Psychological Measurement and Prediction

Article

Jan 1967

Partial-Credit Scoring Methods for Multiple-Choice Tests

Article

Jan 1989

Robert B. Frary

This review covers multiple-choice response and scoring methods that attempt to capture information about an examinee's degree or level of knowledge with respect to each item and use this information to produce a total test score. The period covered is mainly from the early 1970s onward; earlier reviews are summarized. It is concluded that there is little to be gained from the complex responding and scoring schemes that have been investigated. Although some of them have confirmed potential to increase internal-consistency reliability, this outcome is often obtained only at the expense of validity. Also, the extra responding time required by some methods would permit lengthening a conventional multiple-choice test sufficiently to obtain the same reliability improvement. Partial-credit response and scoring methods that continue to be used will probably earn this status due to secondary characteristics such as providing feedback to enhance learning.

Multiple choice: A state of the art report

Article

Dec 1977
Eval Educ Int Progr

Robert Wood

On the Feasibility of Multiple Matching Tests— Variations on a Theme by Guiliksen

Article

Mar 1988
APPL PSYCH MEAS

David V. Budescu

The Effect of Misinformation, Partial Information, and Guessing on Expected Multiple-Choice Test Item Scores

Article

Jan 1980
APPL PSYCH MEAS

Robert B. Frary

Six response/scoring methods for multiple-choice tests are analyzed with respect to expected item scores under various levels of information and mis information. It is shown that misinformation always and necessarily results in expected item scores lower than those associated with complete igno rance. Moreover, it is shown that some re sponse/scoring methods penalize all conditions of misinformation equally, and others have varying penalties according to the number of wrong choices the misinformed examinee has categorized with the correct choice. One method exacts the greatest pen alty when a specific wrong choice is believed cor rect ; two other methods provide the maximum pen alty when the examinee is confident only that the correct choice is incorrect. Partial information is shown to yield substantially different expected item scores from one method to another. Guessing is an alyzed under the assumption that examinees guess whenever it is advantageous to do so under the scoring method used and that these conditions would be made clear to the examinee. Additional guessing is shown to have no effect on expected item scores in some cases, though in others it is shown to lower the expected item score. These out comes are discussed with respect to validity and reliability of resulting total scores and also with re spect to test content and examinee characteristics.

Perspective on Educational Measurement

Article

Jun 1986
APPL PSYCH MEAS

Harold Gulliksen

An important but usually neglected aspect of the training of teachers is instruction in the art of writing good classroom tests. Such training should emphasize various forms of objective items (e.g., multiple- choice, master list, matching, greater-less-same, best- worst answer, and matrix format). The proper formu lation and accurate grading of essay items should be included, as should the use of various types of free- answer items (e.g., the brief answer, interlinear, and "fill in the blanks in the following paragraph" forms). For courses involving laboratory work, such as sci ence, machine shop, and home economics, perfor mance and identification tests based on the laboratory work should be used. A second point is that organizations developing ap titude tests for nonacademic areas, such as police work, fire fighting, and licensing tests, should empha size the use by the client of a valid, reliable, and un biased criterion. Organizations developing academic aptitude tests should also (1) be alert to the accuracy of criterion measures, grades, rank in class, and so forth; (2) call teachers' attention to defects in grading; and (3) help guide teachers and schools in improving these procedures. In recent decades, there have been few instances in which a testing organization has ap prised teachers of the fact that their criteria—among others, grades on tests and student papers—are often quite unreliable based on characteristics such as work habits and attitude in class, and could be improved by using better tests to evaluate student performance. Characteristics of the group used for determining va lidity are also critical.

Alternative Response and Scoring Methods for Multiple-Choice Items: An Empirical Study of Probabilistic and Ordinal Response Modes

Article

Jan 1978
APPL PSYCH MEAS

Binary, probability, and ordinal scoring proce dures for multiple-choice items were examined. In a situation where true scores were experimentally controlled by the manipulation of partial informa tion, it was found that both the probability and or dinal scoring systems were more reliable than the binary scoring method. A second experiment using vocabulary items and standard reliability estimation procedures also showed higher reliability for the two partial information scoring methods relative to binary scoring.

Reliability and Validity of Item Option Weighting Schemes

Article

Jul 1976
EDUC PSYCHOL MEAS

Gary Echternacht

This study compares various item option scoring methods with respect to coefficient alpha and a concurrent validity coefficient. The scoring methods under consideration were: (1) formula scoring, (2) a priori scoring, (3) empirical scoring with an internal criterion, and (4) two modifications of formula scoring. The study indicates a clear superiority of the empirically determined scoring system with respect to both coefficient alpha and the concurrent validity.

Effects of a Confidence Weighted Scoring System on Measures of Test Reliability and Validity

Article

Apr 1975
EDUC PSYCHOL MEAS

The validity of a confidence scored vocabulary test was investigated by demonstrating an increase in its reliability without changing the relative difficulty of the test items and without detecting any personality bias in the confidence scoring system. The reliability estimate of the vocabulary test increased from .57 using a traditional scoring system to .85 using a confidence scoring system. No significant interaction was found between the difficulty of the test items and the type of scoring system. Three personality measures failed to correlate significantly with the confidence scores of the vocabulary test.

A STUDY OF RELIABILITY AND VALIDITY EFFECTS OF TOTAL AND PARTIAL IMMEDIATE FEEDBACK IN MULTIPLE-CHOICE TESTING

Article

Mar 1977
J EDUC MEAS

Gerald S. Hanna

The Assessment of Partial Knowledge1

Article

Mar 1956
EDUC PSYCHOL MEAS

The Correction for Guessing

Article

Jun 1973

The Impact of Alternative Scoring Procedures for Multiple-Choice Items on Test Reliability, Validity, and Grading

Article

Sep 1988
EDUC PSYCHOL MEAS

This study compared the reliability and validity indexes of randomly parallel tests administered under inclusion, exclusion, and correction for guessing directions. It also compared the criterion-referenced grading decisions based on the different scoring methods. Inclusion and exclusion scores were not so highly correlated as theory would predict. There were no significant differences in the reliability and validity indices for the three scoring methods. However, the scoring methods differed substantially in the proportion of students assigned to different grade categories.

Validity and Reliability Consequences of Confidence Weighting

Article

Apr 1973
EDUC PSYCHOL MEAS

Sixty-three graduate students, taking an elementary statistics course in education and having previous experience with confidence weighting, utilized confidence weighting in recording their test responses to the objective, 65-item final examination, administered on the first of two consecutive final examination days. On the second day of testing a short-answer examination covering the same material was given. Examinees were directed to attempt all items on both tests. Since response styles and chance had virtually no opportunity to affect performance on the short-answer final examination, it served as the criterion. The observed reliability of the scores using confidence weighting was slightly higher than the scores from conventional scoring (.91 vs. 88). The validity coefficient for the confidence-weighted scores was lower than for the conventional scores (.67 vs.70), but the differences did not attain statistical significance. The findings suggest that the added reliable variance often observed in confidenceweighting studies may be irrelevant response style variance and does not increase validity, in fact, it may actually diminish validity.

Some Modifications of the Multiple-Choice Item

Article

Dec 1953
EDUC PSYCHOL MEAS

The Three-Decision Multiple-Choice Test: A Method of Increasing the Sensitivity of the Multiple-Choice Item

Article

Dec 1960

CLARENCE F. WILLEY

Differential Weighting: A Survey of Methods and Empirical Studies

Article

Jan 1968

The literature on a priori and empirical weighting of test items and test-item options is reviewed. While multiple regression is the best known technique for deriving fixed empirical weights for component variables (such as tests and test items), other methods allow one to derive weights which equalize the effective weights of the component variables (their individual contributions to the variance of the composite). Fixed weighting is most effective, in general, when there are few variables in the composite, and when these variables are not highly correlated. Variable weighting methods are those in which there is no nominal weight, constant over subjects, applied to a single item or response option. Of most interest are variable response-weighting methods such as those recently suggested by de Finetti (1965) and others. To be effective, such weighting methods require that the subject be able to maximize his expected score only if he reports his subjective probabilities honestly. Variable response-weighting methods, perhaps in conjunction with fixed response-weighting methods, show promise for increasing the reliability and validity of test scores. (Author/CJ)

Confidence Marking: Its Use in Testing.

Article

Jan 1982
Eval Educ

Dieudonne Leclercq

In a confidence weighting situation, the examinee is asked to indicate the correct answer, and how certain he or she is of the correctness of that answer. This paper reviews the bases for confidence marking, its validity and accuracy in evaluating students, and it's use in research. (BW)

Differential Weighting: A Review of Methods and Empirical Studies

Article

Dec 1970

Effects of Promised Reward and Threatened Penalty on Performance of a Multiple-Choice Vocabulary Test

Article

Dec 1969
EDUC PSYCHOL MEAS

2 forms of the Dominion Vocabulary Test were administered to 667 9th graders in either AB or BA orders. The directions for the 1st administration were reward for omitting otherwise guessed replies (PR), penalty for guessing (P), no penalty for guessing (G), and no reference to cues (NR). Instruction were randomized by blocks. When tests were scored for corrects only, P and PR attained lower scores than G and NR. PR, however, had significantly fewer wrong answers than the others. Interform r for the G group was .93; for PR, .92, for NR, .90; and for P, .89. (PsycINFO Database Record (c) 2012 APA, all rights reserved)

Weighting Test Items and Test-Item Options, an Overview of the Analytical and Empirical Literature

Article

Apr 1970
EDUC PSYCHOL MEAS

Cites highlights of the history of item weighting. Although recent history suggests that on the whole item weighting affects validity to a small degree, Birnbaum's model for differential weighting on the basis of ability and de Finetti's personal probability approach to item alternatives may prove to be exceptions. (43 ref.) (PsycINFO Database Record (c) 2012 APA, all rights reserved)

The Effect of Scoring Instructions and Degree of Speededness on the Validity and Reliability of Multiple-Choice Tests1

Article

Oct 1972
EDUC PSYCHOL MEAS

Administered 3 sets of scoring instructions (1 promising a small reward for omitted questions, a 2nd threatening a small penalty for wrong answers, and a 3rd encouraging the examinee to guess) to 1,091 8th grade Canadian children to test the effect of instructions with speededness varied. Measures of risk taking, test anxiety, need achievement, intelligence, and school achievement also were available. Analysis of variance yielded significant differences between scoring instructions, speededness, and sex. Estimates of reliability and validity under varying conditions are provided and discussed, along with correlations of the tests with the personality variables. Results support the assertion that the reward instruction more effectively encourages omissive behavior than the penalty instruction, and also the assertion that the reward instruction yields scores with higher reliability and criterion validity than the penalty instruction. (19 ref.) (PsycINFO Database Record (c) 2012 APA, all rights reserved)

Theory of mental tests. Wiley publications in psychology.

Article

Jan 1950

Harold Gulliksen

The material in this book is based on my several years' experience in construction and evaluation of examinations, first as a member of the Board of Examinations of the University of Chicago, later as director of a war research project developing aptitude and achievement tests for the Bureau of Naval Personnel, and at present as research adviser for the Educational Testing Service. During this time I have become aware of the necessity for a firm grounding in test theory for work in test development. When this book was begun the material on test theory was available in numerous articles scattered through the literature and in books written some time ago, and therefore not presenting recent developments. It seemed desirable to me to bring the technical developments in test theory of the last fifty years together in one readily available source. Although this book is written primarily for those working in test development, it is interesting to note that the techniques presented here are applicable in many fields other than test construction. Many of the difficulties that have been encountered and solved in the testing field also confront workers in other areas, such as measurement of attitudes or opinions, appraisal of personality, and clinical diagnosis. The major part of this book is designed for readers with the following preparation: (1) A knowledge of elementary algebra, including such topics as the binomial expansion, the solution of simultaneous linear equations, and the use of tables of logarithms; (2) Some familiarity with analytical geometry, emphasizing primarily the equation of the straight line, although some use is made of the equations for the circle, ellipse, hyperbola, and parabola; and (3) A knowledge of elementary statistics, including such topics as the computation and interpretation of means, standard deviations, correlations, errors of estimate, and the constants of the equation of the regression line. It is assumed that the students know how to make and to interpret frequency diagrams of various sorts, including the histogram, frequency polygon, normal curve, cumulative frequency curve, and the correlation scatter diagram. Familiarity with tables of the normal curve and with significance tests is also assumed. In textbook fashion, each chapter concludes with problems and exercises. (PsycINFO Database Record (c) 2012 APA, all rights reserved)

An empirical Test of Lord's theoretical results regarding formula-scoring of multiple-choice tests

Article

Sep 2005
J EDUC MEAS

Elimination scoring: An empirical evaluation

Article

Sep 1971
J EDUC MEAS

LEVERNE S. COLLET

The purpose of this paper was to provide an empirical test of the hypothesis that elimination scores are more reliable and valid than classical corrected for-guessing scores or weighted-choice scores. The evidence presented supports the hypothesized superiority of elimination scoring.

State of the Art---Encoding Subjective Probabilities: A Psychological and Psychometric Review

Article

Feb 1983

In order to review the empirical literature on subjective probability encoding from a psychological and psychometric perspective, it is first suggested that the usual encoding techniques can be regarded as instances of the general methods used to scale psychological variables. It is then shown that well-established concepts and theories from measurement and psychometric theory can provide a general framework for evaluating and assessing subjective probability encoding. The actual review of the literature distinguishes between studies conducted with nonexperts and with experts. In the former class, findings related to the reliability, internal consistency, and external validity of the judgments are critically discussed. The latter class reviews work relevant to some of these characteristics separately for several fields of expertise. In die final section of the paper the results from these two classes of studies are summarized and related to a view of vague subjective probabilities. Problems deserving additional attention and research are identified.

A test of the hypothesis that Cronbach's Alpha or Kuder-Richardson Coefficient Twenty is the same for two tests

Article

Sep 1969

Leonard S. Feldt

An approximate statistical test is derived for the hypothesis that the reliability coefficients (Cronbach's ) associated with two measurement procedures are equal. Control of Type I error is investigated by comparing empirical sampling distributions of the test statistic with the theoretical model derived for it. The effect of platykurtosis in the test-score distribution on the test statistic is also considered.

Calibration and probability judgements: Conceptual and methodological issues

Article

Oct 1991
ACTA PSYCHOL

Gideon Keren

In a world characterized by uncertainty, the study of how people assess probabilities carries both theoretical and practical implications. Much of the research efforts in this area, especially in psychology, has focused on calibration studies (Lichtenstein, Fischhoff and Phillips 1982). The present article offers an extensive review of conceptual and methodological issues involved in the study of calibration and probability assessments. It is claimed that most calibration studies have focused on technical formal issues and are in this respect a-theoretical. The reason for this state of affairs is the adoption of a strict perspective which assumes that uncertainty is a reflection of the external world, and relies heavily on normative and formal considerations. Several unresolved problems within this strict outlook are pointed out. The present paper assumes that calibration (and assessments of subjective probabilities in general) is not a characteristic of the event(s), but rather of the assessor (Lad 1984), and advocates a more loose perspective, which is broader and more descriptive in nature. Possible discrepancies between a strict and a more loose perspective, as well as reconciliation attempts, are presented.

Subset selection technique for scoring items on a multiple-choice test

Article

Feb 1979
PSYCHOMETRIKA

On a multiple-choice test in which each item hask alternative responses, the test taker is permitted to choose any subset which he believes contains the one correct answer. A scoring system is devised that depends on the size of the subset and on whether or not the correct answer is eliminated. The mean and variance of the score per item are obtained. Methods are derived for determining the total number of items that should be included on the test so that the average score on all items can be regarded as a good measure of the subject's knowledge. Efficiency comparisons between conventional and the subset selection scoring procedures are made. The analogous problem ofr > 1 correct answers for each item (withr fixed and known) is also considered.

A K-sample significance test for independent alpha coefficients

Article

Feb 1976
PSYCHOMETRIKA

The earlier two-sample procedure of Feldt [1969] for comparing independent alpha reliability coefficients is extended to the case ofK 2 independent samples. Details of a normalization of the statistic under consideration are presented, leading to computational procedures for the overallK-group significance test and accompanying multiple comparisons. Results based on computer simulation methods are presented, demonstrating that the procedures control Type I error adequately. The results of a power comparison of the case ofK=2 with Feldt's [1969]F test are also presented. The differences in power were negligible. Some final observations, along with suggestions for further research, are noted.

Admissible Probability Measurement Procedures

Article

Feb 1966
PSYCHOMETRIKA

Admissible probability measurement procedures utilize scoring systems with a very special property that guarantees that any student, at whatever level of knowledge or skill, can maximize his expected score if and only if he honestly reflects his degree-of-belief probabilities. Section 1 introduces the notion of a scoring system with the reproducing property and derives the necessary and sufficient condition for the case of a test item with just two possible answers. A method is given for generating a virtually inexhaustible number of scoring systems, both symmetric and asymmetric, with the reproducing property. A negative result concerning the existence of a certain subclass of reproducing scoring systems for the case of more than two possible answers is obtained. Whereas Section 1 is concerned with those instances in which the possible answers to a query are stated in the test itself, Section 2 is concerned with those instances in which the student himself must provide the possible answer(s). In this case, it is shown that a certain minor modification of a scoring system with the reproducing property yields the desired admissible probability measurement procedure.

Comparison of the Three-Decision and Conventional Multiple-Choice Tests

Article

Jul 1967

CLEMENS S. BERNHARDSON

Two multiple-choice tests, one with five alternatives for each question and one with four alternatives for each question, were scored as a Three-decision Multiple-choice Test and as a conventional multiple-choice test. In addition, the five-alternative test was scored as a modified conventional multiple-choice test by giving half marks if the correct alternative was picked as the second choice. The different scoring systems were evaluated by correlating the scores with the average mark obtained by each student in all his courses during the year. The results indicated that the conventional multiple-choice test was not improved by scoring methods which gave credit for partial knowledge.

Confidence testing: Is it reliable? Paper presented at the annual meeting of the National Council on Measurement in Education

Feb 1969

R J Armstrong
R F Mooney

Armstrong, R. J., & Mooney, R. F. (1969, February). Confidence testing: Is it reliable? Paper presented at the annual meeting of the National Council on Measurement in Education, Los Angeles.

Educational and psychological testing: The test-taker's outlook

Jan 1993
153-176

D V Budescu

Budescu, D. V. (1993). Self-evaluation of success in psychological testing. In B. Nevo & R. S. Jager (Eds.), Educational and psychological testing: The test-taker's outlook (pp. 153-176). Gottingen, Germany: Hogrefe & Huber Publishers.

A new method for administering and scoring multiple choice tests: Theoretical and empirical considerations. Unpublished manuscript

Jan 1979

L H Cross
N F Thayer

Cross, L. H., & Thayer, N. F. (1979). A new method for administering and scoring multiple choice tests: Theoretical and empirical considerations. Unpublished manuscript, Virginia Polytechnic Institute and State University, Blacksburg.

Empirically based polychotomous scoring of multiple choice test items: A review

Apr 1988

T M Haladyna

Haladyna, T. M. (1988, April). Empirically based polychotomous scoring of multiple choice test items: A review. Paper presented at the annual meeting of the American Educational Research Association, New Orleans.

The subset selection technique for multiple choice tests: An empirical inquiry

Jan 1986
J EDUC MEAS
369-376

O Jaradat
S Swagad

Jaradat, O., & Swagad, S. (1986). The subset selection technique for multiple choice tests: An empirical inquiry. Journal of Educational Measurement, 23, 369-376.

NEVO COMPARISON OF MEASURES OF PARTIAL KNOWLEDGE 87 evaluation of ability : A review and meta-analysis

Jan 1982
J APPL PSYCHOL
280-296

P A Mabe
S G West
D V Ben-Simon
B Budescu

Mabe, P. A., & West, S. G. (1982). Validity of self-A. BEN-SIMON, D. V. BUDESCU, and B. NEVO COMPARISON OF MEASURES OF PARTIAL KNOWLEDGE 87 evaluation of ability : A review and meta-analysis. Journal of Applied Psychology, 67, 280-296.

Effect of examinee certainty on probabilistic test scores and a comparison ofscoring methodsforprobabilistic responses

Jan 1983

D Suhadolnik
D J Weiss

Suhadolnik, D., & Weiss, D. J. (1983). Effect of examinee certainty on probabilistic test scores and a comparison ofscoring methodsforprobabilistic responses (Research Report 83-3). University of Minnesota, Department of Psychology, Psychometric Methods Program, Computerized Adaptive Testing Laboratory, Minneapolis.

A Comparative Study of Measures of Partial Knowledge in Multiple-Choice Tests

Abstract

No full-text available

Recommended publications

A Simple Management Tool for Medium-Sized Web Sites

Increasing the influence of usability practices within the design process

Leadership and the spirit of humor

MINERνA, a neutrino - Nucleus interaction experiment