ArticlePDF Available

Exploring quality criteria and evaluation methods in automated question generation: A comprehensive survey

Authors:

Abstract and Figures

In light of the widespread adoption of technology-enhanced learning and assessment platforms, there is a growing demand for innovative, high-quality, and diverse assessment questions. Automatic Question Generation (AQG) has emerged as a valuable solution, enabling educators and assessment developers to efficiently produce a large volume of test items, questions, or assessments within a short timeframe. AQG leverages computer algorithms to automatically generate questions, streamlining the question-generation process. Despite the efficiency gains, significant gaps in the question-generation pipeline hinder the seamless integration of AQG systems into the assessment process. Notably, the absence of a standardized evaluation framework poses a substantial challenge in assessing the quality and usability of automatically generated questions. This study addresses this gap by conducting a comprehensive survey of existing question evaluation methods, a crucial step in refining the question generation pipeline. Subsequently, we present a taxonomy for these evaluation methods, shedding light on their respective advantages and limitations within the AQG context. The study concludes by offering recommendations for future research to enhance the effectiveness of AQG systems in educational assessments.
Content may be subject to copyright.
Vol.:(0123456789)
Education and Information Technologies
https://doi.org/10.1007/s10639-024-12771-3
1 3
Exploring quality criteria andevaluation methods
inautomated question generation: Acomprehensive
survey
GuherGorgun1 · OkanBulut2
Received: 14 December 2023 / Accepted: 7 May 2024
© The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature
2024
Abstract
In light of the widespread adoption of technology-enhanced learning and assess-
ment platforms, there is a growing demand for innovative, high-quality, and diverse
assessment questions. Automatic Question Generation (AQG) has emerged as a valu-
able solution, enabling educators and assessment developers to efficiently produce
a large volume of test items, questions, or assessments within a short timeframe.
AQG leverages computer algorithms to automatically generate questions, streamlin-
ing the question-generation process. Despite the efficiency gains, significant gaps in
the question-generation pipeline hinder the seamless integration of AQG systems into
the assessment process. Notably, the absence of a standardized evaluation framework
poses a substantial challenge in assessing the quality and usability of automatically
generated questions. This study addresses this gap by conducting a comprehensive
survey of existing question evaluation methods, a crucial step in refining the ques-
tion generation pipeline. Subsequently, we present a taxonomy for these evaluation
methods, shedding light on their respective advantages and limitations within the
AQG context. The study concludes by offering recommendations for future research
to enhance the effectiveness of AQG systems in educational assessments.
Keywords Automatic question generation· Human evaluators· Question quality·
post-hoc evaluations· metric-based evaluations
* Guher Gorgun
gorgun@ualberta.ca
Okan Bulut
bulut@ualberta.ca
1 Measurement, Evaluation, andData Science, Faculty ofEducation, University ofAlberta, 6-110
Education Centre North, 11210 87 Ave NW, Edmonton, ABT6G2G5, Canada
2 Centre forResearch inApplied Measurement andEvaluation, Faculty ofEducation, , University
ofAlberta, 6-110 Education Centre North, 11210 87 Ave NW, Edmonton, ABT6G2G5,
Canada
Education and Information Technologies
1 3
Assessment is a fundamental core of education that allows researchers, educators,
and policymakers to assess learners’ knowledge and skills while providing evidence
about the effectiveness of educational practices (Ewell, 2008; Heubert etal., 1999;
Linn, 2003; Nagy, 2000; Newton, 2007; Zilberberg etal., 2013). Creating high-qual-
ity assessments is a complex process because assessment developers should ensure
that the building blocks of assessment (i.e., questions) have high quality, and holis-
tically, the assessment measures what it intends to measure with high consistency
and accuracy (Darling-Hammond etal., 2013). Developing high-quality questions
has been a major challenge for educators because it requires content and assessment
expertise, time, and resources (e.g., Tarrant etal., 2006). Various shortcomings in
the question development process may slow down the transition to personalized and
adaptive teaching and learning, necessitating a large question bank (Vie etal., 2017).
Automatic question generation (AQG) has emerged as an efficient and practical
solution to streamline the question generation process, allowing the rapid generation
of a large number of questions through computer algorithms. Nonetheless, the fast
and cost-effective nature of this process does not ensure the suitability of the auto-
matically generated questions for operational use in educational settings. Each ques-
tion generated through AQG systems must still go through a comprehensive evalu-
ation process to discern the quality, relevance, and effectiveness of these questions
within the context of operational educational settings. Therefore, a robust evaluation
process becomes the linchpin in sifting through the generated question pool, identi-
fying questions that align with the intended educational outcomes.
Although evaluating question quality is imperative for understanding the utility of an
AQG system and the usability of questions generated, evaluation methods and quality
criteria used in AQG are often neglected. This aspect is perhaps the most fundamental
reason why question generation systems have not been fully integrated into educational
or assessment settings. Typically, AQG researchers introduce novel question generation
systems for educational use by employing state-of-the-art natural language processing
and machine learning methods, yet they lacked an essential component in the question
generation pipeline, i.e., question evaluation, to be fully integrated in real-life practices.
This paper, to the best of our knowledge, is the first study attempting to summarize and
categorize evaluation methods used by traditional item developers and computer scien-
tists relying on computer algorithms to generate questions. Through a comprehensive
survey of evaluation methods and quality criteria used in AQG, we aim to: 1) provide
an exhaustive list of evaluation methods and quality criteria used by the AQG systems;
2) identify the strengths, limitations, and gaps in each evaluation method; 3) highlight
the quintessence of evaluation methods for AQG research; 4) bridge the theoretical and
practical gap between traditional psychometric and computer science methods in ques-
tion evaluation methods and quality criteria; and finally 5) create a taxonomy for cat-
egorizing existing evaluation methods used by AQG research to inform future studies
on selecting the best evaluation method given the resources and design of the study.
In the subsequent sections of this study, our focus shifts toward a comprehensive
exploration of the quality criteria employed by both assessment developers and AQG
systems. Our objective is to elucidate the shared aspects and distinctions that character-
ize these criteria. This comparative analysis serves as a crucial step in understanding
the convergence and divergence in the quality standards applied by human assessment
1 3
Education and Information Technologies
developers and AQG systems. Following the examination of quality criteria, we
introduce a taxonomy rooted in the evaluation methods employed by AQG systems.
Through this taxonomy, we categorize and classify the diverse approaches to evalua-
tion, shedding light on the strengths, limitations, and existing gaps within each method.
This systematic breakdown aims to offer a comprehensive understanding of the intrica-
cies involved in assessing the quality of automatically generated questions. The study
concludes with a discussion of recommendations and future directions for enhancing
the efficiency and scalability of AQG. By identifying areas for improvement and pro-
posing actionable suggestions, we aim to contribute to the ongoing evolution of AQG
systems, ensuring their alignment with the evolving needs of educational assessments.
1 Quality criteria forevaluating questions
1.1 Quality criteria used intraditional test development
Test developers and psychometricians typically refer to questions, exercises, prompts,
or statements in an assessment as items (American Educational Research Associa-
tion etal., 2014; Nelson, 2004). Thus, the process during which the properties of
items (e.g., structural characteristics and quality) are investigated is called item anal-
ysis (e.g., Bandalos, 2018; Lane etal., 2016; Osterlind, 1989). Item analysis is an
umbrella term that encompasses statistical approaches (e.g., Ashraf, 2020; Clauser
& Hambleton, 2011; French, 2001; Rezigalla, 2022) and judgment-based approaches
(e.g., Gierl et al., 2021, 2022; Osterlind, 1989) used for evaluating the quality of
questions created. Below, we dissect each approach to item analysis to explain the
processes and tools used for understanding the quality of questions created.
1.2 Statistical approaches foritem analysis
Statistical approaches for item analysis have been considered a cornerstone for inves-
tigating question quality because empirical data about learners are collected to ana-
lyze item properties. The most frequently analyzed item properties include difficulty,
discrimination, distractors, and differential item functioning. Depending on the ques-
tion format (e.g., multiple-choice, cloze, essay), some of these statistical approaches
could be redundant (e.g., distractor analysis can only be used when multiple response
options are present). Typically, the empirical data collected during a test develop-
ment stage for statistical analysis are referred to as field testing. During field testing,
efforts have been made to ensure that the sample is representative of the target popu-
lation because the quality indices can typically only be generalized for the popula-
tion of interest. For instance, in large-scale international assessments, such as the Pro-
gramme for International Student Assessment (PISA), a sampling strategy has been
employed to adequately represent the schools and the students (OECD, 2020).
The first index, item difficulty, aids test developers in quantifying the level of dif-
ficulty of a question for a given sample. Based on the test theory (e.g., Classical Test
Theory; CTT or Item Response Theory; IRT) that assessment experts adopt, item
Education and Information Technologies
1 3
difficulty is operationalized slightly differently (for a more comprehensive discussion
of test theories, please refer to Suen (2012)). CTT asserts that item difficulty, denoted
as p, is the proportion of examinees answering the question correctly (Anastasi &
Urbina, 2004; Clauser & Hambleton, 2011; Haladyna & Rodriguez, 2021), which can
be expressed as:
Equation1 indicates that as the number of students answering a question correctly
increases, p also increases, suggesting that the question is getting easier.
Unlike CTT deriving the difficulty level of each question directly from raw
responses, IRT postulates a probability function that places examinees’ ability levels
and item difficulty onto the same scale. This function asserts the item difficulty as the
probability of correctly answering a question given the ability level of the examinee.
Formally, the simplest IRT model (i.e., Rasch model) can be expressed as:
where θ is the ability level of an examinee, and b is the difficulty of the question.
Since item difficulty and ability are positioned on the same continuum, when dif-
ficulty and ability overlap, there is a 50% probability for the examinee to cor-
rectly answer the question (Osterlind & Wang, 2017). If the examinee’s ability is
greater than a given item’s difficulty, the probability of answering an item correctly
increases (> 50%).
The second quality criterion that has been inseparable from the item difficulty
index is item discrimination, which indicates a question’s capacity to distinguish high-
performing examinees (i.e., those who know the content well) from low-performing
ones (i.e., those who struggle with the content) (French, 2001). Similar to the difficulty
index, the operationalization of discrimination depends on the test theory adopted by
assessment experts. By endorsing CTT, experts have three alternatives to evaluate item
discrimination. Using the first approach, experts may evaluate the difference in p values
(i.e., item difficulty) by comparing high-performers and low-performers as
where phigh-performing is found by taking the upper-performing 25% or 27% of exami-
nees and plow-performing is found by taking the lower-performing 25% or 27% of exam-
inees based on the total test score (Cohen etal., 1996; Jenkins & Michael, 1986;
Kehoe, 1995). Using the second approach, experts may calculate a point-biserial
correlation that relies on finding Pearson’s product-moment correlation between the
total score and item score (Ebel & Frisbie, 1986; Kim etal., 2021). Formally, the
point biserial correlation coefficient (rpbi) is given as
(1)
p
=
#of examinees who answered the question correctly
#
of examinees who attempted the question
(2)
P
(X=1
|
𝜃,b)=e
(𝜃b)
1+e
(𝜃b)
,
(3)
discrimination index =phighperfoming plowperfoming,
(4)
r
pbi =
M
p
M
q
s
pq
,
1 3
Education and Information Technologies
where Mp is the test average for examinees answering the question correctly, Mq is
the test average for examinees answering the question incorrectly, s is the standard
deviation of the test, p is the p-value for examinees answering the question correctly,
and q is the p-value for examinees answering the question incorrectly. The final
approach to item discrimination under the CTT framework has emerged recently and
is referred to as a multi-serial index. The multi-serial index is based on a multiple
correlation and is used for calculating item discrimination while considering distrac-
tor discrimination (Haladyna & Rodriguez, 2021). The multi-serial index (MSI) is
calculated by
where SS is the sum of squares, and it quantifies the amount of test score variance
accounted for by the item response across all options (Haladyna & Rodriguez,
2021).
When experts adopt IRT, the discrimination index is the rate at which the prob-
ability of selecting a correct response change given the examinee’s ability level.
Similar to CTT, as the value of the discrimination index increases, the question’s
ability to differentiate high-performing students from low-performing ones increases
(Ashraf, 2020). Under the IRT framework, item discrimination is expressed as
where θ is the ability level of an examinee, b is the difficulty, and a is the discrimina-
tion of the question.
The third statistical approach widely used by traditional test developers is dis-
tractor analysis—examining the functioning of incorrect response options. Distrac-
tor analysis helps test developers evaluate whether all distractors have been used,
whether the distractors are more likely to be selected by low-performing examinees
and whether the keyed response is correctly identified (Gierl etal., 2016; Haladyna
etal., 2002). Ideally, distractors should tap examinees’ misconceptions and hence
should attract examinees with a lack of knowledge or misconception about a given
topic (Wind etal., 2019). Using CTT, experts may look at the frequency distribu-
tion for each response option and analyze which performance level is more likely to
select distractors. Endorsing the IRT framework, experts may assess the probability
of each response option being selected by the examinees. Either approach provides a
picture of how the distractors function and whether they are more attractive to low-
performing examinees.
The final statistical approach that enables test developers to examine whether
a question functions differently based typically on group membership is referred
to as differential item functioning (DIF). Test developers may also investigate the
presence of DIF considering the different time points. When DIF is present in a
question, a group affiliation (e.g., race, gender, or region) impacts the probabil-
ity of answering the question correctly (Clauser & Hambleton, 2011), creating a
(5)
MSI
=
SSbetween
SStotal
,
(6)
P
(X=1
|
𝜃,b,a)=e
a(𝜃b)
1+e
a(𝜃b)
,
Education and Information Technologies
1 3
bias toward a particular group. Assessment results obtained from questions with
DIF may lead to unwarranted inferences about the ability levels of examinees.
To perform DIF analysis, experts start by identifying focal and reference groups
and then investigate whether the probability of correctly answering the question
changes between similar members of these groups (Livingston, 2013). Similar
members of the focal and reference groups are determined based on their total
test scores. DIF is presumed to exist for a given question if the probability of giv-
ing a correct response in the focal group is smaller than the probability of giving
a correct response in the reference group when the ability levels are matched.
There are various statistical methods (e.g., the Mantel–Haenszel method, logistic
regression, and SIBTEST) developed for assessing DIF in a question (Bulut &
Suh, 2017; Osterlind & Everson, 2009).
Guidelines for statistical approaches Depending on the test theory adopted, the
guidelines differ for item difficulty and discrimination indices. Under the CTT
framework, questions with p-values smaller than 0.30 are considered difficult items.
Questions with a p-value between 0.30 and 0.70 have moderate difficulty. Finally,
questions with a p-value greater than 0.70 are considered as easy items (Adegoke,
2013; Bichi, 2016; Henning, 1987). Concerning the interpretation of item discrimi-
nation when CTT is used, questions with negative discrimination (≤ 0) should be
flagged and scrutinized to evaluate whether there are any issues with the question
(e.g., the correct response is not properly identified) because the negative discrimi-
nation index indicates that lower-performing examinees are more likely to answer
the question correctly compared to higher-performing ones. Questions with a dis-
crimination index smaller than 0.20 should be revised or eliminated because they are
not good at differentiating higher-performing examinees from the lower-performing
ones. Questions with a discrimination index greater than 0.40 are considered good
questions, while those with a discrimination index between 0.20 and 0.40 could be
revised or scarcely used (Bichi, 2016; Ebel & Frisbie, 1986; Towns, 2014).
Interpreting item difficulty and discrimination is not as straightforward in IRT
because examinees and questions are placed on the same continuum, and whether
a question is difficult or not depends on the examinee’s ability level. Theoretically,
item difficulty ranges from -∞ to + ∞,placing the difficulty of each item in a con-
tinuum. Thus, item difficulties could be compared to one another, yet whether an
itemis difficult or not depends on an examinee’s ability level. That is, item diffi-
culties and examinee abilities are placed on the same continuum allowing the test
developer to match item difficulty with examinee ability. Assume that a question
has a difficulty parameter ofb = 0.1, then 50% of examinees with an ability level
of θ = 0.1 will answer the question correctly (DeMars, 2010). Hence, item dif-
ficulty is relative to the ability level of the examinee and interpretating whether a
question is easy or difficult depends on the ability level of the examinee.
Concerning the discrimination in IRT, the negative discrimination index indi-
cates that the question is problematic and, just like CTT, should be removed. The-
oretically, similar tothe difficulty index, the discrimination index ranges from -∞
1 3
Education and Information Technologies
to + ∞, however the discrimination index typically does not exceed 2. The ques-
tions with a discrimination index greater than 0.4 are considered good (DeMars,
2010). The questions with a discrimination index higher than 0.65 are regarded
as having high discrimination (Baker, 2001). Questions with higher discrimina-
tion index should be preferred as they are minimally impacted from the guess-
ing behavior and these questions are associated with higher internal consistency
(Kim & Feldt, 2010).
1.3 Judgment‑based approaches foritem analysis
Judgment-based approaches for item analysis rely on employing subject-matter
experts to assess question quality. A rating scale or a rubric-like evaluation tool is
often used for assessing the quality of questions based on criteria such as the degree
of alignment between the content and question coverage, fairness of the questions,
ambiguity in questions, the difficulty levels of the questions, formatting issues, cog-
nitive load, and readability of questions (Chalifour & Powers, 1989; Engelhard etal.,
1999; Gierl etal., 2021; Gierl etal., 2016; Osterlind, 1989; Wauters etal., 2012).
For instance, Osterlind (1989) suggested using a 3-point scale for evaluating con-
tent alignment using the scales of high congruence, medium congruence, and low
congruence (pp. 267–268). Similarly, Gierl etal. (2016) proposed a 4-point rating
scale composed of accept, accept-minor revision, reject-major revision, and reject
for assessing the question quality based on the criteria of content, logic, and presen-
tation of questions created.
There is less uniformity in judgment-based approaches in terms of criteria and
rating scales used for item analysis, suggesting more idiosyncrasy during the evalu-
ation process. Below, we compare statistical and judgment-based approaches as well
as enumerate the strengths and limitations of both approaches traditionally used by
test developers and psychometricians.
1.4 Comparisons betweenstatistical andjudgment‑based approaches
Judgment-based approaches require a rating scale and a training process involving
raters to establish reliability and consistency among those raters. Thus, it requires
extensive time and resources to train experts to assess the quality of questions con-
sistently using the rating scale. Nonetheless, studies indicated that experts could
vary a lot and might be poor judges of item quality, especially when it comes to
evaluating item difficulty (Engelhard etal., 1999; Impara & Plake, 1998; Wauters
etal., 2012). Due to the challenges inherent in judgment-based approaches, statisti-
cal approaches are considered the golden standard for evaluating the quality of ques-
tions because they provide empirical data about the question’s quality.
Nonetheless,statistical approaches also embody certain limitations. These limi-
tations restrict the use of statistical approaches and the generalizability of quality
indices. Concerning the first limitation of the use of statistical approaches, if a large
number of questions exist, thenall questions may not be field-tested, and field test-
ing is a costly and time-consuming process. Concerning the second limitation of
Education and Information Technologies
1 3
generalizability of the statistical approach, we highlight two fundamental assump-
tions in statistical approaches: 1) The sample selected is representative of the target
population, and 2) the real assessment administration conditions can be reproduced
during the field-testing process. Especially under the CTT framework, the quality
indices (i.e., item difficulty, item discrimination, and distractor analysis) heavily
depend on the sample and question characteristics. Distortions in sample representa-
tiveness and administration conditions may introduce construct-irrelevant variance,
contaminating the validity and reliability of statistical indices (e.g., Anastasi &
Urbina, 2004; Hambleton etal., 1991; Livingston, 2013). The construct-irrelevant
variance may involve random guessing behavior, test speededness, not-reached
items, and omitted items and can contaminate item parameters (e.g., Gorgun and
Bulut, 2022), test administration processes (Gorgun & Bulut, 2023), or scoring pro-
cedures (e.g., Gorgun & Bulut, 2021).
1.5 Quality criteria inautomatic question generation
In AQG, questions are generated through the assistance of computer algorithms,
significantly reducing the reliance on human intervention throughout the question
generation process. While this automated approach minimizes human involvement,
it brings forth unique considerations for quality that were not as prominent in tradi-
tional question development where human input played a central role. Consequently,
the conventional evaluation methodologies may not thoroughly examine facets asso-
ciated with AQG, potentially limiting their effectiveness in addressing the distinctive
challenges presented by this automated process. Thus, the objective of this section is
to amalgamate traditional evaluation criteria with those employed by AQG research-
ers, bridging the gap between established evaluation practices and the evolving land-
scape of AQG.
Some of the quality criteria in AQG have exclusively focused on the linguistic
aspects of the generated questions. These linguistic criteria include grammaticality,
fluency, relevancy, semantic correctness, syntax clarity, meaning, spelling, natural-
ness, specificity, and coherence (Amidei etal., 2018; Gatt & Krahmer, 2018; Kurdi
etal., 2020; Mulla & Gharpure, 2023). These criteria are typically assessed using a
judgment-based approach, such asby employing a rating scale (e.g., Becker etal.,
2012; Heilman, 2011; Heilman & Smith, 2010; Mostow etal., 2017). Yet, the level
of scrutiny varies widely across the AQG systems. Some researchers have used a
rubric with multiple criteria (e.g., Becker etal., 2012; Maurya and Desarkar, 2020;
Niraula and Rus, 2015), whereas others evaluated the quality of questions gener-
ated using a binary scale (e.g., Wang etal., 2021). For example, Heilman and Smith
(2010) evaluated the quality of generated questions using the criteria of grammati-
cality, incorrect information, vagueness, and awkwardness/other using the rating
scale levels ofgood, acceptable, borderline, unacceptable, and bad (Heilman, 2011,
pp. 183–184). On the other hand, several have rated whether the generated ques-
tions satisfy the quality criteria of fluency (i.e., whether the question is coherent
and grammatically correct) and relevancy (i.e., whether the item is relevant given
the input sentence). In addition to the linguistic criteria, AQG systems were also
1 3
Education and Information Technologies
assessed based on the accuracy of cognitive models underlying question generation
(e.g., Gierl etal., 2022, 2016), question engagement (e.g., Van Campenhout etal.,
2022), differential child item functioning (e.g., Fu etal., 2022), and pedagogical use-
fulness (e.g., Jouault etal., 2016; Tamura etal., 2015; Zhang & VanLehn, 2016).
There are also similarities between AQG quality criteria and traditional quality
criteria. These included item difficulty, distractor analysis, domain relevance, and
educational usefulness. Note that, in AQG research, the definitions of these criteria
might vary across the studies compared with thetraditional ones. For example, by
employing CTT, Gierl etal. (2016) and Van Campenhout etal. (2022) estimated
item difficulty as a p-value, while others conceptualized item difficulty as cogni-
tive complexity (McCarthy etal., 2021; Settles etal., 2020; Venktesh etal., 2022).
Likewise, several studies conceptualized item difficulty as the similarity between the
keyed response and distractors (Lin etal., 2015; Seyler etal., 2017).
2 A taxonomy ofevaluation methods used inAQG
In this section, we propose a coherent taxonomy for organizing the evaluation meth-
ods employed in AQG that addresses the quality criteria discussed above. We con-
tend that providing a comprehensive list of evaluation methods and a coherent tax-
onomy may help AQG researchers identify the suitable evaluation methods while
understanding their advantages and limitations. As such, AQG researchers can pick
the most suitable evaluation method given the design and approach employed dur-
ing the question generation process. Note that AQG studies may combine multiple
evaluation methods (Amidei etal., 2018), and therefore, the taxonomy proposed in
this study reflects the potential overlaps between the evaluation methods employed.
The proposed taxonomy (Fig.1) was created by considering aspects of quality
criteria, benchmark, resources, and input used during the evaluation process. In
Fig.2, we displayed the decision tree used to categorize evaluation methods for cre-
ating the proposed taxonomy. We first considered the aspect of the question qual-
ity criteria and the existence of benchmarks. Specifically, we considered whether
the questions are evaluated on an individual basis or whether the question-gener-
ation system is evaluated holistically using benchmarks. Benchmarks here refer to
whether there is a baseline during the evaluation process to be considered as a point
of reference and allow test developers to evaluate the quality of the question-genera-
tion system holistically. That is, when the goal is to evaluate the question-generation
system holistically, we categorize these queries as comparison-based methods.
On the other hand, individually evaluated items can be categorized under the
umbrella terms of statistical approaches or judgment-based approaches based on the
quality criteria employed. If human judgment is the main instrument of evaluation, we
labelled this method as human evaluators. We further refined statistical approaches
based on input and resources. Input refers to which information has been used during
the evaluation process. For example, do AQG researchers evaluate questions directly or
do they rely on empirical data obtained after the generated questions are administered
to a group of examinees? When items are field-tested, we distinguish this approach as
post-hoc evaluation methods. Resources refer to the availability of certain information
Education and Information Technologies
1 3
Fig. 1 The taxonomy for evaluation methods in automatic question generation
1 3
Education and Information Technologies
Fig. 2 The decision tree used for categorizing the evaluation methods in AQG
Education and Information Technologies
1 3
in the dataset (e.g., reference questions, item ratings). When auxiliary information is
available in the dataset and statistical approaches are used, we referred to this family of
evaluation methods as metric-based methods. Below, we elaborate on each category of
taxonomy and illustrate example studies from literature.
2.1 Metric‑based evaluations
Metric refers to a standard of measurement or relating to (such) an art, process, or sci-
ence of measuring (Merriam-Webster, 2023). Metric-based evaluations encompass
standardized metrics or indices that can be used to automatically evaluate the perfor-
mance of the AQG system. Typically, the performance of the AQG system is com-
pared against human-authored questions or reference questions (e.g., Gao etal., 2019;
Kumar etal., 2018; Marrese-Taylor etal., 2018; Wang etal., 2018). Table1 presents
several examples of previous AQG studies that used metric-based evaluations. Below,
we explain these evaluation indices, describe how they can be implemented, and finally
discuss their limitations.
The first group of metric-based evaluations is widely used in machine-translation
tasks that involve comparing a machine-translated sentence with a reference sentence.
Among these indices, the most frequently used ones in AQG are BiLingual Evaluation
Understudy (BLEU; Papineni etal., 2002), Metric for Evaluation of Translation with
Explicit ORdering (METEOR; Banerjee & Lavie, 2005), and Recall-Oriented Under-
study for Gisting Evaluation (ROUGE; Lin, 2004). These metrics emerged in machine
translation because employing human evaluations is costly (Hovy, 1999) and time-
consuming (Papineni etal., 2002). These methods have facilitated the evaluation of
machine-translation systems while quantifying the magnitude of the closeness between
human translation and machine translation.
The first metric, BLEU (Papineni etal., 2002), calculates the number of n-gram
pairs between the machine translation and the reference translation. Note that BLEU
does not consider the position of n-gram pairs. The BLEU score ranges between 0 and
1, and a higherBLEU score indicates bettera machine translation. Thus, a score closer
to 1 indicates that the machine translation is close to the human translation. The BLEU
score is calculated by estimating a precision score based on n-gram pairs and a penalty
function. The precision score (pn) formula is given as
where CountClip is the total count of each candidate word by its maximum refer-
ence count. The penalty function takes sentence length (i.e., brevity penalty, BP)
into account and is expressed as
(7)
pn=
∁∈{Candidates}
ngram∈∁ Countclip
(
ngram
)
{Candidates}ngram
∈∁
Count
clip
(ngram)
,
(8)
BP =
{
1if c >r
e
(
1r
c
)
if c r
,
1 3
Education and Information Technologies
Table 1 Examples of AQG systems evaluated using metric-based methods
Authors Generated Item Types Context AQG Method Evaluation Method
Becker etal., 2012 Cloze Generic Parse-trees Logistic regression
Gao etal., 2019 Constructed response Reading comprehension seq2seq BLEU, METEOR, ROUGE-L
Ha and Yaneva, 2018 Distractor Medicine Concept embeddings, information retrieval Embedding similarity
Liang etal., 2018 Distractor Biology
Chemistry
Earth science
Physics
Feature-based and neural net Logistic regression, random forest, and
LambdaMart
Liu etal., 2017 Constructed response Factual questions Sentence simplification Logistic regression, rankSVM
Marrese-Taylor etal., 2018 Cloze Language Bidirectional LSTM F1, Recall, Precision
Maurya & Desarkar, 2020 Distractor Reading comprehension Hierarchical multi-decoder network BLEU, ROUGE-L, METEOR, greedy
matching, embedding average, and
BERT cosine similarity
Kumar etal., 2018 Q&A pairs Reading comprehension Recurrent neural network, seq2seq, global
attention, and answer encoding
METEOR, BLEU, ROUGE-L
Panda etal., 2022 Distractor generation Cloze Language Neural machine translation, round-trip
machine translation
Specialized metric
Rodriguez-Torrealba etal., 2022 Multiple-choice Answer Distractor Generic T-5 BLEU, ROUGE-L, cosine similarity
Wang etal., 2018 Constructed response Biology Sociology
History
Recurrent neural network (bidirectional
LSTM)
BLEU, METEOR, ROUGE-L
Wang etal., 2021 Constructed response Math Pre-trained large-language models BLEU, METEOR, ROUGE-L, Specialized
metric
Wang etal., 2022 Constructed response Educational biology items GPT-3 and prompt engineering Perplexity, grammatical error using
Python language tool, distinct-3, toxicity
using perspective API
Education and Information Technologies
1 3
where r is the reference corpus length and c is the total length of the candidate trans-
lation corpus.
Finally, the BLEU score is obtained by
where wn = 1/N and N is the number of total words.
The second machine translation-based metric employed for evaluating ques-
tions in AQG is ROUGE (Lin, 2004). Originally, four variations of the ROUGE
metric have been proposed–ROUGE-N, ROUGE-L, ROUGE-W, and ROUGE-S,
however, in AQG, ROUGE-L is used most frequently. ROUGE-L is based on esti-
mating the longest common subsequence (LCS) between a machine translation
and a reference translation. Similar to BLEU, ROUGE-L computes the similarity
between the machine translation and the reference translation by calculating an
LCS-based F-measure (Lin, 2004). ROUGE-L thus is calculated by finding the
LCS-based recall
and LCS-based precision
where X and Y are machine and reference translations with a length of m and n,
respectively. Finally, the ROUGE-L score is found by
where β is
. This generates a score between 0 and 1, and the larger values
indicate more similarity between the machine and reference translations.
The final machine translation metric used in AQG is METEOR (Banerjee &
Lavie, 2005). METEOR, similar to ROUGE-L, also uses precision and recall
scores but also takes the order of paired words into account (Banerjee & Lavie,
2005). METEOR is computed as
where Fmean is a combination of unigram recall (R) and unigram precision (P)
and Penalty is defined as
(9)
BLEU
=BP exp
N
n=1wnlogpn
,
(10)
R
LCS =
LCS(X,Y)
m,
(11)
P
LCS =
LCS(X,Y)
n,
(12)
F
LCS =
(
1+𝛽2
)
RLCSPLCS
RLCS +𝛽
2
PLCS
,
(13)
Score =Fmean (1Penalty),
(14)
F
mean =
10PR
R+9P,
1 3
Education and Information Technologies
METEOR intends to overcome the limitations inherent in BLEU (i.e., neglect-
ing the position of n-gram pairs) by introducing the penalty function defined above.
Specifically, the penalty function considers the unigrams in adjacent positions in
candidate translation (i.e., chunks) and the unigrams matched. METEOR provides
a score between 0 and 1, and the larger values indicate more similarity between the
candidate translation and the reference translation.
In addition to the indices borrowed from the machine translation literature, other
standardized metrics have also been used to evaluate the quality of AQG. These
include F1, perplexity, the Python language tool, toxicity analysis, embedding simi-
larity, and specialized metrics that were developed by the AQG researchers (Amidei
etal., 2018). To exemplify specialized metrics, Wang and colleagues (Wang etal.,
2021) developed a math word problem question-generation system where equa-
tions are used to represent the math questions. They developed a metric, ACC-eq,
to assess the similarity between the equation describing the generated math word
problem and the input equation.
2.2 Implementation
The metric-based evaluation methods necessitate a point of reference to carry out
the evaluation of automatically generated questions. Using BLEU, ROUGE-L,
and METEOR, researchers can evaluate the similarity between machine-generated
and human-authored questions. In some studies, researchers have also compared
machine-generated distractors with human-authored ones. For example, using
BLEU, METEOR, and ROUGE-L, the researchers have evaluated the similarity
between the questions available in the dataset (e.g., SQuAD; Rajpurkar etal., 2016)
and generated questions (e.g., Gao etal., 2019; Kumar etal., 2018; Maurya and
Desarkar, 2020). Similarity-based metrics can also be used to compare the semantic
distance between machine-generated questions or distractors and human-authored
questions or distractors. For example, Rodriguez-Torrealba etal. (2022) and Maurya
and Desarkar (2020) obtained vector embeddings (i.e., numerical representations of
individual words) for the questions by using linguistic models such as Global Vec-
tors for Word Representation (GloVe; Pennington et al., 2014) and Bidirectional
Encoder Representations from Transformers (BERT; Devlin etal., 2018), calculated
cosine similarity based on the embeddings, and evaluated the semantic similarity
between the generated questions and the reference questions.
Various metrics have been introduced to facilitate an automated evaluation of
multiple dimensions related to question quality. These metrics offer a comprehen-
sive assessment, encompassing aspects such as question diversity, textual coher-
ence, grammatical accuracy, the average count of n-grams, and even the evalua-
tion of toxicity within the generated questions (as explored by Wang etal., 2022).
By employing these metrics, a nuanced understanding of the quality of questions
can be derived, enabling a more robust evaluation process that goes beyond mere
(15)
Penalty
=0.5
(
#chunks
#unigrams matched
)2
.
Education and Information Technologies
1 3
correctness and delves into the intricacies of linguistic expression and ethical
considerations.
Ultimately, researchers can leverage labeled data acquired through human
evaluators to facilitate the training of classifiers capable of automatically assess-
ing the quality of generated questions. Noteworthy examples in literature, such
as the work by Becker etal. (2012) and Heilman (2011), showcase the training of
machine learning classifiers specifically designed to predict the fluency and gram-
maticality of questions. A variety of well-established classifiers find application in
automated question quality detection, including logistic regression, random forest,
LambdaMart, and rankSVM, as highlighted in studies by Becker etal. (2012), Liang
etal. (2018), and Liu etal. (2017). These machine learning approaches, informed by
human-labeled data, contribute significantly to the advancement of automated sys-
tems for evaluating the linguistic and structural aspects of generated questions.
2.3 Limitations ofmetric‑based methods
While metric-based evaluations offer efficiency and simplicity in assessing automat-
ically generated questions through standardized measures such as BLEU or toxicity
analysis, their applicability hinges on the availability of reference questions, distrac-
tors, or labeled data. Therefore, the necessity of ground truth or reference questions
turns into a major roadblock for the utilization of these metrics. Moreover, given
that these metrics primarily assess the similarity between a reference question and
an automatically generated question, a question deemed acceptable but semantically
dissimilar may receive a low score (Kurdi etal., 2020). Consequently, the evaluation
process may inadvertently disqualify high-quality questions due to dissimilarities in
linguistic structure. Alternatively, metric-based evaluations may flag generated ques-
tions as acceptable due to the similarity in linguistic structure that might be in fact
useless for educational or pedagogical purposes.
2.4 Human evaluators
Human evaluators refer to employing manual coding and rating scales for evaluating
the quality of automatically generated questions. As such, human evaluators typi-
cally employ the quality criteria discussed under judgment-based approaches. The
criteria include language fluency (e.g., Mostow etal., 2017; Song & Zhao, 2017),
grammaticality (e.g., Chughtai etal., 2022; Heilman, 2011), distractability of ques-
tions (e.g., Maurya and Desarkar, 2020), the complexity of questions (e.g., Chung &
Hsiao, 2022), the acceptability of questions (Gierl etal., 2016; Liang etal., 2017),
the difficulty of questions (e.g., Rodriguez-Torrealba etal., 2022), or domain rel-
evance (e.g., Chughtai etal., 2022; Dugan etal., 2022). The labeled data obtained
from human evaluators may serve as ground truth to be used in subsequent analysis
(e.g., training classifiers; Becker etal., 2012; Heilman & Smith, 2010).
Human evaluators embody various groups with different types of expertise,
such as subject matter experts (e.g., Gierl etal., 2016), researchers (e.g., Dugan
etal., 2022), students (e.g., Panda etal., 2022), teachers (e.g., Chung & Hsiao,
1 3
Education and Information Technologies
2022), and crowdsource workers (e.g., Lin etal., 2015). An advantage of employ-
ing crowdsourcing is that crowdsource-workers are relatively less expensive than
other evaluators, and a lot of crowdsource-workers could be recruited to evalu-
ate the vast number of questions generated by AQG. This provides an efficient
and inexpensive solution to the question-evaluation process. Nonetheless, human
evaluators may exhibit different levels of expertise for question evaluation, which
may cast suspicion on the validity of question quality labels assigned. Examples
of studies employing human evaluators are provided in Table2.
2.5 Implementation
Typically, a rating scale or scoring rubric composed of multiple criteria has
been used by human evaluators to assess the quality of generated questions (e.g.,
Becker etal., 2012; Mostow etal., 2017; Rodriguez-Torrealba etal., 2022; von
Davier, 2018). Evaluators may undergo a training process to prevent idiosyncratic
rating scale interpretation and to achieve standardization during question evalua-
tion using the rating scale. In addition, the questions generated can be assessed by
multiple evaluators, and the interrater reliability among the evaluators could be
analyzed to examine the extent to which evaluators agree and are consistent with
one another.
2.6 Limitations ofhuman evaluators
Human evaluators are perhaps the most frequently used evaluation method in AQG
research (e.g., Kurdi etal., 2020). Nonetheless, the lack of reporting practices may
encumber the appraisal of the quality of human evaluations. Although AQG studies
have involved different numbers of evaluators, ranging from 1 to 364 (Amidei etal.,
2018), they have rarely reported the rate of agreement between the evaluators. Fur-
thermore, training practices and measures adopted to ensure agreement and consist-
ency among the evaluators are usually unknown (Kurdi etal., 2020). Most impor-
tantly, previous studies often fail to provide a detailed description of the evaluators
and evaluation criteria used. For example, researchers simply report that evaluators
are native English speakers, lacking information on the educational background or
demographic characteristics of evaluators (e.g., Maurya and Desarkar, 2020; Song
& Zhao, 2017), or they only indicate that evaluators assessed the grammatically of
questions without necessarily providing the reader with information on how gram-
matically is defined. As such, it is impossible to appraise the quality of human eval-
uators. To optimize the use of human evaluators in AQG, researchers should pro-
vide a detailed description of the evaluators, the recruitment process, training, rating
scale development, and tools used (Lin etal., 2015). A third limitation is that evalu-
ating questions by employing human evaluators is typically an expensive and time-
consuming process, and given the scale of all generated questions, human evaluators
may not be an optimal solution for evaluating all questions generated.
Education and Information Technologies
1 3
2.7 Post‑hoc evaluations
Post-hoc evaluations refer to administering automatically generated questions to a
representative sample and evaluating the quality of the questions after the adminis-
tration is completed. As such, post-hoc evaluations typically incorporate statistical
approaches for item analysis. Post-hoc evaluations include experimental designs and
psychometric analysis. The former may compare the impact of generated questions
with human-authored questions on learner engagement and performance. Alterna-
tively, experimental studies may also include control and experimental groups in
which the effectiveness of automatically generated items on learner performance is
assessed. Psychometric analysis, on the other hand, typically starts with picking a
test theory (i.e., CTT or IRT) and running item analysis that may focus on item diffi-
culty, item discrimination, or distractor analysis. Table3 provides examples of stud-
ies that used post-hoc evaluations to assess the quality of the generated questions.
2.8 Implementation
Previous studies used post-hoc evaluations when automatically generated ques-
tions could be administered to a representative sample of examinees to obtain sta-
tistical indices about the questions generated. For instance, Van Campenhout and
colleagues (2022) aimed to understand the influence of automatically generated
questions on student engagement and persistence by comparing generated questions
with human-authored ones. They found that both questions functioned similarly. In
a similar study, Yang and colleagues (Yang etal., 2021) investigated the impact of
automatically generated questions on students’ reading engagement and reading per-
formance. They found that those who practiced the content using automatically gen-
erated questions had better course performance (Yang etal., 2021).
Beyond simply assessing the influence of automatically generated questions on
learner performance, researchers have also delved into a comprehensive evaluation
of the psychometric properties associated with such questions. A notable instance
of this approach is found in the work of Gierl and colleagues (Gierl etal., 2016)
who meticulously appraised the quality of automatically generated medical ques-
tions. Their evaluation extended beyond mere performance outcomes and involved
administering these questions to a representative sample. The criteria that Gierl
etal. (2016) employed to gauge quality included item difficulty, a thorough analy-
sis of distractors, and an examination of keyed response functioning. Similarly, in a
study conducted by Attali etal. (2022), researchers took a multifaceted approach to
assess the psychometric properties of generated questions. This investigation went
beyond traditional metrics, incorporating an examination of item difficulty, an analy-
sis of local independence within the questions, and a scrutiny of response times.
Such comprehensive evaluations not only provide insights into the impact of gener-
ated questions on learner performance but also offer a nuanced understanding of the
inherent qualities that contribute to the effectiveness and reliability of these educa-
tional assessment tools.
1 3
Education and Information Technologies
Table 2 Examples of AQG systems evaluated using human evaluators
Authors Generated Item Type Context AQG Method Evaluation Method
Attali etal., 2022 Multiple-choice Reading comprehension GPT-3 Experts
Becker etal., 2012 Cloze Generic Parse-trees Crowdsource workers
Chughtai etal., 2022 Multiple-choice Engineering T-5, sense2vec Experts
Chung & Hsiao, 2022 Constructed response Programming Template-based Teachers
Dugan etal., 2022 Constructed response Generic T-5 Researchers
Gierl etal., 2016 Multiple-choice Medicine Template-based Experts
Liang etal., 2017 Distractor Biology
Math
Physics
Generative adversarial neural nets Experts
Lin etal., 2015 Multiple-choice Wildlife Hybrid semantic similarity Crowdsource workers
Maurya and Desarkar, 2020 Distractor Reading comprehension Hierarchical multi-decoder network Students
Mostow etal., 2017 Multiple-choice Reading comprehension Parse-trees n-grams Students
Olney, 2021 Cloze items Science Deep learning summarization Experts Students
Panda etal., 2022 Distractor generation, cloze item Language Neural machine translation, round-trip
machine translation
Students
Rodriguez-Torrealba etal., 2022 Multiple-choice Answer Distractor Generic T-5 Professionals
Song & Zhao, 2017 Constructed response Generic Neural machine translation Human (unknown category)
von Davier, 2018 Survey Personality scale Recurrent neural network, long-short term
memory
Crowdsource workers
Wang etal., 2022 Constructed response Biology GPT-3, prompt engineering Experts
Education and Information Technologies
1 3
2.9 Limitations ofpost‑hoc evaluations
Post-hoc evaluations, anchored in data-driven methodologies, rely on empirical
evidence that demonstrates the quality of generated questions. These assessments
commonly integrate statistical approaches for item analysis to gauge question qual-
ity. However, as highlighted earlier, the limitations associated with relying solely on
statistical methods for item analysis have been explicitly articulated, including con-
cerns related to the generalizability of indices derived from a specific sample.
Another caveat of post-hoc evaluations is that the question quality is assessed in a
retrospective manner. That is, we have limited information about the quality of ques-
tions generated prior to administering them. This inherent characteristic may result
in unintended repercussions, contingent on the testing conditions. On one hand, if
the generated questions are tested in a real assessment setting, poor questions could
potentially induce learner confusion and frustration. Conversely, in a field-testing
scenario, the testing conditions may wield substantial influence over the conclusions
drawn regarding the quality of the questions.
Finally, it is worth noting that researchers in AQG often produce a substantial
volume of questions. In practical terms, it becomes unfeasible to administer all
generated questions in a field-testing or experimental context. Consequently, while
we can form a reliable estimation of the quality of the questions that were actually
administered, a significant number of leftover questions remain untested, prevent-
ing the acquisition of comprehensive item statistics. This surplus of unadministered
questions introduces an additional challenge in the adoption of automatically gener-
ated questions for operational assessment and learning environments. The post-hoc
evaluations, by their very nature, contribute to a bottleneck in the seamless imple-
mentation of these questions, raising practical concerns about their widespread
applicability and integration into educational settings.
2.10 Comparison‑based evaluations
So far, we have discussed methods used to evaluate questions on an individual basis.
However, test-developers might be interested in evaluating the question-generation
system holistically to understand the contributions of certain components of ques-
tion generation pipeline. This evaluation method could have been subsumed under
metric-based evaluations (see Fig.1) because the degradation in model performance
is typically estimated by using metrics such as BLEU, ROUGE, Recall, Precision, or
F1 (Wang etal., 2021). However, comparison-based methods deviate from metric-
based evaluations because when the comparison-based evaluations are usedthe sys-
tem is evaluated holistically whereas each generated question is evaluated individu-
allyin metric-based evaluations.
Although rarely employed, comparison-based evaluations have two branches:
ablation studies and comparing the AQG system with previous question generation
systems. Ablation studies are evaluation methods that involve removing a compo-
nent of the AQG system and assessing the degree of degradation in the question
generation pipeline. Here, degradation in the model refers to a decrease in model
1 3
Education and Information Technologies
Table 3 Examples of AQG systems evaluated using post-hoc methods
Authors Item Type Context AQG Method Evaluation Method
Attali etal., 2022 Multiple-choice Reading comprehension GPT-3 Psychometric properties
Gierl etal. 2016 Multiple-choice Medicine Template-based Psychometric properties
Gierl & Lai, 2012 Multiple-choice Medicine Template-based Psychometric properties
Hommel etal., 2022 Survey Personality Recurrent neural network, long-short
term memory, GPT-2
Psychometric properties
Van Campenhout etal., 2022 Matching Cloze Psychology Rule-based Experimental
Yang etal., 2021 Cloze items Reading comprehension BERT Experimental
Education and Information Technologies
1 3
performance (e.g., Precision, Recall, BLEU, or F1 values) when one or more com-
ponents of the question generation system are removed. Thus, the system is expected
to perform worse if the removed component is essential to the AQG system. The
second branch involves comparing the AQG system with the baseline or previous
system and assessing how much improvement is achieved with the new modifica-
tions. This second branch is perhaps the least frequently used method in comparison
to other methods because AQG researchers require an existing system or a base-
line model in order to assess the performance of the newly developed system. None-
theless, a few studies follow this method for evaluating the AQG system. Table4
includes several examples of studies using comparison-based evaluations.
2.11 Implementation
There are several AQG systems that employ ablation studies to assess question qual-
ity. For instance, Wang and colleagues (Wang etal., 2021) removed several compo-
nents from their proposed AQG system to assess the degree of degradation in the
AQG system. Specifically, they compared several context keyword selection meth-
ods including term frequency-inverse document frequency, nouns and pronouns
when generating math questions. Using BLEU and a customized evaluation metric,
Wang and colleagues (Wang etal., 2021) compared different models’ performance.
Thus, this process served as baseline models for Wang etal. (2021) to justify the
contribution of the components to the question-generation process. In addition, a
few studies considered previous AQG systems as the benchmark and assessed the
degree of improvement observed in the newer AQG systems (e.g., Huang & He,
2016; Mostow etal., 2017).
2.12 Limitations ofcomparison‑based evaluations
Comparison-based evaluations necessitate access to previous AQG systems or ques-
tions generated through previous systems. Because of the limited availability of
AQG data, this is a rarely used evaluation method in practice. On the other hand,
ablation studies assess the quality of questions generated with respect to the degree
of degradation observed in the AQG system. Therefore, they also have limited usa-
bility to understand the overall quality of the AQG system and the quality of ques-
tions generated.
3 Discussion
In this study, we provided a comprehensive overview of the evaluation criteria and
methods used by AQG system developers. Our comprehensive survey highlighted
that AQG researchers may evaluate the AQG system holistically by removing com-
ponents of the question-generation system or comparing the system’s performance
against previous baseline models. While this approach allows AQG researchers
to compare the contributions of several preprocessing or modeling decisions for
1 3
Education and Information Technologies
the question generation system (e.g., Wang etal., 2021), the quality of individual
questions remains unknown. As such, comparison-based methods are not entirely
sufficient for deploying the generated questions in learning environments and
assessments.
Methods relying on statistical and judgment-based approaches allow research-
ers and practitioners to evaluate each generated question (e.g., Attali etal., 2022;
Becker etal., 2012; Dugan etal., 2022; Gierl etal., 2022). Nonetheless, these eval-
uation methods have several significant pitfalls, limiting their generalizability and
efficiency in the implementation of generated questions in real-world educational
settings. For instance, employing human evaluators to evaluate generated questions
violates the most fundamental assumption of AQG, which is that the questions can
be generated quickly and efficiently. When employed, human evaluators need to go
through each individual question and assign a quality score using a rating scale.
Although questions are generated instantly and efficiently, human evaluators will
slow down the deploymentprocess (Kurdi etal., 2020). Without knowing the ques-
tion quality, efficiency and swiftness in question generation will be futile because
generated questions cannot be directly used in educational environments.
Similar to concerns about employing human evaluators, post-hoc methods could
be quite limited and resource-intensive for evaluating the quality of generated ques-
tions. As such, post-hoc methods may violate another fundamental assumption of
AQG, that is, a high volume of questions can be generated (e.g., Attali etal., 2022;
Panda etal., 2022). When post-hoc methods are employed, only a subset of ques-
tions can be administered, yielding information on item quality about a fraction
of the questions generated. The quality of the remaining questions persists to be
unknown, restricting the optimal use of the generated questions.
Metric-based methods have emerged as a promising solution to evaluate all gen-
erated items instantly and efficiently. This family of methods enables the evaluation
of all questions generated easily, yet most of these metrics require reference ques-
tions or ground truth about the item quality (e.g., Kurdi etal., 2020), which is an
unrealistic expectation for many question-generation systems. For these reasons,
AQG researchers should not only focus on enhancing the performance of question
generation systems but also introduce novel evaluation methods to assess the quality
of all generated questions efficiently and feasibly.
3.1 Recommendations forfuture research
Beyond presenting an overview of current evaluation practices, our aim is to extend
support to AQG researchers by offering recommendations and suggestions that pro-
pel the question-generation pipeline to new heights.
3.2 Availability ofdatasets
There are many AQG systems proposed, yet very few of them have shared the gener-
ated questions and the evaluation metrics publicly (e.g., Becker etal., 2012). This
encumbers the progress and comparison of AQG systems. Datasets containing
Education and Information Technologies
1 3
Table 4 Examples of AQG systems evaluated using comparison-based methods
Authors Item Type Context AQG Method Evaluation Method
Huang & He, 2016 Constructed response Reading comprehension Paraphrasing Previous AQG system
Huang & Mostow, 2015 Multiple-choice Reading comprehension N-grams Previous AQG system
Liang etal., 2017 Distractor Biology
Math
Physics
Generative adversarial neural nets Previous AQG system
Mostow etal., 2017 Multiple-choice Reading comprehension Parse-trees and n-grams Previous AQG system
Wang etal., 2021 Constructed response Math Pre-trained large-language models Ablation study
1 3
Education and Information Technologies
automatically generated questions with various evaluation methods are needed to
compare the feasibility, scalability, and overlap among the evaluation methods to
assess the coherence and consistency among the evaluation methods employed.
Especially, questions evaluated with multiple methods, such as employing human
evaluators or metric-based approaches, are essential to reveal the interrelationship
between evaluation methods and quality criteria. This could offer new insights into
current question evaluation methods, advancing both traditional psychometrics and
computer science approaches to question development. The availability of such
datasets could also help develop automatic evaluation methods to bridge the gap
between question generation and deploymentof generated questions in real assess-
ment and learning settings. The availability of datasets will support developing and
validating novel metric-based evaluations for evaluating questions instantly and effi-
ciently, rendering the implementation of generated questions in educational setting
possible. Furthermore, AQG researchers may also benefit from the existing data-
sets to develop automated detectors for question quality evaluation. Therefore, we
recommend that AQG researchers develop automatically generated questions, assess
their quality based on various quality criteria and evaluation methods, and share the
resulting datasets publicly.
3.3 Standardized quality criteria
We highlighted that quality criteria may be defined quite distinctly across studies
creating challenges and limitations when it comes to comparing different question
generation systems (e.g., Gierl etal., 2016; Heilman & Smith, 2010; Rodriguez-Tor-
realba etal., 2022). For instance, researchers may use the same criteria, such as item
difficulty or fluency, yet the operationalization or tools used for evaluation could
be quite different (e.g., Heilman, 2011; Mostow etal., 2017). Thus, on the surface,
what seems to be comparable may indeed be incommensurable, introducing chal-
lenges to comparing various systems and enhancing AQG methods. Therefore, we
recommend that standardized quality criteria should be established to render stud-
ies more comparable and transferable. This could especiallysupport judgment-based
approaches and human evaluators by establishing a standardized evaluation process.
Such standardized quality criteria may enhance the robustness, systematicity, and
interpretability of AQG systems.
3.4 Better reporting practices
Many AQG studies, to date, have failed to report crucial aspects of question gen-
eration and evaluation processes, limiting the appraisal of question generation sys-
tems. Better reporting practices, including information on the question generation
and evaluation pipeline, should be an integral part of standardized reporting practices
(Amidei etal., 2018; Kurdi etal., 2020). For instance, rating scales used by evalutors
or implementation details of post-hoc evaluations may inform AQG researchers to
design better evaluation processes for the question generation pipeline. We recom-
mend researchers follow detailed reporting practices and include information about
Education and Information Technologies
1 3
the purpose for question generation, the question generation process, evaluation prac-
tices such as how questions are evaluated and which criteria are used, reliability and
validity indicators about the evaluation process, and limitations and challenges expe-
rienced during the question generation and evaluation process. This not only should
be an indispensable part of best practices conducting AQG research but also can sup-
port future AQG researchers to conduct reproducible and replicable AQG systems
and evaluations. We further recommend researchers report this vital information in
research repositories or as supplemental materials if the formatting of the publishing
venue does not allow them to report the full details of the generation process. As such
more transparency and interpretability across the AQG systems could be achieved.
3.5 Automated evaluation metrics
While numerous studies in AQG underscore the imperative role of AQG systems in
enhancing the efficacy of educational assessments (Kurdi etal., 2020), the evalua-
tion of automatically generated questions remains an overlooked yet crucial aspect
of question generation. The inherent promise of any AQG system lies in its ability to
swiftly, efficiently, and cost-effectively generate a large number of questions. How-
ever, optimizing the evaluation process for such a multitude of questions requires
automated methods. These automated approaches not only enhance scalability but
also contribute to efficiency and cost-effectiveness in the evaluation of generated
questions. Consequently, we advocate for the establishment of automated evaluation
methods by AQG researchers, emphasizing their reliance on minimal resources—
such as the utilization of reference questions for metric-based methods or human
judgment for training classifiers in question evaluation. This approach aligns with
the overarching goal of streamlining and enhancing the evaluation processes integral
to the effectiveness of AQG systems in educational settings.
4 Conclusion
This study offers a comprehensive survey of the diverse evaluation methods and
quality criteria employed by researchers in AQG. To the best of our knowledge, this
is a first attempt to categorize evaluation methods employed by AQG researchers
and point out strengths and limitations of each method. Specifically, by introducing
a novel taxonomy, we categorize the evaluation methods based on the key aspects of
AQG systems, including input, resources, benchmark, and quality criteria. Through
the lens of this taxonomy, we delve into the strengths, limitations, and challenges
inherent in each evaluation method. It is our expectation that this taxonomy serves
as a valuable tool for AQG researchers, aiding them in the identification of optimal
and efficient evaluation methods, along with the quality criteria suitable for assess-
ing boththe system performance and the generated questions’ quality.
1 3
Education and Information Technologies
Author contributions GG: Conceptualization, methodology, formal analysis, writing—original draft
preparation. OB: Conceptualization, supervision, writing—review and editing.
Funding This research did not receive any specific grant from funding agencies in the public, commer-
cial, or not-for-profit sectors.
Data availability The manuscript has no associated data.
Declarations
Consent for publication All authors read and approved the final manuscript.
Competing interests The authors have no conflicts of interest to declare that are relevant to the content
of this article.
References
Adegoke, B. A. (2013). Comparison of item statistics of physics achievement test using classical test and
item response theory frameworks. Journal of Education and Practice, 4(22), 87–96.
American Educational Research Association, American Psychological Association, National Council on
Measurement in Education. (2014). Standards for educational and psychological testing. Ameri-
can Educational Research Association.
Amidei, J., Piwek, P., & Willis, A. (2018). Evaluation methodologies in automatic question generation
2013-2018. Proceedings of The 11th International Natural Language Generation Conference (pp.
307–317). https:// doi. org/ 10. 18653/ v1/ W18- 6537
Anastasi, A., & Urbina, S. (2004). Psychological testing (7th ed.). Pearson.
Ashraf, Z. A. (2020). Classical and modern methods in item analysis of test tools. International Journal
of Research and Review, 7(5), 397–403.
Attali, Y., Runge, A., LaFlair, G. T., Yancey, K., Goodwin, S., Park, Y., & von Davier, A. A. (2022). The
interactive reading task: Transformer-based automatic item generation. Frontiers in Artificial Intel-
ligence, 5, 903077. https:// doi. org/ 10. 3389/ frai. 2022. 903077.
Baker, F. B. (2001). The basics of item response theory (2nd ed.). ERIC Clearinghouse on Assessment
and Evaluation.
Bandalos, D. L. (2018). Measurement theory and applications for the social sciences. Guilford
Publications.
Banerjee, S., & Lavie, A. (2005). METEOR: An Automatic Metric for MT Evaluation with Improved
Correlation with Human Judgments. Proceedings of the ACL Workshop onIntrinsic and Extrinsic
Evaluation Measures for Machine Translation and/or Summarization, pp 65–72.
Becker, L., Basu, S., & Vanderwende, L. (2012). Mind the gap: Learning to choose gaps for question gen-
eration. Proceedings of the 2012 Conference of the North American Chapter of the Association for
Computational Linguistics: Human Language Technologies, 742–751.
Bichi, A. A. (2016). Classical Test Theory: An introduction to linear modeling approach to test and item
analysis. International Journal for Social Studies, 2(9), 27–33.
Bulut, O., & Suh, Y. (2017). Detecting DIF in multidimensional assessments with the MIMIC model, the
IRT likelihood ratio test, and logistic regression. Frontiers in Education, 2(51), 1–14. https:// doi.
org/ 10. 3389/ feduc. 2017. 00051.
Chalifour, C. L., & Powers, D. E. (1989). The relationship of content characteristics of GRE analyti-
cal reasoning items to their difficulties and discriminations. Journal of Educational Measurement,
26(2), 120–132. https:// doi. org/ 10. 1111/j. 1745- 3984. 1989. tb003 23.x.
Chughtai, R., Azam, F., Anwar, M. W., Haider But, W., & Farooq, M. U. (2022). A lecture-centric auto-
mated distractor generation for post-graduate software engineering courses. International Confer-
ence on Frontiers of Information Technology (FIT), 2022, 100–105. https:// doi. org/ 10. 1109/ FIT57
066. 2022. 00028.
Education and Information Technologies
1 3
Chung, C.-Y., & Hsiao, I.-H. (2022). Programming Question Generation by a Semantic Network: A Pre-
liminary User Study with Experienced Instructors. In M. M. Rodrigo, N. Matsuda, A. I. Cristea,
& V. Dimitrova (Eds.), Artificial Intelligence in Education. Posters and Late Breaking Results,
Workshops and Tutorials, Industry and Innovation Tracks, Practitioners’ and Doctoral Consortium
(Vol. 13356, pp. 463–466). Springer International Publishing. https:// doi. org/ 10. 1007/ 978-3- 031-
11647-6_ 93.
Clauser, J. C., & Hambleton, R. K. (2011). Item analysis procedures for classroom assessments in higher
education. In C. Secolsky & D. B. Denison (Eds.), Handbook on Measurement, Assessment, and
Evaluation in Higher Education (pp. 296–309). Routledge.
Cohen, R. J., Swerdlik, M. E., & Phillips, S. M. (1996). Psychological testing and assessment: An intro-
duction to tests and measurement (3rd ed.). Mayfield Publishing Co.
Darling-Hammond, L., Herman, J., Pellegrino, J., Abedi, J., Aber, J. L., Baker, E., … & Steele, C. M.
(2013). Criteria for high-quality assessment. Stanford Center for Opportunity Policy in Education,
2, 171–192.
DeMars, C. (2010). Item response theory. Oxford University Press. https:// doi. org/ 10. 1093/ acprof: oso/
97801 95377 033. 001. 0001.
Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of deep bidirectional
transformers for language understanding. arXiv preprint arXiv:1810.04805. https:// doi. org/ 10.
48550/ arXiv. 1810. 04805
Dugan, L., Miltsakaki, E., Upadhyay, S., Ginsberg, E., Gonzalez, H., Choi, D., Yuan, C., & Callison-
Burch, C. (2022). A feasibility study of answer-agnostic question generation for education. Find-
ings of the Association for Computational Linguistics: ACL, 2022, 1919–1926.
Ebel, R. L., & Frisbie, D. A. (1986). Using test and item analysis to evaluate and improve test quality.
Essentials of educational measurement (Vol. 4, pp. 223–242). Prentice-Hall.
Engelhard, G., Jr., Davis, M., & Hansche, L. (1999). Evaluating the accuracy of judgments obtained from
item review committees. Applied Measurement in Education, 12(2), 199–210. https:// doi. org/ 10.
1207/ s1532 4818a me1202_6.
Ewell, P. T. (2008). Assessment and accountability in America today: Background and context. New
Directions for Institutional Research, 2008(S1), 7–17. https:// doi. org/ 10. 1002/ ir. 258.
French, C. L. (2001). A review of classical methods of item analysis [Paper presentation]. Annual meet-
ing of the Southwest Educational Research Association, New Orleans, LA, USA.
Fu, Y., Choe, E. M., Lim, H., & Choi, J. (2022). An Evaluation of Automatic Item Generation: A Case
Study of Weak Theory Approach. Educational Measurement: Issues and Practice, 41(4), 10–22.
https:// doi. org/ 10. 1111/ emip. 12529.
Gao, Y., Bing, L., Chen, W., Lyu, M. R., & King, I. (2019). Difficulty controllable generation of reading
comprehension questions. arXiv. http:// arxiv. org/ abs/ 1807. 03586. Accessed04/04/2023.
Gatt, A., & Krahmer, E. (2018). Survey of the state of the art in natural language generation: Core tasks,
applications and evaluation. Journal of Artificial Intelligence Research, 61, 65–170. https:// doi. org/
10. 1613/ jair. 5477.
Gierl, M. J., & Lai, H. (2012). The role of item models in automatic item generation. International Jour-
nal of Testing, 12(3), 273–298. https:// doi. org/ 10. 1080/ 15305 058. 2011. 635830
Gierl, M. J., Lai, H., & Tanygin, V. (2021). Methods for validating generated items: A focus on model-
level outcomes. In Advanced Methods in Automatic Item Generation (1st ed., pp. 120–143). Rout-
ledge. https:// doi. org/ 10. 4324/ 97810 03025 634.
Gierl, M. J., Lai, H., Pugh, D., Touchie, C., Boulais, A.-P., & De Champlain, A. (2016). Evaluating the
psychometric characteristics of generated multiple-choice test items. Applied Measurement in Edu-
cation, 29(3), 196–210. https:// doi. org/ 10. 1080/ 08957 347. 2016. 11717 68.
Gierl, M. J., Swygert, K., Matovinovic, D., Kulesher, A., & Lai, H. (2022). Three sources of validation
evidence are needed to evaluate the quality of generated test items for medical licensure. Teaching
and Learning in Medicine, 1–11. https:// doi. org/ 10. 1080/ 10401 334. 2022. 21195 69.
Gorgun, G., & Bulut, O. (2021). A polytomous scoring approach to handle not-reached items in low-
stakes assessments. Educational and Psychological Measurement, 81(5), 847–871. https:// doi. org/
10. 1177/ 00131 64421 991211.
Gorgun, G., & Bulut, O. (2022). Considering disengaged responses in Bayesian and deep knowledge trac-
ing. In M. M. Rodrigo, N. Matsuda, A. I. Cristea, & V. Dimitrova (Eds.), Artificial intelligence
in education. Posters and late-breaking results, workshops and tutorials, industry and innovation
Tracks, practitioners’ and doctoral consortium (pp. 591–594). Lecture Notes in Computer Science,
vol 13356. Springer. https:// doi. org/ 10. 1007/ 978-3- 031- 11647-6_ 122.
1 3
Education and Information Technologies
Gorgun, G., & Bulut, O. (2023). Incorporating test-taking engagement into the item selection algorithm
in low-stakes computerized adaptive tests. Large-Scale Assessments in Education, 11(1), 27.
https:// doi. org/ 10. 1186/ s40536- 023- 00177-5
Ha, L. A., & Yaneva, V. (2018). Automatic distractor suggestion for multiple-choice tests using con-
cept embeddings and information retrieval. Proceedings of the Thirteenth Workshop on Innova-
tive Use of NLP for Building Educational Applications, pp 389–398. https:// doi. org/ 10. 18653/ v1/
W18- 0548.
Haladyna, T. M., & Rodriguez, M. C. (2021). Using full-information item analysis to improve item qual-
ity. Educational Assessment, 26(3), 198–211. https:// doi. org/ 10. 1080/ 10627 197. 2021. 19463 90.
Haladyna, T. M., Downing, S. M., & Rodriguez, M. C. (2002). A review of multiple-choice item writing
guidelines for classroom assessment. Applied Measurement in Education, 15(3), 309–333. https://
doi. org/ 10. 1207/ S1532 4818A ME1503_5.
Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991). Fundamentals of item response theory.
Sage.
Heilman, M. (2011). Automatic factual question generation from text [Ph. D.]. Carnegie Mellon
University.
Heilman, M., & Smith, N. A. (2010). Good question! Statistical ranking for question generation. Human
Language Technologies: The 2010 Annual Conference of the North American Chapter of the Asso-
ciation for Computational Linguistics, pp 609–617.
Henning, G. (1987). A guide to language testing: Development, evaluation, research. Newberry House
Publishers.
Heubert, J. P., & Hauser, R. M. (Eds.). (1999). High stakes: Testing for tracking, promotion, and gradua-
tion. National Academy Press.
Hommel, B. E., Wollang, F.-J.M., Kotova, V., Zacher, H., & Schmukle, S. C. (2022). Transformer-based
deep neural language modeling for construct-specific automatic item generation. Psychometrika,
87(2), 749–772. https:// doi. org/ 10. 1007/ s11336- 021- 09823-9.
Hovy, E. (1999). Toward finely differentiated evaluation metrics for machine translation. Proceedings
of the EAGLES Workshop on Standards and Evaluation Pisa, Italy, 1999. https:// cir. nii. ac. jp/ crid/
15714 17125 25545 8048 https:// doi. org/ 10. 18653/ v1/ 2022. acl- srw. 31.
Huang, Y., & He, L. (2016). Automatic generation of short answer questions for reading comprehension
assessment. Natural Language Engineering, 22(3), 457–489. https:// doi. org/ 10. 1017/ S1351 32491
50004 55.
Huang, Y. T., & Mostow, J. (2015). Evaluating human and automated generation of distractors for diag-
nostic multiple-choice cloze questions to assess children’s reading comprehension. In C. Conati,
N. Heffernan, A. Mitrovic, & M. Verdejo (Eds.), Artificial Intelligence in Education. AIED 2015.
Lecture Notes in Computer Science. (Vol. 9112). Cham: Springer. https:// doi. org/ 10. 1007/ 978-3-
319- 19773-9_ 16
Impara, J. C., & Plake, B. S. (1998). Teachers’ ability to estimate item difficulty: A test of the assump-
tions in the Angoff standard setting method. Journal of Educational Measurement, 35(1), 69–81.
https:// doi. org/ 10. 1111/j. 1745- 3984. 1998. tb005 28.x.
Jenkins, H. M., & Michael, M. M. (1986). Using and interpreting item analysis data. Nurse Educator,
11(1), 10.
Jouault, C., Seta, K., & Hayashi, Y. (2016). Content-dependent question generation using LOD for his-
tory learning in open learning space. New Generation Computing, 34(4), 367–394. https:// doi. org/
10. 1007/ s00354- 016- 0404-x.
Kehoe, J. (1995). Basic item analysis for multiple-choice tests. Practical Assessment, Research, and Eval-
uation, 4(10), 1–3. https:// doi. org/ 10. 7275/ 07zg- h235.
Kim, S.-H., Cohen, A. S., & Eom, H. J. (2021). A note on the three methods of item analysis. Behavior-
metrika, 48(2), 345–367. https:// doi. org/ 10. 1007/ s41237- 021- 00131-1.
Kim, S., & Feldt, L. S. (2010). The estimation of the IRT reliability coefficient and its lower and upper
bounds, with comparisons to CTT reliability statistics. Asia Pacific Education Review, 11, 179–
188. https:// doi. org/ 10. 1007/ s12564- 009- 9062-8.
Kumar, V., Boorla, K., Meena, Y., Ramakrishnan, G., & Li, Y.-F. (2018). Automating reading compre-
hension by generating question and answer pairs (arXiv: 1803. 03664). arXiv. http:// arxiv. org/ abs/
1803. 03664.
Kurdi, G., Leo, J., Parsia, B., Sattler, U., & Al-Emari, S. (2020). A systematic review of automatic ques-
tion generation for educational purposes. International Journal of Artificial Intelligence in Educa-
tion, 30(1), 121–204. https:// doi. org/ 10. 1007/ s40593- 019- 00186-y.
Education and Information Technologies
1 3
Lane, S., Raymond, M. R., & Haladyna, T. M. (Eds.). (2016). Handbook of test development (2nd ed.).
Routledge.
Liang, C., Yang, X., Dave, N., Wham, D., Pursel, B., & Giles, C. L. (2018). Distractor generation for
multiple choice questions using learning to rank. Proceedings of the thirteenth workshop on inno-
vative use of NLP for building educational applications, pp. 284–290.
Liang, C., Yang, X., Wham, D., Pursel, B., Passonneaur, R., & Giles, C. L. (2017). Distractor generation
with generative adversarial nets for automatically creating fill-in-the-blank questions. Proceedings
of the Knowledge Capture Conference, 1–4. https:// doi. org/ 10. 1145/ 31480 11. 315446.
Lin, C., Liu, D., Pang, W., & Apeh, E. (2015). Automatically predicting quiz difficulty level using simi-
larity measures. Proceedings of the 8th International Conference on Knowledge Capture, 1–8.
https:// doi. org/ 10. 1145/ 28158 33. 28158 42.
Lin, C.-Y. (2004). ROUGE: A package for automatic evaluation of summaries. Text Summarization
Branches Out, 74–81.
Linn, R. L. (2003). Accountability: Responsibility and reasonable expectations. Educational Researcher,
32(7), 3–13. https:// doi. org/ 10. 3102/ 00131 89X03 20070 03.
Liu, M., Rus, V., & Liu, L. (2017). Automatic Chinese factual question generation. IEEE Transactions on
Learning Technologies, 10(2), 194–204.
Livingston, S. A. (2013). Item analysis. Routledge. https:// doi. org/ 10. 4324/ 97802 03874 776. ch19.
Marrese-Taylor, E., Nakajima, A., Matsuo, Y., & Yuichi, O. (2018). Learning to automatically generate
fill-in-the-blank quizzes. arXiv. http:// arxiv. org/ abs/ 1806. 04524.
Maurya, K. K., & Desarkar, M. S. (2020). Learning to distract: A hierarchical multi-decoder network for
automated generation of long distractors for multiple-choice questions for reading comprehension.
Proceedings of the 29th ACM International Conference on Information & Knowledge Manage-
ment, 1115–1124. https:// doi. org/ 10. 1145/ 33405 31. 34119 97.
McCarthy, A. D., Yancey, K. P., LaFlair, G. T., Egbert, J., Liao, M., & Settles, B. (2021). Jump-starting
item parameters for adaptive language tests. Proceedings of the 2021 Conference on Empirical
Methods in Natural Language Processing, 883–899. https:// doi. org/ 10. 18653/ v1/ 2021. emnlp-
main. 67.
Merriam-Webster. (2023). Metric. In Merriam-Webster.com dictionary. Retrieved November 3, 2023,
from https:// www. merri am- webst er. com/ dicti onary/ metric. Accessed 18 Sept 2023.
Mostow, J., Huang, Y.-T., Jang, H., Weinstein, A., Valeri, J., & Gates, D. (2017). Developing, evaluating,
and refining an automatic generator of diagnostic multiple-choice cloze questions to assess chil-
dren’s comprehension while reading. Natural Language Engineering, 23(2), 245–294. https:// doi.
org/ 10. 1017/ S1351 32491 60000 24.
Mulla, N., & Gharpure, P. (2023). Automatic question generation: A review of methodologies, datasets,
evaluation metrics, and applications. Progress in Artificial Intelligence, 12(1), 1–32. https:// doi.
org/ 10. 1007/ s13748- 023- 00295-9.
Nagy, P. (2000). The three roles of assessment: Gatekeeping, accountability, and instructional diagnosis.
Canadian Journal of Education / Revue Canadienne De L’éducation, 25(4), 262–279. https:// doi.
org/ 10. 2307/ 15858 50.
Nelson, D. (2004). The penguin dictionary of statistics. Penguin Books.
Newton, P. E. (2007). Clarifying the purposes of educational assessment. Assessment in Education: Prin-
ciples, Policy & Practice, 14(2), 149–170. https:// doi. org/ 10. 1080/ 09695 94070 14783 21.
Niraula, N. B., & Rus, V. (2015). Judging the quality of automatically generated gap-fill questions using
active learning. Proceedings of the Tenth Workshop on Innovative Use of NLP for Building Educa-
tional Applications, 196–206. https:// doi. org/ 10. 3115/ v1/ W15- 0623.
OECD. (2020). PISA 2022 technical standards. OECD Publishing.
Olney, A. M. (2021). Sentence selection for cloze item creation: A standardized task and preliminary
results. Joint Proceedings of the Workshops at the 14th International Conference on Educational
Data Mining, pp 1–5.
Osterlind, S. J. (1989). Judging the quality of test items: Item analysis. In S. J. Osterlind (Ed.), Construct-
ing Test Items (pp. 259–310). Springer. https:// doi. org/ 10. 1007/ 978- 94- 009- 1071-3_7.
Osterlind, S. J., & Everson, H. T. (2009). Differential item functioning. Sage Publications.
1 3
Education and Information Technologies
Osterlind, S. J., & Wang, Z. (2017). Item response theory in measurement, assessment, and evaluation for
higher education. In C. Secolsky & D. B. Denison (Eds.), Handbook on measurement, assessment,
and evaluation in higher education (pp. 191–200). Routledge.
Panda, S., Palma Gomez, F., Flor, M., & Rozovskaya, A. (2022). Automatic generation of distractors
for fill-in-the-blank exercises with round-trip neural machine translation. Proceedings of the 60th
Annual Meeting of the Association for Computational Linguistics: Student Research Workshop,
391–401.
Papineni, K., Roukos, S., Ward, T., & Zhu, W.-J. (2002). Bleu: A Method for Automatic Evaluation of
Machine Translation. Proceedings of the 40th Annual Meeting of the Association for Computa-
tional Linguistics, 311–318. https:// doi. org/ 10. 3115/ 10730 83. 10731 35.
Pennington, J., Socher, R., & Manning, D. (2014, October). Glove: Global vectors for word representa-
tion. Proceedings of the 2014 conference on empirical methods in natural language processing
(EMNLP), pp 1532–1543.
Rajpurkar, P., Zhang, J., Lopyrev, K., & Liang, P. (2016). SQuAD: 100,000+ questions for machine com-
prehension of text. arXiv preprint arXiv:1606.05250. https:// doi. org/ 10. 48550/ arXiv. 1606. 05250
Rezigalla, A A.. (2022). Item analysis: Concept and application. In M. S. Firstenberg & S. P. Stawicki
(Eds.), Medical education for the 21st century. IntechOpen. https:// doi. org/ 10. 5772/ intec hopen.
100138.
Rodriguez-Torrealba, R., Garcia-Lopez, E., & Garcia-Cabot, A. (2022). End-to-end generation of multi-
ple-choice questions using text-to-text transfer transformer models. Expert Systems with Applica-
tions, 208, 118258. https:// doi. org/ 10. 1016/j. eswa. 2022. 118258.
Settles, B., LaFlair, T. G., & Hagiwara, M. (2020). Machine learning–driven language assessment. Trans-
actions of the Association for Computational Linguistics, 8, 247–263. https:// doi. org/ 10. 1162/
tacl_a_ 00310.
Seyler, D., Yahya, M., & Berberich, K. (2017). Knowledge questions from knowledge graphs. Proceed-
ings of the ACM SIGIR International Conference on Theory of Information Retrieval, 11–18.
https:// doi. org/ 10. 1145/ 31210 50. 31210 73.
Song, L., & Zhao, L. (2017). Question generation from a knowledge base with web exploration. arXiv.
http:// arxiv. org/ abs/ 1610. 03807.
Suen, H. K. (2012). Principles of test theories. Routledge.
Tamura, Y., Takase, Y., Hayashi, Y., & Nakano, Y. I. (2015). Generating quizzes for history learning
based on Wikipedia articles. In P. Zaphiris & A. Ioannou (Eds.), Learning and Collaboration
Technologies (pp. 337–346). Springer International Publishing. https:// doi. org/ 10. 1007/ 978-3- 319-
20609-7_ 32.
Tarrant, M., Knierim, A., Hayes, S. K., & Ware, J. (2006). The frequency of item writing flaws in mul-
tiple-choice questions used in high-stakes nursing assessments. Nurse Education Today, 26(8),
662–671.
Towns, M. H. (2014). Guide to developing high-quality, reliable, and valid multiple-choice assessments.
Journal of Chemical Education, 91(9), 1426–1431. https:// doi. org/ 10. 1021/ ed500 076x.
Van Campenhout, R., Hubertz, M., & Johnson, B. G. (2022). Evaluating AI-generated questions: A
mixed-methods analysis using question data and student perceptions. In M. M. Rodrigo, N. Mat-
suda, A. I. Cristea, & V. Dimitrova (Eds.), Artificial Intelligence in Education. AIED 2022. Lecture
Notes in Computer Science. (Vol. 13355). Cham: Springer. https:// doi. org/ 10. 1007/ 978-3- 031-
11644-5_ 28
Van Campenhout, R., Hubertz, M., & Johnson, B. G. (2022). Evaluating AI-generated questions: A
mixed-methods analysis using question data and student perceptions. In M. M. Rodrigo, N. Mat-
suda, A. I. Cristea, & V. Dimitrova (Eds.), Artificial Intelligence in Education (Vol. 13355, pp.
344–353). Springer International Publishing. https:// doi. org/ 10. 1007/ 978-3- 031- 11644-5_ 28.
Venktesh, V., Akhtar, Md. S., Mohania, M., & Goyal, V. (2022). Auxiliary task guided interactive atten-
tion model for question difficulty prediction. In M. M. Rodrigo, N. Matsuda, A. I. Cristea, & V.
Dimitrova (Eds.), Artificial Intelligence in Education (Vol. 13355, pp. 477–489). Springer Interna-
tional Publishing. https:// doi. org/ 10. 1007/ 978-3- 031- 11644-5_ 39.
Vie, J. J., Popineau, F., Bruillard, É., Bourda, Y. (2017). A review of recent advances in adap-
tive assessment. In: Peña-Ayala, A. (Ed.), Learning analytics: Fundaments, applications, and
Education and Information Technologies
1 3
trends. Studies in systems, decision, and control (113–142). Springer. https:// doi. org/ 10. 1007/
978-3- 319- 52977-6_4.
von Davier, M. (2018). Automated item generation with recurrent neural networks. Psychometrika, 83(4),
847–857. https:// doi. org/ 10. 1007/ s11336- 018- 9608-y.
Wang, Z., Lan, A. S., & Baraniuk, R. G. (2021). Math word problem generation with mathemati-
cal consistency and problem context constraints. arXiv. http:// arxiv. org/ abs/ 2109. 04546.
Accessed04/04/2023.
Wang, Z., Lan, A. S., Nie, W., Waters, A. E., Grimaldi, P. J., & Baraniuk, R. G. (2018). QG-net: A data-
driven question generation model for educational content. Proceedings of the Fifth Annual ACM
Conference on Learning at Scale, 1–10. https:// doi. org/ 10. 1145/ 32316 44. 32316 54.
Wang, Z., Valdez, J., Basu Mallick, D., & Baraniuk, R. G. (2022). Towards human-Like educational
question generation with large language models. In M. M. Rodrigo, N. Matsuda, A. I. Cristea, & V.
Dimitrova (Eds.), Artificial Intelligence in Education (Vol. 13355, pp. 153–166). Springer Interna-
tional Publishing. https:// doi. org/ 10. 1007/ 978-3- 031- 11644-5_ 13.
Wauters, K., Desmet, P., & Van Den Noortgate, W. (2012). Item difficulty estimation: An auspicious col-
laboration between data and judgment. Computers & Education, 58(4), 1183–1193.
Wind, S. A., Alemdar, M., Lingle, J. A., Moore, R., & Asilkalkan, A. (2019). Exploring student under-
standing of the engineering design process using distractor analysis. International Journal of
STEM Education, 6(1), 1–18. https:// doi. org/ 10. 1186/ s40594- 018- 0156-x.
Yang, A. C. M., Chen, I. Y. L., Flanagan, B., & Ogata, H. (2021). Automatic generation of cloze items
for repeated testing to improve reading comprehension. Educational Technology & Society, 24(3),
147–158.
Zhang, L., & VanLehn, K. (2016). How do machine-generated questions compare to human-generated
questions? Research and Practice in Technology Enhanced Learning, 11(1), 7. https:// doi. org/ 10.
1186/ s41039- 016- 0031-7.
Zilberberg, A., Anderson, R. D., Finney, S. J., & Marsh, K. R. (2013). American college students’ atti-
tudes toward institutional accountability testing: Developing measures. Educational Assessment,
18(3), 208–234. https:// doi. org/ 10. 1080/ 10627 197. 2013. 817153.
Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps
and institutional affiliations.
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under
a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted
manuscript version of this article is solely governed by the terms of such publishing agreement and
applicable law.
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
In low-stakes assessment settings, students’ performance is not only influenced by students’ ability level but also their test-taking engagement. In computerized adaptive tests (CATs), disengaged responses (e.g., rapid guesses) that fail to reflect students’ true ability levels may lead to the selection of less informative items and thereby contaminate item selection and ability estimation procedures. To date, researchers have developed various approaches to detect and remove disengaged responses after test administration is completed to alleviate the negative impact of low test-taking engagement on test scores. This study proposes an alternative item selection method based on Maximum Fisher Information (MFI) that considers test-taking engagement as a secondary latent trait to select the most optimal items based on both ability and engagement. The results of post-hoc simulation studies indicated that the proposed method could optimize item selection and improve the accuracy of final ability estimates, especially for low-ability students. Overall, the proposed method showed great promise for tailoring CATs based on test-taking engagement. Practitioners are encouraged to consider incorporating engagement into the item selection algorithm to enhance the validity of inferences made from low-stakes CATs.
Article
Full-text available
This case study applied the weak theory of Automatic Item Generation (AIG) to generate isomorphic item instances (i.e., unique but psychometrically equivalent items) for a large-scale assessment. Three representative instances were selected from each item template (i.e., model) and pilot-tested. In addition, a new analytical framework, differential child item functioning (DCIF) analysis, based on the existing differential item functioning statistics, was applied to evaluate the psychometric equivalency of item instances within each template. The results showed that, out of 23 templates, nine successfully generated isomorphic instances, five required minor revisions to make them isomorphic, and the remaining templates required major modifications. The results and insights obtained from the AIG template development procedure may help item writers and psychometricians effectively develop and manage the templates that generate isomorphic instances.
Article
Question generation in natural language has a wide variety of applications. It can be a helpful tool for chatbots for generating interesting questions as also for automating the process of question generation from a piece of text. Most modern-day systems, which are conversational, require question generation ability for identifying the user’s needs and serving customers better. Generating questions in natural language is now, a more evolved task, which also includes generating questions for an image or video. In this review, we provide an overview of the research progress in automatic question generation. We also present a comprehensive literature review covering the classification of Question Generation systems by categorizing them into three broad use-cases, namely standalone question generation, visual question generation, and conversational question generation. We next discuss the datasets available for the same for each use-case. We further direct this review towards applications of question generation and discuss the challenges in this field of research.
Article
Issue: Automatic item generation is a method for creating medical items using an automated, technological solution. Automatic item generation is a contemporary method that can scale the item development process for production of large numbers of new items, support building of multiple forms, and allow rapid responses to changing medical content guidelines and threats to test security. The purpose of this analysis is to describe three sources of validation evidence that are required when producing high-quality medical licensure test items to ensure evidence for valid test score inferences, using the automatic item generation methodology for test development. Evidence: Generated items are used to make inferences about examinees’ medical knowledge, skills, and competencies. We present three sources of evidence required to evaluate the quality of the generated items that is necessary to ensure the generated items measure the intended knowledge, skills, and competencies. The sources of evidence we present here relate to the item definition, the item development process, and the item quality review. An item is defined as an explicit set of properties that include the parameters, constraints, and instructions used to elicit a response from the examinee. This definition allows for a critique of the input used for automatic item generation. The item development process is evaluated using a validation table, whose purpose is to support verification of the assumptions related to model specification made by the subject-matter expert. This table provides a succinct summary of the content and constraints that were used to create new items. The item quality review is used to evaluate the statistical quality of the generated items, which often focuses on the difficulty and the discrimination of the correct and incorrect options. Implications: Automatic item generation is an increasingly popular item development method. The generated items from this process must be bolstered by evidence to ensure the items measure the intended knowledge, skills, and competencies. The purpose of this analysis is to describe these sources of evidence that can be used to evaluate the quality of the generated items. The important role of medical expertise in the development and evaluation of the generated items is highlighted as a crucial requirement for producing validation evidence.
Chapter
Questions are widely used in various instructional designs in education. Creating questions can be challenging and time-consuming. It requires not only the expertise of the learning content but also the experience of the question designs and the overall class performance. A considerable amount of research in the field of question generation (QG) has focused on computer models that automatically extract key information from a given context and transform them into meaningful questions. However, due to the complexity of programming knowledge, there are only few studies that have explored the potential of Programming QG (PQG) where natural languages and programming languages are often interwoven to constitute an assessment unit. To investigate further, this study experiments with a hybrid semantic network model for PQG based on open information extraction and abstract syntax tree. Our user study showed that experienced instructors had significantly positive feedback on the relevance and extensibility of the machine-generated questions.
Chapter
In this study, we analyzed the influence of student disengagement on prediction accuracy in knowledge tracing models. During the data pre-processing stage, we prepared two training data: The disengaged responses were ignored in the baseline data whereas the disengaged responses were removed in the disengagement-adjusted data. Using visual analysis, we identified disengaged responses (i.e., hint abusers and rapid guessers) and removed them from the disengagement-adjusted data during the pre-processing phase since those responses do not reflect the true latent ability of the students. After fitting the knowledge tracing models to the baseline and disengagement-adjusted data, we found that the prediction accuracy of both models on test data has substantially increased when disengaged responses were removed during the pre-processing stage. Our results emphasized the importance of considering student disengagement in knowledge tracing models to produce more accurate prediction models.