ArticlePDF Available

Exploring quality criteria and evaluation methods in automated question generation: A comprehensive survey

June 2024
Education and Information Technologies

June 2024

DOI:10.1007/s10639-024-12771-3

Authors:

Guher Gorgun

University of Alberta

Okan Bulut

University of Alberta

In light of the widespread adoption of technology-enhanced learning and assessment platforms, there is a growing demand for innovative, high-quality, and diverse assessment questions. Automatic Question Generation (AQG) has emerged as a valuable solution, enabling educators and assessment developers to efficiently produce a large volume of test items, questions, or assessments within a short timeframe. AQG leverages computer algorithms to automatically generate questions, streamlining the question-generation process. Despite the efficiency gains, significant gaps in the question-generation pipeline hinder the seamless integration of AQG systems into the assessment process. Notably, the absence of a standardized evaluation framework poses a substantial challenge in assessing the quality and usability of automatically generated questions. This study addresses this gap by conducting a comprehensive survey of existing question evaluation methods, a crucial step in refining the question generation pipeline. Subsequently, we present a taxonomy for these evaluation methods, shedding light on their respective advantages and limitations within the AQG context. The study concludes by offering recommendations for future research to enhance the effectiveness of AQG systems in educational assessments.

The taxonomy for evaluation methods in automatic question generation

…

The decision tree used for categorizing the evaluation methods in AQG

…

Examples of AQG systems evaluated using metric-based methods

…

Examples of AQG systems evaluated using human evaluators

…

Examples of AQG systems evaluated using post-hoc methods

…

Figures - uploaded by Okan Bulut

Content may be subject to copyright.

Content uploaded by Okan Bulut

Content may be subject to copyright.

Vol.:(0123456789)

Education and Information Technologies

https://doi.org/10.1007/s10639-024-12771-3

1 3

Exploring quality criteria andevaluation methods

inautomated question generation: Acomprehensive

survey

GuherGorgun1 · OkanBulut2

Received: 14 December 2023 / Accepted: 7 May 2024

2024

Abstract

In light of the widespread adoption of technology-enhanced learning and assess-

ment platforms, there is a growing demand for innovative, high-quality, and diverse

assessment questions. Automatic Question Generation (AQG) has emerged as a valu-

able solution, enabling educators and assessment developers to eﬃciently produce

a large volume of test items, questions, or assessments within a short timeframe.

AQG leverages computer algorithms to automatically generate questions, streamlin-

ing the question-generation process. Despite the eﬃciency gains, signiﬁcant gaps in

the question-generation pipeline hinder the seamless integration of AQG systems into

the assessment process. Notably, the absence of a standardized evaluation framework

poses a substantial challenge in assessing the quality and usability of automatically

generated questions. This study addresses this gap by conducting a comprehensive

survey of existing question evaluation methods, a crucial step in reﬁning the ques-

tion generation pipeline. Subsequently, we present a taxonomy for these evaluation

methods, shedding light on their respective advantages and limitations within the

AQG context. The study concludes by oﬀering recommendations for future research

to enhance the eﬀectiveness of AQG systems in educational assessments.

Keywords Automatic question generation· Human evaluators· Question quality·

post-hoc evaluations· metric-based evaluations

* Guher Gorgun

gorgun@ualberta.ca

Okan Bulut

bulut@ualberta.ca

1 Measurement, Evaluation, andData Science, Faculty ofEducation, University ofAlberta, 6-110

Education Centre North, 11210 87 Ave NW, Edmonton, ABT6G2G5, Canada

2 Centre forResearch inApplied Measurement andEvaluation, Faculty ofEducation, , University

ofAlberta, 6-110 Education Centre North, 11210 87 Ave NW, Edmonton, ABT6G2G5,

Canada

Education and Information Technologies

1 3

Assessment is a fundamental core of education that allows researchers, educators,

and policymakers to assess learners’ knowledge and skills while providing evidence

about the eﬀectiveness of educational practices (Ewell, 2008; Heubert etal., 1999;

Linn, 2003; Nagy, 2000; Newton, 2007; Zilberberg etal., 2013). Creating high-qual-

ity assessments is a complex process because assessment developers should ensure

that the building blocks of assessment (i.e., questions) have high quality, and holis-

tically, the assessment measures what it intends to measure with high consistency

and accuracy (Darling-Hammond etal., 2013). Developing high-quality questions

has been a major challenge for educators because it requires content and assessment

expertise, time, and resources (e.g., Tarrant etal., 2006). Various shortcomings in

the question development process may slow down the transition to personalized and

adaptive teaching and learning, necessitating a large question bank (Vie etal., 2017).

Automatic question generation (AQG) has emerged as an eﬃcient and practical

solution to streamline the question generation process, allowing the rapid generation

of a large number of questions through computer algorithms. Nonetheless, the fast

and cost-eﬀective nature of this process does not ensure the suitability of the auto-

matically generated questions for operational use in educational settings. Each ques-

tion generated through AQG systems must still go through a comprehensive evalu-

ation process to discern the quality, relevance, and eﬀectiveness of these questions

within the context of operational educational settings. Therefore, a robust evaluation

process becomes the linchpin in sifting through the generated question pool, identi-

fying questions that align with the intended educational outcomes.

Although evaluating question quality is imperative for understanding the utility of an

AQG system and the usability of questions generated, evaluation methods and quality

criteria used in AQG are often neglected. This aspect is perhaps the most fundamental

reason why question generation systems have not been fully integrated into educational

or assessment settings. Typically, AQG researchers introduce novel question generation

systems for educational use by employing state-of-the-art natural language processing

and machine learning methods, yet they lacked an essential component in the question

generation pipeline, i.e., question evaluation, to be fully integrated in real-life practices.

This paper, to the best of our knowledge, is the ﬁrst study attempting to summarize and

categorize evaluation methods used by traditional item developers and computer scien-

tists relying on computer algorithms to generate questions. Through a comprehensive

survey of evaluation methods and quality criteria used in AQG, we aim to: 1) provide

an exhaustive list of evaluation methods and quality criteria used by the AQG systems;

2) identify the strengths, limitations, and gaps in each evaluation method; 3) highlight

the quintessence of evaluation methods for AQG research; 4) bridge the theoretical and

practical gap between traditional psychometric and computer science methods in ques-

tion evaluation methods and quality criteria; and ﬁnally 5) create a taxonomy for cat-

egorizing existing evaluation methods used by AQG research to inform future studies

on selecting the best evaluation method given the resources and design of the study.

In the subsequent sections of this study, our focus shifts toward a comprehensive

exploration of the quality criteria employed by both assessment developers and AQG

systems. Our objective is to elucidate the shared aspects and distinctions that character-

ize these criteria. This comparative analysis serves as a crucial step in understanding

the convergence and divergence in the quality standards applied by human assessment

1 3

Education and Information Technologies

developers and AQG systems. Following the examination of quality criteria, we

introduce a taxonomy rooted in the evaluation methods employed by AQG systems.

Through this taxonomy, we categorize and classify the diverse approaches to evalua-

tion, shedding light on the strengths, limitations, and existing gaps within each method.

This systematic breakdown aims to oﬀer a comprehensive understanding of the intrica-

cies involved in assessing the quality of automatically generated questions. The study

concludes with a discussion of recommendations and future directions for enhancing

the eﬃciency and scalability of AQG. By identifying areas for improvement and pro-

posing actionable suggestions, we aim to contribute to the ongoing evolution of AQG

systems, ensuring their alignment with the evolving needs of educational assessments.

1 Quality criteria forevaluating questions

1.1 Quality criteria used intraditional test development

Test developers and psychometricians typically refer to questions, exercises, prompts,

or statements in an assessment as items (American Educational Research Associa-

tion etal., 2014; Nelson, 2004). Thus, the process during which the properties of

items (e.g., structural characteristics and quality) are investigated is called item anal-

ysis (e.g., Bandalos, 2018; Lane etal., 2016; Osterlind, 1989). Item analysis is an

umbrella term that encompasses statistical approaches (e.g., Ashraf, 2020; Clauser

& Hambleton, 2011; French, 2001; Rezigalla, 2022) and judgment-based approaches

(e.g., Gierl et al., 2021, 2022; Osterlind, 1989) used for evaluating the quality of

questions created. Below, we dissect each approach to item analysis to explain the

processes and tools used for understanding the quality of questions created.

1.2 Statistical approaches foritem analysis

Statistical approaches for item analysis have been considered a cornerstone for inves-

tigating question quality because empirical data about learners are collected to ana-

lyze item properties. The most frequently analyzed item properties include diﬃculty,

discrimination, distractors, and diﬀerential item functioning. Depending on the ques-

tion format (e.g., multiple-choice, cloze, essay), some of these statistical approaches

could be redundant (e.g., distractor analysis can only be used when multiple response

options are present). Typically, the empirical data collected during a test develop-

ment stage for statistical analysis are referred to as ﬁeld testing. During ﬁeld testing,

eﬀorts have been made to ensure that the sample is representative of the target popu-

lation because the quality indices can typically only be generalized for the popula-

tion of interest. For instance, in large-scale international assessments, such as the Pro-

gramme for International Student Assessment (PISA), a sampling strategy has been

employed to adequately represent the schools and the students (OECD, 2020).

The ﬁrst index, item diﬃculty, aids test developers in quantifying the level of dif-

ﬁculty of a question for a given sample. Based on the test theory (e.g., Classical Test

Theory; CTT or Item Response Theory; IRT) that assessment experts adopt, item

Education and Information Technologies

1 3

diﬃculty is operationalized slightly diﬀerently (for a more comprehensive discussion

of test theories, please refer to Suen (2012)). CTT asserts that item diﬃculty, denoted

as p, is the proportion of examinees answering the question correctly (Anastasi &

Urbina, 2004; Clauser & Hambleton, 2011; Haladyna & Rodriguez, 2021), which can

be expressed as:

Equation1 indicates that as the number of students answering a question correctly

increases, p also increases, suggesting that the question is getting easier.

Unlike CTT deriving the diﬃculty level of each question directly from raw

responses, IRT postulates a probability function that places examinees’ ability levels

and item diﬃculty onto the same scale. This function asserts the item diﬃculty as the

probability of correctly answering a question given the ability level of the examinee.

Formally, the simplest IRT model (i.e., Rasch model) can be expressed as:

where θ is the ability level of an examinee, and b is the diﬃculty of the question.

Since item diﬃculty and ability are positioned on the same continuum, when dif-

ﬁculty and ability overlap, there is a 50% probability for the examinee to cor-

rectly answer the question (Osterlind & Wang, 2017). If the examinee’s ability is

greater than a given item’s diﬃculty, the probability of answering an item correctly

increases (> 50%).

The second quality criterion that has been inseparable from the item diﬃculty

index is item discrimination, which indicates a question’s capacity to distinguish high-

performing examinees (i.e., those who know the content well) from low-performing

ones (i.e., those who struggle with the content) (French, 2001). Similar to the diﬃculty

index, the operationalization of discrimination depends on the test theory adopted by

assessment experts. By endorsing CTT, experts have three alternatives to evaluate item

discrimination. Using the ﬁrst approach, experts may evaluate the diﬀerence in p values

(i.e., item diﬃculty) by comparing high-performers and low-performers as

where phigh-performing is found by taking the upper-performing 25% or 27% of exami-

nees and plow-performing is found by taking the lower-performing 25% or 27% of exam-

inees based on the total test score (Cohen etal., 1996; Jenkins & Michael, 1986;

Kehoe, 1995). Using the second approach, experts may calculate a point-biserial

correlation that relies on ﬁnding Pearson’s product-moment correlation between the

total score and item score (Ebel & Frisbie, 1986; Kim etal., 2021). Formally, the

point biserial correlation coeﬃcient (rpbi) is given as

(1)

#of examinees who answered the question correctly

of examinees who attempted the question

(2)

(X=1

𝜃,b)=e

(𝜃−b)

1+e

(𝜃−b)

(3)

discrimination index =phigh−perfoming −plow−perfoming,

(4)

pbi =

−M

s√

1 3

Education and Information Technologies

where Mp is the test average for examinees answering the question correctly, Mq is

the test average for examinees answering the question incorrectly, s is the standard

deviation of the test, p is the p-value for examinees answering the question correctly,

and q is the p-value for examinees answering the question incorrectly. The ﬁnal

approach to item discrimination under the CTT framework has emerged recently and

is referred to as a multi-serial index. The multi-serial index is based on a multiple

correlation and is used for calculating item discrimination while considering distrac-

tor discrimination (Haladyna & Rodriguez, 2021). The multi-serial index (MSI) is

calculated by

where SS is the sum of squares, and it quantiﬁes the amount of test score variance

accounted for by the item response across all options (Haladyna & Rodriguez,

2021).

When experts adopt IRT, the discrimination index is the rate at which the prob-

ability of selecting a correct response change given the examinee’s ability level.

Similar to CTT, as the value of the discrimination index increases, the question’s

ability to diﬀerentiate high-performing students from low-performing ones increases

(Ashraf, 2020). Under the IRT framework, item discrimination is expressed as

where θ is the ability level of an examinee, b is the diﬃculty, and a is the discrimina-

tion of the question.

The third statistical approach widely used by traditional test developers is dis-

tractor analysis—examining the functioning of incorrect response options. Distrac-

tor analysis helps test developers evaluate whether all distractors have been used,

whether the distractors are more likely to be selected by low-performing examinees

and whether the keyed response is correctly identiﬁed (Gierl etal., 2016; Haladyna

etal., 2002). Ideally, distractors should tap examinees’ misconceptions and hence

should attract examinees with a lack of knowledge or misconception about a given

topic (Wind etal., 2019). Using CTT, experts may look at the frequency distribu-

tion for each response option and analyze which performance level is more likely to

select distractors. Endorsing the IRT framework, experts may assess the probability

of each response option being selected by the examinees. Either approach provides a

picture of how the distractors function and whether they are more attractive to low-

performing examinees.

The ﬁnal statistical approach that enables test developers to examine whether

a question functions diﬀerently based typically on group membership is referred

to as diﬀerential item functioning (DIF). Test developers may also investigate the

presence of DIF considering the diﬀerent time points. When DIF is present in a

question, a group aﬃliation (e.g., race, gender, or region) impacts the probabil-

ity of answering the question correctly (Clauser & Hambleton, 2011), creating a

(5)

MSI

√

SSbetween

SStotal

(6)

(X=1

𝜃,b,a)=e

a(𝜃−b)

1+e

a(𝜃−b)

Education and Information Technologies

1 3

bias toward a particular group. Assessment results obtained from questions with

DIF may lead to unwarranted inferences about the ability levels of examinees.

To perform DIF analysis, experts start by identifying focal and reference groups

and then investigate whether the probability of correctly answering the question

changes between similar members of these groups (Livingston, 2013). Similar

members of the focal and reference groups are determined based on their total

test scores. DIF is presumed to exist for a given question if the probability of giv-

ing a correct response in the focal group is smaller than the probability of giving

a correct response in the reference group when the ability levels are matched.

There are various statistical methods (e.g., the Mantel–Haenszel method, logistic

regression, and SIBTEST) developed for assessing DIF in a question (Bulut &

Suh, 2017; Osterlind & Everson, 2009).

Guidelines for statistical approaches Depending on the test theory adopted, the

guidelines diﬀer for item diﬃculty and discrimination indices. Under the CTT

framework, questions with p-values smaller than 0.30 are considered diﬃcult items.

Questions with a p-value between 0.30 and 0.70 have moderate diﬃculty. Finally,

questions with a p-value greater than 0.70 are considered as easy items (Adegoke,

2013; Bichi, 2016; Henning, 1987). Concerning the interpretation of item discrimi-

nation when CTT is used, questions with negative discrimination (≤ 0) should be

ﬂagged and scrutinized to evaluate whether there are any issues with the question

(e.g., the correct response is not properly identiﬁed) because the negative discrimi-

nation index indicates that lower-performing examinees are more likely to answer

the question correctly compared to higher-performing ones. Questions with a dis-

crimination index smaller than 0.20 should be revised or eliminated because they are

not good at diﬀerentiating higher-performing examinees from the lower-performing

ones. Questions with a discrimination index greater than 0.40 are considered good

questions, while those with a discrimination index between 0.20 and 0.40 could be

revised or scarcely used (Bichi, 2016; Ebel & Frisbie, 1986; Towns, 2014).

Interpreting item diﬃculty and discrimination is not as straightforward in IRT

because examinees and questions are placed on the same continuum, and whether

a question is diﬃcult or not depends on the examinee’s ability level. Theoretically,

item diﬃculty ranges from -∞ to + ∞,placing the diﬃculty of each item in a con-

tinuum. Thus, item diﬃculties could be compared to one another, yet whether an

itemis diﬃcult or not depends on an examinee’s ability level. That is, item diﬃ-

culties and examinee abilities are placed on the same continuum allowing the test

developer to match item diﬃculty with examinee ability. Assume that a question

has a diﬃculty parameter ofb = 0.1, then 50% of examinees with an ability level

of θ = 0.1 will answer the question correctly (DeMars, 2010). Hence, item dif-

ﬁculty is relative to the ability level of the examinee and interpretating whether a

question is easy or diﬃcult depends on the ability level of the examinee.

Concerning the discrimination in IRT, the negative discrimination index indi-

cates that the question is problematic and, just like CTT, should be removed. The-

oretically, similar tothe diﬃculty index, the discrimination index ranges from -∞

1 3

Education and Information Technologies

to + ∞, however the discrimination index typically does not exceed 2. The ques-

tions with a discrimination index greater than 0.4 are considered good (DeMars,

2010). The questions with a discrimination index higher than 0.65 are regarded

as having high discrimination (Baker, 2001). Questions with higher discrimina-

tion index should be preferred as they are minimally impacted from the guess-

ing behavior and these questions are associated with higher internal consistency

(Kim & Feldt, 2010).

1.3 Judgment‑based approaches foritem analysis

Judgment-based approaches for item analysis rely on employing subject-matter

experts to assess question quality. A rating scale or a rubric-like evaluation tool is

often used for assessing the quality of questions based on criteria such as the degree

of alignment between the content and question coverage, fairness of the questions,

ambiguity in questions, the diﬃculty levels of the questions, formatting issues, cog-

nitive load, and readability of questions (Chalifour & Powers, 1989; Engelhard etal.,

1999; Gierl etal., 2021; Gierl etal., 2016; Osterlind, 1989; Wauters etal., 2012).

For instance, Osterlind (1989) suggested using a 3-point scale for evaluating con-

tent alignment using the scales of high congruence, medium congruence, and low

congruence (pp. 267–268). Similarly, Gierl etal. (2016) proposed a 4-point rating

scale composed of accept, accept-minor revision, reject-major revision, and reject

for assessing the question quality based on the criteria of content, logic, and presen-

tation of questions created.

There is less uniformity in judgment-based approaches in terms of criteria and

rating scales used for item analysis, suggesting more idiosyncrasy during the evalu-

ation process. Below, we compare statistical and judgment-based approaches as well

as enumerate the strengths and limitations of both approaches traditionally used by

test developers and psychometricians.

1.4 Comparisons betweenstatistical andjudgment‑based approaches

Judgment-based approaches require a rating scale and a training process involving

raters to establish reliability and consistency among those raters. Thus, it requires

extensive time and resources to train experts to assess the quality of questions con-

sistently using the rating scale. Nonetheless, studies indicated that experts could

vary a lot and might be poor judges of item quality, especially when it comes to

evaluating item diﬃculty (Engelhard etal., 1999; Impara & Plake, 1998; Wauters

etal., 2012). Due to the challenges inherent in judgment-based approaches, statisti-

cal approaches are considered the golden standard for evaluating the quality of ques-

tions because they provide empirical data about the question’s quality.

Nonetheless,statistical approaches also embody certain limitations. These limi-

tations restrict the use of statistical approaches and the generalizability of quality

indices. Concerning the ﬁrst limitation of the use of statistical approaches, if a large

number of questions exist, thenall questions may not be ﬁeld-tested, and ﬁeld test-

ing is a costly and time-consuming process. Concerning the second limitation of

Education and Information Technologies

1 3

generalizability of the statistical approach, we highlight two fundamental assump-

tions in statistical approaches: 1) The sample selected is representative of the target

population, and 2) the real assessment administration conditions can be reproduced

during the ﬁeld-testing process. Especially under the CTT framework, the quality

indices (i.e., item diﬃculty, item discrimination, and distractor analysis) heavily

depend on the sample and question characteristics. Distortions in sample representa-

tiveness and administration conditions may introduce construct-irrelevant variance,

contaminating the validity and reliability of statistical indices (e.g., Anastasi &

Urbina, 2004; Hambleton etal., 1991; Livingston, 2013). The construct-irrelevant

variance may involve random guessing behavior, test speededness, not-reached

items, and omitted items and can contaminate item parameters (e.g., Gorgun and

Bulut, 2022), test administration processes (Gorgun & Bulut, 2023), or scoring pro-

cedures (e.g., Gorgun & Bulut, 2021).

1.5 Quality criteria inautomatic question generation

In AQG, questions are generated through the assistance of computer algorithms,

signiﬁcantly reducing the reliance on human intervention throughout the question

generation process. While this automated approach minimizes human involvement,

it brings forth unique considerations for quality that were not as prominent in tradi-

tional question development where human input played a central role. Consequently,

the conventional evaluation methodologies may not thoroughly examine facets asso-

ciated with AQG, potentially limiting their eﬀectiveness in addressing the distinctive

challenges presented by this automated process. Thus, the objective of this section is

to amalgamate traditional evaluation criteria with those employed by AQG research-

ers, bridging the gap between established evaluation practices and the evolving land-

scape of AQG.

Some of the quality criteria in AQG have exclusively focused on the linguistic

aspects of the generated questions. These linguistic criteria include grammaticality,

ﬂuency, relevancy, semantic correctness, syntax clarity, meaning, spelling, natural-

ness, speciﬁcity, and coherence (Amidei etal., 2018; Gatt & Krahmer, 2018; Kurdi

etal., 2020; Mulla & Gharpure, 2023). These criteria are typically assessed using a

judgment-based approach, such asby employing a rating scale (e.g., Becker etal.,

2012; Heilman, 2011; Heilman & Smith, 2010; Mostow etal., 2017). Yet, the level

of scrutiny varies widely across the AQG systems. Some researchers have used a

rubric with multiple criteria (e.g., Becker etal., 2012; Maurya and Desarkar, 2020;

Niraula and Rus, 2015), whereas others evaluated the quality of questions gener-

ated using a binary scale (e.g., Wang etal., 2021). For example, Heilman and Smith

(2010) evaluated the quality of generated questions using the criteria of grammati-

cality, incorrect information, vagueness, and awkwardness/other using the rating

scale levels ofgood, acceptable, borderline, unacceptable, and bad (Heilman, 2011,

pp. 183–184). On the other hand, several have rated whether the generated ques-

tions satisfy the quality criteria of ﬂuency (i.e., whether the question is coherent

and grammatically correct) and relevancy (i.e., whether the item is relevant given

the input sentence). In addition to the linguistic criteria, AQG systems were also

1 3

Education and Information Technologies

assessed based on the accuracy of cognitive models underlying question generation

(e.g., Gierl etal., 2022, 2016), question engagement (e.g., Van Campenhout etal.,

2022), diﬀerential child item functioning (e.g., Fu etal., 2022), and pedagogical use-

fulness (e.g., Jouault etal., 2016; Tamura etal., 2015; Zhang & VanLehn, 2016).

There are also similarities between AQG quality criteria and traditional quality

criteria. These included item diﬃculty, distractor analysis, domain relevance, and

educational usefulness. Note that, in AQG research, the deﬁnitions of these criteria

might vary across the studies compared with thetraditional ones. For example, by

employing CTT, Gierl etal. (2016) and Van Campenhout etal. (2022) estimated

item diﬃculty as a p-value, while others conceptualized item diﬃculty as cogni-

tive complexity (McCarthy etal., 2021; Settles etal., 2020; Venktesh etal., 2022).

Likewise, several studies conceptualized item diﬃculty as the similarity between the

keyed response and distractors (Lin etal., 2015; Seyler etal., 2017).

2 A taxonomy ofevaluation methods used inAQG

In this section, we propose a coherent taxonomy for organizing the evaluation meth-

ods employed in AQG that addresses the quality criteria discussed above. We con-

tend that providing a comprehensive list of evaluation methods and a coherent tax-

onomy may help AQG researchers identify the suitable evaluation methods while

understanding their advantages and limitations. As such, AQG researchers can pick

the most suitable evaluation method given the design and approach employed dur-

ing the question generation process. Note that AQG studies may combine multiple

evaluation methods (Amidei etal., 2018), and therefore, the taxonomy proposed in

this study reﬂects the potential overlaps between the evaluation methods employed.

The proposed taxonomy (Fig.1) was created by considering aspects of quality

criteria, benchmark, resources, and input used during the evaluation process. In

Fig.2, we displayed the decision tree used to categorize evaluation methods for cre-

ating the proposed taxonomy. We ﬁrst considered the aspect of the question qual-

ity criteria and the existence of benchmarks. Speciﬁcally, we considered whether

the questions are evaluated on an individual basis or whether the question-gener-

ation system is evaluated holistically using benchmarks. Benchmarks here refer to

whether there is a baseline during the evaluation process to be considered as a point

of reference and allow test developers to evaluate the quality of the question-genera-

tion system holistically. That is, when the goal is to evaluate the question-generation

system holistically, we categorize these queries as comparison-based methods.

On the other hand, individually evaluated items can be categorized under the

umbrella terms of statistical approaches or judgment-based approaches based on the

quality criteria employed. If human judgment is the main instrument of evaluation, we

labelled this method as human evaluators. We further reﬁned statistical approaches

based on input and resources. Input refers to which information has been used during

the evaluation process. For example, do AQG researchers evaluate questions directly or

do they rely on empirical data obtained after the generated questions are administered

to a group of examinees? When items are ﬁeld-tested, we distinguish this approach as

post-hoc evaluation methods. Resources refer to the availability of certain information

Education and Information Technologies

1 3

Fig. 1 The taxonomy for evaluation methods in automatic question generation

1 3

Education and Information Technologies

Fig. 2 The decision tree used for categorizing the evaluation methods in AQG

Education and Information Technologies

1 3

in the dataset (e.g., reference questions, item ratings). When auxiliary information is

available in the dataset and statistical approaches are used, we referred to this family of

evaluation methods as metric-based methods. Below, we elaborate on each category of

taxonomy and illustrate example studies from literature.

2.1 Metric‑based evaluations

Metric refers to a standard of measurement or relating to (such) an art, process, or sci-

ence of measuring (Merriam-Webster, 2023). Metric-based evaluations encompass

standardized metrics or indices that can be used to automatically evaluate the perfor-

mance of the AQG system. Typically, the performance of the AQG system is com-

pared against human-authored questions or reference questions (e.g., Gao etal., 2019;

Kumar etal., 2018; Marrese-Taylor etal., 2018; Wang etal., 2018). Table1 presents

several examples of previous AQG studies that used metric-based evaluations. Below,

we explain these evaluation indices, describe how they can be implemented, and ﬁnally

discuss their limitations.

The ﬁrst group of metric-based evaluations is widely used in machine-translation

tasks that involve comparing a machine-translated sentence with a reference sentence.

Among these indices, the most frequently used ones in AQG are BiLingual Evaluation

Understudy (BLEU; Papineni etal., 2002), Metric for Evaluation of Translation with

Explicit ORdering (METEOR; Banerjee & Lavie, 2005), and Recall-Oriented Under-

study for Gisting Evaluation (ROUGE; Lin, 2004). These metrics emerged in machine

translation because employing human evaluations is costly (Hovy, 1999) and time-

consuming (Papineni etal., 2002). These methods have facilitated the evaluation of

machine-translation systems while quantifying the magnitude of the closeness between

human translation and machine translation.

The ﬁrst metric, BLEU (Papineni etal., 2002), calculates the number of n-gram

pairs between the machine translation and the reference translation. Note that BLEU

does not consider the position of n-gram pairs. The BLEU score ranges between 0 and

1, and a higherBLEU score indicates bettera machine translation. Thus, a score closer

to 1 indicates that the machine translation is close to the human translation. The BLEU

score is calculated by estimating a precision score based on n-gram pairs and a penalty

function. The precision score (pn) formula is given as

where CountClip is the total count of each candidate word by its maximum refer-

ence count. The penalty function takes sentence length (i.e., brevity penalty, BP)

into account and is expressed as

(7)

pn=

∑

∁∈{Candidates}

∑

n−gram∈∁ Countclip

(

ngram

)

∑∁

�

∈{Candidates}∑n−gram

�

∈∁

�Count

clip

(ngram�)�

(8)

BP =

{

1if c >r

(

1−r

)

if c ≤r

1 3

Education and Information Technologies

Table 1 Examples of AQG systems evaluated using metric-based methods

Authors Generated Item Types Context AQG Method Evaluation Method

Becker etal., 2012 Cloze Generic Parse-trees Logistic regression

Gao etal., 2019 Constructed response Reading comprehension seq2seq BLEU, METEOR, ROUGE-L

Ha and Yaneva, 2018 Distractor Medicine Concept embeddings, information retrieval Embedding similarity

Liang etal., 2018 Distractor Biology

Chemistry

Earth science

Physics

Feature-based and neural net Logistic regression, random forest, and

LambdaMart

Liu etal., 2017 Constructed response Factual questions Sentence simpliﬁcation Logistic regression, rankSVM

Marrese-Taylor etal., 2018 Cloze Language Bidirectional LSTM F1, Recall, Precision

Maurya & Desarkar, 2020 Distractor Reading comprehension Hierarchical multi-decoder network BLEU, ROUGE-L, METEOR, greedy

matching, embedding average, and

BERT cosine similarity

Kumar etal., 2018 Q&A pairs Reading comprehension Recurrent neural network, seq2seq, global

attention, and answer encoding

METEOR, BLEU, ROUGE-L

Panda etal., 2022 Distractor generation Cloze Language Neural machine translation, round-trip

machine translation

Specialized metric

Rodriguez-Torrealba etal., 2022 Multiple-choice Answer Distractor Generic T-5 BLEU, ROUGE-L, cosine similarity

Wang etal., 2018 Constructed response Biology Sociology

History

Recurrent neural network (bidirectional

LSTM)

BLEU, METEOR, ROUGE-L

Wang etal., 2021 Constructed response Math Pre-trained large-language models BLEU, METEOR, ROUGE-L, Specialized

metric

Wang etal., 2022 Constructed response Educational biology items GPT-3 and prompt engineering Perplexity, grammatical error using

Python language tool, distinct-3, toxicity

using perspective API

Education and Information Technologies

1 3

where r is the reference corpus length and c is the total length of the candidate trans-

lation corpus.

Finally, the BLEU score is obtained by

where wn = 1/N and N is the number of total words.

The second machine translation-based metric employed for evaluating ques-

tions in AQG is ROUGE (Lin, 2004). Originally, four variations of the ROUGE

metric have been proposed–ROUGE-N, ROUGE-L, ROUGE-W, and ROUGE-S,

however, in AQG, ROUGE-L is used most frequently. ROUGE-L is based on esti-

mating the longest common subsequence (LCS) between a machine translation

and a reference translation. Similar to BLEU, ROUGE-L computes the similarity

between the machine translation and the reference translation by calculating an

LCS-based F-measure (Lin, 2004). ROUGE-L thus is calculated by ﬁnding the

LCS-based recall

and LCS-based precision

where X and Y are machine and reference translations with a length of m and n,

respectively. Finally, the ROUGE-L score is found by

where β is

PLCS∕RLCS

. This generates a score between 0 and 1, and the larger values

indicate more similarity between the machine and reference translations.

The ﬁnal machine translation metric used in AQG is METEOR (Banerjee &

Lavie, 2005). METEOR, similar to ROUGE-L, also uses precision and recall

scores but also takes the order of paired words into account (Banerjee & Lavie,

2005). METEOR is computed as

where Fmean is a combination of unigram recall (R) and unigram precision (P)

and Penalty is deﬁned as

(9)

BLEU

=BP ∗exp



n=1wnlogpn

,

(10)

LCS =

LCS(X,Y)

(11)

LCS =

LCS(X,Y)

(12)

LCS =

(

1+𝛽2

)

RLCSPLCS

RLCS +𝛽

PLCS

(13)

Score =Fmean ∗(1−Penalty),

(14)

mean =

10PR

R+9P,

1 3

Education and Information Technologies

METEOR intends to overcome the limitations inherent in BLEU (i.e., neglect-

ing the position of n-gram pairs) by introducing the penalty function deﬁned above.

Speciﬁcally, the penalty function considers the unigrams in adjacent positions in

candidate translation (i.e., chunks) and the unigrams matched. METEOR provides

a score between 0 and 1, and the larger values indicate more similarity between the

candidate translation and the reference translation.

In addition to the indices borrowed from the machine translation literature, other

standardized metrics have also been used to evaluate the quality of AQG. These

include F1, perplexity, the Python language tool, toxicity analysis, embedding simi-

larity, and specialized metrics that were developed by the AQG researchers (Amidei

etal., 2018). To exemplify specialized metrics, Wang and colleagues (Wang etal.,

2021) developed a math word problem question-generation system where equa-

tions are used to represent the math questions. They developed a metric, ACC-eq,

to assess the similarity between the equation describing the generated math word

problem and the input equation.

2.2 Implementation

The metric-based evaluation methods necessitate a point of reference to carry out

the evaluation of automatically generated questions. Using BLEU, ROUGE-L,

and METEOR, researchers can evaluate the similarity between machine-generated

and human-authored questions. In some studies, researchers have also compared

machine-generated distractors with human-authored ones. For example, using

BLEU, METEOR, and ROUGE-L, the researchers have evaluated the similarity

between the questions available in the dataset (e.g., SQuAD; Rajpurkar etal., 2016)

and generated questions (e.g., Gao etal., 2019; Kumar etal., 2018; Maurya and

Desarkar, 2020). Similarity-based metrics can also be used to compare the semantic

distance between machine-generated questions or distractors and human-authored

questions or distractors. For example, Rodriguez-Torrealba etal. (2022) and Maurya

and Desarkar (2020) obtained vector embeddings (i.e., numerical representations of

individual words) for the questions by using linguistic models such as Global Vec-

tors for Word Representation (GloVe; Pennington et al., 2014) and Bidirectional

Encoder Representations from Transformers (BERT; Devlin etal., 2018), calculated

cosine similarity based on the embeddings, and evaluated the semantic similarity

between the generated questions and the reference questions.

Various metrics have been introduced to facilitate an automated evaluation of

multiple dimensions related to question quality. These metrics oﬀer a comprehen-

sive assessment, encompassing aspects such as question diversity, textual coher-

ence, grammatical accuracy, the average count of n-grams, and even the evalua-

tion of toxicity within the generated questions (as explored by Wang etal., 2022).

By employing these metrics, a nuanced understanding of the quality of questions

can be derived, enabling a more robust evaluation process that goes beyond mere

(15)

Penalty

=0.5 ∗

(

#chunks

#unigrams matched

Education and Information Technologies

1 3

correctness and delves into the intricacies of linguistic expression and ethical

considerations.

Ultimately, researchers can leverage labeled data acquired through human

evaluators to facilitate the training of classiﬁers capable of automatically assess-

ing the quality of generated questions. Noteworthy examples in literature, such

as the work by Becker etal. (2012) and Heilman (2011), showcase the training of

machine learning classiﬁers speciﬁcally designed to predict the ﬂuency and gram-

maticality of questions. A variety of well-established classiﬁers ﬁnd application in

automated question quality detection, including logistic regression, random forest,

LambdaMart, and rankSVM, as highlighted in studies by Becker etal. (2012), Liang

etal. (2018), and Liu etal. (2017). These machine learning approaches, informed by

human-labeled data, contribute signiﬁcantly to the advancement of automated sys-

tems for evaluating the linguistic and structural aspects of generated questions.

2.3 Limitations ofmetric‑based methods

While metric-based evaluations oﬀer eﬃciency and simplicity in assessing automat-

ically generated questions through standardized measures such as BLEU or toxicity

analysis, their applicability hinges on the availability of reference questions, distrac-

tors, or labeled data. Therefore, the necessity of ground truth or reference questions

turns into a major roadblock for the utilization of these metrics. Moreover, given

that these metrics primarily assess the similarity between a reference question and

an automatically generated question, a question deemed acceptable but semantically

dissimilar may receive a low score (Kurdi etal., 2020). Consequently, the evaluation

process may inadvertently disqualify high-quality questions due to dissimilarities in

linguistic structure. Alternatively, metric-based evaluations may ﬂag generated ques-

tions as acceptable due to the similarity in linguistic structure that might be in fact

useless for educational or pedagogical purposes.

2.4 Human evaluators

Human evaluators refer to employing manual coding and rating scales for evaluating

the quality of automatically generated questions. As such, human evaluators typi-

cally employ the quality criteria discussed under judgment-based approaches. The

criteria include language ﬂuency (e.g., Mostow etal., 2017; Song & Zhao, 2017),

grammaticality (e.g., Chughtai etal., 2022; Heilman, 2011), distractability of ques-

tions (e.g., Maurya and Desarkar, 2020), the complexity of questions (e.g., Chung &

Hsiao, 2022), the acceptability of questions (Gierl etal., 2016; Liang etal., 2017),

the diﬃculty of questions (e.g., Rodriguez-Torrealba etal., 2022), or domain rel-

evance (e.g., Chughtai etal., 2022; Dugan etal., 2022). The labeled data obtained

from human evaluators may serve as ground truth to be used in subsequent analysis

(e.g., training classiﬁers; Becker etal., 2012; Heilman & Smith, 2010).

Human evaluators embody various groups with diﬀerent types of expertise,

such as subject matter experts (e.g., Gierl etal., 2016), researchers (e.g., Dugan

etal., 2022), students (e.g., Panda etal., 2022), teachers (e.g., Chung & Hsiao,

1 3

Education and Information Technologies

2022), and crowdsource workers (e.g., Lin etal., 2015). An advantage of employ-

ing crowdsourcing is that crowdsource-workers are relatively less expensive than

other evaluators, and a lot of crowdsource-workers could be recruited to evalu-

ate the vast number of questions generated by AQG. This provides an eﬃcient

and inexpensive solution to the question-evaluation process. Nonetheless, human

evaluators may exhibit diﬀerent levels of expertise for question evaluation, which

may cast suspicion on the validity of question quality labels assigned. Examples

of studies employing human evaluators are provided in Table2.

2.5 Implementation

Typically, a rating scale or scoring rubric composed of multiple criteria has

been used by human evaluators to assess the quality of generated questions (e.g.,

Becker etal., 2012; Mostow etal., 2017; Rodriguez-Torrealba etal., 2022; von

Davier, 2018). Evaluators may undergo a training process to prevent idiosyncratic

rating scale interpretation and to achieve standardization during question evalua-

tion using the rating scale. In addition, the questions generated can be assessed by

multiple evaluators, and the interrater reliability among the evaluators could be

analyzed to examine the extent to which evaluators agree and are consistent with

one another.

2.6 Limitations ofhuman evaluators

Human evaluators are perhaps the most frequently used evaluation method in AQG

research (e.g., Kurdi etal., 2020). Nonetheless, the lack of reporting practices may

encumber the appraisal of the quality of human evaluations. Although AQG studies

have involved diﬀerent numbers of evaluators, ranging from 1 to 364 (Amidei etal.,

2018), they have rarely reported the rate of agreement between the evaluators. Fur-

thermore, training practices and measures adopted to ensure agreement and consist-

ency among the evaluators are usually unknown (Kurdi etal., 2020). Most impor-

tantly, previous studies often fail to provide a detailed description of the evaluators

and evaluation criteria used. For example, researchers simply report that evaluators

are native English speakers, lacking information on the educational background or

demographic characteristics of evaluators (e.g., Maurya and Desarkar, 2020; Song

& Zhao, 2017), or they only indicate that evaluators assessed the grammatically of

questions without necessarily providing the reader with information on how gram-

matically is deﬁned. As such, it is impossible to appraise the quality of human eval-

uators. To optimize the use of human evaluators in AQG, researchers should pro-

vide a detailed description of the evaluators, the recruitment process, training, rating

scale development, and tools used (Lin etal., 2015). A third limitation is that evalu-

ating questions by employing human evaluators is typically an expensive and time-

consuming process, and given the scale of all generated questions, human evaluators

may not be an optimal solution for evaluating all questions generated.

Education and Information Technologies

1 3

2.7 Post‑hoc evaluations

Post-hoc evaluations refer to administering automatically generated questions to a

representative sample and evaluating the quality of the questions after the adminis-

tration is completed. As such, post-hoc evaluations typically incorporate statistical

approaches for item analysis. Post-hoc evaluations include experimental designs and

psychometric analysis. The former may compare the impact of generated questions

with human-authored questions on learner engagement and performance. Alterna-

tively, experimental studies may also include control and experimental groups in

which the eﬀectiveness of automatically generated items on learner performance is

assessed. Psychometric analysis, on the other hand, typically starts with picking a

test theory (i.e., CTT or IRT) and running item analysis that may focus on item diﬃ-

culty, item discrimination, or distractor analysis. Table3 provides examples of stud-

ies that used post-hoc evaluations to assess the quality of the generated questions.

2.8 Implementation

Previous studies used post-hoc evaluations when automatically generated ques-

tions could be administered to a representative sample of examinees to obtain sta-

tistical indices about the questions generated. For instance, Van Campenhout and

colleagues (2022) aimed to understand the inﬂuence of automatically generated

questions on student engagement and persistence by comparing generated questions

with human-authored ones. They found that both questions functioned similarly. In

a similar study, Yang and colleagues (Yang etal., 2021) investigated the impact of

automatically generated questions on students’ reading engagement and reading per-

formance. They found that those who practiced the content using automatically gen-

erated questions had better course performance (Yang etal., 2021).

Beyond simply assessing the inﬂuence of automatically generated questions on

learner performance, researchers have also delved into a comprehensive evaluation

of the psychometric properties associated with such questions. A notable instance

of this approach is found in the work of Gierl and colleagues (Gierl etal., 2016)

who meticulously appraised the quality of automatically generated medical ques-

tions. Their evaluation extended beyond mere performance outcomes and involved

administering these questions to a representative sample. The criteria that Gierl

etal. (2016) employed to gauge quality included item diﬃculty, a thorough analy-

sis of distractors, and an examination of keyed response functioning. Similarly, in a

study conducted by Attali etal. (2022), researchers took a multifaceted approach to

assess the psychometric properties of generated questions. This investigation went

beyond traditional metrics, incorporating an examination of item diﬃculty, an analy-

sis of local independence within the questions, and a scrutiny of response times.

Such comprehensive evaluations not only provide insights into the impact of gener-

ated questions on learner performance but also oﬀer a nuanced understanding of the

inherent qualities that contribute to the eﬀectiveness and reliability of these educa-

tional assessment tools.

1 3

Education and Information Technologies

Table 2 Examples of AQG systems evaluated using human evaluators

Authors Generated Item Type Context AQG Method Evaluation Method

Attali etal., 2022 Multiple-choice Reading comprehension GPT-3 Experts

Becker etal., 2012 Cloze Generic Parse-trees Crowdsource workers

Chughtai etal., 2022 Multiple-choice Engineering T-5, sense2vec Experts

Chung & Hsiao, 2022 Constructed response Programming Template-based Teachers

Dugan etal., 2022 Constructed response Generic T-5 Researchers

Gierl etal., 2016 Multiple-choice Medicine Template-based Experts

Liang etal., 2017 Distractor Biology

Math

Physics

Generative adversarial neural nets Experts

Lin etal., 2015 Multiple-choice Wildlife Hybrid semantic similarity Crowdsource workers

Maurya and Desarkar, 2020 Distractor Reading comprehension Hierarchical multi-decoder network Students

Mostow etal., 2017 Multiple-choice Reading comprehension Parse-trees n-grams Students

Olney, 2021 Cloze items Science Deep learning summarization Experts Students

Panda etal., 2022 Distractor generation, cloze item Language Neural machine translation, round-trip

machine translation

Students

Rodriguez-Torrealba etal., 2022 Multiple-choice Answer Distractor Generic T-5 Professionals

Song & Zhao, 2017 Constructed response Generic Neural machine translation Human (unknown category)

von Davier, 2018 Survey Personality scale Recurrent neural network, long-short term

memory

Crowdsource workers

Wang etal., 2022 Constructed response Biology GPT-3, prompt engineering Experts

Education and Information Technologies

1 3

2.9 Limitations ofpost‑hoc evaluations

Post-hoc evaluations, anchored in data-driven methodologies, rely on empirical

evidence that demonstrates the quality of generated questions. These assessments

commonly integrate statistical approaches for item analysis to gauge question qual-

ity. However, as highlighted earlier, the limitations associated with relying solely on

statistical methods for item analysis have been explicitly articulated, including con-

cerns related to the generalizability of indices derived from a speciﬁc sample.

Another caveat of post-hoc evaluations is that the question quality is assessed in a

retrospective manner. That is, we have limited information about the quality of ques-

tions generated prior to administering them. This inherent characteristic may result

in unintended repercussions, contingent on the testing conditions. On one hand, if

the generated questions are tested in a real assessment setting, poor questions could

potentially induce learner confusion and frustration. Conversely, in a ﬁeld-testing

scenario, the testing conditions may wield substantial inﬂuence over the conclusions

drawn regarding the quality of the questions.

Finally, it is worth noting that researchers in AQG often produce a substantial

volume of questions. In practical terms, it becomes unfeasible to administer all

generated questions in a ﬁeld-testing or experimental context. Consequently, while

we can form a reliable estimation of the quality of the questions that were actually

administered, a signiﬁcant number of leftover questions remain untested, prevent-

ing the acquisition of comprehensive item statistics. This surplus of unadministered

questions introduces an additional challenge in the adoption of automatically gener-

ated questions for operational assessment and learning environments. The post-hoc

evaluations, by their very nature, contribute to a bottleneck in the seamless imple-

mentation of these questions, raising practical concerns about their widespread

applicability and integration into educational settings.

2.10 Comparison‑based evaluations

So far, we have discussed methods used to evaluate questions on an individual basis.

However, test-developers might be interested in evaluating the question-generation

system holistically to understand the contributions of certain components of ques-

tion generation pipeline. This evaluation method could have been subsumed under

metric-based evaluations (see Fig.1) because the degradation in model performance

is typically estimated by using metrics such as BLEU, ROUGE, Recall, Precision, or

F1 (Wang etal., 2021). However, comparison-based methods deviate from metric-

based evaluations because when the comparison-based evaluations are usedthe sys-

tem is evaluated holistically whereas each generated question is evaluated individu-

allyin metric-based evaluations.

Although rarely employed, comparison-based evaluations have two branches:

ablation studies and comparing the AQG system with previous question generation

systems. Ablation studies are evaluation methods that involve removing a compo-

nent of the AQG system and assessing the degree of degradation in the question

generation pipeline. Here, degradation in the model refers to a decrease in model

1 3

Education and Information Technologies

Table 3 Examples of AQG systems evaluated using post-hoc methods

Authors Item Type Context AQG Method Evaluation Method

Attali etal., 2022 Multiple-choice Reading comprehension GPT-3 Psychometric properties

Gierl etal. 2016 Multiple-choice Medicine Template-based Psychometric properties

Gierl & Lai, 2012 Multiple-choice Medicine Template-based Psychometric properties

Hommel etal., 2022 Survey Personality Recurrent neural network, long-short

term memory, GPT-2

Psychometric properties

Van Campenhout etal., 2022 Matching Cloze Psychology Rule-based Experimental

Yang etal., 2021 Cloze items Reading comprehension BERT Experimental

Education and Information Technologies

1 3

performance (e.g., Precision, Recall, BLEU, or F1 values) when one or more com-

ponents of the question generation system are removed. Thus, the system is expected

to perform worse if the removed component is essential to the AQG system. The

second branch involves comparing the AQG system with the baseline or previous

system and assessing how much improvement is achieved with the new modiﬁca-

tions. This second branch is perhaps the least frequently used method in comparison

to other methods because AQG researchers require an existing system or a base-

line model in order to assess the performance of the newly developed system. None-

theless, a few studies follow this method for evaluating the AQG system. Table4

includes several examples of studies using comparison-based evaluations.

2.11 Implementation

There are several AQG systems that employ ablation studies to assess question qual-

ity. For instance, Wang and colleagues (Wang etal., 2021) removed several compo-

nents from their proposed AQG system to assess the degree of degradation in the

AQG system. Speciﬁcally, they compared several context keyword selection meth-

ods including term frequency-inverse document frequency, nouns and pronouns

when generating math questions. Using BLEU and a customized evaluation metric,

Wang and colleagues (Wang etal., 2021) compared diﬀerent models’ performance.

Thus, this process served as baseline models for Wang etal. (2021) to justify the

contribution of the components to the question-generation process. In addition, a

few studies considered previous AQG systems as the benchmark and assessed the

degree of improvement observed in the newer AQG systems (e.g., Huang & He,

2016; Mostow etal., 2017).

2.12 Limitations ofcomparison‑based evaluations

Comparison-based evaluations necessitate access to previous AQG systems or ques-

tions generated through previous systems. Because of the limited availability of

AQG data, this is a rarely used evaluation method in practice. On the other hand,

ablation studies assess the quality of questions generated with respect to the degree

of degradation observed in the AQG system. Therefore, they also have limited usa-

bility to understand the overall quality of the AQG system and the quality of ques-

tions generated.

3 Discussion

In this study, we provided a comprehensive overview of the evaluation criteria and

methods used by AQG system developers. Our comprehensive survey highlighted

that AQG researchers may evaluate the AQG system holistically by removing com-

ponents of the question-generation system or comparing the system’s performance

against previous baseline models. While this approach allows AQG researchers

to compare the contributions of several preprocessing or modeling decisions for

1 3

Education and Information Technologies

the question generation system (e.g., Wang etal., 2021), the quality of individual

questions remains unknown. As such, comparison-based methods are not entirely

suﬃcient for deploying the generated questions in learning environments and

assessments.

Methods relying on statistical and judgment-based approaches allow research-

ers and practitioners to evaluate each generated question (e.g., Attali etal., 2022;

Becker etal., 2012; Dugan etal., 2022; Gierl etal., 2022). Nonetheless, these eval-

uation methods have several signiﬁcant pitfalls, limiting their generalizability and

eﬃciency in the implementation of generated questions in real-world educational

settings. For instance, employing human evaluators to evaluate generated questions

violates the most fundamental assumption of AQG, which is that the questions can

be generated quickly and eﬃciently. When employed, human evaluators need to go

through each individual question and assign a quality score using a rating scale.

Although questions are generated instantly and eﬃciently, human evaluators will

slow down the deploymentprocess (Kurdi etal., 2020). Without knowing the ques-

tion quality, eﬃciency and swiftness in question generation will be futile because

generated questions cannot be directly used in educational environments.

Similar to concerns about employing human evaluators, post-hoc methods could

be quite limited and resource-intensive for evaluating the quality of generated ques-

tions. As such, post-hoc methods may violate another fundamental assumption of

AQG, that is, a high volume of questions can be generated (e.g., Attali etal., 2022;

Panda etal., 2022). When post-hoc methods are employed, only a subset of ques-

tions can be administered, yielding information on item quality about a fraction

of the questions generated. The quality of the remaining questions persists to be

unknown, restricting the optimal use of the generated questions.

Metric-based methods have emerged as a promising solution to evaluate all gen-

erated items instantly and eﬃciently. This family of methods enables the evaluation

of all questions generated easily, yet most of these metrics require reference ques-

tions or ground truth about the item quality (e.g., Kurdi etal., 2020), which is an

unrealistic expectation for many question-generation systems. For these reasons,

AQG researchers should not only focus on enhancing the performance of question

generation systems but also introduce novel evaluation methods to assess the quality

of all generated questions eﬃciently and feasibly.

3.1 Recommendations forfuture research

Beyond presenting an overview of current evaluation practices, our aim is to extend

support to AQG researchers by oﬀering recommendations and suggestions that pro-

pel the question-generation pipeline to new heights.

3.2 Availability ofdatasets

There are many AQG systems proposed, yet very few of them have shared the gener-

ated questions and the evaluation metrics publicly (e.g., Becker etal., 2012). This

encumbers the progress and comparison of AQG systems. Datasets containing

Education and Information Technologies

1 3

Table 4 Examples of AQG systems evaluated using comparison-based methods

Authors Item Type Context AQG Method Evaluation Method

Huang & He, 2016 Constructed response Reading comprehension Paraphrasing Previous AQG system

Huang & Mostow, 2015 Multiple-choice Reading comprehension N-grams Previous AQG system

Liang etal., 2017 Distractor Biology

Math

Physics

Generative adversarial neural nets Previous AQG system

Mostow etal., 2017 Multiple-choice Reading comprehension Parse-trees and n-grams Previous AQG system

Wang etal., 2021 Constructed response Math Pre-trained large-language models Ablation study

1 3

Education and Information Technologies

automatically generated questions with various evaluation methods are needed to

compare the feasibility, scalability, and overlap among the evaluation methods to

assess the coherence and consistency among the evaluation methods employed.

Especially, questions evaluated with multiple methods, such as employing human

evaluators or metric-based approaches, are essential to reveal the interrelationship

between evaluation methods and quality criteria. This could oﬀer new insights into

current question evaluation methods, advancing both traditional psychometrics and

computer science approaches to question development. The availability of such

datasets could also help develop automatic evaluation methods to bridge the gap

between question generation and deploymentof generated questions in real assess-

ment and learning settings. The availability of datasets will support developing and

validating novel metric-based evaluations for evaluating questions instantly and eﬃ-

ciently, rendering the implementation of generated questions in educational setting

possible. Furthermore, AQG researchers may also beneﬁt from the existing data-

sets to develop automated detectors for question quality evaluation. Therefore, we

recommend that AQG researchers develop automatically generated questions, assess

their quality based on various quality criteria and evaluation methods, and share the

resulting datasets publicly.

3.3 Standardized quality criteria

We highlighted that quality criteria may be deﬁned quite distinctly across studies

creating challenges and limitations when it comes to comparing diﬀerent question

generation systems (e.g., Gierl etal., 2016; Heilman & Smith, 2010; Rodriguez-Tor-

realba etal., 2022). For instance, researchers may use the same criteria, such as item

diﬃculty or ﬂuency, yet the operationalization or tools used for evaluation could

be quite diﬀerent (e.g., Heilman, 2011; Mostow etal., 2017). Thus, on the surface,

what seems to be comparable may indeed be incommensurable, introducing chal-

lenges to comparing various systems and enhancing AQG methods. Therefore, we

recommend that standardized quality criteria should be established to render stud-

ies more comparable and transferable. This could especiallysupport judgment-based

approaches and human evaluators by establishing a standardized evaluation process.

Such standardized quality criteria may enhance the robustness, systematicity, and

interpretability of AQG systems.

3.4 Better reporting practices

Many AQG studies, to date, have failed to report crucial aspects of question gen-

eration and evaluation processes, limiting the appraisal of question generation sys-

tems. Better reporting practices, including information on the question generation

and evaluation pipeline, should be an integral part of standardized reporting practices

(Amidei etal., 2018; Kurdi etal., 2020). For instance, rating scales used by evalutors

or implementation details of post-hoc evaluations may inform AQG researchers to

design better evaluation processes for the question generation pipeline. We recom-

mend researchers follow detailed reporting practices and include information about

Education and Information Technologies

1 3

the purpose for question generation, the question generation process, evaluation prac-

tices such as how questions are evaluated and which criteria are used, reliability and

validity indicators about the evaluation process, and limitations and challenges expe-

rienced during the question generation and evaluation process. This not only should

be an indispensable part of best practices conducting AQG research but also can sup-

port future AQG researchers to conduct reproducible and replicable AQG systems

and evaluations. We further recommend researchers report this vital information in

research repositories or as supplemental materials if the formatting of the publishing

venue does not allow them to report the full details of the generation process. As such

more transparency and interpretability across the AQG systems could be achieved.

3.5 Automated evaluation metrics

While numerous studies in AQG underscore the imperative role of AQG systems in

enhancing the eﬃcacy of educational assessments (Kurdi etal., 2020), the evalua-

tion of automatically generated questions remains an overlooked yet crucial aspect

of question generation. The inherent promise of any AQG system lies in its ability to

swiftly, eﬃciently, and cost-eﬀectively generate a large number of questions. How-

ever, optimizing the evaluation process for such a multitude of questions requires

automated methods. These automated approaches not only enhance scalability but

also contribute to eﬃciency and cost-eﬀectiveness in the evaluation of generated

questions. Consequently, we advocate for the establishment of automated evaluation

methods by AQG researchers, emphasizing their reliance on minimal resources—

such as the utilization of reference questions for metric-based methods or human

judgment for training classiﬁers in question evaluation. This approach aligns with

the overarching goal of streamlining and enhancing the evaluation processes integral

to the eﬀectiveness of AQG systems in educational settings.

4 Conclusion

This study oﬀers a comprehensive survey of the diverse evaluation methods and

quality criteria employed by researchers in AQG. To the best of our knowledge, this

is a ﬁrst attempt to categorize evaluation methods employed by AQG researchers

and point out strengths and limitations of each method. Speciﬁcally, by introducing

a novel taxonomy, we categorize the evaluation methods based on the key aspects of

AQG systems, including input, resources, benchmark, and quality criteria. Through

the lens of this taxonomy, we delve into the strengths, limitations, and challenges

inherent in each evaluation method. It is our expectation that this taxonomy serves

as a valuable tool for AQG researchers, aiding them in the identiﬁcation of optimal

and eﬃcient evaluation methods, along with the quality criteria suitable for assess-

ing boththe system performance and the generated questions’ quality.

1 3

Education and Information Technologies

Author contributions GG: Conceptualization, methodology, formal analysis, writing—original draft

preparation. OB: Conceptualization, supervision, writing—review and editing.

Funding This research did not receive any speciﬁc grant from funding agencies in the public, commer-

cial, or not-for-proﬁt sectors.

Data availability The manuscript has no associated data.

Declarations

Consent for publication All authors read and approved the ﬁnal manuscript.

Competing interests The authors have no conﬂicts of interest to declare that are relevant to the content

of this article.

References

Adegoke, B. A. (2013). Comparison of item statistics of physics achievement test using classical test and

item response theory frameworks. Journal of Education and Practice, 4(22), 87–96.

American Educational Research Association, American Psychological Association, National Council on

Measurement in Education. (2014). Standards for educational and psychological testing. Ameri-

can Educational Research Association.

Amidei, J., Piwek, P., & Willis, A. (2018). Evaluation methodologies in automatic question generation

2013-2018. Proceedings of The 11th International Natural Language Generation Conference (pp.

307–317). https:// doi. org/ 10. 18653/ v1/ W18- 6537

Anastasi, A., & Urbina, S. (2004). Psychological testing (7th ed.). Pearson.

Ashraf, Z. A. (2020). Classical and modern methods in item analysis of test tools. International Journal

of Research and Review, 7(5), 397–403.

Attali, Y., Runge, A., LaFlair, G. T., Yancey, K., Goodwin, S., Park, Y., & von Davier, A. A. (2022). The

interactive reading task: Transformer-based automatic item generation. Frontiers in Artiﬁcial Intel-

ligence, 5, 903077. https:// doi. org/ 10. 3389/ frai. 2022. 903077.

Baker, F. B. (2001). The basics of item response theory (2nd ed.). ERIC Clearinghouse on Assessment

and Evaluation.

Bandalos, D. L. (2018). Measurement theory and applications for the social sciences. Guilford

Publications.

Banerjee, S., & Lavie, A. (2005). METEOR: An Automatic Metric for MT Evaluation with Improved

Correlation with Human Judgments. Proceedings of the ACL Workshop onIntrinsic and Extrinsic

Evaluation Measures for Machine Translation and/or Summarization, pp 65–72.

Becker, L., Basu, S., & Vanderwende, L. (2012). Mind the gap: Learning to choose gaps for question gen-

eration. Proceedings of the 2012 Conference of the North American Chapter of the Association for

Computational Linguistics: Human Language Technologies, 742–751.

Bichi, A. A. (2016). Classical Test Theory: An introduction to linear modeling approach to test and item

analysis. International Journal for Social Studies, 2(9), 27–33.

Bulut, O., & Suh, Y. (2017). Detecting DIF in multidimensional assessments with the MIMIC model, the

IRT likelihood ratio test, and logistic regression. Frontiers in Education, 2(51), 1–14. https:// doi.

org/ 10. 3389/ feduc. 2017. 00051.

Chalifour, C. L., & Powers, D. E. (1989). The relationship of content characteristics of GRE analyti-

cal reasoning items to their diﬃculties and discriminations. Journal of Educational Measurement,

26(2), 120–132. https:// doi. org/ 10. 1111/j. 1745- 3984. 1989. tb003 23.x.

Chughtai, R., Azam, F., Anwar, M. W., Haider But, W., & Farooq, M. U. (2022). A lecture-centric auto-

mated distractor generation for post-graduate software engineering courses. International Confer-

ence on Frontiers of Information Technology (FIT), 2022, 100–105. https:// doi. org/ 10. 1109/ FIT57

066. 2022. 00028.

Education and Information Technologies

1 3

Chung, C.-Y., & Hsiao, I.-H. (2022). Programming Question Generation by a Semantic Network: A Pre-

liminary User Study with Experienced Instructors. In M. M. Rodrigo, N. Matsuda, A. I. Cristea,

& V. Dimitrova (Eds.), Artiﬁcial Intelligence in Education. Posters and Late Breaking Results,

Workshops and Tutorials, Industry and Innovation Tracks, Practitioners’ and Doctoral Consortium

(Vol. 13356, pp. 463–466). Springer International Publishing. https:// doi. org/ 10. 1007/ 978-3- 031-

11647-6_ 93.

Clauser, J. C., & Hambleton, R. K. (2011). Item analysis procedures for classroom assessments in higher

education. In C. Secolsky & D. B. Denison (Eds.), Handbook on Measurement, Assessment, and

Evaluation in Higher Education (pp. 296–309). Routledge.

Cohen, R. J., Swerdlik, M. E., & Phillips, S. M. (1996). Psychological testing and assessment: An intro-

duction to tests and measurement (3rd ed.). Mayﬁeld Publishing Co.

Darling-Hammond, L., Herman, J., Pellegrino, J., Abedi, J., Aber, J. L., Baker, E., … & Steele, C. M.

(2013). Criteria for high-quality assessment. Stanford Center for Opportunity Policy in Education,

2, 171–192.

DeMars, C. (2010). Item response theory. Oxford University Press. https:// doi. org/ 10. 1093/ acprof: oso/

97801 95377 033. 001. 0001.

Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of deep bidirectional

transformers for language understanding. arXiv preprint arXiv:1810.04805. https:// doi. org/ 10.

48550/ arXiv. 1810. 04805

Dugan, L., Miltsakaki, E., Upadhyay, S., Ginsberg, E., Gonzalez, H., Choi, D., Yuan, C., & Callison-

Burch, C. (2022). A feasibility study of answer-agnostic question generation for education. Find-

ings of the Association for Computational Linguistics: ACL, 2022, 1919–1926.

Ebel, R. L., & Frisbie, D. A. (1986). Using test and item analysis to evaluate and improve test quality.

Essentials of educational measurement (Vol. 4, pp. 223–242). Prentice-Hall.

Engelhard, G., Jr., Davis, M., & Hansche, L. (1999). Evaluating the accuracy of judgments obtained from

item review committees. Applied Measurement in Education, 12(2), 199–210. https:// doi. org/ 10.

1207/ s1532 4818a me1202_6.

Ewell, P. T. (2008). Assessment and accountability in America today: Background and context. New

Directions for Institutional Research, 2008(S1), 7–17. https:// doi. org/ 10. 1002/ ir. 258.

French, C. L. (2001). A review of classical methods of item analysis [Paper presentation]. Annual meet-

ing of the Southwest Educational Research Association, New Orleans, LA, USA.

Fu, Y., Choe, E. M., Lim, H., & Choi, J. (2022). An Evaluation of Automatic Item Generation: A Case

Study of Weak Theory Approach. Educational Measurement: Issues and Practice, 41(4), 10–22.

https:// doi. org/ 10. 1111/ emip. 12529.

Gao, Y., Bing, L., Chen, W., Lyu, M. R., & King, I. (2019). Diﬃculty controllable generation of reading

comprehension questions. arXiv. http:// arxiv. org/ abs/ 1807. 03586. Accessed04/04/2023.

Gatt, A., & Krahmer, E. (2018). Survey of the state of the art in natural language generation: Core tasks,

applications and evaluation. Journal of Artiﬁcial Intelligence Research, 61, 65–170. https:// doi. org/

10. 1613/ jair. 5477.

Gierl, M. J., & Lai, H. (2012). The role of item models in automatic item generation. International Jour-

nal of Testing, 12(3), 273–298. https:// doi. org/ 10. 1080/ 15305 058. 2011. 635830

Gierl, M. J., Lai, H., & Tanygin, V. (2021). Methods for validating generated items: A focus on model-

level outcomes. In Advanced Methods in Automatic Item Generation (1st ed., pp. 120–143). Rout-

ledge. https:// doi. org/ 10. 4324/ 97810 03025 634.

Gierl, M. J., Lai, H., Pugh, D., Touchie, C., Boulais, A.-P., & De Champlain, A. (2016). Evaluating the

psychometric characteristics of generated multiple-choice test items. Applied Measurement in Edu-

cation, 29(3), 196–210. https:// doi. org/ 10. 1080/ 08957 347. 2016. 11717 68.

Gierl, M. J., Swygert, K., Matovinovic, D., Kulesher, A., & Lai, H. (2022). Three sources of validation

evidence are needed to evaluate the quality of generated test items for medical licensure. Teaching

and Learning in Medicine, 1–11. https:// doi. org/ 10. 1080/ 10401 334. 2022. 21195 69.

Gorgun, G., & Bulut, O. (2021). A polytomous scoring approach to handle not-reached items in low-

stakes assessments. Educational and Psychological Measurement, 81(5), 847–871. https:// doi. org/

10. 1177/ 00131 64421 991211.

Gorgun, G., & Bulut, O. (2022). Considering disengaged responses in Bayesian and deep knowledge trac-

ing. In M. M. Rodrigo, N. Matsuda, A. I. Cristea, & V. Dimitrova (Eds.), Artiﬁcial intelligence

in education. Posters and late-breaking results, workshops and tutorials, industry and innovation

Tracks, practitioners’ and doctoral consortium (pp. 591–594). Lecture Notes in Computer Science,

vol 13356. Springer. https:// doi. org/ 10. 1007/ 978-3- 031- 11647-6_ 122.

1 3

Education and Information Technologies

Gorgun, G., & Bulut, O. (2023). Incorporating test-taking engagement into the item selection algorithm

in low-stakes computerized adaptive tests. Large-Scale Assessments in Education, 11(1), 27.

https:// doi. org/ 10. 1186/ s40536- 023- 00177-5

Ha, L. A., & Yaneva, V. (2018). Automatic distractor suggestion for multiple-choice tests using con-

cept embeddings and information retrieval. Proceedings of the Thirteenth Workshop on Innova-

tive Use of NLP for Building Educational Applications, pp 389–398. https:// doi. org/ 10. 18653/ v1/

W18- 0548.

Haladyna, T. M., & Rodriguez, M. C. (2021). Using full-information item analysis to improve item qual-

ity. Educational Assessment, 26(3), 198–211. https:// doi. org/ 10. 1080/ 10627 197. 2021. 19463 90.

Haladyna, T. M., Downing, S. M., & Rodriguez, M. C. (2002). A review of multiple-choice item writing

guidelines for classroom assessment. Applied Measurement in Education, 15(3), 309–333. https://

doi. org/ 10. 1207/ S1532 4818A ME1503_5.

Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991). Fundamentals of item response theory.

Sage.

Heilman, M. (2011). Automatic factual question generation from text [Ph. D.]. Carnegie Mellon

University.

Heilman, M., & Smith, N. A. (2010). Good question! Statistical ranking for question generation. Human

Language Technologies: The 2010 Annual Conference of the North American Chapter of the Asso-

ciation for Computational Linguistics, pp 609–617.

Henning, G. (1987). A guide to language testing: Development, evaluation, research. Newberry House

Publishers.

Heubert, J. P., & Hauser, R. M. (Eds.). (1999). High stakes: Testing for tracking, promotion, and gradua-

tion. National Academy Press.

Hommel, B. E., Wollang, F.-J.M., Kotova, V., Zacher, H., & Schmukle, S. C. (2022). Transformer-based

deep neural language modeling for construct-speciﬁc automatic item generation. Psychometrika,

87(2), 749–772. https:// doi. org/ 10. 1007/ s11336- 021- 09823-9.

Hovy, E. (1999). Toward ﬁnely diﬀerentiated evaluation metrics for machine translation. Proceedings

of the EAGLES Workshop on Standards and Evaluation Pisa, Italy, 1999. https:// cir. nii. ac. jp/ crid/

15714 17125 25545 8048 https:// doi. org/ 10. 18653/ v1/ 2022. acl- srw. 31.

Huang, Y., & He, L. (2016). Automatic generation of short answer questions for reading comprehension

assessment. Natural Language Engineering, 22(3), 457–489. https:// doi. org/ 10. 1017/ S1351 32491

50004 55.

Huang, Y. T., & Mostow, J. (2015). Evaluating human and automated generation of distractors for diag-

nostic multiple-choice cloze questions to assess children’s reading comprehension. In C. Conati,

N. Heﬀernan, A. Mitrovic, & M. Verdejo (Eds.), Artiﬁcial Intelligence in Education. AIED 2015.

Lecture Notes in Computer Science. (Vol. 9112). Cham: Springer. https:// doi. org/ 10. 1007/ 978-3-

319- 19773-9_ 16

Impara, J. C., & Plake, B. S. (1998). Teachers’ ability to estimate item diﬃculty: A test of the assump-

tions in the Angoﬀ standard setting method. Journal of Educational Measurement, 35(1), 69–81.

https:// doi. org/ 10. 1111/j. 1745- 3984. 1998. tb005 28.x.

Jenkins, H. M., & Michael, M. M. (1986). Using and interpreting item analysis data. Nurse Educator,

11(1), 10.

Jouault, C., Seta, K., & Hayashi, Y. (2016). Content-dependent question generation using LOD for his-

tory learning in open learning space. New Generation Computing, 34(4), 367–394. https:// doi. org/

10. 1007/ s00354- 016- 0404-x.

Kehoe, J. (1995). Basic item analysis for multiple-choice tests. Practical Assessment, Research, and Eval-

uation, 4(10), 1–3. https:// doi. org/ 10. 7275/ 07zg- h235.

Kim, S.-H., Cohen, A. S., & Eom, H. J. (2021). A note on the three methods of item analysis. Behavior-

metrika, 48(2), 345–367. https:// doi. org/ 10. 1007/ s41237- 021- 00131-1.

Kim, S., & Feldt, L. S. (2010). The estimation of the IRT reliability coeﬃcient and its lower and upper

bounds, with comparisons to CTT reliability statistics. Asia Paciﬁc Education Review, 11, 179–

188. https:// doi. org/ 10. 1007/ s12564- 009- 9062-8.

Kumar, V., Boorla, K., Meena, Y., Ramakrishnan, G., & Li, Y.-F. (2018). Automating reading compre-

hension by generating question and answer pairs (arXiv: 1803. 03664). arXiv. http:// arxiv. org/ abs/

1803. 03664.

Kurdi, G., Leo, J., Parsia, B., Sattler, U., & Al-Emari, S. (2020). A systematic review of automatic ques-

tion generation for educational purposes. International Journal of Artiﬁcial Intelligence in Educa-

tion, 30(1), 121–204. https:// doi. org/ 10. 1007/ s40593- 019- 00186-y.

Education and Information Technologies

1 3

Lane, S., Raymond, M. R., & Haladyna, T. M. (Eds.). (2016). Handbook of test development (2nd ed.).

Routledge.

Liang, C., Yang, X., Dave, N., Wham, D., Pursel, B., & Giles, C. L. (2018). Distractor generation for

multiple choice questions using learning to rank. Proceedings of the thirteenth workshop on inno-

vative use of NLP for building educational applications, pp. 284–290.

Liang, C., Yang, X., Wham, D., Pursel, B., Passonneaur, R., & Giles, C. L. (2017). Distractor generation

with generative adversarial nets for automatically creating ﬁll-in-the-blank questions. Proceedings

of the Knowledge Capture Conference, 1–4. https:// doi. org/ 10. 1145/ 31480 11. 315446.

Lin, C., Liu, D., Pang, W., & Apeh, E. (2015). Automatically predicting quiz diﬃculty level using simi-

larity measures. Proceedings of the 8th International Conference on Knowledge Capture, 1–8.

https:// doi. org/ 10. 1145/ 28158 33. 28158 42.

Lin, C.-Y. (2004). ROUGE: A package for automatic evaluation of summaries. Text Summarization

Branches Out, 74–81.

Linn, R. L. (2003). Accountability: Responsibility and reasonable expectations. Educational Researcher,

32(7), 3–13. https:// doi. org/ 10. 3102/ 00131 89X03 20070 03.

Liu, M., Rus, V., & Liu, L. (2017). Automatic Chinese factual question generation. IEEE Transactions on

Learning Technologies, 10(2), 194–204.

Livingston, S. A. (2013). Item analysis. Routledge. https:// doi. org/ 10. 4324/ 97802 03874 776. ch19.

Marrese-Taylor, E., Nakajima, A., Matsuo, Y., & Yuichi, O. (2018). Learning to automatically generate

ﬁll-in-the-blank quizzes. arXiv. http:// arxiv. org/ abs/ 1806. 04524.

Maurya, K. K., & Desarkar, M. S. (2020). Learning to distract: A hierarchical multi-decoder network for

automated generation of long distractors for multiple-choice questions for reading comprehension.

Proceedings of the 29th ACM International Conference on Information & Knowledge Manage-

ment, 1115–1124. https:// doi. org/ 10. 1145/ 33405 31. 34119 97.

McCarthy, A. D., Yancey, K. P., LaFlair, G. T., Egbert, J., Liao, M., & Settles, B. (2021). Jump-starting

item parameters for adaptive language tests. Proceedings of the 2021 Conference on Empirical

Methods in Natural Language Processing, 883–899. https:// doi. org/ 10. 18653/ v1/ 2021. emnlp-

main. 67.

Merriam-Webster. (2023). Metric. In Merriam-Webster.com dictionary. Retrieved November 3, 2023,

from https:// www. merri am- webst er. com/ dicti onary/ metric. Accessed 18 Sept 2023.

Mostow, J., Huang, Y.-T., Jang, H., Weinstein, A., Valeri, J., & Gates, D. (2017). Developing, evaluating,

and reﬁning an automatic generator of diagnostic multiple-choice cloze questions to assess chil-

dren’s comprehension while reading. Natural Language Engineering, 23(2), 245–294. https:// doi.

org/ 10. 1017/ S1351 32491 60000 24.

Mulla, N., & Gharpure, P. (2023). Automatic question generation: A review of methodologies, datasets,

evaluation metrics, and applications. Progress in Artiﬁcial Intelligence, 12(1), 1–32. https:// doi.

org/ 10. 1007/ s13748- 023- 00295-9.

Nagy, P. (2000). The three roles of assessment: Gatekeeping, accountability, and instructional diagnosis.

Canadian Journal of Education / Revue Canadienne De L’éducation, 25(4), 262–279. https:// doi.

org/ 10. 2307/ 15858 50.

Nelson, D. (2004). The penguin dictionary of statistics. Penguin Books.

Newton, P. E. (2007). Clarifying the purposes of educational assessment. Assessment in Education: Prin-

ciples, Policy & Practice, 14(2), 149–170. https:// doi. org/ 10. 1080/ 09695 94070 14783 21.

Niraula, N. B., & Rus, V. (2015). Judging the quality of automatically generated gap-ﬁll questions using

active learning. Proceedings of the Tenth Workshop on Innovative Use of NLP for Building Educa-

tional Applications, 196–206. https:// doi. org/ 10. 3115/ v1/ W15- 0623.

OECD. (2020). PISA 2022 technical standards. OECD Publishing.

Olney, A. M. (2021). Sentence selection for cloze item creation: A standardized task and preliminary

results. Joint Proceedings of the Workshops at the 14th International Conference on Educational

Data Mining, pp 1–5.

Osterlind, S. J. (1989). Judging the quality of test items: Item analysis. In S. J. Osterlind (Ed.), Construct-

ing Test Items (pp. 259–310). Springer. https:// doi. org/ 10. 1007/ 978- 94- 009- 1071-3_7.

Osterlind, S. J., & Everson, H. T. (2009). Diﬀerential item functioning. Sage Publications.

1 3

Education and Information Technologies

Osterlind, S. J., & Wang, Z. (2017). Item response theory in measurement, assessment, and evaluation for

higher education. In C. Secolsky & D. B. Denison (Eds.), Handbook on measurement, assessment,

and evaluation in higher education (pp. 191–200). Routledge.

Panda, S., Palma Gomez, F., Flor, M., & Rozovskaya, A. (2022). Automatic generation of distractors

for ﬁll-in-the-blank exercises with round-trip neural machine translation. Proceedings of the 60th

Annual Meeting of the Association for Computational Linguistics: Student Research Workshop,

391–401.

Papineni, K., Roukos, S., Ward, T., & Zhu, W.-J. (2002). Bleu: A Method for Automatic Evaluation of

Machine Translation. Proceedings of the 40th Annual Meeting of the Association for Computa-

tional Linguistics, 311–318. https:// doi. org/ 10. 3115/ 10730 83. 10731 35.

Pennington, J., Socher, R., & Manning, D. (2014, October). Glove: Global vectors for word representa-

tion. Proceedings of the 2014 conference on empirical methods in natural language processing

(EMNLP), pp 1532–1543.

Rajpurkar, P., Zhang, J., Lopyrev, K., & Liang, P. (2016). SQuAD: 100,000+ questions for machine com-

prehension of text. arXiv preprint arXiv:1606.05250. https:// doi. org/ 10. 48550/ arXiv. 1606. 05250

Rezigalla, A A.. (2022). Item analysis: Concept and application. In M. S. Firstenberg & S. P. Stawicki

(Eds.), Medical education for the 21st century. IntechOpen. https:// doi. org/ 10. 5772/ intec hopen.

100138.

Rodriguez-Torrealba, R., Garcia-Lopez, E., & Garcia-Cabot, A. (2022). End-to-end generation of multi-

ple-choice questions using text-to-text transfer transformer models. Expert Systems with Applica-

tions, 208, 118258. https:// doi. org/ 10. 1016/j. eswa. 2022. 118258.

Settles, B., LaFlair, T. G., & Hagiwara, M. (2020). Machine learning–driven language assessment. Trans-

actions of the Association for Computational Linguistics, 8, 247–263. https:// doi. org/ 10. 1162/

tacl_a_ 00310.

Seyler, D., Yahya, M., & Berberich, K. (2017). Knowledge questions from knowledge graphs. Proceed-

ings of the ACM SIGIR International Conference on Theory of Information Retrieval, 11–18.

https:// doi. org/ 10. 1145/ 31210 50. 31210 73.

Song, L., & Zhao, L. (2017). Question generation from a knowledge base with web exploration. arXiv.

http:// arxiv. org/ abs/ 1610. 03807.

Suen, H. K. (2012). Principles of test theories. Routledge.

Tamura, Y., Takase, Y., Hayashi, Y., & Nakano, Y. I. (2015). Generating quizzes for history learning

based on Wikipedia articles. In P. Zaphiris & A. Ioannou (Eds.), Learning and Collaboration

Technologies (pp. 337–346). Springer International Publishing. https:// doi. org/ 10. 1007/ 978-3- 319-

20609-7_ 32.

Tarrant, M., Knierim, A., Hayes, S. K., & Ware, J. (2006). The frequency of item writing ﬂaws in mul-

tiple-choice questions used in high-stakes nursing assessments. Nurse Education Today, 26(8),

662–671.

Towns, M. H. (2014). Guide to developing high-quality, reliable, and valid multiple-choice assessments.

Journal of Chemical Education, 91(9), 1426–1431. https:// doi. org/ 10. 1021/ ed500 076x.

Van Campenhout, R., Hubertz, M., & Johnson, B. G. (2022). Evaluating AI-generated questions: A

mixed-methods analysis using question data and student perceptions. In M. M. Rodrigo, N. Mat-

suda, A. I. Cristea, & V. Dimitrova (Eds.), Artiﬁcial Intelligence in Education. AIED 2022. Lecture

Notes in Computer Science. (Vol. 13355). Cham: Springer. https:// doi. org/ 10. 1007/ 978-3- 031-

11644-5_ 28

Van Campenhout, R., Hubertz, M., & Johnson, B. G. (2022). Evaluating AI-generated questions: A

mixed-methods analysis using question data and student perceptions. In M. M. Rodrigo, N. Mat-

suda, A. I. Cristea, & V. Dimitrova (Eds.), Artiﬁcial Intelligence in Education (Vol. 13355, pp.

344–353). Springer International Publishing. https:// doi. org/ 10. 1007/ 978-3- 031- 11644-5_ 28.

Venktesh, V., Akhtar, Md. S., Mohania, M., & Goyal, V. (2022). Auxiliary task guided interactive atten-

tion model for question diﬃculty prediction. In M. M. Rodrigo, N. Matsuda, A. I. Cristea, & V.

Dimitrova (Eds.), Artiﬁcial Intelligence in Education (Vol. 13355, pp. 477–489). Springer Interna-

tional Publishing. https:// doi. org/ 10. 1007/ 978-3- 031- 11644-5_ 39.

Vie, J. J., Popineau, F., Bruillard, É., Bourda, Y. (2017). A review of recent advances in adap-

tive assessment. In: Peña-Ayala, A. (Ed.), Learning analytics: Fundaments, applications, and

Education and Information Technologies

1 3

trends. Studies in systems, decision, and control (113–142). Springer. https:// doi. org/ 10. 1007/

978-3- 319- 52977-6_4.

von Davier, M. (2018). Automated item generation with recurrent neural networks. Psychometrika, 83(4),

847–857. https:// doi. org/ 10. 1007/ s11336- 018- 9608-y.

Wang, Z., Lan, A. S., & Baraniuk, R. G. (2021). Math word problem generation with mathemati-

cal consistency and problem context constraints. arXiv. http:// arxiv. org/ abs/ 2109. 04546.

Accessed04/04/2023.

Wang, Z., Lan, A. S., Nie, W., Waters, A. E., Grimaldi, P. J., & Baraniuk, R. G. (2018). QG-net: A data-

driven question generation model for educational content. Proceedings of the Fifth Annual ACM

Conference on Learning at Scale, 1–10. https:// doi. org/ 10. 1145/ 32316 44. 32316 54.

Wang, Z., Valdez, J., Basu Mallick, D., & Baraniuk, R. G. (2022). Towards human-Like educational

question generation with large language models. In M. M. Rodrigo, N. Matsuda, A. I. Cristea, & V.

Dimitrova (Eds.), Artiﬁcial Intelligence in Education (Vol. 13355, pp. 153–166). Springer Interna-

tional Publishing. https:// doi. org/ 10. 1007/ 978-3- 031- 11644-5_ 13.

Wauters, K., Desmet, P., & Van Den Noortgate, W. (2012). Item diﬃculty estimation: An auspicious col-

laboration between data and judgment. Computers & Education, 58(4), 1183–1193.

Wind, S. A., Alemdar, M., Lingle, J. A., Moore, R., & Asilkalkan, A. (2019). Exploring student under-

standing of the engineering design process using distractor analysis. International Journal of

STEM Education, 6(1), 1–18. https:// doi. org/ 10. 1186/ s40594- 018- 0156-x.

Yang, A. C. M., Chen, I. Y. L., Flanagan, B., & Ogata, H. (2021). Automatic generation of cloze items

for repeated testing to improve reading comprehension. Educational Technology & Society, 24(3),

147–158.

Zhang, L., & VanLehn, K. (2016). How do machine-generated questions compare to human-generated

questions? Research and Practice in Technology Enhanced Learning, 11(1), 7. https:// doi. org/ 10.

1186/ s41039- 016- 0031-7.

Zilberberg, A., Anderson, R. D., Finney, S. J., & Marsh, K. R. (2013). American college students’ atti-

tudes toward institutional accountability testing: Developing measures. Educational Assessment,

18(3), 208–234. https:// doi. org/ 10. 1080/ 10627 197. 2013. 817153.

Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps

and institutional aﬃliations.

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under

a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted

manuscript version of this article is solely governed by the terms of such publishing agreement and

applicable law.

A preview of this full-text is provided by Springer Nature.

Learn more

Content available from Education and Information Technologies

This content is subject to copyright. Terms and conditions apply.

ResearchGate has not been able to resolve any citations for this publication.

Incorporating test-taking engagement into the item selection algorithm in low-stakes computerized adaptive tests

Article

Full-text available

Jul 2023

In low-stakes assessment settings, students’ performance is not only influenced by students’ ability level but also their test-taking engagement. In computerized adaptive tests (CATs), disengaged responses (e.g., rapid guesses) that fail to reflect students’ true ability levels may lead to the selection of less informative items and thereby contaminate item selection and ability estimation procedures. To date, researchers have developed various approaches to detect and remove disengaged responses after test administration is completed to alleviate the negative impact of low test-taking engagement on test scores. This study proposes an alternative item selection method based on Maximum Fisher Information (MFI) that considers test-taking engagement as a secondary latent trait to select the most optimal items based on both ability and engagement. The results of post-hoc simulation studies indicated that the proposed method could optimize item selection and improve the accuracy of final ability estimates, especially for low-ability students. Overall, the proposed method showed great promise for tailoring CATs based on test-taking engagement. Practitioners are encouraged to consider incorporating engagement into the item selection algorithm to enhance the validity of inferences made from low-stakes CATs.

An Evaluation of Automatic Item Generation: A Case Study of Weak Theory Approach

Article

Full-text available

Oct 2022
Educ Meas

This case study applied the weak theory of Automatic Item Generation (AIG) to generate isomorphic item instances (i.e., unique but psychometrically equivalent items) for a large-scale assessment. Three representative instances were selected from each item template (i.e., model) and pilot-tested. In addition, a new analytical framework, differential child item functioning (DCIF) analysis, based on the existing differential item functioning statistics, was applied to evaluate the psychometric equivalency of item instances within each template. The results showed that, out of 23 templates, nine successfully generated isomorphic instances, five required minor revisions to make them isomorphic, and the remaining templates required major modifications. The results and insights obtained from the AIG template development procedure may help item writers and psychometricians effectively develop and manage the templates that generate isomorphic instances.

Evaluating AI-Generated Questions: A Mixed-Methods Analysis Using Question Data and Student Perceptions

Chapter

Full-text available

Jul 2022

A Lecture Centric Automated Distractor Generation for Post-Graduate Software Engineering Courses

Conference Paper

Dec 2022

Automatic question generation: a review of methodologies, datasets, evaluation metrics, and applications

Article

Jan 2023

Question generation in natural language has a wide variety of applications. It can be a helpful tool for chatbots for generating interesting questions as also for automating the process of question generation from a piece of text. Most modern-day systems, which are conversational, require question generation ability for identifying the user’s needs and serving customers better. Generating questions in natural language is now, a more evolved task, which also includes generating questions for an image or video. In this review, we provide an overview of the research progress in automatic question generation. We also present a comprehensive literature review covering the classification of Question Generation systems by categorizing them into three broad use-cases, namely standalone question generation, visual question generation, and conversational question generation. We next discuss the datasets available for the same for each use-case. We further direct this review towards applications of question generation and discuss the challenges in this field of research.

Three Sources of Validation Evidence Needed to Evaluate the Quality of Generated Test Items for Medical Licensure

Article

Sep 2022

Issue: Automatic item generation is a method for creating medical items using an automated, technological solution. Automatic item generation is a contemporary method that can scale the item development process for production of large numbers of new items, support building of multiple forms, and allow rapid responses to changing medical content guidelines and threats to test security. The purpose of this analysis is to describe three sources of validation evidence that are required when producing high-quality medical licensure test items to ensure evidence for valid test score inferences, using the automatic item generation methodology for test development. Evidence: Generated items are used to make inferences about examinees’ medical knowledge, skills, and competencies. We present three sources of evidence required to evaluate the quality of the generated items that is necessary to ensure the generated items measure the intended knowledge, skills, and competencies. The sources of evidence we present here relate to the item definition, the item development process, and the item quality review. An item is defined as an explicit set of properties that include the parameters, constraints, and instructions used to elicit a response from the examinee. This definition allows for a critique of the input used for automatic item generation. The item development process is evaluated using a validation table, whose purpose is to support verification of the assumptions related to model specification made by the subject-matter expert. This table provides a succinct summary of the content and constraints that were used to create new items. The item quality review is used to evaluate the statistical quality of the generated items, which often focuses on the difficulty and the discrimination of the correct and incorrect options. Implications: Automatic item generation is an increasingly popular item development method. The generated items from this process must be bolstered by evidence to ensure the items measure the intended knowledge, skills, and competencies. The purpose of this analysis is to describe these sources of evidence that can be used to evaluate the quality of the generated items. The important role of medical expertise in the development and evaluation of the generated items is highlighted as a crucial requirement for producing validation evidence.

Towards Human-Like Educational Question Generation with Large Language Models

Chapter

Jul 2022

Auxiliary Task Guided Interactive Attention Model for Question Difficulty Prediction

Chapter

Jul 2022

Programming Question Generation by a Semantic Network: A Preliminary User Study with Experienced Instructors

Chapter

Jul 2022

Questions are widely used in various instructional designs in education. Creating questions can be challenging and time-consuming. It requires not only the expertise of the learning content but also the experience of the question designs and the overall class performance. A considerable amount of research in the field of question generation (QG) has focused on computer models that automatically extract key information from a given context and transform them into meaningful questions. However, due to the complexity of programming knowledge, there are only few studies that have explored the potential of Programming QG (PQG) where natural languages and programming languages are often interwoven to constitute an assessment unit. To investigate further, this study experiments with a hybrid semantic network model for PQG based on open information extraction and abstract syntax tree. Our user study showed that experienced instructors had significantly positive feedback on the relevance and extensibility of the machine-generated questions.

Considering Disengaged Responses in Bayesian and Deep Knowledge Tracing

Chapter

Jul 2022

In this study, we analyzed the influence of student disengagement on prediction accuracy in knowledge tracing models. During the data pre-processing stage, we prepared two training data: The disengaged responses were ignored in the baseline data whereas the disengaged responses were removed in the disengagement-adjusted data. Using visual analysis, we identified disengaged responses (i.e., hint abusers and rapid guessers) and removed them from the disengagement-adjusted data during the pre-processing phase since those responses do not reflect the true latent ability of the students. After fitting the knowledge tracing models to the baseline and disengagement-adjusted data, we found that the prediction accuracy of both models on test data has substantially increased when disengaged responses were removed during the pre-processing stage. Our results emphasized the importance of considering student disengagement in knowledge tracing models to produce more accurate prediction models.

Exploring quality criteria and evaluation methods in automated question generation: A comprehensive survey

Abstract and Figures

Recommended publications

Current Evaluation Methods are a Bottleneck in Automatic Question Generation

A Systematic Review of Automatic Question Generation for Educational Purposes

Assessment Analytics for Digital Assessments Identifying, Modeling, and Interpreting Behavioral Enga...

Utilizing Response Time for Scoring the TIMSS 2019 Problem Solving and Inquiry Tasks