ArticlePDF Available

Technological Innovations in Large-Scale Assessment

Authors:

Abstract and Figures

Computers have had a tremendous impact on assessment practices over the past half century. Advances in computer technology have substantially influenced the ways in which tests are made, administered, scored, and reported to examinees. These changes are particularly evident in computer-based testing, where the use of computers has allowed test developers to re-envision what test items look like and how they are scored. By integrating technology into assessments, it is increasingly possible to create test items that can sample as broad or as narrow a range of behaviors as needed while preserving a great deal of fidelity to the construct of interest. In this article we review and illustrate some of the current technological developments in computer-based testing, focusing on novel item formats and automated scoring methodologies. Our review indicates that a number of technological innovations in performance assessment are increasingly being researched and implemented by testing programs. In some cases, complex psychometric and operational issues have successfully been dealt with, but a variety of substantial measurement concerns associated with novel item types and other technological aspects impede more widespread use. Given emerging research, however, there appears to be vast potential for expanding the use of more computerized constructed-response type items in a variety of testing contexts.
Content may be subject to copyright.
Technological Innovations
in Large-Scale Assessment
April L. Zenisky and Stephen G. Sireci
Center for Educational Assessment, School of Education
University of Massachusetts Amherst
Computers have had a tremendous impact on assessment practices over the past half
century. Advances in computer technology have substantially influenced the ways in
which tests are made, administered, scored, and reported to examinees. These
changes are particularly evident in computer-based testing, where the use of comput-
ers has allowed test developers to re-envision what test items look like and how they
are scored. By integrating technology into assessments, it is increasingly possible to
create test items that can sample as broad or as narrow a range of behaviors as needed
while preserving a great deal of fidelity to the construct of interest. In this article we
review and illustrate some of the current technological developments in com-
puter-based testing, focusing on novel item formats and automated scoring method-
ologies. Our review indicates that a number of technological innovations in perfor-
mance assessment are increasingly being researched and implemented by testing
programs. In some cases, complex psychometric and operational issues have suc-
cessfully been dealt with, but a variety of substantial measurement concerns associ-
ated with novel item types and other technological aspects impede more widespread
use. Given emerging research, however, there appears to be vast potential for expand-
ing the use of more computerized constructed-response type items in a variety of test-
ing contexts.
The rapid evolution of computers and computing technology has played a critical
role in defining current measurement practices. Many tests are now administered
on a computer, and a number of psychometric software programs are widely used
to facilitate all aspects of test development and analysis. Such technological ad-
vances provide testing programs with many new tools with which to build tests and
understand assessment data from a variety of perspectives and traditions. The po-
APPLIED MEASUREMENT IN EDUCATION, 15(4), 337–362
Copyright © 2002, Lawrence Erlbaum Associates, Inc.
Requests for reprints should be sent to April L. Zenisky, School of Education, 156 Hills South, Uni-
versity of Massachusetts, Amherst, MA 01003–4140. E-mail: azenisky@educ.umass.edu
tential for computers to influence testing is not yet exhausted, however, as the
needs and interests of testing programs continue to evolve.
Test users increasingly express interest in assessing skills that can be difficult to
fully tap using traditional paper-and-pencil tests. As the potential for integrating
technology into task presentation and response collection has become more of a
practical reality, a variety of innovative computerized constructed-response item
types emerge. Many of these new types call for reconceptualization of what
examinee responses look like, how they are entered into the computer, and how
they are scored (Bennett, 1998). This is good news for test users in all testing con-
texts, as a greater selection of item types may allow test developers to increase the
extent to which tasks on a test approximate the knowledge, skills, and abilities of
interest.
The purpose of this article is to review the current advances in computer-based
assessment, including innovative item types, response technologies, and scoring
methodologies. Each of these topics defines an area where applications of technol-
ogy are rapidly evolving. As research and practical implementation continues,
these emerging assessment methods are likely to significantly alter measurement
practices. In this article we provide an overview of the recent developments in task
presentation and response scoring algorithms that are currently used or have the
potential for use in large-scale testing.
INNOVATIONS IN TASK PRESENTATION
Response Actions
In developing test items for computerized performance assessment, one critical
component for test developers to think about is the format of the response an
examinee is to provide. It may at first seem backward to think about examinee re-
sponse formats before item stems, but how answers to test questions are structured
has an obvious bearing on the nature of the information being collected. Thus, as
the process of designing assessment tasks gets underway, some reflection on the
inferences to be made on the basis of test scores and how best to collect that data is
essential.
To this end, an understanding of what Parshall, Davey, and Pashley (2000)
termed response action and what Bennett, Morley and Quardt (2000) described as
response type may be helpful. Prior to actually constructing test items, some con-
sideration of the type of responses desired from examinees and the method by
which the responses could be entered can help a test developer to discern the kinds
of item types that might provide the most useful and construct-relevant informa-
tion about an examinee (Mislevy, Steinberg, Breyer, Almond, & Johnson, 1999).
In a computer-based testing (CBT) environment, the precise nature of the informa-
338 ZENISKY AND SIRECI
tion that test developers would like examinees to provide might be best expressed
in one of several ways. For example, examinees could be required to type text-
based responses, enter numerical answers via a keyboard or by clicking onscreen
buttons, or manipulate or synthesize information on a computer screen in some
way (e.g., use a mouse to direct an onscreen cursor to click on text boxes,
pull-down menus, or audio or video prompts). The mouse can also be used to draw
onscreen images as well as to “drag-and-drop” objects.
The keyboard and mouse are the input devices most familiar to examinees and
are the ones overwhelmingly implemented in current computerized testing appli-
cations, but response actions in a computerized context are not exclusively limited
to the keyboard and mouse. Pending additional research, some additional input de-
vices by which examinees’ constructed responses could one day be collected in-
clude touch screens, light pens, joysticks, trackballs, speech recognition software
and microphones, and pressure-feedback (haptic) devices (Parshall, Davey, &
Pashley, 2000). Each of these emerging methods represent inventive ways by
which test developers and users can gather different pieces of information about
examinee skills. However, at this time these alternate data collection mechanisms
are largely in experimental stages and are not yet implemented as part of many (if
any) novel item types. For this reason, we focus on emerging item types that use
the keyboard, mouse, or both for collecting responses from examinees.
Novel Item Types
For many testing programs, the familiar item types currently in use such as multi-
ple-choice and essay items provide sufficient measurement information for the
kinds of decisions being made on the basis of test scores. However, a substantial
number of increasingly interactive item types that may increase measurement in-
formation are now available, and some are being used operationally. This prolifer-
ation of item types has largely come about in response to requests from test con-
sumers for assessments aligned more closely with the constructs or skills being
assessed. Although many of these newer item types were developed for specific
testing applications such as licensure, certification, or graduate admissions testing,
it is possible to envision each of these item types being adapted in countless ways
to access different constructs as needed by a particular testing program.
Numerous computerized item types have emerged over the past decade, so it is
virtually impossible to illustrate and describe them all in a single article. Neverthe-
less, we conducted a review of the psychometric literature and of test publishers’
Web sites and selected several promising item types for presentation and discus-
sion. A nonexhaustive list of 21 of these item types is presented in Table 1, along
with a brief description of each type and some relevant references. Some of the
item types listed in Table 1 are being used operationally, whereas others have been
only proposed for use.
TECHNOLOGICAL INNOVATIONS 339
340
TABLE 1
Computerized Performance Assessment Item Types
Item Format Brief Description Selected Citation(s)
Drag-and-drop (select-and-place) Given scenario or problem, examinees click and drag an
object to the center of the appropriate answer field (see
Figure 1).
Fitzgerald (2001); Luecht (2001); Microsoft
Corporation (1998)
Graphical modeling Examinees use line and curve tools to sketch a given situation
on a grid.
Bennett, Morley, & Quardt (2000); Bennett,
Morley, Quardt, & Rock (2000)
Move figure or symbols in/into
pictographs
Examinees manipulate elements of chart or graph to represent
certain situations or adjust or complete image as necessary
(e.g., extending bars in a bar chart, see Figure 2).
Educational Testing Service (1993); French &
Godwin (1996); Martinez (1991)
Drag and connect, specifying
relationships
Given presented objects, examinees identify the
relationship(s) that exist between pairs of objects (see
Figure 3).
Fitzgerald (2001); Luecht (2001)
Concept mapping Examinees demonstrate knowledge of interrelationships
between data points by graphically representing onscreen
images and text using links and nodes.
Chung, O’Neil, & Herl (1999); Klein, O’Neil, &
Baker (1998)
Sorting task Given prototypes, examinees look for surface or deep
structural similarities between presented items and
prototypes and match items with prototype categories.
Bennett & Sebrechts (1997); Glaser (1991)
Ordering information (create-a-tree) Examinees sequence events as required by item stem (e.g.,
largest to smallest, most to least probable cause of event,
if–then; see Figure 4).
Educational Testing Service (1993); Fitzgerald
(2001); Luecht (2001); Walker & Crandall
(1999)
Inserting text Examinees drag and drop text into passage as directed by item
stem (e.g., where it makes sense, serves as example of
observation).
Educational Testing Service (1993); Taylor,
Jamieson, Eignor, & Kirsch (1998)
Passage editing Examinees edit a short onscreen passage by moving the cursor
to various points in a passage and selecting sentence
rewrites from a list of alternatives on a drop-down menu.
Breland (1999); Davey, Godwin, & Mittelholtz
(1997)
341
Highlighting text Examinees read a passage and select specific sentence(s) in
the passage (e.g., main idea, particular piece of
information).
Carey (2001); Taylor, Jamieson, Eignor, &
Kirsch (1998); Walker & Crandall (1999)
Capturing or selecting
frames/Shading
Given directions or parameters, examinees use mouse to select
portion of picture, map, or graph.
Hambleton (1997); O’Neil & Folk (1996)
Mathematical expressions Examinees generate and type in unique expression to
represent mathematical relationship.
Bennett, Morley, & Quardt (2000); Bennett et
al. (1997); Educational Testing Service
(1993); Martinez & Bennett (1992)
Numerical equations Examinees complete numerical sentences by entering
numbers and mathematical symbols in text box.
Hambleton (1997)
Multiple numerical response Examinees type in more than one numerical answer (e.g.,
complete tax form, insert numbers into a spreadsheet).
Hambleton (1997)
Multiple selection Examinees are presented with a stimulus (visual, audio, text)
and select answer(s) from list (answers may be used more
than once in series of questions).
Ackerman, Evans, Park, Tamassia, & Turner
(1999); Mills (2000)
Analyzing situations Examinees are provided with visual/audio clips and short
informational text and are asked to make diagnosis/decision.
Response could be free-text entry or extended matching.
Ackerman, Evans, Park, Tamassia, & Turner
(1999)
Generating examples Examinees create examples given certain situations or
constraints; there is more than one correct answer.
Response is free-text entry.
Bennett, Morley, & Quardt (2000); Bennett et
al. (1999); Enright, Rock, & Bennett (1998);
Nhouyvanisvong, Katz, & Singley (1997)
Generating multiple
solutions/Formulating hypotheses
Given situation, examinees generate plausible solutions or
explanations. Response is free-text entry (see Figure 5).
Bennett & Rock (1995); Kaplan & Bennett
(1994)
Essay/Short answer May be restricted or extended length. Burstein et al. (1998); Rizavi & Sireci (1999)
Problem-solving vignettes Problem-solving situations (vignettes) are presented to
examinees, who are graded on features of a product.
Bejar (1991); Fitzgerald (2001); Luecht (2001);
Williamson, Bejar, & Hone (1999);
Williamson, Hone, Miller, & Bejar (1998)
Sequential problem solving/Role
play
Examinees provide a series of responses as dynamic situation
unfolds. Scoring attends to process and product.
Clauser, Harik, & Clyman (2000); Clauser et al.
(1997)
Our discussion of novel item types begins with those items requiring use of the
mouse for different onscreen actions as methods for data collection. Some of these
item types bear greater resemblance to traditional selected response item types and
are easier to score mechanically, whereas others integrate technology in more in-
ventive ways. After introducing these item types, we turn to those item types in-
volving text-based responses. Last, we focus on items with more complex ex-
aminee responses that expand the concept of what responses to test items look like
in fundamental ways and pose more difficult challenges for automated scoring.
Item types requiring use of a mouse.
Many of the emerging computer-
based item types take advantage of the way in which people interact with a com-
puter, specifically via a keyboard and mouse. The mouse and onscreen cursor pro-
vide a flexible mechanism by which items can be manipulated. Using a mouse,
pull-down menus, and arrow keys, examinees can highlight text, drag-and-drop
text and other stimuli, create drawings or graphics, or point to critical features of an
item. An example of a drag-and-drop item (also called a select-and-place item) is
presented in Figure 1. This item type is used on a number of the Microsoft certifi-
342 ZENISKY AND SIRECI
FIGURE 1 Example of drag-and-drop item type.
cation examinations (Fitzgerald, 2001; Microsoft Corporation, 1998). These items
can be scored right/wrong or using partial credit.
The graphical modeling item type also uses the drag-and-drop capability of a
computer. This item type requires examinees to sketch out situations graphically
using onscreen line, curve tools, or both (Bennett, Morley, & Quardt, 2000;
Bennett, Morley, Quardt, & Rock, 2000). A similar item type using drag-and-drop
technology is the move figures or symbols into pictographs item type, which is pre-
sented in Figure 2. This item type requires examinees to drag a shape and position
it on a grid given certain parameters or constraints in the item stem (French &
Godwin, 1996; Martinez, 1991).
A variation of the drag-and-drop item is the drag-and-connect item type. This
item type presents examinees with several movable objects that can be arranged in
several different target locations onscreen. For example, when all the objects are
correctly assembled (items are sequenced or organized accurately), a network
would correctly work or a flowchart would appropriately illustrate a network pro-
tocol. An extension of this item type is the specifying relationships item type in
which examinees move objects around onscreen and link them in a flowchart by
way of clicking relationships such as “one to one,” “many to one,” or “one to zero”
(Fitzgerald, 2001; Luecht, 2001). An example of this item type is presented in Fig-
ure 3. Another item type that can be used to assess the understanding of relation-
TECHNOLOGICAL INNOVATIONS 343
FIGURE 2 Example of moving figures or symbols in/into a pictograph item type.
ships is the concept map item type. Having an examinee create onscreen concept
maps can allow for relationships between items or pieces of information to be il-
lustrated (Chung, O’Neil, & Herl, 1999; Herl, O’Neil, Chung, & Schacter, 1999;
Klein, O’Neil, & Baker, 1998).
Items that delve into assessing ordering and sorting information are increas-
ingly using drag-and-drop action. Bennett’s and Sebrechts’ (1997) sorting task
item type (also studied by Glaser, 1991) gives examinees the chance to communi-
cate knowledge about underlying relationships between items by dragging and
dropping focal items onto the target prototype to which it best aligns according to
some surface or deep structural feature.
The ordering information item type, also referred to in the literature as a cre-
ate-a-tree item, requires examinees to use the mouse to exhibit understanding of
the material tested. The stem of this item type specifies the way in which the
examinee should arrange elements in a process. The examinee clicks on a focal ob-
ject and then places it into a target location by dragging and dropping or by click-
ing on onscreen radio buttons to move the item as needed (Fitzgerald, 2001;
344 ZENISKY AND SIRECI
FIGURE 3 Example of specifying relationships item type.
Luecht, 2001; Walker & Crandall, 1999). Some sequences in which these focal
items might be arranged include largest to smallest, most-to-least probable cause
of an event, or following an if–then framework. Figure 4 presents an example of an
ordering information/create-a-tree item type.
Some computer-based item types are used to assess skills specifically relating
to verbal communication and comprehension skills. One, the inserting text item
type, presents examinees with a sentence that must be dragged and dropped into
the appropriate place in a passage (Carey, 2001; Jamieson, Taylor, Kirsch, &
Eignor, 1998). A similar item type is the passage editing item; the examinees move
the cursor to various onscreen “hot spots” where a drop-down menu appears with a
list of potential sentence or phrase rewrites, and the examinee must select the best
alternative from the list (Breland, 1999; Davey, Godwin, & Mittelholtz, 1997). The
alternatives could range from radical changes to no change at all, with the different
options being scored correct/incorrect or using a graded scale.
TECHNOLOGICAL INNOVATIONS 345
FIGURE 4 Example of ordering information/create-a-tree item type.
Another objectively scorable item type that uses mouse manipulation is the
highlighting information item type where the examinee uses the cursor to select a
target phrase or sentence within a passage. Examples of this item type include
identifying the main idea of a paragraph or antecedents of pronouns (Carey, 2001;
Taylor, Jamieson, Eignor, & Kirsch, 1998; Walker & Crandall, 1999). A similar
item type is the capturing/selecting frames item type (sometimes also referred to
as shading) that directs the examinee to click on portions of a graphic as needed
(Hambleton, 1997; O’Neil & Folk, 1996).
The item types described thus far could be described as “click-on item types,
as clicking on one or more objects is required. For all of these item types
examinees must select or highlight information as directed by the item stem by
moving the mouse, which correspondingly moves an onscreen cursor. Many of
these item types might be considered by some as little more than extended multi-
ple-choice items, but generally with such items the objects that could be selected
are so numerous they require skills above and beyond the test-taking skills helpful
for success on traditional selected-response items. The multiple selection item type
is a good example of an item with this property, in that the examinee is expected to
select text or onscreen items using the mouse given instructions such as “choose all
that apply” or “select three” (Ackerman, Evans, Park, Tamassia, & Turner, 1999;
Mills, 2000). Obviously, such items reduce the chance of answering the item cor-
rectly by guessing, relative to a traditional multiple-choice item.
Innovations in items with text-based responses.
Moving from the mouse
to the keyboard as the mechanism for examinees to enter responses, a number of
both novel and more familiar item types become increasingly useful in large-scale
performance assessment. Although having an examinee write a short answer or an
essay in a text box is not particularly novel, collecting such answers via computer
can be effective for data management purposes and is increasingly likely to be the
preferred method of evaluating writing skills (to the extent that typing is accepted
as a skill directed relevant to the construct(s) of interest). Similarly, in the mathe-
matical expressions,numerical equations, and multiple numerical response item
types, the examinee can type answers into free-response text boxes (Bennett,
Morley, & Quardt, 2000; Bennett, Steffen, Singley, Morley, & Jacquemin, 1997).
Although these item types may not be especially innovative in and of themselves,
the responses can be surprisingly complex to complete, manage, and score because
there are multiple ways in which any mathematical or text-based response could be
expressed.
One novel item type in which examinees respond by means of a text box is the
generating examples item type. Problems and constraints are presented to the
examinee, whose task it is to pose one or several solutions that are feasible under
such parameters (Bennett, Morley, & Quardt, 2000; Bennett et al., 1999; Enright,
Rock, & Bennett, 1998; Nhouyvanisvong, Katz, & Singley, 1997). In some appli-
346 ZENISKY AND SIRECI
cations of this item type, the responses are numerical in answer. In fact, it was orig-
inally designed in part to broaden the measurement of the construct “quantitative
reasoning” on the Graduate Record Examination, or GRE. Generating examples,
as an item type, is a variant of the generating solutions/formulating hypotheses
item type (Bennett & Rock, 1995; Kaplan & Bennett, 1994), an example of which
is presented in Figure 5. In the formulating hypotheses item type, an examinee is
first presented with a situation of some kind. The task then is to generate as many
possible explanations or causal reasonings for the situation as possible.
Complex constructed-response items for CBT.
Some of the most intrigu-
ing advances in CBT are the problem solving vignettes used on professional li-
censure exams. Typically, the vignettes presented to licensure candidates reflect
real-world problems, and the computer simulates real-world responses. An example
of this increased fidelity in measurement is the problem solving vignette item type
found on the Architectural Registration Examination (ARE). On the ARE, ex-
aminees are asked to complete several design tasks (such as design a building or lay
out a parking lot) using a variety of onscreen drawing tools (Bejar, 1991; William-
TECHNOLOGICAL INNOVATIONS 347
FIGURE 5 Example of formulating hypotheses/generating solutions item type. Source: The
formulating-hypotheses item type. (n.d.). Retrieved February 26, 2001, from http://www.ets.
org/research/rsc/alcadia.html Copyright © Educational TestingService. Used with permission.
son, Bejar, & Hone, 1999; Williamson, Hone, Miller, & Bejar, 1998). By and large,
these items seem highly authentic to examinees and give test users data about
examinee ability in relation to actual, standardized architectural design tasks. In this
case, the generalizability of test item to job performance as described by Kane
(1992) is high.
Closely related to such problem-solving vignettes are dynamic problem solving
tasks, sometimes referred to as role-play exercises. In measurement, we typically
think of adaptive testing as dynamic between items, but advances in computer
hardware and software now allow some testing programs to create tests that are
adaptive within an item, where an item may be defined as an extended role-playing
task. The computerized case-based simulations used by the National Board of
Medical Examiners incorporate the idea of the simulated patient whose symptoms
and vital statistics change over time in response to the actions (or nonactions) taken
by the candidate (Clauser, Harik, & Clyman, 2000; Clauser, Margolis, Clyman, &
Ross, 1997). As the examinee manages a case, new symptoms may emerge, the
clock is ticking, and the prospective doctor’s actions have the potential to harm as
well as help the simulated patient. As the case progresses, the examinee may have
to deal with unintended medical side effects as well as the patient’s original medi-
cal condition. Each examinee is scored on the sequence of response actions they
enter into the computer, such as requesting tests on the patient, writing treatment
orders, and their diagnostic acuity. Similar dynamic simulation tasks are used for
aviation selection and training.
Media in Item Stems
A further emerging dimension of novel item types relates to what Parshall, Davey,
and Pashley (2000) referred to as media inclusion: the use of graphics, video, and
audio within an item or set of items. Multimedia can be used at various points in
the item stem for a variety of purposes: to better illustrate a particular situation, to
allow examinees to visualize a problem, or to better assess a specified construct
(e.g., music-listening aptitude).
Audio prompts in large-scale, noncomputerized testing have been largely con-
fined to music and language tests, with partial success in those areas. However,
Parshall and Balizet (2001) defined a framework for considering four uses of an
audio component in CBTs, including speech audio for listening tests, nonspeech
audio (e.g., music) for listening tests, speech audio for alternative assessment
(such as accommodating tests for limited-English proficient, reading disabled, or
visually disabled examinees), and speech and nonspeech audio incorporated into
the user interface. From a measurement perspective, as Vispoel (1999) noted,
when tests of music listening or aptitude are administered to a group in a non-CBT
format, compromises in administrative efficiency and measurement accuracy of-
ten leave examinee scores on such tests with questionable reliability. The differ-
348 ZENISKY AND SIRECI
ence in a computer-based setting is that the test can be administered individually
and this format of test administration permits examinees to proceed at their own
pace (Parshall & Balizet, 2001). An example of the successful use of audio
prompts in large-scale computer-based assessment is Educational Testing Ser-
vice’s (ETS’s) Test of English as a Foreign Language, which incorporates such
features.
Graphics, on the other hand, have generally enjoyed more extensive use in com-
puterized assessment. For example, the presentation of digitized pictures has been
used successfully by Ackerman, Evans, Park, Tamassia, and Turner (1999) in a test
of dermatological skin disorders. Examinees can use a zoom feature to get a better
look at the picture before selecting the correct diagnosis from a list of 110 alpha-
betized disorders. Although this item type would be strictly classified as a se-
lected-response item, rather than constructed-response, the emulation of the diag-
nostic processes of professional dermatologists allows examinees to demonstrate a
higher order grasp of the information and reduces the likelihood of guessing. On
other tests, onscreen images can be rotated, resized, selected, clicked on, and
dragged to form a meaningful image, depending on the item type (Klein, O’Neil,
& Baker, 1998). Some of the items described in Table 1, including graphical mod-
eling, concept mapping, and moving a figure into a graph, are examples of tasks
where the graphical manipulations compose the constructed responses.
Furthermore, most desktop computers now have video capabilities that make
the inclusion of video prompts in performance assessment highly feasible. Interac-
tive video assessment has been used operationally with the Workplace Situations
test at IBM (Desmarais et al., 1992), the Allstate Multimedia In-Basket (Ashworth
& Joyce, 1994), and the Conflict Resolution Skills Assessment (Drasgow, Olson-
Buchanan, & Moberg, 1999). In the Conflict Resolutions Skills Assessment, for
example, an examinee views a conflict scene of approximately 2 minutes’ dura-
tion, which stops at a critical point and asks the examinee to select one of four re-
sponse options. Based on this selection, action branches out and continues until a
second critical point is reached, and so forth.
Although selected response may be easier to implement due to a test devel-
oper’s desire to avoid the infinite number of possible solutions when constructed
responses are used, it is possible to envision a day in the near future when advanced
computing technology such as virtual reality might be used to more dynamically
model interactive situations and complex software applications are developed to
score them successfully.
Use of Reference Materials
One additional new facet of how examinees complete items is the extent to which
examinees may use reference materials as they go through the test. On some as-
sessments, examinees can use calculators or other reference materials. Indeed, cal-
TECHNOLOGICAL INNOVATIONS 349
culators may be used on the SAT I (College Board, 2000b) and the ACT Mathe-
matics test (American College Testing Program, 2000) and are required on two of
the mathematics sections of the SAT II (College Board, 2000a). In terms of a
credentialing/licensure test, one section of the mathematics assessment of the
Praxis teacher certification test specifically prohibits calculators, one section al-
lows them, and other sections require them (ETS, 2000). The ARE also requires
examinees to supply their own calculators.
Additional examples of auxiliary information examinees may access are found
on the credentialing examinations administered for Novell software certification
and on the ARE. One of Novell’s certification exams measures candidates’ ability
to quickly navigate two reference CDs to locate information necessary to complete
a task (D. Foster, personal communication, April 11, 2000). One CD contains tech-
nical product information, and the other contains a technical library detailing in-
formation about cables, hard drives, monitors, and CPUs. The tasks tap the candi-
dates’ ability to “research” these CDs to locate the content necessary to solve a
problem. The ARE also allows candidates to access resource material via the com-
puter. Candidates can retrieve certain subject specific information about building
code requirements, program constraints, and vignette specifications on demand as
they design structures in accordance with the presented task directions (National
Council of Architectural Registration Boards, 2000).
Novell certification exams have an additional interactive feature: Candidates
who take a non-English version of an exam can access the English-language ver-
sion of each item. If they so desire, candidates can click on a button to switch back
and forth between the language in which they are taking the test and English (Fos-
ter, Olsen, Ford, & Sireci, 1997).
INNOVATIONS IN SCORING COMPLEX
CONSTRUCTED-RESPONSES
Computerized-adaptive testing (CAT) is increasingly attractive to test developers
as a way to increase the amount of information examinee responses provide about
ability. However, as Parshall, Davey, and Pashley (2000) point out, an adaptive test
works best if the computer can score examinee responses to the test items automat-
ically and instantaneously. This has not historically been a problem when the items
being used are selected-response, as the computer can easily compare the sequence
of response from each examinee to the programmed answer key. The traditional
multiple-choice item, scored with a dichotomous item response model, is the item
type principally used in CAT, but some variations on the multiple-choice item,
such as multiple numerical response, graphical modeling, drag-and-drop, multiple
selection, and ordering information might also be scored fairly easily using a
polytomous item response model.
350 ZENISKY AND SIRECI
A problem in incorporating many of the innovative performance tasks in Table
1 into a CAT format or into a linear CBT, however, is that each examinee provides a
unique response for each item. Thus, the structure and nature of those answers may
vary widely across examinees, and scoring decisions cannot typically be made im-
mediately using mechanical application of limited, explicit criteria (Bennett,
Sebrechts, & Rock, 1991). Legitimate logistical difficulties in automatically scor-
ing these responses might at first seem to preclude the manufacture of CAT perfor-
mance assessments, although a CBT format might be feasible (as scoring occurs at
a later date). However, current developments in psychology, computer science,
communication disorders, and artificial intelligence reveal several promising di-
rections for the future of computerized performance assessment.
Clearly, a consideration in terms of automated scoring is the level of constraint
desired in the constructed response. Among the various constructed-response item
types, it is possible to constrain any individual item in such a way that there are sev-
eral or infinite possible answers, just as an item can be written to ensure that there
is only one correct response. Take a graphical modeling problem, for example,
where the task is to model the developments of an interest-bearing account. This
could be highly specific, such that given a base amount of money, an interest rate,
and a length of time, an examinee would use the mouse to graphically represent an
outcome. Alternatively, given certain information the examinee could be asked to
extrapolate future outcomes from current data, and in this case (depending on how
an examinee synthesized the available information) there might be more than one
appropriate way to respond graphically.
Currently, much of the work in terms of automated scoring has focused on de-
veloping techniques for scoring essays, computer programs, simulated patient in-
teractions, and architectural designs online. The most prominent computer-based
scoring methods, described in the following sections, are summarized in Table 2.
These scoring methods fall into three general categories: essay scoring, expert sys-
tems analysis, and mental modeling.
Automated Scoring of Essays
Essays are perhaps the most common form of constructed responses used in
large-scale assessment, although their use is limited because many testing pro-
grams need two and sometimes three humans to read and evaluate essays accord-
ing to preestablished scoring rubrics. Other constructed-response text-based item
types typically found on paper-and-pencil tests, such as the accounting problems
on the Uniform Certified Public Accountants Examination, are scored in a simi-
larly labor-intensive fashion. Given the expense of extensively training readers,
who are required to establish score validity, computerized alternatives to human
readers are highly attractive. Research on automated scoring programs and meth-
ods for the most part have demonstrated the comparability of essay scoring across
TECHNOLOGICAL INNOVATIONS 351
352
TABLE 2
Summary of Automated Scoring Programs/Methods
Scoring Method Description Citation(s)
Essays/Free-text answers
Project essay grade Uses a regression model where the independent variables are surface
features of the text (document length, word length, and punctuation) and
the dependent variable is the essay score.
Page (1994)
E-rater Evaluates numerous structural and linguistic features specified in a holistic
scoring guide using natural language process techniques.
Burstein & Chodorow (in press); Burstein et al.
(1998)
Latent semantic
analysis
A theory and method for extracting and representing the contextual usage
of words by statistical computations applied to a large corpus of text.
Foltz, Kintsch, & Landauer (1998); Landauer,
Foltz, & Laham (1998); Landauer et al.
(1997)
Text categorization Evaluates text using automated learning techniques to categorize text
documents, where linguistic expressions and contexts extracted from the
texts are used to classify texts.
Larkey (1998)
Constructed free
response scoring tool
Scores short verbal answers, where examinees key in responses that are
pattern-matched to programmed correct responses.
Martinez & Bennett (1992)
Expert systems Examinee’s completed response is compared to a problem-specific
knowledge base encoded within the computer’s memory banks. The
knowledge base is constructed from human content-expert responses
that have been coded in a machine-usable form.
Bennett & Sebrechts (1996); Braun et al.
(1990); Martinez & Bennett (1992);
Sebrechts, Bennett, & Rock (1991)
Mental modeling Elements of the final product can be evaluated against the universe of all
possible variations using a process that mimics the scoring processing of
committees and requires an analysis of the way experts evaluate
solutions. Scores can be compared to the results obtained from human
raters to assess agreement.
Bejar (1991); Clauser (2000); Clauser, Harik, &
Clyman (2000); Clauser et al. (1997);
Martinez & Bennett (1992); Williamson,
Bejar, & Hone (1999); Williamson, et al.
(1998)
human and computer graders (Burstein, Kukich, Wolff, Lu, & Chodorow, 1998;
Rizavi & Sireci, 1999; Yang, Buckendahl, Juszkiewicz, & Bhola, 2002). Currently,
several computerized essay scoring options and methods are available, including
project essay grade (PEG), e-rater, latent semantic analysis, text categorization,
and the constructed-response scoring tool.
Project essay grade.
PEG, developed by Ellis Page in the mid-1960s, was
the first automated essay grading system; the current version evolved from his ear-
lier work (Page, 1994; Page & Petersen, 1995). Like most computerized essay
scoring programs, the specifics of how PEG works are proprietary. However, de-
scriptions of the program suggest that it uses multiple regression to determine the
optimal combination of the surface features of an essay (e.g., average word length,
essay word length, number of uncommon words, number of commas) as well as
complex structure of the essay (e.g., soundness of sentence structure) to best pre-
dict the score that would be assigned by a human grader (Page, 1994, Page &
Petersen, 1995). By assigning weights to these surface and intrinsic features, the
computer attempts to mimic human scoring. Although it is unclear whether PEG is
currently being used in large-scale assessment, it is clear that it set the stage for
other developments in the computerized scoring of essays.
E-rater.
E-rater (Burstein et al., 1998; Burstein & Chodorow, in press) is the
essay scoring system developed by ETS for the essay portion of the Graduate Man-
agement Admissions Test (GMAT). It is designed to evaluate numerous structural
and linguistic features specified in a holistic scoring guide. On the GMAT, each
examinee responds to two essay questions, which are scored by both a trained hu-
man grader and an electronic reader. Currently, e-rater serves as the second reader.
If human and e-rater scores on a particular essay differ by more than one point, the
essay is sent to a second human expert, and finally, if consensus is still not reached,
to a final human referee. Thus, the GMAT scoring system provides an example of
how the computer can be used to increase the efficiency of essay scoring while
maintaining the validity of the final scores assigned to an essay.
Latent semantic analysis.
Latent semantic analysis (LSA) is a theory and
method for extracting and representing the contextual usage of words by statisti-
cal computations applied to a large corpus of text (Foltz, Kintsch, & Landauer,
1998; Landauer, Foltz, & Laham, 1998; Rehder et al., 1998). The underlying
idea is that the aggregate of all the word contexts in which a given word does
and does not appear provides a set of mutual constraints that largely determines
the similarity of meaning of words and sets of words to each other. A possible
analogy for LSA could be the way multidimensional scaling allows relationships
between variables to be plotted in ndimensions. In LSA, words can be mapped
into semantic space and distances between words are derived from shadings of
TECHNOLOGICAL INNOVATIONS 353
meaning, which are obtained through context. LSA’s algorithm has a learning
component that “reads” through a text and develops an understanding of the sen-
tence or passage by evolving a meaning for each word in relation to all the other
words in the sentence or passage. The LSA system can be “trained” to work in
different content areas by having it electronically read texts relevant to the do-
main of interest. One possible caveat to the use of LSA: At this time, the algo-
rithm does not derive sentence meaning from word order, which is a possible
place for exploitation, but ongoing research addresses this point (Landauer,
Laham, Rehder, & Schreiner, 1997).
Text categorization.
Text categorization is a method for evaluating text that
uses automated learning techniques to categorize documents, where linguistic ex-
pressions and contexts extracted from the texts are used to classify them (Larkey,
1998). This type of analysis is informed by work in areas of machine learning,
Bayesian networks, information retrieval, natural language processing, case-based
reasoning, language modeling, and speech recognition. A number of text categori-
zation algorithms have been developed, incorporating different schema for classi-
fying text. The sorting of verbal content may be related to topic, to specified levels
of quality, or perhaps by keywords. It is interesting that the evaluation of essays is
only one of the many situations in which text categorization techniques have been
applied. These algorithms are also used in sorting documents in databases in an in-
formation retrieval context such as in the code that powers Internet search engines.
One organization with ongoing research into text categorization is the Edinburgh
Language Technology Group (http://www.ltg.ed.ac.uk/papers/class.html). On this
Web site, this group details much of their work on multiple applications of text cat-
egorization methodology.
The Constructed Free Response Scoring Tool.
The Constructed Free Re-
sponse Scoring Tool (FRST) is an automated approach to scoring examinee’s con-
structed responses online (Martinez & Bennett, 1992). FRST is an algorithm devel-
oped to score short verbal answers, where examinees key in responses that are
pattern-matched to programmed correct responses (Martinez & Bennett, 1992).
FRST has a 100% congruence rate with human raters when examinee responses
range in length between 5 and 7 words, and an 88% congruence rate for responses be-
tween 12 and 15 words. In this case, congruence rate is defined as the rate at which
scores assigned to examinee responses by the computer and the human rater exactly
match.
Expert Systems Analysis
Expert systems analysis provides another example of the use of computers to score
complex examinee responses. Expert systems are computer programs designed to
354 ZENISKY AND SIRECI
emulate the scoring behaviors of human content specialists. Expert scoring has
been studied in a number of contexts, including computer programming and math-
ematics problems. For example, PROUST and MicroPROUST are two expert sys-
tems developed to automatically score computer programs that examinees write
using the Pascal language (Braun, Bennett, Frye, & Soloway, 1990; Martinez &
Bennett, 1992). Each system has knowledge to reason about programming prob-
lems within intention-based analysis framework. Based on how humans reason out
computer programs, the expert systems analysis formulates deep-structure, goal,
and plan representation in the process of trying to identify nonsyntactical errors.
In terms of constructed-response quantitative items, the expert scoring system
known as GIDE produces a series of comments about errors present in examinees’
solutions and then incorporates that information into computation of partial-credit
scores (Bennett & Sebrechts, 1996; Martinez & Bennett, 1992; Sebrechts, Ben-
nett, & Rock, 1991). The expert systems program consults a problem-specific
knowledge base constructed from human content-expert responses that are coded
in a machine-usable form. The examinee responses are broken down into compo-
nent parts, and each piece is evaluated against multiple programmed alternatives.
Here, analysis has shown that reasonable machine-rater congruence can be ob-
tained (e.g., a .86 correlation between the scores assigned to essays by a machine
and by a human rater; Martinez & Bennett, 1992). Interestingly, research into
GIDE, PROUST, and MicroPROUST expert systems scoring mechanisms sug-
gests that although each is highly accurate at classifying examinee responses as
correct or incorrect, they are less able to provide specific diagnostic information
about examinee errors.
Mental Modeling
An additional approach to computerized scoring of complex performance tasks is
mental modeling, which is currently used to score portions of the ARE. The perfor-
mance tasks on the ARE are often cited as highly interactive and innovative exam-
ples of the possibilities that exist for automated scoring in computer-based testing.
Examinees are presented with architectural design tasks given various constraints
and, in effect, create blueprints for buildings during the testing session. Each prob-
lem is graded on four attributes (grammatical, code compliance, diagrammatic
compliance, and design logic and efficiency) using features extraction analysis
where elements of the final product can be evaluated against the universe of all
possible variations (Bejar, 1991; Martinez & Bennett, 1992). Various elements ex-
tracted from an examinee’s constructed response are compared to the universe of
possible variations where the components of possible responses are evaluated us-
ing a procedure that mimics the scoring processing of committees and requires an
analysis of the way experienced experts evaluate solutions (Williamson, Bejar, &
Hone, 1999). This “mental modeling” approach to scoring, done by computers,
TECHNOLOGICAL INNOVATIONS 355
can be compared to the results obtained from human raters to assess the extent to
which these methodologies agree on results.
In addition to being used on the ARE, the National Board of Medical Examiners
has incorporated the mental model algorithm into its patient care simulations
(Clauser, Harik, & Clyman, 2000; Clauser et al., 1997). Each action that exam-
inees key in is varyingly associated as benefitting the simulated patient or as an in-
appropriate action carrying some level of risk. Features extraction analysis and
mental modeling may be applicable as a medium for the automated scoring of es-
says as well, where the features could be specified as components of an essay, such
as sentences.
The various algorithms for automatically scoring constructed responses repre-
sent an especially exciting direction for computerized assessment practices. Using
computers in this regard will help to improve the extent to which uniformity and
precision in scoring rules can be implemented (Clauser et al., 2000). As a result,
test users and examinees alike may develop greater confidence in the inferences
about domain proficiency made on the basis of test scores. Likewise, as Bennett
(1998) mentioned, delivery efficiency will improve. In consequence, performance
tasks that can be automatically evaluated will become more logistically and practi-
cally feasible for use in high-stakes credentialing.
DIRECTIONS FOR FUTURE RESEARCH
The incorporation of different response actions, task formats, multimedia prompts,
and reference materials into test items has the potential to substantially increase
the types of skills, abilities, and processes that can be measured. Likewise, the im-
plementation of automated scoring methods can greatly facilitate the processing of
examinee responses. The potential benefits of incorporating these innovations and
tasks to testing are great, but such benefits cannot fully materialize without further
research on a number of psychometric and operational concerns (particularly with
regard to many of the emerging item types, as found by Zenisky & Sireci, 2001).
Technology-related dimensions of test format, administration, and scoring cannot
be accepted without due psychometric scrutiny.
In terms of integrating innovations in task presentation into more large-scale as-
sessments, there are a several critical directions for research. The intricacy of the
task and how it relates to the skill(s) being assessed in a given testing context is an
issue of central importance that must be rigorously evaluated (Crocker, 1997).
Examinees should not be overwhelmed with innovative item types and item fea-
tures that are extraneous to the task. To this end, the relative simplicity or complex-
ity of the user interface should remain a fundamental concern for test developers,
especially in light of the potential for gadgetry to eclipse real technological bene-
fits. Oftentimes, extended tutorials may be necessary to sufficiently familiarize
356 ZENISKY AND SIRECI
examinees with novel testing tasks, and thus use of these types may require sub-
stantial testing time and development resource commitments from test developers.
Further research specific to different types and task presentation variables can help
determine the kinds of preparation and tutorials necessary for this purpose.
Practical validity concerns in CBT include the adequacy of construct represen-
tation (Huff & Sireci, 2001; Kane, Crooks, & Cohen, 1999; Messick, 1995) and
task generalizability (Brennan & Johnson, 1995; Linn & Burton, 1994; Shavelson,
Baxter, & Pine, 1991). Issues of task specificity such as the relative number of
tasks and the extent to which examinee performance can be generalized from the
selected tasks (Guion, 1995) are additional concerns. Equally important are stud-
ies to determine potential sources of construct-irrelevant variance associated with
such item types (Huff & Sireci, 2001). Furthermore, work to evaluate adverse im-
pact for different subgroups of examinee populations has by and large not been
completed for most emerging item types. This problem needs to be addressed in
future research.
Automated scoring must be evaluated with respect to potential losses in score
validity, perhaps in the direction of multitrait–multimethod analyses (Clauser,
2000; Yang et al., 2002). The emerging area of multidimensional item response
theory (IRT) models may provide some interesting ways for scoring complex con-
structed responses (see Ackerman, 1994, and van der Linden & Hambleton, 1997,
for further information on multidimensional IRT). Preliminary research suggests
that compromises in reliability and information per minute of testing time may
occur when complex, computerized constructed-response item types are used
(Jodoin, 2001), so further research in the areas of reliability and test and item infor-
mation should be accelerated.
CONCLUSIONS
Technological advances in CBT represent positive future directions for the evalua-
tion of complex performances within large-scale testing programs, especially
given the escalating use of technology in many aspects of everyday life. Ex-
aminees support the opportunity to demonstrate what they know when tasks on a
test more faithfully relate to the skills necessary for a particular domain, and these
methods may provide test users with ways to acquire information about an
examinee’s proficiency on a given knowledge or skill area more directly (Kane,
1992). As psychometric research related to computerized performance assessment
is completed, the application of empirical results with regard to fusions of technol-
ogy and measurement will continue to impact positively on assessment practices.
The overview of emerging technological innovations in computer-based assess-
ment presented in this article provides a comprehensive (although not exhaustive)
description of numerous recent developments in computer-based assessment.
TECHNOLOGICAL INNOVATIONS 357
Many of these innovations have both strengths and weaknesses from practical and
psychometric perspectives, and thus enthusiasm for these emerging measurement
methods must be tempered by scientific wariness about their technical characteris-
tics. Still, although it is difficult to predict the future, many of these CBT innova-
tions are likely to dramatically change the testing experience for many examinees
who sit for assessments in a wide variety of testing contexts, including certification
and licensure, admissions, and achievement testing.
ACKNOWLEDGMENTS
Laboratory of Psychometric and Evaluative Research Report No. 383, School of
Education, University of Massachusetts, Amherst. This research was funded in
part by the American Institute of Certified Public Accountants (AICPA). We are
grateful for this support. The opinions expressed in this article are ours and do not
represent official positions of the AICPA.
REFERENCES
Ackerman, T. A. (1994). Using multidimensional item response theory to understand what items and
tests are measuring. Applied Measurement in Education, 7, 255–278.
Ackerman, T. A., Evans, J., Park, K., Tamassia, C., & Turner, R. (1999). Computer assessment using vi-
sual stimuli: A test of dermatological skin disorders. In F. Drasgow and J. B. Olson-Buchanan (Eds.),
Innovations in computerized assessment (pp. 137–150). Mahwah, NJ: Lawrence Erlbaum Associ-
ates, Inc.
American College Testing Program. (2000). Calculators and the ACT math test. Retrieved March 20,
2000, from http://www.act.org/aap/taking/calculator.html
Ashworth, S. D., & Joyce, T. M. (1994, April). Developing scoring protocols for a computerized multi-
media in-basket exercise. Paper presented at the Ninth Annual Conference of the Society for Indus-
trial and Organizational Psychology, Nashville, TN.
Bejar, I. (1991). A methodology for scoring open-ended architectural design problems. Journal of Ap-
plied Psychology, 76, 522–532.
Bennett, R. E. (1998). Reinventing assessment: Speculations on the future of large-scale educational
testing. Princeton, NJ: Educational Testing Service.
Bennett, R. E., Morley, M., & Quardt, D. (2000). Three response types for broadening the conception of
mathematical problem solving in computerized-adaptive tests. Applied Psychological Measurement,
24, 294–309.
Bennett, R. E., Morley, M., Quardt, D., & Rock, D. A. (2000). Graphical modeling: A new response
type for measuring the qualitative component of mathematical reasoning. Applied Measurement in
Education, 13, 303–322.
Bennett, R. E., Morley, M., Quardt, D., Rock, D. A., Singley, M. K., Katz, I. R., & Nhouyvanisvong, A.
(1999). Psychometric and cognitive functioning of an under-determined computer-based response
type for quantitative reasoning. Journal of Educational Measurement, 36, 233–252.
Bennett, R. E., & Rock, D. A. (1995). Generalizability, validity, and examinee perceptions of a com-
puter-delivered formulating-hypotheses test. Journal of Educational Measurement, 32, 19–36.
358 ZENISKY AND SIRECI
Bennett, R. E., & Sebrechts, M. M. (1996). The accuracy of expert-system diagnoses of mathematical
problem solutions. Applied Measurement in Education, 9, 133–150.
Bennett, R. E., & Sebrechts, M. M. (1997). A computer-based task for measuring the representational
component of quantitative proficiency. Journal of Educational Measurement, 34, 64–78.
Bennett, R. E., Sebrechts, M. M., & Rock, D. A. (1991). Expert system scores for complex con-
structed-response quantitative items: A study of convergent validity. Applied Psychological Mea-
surement, 15, 227–239.
Bennett, R. E., Steffen, M., Singley, M. K., Morley, M., & Jacquemin, D. (1997). Evaluating an auto-
matically scorable, open-ended response type for measuring mathematical reasoning in com-
puter-adaptive tests. Journal of Educational Measurement, 34, 162–176.
Braun, H. I., Bennett, R. E., Frye, D., & Soloway, E. (1990). Scoring constructed responses using expert
systems. Journal of Educational Measurement, 27, 93–108.
Breland, H. M. (1999). Exploration of an automated editing task as a GRE writing measure (RR–99–9).
Princeton, NJ: Educational Testing Service.
Brennan, R. L., & Johnson, E. G. (1995). Generalizability of performance assessments. Educational
Measurement: Issues and Practice, 14(4), 25–27.
Burstein, J., & Chodorow,M. (2002). Directions in automated essay scoring analysis. In R. Kaplan (Ed.),
Oxford handbook of applied linguistics (pp. 487–497). Oxford, England: Oxford University Press.
Burstein, J., Kukich, K., Wolff, S., Lu, C., & Chodorow, M. (1998, April). Computer analysis of essays.
Paper presented at the NCME Symposium on Automated Scoring, San Diego, CA.
Carey, P. (2001, April). Overview of current computer-based TOEFL. Paper presented at the annual
meeting of the National Council on Measurement in Education, Seattle, WA.
Chung, G. K. W. K., O’Neil, H. F., Jr., & Herl, H. E. (1999). The use of computer-based collaborative
knowledge mapping to measure team processes and outcomes. Computers in Human Behavior, 15,
463–494.
Clauser, B. E. (2000). Recurrent issues and recent advances in scoring performance assessments. Ap-
plied Psychological Measurement, 24, 310–324.
Clauser, B. E., Harik, P., & Clyman, S. G. (2000). The generalizability of scores for a performance as-
sessment scored with a computer-automated scoring system. Journal of Educational Measurement,
37, 245–261.
Clauser, B. E., Margolis, M. J., Clyman, S. G., & Ross, L. P. (1997). Development of automated scoring
algorithms for complex performance assessments: A comparison of two approaches. Journal of Edu-
cational Measurement, 34, 141–161.
College Board. (2000a). AP Calculus for the new century. Retrieved March 20, 2000, from http://www.
collegeboard.org/index_this/ap/calculus/new_century/evolution.html
College Board. (2000b). Calculators. Retrieved March 20, 2000, from http://www.collegeboard.org/in-
dex_this/sat/center/html/counselors/prep009.html
Crocker, L. (1997). Assessing content representativeness of performance assessment exercises. Ap-
plied Measurement in Education, 10, 83–95.
Davey, T., Godwin, J., & Mittelholtz, D. (1997). Developing and scoring an innovative computerized
writing assessment. Journal of Educational Measurement, 34, 21–42.
Desmarais, L. B., Dyer, P. J., Midkiff, K. R., Barbera, K. M., Curtis, J. R., Esrig, F. H., & Masi, D. L.
(1992, May). Scientific uncertainties in the development of a multimedia test: Trade-offs and deci-
sions. Paper presented at the Seventh Annual Conference of the Society for Industrial and Organiza-
tional Psychology, Montreal, Quebec, Canada.
Drasgow, F., Olson-Buchanan, J. B., & Moberg, P. J. (1999). Development of an interactive video as-
sessment: Trials and tribulations. In F. Drasgow & J. B. Olson-Buchanan (Eds.), Innovations in com-
puterized assessment (pp. 197–219). Mahwah, NJ: Lawrence Erlbaum Associates, Inc.
Educational Testing Service. (1993). Tests at a glance: Praxis I: Academic Skills Assessment. Prince-
ton, NJ: Author.
TECHNOLOGICAL INNOVATIONS 359
Educational Testing Service. (2000). The Praxis Series: Professional Assessments for Beginning
Teachers: Tests and test dates. Retrieved March 20, 2000, from http://www.teachingandlearning.org/
licensure/ praxis/prxtest.html
Enright, M. K., Rock, D. A., & Bennett, R. E. (1998). Improving measurement for graduate admissions.
Journal of Educational Measurement, 35, 250–267.
Fitzgerald, C. (2001, April). Rewards and challenges of implementing an innovative CBT certification
exam program. Paper presented at the annual meeting of the National Council on Measurement in
Education, Seattle, WA.
Foltz, P. W., Kintsch, W., & Landauer, T. K. (1998). The measurement of textual coherence with latent
semantic analysis. Discourse Processes, 25, 285–307.
Foster, D., Olsen, J. B., Ford, J., & Sireci, S. G. (1997, March). Administering computerized certifica-
tion exams in multiple languages: Lessons learned from the international marketplace. Paper pre-
sented at the meeting of the American Educational Research Association, Chicago.
French, A., & Godwin, J. (1996). Using multimedia technology to create innovative items. Paper pre-
sented at the annual meeting of the American Educational Research Association, New York.
Glaser, R. (1991). Expertise and assessment. In M. C. Wittrock & E. L. Baker (Eds.), Testing and cogni-
tion (pp. 17–30). Englewood Cliffs, NJ: Prentice Hall.
Guion, R. M. (1995). Comments on values and standards in performance assessments. Educational
Measurement: Issues and Practice, 14(4), 25–27.
Hambleton, R. K. (1997, October). Promising GMAT item formats for the 21st century. Invited presen-
tation at the international workshop on the GMAT, Paris, France.
Herl, H. E., O’Neil, H. F., Jr., Chung, G. K. W. K., & Schacter, J. (1999). Reliability and validity of a
computer-based knowledge mapping system to measure content understanding. Computers in Hu-
man Behavior, 15, 315–333.
Huff, K. L., & Sireci, S. G. (2001). Validity issues in computer-based testing. Educational Measure-
ment: Issues and Practice, 20(3), 16–25.
Jamieson, J., Taylor, C., Kirsch, I., & Eignor, D. (1998). Design and evaluation of a computer-based
TOEFL tutorial. System, 26, 485–513.
Jodoin, M. G. (2001, April). An empirical examination of IRT information for innovative item formats
in a computer-based certification testing program. Paper presented at the annual meeting of the Na-
tional Council on Measurement in Education, Seattle, WA.
Kane, M. T. (1992). The assessment of professional competence. Evaluation and the Health Profes-
sions, 15, 163–182.
Kane, M., Crooks, T., & Cohen, A. (1999). Validating measures of performance. Educational Measure-
ment: Issues and Practice, 18(2), 5–17.
Kaplan, R. M., & Bennett, R. E. (1994). Using the free-response scoring tool to automatically score the
formulating hypotheses item (ETS Research Report No. 94–08). Princeton, NJ: Educational Testing
Service.
Klein, D. C. D., O’Neil, H. F., Jr., & Baker, E. L. (1998). A cognitive demands analysis of innovative
technologies (CSE Tech. Rep. No. 454). Los Angeles, CA: UCLA, National Center for Research on
Evaluation, Student Standards, and Testing.
Landauer, T. K., Foltz, P. W., & Laham, D. (1998). An introduction to latent semantic analysis. Dis-
course Processes, 25, 259–284.
Landauer, T. K., Laham, D., Rehder, B., & Schreiner, M. E. (1997). How well can passage meaning be
derived without using word order? A comparison of latent semantic analysis and humans. In G.
Shafto & P. Langley (Eds.), Proceedings of the 19th annual meeting of the Cognitive Science Society
(pp. 412–417). Mahwah, NJ: Lawrence Erlbaum Associates, Inc.
Larkey, L. S. (1998). Automatic essay grading using text categorization techniques. In the Proceedings
of the 21st Annual International Conference of the Association for Computing Machinery—Special
Interest Group on Information Retrieval, Melbourne, Australia, 90–95.
360 ZENISKY AND SIRECI
Linn, R. L., & Burton, E. (1994). Performance-based assessment: Implications of task specificity. Edu-
cational Measurement: Issues and Practice, 13(1), 5–8, 15.
Luecht, R. M. (2001, April). Capturing, codifying, and scoring complex data for innovative, com-
puter-based items. Paper presented at the annual meeting of the National Council on Measurement in
Education, Seattle, WA.
Martinez, M. E. (1991). A comparison of multiple-choice and constructed figural response items. Jour-
nal of Educational Measurement, 28, 131–145.
Martinez, M. E., & Bennett, R. E. (1992). A review of automatically scorable constructed-response
item types for large-scale assessment. Applied Measurement in Education, 5, 151–169.
Messick, S. (1995). Standards of validity and the validity of standards in performance assessment. Edu-
cational Measurement: Issues and Practice, 14(4), 5–9.
Microsoft Corporation. (1998, September). Procedures and guidelines for writing Microsoft Certifica-
tion Exams. Redmond, WA: Author.
Mills, C. (2000, February). Unlocking the promise of CBT. Keynote address presented at a conference
of the Association of Test Publishers, Carmel Valley, CA.
Mislevy, R. J., Steinberg, L. S., Breyer, F. J., Almond, R. G., & Johnson, L. (1999, September 16–17).
Making sense of data from complex assessments. Paper presented at the 1999 CRESST Conference,
Los Angeles, CA.
National Council of Architectural Registration Boards. (2000). ARE practice program [Computer soft-
ware]. Retrieved March 20, 2000, from http://www.ncarb.org/are/tutorial2.html
Nhouyvanisvong, A., Katz, I. R., & Singley, M. K. (1997). Toward a unified model of problem solving
in well-determined and under-determined algebra word problems. Paper presented at the annual
meeting of the American Educational Research Association, Chicago.
O’Neil, K., & Folk, V. (1996, April). Innovative CBT item formats in a teacher licensing program.Paper
presented at the annual meeting of the National Council on Measurement in Education, New York.
Page, E. B., & Peterson, N. S. (1995). The computer moves into essay grading: Updating the ancient
test. Phi Delta Kappan, 76, 561–565.
Page, E. B. (1994). Computer grading of student prose, using modern concepts and software. Journal of
Experimental Education, 62(2), 127–142.
Parshall, C. G., & Balizet, S. (2001). Audio computer-based tests (CBTS): An initial framework for the
use of sound in computerized tests. Educational Measurement: Issues and Practice, 20(2), 5–15.
Parshall, C. G., Davey, T., & Pashley, P. (2000). Innovative item types for computerized testing. In W. J.
van der Linden & C. Glas (Eds.), Computer-adaptive testing: Theory and practice (pp. 129–148).
Boston: Kluwer Academic.
Rehder, B., Schreiner, M. E., Wolfe, M. B., Laham, D., Landauer, T. K., & Kintsch, W. (1998). Using
latent semantic analysis to assess knowledge: Some technical considerations. Discourse Processes,
25, 337–354.
Rizavi, S., & Sireci, S.G. (1999). Comparing computerized and human scoring of WritePlacer Essays
(Laboratory of Psychometric and Evaluative Research Report No. 354). Amherst: School of Educa-
tion, University of Massachusetts.
Sebrechts, M. M., Bennett, R. E., & Rock, D. A. (1991). Agreement between expert system and human
raters’ scores on complex constructed-response quantitative items. Journal of Applied Psychology,
76, 856–862.
Shavelson, R. J., Baxter, G., & Pine, J. (1991). Performance assessment in science. Applied Measure-
ment in Education, 4, 347–362.
Taylor, C., Jamieson, J., Eignor, D., & Kirsch, I. (1998). The relationship between computer familiarity
and performance on computer-based TOEFL test tasks [ETS Research Rep. No. 98–08]. Princeton,
NJ: Educational Testing Service.
van der Linden, W. J., & Hambleton, R. K. (Eds.). (1997). Handbook of modern item response theory.
New York: Springer.
TECHNOLOGICAL INNOVATIONS 361
Vispoel, W. P. (1999). Creating computerized adaptive tests of music aptitude: Problems, solutions, and
future directions. In F. Drasgow & J. B. Olson-Buchanan (Eds.), Innovations in computerized assess-
ment (pp. 151–176). Mahwah, NJ: Lawrence Erlbaum Associates, Inc.
Walker, G., & Crandall, J. (1999, February). Value added by computer-based TOEFL test [TOEFL
briefing]. Princeton, NJ: Educational Testing Service.
Williamson, D. M., Bejar, I. I., & Hone, A. S. (1999). ‘Mental model’comparisons of automated and
human scoring. Journal of Educational Measurement, 36, 158–184.
Williamson, D. M., Hone, A. S., Miller, S., & Bejar, I. I. (1998, April). Classification trees for quality
control processes in automated constructed response scoring. Paper presented at the annual meeting
of the National Council on Measurement in Education, San Diego, CA.
Yang, Y., Buckendahl, C. W., Juszkiewicz, P. I., & Bhola, D. S. (2002/this issue). A review of strategies
for validating computer automated scoring. Applied Measurement in Education, 15, 391–412.
Zenisky, A. L., & Sireci, S. (2001). Feasibility review of selected performance assessment item types
for the computerized Uniform CPA Exam (Laboratory of Psychometric and Evaluative Research
Rep. No. 405). Amherst: School of Education, University of Massachusetts.
362 ZENISKY AND SIRECI
... Instead of typification based on function Bennett et al. (1990) suggested differentiating between purpose: multiple choice, selection/identification, reordering/rearrangement, substitution/correction, completion, construction, and presentation. Zenisky and Sireci (2002) Other researchers have subsequently used the same term (Wan & Henly 2012), which is why it is also used here. Still, others have used the term technology-enhanced items (TEI) (Scalise & Gifford 2006;Crabtree 2016), with an emphasis on technological development. ...
... In the present article, the term IIF is used partly in accordance with the understanding of Zenisky and Sireci (2002) that IIF covers every format except multiple choice, multiple answers, dropdown, and simple open-ended items found in most survey software. However, the typification used here is based on technical traits instead of the purpose in a test, since the purpose is expected to represent a gray area that makes it difficult to differentiate between studies for inclusion and exclusion. ...
Article
Full-text available
Technological development has allowed researchers to apply numerous item formats in web-based surveys. A growing body of research suggests that the use of formats for web and paper other than multiple choice, such as ranking, sorting, questions with pictures e.g., may offer relevant alternatives that can strengthen data quality. These formats are referred to as Innovative Item Formats (IIF). Existing literature in the field is not able to present a systematic overview and functional typology of IIFs and their impact on data quality. Therefore, a review of the research is needed for each IIF. This review is designed with the purpose of covering which typers of IIF that exist and what type of evidence there is about data quality on these IIFs. Based on a scoping review, this article presents the existing research literature on specific IIFs. A total of 62 research articles with data from 89,365 participants were identified. A more extensive typification of IIFs than previously used, one that includes a total of 23 IIFs and 13 subcategories, is suggested. Researchers designing questionnaires can use this knowledge to obtain higher-quality data.
... As advancements in computer technology have proven beneficial for large-scale assessments (Bennett, 2006), many paper-and-pencil format large-scale assessments are now administered electronically through the Internet (Lightstone & Smith, 2009). The benefits of administering electronic assessments is the ability to include innovative item types that integrate digital media to increase the type of knowledge and skills to be measured (Bartram, 2006;Breithaupt, Mills, & Melican, 2006;Scalise & Gifford, 2006;Sireci & Zenisky, 2006;Zenisky & Sireci, 2002) and to personalize assessments to individual examinees through the use of Computer Adaptive Tests (CAT) (Drasgow, Luecht, & Bennett, 2006;Weiss, 1982). Thus, instead of administering the same set of items to all examinees, CAT exams can be personalized such that all students are capable of completing the exam and provide better performance estimates. ...
Article
In educational assessment, there is an increasing demand for tailoring assessments to individual examinees through computer adaptive tests (CAT). As such, it is particularly important toinvestigate the fairness of these adaptive testing processes, which require theinvestigation of differential item function (DIF) to yield information about itembias. The performance of CATSIB, a revision of SIBTEST to accommodate CATresponses, in detecting DIF in a multi-stage adaptive testing (MST) environmentis investigated in the present study. Specifically, the power and type I error rates on directional DIF detection of an MST environment when positive and negative impact, group ability differences, was investigated using simulation procedures. The results revealed that CATSIB performed relatively well in identifying the items with DIF when characteristics of the group and items were known. Testing companies are able to use these results to enhance test items which provide students with fair and equitable adaptive testing environments.
... There are no comprehensive literature reviews of the use of process data in LSAs. Existing reviews focus on the use of technology in LSAs (Zenisky & Sireci, 2002) or the use of process data in related fields (e.g., higher education; Viberg et al., 2018). In their recent work, Reis Costa and Leoncio Netto (2022) review selected papers related to process data in international LSAs. ...
Article
Full-text available
As the use of process data in large-scale educational assessments is becoming more common, it is clear that data on examinees’ test-taking behaviors can illuminate their performance, and can have crucial ramifications concerning assessments’ validity. A thorough review of the literature in the field may inform researchers and practitioners of common findings as well as existing gaps. This literature review used topic modeling to identify themes in 221 empirical studies using process data in large-scale assessments. We identified six recurring topics: response time models, response time-general, aberrant test-taking behavior, action sequences, complex problem-solving, and digital writing. We also discuss the prominent theories used by studies in each category. Based on these findings, we suggest directions for future research applying process data from large-scale assessments.
... 34019/2237-9444.2023.v13.40259 Os autores iniciam o texto ressaltando que o rápido avanço tecnológico contribui para o processo de ensino da matemática, bem como se baseiam em diversos outros estudos para afirmar que, com a crescente importância da tecnologia educacional, as avaliações em larga escala começaram a passar de avaliações tradicionais impressas para avaliações digitais (HEI; BORGONOVI; PACCAGNELLA, 2019; SCALISE; GIFFORD, 2006;ZENISKY;SIRECI 2002), citando como exemplo a NAEP. ...
Article
O artigo apresenta uma revisão sistemática e bibliométrica com o objetivo de identificar e apresentar características, formatos e métodos para o desenvolvimento de itens digitais no contexto de avaliações em larga escala. Para tanto, é detalhado o processo de seleção e análise dos textos a partir de duas bases de dados no período de 2000 a 2021. Os resultados mostram pontos comuns que pautam a concordância sobre a influência do avanço tecnológico para a transição das avaliações do impresso para o digital e o surgimento de novos formatos de itens com diferentes características. Ademais, trazem críticas às organizações quanto à adaptação dos itens usados em avaliações impressas para a avaliação digital sem a clara compreensão de suas diferenças e a crença de que o digital por si só agrega valor.
... (Mothe and Uyen Nguyen Thi, 2010;Shrivastava and Souder, 1987) Technological Innovations involving the use of various new technological processes, the main purpose of which is to optimize/accelerate production and improve working conditions. (Saaksjarvi, 2003;Zenisky and Sireci, 2002) Source: own compilation based on: Alänge et al. (1998) Zawada and Herbuś (2015), and Ziółko and Mróz (2015). ...
Article
Full-text available
The article comprises an analysis of academic entrepreneurship, from the perspective of innovation and sustainability. It aims to identify and assess the innovation and eco-innovation support programs implemented by academics at higher education institutions. The paper presents the typology of innovation and associated concepts , including the notion of innovation for sustainable development (eco-innovation). It also reviews the existing literature in the field of academic entrepreneurship, which is a broad concept, requiring a unique and geographically specific approach. The following research methods were adopted-case study (the subject of the study was the Center for Knowledge and Technology Transfer), based on document analysis, and structured interviews with a purposively selected group of academic innovation developers. The analysis conducted allows a conclusion that one of the main barriers to the implementation of innovation is the lack of adequate funds for research task implementation. The analyzed Innovation Incubator program, despite its significant contribution to the sustainable support of innovation, requires certain changes and improvements to ameliorate the cooperation between the academic circles and business.
... The past two decades have witnessed an increasing use of computers and relevant technologies in classroom teaching and learning (Hoyles and Noss, 2003) and a swift transition from traditional paper-and-pencil tests to digitally-based assessments (DBAs) (Zenisky and Sireci, 2002;Scalise and Gifford, 2006) that accommodate advancement of educational technologies. Along with these trends, the National Assessment of Educational Progress (NAEP) 1 began to use hand-held tablets to administer math assessments in the U.S. in 2017, so did other disciplines afterward. ...
Article
Full-text available
Introduction This study investigates the process data from scientific inquiry tasks of fair tests [requiring test-takers to manipulate a target variable while keeping other(s) constant] and exhaustive tests (requiring test-takers to construct all combinations of given variables) in the National Assessment of Educational Progress program. Methods We identify significant associations between item scores and temporal features of preparation time, execution time, and mean execution time. Results Reflecting, respectively, durations of action planning and execution, and execution efficiency, these process features quantitatively differentiate the high- and low-performing students: in the fair tests, high-performing students tended to exhibit shorter execution time than low-performing ones, but in the exhaustive tests, they showed longer execution time; and in both types of tests, high-performing students had shorter mean execution time than low-performing ones. Discussion This study enriches process features reflecting scientific problem-solving process and competence and sheds important light on how to improve performance in large-scale, online delivered scientific inquiry tasks.
Article
Full-text available
The increasing volume of large-scale assessment data poses a challenge for testing organizations to manage data and conduct psychometric analysis efficiently. Traditional psychometric software presents barriers, such as a lack of functionality for managing data and conducting various standard psychometric analyses efficiently. These challenges have resulted in high costs to achieve the desired research and analysis outcomes. To address these challenges, we have designed and implemented a modernized data pipeline that allows psychometricians and statisticians to efficiently manage the data, conduct psychometric analysis, generate technical reports, and perform quality assurance to validate the required outputs. This modernized pipeline has proven to scale with large databases, decrease human error by reducing manual processes, efficiently make complex workloads repeatable, ensure high quality of the outputs, and reduce overall costs of psychometric analysis of large-scale assessment data. This paper aims to provide information to support the modernization of the current psychometric analysis practices. We shared details on the workflow design and functionalities of our modernized data pipeline, which provide a universal interface to large-scale assessments. The methods for developing non-technical and user-friendly interfaces will also be discussed.
Article
The increasing use of computerization in the testing industry and the need for items potentially measuring higher‐order skills have led educational measurement communities to develop technology‐enhanced (TE) items and conduct validity studies on the use of TE items. Parallel to this goal, the purpose of this study was to collect validity evidence comparing item information functions, expected information values, and measurement efficiencies (item information per time unit) between multiple‐choice (MC) and technology‐enhanced (TE) items. The data came from K–12 mathematics large‐scale accountability assessments. The study results were mainly interpreted descriptively, and the presence of specific patterns between MC and TE items was examined across grades and depth of knowledge levels. Although many earlier researchers pointed out that TE items were not as efficient as MC items, the results from the study point to ways that TE items might provide more information and were more than or equally efficient as MC items overall.
Article
Bu çalışma, görsel metin destekli yeni bir madde formatının tasarlanmasını ve uygulamalı olarak değerlendirilmesini kapsamaktadır. Bu yenilikçi madde formatının tasarımı, öğretmen adaylarının duygusal okuryazarlık becerilerinin ölçülmesinde görsel metin destekli fotoğrafların kullanıldığı maddelerle uygulamalı bir alan çalışması üzerinden gerçekleştirilmiştir. Araştırma desenini, aynı akademik dönem içinde gerçekleştirilen dokuz ayrı uygulama oluşturmaktadır. İlk olarak, öğretmen adaylarının görsel olarak zenginleştirilmiş maddelere verdikleri tepkileri işaretlemede kullanmak için tasarlanmış 12'li işaretleme skalasında olumsuzdan olumluya uzanan duygu durumlarının sırasının uygunluğu kontrol edilmiştir. Ardından, 12 seçenekten oluşan duygu skalasının içerdiği duygu durumlarının gruplandırıldığı, uygulamada kullanışlı olabilecek ve aynı zamanda bireyler arası farklılıkları ortaya çıkarabilecek nitelikte alternatif bir duygu skalası oluşturulmuştur. Oluşturulan bu 7'li duygu skalasının uygulamada daha kullanışlı olduğu gösterilmiştir. Ayrıca görsel metin destekli madde formatında yer alan farklı duygu durumları içeren görseller için öğretmen adaylarının birincil, ikincil ve genel tepkilerinin sorulduğu maddelere verilen cevaplar incelenmiştir. Grafiksel incelemelere ek olarak, öğretmen adaylarının madde takımında yer alan sorulara (birincil-ikincil-genel tepkiler) verdikleri cevapların benzerlik ya da farklılıklarını istatistiksel açıdan test etmek amacıyla parametrik olmayan yöntemlerden Wilcoxon İşaretli Sıralar Testi kullanılmıştır. Elde edilen bulgular, özellikle kullanılan görsellerin duygusal içeriğinin çok basit olmadığı durumlarda, öğretmen adaylarının birincil ve ikincil tepkilerinin iki ayrı madde olarak ele alınmasının, hem tekli genel bir madde formatının sağladığı bilgileri sağlayabilecek hem de toplanan verilerden elde edilen bilgileri zenginleştirebilecek nitelikte olduğunu göstermektedir. Çalışma sonuçları, görsel metinler kullanılarak hazırlanan bu yenilikçi madde formatlarının duygusal okuryazarlık becerileri gibi karmaşık ve çok boyutlu olduğu düşünülen yapıların ölçülmesinde kullanışlı olabilecekleri önermesini destelemektedir.
Article
Full-text available
Evaluated agreement between expert-system and human scores on 12 constructed-response algebra word problems taken by Graduate Record Examination General Test examinees. Problems were drawn from 3 content classes (rate by time, work, and interest) and presented in 4 constructed-response formats (open ended, goal specification, equation setup, and faulty solution). Agreement was evaluated for each item separately by comparing the system's scores to the mean scores taken across 5 content experts. The expert system produced scores for all responses and duplicated the judgments of raters with reasonable accuracy; the median of 12 correlations between the system and human scores was .88, and the largest average discrepancy was 1.2 on a 16-point scale. No obvious differences in scoring agreement between constructed-response formats or content classes emerged. Ideas are discussed for using expert scoring systems in large-scale assessment programs and in interactive diagnostic assessment.
Article
Full-text available
This study investigated the convergent validity of expert-system scores for four mathematical constructed-response item formats. A five-factor model comprised of four constructed-response for mat factors and a Graduate Record Examination (GRE) General Test quantitative factor was posed. Confirmatory factor analysis was used to test the fit of this model and to compare it with several alter natives. The five-factor model fit well, although a solution comprised of two highly correlated dimensions_GRE-quantitative and constructed- response—represented the data almost as well. These results extend the meaning of the expert system's constructed-response scores by relating them to a well-established quantitative measure and by indicating that they signify the same underlying proficiency across item formats.
Book
Item response theory has become an essential component in the toolkit of every researcher in the behavioral sciences. It provides a powerful means to study individual responses to a variety of stimuli, and the methodology has been extended and developed to cover many different models of interaction. This volume presents a wide-ranging handbook to item response theory - and its applications to educational and psychological testing. It will serve as both an introduction to the subject and also as a comprehensive reference volume for practitioners and researchers. It is organized into six major sections: the nominal categories model, models for response time or multiple attempts on items, models for multiple abilities or cognitive components, nonparametric models, models for nonmonotone items, and models with special assumptions. Each chapter in the book has been written by an expert of that particular topic, and the chapters have been carefully edited to ensure that a uniform style of notation and presentation is used throughout. As a result, all researchers whose work uses item response theory will find this an indispensable companion to their work and it will be the subject's reference volume for many years to come.
Article
Psychometric and architectural principles were integrated to create a general approach for scoring open-ended architectural site-design test problems. In this approach, solutions are examined and described in terms of design features, and those features are then mapped onto a scoring scale by means of scoring rules. This methodology was applied to 2 problems that had been administered as part of a national certification test. Because the test is not currently administered by computer, the paper-and-pencil solutions were first converted to machine-readable form. One problem dealt with the spatial arrangement of buildings in a country club, and the other called for regrading of a site by rearranging contours. In both instances, the results suggest that computer scoring is feasible.
Article
This report discusses the development and evaluation of a research prototype system designed to automatically score essay responses to the GMAT Analytical Writing Assessments: (a). Analysis of an Argument (Argument essays) and (b). Analysis of an Issue (Issue essays) item types. The system, Electronic Essay Rater (e-rater), was designed to automatically analyze several features of an essay and score the essay based on the features of writing as specified in holistic rubrics. E-rater uses a hybrid feature methodology. It incorporates several variables that are derived statistically, extracted through NLP techniques, or achieved by simple “counting” procedures. The version of the e-rater described in this report uses five sets of critical feature variables to build the final linear regression model used for predicting scores. The same set of critical variables was used to fit models for the issue and argument training essays and the following results were achieved. For the set of 275 cross-validation data, exact or adjacent agreement with human rater scores reached 95%. For the 282 cross-validation issue essays exact or adjacent agreement with human rater scores achieved 93%. The rich feature variables used as score predictors in e-rater could potentially be used to generate explanation of score predictions, and diagnostic and instructional information.
Article
Two editing tasks were developed and programmed for the computer to explore the possibility that such tasks might be useful as measures of writing skill. An informal data collection was then conducted with 52 prospective graduate students. These students completed the editing tasks with no time limit, as well as a writing experience questionnaire. Scores obtained on the two editing tasks were correlated with variables developed from the questionnaire. The total score for the two editing tasks correlated .52 with student self-assessments of their writing ability, .46 with grade-point average (GPA) based on courses requiring at least some writing, and .30 with writing accomplishments. The correlation with GPA, however, was only .14. The reliability of the total editing score was estimated at .84.
Article
In this study we examined the feasibility of using a computer-based, networked collaborative knowledge mapping system to measure teamwork skills. A knowledge map is a node–link–node representation of information, where nodes represent concepts and links represent relationships between connected concepts. We studied the nature of the interaction between team members as they jointly constructed a knowledge map. Each team member was randomly assigned to a team and communicated (anonymously) with other members by sending pre defined messages. Teamwork processes were measured by examining message usage. Each message was categorized as belonging to one of six team processes: (1) adaptability; (2) communication; (3) coordination; (4) decision making; (5) interpersonal; and (6) leadership. Team performance was measured by scoring each team's knowledge map using four expert maps as the criterion. No significant correlations were found between the team processes and team outcomes. This unexpected finding may be due in part to a split-attention effect resulting from the design of the user interface. However, students were able to successfully construct knowledge maps using our system, suggesting that our general approach to using networked computers to measure group processes remain viable given existing alternatives.