Content uploaded by Stephen G. Sireci
Author content
All content in this area was uploaded by Stephen G. Sireci on May 30, 2019
Content may be subject to copyright.
Technological Innovations
in Large-Scale Assessment
April L. Zenisky and Stephen G. Sireci
Center for Educational Assessment, School of Education
University of Massachusetts Amherst
Computers have had a tremendous impact on assessment practices over the past half
century. Advances in computer technology have substantially influenced the ways in
which tests are made, administered, scored, and reported to examinees. These
changes are particularly evident in computer-based testing, where the use of comput-
ers has allowed test developers to re-envision what test items look like and how they
are scored. By integrating technology into assessments, it is increasingly possible to
create test items that can sample as broad or as narrow a range of behaviors as needed
while preserving a great deal of fidelity to the construct of interest. In this article we
review and illustrate some of the current technological developments in com-
puter-based testing, focusing on novel item formats and automated scoring method-
ologies. Our review indicates that a number of technological innovations in perfor-
mance assessment are increasingly being researched and implemented by testing
programs. In some cases, complex psychometric and operational issues have suc-
cessfully been dealt with, but a variety of substantial measurement concerns associ-
ated with novel item types and other technological aspects impede more widespread
use. Given emerging research, however, there appears to be vast potential for expand-
ing the use of more computerized constructed-response type items in a variety of test-
ing contexts.
The rapid evolution of computers and computing technology has played a critical
role in defining current measurement practices. Many tests are now administered
on a computer, and a number of psychometric software programs are widely used
to facilitate all aspects of test development and analysis. Such technological ad-
vances provide testing programs with many new tools with which to build tests and
understand assessment data from a variety of perspectives and traditions. The po-
APPLIED MEASUREMENT IN EDUCATION, 15(4), 337–362
Copyright © 2002, Lawrence Erlbaum Associates, Inc.
Requests for reprints should be sent to April L. Zenisky, School of Education, 156 Hills South, Uni-
versity of Massachusetts, Amherst, MA 01003–4140. E-mail: azenisky@educ.umass.edu
tential for computers to influence testing is not yet exhausted, however, as the
needs and interests of testing programs continue to evolve.
Test users increasingly express interest in assessing skills that can be difficult to
fully tap using traditional paper-and-pencil tests. As the potential for integrating
technology into task presentation and response collection has become more of a
practical reality, a variety of innovative computerized constructed-response item
types emerge. Many of these new types call for reconceptualization of what
examinee responses look like, how they are entered into the computer, and how
they are scored (Bennett, 1998). This is good news for test users in all testing con-
texts, as a greater selection of item types may allow test developers to increase the
extent to which tasks on a test approximate the knowledge, skills, and abilities of
interest.
The purpose of this article is to review the current advances in computer-based
assessment, including innovative item types, response technologies, and scoring
methodologies. Each of these topics defines an area where applications of technol-
ogy are rapidly evolving. As research and practical implementation continues,
these emerging assessment methods are likely to significantly alter measurement
practices. In this article we provide an overview of the recent developments in task
presentation and response scoring algorithms that are currently used or have the
potential for use in large-scale testing.
INNOVATIONS IN TASK PRESENTATION
Response Actions
In developing test items for computerized performance assessment, one critical
component for test developers to think about is the format of the response an
examinee is to provide. It may at first seem backward to think about examinee re-
sponse formats before item stems, but how answers to test questions are structured
has an obvious bearing on the nature of the information being collected. Thus, as
the process of designing assessment tasks gets underway, some reflection on the
inferences to be made on the basis of test scores and how best to collect that data is
essential.
To this end, an understanding of what Parshall, Davey, and Pashley (2000)
termed response action and what Bennett, Morley and Quardt (2000) described as
response type may be helpful. Prior to actually constructing test items, some con-
sideration of the type of responses desired from examinees and the method by
which the responses could be entered can help a test developer to discern the kinds
of item types that might provide the most useful and construct-relevant informa-
tion about an examinee (Mislevy, Steinberg, Breyer, Almond, & Johnson, 1999).
In a computer-based testing (CBT) environment, the precise nature of the informa-
338 ZENISKY AND SIRECI
tion that test developers would like examinees to provide might be best expressed
in one of several ways. For example, examinees could be required to type text-
based responses, enter numerical answers via a keyboard or by clicking onscreen
buttons, or manipulate or synthesize information on a computer screen in some
way (e.g., use a mouse to direct an onscreen cursor to click on text boxes,
pull-down menus, or audio or video prompts). The mouse can also be used to draw
onscreen images as well as to “drag-and-drop” objects.
The keyboard and mouse are the input devices most familiar to examinees and
are the ones overwhelmingly implemented in current computerized testing appli-
cations, but response actions in a computerized context are not exclusively limited
to the keyboard and mouse. Pending additional research, some additional input de-
vices by which examinees’ constructed responses could one day be collected in-
clude touch screens, light pens, joysticks, trackballs, speech recognition software
and microphones, and pressure-feedback (haptic) devices (Parshall, Davey, &
Pashley, 2000). Each of these emerging methods represent inventive ways by
which test developers and users can gather different pieces of information about
examinee skills. However, at this time these alternate data collection mechanisms
are largely in experimental stages and are not yet implemented as part of many (if
any) novel item types. For this reason, we focus on emerging item types that use
the keyboard, mouse, or both for collecting responses from examinees.
Novel Item Types
For many testing programs, the familiar item types currently in use such as multi-
ple-choice and essay items provide sufficient measurement information for the
kinds of decisions being made on the basis of test scores. However, a substantial
number of increasingly interactive item types that may increase measurement in-
formation are now available, and some are being used operationally. This prolifer-
ation of item types has largely come about in response to requests from test con-
sumers for assessments aligned more closely with the constructs or skills being
assessed. Although many of these newer item types were developed for specific
testing applications such as licensure, certification, or graduate admissions testing,
it is possible to envision each of these item types being adapted in countless ways
to access different constructs as needed by a particular testing program.
Numerous computerized item types have emerged over the past decade, so it is
virtually impossible to illustrate and describe them all in a single article. Neverthe-
less, we conducted a review of the psychometric literature and of test publishers’
Web sites and selected several promising item types for presentation and discus-
sion. A nonexhaustive list of 21 of these item types is presented in Table 1, along
with a brief description of each type and some relevant references. Some of the
item types listed in Table 1 are being used operationally, whereas others have been
only proposed for use.
TECHNOLOGICAL INNOVATIONS 339
340
TABLE 1
Computerized Performance Assessment Item Types
Item Format Brief Description Selected Citation(s)
Drag-and-drop (select-and-place) Given scenario or problem, examinees click and drag an
object to the center of the appropriate answer field (see
Figure 1).
Fitzgerald (2001); Luecht (2001); Microsoft
Corporation (1998)
Graphical modeling Examinees use line and curve tools to sketch a given situation
on a grid.
Bennett, Morley, & Quardt (2000); Bennett,
Morley, Quardt, & Rock (2000)
Move figure or symbols in/into
pictographs
Examinees manipulate elements of chart or graph to represent
certain situations or adjust or complete image as necessary
(e.g., extending bars in a bar chart, see Figure 2).
Educational Testing Service (1993); French &
Godwin (1996); Martinez (1991)
Drag and connect, specifying
relationships
Given presented objects, examinees identify the
relationship(s) that exist between pairs of objects (see
Figure 3).
Fitzgerald (2001); Luecht (2001)
Concept mapping Examinees demonstrate knowledge of interrelationships
between data points by graphically representing onscreen
images and text using links and nodes.
Chung, O’Neil, & Herl (1999); Klein, O’Neil, &
Baker (1998)
Sorting task Given prototypes, examinees look for surface or deep
structural similarities between presented items and
prototypes and match items with prototype categories.
Bennett & Sebrechts (1997); Glaser (1991)
Ordering information (create-a-tree) Examinees sequence events as required by item stem (e.g.,
largest to smallest, most to least probable cause of event,
if–then; see Figure 4).
Educational Testing Service (1993); Fitzgerald
(2001); Luecht (2001); Walker & Crandall
(1999)
Inserting text Examinees drag and drop text into passage as directed by item
stem (e.g., where it makes sense, serves as example of
observation).
Educational Testing Service (1993); Taylor,
Jamieson, Eignor, & Kirsch (1998)
Passage editing Examinees edit a short onscreen passage by moving the cursor
to various points in a passage and selecting sentence
rewrites from a list of alternatives on a drop-down menu.
Breland (1999); Davey, Godwin, & Mittelholtz
(1997)
341
Highlighting text Examinees read a passage and select specific sentence(s) in
the passage (e.g., main idea, particular piece of
information).
Carey (2001); Taylor, Jamieson, Eignor, &
Kirsch (1998); Walker & Crandall (1999)
Capturing or selecting
frames/Shading
Given directions or parameters, examinees use mouse to select
portion of picture, map, or graph.
Hambleton (1997); O’Neil & Folk (1996)
Mathematical expressions Examinees generate and type in unique expression to
represent mathematical relationship.
Bennett, Morley, & Quardt (2000); Bennett et
al. (1997); Educational Testing Service
(1993); Martinez & Bennett (1992)
Numerical equations Examinees complete numerical sentences by entering
numbers and mathematical symbols in text box.
Hambleton (1997)
Multiple numerical response Examinees type in more than one numerical answer (e.g.,
complete tax form, insert numbers into a spreadsheet).
Hambleton (1997)
Multiple selection Examinees are presented with a stimulus (visual, audio, text)
and select answer(s) from list (answers may be used more
than once in series of questions).
Ackerman, Evans, Park, Tamassia, & Turner
(1999); Mills (2000)
Analyzing situations Examinees are provided with visual/audio clips and short
informational text and are asked to make diagnosis/decision.
Response could be free-text entry or extended matching.
Ackerman, Evans, Park, Tamassia, & Turner
(1999)
Generating examples Examinees create examples given certain situations or
constraints; there is more than one correct answer.
Response is free-text entry.
Bennett, Morley, & Quardt (2000); Bennett et
al. (1999); Enright, Rock, & Bennett (1998);
Nhouyvanisvong, Katz, & Singley (1997)
Generating multiple
solutions/Formulating hypotheses
Given situation, examinees generate plausible solutions or
explanations. Response is free-text entry (see Figure 5).
Bennett & Rock (1995); Kaplan & Bennett
(1994)
Essay/Short answer May be restricted or extended length. Burstein et al. (1998); Rizavi & Sireci (1999)
Problem-solving vignettes Problem-solving situations (vignettes) are presented to
examinees, who are graded on features of a product.
Bejar (1991); Fitzgerald (2001); Luecht (2001);
Williamson, Bejar, & Hone (1999);
Williamson, Hone, Miller, & Bejar (1998)
Sequential problem solving/Role
play
Examinees provide a series of responses as dynamic situation
unfolds. Scoring attends to process and product.
Clauser, Harik, & Clyman (2000); Clauser et al.
(1997)
Our discussion of novel item types begins with those items requiring use of the
mouse for different onscreen actions as methods for data collection. Some of these
item types bear greater resemblance to traditional selected response item types and
are easier to score mechanically, whereas others integrate technology in more in-
ventive ways. After introducing these item types, we turn to those item types in-
volving text-based responses. Last, we focus on items with more complex ex-
aminee responses that expand the concept of what responses to test items look like
in fundamental ways and pose more difficult challenges for automated scoring.
Item types requiring use of a mouse.
Many of the emerging computer-
based item types take advantage of the way in which people interact with a com-
puter, specifically via a keyboard and mouse. The mouse and onscreen cursor pro-
vide a flexible mechanism by which items can be manipulated. Using a mouse,
pull-down menus, and arrow keys, examinees can highlight text, drag-and-drop
text and other stimuli, create drawings or graphics, or point to critical features of an
item. An example of a drag-and-drop item (also called a select-and-place item) is
presented in Figure 1. This item type is used on a number of the Microsoft certifi-
342 ZENISKY AND SIRECI
FIGURE 1 Example of drag-and-drop item type.
cation examinations (Fitzgerald, 2001; Microsoft Corporation, 1998). These items
can be scored right/wrong or using partial credit.
The graphical modeling item type also uses the drag-and-drop capability of a
computer. This item type requires examinees to sketch out situations graphically
using onscreen line, curve tools, or both (Bennett, Morley, & Quardt, 2000;
Bennett, Morley, Quardt, & Rock, 2000). A similar item type using drag-and-drop
technology is the move figures or symbols into pictographs item type, which is pre-
sented in Figure 2. This item type requires examinees to drag a shape and position
it on a grid given certain parameters or constraints in the item stem (French &
Godwin, 1996; Martinez, 1991).
A variation of the drag-and-drop item is the drag-and-connect item type. This
item type presents examinees with several movable objects that can be arranged in
several different target locations onscreen. For example, when all the objects are
correctly assembled (items are sequenced or organized accurately), a network
would correctly work or a flowchart would appropriately illustrate a network pro-
tocol. An extension of this item type is the specifying relationships item type in
which examinees move objects around onscreen and link them in a flowchart by
way of clicking relationships such as “one to one,” “many to one,” or “one to zero”
(Fitzgerald, 2001; Luecht, 2001). An example of this item type is presented in Fig-
ure 3. Another item type that can be used to assess the understanding of relation-
TECHNOLOGICAL INNOVATIONS 343
FIGURE 2 Example of moving figures or symbols in/into a pictograph item type.
ships is the concept map item type. Having an examinee create onscreen concept
maps can allow for relationships between items or pieces of information to be il-
lustrated (Chung, O’Neil, & Herl, 1999; Herl, O’Neil, Chung, & Schacter, 1999;
Klein, O’Neil, & Baker, 1998).
Items that delve into assessing ordering and sorting information are increas-
ingly using drag-and-drop action. Bennett’s and Sebrechts’ (1997) sorting task
item type (also studied by Glaser, 1991) gives examinees the chance to communi-
cate knowledge about underlying relationships between items by dragging and
dropping focal items onto the target prototype to which it best aligns according to
some surface or deep structural feature.
The ordering information item type, also referred to in the literature as a cre-
ate-a-tree item, requires examinees to use the mouse to exhibit understanding of
the material tested. The stem of this item type specifies the way in which the
examinee should arrange elements in a process. The examinee clicks on a focal ob-
ject and then places it into a target location by dragging and dropping or by click-
ing on onscreen radio buttons to move the item as needed (Fitzgerald, 2001;
344 ZENISKY AND SIRECI
FIGURE 3 Example of specifying relationships item type.
Luecht, 2001; Walker & Crandall, 1999). Some sequences in which these focal
items might be arranged include largest to smallest, most-to-least probable cause
of an event, or following an if–then framework. Figure 4 presents an example of an
ordering information/create-a-tree item type.
Some computer-based item types are used to assess skills specifically relating
to verbal communication and comprehension skills. One, the inserting text item
type, presents examinees with a sentence that must be dragged and dropped into
the appropriate place in a passage (Carey, 2001; Jamieson, Taylor, Kirsch, &
Eignor, 1998). A similar item type is the passage editing item; the examinees move
the cursor to various onscreen “hot spots” where a drop-down menu appears with a
list of potential sentence or phrase rewrites, and the examinee must select the best
alternative from the list (Breland, 1999; Davey, Godwin, & Mittelholtz, 1997). The
alternatives could range from radical changes to no change at all, with the different
options being scored correct/incorrect or using a graded scale.
TECHNOLOGICAL INNOVATIONS 345
FIGURE 4 Example of ordering information/create-a-tree item type.
Another objectively scorable item type that uses mouse manipulation is the
highlighting information item type where the examinee uses the cursor to select a
target phrase or sentence within a passage. Examples of this item type include
identifying the main idea of a paragraph or antecedents of pronouns (Carey, 2001;
Taylor, Jamieson, Eignor, & Kirsch, 1998; Walker & Crandall, 1999). A similar
item type is the capturing/selecting frames item type (sometimes also referred to
as shading) that directs the examinee to click on portions of a graphic as needed
(Hambleton, 1997; O’Neil & Folk, 1996).
The item types described thus far could be described as “click-on item types,”
as clicking on one or more objects is required. For all of these item types
examinees must select or highlight information as directed by the item stem by
moving the mouse, which correspondingly moves an onscreen cursor. Many of
these item types might be considered by some as little more than extended multi-
ple-choice items, but generally with such items the objects that could be selected
are so numerous they require skills above and beyond the test-taking skills helpful
for success on traditional selected-response items. The multiple selection item type
is a good example of an item with this property, in that the examinee is expected to
select text or onscreen items using the mouse given instructions such as “choose all
that apply” or “select three” (Ackerman, Evans, Park, Tamassia, & Turner, 1999;
Mills, 2000). Obviously, such items reduce the chance of answering the item cor-
rectly by guessing, relative to a traditional multiple-choice item.
Innovations in items with text-based responses.
Moving from the mouse
to the keyboard as the mechanism for examinees to enter responses, a number of
both novel and more familiar item types become increasingly useful in large-scale
performance assessment. Although having an examinee write a short answer or an
essay in a text box is not particularly novel, collecting such answers via computer
can be effective for data management purposes and is increasingly likely to be the
preferred method of evaluating writing skills (to the extent that typing is accepted
as a skill directed relevant to the construct(s) of interest). Similarly, in the mathe-
matical expressions,numerical equations, and multiple numerical response item
types, the examinee can type answers into free-response text boxes (Bennett,
Morley, & Quardt, 2000; Bennett, Steffen, Singley, Morley, & Jacquemin, 1997).
Although these item types may not be especially innovative in and of themselves,
the responses can be surprisingly complex to complete, manage, and score because
there are multiple ways in which any mathematical or text-based response could be
expressed.
One novel item type in which examinees respond by means of a text box is the
generating examples item type. Problems and constraints are presented to the
examinee, whose task it is to pose one or several solutions that are feasible under
such parameters (Bennett, Morley, & Quardt, 2000; Bennett et al., 1999; Enright,
Rock, & Bennett, 1998; Nhouyvanisvong, Katz, & Singley, 1997). In some appli-
346 ZENISKY AND SIRECI
cations of this item type, the responses are numerical in answer. In fact, it was orig-
inally designed in part to broaden the measurement of the construct “quantitative
reasoning” on the Graduate Record Examination, or GRE. Generating examples,
as an item type, is a variant of the generating solutions/formulating hypotheses
item type (Bennett & Rock, 1995; Kaplan & Bennett, 1994), an example of which
is presented in Figure 5. In the formulating hypotheses item type, an examinee is
first presented with a situation of some kind. The task then is to generate as many
possible explanations or causal reasonings for the situation as possible.
Complex constructed-response items for CBT.
Some of the most intrigu-
ing advances in CBT are the problem solving vignettes used on professional li-
censure exams. Typically, the vignettes presented to licensure candidates reflect
real-world problems, and the computer simulates real-world responses. An example
of this increased fidelity in measurement is the problem solving vignette item type
found on the Architectural Registration Examination (ARE). On the ARE, ex-
aminees are asked to complete several design tasks (such as design a building or lay
out a parking lot) using a variety of onscreen drawing tools (Bejar, 1991; William-
TECHNOLOGICAL INNOVATIONS 347
FIGURE 5 Example of formulating hypotheses/generating solutions item type. Source: The
formulating-hypotheses item type. (n.d.). Retrieved February 26, 2001, from http://www.ets.
org/research/rsc/alcadia.html Copyright © Educational TestingService. Used with permission.
son, Bejar, & Hone, 1999; Williamson, Hone, Miller, & Bejar, 1998). By and large,
these items seem highly authentic to examinees and give test users data about
examinee ability in relation to actual, standardized architectural design tasks. In this
case, the generalizability of test item to job performance as described by Kane
(1992) is high.
Closely related to such problem-solving vignettes are dynamic problem solving
tasks, sometimes referred to as role-play exercises. In measurement, we typically
think of adaptive testing as dynamic between items, but advances in computer
hardware and software now allow some testing programs to create tests that are
adaptive within an item, where an item may be defined as an extended role-playing
task. The computerized case-based simulations used by the National Board of
Medical Examiners incorporate the idea of the simulated patient whose symptoms
and vital statistics change over time in response to the actions (or nonactions) taken
by the candidate (Clauser, Harik, & Clyman, 2000; Clauser, Margolis, Clyman, &
Ross, 1997). As the examinee manages a case, new symptoms may emerge, the
clock is ticking, and the prospective doctor’s actions have the potential to harm as
well as help the simulated patient. As the case progresses, the examinee may have
to deal with unintended medical side effects as well as the patient’s original medi-
cal condition. Each examinee is scored on the sequence of response actions they
enter into the computer, such as requesting tests on the patient, writing treatment
orders, and their diagnostic acuity. Similar dynamic simulation tasks are used for
aviation selection and training.
Media in Item Stems
A further emerging dimension of novel item types relates to what Parshall, Davey,
and Pashley (2000) referred to as media inclusion: the use of graphics, video, and
audio within an item or set of items. Multimedia can be used at various points in
the item stem for a variety of purposes: to better illustrate a particular situation, to
allow examinees to visualize a problem, or to better assess a specified construct
(e.g., music-listening aptitude).
Audio prompts in large-scale, noncomputerized testing have been largely con-
fined to music and language tests, with partial success in those areas. However,
Parshall and Balizet (2001) defined a framework for considering four uses of an
audio component in CBTs, including speech audio for listening tests, nonspeech
audio (e.g., music) for listening tests, speech audio for alternative assessment
(such as accommodating tests for limited-English proficient, reading disabled, or
visually disabled examinees), and speech and nonspeech audio incorporated into
the user interface. From a measurement perspective, as Vispoel (1999) noted,
when tests of music listening or aptitude are administered to a group in a non-CBT
format, compromises in administrative efficiency and measurement accuracy of-
ten leave examinee scores on such tests with questionable reliability. The differ-
348 ZENISKY AND SIRECI
ence in a computer-based setting is that the test can be administered individually
and this format of test administration permits examinees to proceed at their own
pace (Parshall & Balizet, 2001). An example of the successful use of audio
prompts in large-scale computer-based assessment is Educational Testing Ser-
vice’s (ETS’s) Test of English as a Foreign Language, which incorporates such
features.
Graphics, on the other hand, have generally enjoyed more extensive use in com-
puterized assessment. For example, the presentation of digitized pictures has been
used successfully by Ackerman, Evans, Park, Tamassia, and Turner (1999) in a test
of dermatological skin disorders. Examinees can use a zoom feature to get a better
look at the picture before selecting the correct diagnosis from a list of 110 alpha-
betized disorders. Although this item type would be strictly classified as a se-
lected-response item, rather than constructed-response, the emulation of the diag-
nostic processes of professional dermatologists allows examinees to demonstrate a
higher order grasp of the information and reduces the likelihood of guessing. On
other tests, onscreen images can be rotated, resized, selected, clicked on, and
dragged to form a meaningful image, depending on the item type (Klein, O’Neil,
& Baker, 1998). Some of the items described in Table 1, including graphical mod-
eling, concept mapping, and moving a figure into a graph, are examples of tasks
where the graphical manipulations compose the constructed responses.
Furthermore, most desktop computers now have video capabilities that make
the inclusion of video prompts in performance assessment highly feasible. Interac-
tive video assessment has been used operationally with the Workplace Situations
test at IBM (Desmarais et al., 1992), the Allstate Multimedia In-Basket (Ashworth
& Joyce, 1994), and the Conflict Resolution Skills Assessment (Drasgow, Olson-
Buchanan, & Moberg, 1999). In the Conflict Resolutions Skills Assessment, for
example, an examinee views a conflict scene of approximately 2 minutes’ dura-
tion, which stops at a critical point and asks the examinee to select one of four re-
sponse options. Based on this selection, action branches out and continues until a
second critical point is reached, and so forth.
Although selected response may be easier to implement due to a test devel-
oper’s desire to avoid the infinite number of possible solutions when constructed
responses are used, it is possible to envision a day in the near future when advanced
computing technology such as virtual reality might be used to more dynamically
model interactive situations and complex software applications are developed to
score them successfully.
Use of Reference Materials
One additional new facet of how examinees complete items is the extent to which
examinees may use reference materials as they go through the test. On some as-
sessments, examinees can use calculators or other reference materials. Indeed, cal-
TECHNOLOGICAL INNOVATIONS 349
culators may be used on the SAT I (College Board, 2000b) and the ACT Mathe-
matics test (American College Testing Program, 2000) and are required on two of
the mathematics sections of the SAT II (College Board, 2000a). In terms of a
credentialing/licensure test, one section of the mathematics assessment of the
Praxis teacher certification test specifically prohibits calculators, one section al-
lows them, and other sections require them (ETS, 2000). The ARE also requires
examinees to supply their own calculators.
Additional examples of auxiliary information examinees may access are found
on the credentialing examinations administered for Novell software certification
and on the ARE. One of Novell’s certification exams measures candidates’ ability
to quickly navigate two reference CDs to locate information necessary to complete
a task (D. Foster, personal communication, April 11, 2000). One CD contains tech-
nical product information, and the other contains a technical library detailing in-
formation about cables, hard drives, monitors, and CPUs. The tasks tap the candi-
dates’ ability to “research” these CDs to locate the content necessary to solve a
problem. The ARE also allows candidates to access resource material via the com-
puter. Candidates can retrieve certain subject specific information about building
code requirements, program constraints, and vignette specifications on demand as
they design structures in accordance with the presented task directions (National
Council of Architectural Registration Boards, 2000).
Novell certification exams have an additional interactive feature: Candidates
who take a non-English version of an exam can access the English-language ver-
sion of each item. If they so desire, candidates can click on a button to switch back
and forth between the language in which they are taking the test and English (Fos-
ter, Olsen, Ford, & Sireci, 1997).
INNOVATIONS IN SCORING COMPLEX
CONSTRUCTED-RESPONSES
Computerized-adaptive testing (CAT) is increasingly attractive to test developers
as a way to increase the amount of information examinee responses provide about
ability. However, as Parshall, Davey, and Pashley (2000) point out, an adaptive test
works best if the computer can score examinee responses to the test items automat-
ically and instantaneously. This has not historically been a problem when the items
being used are selected-response, as the computer can easily compare the sequence
of response from each examinee to the programmed answer key. The traditional
multiple-choice item, scored with a dichotomous item response model, is the item
type principally used in CAT, but some variations on the multiple-choice item,
such as multiple numerical response, graphical modeling, drag-and-drop, multiple
selection, and ordering information might also be scored fairly easily using a
polytomous item response model.
350 ZENISKY AND SIRECI
A problem in incorporating many of the innovative performance tasks in Table
1 into a CAT format or into a linear CBT, however, is that each examinee provides a
unique response for each item. Thus, the structure and nature of those answers may
vary widely across examinees, and scoring decisions cannot typically be made im-
mediately using mechanical application of limited, explicit criteria (Bennett,
Sebrechts, & Rock, 1991). Legitimate logistical difficulties in automatically scor-
ing these responses might at first seem to preclude the manufacture of CAT perfor-
mance assessments, although a CBT format might be feasible (as scoring occurs at
a later date). However, current developments in psychology, computer science,
communication disorders, and artificial intelligence reveal several promising di-
rections for the future of computerized performance assessment.
Clearly, a consideration in terms of automated scoring is the level of constraint
desired in the constructed response. Among the various constructed-response item
types, it is possible to constrain any individual item in such a way that there are sev-
eral or infinite possible answers, just as an item can be written to ensure that there
is only one correct response. Take a graphical modeling problem, for example,
where the task is to model the developments of an interest-bearing account. This
could be highly specific, such that given a base amount of money, an interest rate,
and a length of time, an examinee would use the mouse to graphically represent an
outcome. Alternatively, given certain information the examinee could be asked to
extrapolate future outcomes from current data, and in this case (depending on how
an examinee synthesized the available information) there might be more than one
appropriate way to respond graphically.
Currently, much of the work in terms of automated scoring has focused on de-
veloping techniques for scoring essays, computer programs, simulated patient in-
teractions, and architectural designs online. The most prominent computer-based
scoring methods, described in the following sections, are summarized in Table 2.
These scoring methods fall into three general categories: essay scoring, expert sys-
tems analysis, and mental modeling.
Automated Scoring of Essays
Essays are perhaps the most common form of constructed responses used in
large-scale assessment, although their use is limited because many testing pro-
grams need two and sometimes three humans to read and evaluate essays accord-
ing to preestablished scoring rubrics. Other constructed-response text-based item
types typically found on paper-and-pencil tests, such as the accounting problems
on the Uniform Certified Public Accountants Examination, are scored in a simi-
larly labor-intensive fashion. Given the expense of extensively training readers,
who are required to establish score validity, computerized alternatives to human
readers are highly attractive. Research on automated scoring programs and meth-
ods for the most part have demonstrated the comparability of essay scoring across
TECHNOLOGICAL INNOVATIONS 351
352
TABLE 2
Summary of Automated Scoring Programs/Methods
Scoring Method Description Citation(s)
Essays/Free-text answers
Project essay grade Uses a regression model where the independent variables are surface
features of the text (document length, word length, and punctuation) and
the dependent variable is the essay score.
Page (1994)
E-rater Evaluates numerous structural and linguistic features specified in a holistic
scoring guide using natural language process techniques.
Burstein & Chodorow (in press); Burstein et al.
(1998)
Latent semantic
analysis
A theory and method for extracting and representing the contextual usage
of words by statistical computations applied to a large corpus of text.
Foltz, Kintsch, & Landauer (1998); Landauer,
Foltz, & Laham (1998); Landauer et al.
(1997)
Text categorization Evaluates text using automated learning techniques to categorize text
documents, where linguistic expressions and contexts extracted from the
texts are used to classify texts.
Larkey (1998)
Constructed free
response scoring tool
Scores short verbal answers, where examinees key in responses that are
pattern-matched to programmed correct responses.
Martinez & Bennett (1992)
Expert systems Examinee’s completed response is compared to a problem-specific
knowledge base encoded within the computer’s memory banks. The
knowledge base is constructed from human content-expert responses
that have been coded in a machine-usable form.
Bennett & Sebrechts (1996); Braun et al.
(1990); Martinez & Bennett (1992);
Sebrechts, Bennett, & Rock (1991)
Mental modeling Elements of the final product can be evaluated against the universe of all
possible variations using a process that mimics the scoring processing of
committees and requires an analysis of the way experts evaluate
solutions. Scores can be compared to the results obtained from human
raters to assess agreement.
Bejar (1991); Clauser (2000); Clauser, Harik, &
Clyman (2000); Clauser et al. (1997);
Martinez & Bennett (1992); Williamson,
Bejar, & Hone (1999); Williamson, et al.
(1998)
human and computer graders (Burstein, Kukich, Wolff, Lu, & Chodorow, 1998;
Rizavi & Sireci, 1999; Yang, Buckendahl, Juszkiewicz, & Bhola, 2002). Currently,
several computerized essay scoring options and methods are available, including
project essay grade (PEG), e-rater, latent semantic analysis, text categorization,
and the constructed-response scoring tool.
Project essay grade.
PEG, developed by Ellis Page in the mid-1960s, was
the first automated essay grading system; the current version evolved from his ear-
lier work (Page, 1994; Page & Petersen, 1995). Like most computerized essay
scoring programs, the specifics of how PEG works are proprietary. However, de-
scriptions of the program suggest that it uses multiple regression to determine the
optimal combination of the surface features of an essay (e.g., average word length,
essay word length, number of uncommon words, number of commas) as well as
complex structure of the essay (e.g., soundness of sentence structure) to best pre-
dict the score that would be assigned by a human grader (Page, 1994, Page &
Petersen, 1995). By assigning weights to these surface and intrinsic features, the
computer attempts to mimic human scoring. Although it is unclear whether PEG is
currently being used in large-scale assessment, it is clear that it set the stage for
other developments in the computerized scoring of essays.
E-rater.
E-rater (Burstein et al., 1998; Burstein & Chodorow, in press) is the
essay scoring system developed by ETS for the essay portion of the Graduate Man-
agement Admissions Test (GMAT). It is designed to evaluate numerous structural
and linguistic features specified in a holistic scoring guide. On the GMAT, each
examinee responds to two essay questions, which are scored by both a trained hu-
man grader and an electronic reader. Currently, e-rater serves as the second reader.
If human and e-rater scores on a particular essay differ by more than one point, the
essay is sent to a second human expert, and finally, if consensus is still not reached,
to a final human referee. Thus, the GMAT scoring system provides an example of
how the computer can be used to increase the efficiency of essay scoring while
maintaining the validity of the final scores assigned to an essay.
Latent semantic analysis.
Latent semantic analysis (LSA) is a theory and
method for extracting and representing the contextual usage of words by statisti-
cal computations applied to a large corpus of text (Foltz, Kintsch, & Landauer,
1998; Landauer, Foltz, & Laham, 1998; Rehder et al., 1998). The underlying
idea is that the aggregate of all the word contexts in which a given word does
and does not appear provides a set of mutual constraints that largely determines
the similarity of meaning of words and sets of words to each other. A possible
analogy for LSA could be the way multidimensional scaling allows relationships
between variables to be plotted in ndimensions. In LSA, words can be mapped
into semantic space and distances between words are derived from shadings of
TECHNOLOGICAL INNOVATIONS 353
meaning, which are obtained through context. LSA’s algorithm has a learning
component that “reads” through a text and develops an understanding of the sen-
tence or passage by evolving a meaning for each word in relation to all the other
words in the sentence or passage. The LSA system can be “trained” to work in
different content areas by having it electronically read texts relevant to the do-
main of interest. One possible caveat to the use of LSA: At this time, the algo-
rithm does not derive sentence meaning from word order, which is a possible
place for exploitation, but ongoing research addresses this point (Landauer,
Laham, Rehder, & Schreiner, 1997).
Text categorization.
Text categorization is a method for evaluating text that
uses automated learning techniques to categorize documents, where linguistic ex-
pressions and contexts extracted from the texts are used to classify them (Larkey,
1998). This type of analysis is informed by work in areas of machine learning,
Bayesian networks, information retrieval, natural language processing, case-based
reasoning, language modeling, and speech recognition. A number of text categori-
zation algorithms have been developed, incorporating different schema for classi-
fying text. The sorting of verbal content may be related to topic, to specified levels
of quality, or perhaps by keywords. It is interesting that the evaluation of essays is
only one of the many situations in which text categorization techniques have been
applied. These algorithms are also used in sorting documents in databases in an in-
formation retrieval context such as in the code that powers Internet search engines.
One organization with ongoing research into text categorization is the Edinburgh
Language Technology Group (http://www.ltg.ed.ac.uk/papers/class.html). On this
Web site, this group details much of their work on multiple applications of text cat-
egorization methodology.
The Constructed Free Response Scoring Tool.
The Constructed Free Re-
sponse Scoring Tool (FRST) is an automated approach to scoring examinee’s con-
structed responses online (Martinez & Bennett, 1992). FRST is an algorithm devel-
oped to score short verbal answers, where examinees key in responses that are
pattern-matched to programmed correct responses (Martinez & Bennett, 1992).
FRST has a 100% congruence rate with human raters when examinee responses
range in length between 5 and 7 words, and an 88% congruence rate for responses be-
tween 12 and 15 words. In this case, congruence rate is defined as the rate at which
scores assigned to examinee responses by the computer and the human rater exactly
match.
Expert Systems Analysis
Expert systems analysis provides another example of the use of computers to score
complex examinee responses. Expert systems are computer programs designed to
354 ZENISKY AND SIRECI
emulate the scoring behaviors of human content specialists. Expert scoring has
been studied in a number of contexts, including computer programming and math-
ematics problems. For example, PROUST and MicroPROUST are two expert sys-
tems developed to automatically score computer programs that examinees write
using the Pascal language (Braun, Bennett, Frye, & Soloway, 1990; Martinez &
Bennett, 1992). Each system has knowledge to reason about programming prob-
lems within intention-based analysis framework. Based on how humans reason out
computer programs, the expert systems analysis formulates deep-structure, goal,
and plan representation in the process of trying to identify nonsyntactical errors.
In terms of constructed-response quantitative items, the expert scoring system
known as GIDE produces a series of comments about errors present in examinees’
solutions and then incorporates that information into computation of partial-credit
scores (Bennett & Sebrechts, 1996; Martinez & Bennett, 1992; Sebrechts, Ben-
nett, & Rock, 1991). The expert systems program consults a problem-specific
knowledge base constructed from human content-expert responses that are coded
in a machine-usable form. The examinee responses are broken down into compo-
nent parts, and each piece is evaluated against multiple programmed alternatives.
Here, analysis has shown that reasonable machine-rater congruence can be ob-
tained (e.g., a .86 correlation between the scores assigned to essays by a machine
and by a human rater; Martinez & Bennett, 1992). Interestingly, research into
GIDE, PROUST, and MicroPROUST expert systems scoring mechanisms sug-
gests that although each is highly accurate at classifying examinee responses as
correct or incorrect, they are less able to provide specific diagnostic information
about examinee errors.
Mental Modeling
An additional approach to computerized scoring of complex performance tasks is
mental modeling, which is currently used to score portions of the ARE. The perfor-
mance tasks on the ARE are often cited as highly interactive and innovative exam-
ples of the possibilities that exist for automated scoring in computer-based testing.
Examinees are presented with architectural design tasks given various constraints
and, in effect, create blueprints for buildings during the testing session. Each prob-
lem is graded on four attributes (grammatical, code compliance, diagrammatic
compliance, and design logic and efficiency) using features extraction analysis
where elements of the final product can be evaluated against the universe of all
possible variations (Bejar, 1991; Martinez & Bennett, 1992). Various elements ex-
tracted from an examinee’s constructed response are compared to the universe of
possible variations where the components of possible responses are evaluated us-
ing a procedure that mimics the scoring processing of committees and requires an
analysis of the way experienced experts evaluate solutions (Williamson, Bejar, &
Hone, 1999). This “mental modeling” approach to scoring, done by computers,
TECHNOLOGICAL INNOVATIONS 355
can be compared to the results obtained from human raters to assess the extent to
which these methodologies agree on results.
In addition to being used on the ARE, the National Board of Medical Examiners
has incorporated the mental model algorithm into its patient care simulations
(Clauser, Harik, & Clyman, 2000; Clauser et al., 1997). Each action that exam-
inees key in is varyingly associated as benefitting the simulated patient or as an in-
appropriate action carrying some level of risk. Features extraction analysis and
mental modeling may be applicable as a medium for the automated scoring of es-
says as well, where the features could be specified as components of an essay, such
as sentences.
The various algorithms for automatically scoring constructed responses repre-
sent an especially exciting direction for computerized assessment practices. Using
computers in this regard will help to improve the extent to which uniformity and
precision in scoring rules can be implemented (Clauser et al., 2000). As a result,
test users and examinees alike may develop greater confidence in the inferences
about domain proficiency made on the basis of test scores. Likewise, as Bennett
(1998) mentioned, delivery efficiency will improve. In consequence, performance
tasks that can be automatically evaluated will become more logistically and practi-
cally feasible for use in high-stakes credentialing.
DIRECTIONS FOR FUTURE RESEARCH
The incorporation of different response actions, task formats, multimedia prompts,
and reference materials into test items has the potential to substantially increase
the types of skills, abilities, and processes that can be measured. Likewise, the im-
plementation of automated scoring methods can greatly facilitate the processing of
examinee responses. The potential benefits of incorporating these innovations and
tasks to testing are great, but such benefits cannot fully materialize without further
research on a number of psychometric and operational concerns (particularly with
regard to many of the emerging item types, as found by Zenisky & Sireci, 2001).
Technology-related dimensions of test format, administration, and scoring cannot
be accepted without due psychometric scrutiny.
In terms of integrating innovations in task presentation into more large-scale as-
sessments, there are a several critical directions for research. The intricacy of the
task and how it relates to the skill(s) being assessed in a given testing context is an
issue of central importance that must be rigorously evaluated (Crocker, 1997).
Examinees should not be overwhelmed with innovative item types and item fea-
tures that are extraneous to the task. To this end, the relative simplicity or complex-
ity of the user interface should remain a fundamental concern for test developers,
especially in light of the potential for gadgetry to eclipse real technological bene-
fits. Oftentimes, extended tutorials may be necessary to sufficiently familiarize
356 ZENISKY AND SIRECI
examinees with novel testing tasks, and thus use of these types may require sub-
stantial testing time and development resource commitments from test developers.
Further research specific to different types and task presentation variables can help
determine the kinds of preparation and tutorials necessary for this purpose.
Practical validity concerns in CBT include the adequacy of construct represen-
tation (Huff & Sireci, 2001; Kane, Crooks, & Cohen, 1999; Messick, 1995) and
task generalizability (Brennan & Johnson, 1995; Linn & Burton, 1994; Shavelson,
Baxter, & Pine, 1991). Issues of task specificity such as the relative number of
tasks and the extent to which examinee performance can be generalized from the
selected tasks (Guion, 1995) are additional concerns. Equally important are stud-
ies to determine potential sources of construct-irrelevant variance associated with
such item types (Huff & Sireci, 2001). Furthermore, work to evaluate adverse im-
pact for different subgroups of examinee populations has by and large not been
completed for most emerging item types. This problem needs to be addressed in
future research.
Automated scoring must be evaluated with respect to potential losses in score
validity, perhaps in the direction of multitrait–multimethod analyses (Clauser,
2000; Yang et al., 2002). The emerging area of multidimensional item response
theory (IRT) models may provide some interesting ways for scoring complex con-
structed responses (see Ackerman, 1994, and van der Linden & Hambleton, 1997,
for further information on multidimensional IRT). Preliminary research suggests
that compromises in reliability and information per minute of testing time may
occur when complex, computerized constructed-response item types are used
(Jodoin, 2001), so further research in the areas of reliability and test and item infor-
mation should be accelerated.
CONCLUSIONS
Technological advances in CBT represent positive future directions for the evalua-
tion of complex performances within large-scale testing programs, especially
given the escalating use of technology in many aspects of everyday life. Ex-
aminees support the opportunity to demonstrate what they know when tasks on a
test more faithfully relate to the skills necessary for a particular domain, and these
methods may provide test users with ways to acquire information about an
examinee’s proficiency on a given knowledge or skill area more directly (Kane,
1992). As psychometric research related to computerized performance assessment
is completed, the application of empirical results with regard to fusions of technol-
ogy and measurement will continue to impact positively on assessment practices.
The overview of emerging technological innovations in computer-based assess-
ment presented in this article provides a comprehensive (although not exhaustive)
description of numerous recent developments in computer-based assessment.
TECHNOLOGICAL INNOVATIONS 357
Many of these innovations have both strengths and weaknesses from practical and
psychometric perspectives, and thus enthusiasm for these emerging measurement
methods must be tempered by scientific wariness about their technical characteris-
tics. Still, although it is difficult to predict the future, many of these CBT innova-
tions are likely to dramatically change the testing experience for many examinees
who sit for assessments in a wide variety of testing contexts, including certification
and licensure, admissions, and achievement testing.
ACKNOWLEDGMENTS
Laboratory of Psychometric and Evaluative Research Report No. 383, School of
Education, University of Massachusetts, Amherst. This research was funded in
part by the American Institute of Certified Public Accountants (AICPA). We are
grateful for this support. The opinions expressed in this article are ours and do not
represent official positions of the AICPA.
REFERENCES
Ackerman, T. A. (1994). Using multidimensional item response theory to understand what items and
tests are measuring. Applied Measurement in Education, 7, 255–278.
Ackerman, T. A., Evans, J., Park, K., Tamassia, C., & Turner, R. (1999). Computer assessment using vi-
sual stimuli: A test of dermatological skin disorders. In F. Drasgow and J. B. Olson-Buchanan (Eds.),
Innovations in computerized assessment (pp. 137–150). Mahwah, NJ: Lawrence Erlbaum Associ-
ates, Inc.
American College Testing Program. (2000). Calculators and the ACT math test. Retrieved March 20,
2000, from http://www.act.org/aap/taking/calculator.html
Ashworth, S. D., & Joyce, T. M. (1994, April). Developing scoring protocols for a computerized multi-
media in-basket exercise. Paper presented at the Ninth Annual Conference of the Society for Indus-
trial and Organizational Psychology, Nashville, TN.
Bejar, I. (1991). A methodology for scoring open-ended architectural design problems. Journal of Ap-
plied Psychology, 76, 522–532.
Bennett, R. E. (1998). Reinventing assessment: Speculations on the future of large-scale educational
testing. Princeton, NJ: Educational Testing Service.
Bennett, R. E., Morley, M., & Quardt, D. (2000). Three response types for broadening the conception of
mathematical problem solving in computerized-adaptive tests. Applied Psychological Measurement,
24, 294–309.
Bennett, R. E., Morley, M., Quardt, D., & Rock, D. A. (2000). Graphical modeling: A new response
type for measuring the qualitative component of mathematical reasoning. Applied Measurement in
Education, 13, 303–322.
Bennett, R. E., Morley, M., Quardt, D., Rock, D. A., Singley, M. K., Katz, I. R., & Nhouyvanisvong, A.
(1999). Psychometric and cognitive functioning of an under-determined computer-based response
type for quantitative reasoning. Journal of Educational Measurement, 36, 233–252.
Bennett, R. E., & Rock, D. A. (1995). Generalizability, validity, and examinee perceptions of a com-
puter-delivered formulating-hypotheses test. Journal of Educational Measurement, 32, 19–36.
358 ZENISKY AND SIRECI
Bennett, R. E., & Sebrechts, M. M. (1996). The accuracy of expert-system diagnoses of mathematical
problem solutions. Applied Measurement in Education, 9, 133–150.
Bennett, R. E., & Sebrechts, M. M. (1997). A computer-based task for measuring the representational
component of quantitative proficiency. Journal of Educational Measurement, 34, 64–78.
Bennett, R. E., Sebrechts, M. M., & Rock, D. A. (1991). Expert system scores for complex con-
structed-response quantitative items: A study of convergent validity. Applied Psychological Mea-
surement, 15, 227–239.
Bennett, R. E., Steffen, M., Singley, M. K., Morley, M., & Jacquemin, D. (1997). Evaluating an auto-
matically scorable, open-ended response type for measuring mathematical reasoning in com-
puter-adaptive tests. Journal of Educational Measurement, 34, 162–176.
Braun, H. I., Bennett, R. E., Frye, D., & Soloway, E. (1990). Scoring constructed responses using expert
systems. Journal of Educational Measurement, 27, 93–108.
Breland, H. M. (1999). Exploration of an automated editing task as a GRE writing measure (RR–99–9).
Princeton, NJ: Educational Testing Service.
Brennan, R. L., & Johnson, E. G. (1995). Generalizability of performance assessments. Educational
Measurement: Issues and Practice, 14(4), 25–27.
Burstein, J., & Chodorow,M. (2002). Directions in automated essay scoring analysis. In R. Kaplan (Ed.),
Oxford handbook of applied linguistics (pp. 487–497). Oxford, England: Oxford University Press.
Burstein, J., Kukich, K., Wolff, S., Lu, C., & Chodorow, M. (1998, April). Computer analysis of essays.
Paper presented at the NCME Symposium on Automated Scoring, San Diego, CA.
Carey, P. (2001, April). Overview of current computer-based TOEFL. Paper presented at the annual
meeting of the National Council on Measurement in Education, Seattle, WA.
Chung, G. K. W. K., O’Neil, H. F., Jr., & Herl, H. E. (1999). The use of computer-based collaborative
knowledge mapping to measure team processes and outcomes. Computers in Human Behavior, 15,
463–494.
Clauser, B. E. (2000). Recurrent issues and recent advances in scoring performance assessments. Ap-
plied Psychological Measurement, 24, 310–324.
Clauser, B. E., Harik, P., & Clyman, S. G. (2000). The generalizability of scores for a performance as-
sessment scored with a computer-automated scoring system. Journal of Educational Measurement,
37, 245–261.
Clauser, B. E., Margolis, M. J., Clyman, S. G., & Ross, L. P. (1997). Development of automated scoring
algorithms for complex performance assessments: A comparison of two approaches. Journal of Edu-
cational Measurement, 34, 141–161.
College Board. (2000a). AP Calculus for the new century. Retrieved March 20, 2000, from http://www.
collegeboard.org/index_this/ap/calculus/new_century/evolution.html
College Board. (2000b). Calculators. Retrieved March 20, 2000, from http://www.collegeboard.org/in-
dex_this/sat/center/html/counselors/prep009.html
Crocker, L. (1997). Assessing content representativeness of performance assessment exercises. Ap-
plied Measurement in Education, 10, 83–95.
Davey, T., Godwin, J., & Mittelholtz, D. (1997). Developing and scoring an innovative computerized
writing assessment. Journal of Educational Measurement, 34, 21–42.
Desmarais, L. B., Dyer, P. J., Midkiff, K. R., Barbera, K. M., Curtis, J. R., Esrig, F. H., & Masi, D. L.
(1992, May). Scientific uncertainties in the development of a multimedia test: Trade-offs and deci-
sions. Paper presented at the Seventh Annual Conference of the Society for Industrial and Organiza-
tional Psychology, Montreal, Quebec, Canada.
Drasgow, F., Olson-Buchanan, J. B., & Moberg, P. J. (1999). Development of an interactive video as-
sessment: Trials and tribulations. In F. Drasgow & J. B. Olson-Buchanan (Eds.), Innovations in com-
puterized assessment (pp. 197–219). Mahwah, NJ: Lawrence Erlbaum Associates, Inc.
Educational Testing Service. (1993). Tests at a glance: Praxis I: Academic Skills Assessment. Prince-
ton, NJ: Author.
TECHNOLOGICAL INNOVATIONS 359
Educational Testing Service. (2000). The Praxis Series: Professional Assessments for Beginning
Teachers: Tests and test dates. Retrieved March 20, 2000, from http://www.teachingandlearning.org/
licensure/ praxis/prxtest.html
Enright, M. K., Rock, D. A., & Bennett, R. E. (1998). Improving measurement for graduate admissions.
Journal of Educational Measurement, 35, 250–267.
Fitzgerald, C. (2001, April). Rewards and challenges of implementing an innovative CBT certification
exam program. Paper presented at the annual meeting of the National Council on Measurement in
Education, Seattle, WA.
Foltz, P. W., Kintsch, W., & Landauer, T. K. (1998). The measurement of textual coherence with latent
semantic analysis. Discourse Processes, 25, 285–307.
Foster, D., Olsen, J. B., Ford, J., & Sireci, S. G. (1997, March). Administering computerized certifica-
tion exams in multiple languages: Lessons learned from the international marketplace. Paper pre-
sented at the meeting of the American Educational Research Association, Chicago.
French, A., & Godwin, J. (1996). Using multimedia technology to create innovative items. Paper pre-
sented at the annual meeting of the American Educational Research Association, New York.
Glaser, R. (1991). Expertise and assessment. In M. C. Wittrock & E. L. Baker (Eds.), Testing and cogni-
tion (pp. 17–30). Englewood Cliffs, NJ: Prentice Hall.
Guion, R. M. (1995). Comments on values and standards in performance assessments. Educational
Measurement: Issues and Practice, 14(4), 25–27.
Hambleton, R. K. (1997, October). Promising GMAT item formats for the 21st century. Invited presen-
tation at the international workshop on the GMAT, Paris, France.
Herl, H. E., O’Neil, H. F., Jr., Chung, G. K. W. K., & Schacter, J. (1999). Reliability and validity of a
computer-based knowledge mapping system to measure content understanding. Computers in Hu-
man Behavior, 15, 315–333.
Huff, K. L., & Sireci, S. G. (2001). Validity issues in computer-based testing. Educational Measure-
ment: Issues and Practice, 20(3), 16–25.
Jamieson, J., Taylor, C., Kirsch, I., & Eignor, D. (1998). Design and evaluation of a computer-based
TOEFL tutorial. System, 26, 485–513.
Jodoin, M. G. (2001, April). An empirical examination of IRT information for innovative item formats
in a computer-based certification testing program. Paper presented at the annual meeting of the Na-
tional Council on Measurement in Education, Seattle, WA.
Kane, M. T. (1992). The assessment of professional competence. Evaluation and the Health Profes-
sions, 15, 163–182.
Kane, M., Crooks, T., & Cohen, A. (1999). Validating measures of performance. Educational Measure-
ment: Issues and Practice, 18(2), 5–17.
Kaplan, R. M., & Bennett, R. E. (1994). Using the free-response scoring tool to automatically score the
formulating hypotheses item (ETS Research Report No. 94–08). Princeton, NJ: Educational Testing
Service.
Klein, D. C. D., O’Neil, H. F., Jr., & Baker, E. L. (1998). A cognitive demands analysis of innovative
technologies (CSE Tech. Rep. No. 454). Los Angeles, CA: UCLA, National Center for Research on
Evaluation, Student Standards, and Testing.
Landauer, T. K., Foltz, P. W., & Laham, D. (1998). An introduction to latent semantic analysis. Dis-
course Processes, 25, 259–284.
Landauer, T. K., Laham, D., Rehder, B., & Schreiner, M. E. (1997). How well can passage meaning be
derived without using word order? A comparison of latent semantic analysis and humans. In G.
Shafto & P. Langley (Eds.), Proceedings of the 19th annual meeting of the Cognitive Science Society
(pp. 412–417). Mahwah, NJ: Lawrence Erlbaum Associates, Inc.
Larkey, L. S. (1998). Automatic essay grading using text categorization techniques. In the Proceedings
of the 21st Annual International Conference of the Association for Computing Machinery—Special
Interest Group on Information Retrieval, Melbourne, Australia, 90–95.
360 ZENISKY AND SIRECI
Linn, R. L., & Burton, E. (1994). Performance-based assessment: Implications of task specificity. Edu-
cational Measurement: Issues and Practice, 13(1), 5–8, 15.
Luecht, R. M. (2001, April). Capturing, codifying, and scoring complex data for innovative, com-
puter-based items. Paper presented at the annual meeting of the National Council on Measurement in
Education, Seattle, WA.
Martinez, M. E. (1991). A comparison of multiple-choice and constructed figural response items. Jour-
nal of Educational Measurement, 28, 131–145.
Martinez, M. E., & Bennett, R. E. (1992). A review of automatically scorable constructed-response
item types for large-scale assessment. Applied Measurement in Education, 5, 151–169.
Messick, S. (1995). Standards of validity and the validity of standards in performance assessment. Edu-
cational Measurement: Issues and Practice, 14(4), 5–9.
Microsoft Corporation. (1998, September). Procedures and guidelines for writing Microsoft Certifica-
tion Exams. Redmond, WA: Author.
Mills, C. (2000, February). Unlocking the promise of CBT. Keynote address presented at a conference
of the Association of Test Publishers, Carmel Valley, CA.
Mislevy, R. J., Steinberg, L. S., Breyer, F. J., Almond, R. G., & Johnson, L. (1999, September 16–17).
Making sense of data from complex assessments. Paper presented at the 1999 CRESST Conference,
Los Angeles, CA.
National Council of Architectural Registration Boards. (2000). ARE practice program [Computer soft-
ware]. Retrieved March 20, 2000, from http://www.ncarb.org/are/tutorial2.html
Nhouyvanisvong, A., Katz, I. R., & Singley, M. K. (1997). Toward a unified model of problem solving
in well-determined and under-determined algebra word problems. Paper presented at the annual
meeting of the American Educational Research Association, Chicago.
O’Neil, K., & Folk, V. (1996, April). Innovative CBT item formats in a teacher licensing program.Paper
presented at the annual meeting of the National Council on Measurement in Education, New York.
Page, E. B., & Peterson, N. S. (1995). The computer moves into essay grading: Updating the ancient
test. Phi Delta Kappan, 76, 561–565.
Page, E. B. (1994). Computer grading of student prose, using modern concepts and software. Journal of
Experimental Education, 62(2), 127–142.
Parshall, C. G., & Balizet, S. (2001). Audio computer-based tests (CBTS): An initial framework for the
use of sound in computerized tests. Educational Measurement: Issues and Practice, 20(2), 5–15.
Parshall, C. G., Davey, T., & Pashley, P. (2000). Innovative item types for computerized testing. In W. J.
van der Linden & C. Glas (Eds.), Computer-adaptive testing: Theory and practice (pp. 129–148).
Boston: Kluwer Academic.
Rehder, B., Schreiner, M. E., Wolfe, M. B., Laham, D., Landauer, T. K., & Kintsch, W. (1998). Using
latent semantic analysis to assess knowledge: Some technical considerations. Discourse Processes,
25, 337–354.
Rizavi, S., & Sireci, S.G. (1999). Comparing computerized and human scoring of WritePlacer Essays
(Laboratory of Psychometric and Evaluative Research Report No. 354). Amherst: School of Educa-
tion, University of Massachusetts.
Sebrechts, M. M., Bennett, R. E., & Rock, D. A. (1991). Agreement between expert system and human
raters’ scores on complex constructed-response quantitative items. Journal of Applied Psychology,
76, 856–862.
Shavelson, R. J., Baxter, G., & Pine, J. (1991). Performance assessment in science. Applied Measure-
ment in Education, 4, 347–362.
Taylor, C., Jamieson, J., Eignor, D., & Kirsch, I. (1998). The relationship between computer familiarity
and performance on computer-based TOEFL test tasks [ETS Research Rep. No. 98–08]. Princeton,
NJ: Educational Testing Service.
van der Linden, W. J., & Hambleton, R. K. (Eds.). (1997). Handbook of modern item response theory.
New York: Springer.
TECHNOLOGICAL INNOVATIONS 361
Vispoel, W. P. (1999). Creating computerized adaptive tests of music aptitude: Problems, solutions, and
future directions. In F. Drasgow & J. B. Olson-Buchanan (Eds.), Innovations in computerized assess-
ment (pp. 151–176). Mahwah, NJ: Lawrence Erlbaum Associates, Inc.
Walker, G., & Crandall, J. (1999, February). Value added by computer-based TOEFL test [TOEFL
briefing]. Princeton, NJ: Educational Testing Service.
Williamson, D. M., Bejar, I. I., & Hone, A. S. (1999). ‘Mental model’comparisons of automated and
human scoring. Journal of Educational Measurement, 36, 158–184.
Williamson, D. M., Hone, A. S., Miller, S., & Bejar, I. I. (1998, April). Classification trees for quality
control processes in automated constructed response scoring. Paper presented at the annual meeting
of the National Council on Measurement in Education, San Diego, CA.
Yang, Y., Buckendahl, C. W., Juszkiewicz, P. I., & Bhola, D. S. (2002/this issue). A review of strategies
for validating computer automated scoring. Applied Measurement in Education, 15, 391–412.
Zenisky, A. L., & Sireci, S. (2001). Feasibility review of selected performance assessment item types
for the computerized Uniform CPA Exam (Laboratory of Psychometric and Evaluative Research
Rep. No. 405). Amherst: School of Education, University of Massachusetts.
362 ZENISKY AND SIRECI