ArticlePDF Available

Technological Innovations in Large-Scale Assessment

October 2002
Applied Measurement in Education 15(4):337-362

October 2002
15(4):337-362

DOI:10.1207/S15324818AME1504_02

Authors:

April L. Zenisky

University of Massachusetts Amherst

Stephen G. Sireci

University of Massachusetts Amherst

Computers have had a tremendous impact on assessment practices over the past half century. Advances in computer technology have substantially influenced the ways in which tests are made, administered, scored, and reported to examinees. These changes are particularly evident in computer-based testing, where the use of computers has allowed test developers to re-envision what test items look like and how they are scored. By integrating technology into assessments, it is increasingly possible to create test items that can sample as broad or as narrow a range of behaviors as needed while preserving a great deal of fidelity to the construct of interest. In this article we review and illustrate some of the current technological developments in computer-based testing, focusing on novel item formats and automated scoring methodologies. Our review indicates that a number of technological innovations in performance assessment are increasingly being researched and implemented by testing programs. In some cases, complex psychometric and operational issues have successfully been dealt with, but a variety of substantial measurement concerns associated with novel item types and other technological aspects impede more widespread use. Given emerging research, however, there appears to be vast potential for expanding the use of more computerized constructed-response type items in a variety of testing contexts.

Example of moving figures or symbols in/into a pictograph item type.

…

Example of ordering information/create-a-tree item type.

…

Example of formulating hypotheses/generating solutions item type. Source: The formulating-hypotheses item type. (n.d.). Retrieved February 26, 2001, from http://www.ets. org/research/rsc/alcadia.html Copyright © Educational Testing Service. Used with permission.

…

Figures - uploaded by Stephen G. Sireci

Content may be subject to copyright.

Content uploaded by Stephen G. Sireci

Content may be subject to copyright.

Technological Innovations

in Large-Scale Assessment

April L. Zenisky and Stephen G. Sireci

Center for Educational Assessment, School of Education

University of Massachusetts Amherst

Computers have had a tremendous impact on assessment practices over the past half

century. Advances in computer technology have substantially influenced the ways in

which tests are made, administered, scored, and reported to examinees. These

changes are particularly evident in computer-based testing, where the use of comput-

ers has allowed test developers to re-envision what test items look like and how they

are scored. By integrating technology into assessments, it is increasingly possible to

create test items that can sample as broad or as narrow a range of behaviors as needed

while preserving a great deal of fidelity to the construct of interest. In this article we

review and illustrate some of the current technological developments in com-

puter-based testing, focusing on novel item formats and automated scoring method-

ologies. Our review indicates that a number of technological innovations in perfor-

mance assessment are increasingly being researched and implemented by testing

programs. In some cases, complex psychometric and operational issues have suc-

cessfully been dealt with, but a variety of substantial measurement concerns associ-

ated with novel item types and other technological aspects impede more widespread

use. Given emerging research, however, there appears to be vast potential for expand-

ing the use of more computerized constructed-response type items in a variety of test-

ing contexts.

The rapid evolution of computers and computing technology has played a critical

role in defining current measurement practices. Many tests are now administered

on a computer, and a number of psychometric software programs are widely used

to facilitate all aspects of test development and analysis. Such technological ad-

vances provide testing programs with many new tools with which to build tests and

understand assessment data from a variety of perspectives and traditions. The po-

APPLIED MEASUREMENT IN EDUCATION, 15(4), 337–362

Requests for reprints should be sent to April L. Zenisky, School of Education, 156 Hills South, Uni-

versity of Massachusetts, Amherst, MA 01003–4140. E-mail: azenisky@educ.umass.edu

tential for computers to influence testing is not yet exhausted, however, as the

needs and interests of testing programs continue to evolve.

Test users increasingly express interest in assessing skills that can be difficult to

fully tap using traditional paper-and-pencil tests. As the potential for integrating

technology into task presentation and response collection has become more of a

practical reality, a variety of innovative computerized constructed-response item

types emerge. Many of these new types call for reconceptualization of what

examinee responses look like, how they are entered into the computer, and how

they are scored (Bennett, 1998). This is good news for test users in all testing con-

texts, as a greater selection of item types may allow test developers to increase the

extent to which tasks on a test approximate the knowledge, skills, and abilities of

interest.

The purpose of this article is to review the current advances in computer-based

assessment, including innovative item types, response technologies, and scoring

methodologies. Each of these topics defines an area where applications of technol-

ogy are rapidly evolving. As research and practical implementation continues,

these emerging assessment methods are likely to significantly alter measurement

practices. In this article we provide an overview of the recent developments in task

presentation and response scoring algorithms that are currently used or have the

potential for use in large-scale testing.

INNOVATIONS IN TASK PRESENTATION

Response Actions

In developing test items for computerized performance assessment, one critical

component for test developers to think about is the format of the response an

examinee is to provide. It may at first seem backward to think about examinee re-

sponse formats before item stems, but how answers to test questions are structured

has an obvious bearing on the nature of the information being collected. Thus, as

the process of designing assessment tasks gets underway, some reflection on the

inferences to be made on the basis of test scores and how best to collect that data is

essential.

To this end, an understanding of what Parshall, Davey, and Pashley (2000)

termed response action and what Bennett, Morley and Quardt (2000) described as

response type may be helpful. Prior to actually constructing test items, some con-

sideration of the type of responses desired from examinees and the method by

which the responses could be entered can help a test developer to discern the kinds

of item types that might provide the most useful and construct-relevant informa-

tion about an examinee (Mislevy, Steinberg, Breyer, Almond, & Johnson, 1999).

In a computer-based testing (CBT) environment, the precise nature of the informa-

338 ZENISKY AND SIRECI

tion that test developers would like examinees to provide might be best expressed

in one of several ways. For example, examinees could be required to type text-

based responses, enter numerical answers via a keyboard or by clicking onscreen

buttons, or manipulate or synthesize information on a computer screen in some

way (e.g., use a mouse to direct an onscreen cursor to click on text boxes,

pull-down menus, or audio or video prompts). The mouse can also be used to draw

onscreen images as well as to “drag-and-drop” objects.

The keyboard and mouse are the input devices most familiar to examinees and

are the ones overwhelmingly implemented in current computerized testing appli-

cations, but response actions in a computerized context are not exclusively limited

to the keyboard and mouse. Pending additional research, some additional input de-

vices by which examinees’ constructed responses could one day be collected in-

clude touch screens, light pens, joysticks, trackballs, speech recognition software

and microphones, and pressure-feedback (haptic) devices (Parshall, Davey, &

Pashley, 2000). Each of these emerging methods represent inventive ways by

which test developers and users can gather different pieces of information about

examinee skills. However, at this time these alternate data collection mechanisms

are largely in experimental stages and are not yet implemented as part of many (if

any) novel item types. For this reason, we focus on emerging item types that use

the keyboard, mouse, or both for collecting responses from examinees.

Novel Item Types

For many testing programs, the familiar item types currently in use such as multi-

ple-choice and essay items provide sufficient measurement information for the

kinds of decisions being made on the basis of test scores. However, a substantial

number of increasingly interactive item types that may increase measurement in-

formation are now available, and some are being used operationally. This prolifer-

ation of item types has largely come about in response to requests from test con-

sumers for assessments aligned more closely with the constructs or skills being

assessed. Although many of these newer item types were developed for specific

testing applications such as licensure, certification, or graduate admissions testing,

it is possible to envision each of these item types being adapted in countless ways

to access different constructs as needed by a particular testing program.

Numerous computerized item types have emerged over the past decade, so it is

virtually impossible to illustrate and describe them all in a single article. Neverthe-

less, we conducted a review of the psychometric literature and of test publishers’

Web sites and selected several promising item types for presentation and discus-

sion. A nonexhaustive list of 21 of these item types is presented in Table 1, along

with a brief description of each type and some relevant references. Some of the

item types listed in Table 1 are being used operationally, whereas others have been

only proposed for use.

TECHNOLOGICAL INNOVATIONS 339

340

TABLE 1

Computerized Performance Assessment Item Types

Item Format Brief Description Selected Citation(s)

Drag-and-drop (select-and-place) Given scenario or problem, examinees click and drag an

object to the center of the appropriate answer field (see

Figure 1).

Fitzgerald (2001); Luecht (2001); Microsoft

Corporation (1998)

Graphical modeling Examinees use line and curve tools to sketch a given situation

on a grid.

Bennett, Morley, & Quardt (2000); Bennett,

Morley, Quardt, & Rock (2000)

Move figure or symbols in/into

pictographs

Examinees manipulate elements of chart or graph to represent

certain situations or adjust or complete image as necessary

(e.g., extending bars in a bar chart, see Figure 2).

Educational Testing Service (1993); French &

Godwin (1996); Martinez (1991)

Drag and connect, specifying

relationships

Given presented objects, examinees identify the

relationship(s) that exist between pairs of objects (see

Figure 3).

Fitzgerald (2001); Luecht (2001)

Concept mapping Examinees demonstrate knowledge of interrelationships

between data points by graphically representing onscreen

images and text using links and nodes.

Chung, O’Neil, & Herl (1999); Klein, O’Neil, &

Baker (1998)

Sorting task Given prototypes, examinees look for surface or deep

structural similarities between presented items and

prototypes and match items with prototype categories.

Bennett & Sebrechts (1997); Glaser (1991)

Ordering information (create-a-tree) Examinees sequence events as required by item stem (e.g.,

largest to smallest, most to least probable cause of event,

if–then; see Figure 4).

Educational Testing Service (1993); Fitzgerald

(2001); Luecht (2001); Walker & Crandall

(1999)

Inserting text Examinees drag and drop text into passage as directed by item

stem (e.g., where it makes sense, serves as example of

observation).

Educational Testing Service (1993); Taylor,

Jamieson, Eignor, & Kirsch (1998)

Passage editing Examinees edit a short onscreen passage by moving the cursor

to various points in a passage and selecting sentence

rewrites from a list of alternatives on a drop-down menu.

Breland (1999); Davey, Godwin, & Mittelholtz

(1997)

341

Highlighting text Examinees read a passage and select specific sentence(s) in

the passage (e.g., main idea, particular piece of

information).

Carey (2001); Taylor, Jamieson, Eignor, &

Kirsch (1998); Walker & Crandall (1999)

Capturing or selecting

frames/Shading

Given directions or parameters, examinees use mouse to select

portion of picture, map, or graph.

Hambleton (1997); O’Neil & Folk (1996)

Mathematical expressions Examinees generate and type in unique expression to

represent mathematical relationship.

Bennett, Morley, & Quardt (2000); Bennett et

al. (1997); Educational Testing Service

(1993); Martinez & Bennett (1992)

Numerical equations Examinees complete numerical sentences by entering

numbers and mathematical symbols in text box.

Hambleton (1997)

Multiple numerical response Examinees type in more than one numerical answer (e.g.,

complete tax form, insert numbers into a spreadsheet).

Hambleton (1997)

Multiple selection Examinees are presented with a stimulus (visual, audio, text)

and select answer(s) from list (answers may be used more

than once in series of questions).

Ackerman, Evans, Park, Tamassia, & Turner

(1999); Mills (2000)

Analyzing situations Examinees are provided with visual/audio clips and short

informational text and are asked to make diagnosis/decision.

Response could be free-text entry or extended matching.

Ackerman, Evans, Park, Tamassia, & Turner

(1999)

Generating examples Examinees create examples given certain situations or

constraints; there is more than one correct answer.

Response is free-text entry.

Bennett, Morley, & Quardt (2000); Bennett et

al. (1999); Enright, Rock, & Bennett (1998);

Nhouyvanisvong, Katz, & Singley (1997)

Generating multiple

solutions/Formulating hypotheses

Given situation, examinees generate plausible solutions or

explanations. Response is free-text entry (see Figure 5).

Bennett & Rock (1995); Kaplan & Bennett

(1994)

Essay/Short answer May be restricted or extended length. Burstein et al. (1998); Rizavi & Sireci (1999)

Problem-solving vignettes Problem-solving situations (vignettes) are presented to

examinees, who are graded on features of a product.

Bejar (1991); Fitzgerald (2001); Luecht (2001);

Williamson, Bejar, & Hone (1999);

Williamson, Hone, Miller, & Bejar (1998)

Sequential problem solving/Role

play

Examinees provide a series of responses as dynamic situation

unfolds. Scoring attends to process and product.

Clauser, Harik, & Clyman (2000); Clauser et al.

(1997)

Our discussion of novel item types begins with those items requiring use of the

mouse for different onscreen actions as methods for data collection. Some of these

item types bear greater resemblance to traditional selected response item types and

are easier to score mechanically, whereas others integrate technology in more in-

ventive ways. After introducing these item types, we turn to those item types in-

volving text-based responses. Last, we focus on items with more complex ex-

aminee responses that expand the concept of what responses to test items look like

in fundamental ways and pose more difficult challenges for automated scoring.

Item types requiring use of a mouse.

Many of the emerging computer-

based item types take advantage of the way in which people interact with a com-

puter, specifically via a keyboard and mouse. The mouse and onscreen cursor pro-

vide a flexible mechanism by which items can be manipulated. Using a mouse,

pull-down menus, and arrow keys, examinees can highlight text, drag-and-drop

text and other stimuli, create drawings or graphics, or point to critical features of an

item. An example of a drag-and-drop item (also called a select-and-place item) is

presented in Figure 1. This item type is used on a number of the Microsoft certifi-

342 ZENISKY AND SIRECI

FIGURE 1 Example of drag-and-drop item type.

cation examinations (Fitzgerald, 2001; Microsoft Corporation, 1998). These items

can be scored right/wrong or using partial credit.

The graphical modeling item type also uses the drag-and-drop capability of a

computer. This item type requires examinees to sketch out situations graphically

using onscreen line, curve tools, or both (Bennett, Morley, & Quardt, 2000;

Bennett, Morley, Quardt, & Rock, 2000). A similar item type using drag-and-drop

technology is the move figures or symbols into pictographs item type, which is pre-

sented in Figure 2. This item type requires examinees to drag a shape and position

it on a grid given certain parameters or constraints in the item stem (French &

Godwin, 1996; Martinez, 1991).

A variation of the drag-and-drop item is the drag-and-connect item type. This

item type presents examinees with several movable objects that can be arranged in

several different target locations onscreen. For example, when all the objects are

correctly assembled (items are sequenced or organized accurately), a network

would correctly work or a flowchart would appropriately illustrate a network pro-

tocol. An extension of this item type is the specifying relationships item type in

which examinees move objects around onscreen and link them in a flowchart by

way of clicking relationships such as “one to one,” “many to one,” or “one to zero”

(Fitzgerald, 2001; Luecht, 2001). An example of this item type is presented in Fig-

ure 3. Another item type that can be used to assess the understanding of relation-

TECHNOLOGICAL INNOVATIONS 343

FIGURE 2 Example of moving figures or symbols in/into a pictograph item type.

ships is the concept map item type. Having an examinee create onscreen concept

maps can allow for relationships between items or pieces of information to be il-

lustrated (Chung, O’Neil, & Herl, 1999; Herl, O’Neil, Chung, & Schacter, 1999;

Klein, O’Neil, & Baker, 1998).

Items that delve into assessing ordering and sorting information are increas-

ingly using drag-and-drop action. Bennett’s and Sebrechts’ (1997) sorting task

item type (also studied by Glaser, 1991) gives examinees the chance to communi-

cate knowledge about underlying relationships between items by dragging and

dropping focal items onto the target prototype to which it best aligns according to

some surface or deep structural feature.

The ordering information item type, also referred to in the literature as a cre-

ate-a-tree item, requires examinees to use the mouse to exhibit understanding of

the material tested. The stem of this item type specifies the way in which the

examinee should arrange elements in a process. The examinee clicks on a focal ob-

ject and then places it into a target location by dragging and dropping or by click-

ing on onscreen radio buttons to move the item as needed (Fitzgerald, 2001;

344 ZENISKY AND SIRECI

FIGURE 3 Example of specifying relationships item type.

Luecht, 2001; Walker & Crandall, 1999). Some sequences in which these focal

items might be arranged include largest to smallest, most-to-least probable cause

of an event, or following an if–then framework. Figure 4 presents an example of an

ordering information/create-a-tree item type.

Some computer-based item types are used to assess skills specifically relating

to verbal communication and comprehension skills. One, the inserting text item

type, presents examinees with a sentence that must be dragged and dropped into

the appropriate place in a passage (Carey, 2001; Jamieson, Taylor, Kirsch, &

Eignor, 1998). A similar item type is the passage editing item; the examinees move

the cursor to various onscreen “hot spots” where a drop-down menu appears with a

list of potential sentence or phrase rewrites, and the examinee must select the best

alternative from the list (Breland, 1999; Davey, Godwin, & Mittelholtz, 1997). The

alternatives could range from radical changes to no change at all, with the different

options being scored correct/incorrect or using a graded scale.

TECHNOLOGICAL INNOVATIONS 345

FIGURE 4 Example of ordering information/create-a-tree item type.

Another objectively scorable item type that uses mouse manipulation is the

highlighting information item type where the examinee uses the cursor to select a

target phrase or sentence within a passage. Examples of this item type include

identifying the main idea of a paragraph or antecedents of pronouns (Carey, 2001;

Taylor, Jamieson, Eignor, & Kirsch, 1998; Walker & Crandall, 1999). A similar

item type is the capturing/selecting frames item type (sometimes also referred to

as shading) that directs the examinee to click on portions of a graphic as needed

(Hambleton, 1997; O’Neil & Folk, 1996).

The item types described thus far could be described as “click-on item types,”

as clicking on one or more objects is required. For all of these item types

examinees must select or highlight information as directed by the item stem by

moving the mouse, which correspondingly moves an onscreen cursor. Many of

these item types might be considered by some as little more than extended multi-

ple-choice items, but generally with such items the objects that could be selected

are so numerous they require skills above and beyond the test-taking skills helpful

for success on traditional selected-response items. The multiple selection item type

is a good example of an item with this property, in that the examinee is expected to

select text or onscreen items using the mouse given instructions such as “choose all

that apply” or “select three” (Ackerman, Evans, Park, Tamassia, & Turner, 1999;

Mills, 2000). Obviously, such items reduce the chance of answering the item cor-

rectly by guessing, relative to a traditional multiple-choice item.

Innovations in items with text-based responses.

Moving from the mouse

to the keyboard as the mechanism for examinees to enter responses, a number of

both novel and more familiar item types become increasingly useful in large-scale

performance assessment. Although having an examinee write a short answer or an

essay in a text box is not particularly novel, collecting such answers via computer

can be effective for data management purposes and is increasingly likely to be the

preferred method of evaluating writing skills (to the extent that typing is accepted

as a skill directed relevant to the construct(s) of interest). Similarly, in the mathe-

matical expressions,numerical equations, and multiple numerical response item

types, the examinee can type answers into free-response text boxes (Bennett,

Morley, & Quardt, 2000; Bennett, Steffen, Singley, Morley, & Jacquemin, 1997).

Although these item types may not be especially innovative in and of themselves,

the responses can be surprisingly complex to complete, manage, and score because

there are multiple ways in which any mathematical or text-based response could be

expressed.

One novel item type in which examinees respond by means of a text box is the

generating examples item type. Problems and constraints are presented to the

examinee, whose task it is to pose one or several solutions that are feasible under

such parameters (Bennett, Morley, & Quardt, 2000; Bennett et al., 1999; Enright,

Rock, & Bennett, 1998; Nhouyvanisvong, Katz, & Singley, 1997). In some appli-

346 ZENISKY AND SIRECI

cations of this item type, the responses are numerical in answer. In fact, it was orig-

inally designed in part to broaden the measurement of the construct “quantitative

reasoning” on the Graduate Record Examination, or GRE. Generating examples,

as an item type, is a variant of the generating solutions/formulating hypotheses

item type (Bennett & Rock, 1995; Kaplan & Bennett, 1994), an example of which

is presented in Figure 5. In the formulating hypotheses item type, an examinee is

first presented with a situation of some kind. The task then is to generate as many

possible explanations or causal reasonings for the situation as possible.

Complex constructed-response items for CBT.

Some of the most intrigu-

ing advances in CBT are the problem solving vignettes used on professional li-

censure exams. Typically, the vignettes presented to licensure candidates reflect

real-world problems, and the computer simulates real-world responses. An example

of this increased fidelity in measurement is the problem solving vignette item type

found on the Architectural Registration Examination (ARE). On the ARE, ex-

aminees are asked to complete several design tasks (such as design a building or lay

out a parking lot) using a variety of onscreen drawing tools (Bejar, 1991; William-

TECHNOLOGICAL INNOVATIONS 347

FIGURE 5 Example of formulating hypotheses/generating solutions item type. Source: The

formulating-hypotheses item type. (n.d.). Retrieved February 26, 2001, from http://www.ets.

son, Bejar, & Hone, 1999; Williamson, Hone, Miller, & Bejar, 1998). By and large,

these items seem highly authentic to examinees and give test users data about

examinee ability in relation to actual, standardized architectural design tasks. In this

case, the generalizability of test item to job performance as described by Kane

(1992) is high.

Closely related to such problem-solving vignettes are dynamic problem solving

tasks, sometimes referred to as role-play exercises. In measurement, we typically

think of adaptive testing as dynamic between items, but advances in computer

hardware and software now allow some testing programs to create tests that are

adaptive within an item, where an item may be defined as an extended role-playing

task. The computerized case-based simulations used by the National Board of

Medical Examiners incorporate the idea of the simulated patient whose symptoms

and vital statistics change over time in response to the actions (or nonactions) taken

by the candidate (Clauser, Harik, & Clyman, 2000; Clauser, Margolis, Clyman, &

Ross, 1997). As the examinee manages a case, new symptoms may emerge, the

clock is ticking, and the prospective doctor’s actions have the potential to harm as

well as help the simulated patient. As the case progresses, the examinee may have

to deal with unintended medical side effects as well as the patient’s original medi-

cal condition. Each examinee is scored on the sequence of response actions they

enter into the computer, such as requesting tests on the patient, writing treatment

orders, and their diagnostic acuity. Similar dynamic simulation tasks are used for

aviation selection and training.

Media in Item Stems

A further emerging dimension of novel item types relates to what Parshall, Davey,

and Pashley (2000) referred to as media inclusion: the use of graphics, video, and

audio within an item or set of items. Multimedia can be used at various points in

the item stem for a variety of purposes: to better illustrate a particular situation, to

allow examinees to visualize a problem, or to better assess a specified construct

(e.g., music-listening aptitude).

Audio prompts in large-scale, noncomputerized testing have been largely con-

fined to music and language tests, with partial success in those areas. However,

Parshall and Balizet (2001) defined a framework for considering four uses of an

audio component in CBTs, including speech audio for listening tests, nonspeech

audio (e.g., music) for listening tests, speech audio for alternative assessment

(such as accommodating tests for limited-English proficient, reading disabled, or

visually disabled examinees), and speech and nonspeech audio incorporated into

the user interface. From a measurement perspective, as Vispoel (1999) noted,

when tests of music listening or aptitude are administered to a group in a non-CBT

format, compromises in administrative efficiency and measurement accuracy of-

ten leave examinee scores on such tests with questionable reliability. The differ-

348 ZENISKY AND SIRECI

ence in a computer-based setting is that the test can be administered individually

and this format of test administration permits examinees to proceed at their own

pace (Parshall & Balizet, 2001). An example of the successful use of audio

prompts in large-scale computer-based assessment is Educational Testing Ser-

vice’s (ETS’s) Test of English as a Foreign Language, which incorporates such

features.

Graphics, on the other hand, have generally enjoyed more extensive use in com-

puterized assessment. For example, the presentation of digitized pictures has been

used successfully by Ackerman, Evans, Park, Tamassia, and Turner (1999) in a test

of dermatological skin disorders. Examinees can use a zoom feature to get a better

look at the picture before selecting the correct diagnosis from a list of 110 alpha-

betized disorders. Although this item type would be strictly classified as a se-

lected-response item, rather than constructed-response, the emulation of the diag-

nostic processes of professional dermatologists allows examinees to demonstrate a

higher order grasp of the information and reduces the likelihood of guessing. On

other tests, onscreen images can be rotated, resized, selected, clicked on, and

dragged to form a meaningful image, depending on the item type (Klein, O’Neil,

& Baker, 1998). Some of the items described in Table 1, including graphical mod-

eling, concept mapping, and moving a figure into a graph, are examples of tasks

where the graphical manipulations compose the constructed responses.

Furthermore, most desktop computers now have video capabilities that make

the inclusion of video prompts in performance assessment highly feasible. Interac-

tive video assessment has been used operationally with the Workplace Situations

test at IBM (Desmarais et al., 1992), the Allstate Multimedia In-Basket (Ashworth

& Joyce, 1994), and the Conflict Resolution Skills Assessment (Drasgow, Olson-

Buchanan, & Moberg, 1999). In the Conflict Resolutions Skills Assessment, for

example, an examinee views a conflict scene of approximately 2 minutes’ dura-

tion, which stops at a critical point and asks the examinee to select one of four re-

sponse options. Based on this selection, action branches out and continues until a

second critical point is reached, and so forth.

Although selected response may be easier to implement due to a test devel-

oper’s desire to avoid the infinite number of possible solutions when constructed

responses are used, it is possible to envision a day in the near future when advanced

computing technology such as virtual reality might be used to more dynamically

model interactive situations and complex software applications are developed to

score them successfully.

Use of Reference Materials

One additional new facet of how examinees complete items is the extent to which

examinees may use reference materials as they go through the test. On some as-

sessments, examinees can use calculators or other reference materials. Indeed, cal-

TECHNOLOGICAL INNOVATIONS 349

culators may be used on the SAT I (College Board, 2000b) and the ACT Mathe-

matics test (American College Testing Program, 2000) and are required on two of

the mathematics sections of the SAT II (College Board, 2000a). In terms of a

credentialing/licensure test, one section of the mathematics assessment of the

Praxis teacher certification test specifically prohibits calculators, one section al-

lows them, and other sections require them (ETS, 2000). The ARE also requires

examinees to supply their own calculators.

Additional examples of auxiliary information examinees may access are found

on the credentialing examinations administered for Novell software certification

and on the ARE. One of Novell’s certification exams measures candidates’ ability

to quickly navigate two reference CDs to locate information necessary to complete

a task (D. Foster, personal communication, April 11, 2000). One CD contains tech-

nical product information, and the other contains a technical library detailing in-

formation about cables, hard drives, monitors, and CPUs. The tasks tap the candi-

dates’ ability to “research” these CDs to locate the content necessary to solve a

problem. The ARE also allows candidates to access resource material via the com-

puter. Candidates can retrieve certain subject specific information about building

code requirements, program constraints, and vignette specifications on demand as

they design structures in accordance with the presented task directions (National

Council of Architectural Registration Boards, 2000).

Novell certification exams have an additional interactive feature: Candidates

who take a non-English version of an exam can access the English-language ver-

sion of each item. If they so desire, candidates can click on a button to switch back

and forth between the language in which they are taking the test and English (Fos-

ter, Olsen, Ford, & Sireci, 1997).

INNOVATIONS IN SCORING COMPLEX

CONSTRUCTED-RESPONSES

Computerized-adaptive testing (CAT) is increasingly attractive to test developers

as a way to increase the amount of information examinee responses provide about

ability. However, as Parshall, Davey, and Pashley (2000) point out, an adaptive test

works best if the computer can score examinee responses to the test items automat-

ically and instantaneously. This has not historically been a problem when the items

being used are selected-response, as the computer can easily compare the sequence

of response from each examinee to the programmed answer key. The traditional

multiple-choice item, scored with a dichotomous item response model, is the item

type principally used in CAT, but some variations on the multiple-choice item,

such as multiple numerical response, graphical modeling, drag-and-drop, multiple

selection, and ordering information might also be scored fairly easily using a

polytomous item response model.

350 ZENISKY AND SIRECI

A problem in incorporating many of the innovative performance tasks in Table

1 into a CAT format or into a linear CBT, however, is that each examinee provides a

unique response for each item. Thus, the structure and nature of those answers may

vary widely across examinees, and scoring decisions cannot typically be made im-

mediately using mechanical application of limited, explicit criteria (Bennett,

Sebrechts, & Rock, 1991). Legitimate logistical difficulties in automatically scor-

ing these responses might at first seem to preclude the manufacture of CAT perfor-

mance assessments, although a CBT format might be feasible (as scoring occurs at

a later date). However, current developments in psychology, computer science,

communication disorders, and artificial intelligence reveal several promising di-

rections for the future of computerized performance assessment.

Clearly, a consideration in terms of automated scoring is the level of constraint

desired in the constructed response. Among the various constructed-response item

types, it is possible to constrain any individual item in such a way that there are sev-

eral or infinite possible answers, just as an item can be written to ensure that there

is only one correct response. Take a graphical modeling problem, for example,

where the task is to model the developments of an interest-bearing account. This

could be highly specific, such that given a base amount of money, an interest rate,

and a length of time, an examinee would use the mouse to graphically represent an

outcome. Alternatively, given certain information the examinee could be asked to

extrapolate future outcomes from current data, and in this case (depending on how

an examinee synthesized the available information) there might be more than one

appropriate way to respond graphically.

Currently, much of the work in terms of automated scoring has focused on de-

veloping techniques for scoring essays, computer programs, simulated patient in-

teractions, and architectural designs online. The most prominent computer-based

scoring methods, described in the following sections, are summarized in Table 2.

These scoring methods fall into three general categories: essay scoring, expert sys-

tems analysis, and mental modeling.

Automated Scoring of Essays

Essays are perhaps the most common form of constructed responses used in

large-scale assessment, although their use is limited because many testing pro-

grams need two and sometimes three humans to read and evaluate essays accord-

ing to preestablished scoring rubrics. Other constructed-response text-based item

types typically found on paper-and-pencil tests, such as the accounting problems

on the Uniform Certified Public Accountants Examination, are scored in a simi-

larly labor-intensive fashion. Given the expense of extensively training readers,

who are required to establish score validity, computerized alternatives to human

readers are highly attractive. Research on automated scoring programs and meth-

ods for the most part have demonstrated the comparability of essay scoring across

TECHNOLOGICAL INNOVATIONS 351

352

TABLE 2

Summary of Automated Scoring Programs/Methods

Scoring Method Description Citation(s)

Essays/Free-text answers

Project essay grade Uses a regression model where the independent variables are surface

features of the text (document length, word length, and punctuation) and

the dependent variable is the essay score.

Page (1994)

E-rater Evaluates numerous structural and linguistic features specified in a holistic

scoring guide using natural language process techniques.

Burstein & Chodorow (in press); Burstein et al.

(1998)

Latent semantic

analysis

A theory and method for extracting and representing the contextual usage

of words by statistical computations applied to a large corpus of text.

Foltz, Kintsch, & Landauer (1998); Landauer,

Foltz, & Laham (1998); Landauer et al.

(1997)

Text categorization Evaluates text using automated learning techniques to categorize text

documents, where linguistic expressions and contexts extracted from the

texts are used to classify texts.

Larkey (1998)

Constructed free

response scoring tool

Scores short verbal answers, where examinees key in responses that are

pattern-matched to programmed correct responses.

Martinez & Bennett (1992)

Expert systems Examinee’s completed response is compared to a problem-specific

knowledge base encoded within the computer’s memory banks. The

knowledge base is constructed from human content-expert responses

that have been coded in a machine-usable form.

Bennett & Sebrechts (1996); Braun et al.

(1990); Martinez & Bennett (1992);

Sebrechts, Bennett, & Rock (1991)

Mental modeling Elements of the final product can be evaluated against the universe of all

possible variations using a process that mimics the scoring processing of

committees and requires an analysis of the way experts evaluate

solutions. Scores can be compared to the results obtained from human

raters to assess agreement.

Bejar (1991); Clauser (2000); Clauser, Harik, &

Clyman (2000); Clauser et al. (1997);

Martinez & Bennett (1992); Williamson,

Bejar, & Hone (1999); Williamson, et al.

(1998)

human and computer graders (Burstein, Kukich, Wolff, Lu, & Chodorow, 1998;

Rizavi & Sireci, 1999; Yang, Buckendahl, Juszkiewicz, & Bhola, 2002). Currently,

several computerized essay scoring options and methods are available, including

project essay grade (PEG), e-rater, latent semantic analysis, text categorization,

and the constructed-response scoring tool.

Project essay grade.

PEG, developed by Ellis Page in the mid-1960s, was

the first automated essay grading system; the current version evolved from his ear-

lier work (Page, 1994; Page & Petersen, 1995). Like most computerized essay

scoring programs, the specifics of how PEG works are proprietary. However, de-

scriptions of the program suggest that it uses multiple regression to determine the

optimal combination of the surface features of an essay (e.g., average word length,

essay word length, number of uncommon words, number of commas) as well as

complex structure of the essay (e.g., soundness of sentence structure) to best pre-

dict the score that would be assigned by a human grader (Page, 1994, Page &

Petersen, 1995). By assigning weights to these surface and intrinsic features, the

computer attempts to mimic human scoring. Although it is unclear whether PEG is

currently being used in large-scale assessment, it is clear that it set the stage for

other developments in the computerized scoring of essays.

E-rater.

E-rater (Burstein et al., 1998; Burstein & Chodorow, in press) is the

essay scoring system developed by ETS for the essay portion of the Graduate Man-

agement Admissions Test (GMAT). It is designed to evaluate numerous structural

and linguistic features specified in a holistic scoring guide. On the GMAT, each

examinee responds to two essay questions, which are scored by both a trained hu-

man grader and an electronic reader. Currently, e-rater serves as the second reader.

If human and e-rater scores on a particular essay differ by more than one point, the

essay is sent to a second human expert, and finally, if consensus is still not reached,

to a final human referee. Thus, the GMAT scoring system provides an example of

how the computer can be used to increase the efficiency of essay scoring while

maintaining the validity of the final scores assigned to an essay.

Latent semantic analysis.

Latent semantic analysis (LSA) is a theory and

method for extracting and representing the contextual usage of words by statisti-

cal computations applied to a large corpus of text (Foltz, Kintsch, & Landauer,

1998; Landauer, Foltz, & Laham, 1998; Rehder et al., 1998). The underlying

idea is that the aggregate of all the word contexts in which a given word does

and does not appear provides a set of mutual constraints that largely determines

the similarity of meaning of words and sets of words to each other. A possible

analogy for LSA could be the way multidimensional scaling allows relationships

between variables to be plotted in ndimensions. In LSA, words can be mapped

into semantic space and distances between words are derived from shadings of

TECHNOLOGICAL INNOVATIONS 353

meaning, which are obtained through context. LSA’s algorithm has a learning

component that “reads” through a text and develops an understanding of the sen-

tence or passage by evolving a meaning for each word in relation to all the other

words in the sentence or passage. The LSA system can be “trained” to work in

different content areas by having it electronically read texts relevant to the do-

main of interest. One possible caveat to the use of LSA: At this time, the algo-

rithm does not derive sentence meaning from word order, which is a possible

place for exploitation, but ongoing research addresses this point (Landauer,

Laham, Rehder, & Schreiner, 1997).

Text categorization.

Text categorization is a method for evaluating text that

uses automated learning techniques to categorize documents, where linguistic ex-

pressions and contexts extracted from the texts are used to classify them (Larkey,

1998). This type of analysis is informed by work in areas of machine learning,

Bayesian networks, information retrieval, natural language processing, case-based

reasoning, language modeling, and speech recognition. A number of text categori-

zation algorithms have been developed, incorporating different schema for classi-

fying text. The sorting of verbal content may be related to topic, to specified levels

of quality, or perhaps by keywords. It is interesting that the evaluation of essays is

only one of the many situations in which text categorization techniques have been

applied. These algorithms are also used in sorting documents in databases in an in-

formation retrieval context such as in the code that powers Internet search engines.

One organization with ongoing research into text categorization is the Edinburgh

Language Technology Group (http://www.ltg.ed.ac.uk/papers/class.html). On this

Web site, this group details much of their work on multiple applications of text cat-

egorization methodology.

The Constructed Free Response Scoring Tool.

The Constructed Free Re-

sponse Scoring Tool (FRST) is an automated approach to scoring examinee’s con-

structed responses online (Martinez & Bennett, 1992). FRST is an algorithm devel-

oped to score short verbal answers, where examinees key in responses that are

pattern-matched to programmed correct responses (Martinez & Bennett, 1992).

FRST has a 100% congruence rate with human raters when examinee responses

range in length between 5 and 7 words, and an 88% congruence rate for responses be-

tween 12 and 15 words. In this case, congruence rate is defined as the rate at which

scores assigned to examinee responses by the computer and the human rater exactly

match.

Expert Systems Analysis

Expert systems analysis provides another example of the use of computers to score

complex examinee responses. Expert systems are computer programs designed to

354 ZENISKY AND SIRECI

emulate the scoring behaviors of human content specialists. Expert scoring has

been studied in a number of contexts, including computer programming and math-

ematics problems. For example, PROUST and MicroPROUST are two expert sys-

tems developed to automatically score computer programs that examinees write

using the Pascal language (Braun, Bennett, Frye, & Soloway, 1990; Martinez &

Bennett, 1992). Each system has knowledge to reason about programming prob-

lems within intention-based analysis framework. Based on how humans reason out

computer programs, the expert systems analysis formulates deep-structure, goal,

and plan representation in the process of trying to identify nonsyntactical errors.

In terms of constructed-response quantitative items, the expert scoring system

known as GIDE produces a series of comments about errors present in examinees’

solutions and then incorporates that information into computation of partial-credit

scores (Bennett & Sebrechts, 1996; Martinez & Bennett, 1992; Sebrechts, Ben-

nett, & Rock, 1991). The expert systems program consults a problem-specific

knowledge base constructed from human content-expert responses that are coded

in a machine-usable form. The examinee responses are broken down into compo-

nent parts, and each piece is evaluated against multiple programmed alternatives.

Here, analysis has shown that reasonable machine-rater congruence can be ob-

tained (e.g., a .86 correlation between the scores assigned to essays by a machine

and by a human rater; Martinez & Bennett, 1992). Interestingly, research into

GIDE, PROUST, and MicroPROUST expert systems scoring mechanisms sug-

gests that although each is highly accurate at classifying examinee responses as

correct or incorrect, they are less able to provide specific diagnostic information

about examinee errors.

Mental Modeling

An additional approach to computerized scoring of complex performance tasks is

mental modeling, which is currently used to score portions of the ARE. The perfor-

mance tasks on the ARE are often cited as highly interactive and innovative exam-

ples of the possibilities that exist for automated scoring in computer-based testing.

Examinees are presented with architectural design tasks given various constraints

and, in effect, create blueprints for buildings during the testing session. Each prob-

lem is graded on four attributes (grammatical, code compliance, diagrammatic

compliance, and design logic and efficiency) using features extraction analysis

where elements of the final product can be evaluated against the universe of all

possible variations (Bejar, 1991; Martinez & Bennett, 1992). Various elements ex-

tracted from an examinee’s constructed response are compared to the universe of

possible variations where the components of possible responses are evaluated us-

ing a procedure that mimics the scoring processing of committees and requires an

analysis of the way experienced experts evaluate solutions (Williamson, Bejar, &

Hone, 1999). This “mental modeling” approach to scoring, done by computers,

TECHNOLOGICAL INNOVATIONS 355

can be compared to the results obtained from human raters to assess the extent to

which these methodologies agree on results.

In addition to being used on the ARE, the National Board of Medical Examiners

has incorporated the mental model algorithm into its patient care simulations

(Clauser, Harik, & Clyman, 2000; Clauser et al., 1997). Each action that exam-

inees key in is varyingly associated as benefitting the simulated patient or as an in-

appropriate action carrying some level of risk. Features extraction analysis and

mental modeling may be applicable as a medium for the automated scoring of es-

says as well, where the features could be specified as components of an essay, such

as sentences.

The various algorithms for automatically scoring constructed responses repre-

sent an especially exciting direction for computerized assessment practices. Using

computers in this regard will help to improve the extent to which uniformity and

precision in scoring rules can be implemented (Clauser et al., 2000). As a result,

test users and examinees alike may develop greater confidence in the inferences

about domain proficiency made on the basis of test scores. Likewise, as Bennett

(1998) mentioned, delivery efficiency will improve. In consequence, performance

tasks that can be automatically evaluated will become more logistically and practi-

cally feasible for use in high-stakes credentialing.

DIRECTIONS FOR FUTURE RESEARCH

The incorporation of different response actions, task formats, multimedia prompts,

and reference materials into test items has the potential to substantially increase

the types of skills, abilities, and processes that can be measured. Likewise, the im-

plementation of automated scoring methods can greatly facilitate the processing of

examinee responses. The potential benefits of incorporating these innovations and

tasks to testing are great, but such benefits cannot fully materialize without further

research on a number of psychometric and operational concerns (particularly with

regard to many of the emerging item types, as found by Zenisky & Sireci, 2001).

Technology-related dimensions of test format, administration, and scoring cannot

be accepted without due psychometric scrutiny.

In terms of integrating innovations in task presentation into more large-scale as-

sessments, there are a several critical directions for research. The intricacy of the

task and how it relates to the skill(s) being assessed in a given testing context is an

issue of central importance that must be rigorously evaluated (Crocker, 1997).

Examinees should not be overwhelmed with innovative item types and item fea-

tures that are extraneous to the task. To this end, the relative simplicity or complex-

ity of the user interface should remain a fundamental concern for test developers,

especially in light of the potential for gadgetry to eclipse real technological bene-

fits. Oftentimes, extended tutorials may be necessary to sufficiently familiarize

356 ZENISKY AND SIRECI

examinees with novel testing tasks, and thus use of these types may require sub-

stantial testing time and development resource commitments from test developers.

Further research specific to different types and task presentation variables can help

determine the kinds of preparation and tutorials necessary for this purpose.

Practical validity concerns in CBT include the adequacy of construct represen-

tation (Huff & Sireci, 2001; Kane, Crooks, & Cohen, 1999; Messick, 1995) and

task generalizability (Brennan & Johnson, 1995; Linn & Burton, 1994; Shavelson,

Baxter, & Pine, 1991). Issues of task specificity such as the relative number of

tasks and the extent to which examinee performance can be generalized from the

selected tasks (Guion, 1995) are additional concerns. Equally important are stud-

ies to determine potential sources of construct-irrelevant variance associated with

such item types (Huff & Sireci, 2001). Furthermore, work to evaluate adverse im-

pact for different subgroups of examinee populations has by and large not been

completed for most emerging item types. This problem needs to be addressed in

future research.

Automated scoring must be evaluated with respect to potential losses in score

validity, perhaps in the direction of multitrait–multimethod analyses (Clauser,

2000; Yang et al., 2002). The emerging area of multidimensional item response

theory (IRT) models may provide some interesting ways for scoring complex con-

structed responses (see Ackerman, 1994, and van der Linden & Hambleton, 1997,

for further information on multidimensional IRT). Preliminary research suggests

that compromises in reliability and information per minute of testing time may

occur when complex, computerized constructed-response item types are used

(Jodoin, 2001), so further research in the areas of reliability and test and item infor-

mation should be accelerated.

CONCLUSIONS

Technological advances in CBT represent positive future directions for the evalua-

tion of complex performances within large-scale testing programs, especially

given the escalating use of technology in many aspects of everyday life. Ex-

aminees support the opportunity to demonstrate what they know when tasks on a

test more faithfully relate to the skills necessary for a particular domain, and these

methods may provide test users with ways to acquire information about an

examinee’s proficiency on a given knowledge or skill area more directly (Kane,

1992). As psychometric research related to computerized performance assessment

is completed, the application of empirical results with regard to fusions of technol-

ogy and measurement will continue to impact positively on assessment practices.

The overview of emerging technological innovations in computer-based assess-

ment presented in this article provides a comprehensive (although not exhaustive)

description of numerous recent developments in computer-based assessment.

TECHNOLOGICAL INNOVATIONS 357

Many of these innovations have both strengths and weaknesses from practical and

psychometric perspectives, and thus enthusiasm for these emerging measurement

methods must be tempered by scientific wariness about their technical characteris-

tics. Still, although it is difficult to predict the future, many of these CBT innova-

tions are likely to dramatically change the testing experience for many examinees

who sit for assessments in a wide variety of testing contexts, including certification

and licensure, admissions, and achievement testing.

ACKNOWLEDGMENTS

Laboratory of Psychometric and Evaluative Research Report No. 383, School of

Education, University of Massachusetts, Amherst. This research was funded in

part by the American Institute of Certified Public Accountants (AICPA). We are

grateful for this support. The opinions expressed in this article are ours and do not

represent official positions of the AICPA.

REFERENCES

Ackerman, T. A. (1994). Using multidimensional item response theory to understand what items and

tests are measuring. Applied Measurement in Education, 7, 255–278.

Ackerman, T. A., Evans, J., Park, K., Tamassia, C., & Turner, R. (1999). Computer assessment using vi-

sual stimuli: A test of dermatological skin disorders. In F. Drasgow and J. B. Olson-Buchanan (Eds.),

Innovations in computerized assessment (pp. 137–150). Mahwah, NJ: Lawrence Erlbaum Associ-

ates, Inc.

American College Testing Program. (2000). Calculators and the ACT math test. Retrieved March 20,

2000, from http://www.act.org/aap/taking/calculator.html

Ashworth, S. D., & Joyce, T. M. (1994, April). Developing scoring protocols for a computerized multi-

media in-basket exercise. Paper presented at the Ninth Annual Conference of the Society for Indus-

trial and Organizational Psychology, Nashville, TN.

Bejar, I. (1991). A methodology for scoring open-ended architectural design problems. Journal of Ap-

plied Psychology, 76, 522–532.

Bennett, R. E. (1998). Reinventing assessment: Speculations on the future of large-scale educational

testing. Princeton, NJ: Educational Testing Service.

Bennett, R. E., Morley, M., & Quardt, D. (2000). Three response types for broadening the conception of

mathematical problem solving in computerized-adaptive tests. Applied Psychological Measurement,

24, 294–309.

Bennett, R. E., Morley, M., Quardt, D., & Rock, D. A. (2000). Graphical modeling: A new response

type for measuring the qualitative component of mathematical reasoning. Applied Measurement in

Education, 13, 303–322.

Bennett, R. E., Morley, M., Quardt, D., Rock, D. A., Singley, M. K., Katz, I. R., & Nhouyvanisvong, A.

(1999). Psychometric and cognitive functioning of an under-determined computer-based response

type for quantitative reasoning. Journal of Educational Measurement, 36, 233–252.

Bennett, R. E., & Rock, D. A. (1995). Generalizability, validity, and examinee perceptions of a com-

puter-delivered formulating-hypotheses test. Journal of Educational Measurement, 32, 19–36.

358 ZENISKY AND SIRECI

Bennett, R. E., & Sebrechts, M. M. (1996). The accuracy of expert-system diagnoses of mathematical

problem solutions. Applied Measurement in Education, 9, 133–150.

Bennett, R. E., & Sebrechts, M. M. (1997). A computer-based task for measuring the representational

component of quantitative proficiency. Journal of Educational Measurement, 34, 64–78.

Bennett, R. E., Sebrechts, M. M., & Rock, D. A. (1991). Expert system scores for complex con-

structed-response quantitative items: A study of convergent validity. Applied Psychological Mea-

surement, 15, 227–239.

Bennett, R. E., Steffen, M., Singley, M. K., Morley, M., & Jacquemin, D. (1997). Evaluating an auto-

matically scorable, open-ended response type for measuring mathematical reasoning in com-

puter-adaptive tests. Journal of Educational Measurement, 34, 162–176.

Braun, H. I., Bennett, R. E., Frye, D., & Soloway, E. (1990). Scoring constructed responses using expert

systems. Journal of Educational Measurement, 27, 93–108.

Breland, H. M. (1999). Exploration of an automated editing task as a GRE writing measure (RR–99–9).

Princeton, NJ: Educational Testing Service.

Brennan, R. L., & Johnson, E. G. (1995). Generalizability of performance assessments. Educational

Measurement: Issues and Practice, 14(4), 25–27.

Burstein, J., & Chodorow,M. (2002). Directions in automated essay scoring analysis. In R. Kaplan (Ed.),

Oxford handbook of applied linguistics (pp. 487–497). Oxford, England: Oxford University Press.

Burstein, J., Kukich, K., Wolff, S., Lu, C., & Chodorow, M. (1998, April). Computer analysis of essays.

Paper presented at the NCME Symposium on Automated Scoring, San Diego, CA.

Carey, P. (2001, April). Overview of current computer-based TOEFL. Paper presented at the annual

meeting of the National Council on Measurement in Education, Seattle, WA.

Chung, G. K. W. K., O’Neil, H. F., Jr., & Herl, H. E. (1999). The use of computer-based collaborative

knowledge mapping to measure team processes and outcomes. Computers in Human Behavior, 15,

463–494.

Clauser, B. E. (2000). Recurrent issues and recent advances in scoring performance assessments. Ap-

plied Psychological Measurement, 24, 310–324.

Clauser, B. E., Harik, P., & Clyman, S. G. (2000). The generalizability of scores for a performance as-

sessment scored with a computer-automated scoring system. Journal of Educational Measurement,

37, 245–261.

Clauser, B. E., Margolis, M. J., Clyman, S. G., & Ross, L. P. (1997). Development of automated scoring

algorithms for complex performance assessments: A comparison of two approaches. Journal of Edu-

cational Measurement, 34, 141–161.

College Board. (2000a). AP Calculus for the new century. Retrieved March 20, 2000, from http://www.

collegeboard.org/index_this/ap/calculus/new_century/evolution.html

College Board. (2000b). Calculators. Retrieved March 20, 2000, from http://www.collegeboard.org/in-

dex_this/sat/center/html/counselors/prep009.html

Crocker, L. (1997). Assessing content representativeness of performance assessment exercises. Ap-

plied Measurement in Education, 10, 83–95.

Davey, T., Godwin, J., & Mittelholtz, D. (1997). Developing and scoring an innovative computerized

writing assessment. Journal of Educational Measurement, 34, 21–42.

Desmarais, L. B., Dyer, P. J., Midkiff, K. R., Barbera, K. M., Curtis, J. R., Esrig, F. H., & Masi, D. L.

(1992, May). Scientific uncertainties in the development of a multimedia test: Trade-offs and deci-

sions. Paper presented at the Seventh Annual Conference of the Society for Industrial and Organiza-

tional Psychology, Montreal, Quebec, Canada.

Drasgow, F., Olson-Buchanan, J. B., & Moberg, P. J. (1999). Development of an interactive video as-

sessment: Trials and tribulations. In F. Drasgow & J. B. Olson-Buchanan (Eds.), Innovations in com-

puterized assessment (pp. 197–219). Mahwah, NJ: Lawrence Erlbaum Associates, Inc.

Educational Testing Service. (1993). Tests at a glance: Praxis I: Academic Skills Assessment. Prince-

ton, NJ: Author.

TECHNOLOGICAL INNOVATIONS 359

Educational Testing Service. (2000). The Praxis Series: Professional Assessments for Beginning

Teachers: Tests and test dates. Retrieved March 20, 2000, from http://www.teachingandlearning.org/

licensure/ praxis/prxtest.html

Enright, M. K., Rock, D. A., & Bennett, R. E. (1998). Improving measurement for graduate admissions.

Journal of Educational Measurement, 35, 250–267.

Fitzgerald, C. (2001, April). Rewards and challenges of implementing an innovative CBT certification

exam program. Paper presented at the annual meeting of the National Council on Measurement in

Education, Seattle, WA.

Foltz, P. W., Kintsch, W., & Landauer, T. K. (1998). The measurement of textual coherence with latent

semantic analysis. Discourse Processes, 25, 285–307.

Foster, D., Olsen, J. B., Ford, J., & Sireci, S. G. (1997, March). Administering computerized certifica-

tion exams in multiple languages: Lessons learned from the international marketplace. Paper pre-

sented at the meeting of the American Educational Research Association, Chicago.

French, A., & Godwin, J. (1996). Using multimedia technology to create innovative items. Paper pre-

sented at the annual meeting of the American Educational Research Association, New York.

Glaser, R. (1991). Expertise and assessment. In M. C. Wittrock & E. L. Baker (Eds.), Testing and cogni-

tion (pp. 17–30). Englewood Cliffs, NJ: Prentice Hall.

Guion, R. M. (1995). Comments on values and standards in performance assessments. Educational

Measurement: Issues and Practice, 14(4), 25–27.

Hambleton, R. K. (1997, October). Promising GMAT item formats for the 21st century. Invited presen-

tation at the international workshop on the GMAT, Paris, France.

Herl, H. E., O’Neil, H. F., Jr., Chung, G. K. W. K., & Schacter, J. (1999). Reliability and validity of a

computer-based knowledge mapping system to measure content understanding. Computers in Hu-

man Behavior, 15, 315–333.

Huff, K. L., & Sireci, S. G. (2001). Validity issues in computer-based testing. Educational Measure-

ment: Issues and Practice, 20(3), 16–25.

Jamieson, J., Taylor, C., Kirsch, I., & Eignor, D. (1998). Design and evaluation of a computer-based

TOEFL tutorial. System, 26, 485–513.

Jodoin, M. G. (2001, April). An empirical examination of IRT information for innovative item formats

in a computer-based certification testing program. Paper presented at the annual meeting of the Na-

tional Council on Measurement in Education, Seattle, WA.

Kane, M. T. (1992). The assessment of professional competence. Evaluation and the Health Profes-

sions, 15, 163–182.

Kane, M., Crooks, T., & Cohen, A. (1999). Validating measures of performance. Educational Measure-

ment: Issues and Practice, 18(2), 5–17.

Kaplan, R. M., & Bennett, R. E. (1994). Using the free-response scoring tool to automatically score the

formulating hypotheses item (ETS Research Report No. 94–08). Princeton, NJ: Educational Testing

Service.

Klein, D. C. D., O’Neil, H. F., Jr., & Baker, E. L. (1998). A cognitive demands analysis of innovative

technologies (CSE Tech. Rep. No. 454). Los Angeles, CA: UCLA, National Center for Research on

Evaluation, Student Standards, and Testing.

Landauer, T. K., Foltz, P. W., & Laham, D. (1998). An introduction to latent semantic analysis. Dis-

course Processes, 25, 259–284.

Landauer, T. K., Laham, D., Rehder, B., & Schreiner, M. E. (1997). How well can passage meaning be

derived without using word order? A comparison of latent semantic analysis and humans. In G.

Shafto & P. Langley (Eds.), Proceedings of the 19th annual meeting of the Cognitive Science Society

(pp. 412–417). Mahwah, NJ: Lawrence Erlbaum Associates, Inc.

Larkey, L. S. (1998). Automatic essay grading using text categorization techniques. In the Proceedings

of the 21st Annual International Conference of the Association for Computing Machinery—Special

Interest Group on Information Retrieval, Melbourne, Australia, 90–95.

360 ZENISKY AND SIRECI

Linn, R. L., & Burton, E. (1994). Performance-based assessment: Implications of task specificity. Edu-

cational Measurement: Issues and Practice, 13(1), 5–8, 15.

Luecht, R. M. (2001, April). Capturing, codifying, and scoring complex data for innovative, com-

puter-based items. Paper presented at the annual meeting of the National Council on Measurement in

Education, Seattle, WA.

Martinez, M. E. (1991). A comparison of multiple-choice and constructed figural response items. Jour-

nal of Educational Measurement, 28, 131–145.

Martinez, M. E., & Bennett, R. E. (1992). A review of automatically scorable constructed-response

item types for large-scale assessment. Applied Measurement in Education, 5, 151–169.

Messick, S. (1995). Standards of validity and the validity of standards in performance assessment. Edu-

cational Measurement: Issues and Practice, 14(4), 5–9.

Microsoft Corporation. (1998, September). Procedures and guidelines for writing Microsoft Certifica-

tion Exams. Redmond, WA: Author.

Mills, C. (2000, February). Unlocking the promise of CBT. Keynote address presented at a conference

of the Association of Test Publishers, Carmel Valley, CA.

Mislevy, R. J., Steinberg, L. S., Breyer, F. J., Almond, R. G., & Johnson, L. (1999, September 16–17).

Making sense of data from complex assessments. Paper presented at the 1999 CRESST Conference,

Los Angeles, CA.

National Council of Architectural Registration Boards. (2000). ARE practice program [Computer soft-

ware]. Retrieved March 20, 2000, from http://www.ncarb.org/are/tutorial2.html

Nhouyvanisvong, A., Katz, I. R., & Singley, M. K. (1997). Toward a unified model of problem solving

in well-determined and under-determined algebra word problems. Paper presented at the annual

meeting of the American Educational Research Association, Chicago.

O’Neil, K., & Folk, V. (1996, April). Innovative CBT item formats in a teacher licensing program.Paper

presented at the annual meeting of the National Council on Measurement in Education, New York.

Page, E. B., & Peterson, N. S. (1995). The computer moves into essay grading: Updating the ancient

test. Phi Delta Kappan, 76, 561–565.

Page, E. B. (1994). Computer grading of student prose, using modern concepts and software. Journal of

Experimental Education, 62(2), 127–142.

Parshall, C. G., & Balizet, S. (2001). Audio computer-based tests (CBTS): An initial framework for the

use of sound in computerized tests. Educational Measurement: Issues and Practice, 20(2), 5–15.

Parshall, C. G., Davey, T., & Pashley, P. (2000). Innovative item types for computerized testing. In W. J.

van der Linden & C. Glas (Eds.), Computer-adaptive testing: Theory and practice (pp. 129–148).

Boston: Kluwer Academic.

Rehder, B., Schreiner, M. E., Wolfe, M. B., Laham, D., Landauer, T. K., & Kintsch, W. (1998). Using

latent semantic analysis to assess knowledge: Some technical considerations. Discourse Processes,

25, 337–354.

Rizavi, S., & Sireci, S.G. (1999). Comparing computerized and human scoring of WritePlacer Essays

(Laboratory of Psychometric and Evaluative Research Report No. 354). Amherst: School of Educa-

tion, University of Massachusetts.

Sebrechts, M. M., Bennett, R. E., & Rock, D. A. (1991). Agreement between expert system and human

raters’ scores on complex constructed-response quantitative items. Journal of Applied Psychology,

76, 856–862.

Shavelson, R. J., Baxter, G., & Pine, J. (1991). Performance assessment in science. Applied Measure-

ment in Education, 4, 347–362.

Taylor, C., Jamieson, J., Eignor, D., & Kirsch, I. (1998). The relationship between computer familiarity

and performance on computer-based TOEFL test tasks [ETS Research Rep. No. 98–08]. Princeton,

NJ: Educational Testing Service.

van der Linden, W. J., & Hambleton, R. K. (Eds.). (1997). Handbook of modern item response theory.

New York: Springer.

TECHNOLOGICAL INNOVATIONS 361

Vispoel, W. P. (1999). Creating computerized adaptive tests of music aptitude: Problems, solutions, and

future directions. In F. Drasgow & J. B. Olson-Buchanan (Eds.), Innovations in computerized assess-

ment (pp. 151–176). Mahwah, NJ: Lawrence Erlbaum Associates, Inc.

Walker, G., & Crandall, J. (1999, February). Value added by computer-based TOEFL test [TOEFL

briefing]. Princeton, NJ: Educational Testing Service.

Williamson, D. M., Bejar, I. I., & Hone, A. S. (1999). ‘Mental model’comparisons of automated and

human scoring. Journal of Educational Measurement, 36, 158–184.

Williamson, D. M., Hone, A. S., Miller, S., & Bejar, I. I. (1998, April). Classification trees for quality

control processes in automated constructed response scoring. Paper presented at the annual meeting

of the National Council on Measurement in Education, San Diego, CA.

Yang, Y., Buckendahl, C. W., Juszkiewicz, P. I., & Bhola, D. S. (2002/this issue). A review of strategies

for validating computer automated scoring. Applied Measurement in Education, 15, 391–412.

Zenisky, A. L., & Sireci, S. (2001). Feasibility review of selected performance assessment item types

for the computerized Uniform CPA Exam (Laboratory of Psychometric and Evaluative Research

Rep. No. 405). Amherst: School of Education, University of Massachusetts.

362 ZENISKY AND SIRECI

A Scoping Review of Rankings, Smileys, and Other Survey Item Formats

Article

Full-text available

Dec 2023

Jacob Brauner

Technological development has allowed researchers to apply numerous item formats in web-based surveys. A growing body of research suggests that the use of formats for web and paper other than multiple choice, such as ranking, sorting, questions with pictures e.g., may offer relevant alternatives that can strengthen data quality. These formats are referred to as Innovative Item Formats (IIF). Existing literature in the field is not able to present a systematic overview and functional typology of IIFs and their impact on data quality. Therefore, a review of the research is needed for each IIF. This review is designed with the purpose of covering which typers of IIF that exist and what type of evidence there is about data quality on these IIFs. Based on a scoping review, this article presents the existing research literature on specific IIFs. A total of 62 research articles with data from 89,365 participants were identified. A more extensive typification of IIFs than previously used, one that includes a total of 23 IIFs and 13 subcategories, is suggested. Researchers designing questionnaires can use this knowledge to obtain higher-quality data.

Fairness in Computerized Testing: Detecting Item Bias using CATSIB with Impact Present

Article

Oct 2014

In educational assessment, there is an increasing demand for tailoring assessments to individual examinees through computer adaptive tests (CAT). As such, it is particularly important toinvestigate the fairness of these adaptive testing processes, which require theinvestigation of differential item function (DIF) to yield information about itembias. The performance of CATSIB, a revision of SIBTEST to accommodate CATresponses, in detecting DIF in a multi-stage adaptive testing (MST) environmentis investigated in the present study. Specifically, the power and type I error rates on directional DIF detection of an MST environment when positive and negative impact, group ability differences, was investigated using simulation procedures. The results revealed that CATSIB performed relatively well in identifying the items with DIF when characteristics of the group and items were known. Testing companies are able to use these results to enhance test items which provide students with fair and equitable adaptive testing environments.

The use of process data in large-scale assessments: a literature review

Article

Full-text available

May 2024

As the use of process data in large-scale educational assessments is becoming more common, it is clear that data on examinees’ test-taking behaviors can illuminate their performance, and can have crucial ramifications concerning assessments’ validity. A thorough review of the literature in the field may inform researchers and practitioners of common findings as well as existing gaps. This literature review used topic modeling to identify themes in 221 empirical studies using process data in large-scale assessments. We identified six recurring topics: response time models, response time-general, aberrant test-taking behavior, action sequences, complex problem-solving, and digital writing. We also discuss the prominent theories used by studies in each category. Based on these findings, we suggest directions for future research applying process data from large-scale assessments.

Itens digitais no contexto de avaliações em larga escala

Article

Dec 2023

Liamara Scortegagna

O artigo apresenta uma revisão sistemática e bibliométrica com o objetivo de identificar e apresentar características, formatos e métodos para o desenvolvimento de itens digitais no contexto de avaliações em larga escala. Para tanto, é detalhado o processo de seleção e análise dos textos a partir de duas bases de dados no período de 2000 a 2021. Os resultados mostram pontos comuns que pautam a concordância sobre a influência do avanço tecnológico para a transição das avaliações do impresso para o digital e o surgimento de novos formatos de itens com diferentes características. Ademais, trazem críticas às organizações quanto à adaptação dos itens usados em avaliações impressas para a avaliação digital sem a clara compreensão de suas diferenças e a crença de que o digital por si só agrega valor.

Academic entrepreneurship as a source of innovation for sustainable development

Article

Full-text available

Jun 2023
TECHNOL FORECAST SOC

The article comprises an analysis of academic entrepreneurship, from the perspective of innovation and sustainability. It aims to identify and assess the innovation and eco-innovation support programs implemented by academics at higher education institutions. The paper presents the typology of innovation and associated concepts , including the notion of innovation for sustainable development (eco-innovation). It also reviews the existing literature in the field of academic entrepreneurship, which is a broad concept, requiring a unique and geographically specific approach. The following research methods were adopted-case study (the subject of the study was the Center for Knowledge and Technology Transfer), based on document analysis, and structured interviews with a purposively selected group of academic innovation developers. The analysis conducted allows a conclusion that one of the main barriers to the implementation of innovation is the lack of adequate funds for research task implementation. The analyzed Innovation Incubator program, despite its significant contribution to the sustainable support of innovation, requires certain changes and improvements to ameliorate the cooperation between the academic circles and business.

Using process features to investigate scientific problem-solving in large-scale assessments

Article

Full-text available

Apr 2023

Introduction This study investigates the process data from scientific inquiry tasks of fair tests [requiring test-takers to manipulate a target variable while keeping other(s) constant] and exhaustive tests (requiring test-takers to construct all combinations of given variables) in the National Assessment of Educational Progress program. Methods We identify significant associations between item scores and temporal features of preparation time, execution time, and mean execution time. Results Reflecting, respectively, durations of action planning and execution, and execution efficiency, these process features quantitatively differentiate the high- and low-performing students: in the fair tests, high-performing students tended to exhibit shorter execution time than low-performing ones, but in the exhaustive tests, they showed longer execution time; and in both types of tests, high-performing students had shorter mean execution time than low-performing ones. Discussion This study enriches process features reflecting scientific problem-solving process and competence and sheds important light on how to improve performance in large-scale, online delivered scientific inquiry tasks.

Generating assessment tests using image-based items

Conference Paper

Dec 2023

A data pipeline for e-large-scale assessments: Better automation, quality assurance, and efficiency

Article

Full-text available

Nov 2023

The increasing volume of large-scale assessment data poses a challenge for testing organizations to manage data and conduct psychometric analysis efficiently. Traditional psychometric software presents barriers, such as a lack of functionality for managing data and conducting various standard psychometric analyses efficiently. These challenges have resulted in high costs to achieve the desired research and analysis outcomes. To address these challenges, we have designed and implemented a modernized data pipeline that allows psychometricians and statisticians to efficiently manage the data, conduct psychometric analysis, generate technical reports, and perform quality assurance to validate the required outputs. This modernized pipeline has proven to scale with large databases, decrease human error by reducing manual processes, efficiently make complex workloads repeatable, ensure high quality of the outputs, and reduce overall costs of psychometric analysis of large-scale assessment data. This paper aims to provide information to support the modernization of the current psychometric analysis practices. We shared details on the workflow design and functionalities of our modernized data pipeline, which provide a universal interface to large-scale assessments. The methods for developing non-technical and user-friendly interfaces will also be discussed.

Measurement Efficiency for Technology‐Enhanced and Multiple‐Choice Items in a K–12 Mathematics Accountability Assessment

Article

Aug 2023

The increasing use of computerization in the testing industry and the need for items potentially measuring higher‐order skills have led educational measurement communities to develop technology‐enhanced (TE) items and conduct validity studies on the use of TE items. Parallel to this goal, the purpose of this study was to collect validity evidence comparing item information functions, expected information values, and measurement efficiencies (item information per time unit) between multiple‐choice (MC) and technology‐enhanced (TE) items. The data came from K–12 mathematics large‐scale accountability assessments. The study results were mainly interpreted descriptively, and the presence of specific patterns between MC and TE items was examined across grades and depth of knowledge levels. Although many earlier researchers pointed out that TE items were not as efficient as MC items, the results from the study point to ways that TE items might provide more information and were more than or equally efficient as MC items overall.

Duygusal Okuryazarlık Becerilerinin Ölçülmesinde Görsel Metin Destekli Yenilikçi Madde Formatı Üzerine Uygulamalı Bir ÇalışmaAn Applied Study on The Visual Text-Supported Innovative Item Format in Measuring Emotional Literacy SkillsUne étude appliquée sur le format d'élément innovant pris en charge par le texte visuel dans la mesure des compétences en littératie émotionnelle

Article

Jul 2023

Bu çalışma, görsel metin destekli yeni bir madde formatının tasarlanmasını ve uygulamalı olarak değerlendirilmesini kapsamaktadır. Bu yenilikçi madde formatının tasarımı, öğretmen adaylarının duygusal okuryazarlık becerilerinin ölçülmesinde görsel metin destekli fotoğrafların kullanıldığı maddelerle uygulamalı bir alan çalışması üzerinden gerçekleştirilmiştir. Araştırma desenini, aynı akademik dönem içinde gerçekleştirilen dokuz ayrı uygulama oluşturmaktadır. İlk olarak, öğretmen adaylarının görsel olarak zenginleştirilmiş maddelere verdikleri tepkileri işaretlemede kullanmak için tasarlanmış 12'li işaretleme skalasında olumsuzdan olumluya uzanan duygu durumlarının sırasının uygunluğu kontrol edilmiştir. Ardından, 12 seçenekten oluşan duygu skalasının içerdiği duygu durumlarının gruplandırıldığı, uygulamada kullanışlı olabilecek ve aynı zamanda bireyler arası farklılıkları ortaya çıkarabilecek nitelikte alternatif bir duygu skalası oluşturulmuştur. Oluşturulan bu 7'li duygu skalasının uygulamada daha kullanışlı olduğu gösterilmiştir. Ayrıca görsel metin destekli madde formatında yer alan farklı duygu durumları içeren görseller için öğretmen adaylarının birincil, ikincil ve genel tepkilerinin sorulduğu maddelere verilen cevaplar incelenmiştir. Grafiksel incelemelere ek olarak, öğretmen adaylarının madde takımında yer alan sorulara (birincil-ikincil-genel tepkiler) verdikleri cevapların benzerlik ya da farklılıklarını istatistiksel açıdan test etmek amacıyla parametrik olmayan yöntemlerden Wilcoxon İşaretli Sıralar Testi kullanılmıştır. Elde edilen bulgular, özellikle kullanılan görsellerin duygusal içeriğinin çok basit olmadığı durumlarda, öğretmen adaylarının birincil ve ikincil tepkilerinin iki ayrı madde olarak ele alınmasının, hem tekli genel bir madde formatının sağladığı bilgileri sağlayabilecek hem de toplanan verilerden elde edilen bilgileri zenginleştirebilecek nitelikte olduğunu göstermektedir. Çalışma sonuçları, görsel metinler kullanılarak hazırlanan bu yenilikçi madde formatlarının duygusal okuryazarlık becerileri gibi karmaşık ve çok boyutlu olduğu düşünülen yapıların ölçülmesinde kullanışlı olabilecekleri önermesini destelemektedir.

Agreement between expert-system and human raters' scores on complex constructed-response quantitative items.

Article

Full-text available

Dec 1991

Evaluated agreement between expert-system and human scores on 12 constructed-response algebra word problems taken by Graduate Record Examination General Test examinees. Problems were drawn from 3 content classes (rate by time, work, and interest) and presented in 4 constructed-response formats (open ended, goal specification, equation setup, and faulty solution). Agreement was evaluated for each item separately by comparing the system's scores to the mean scores taken across 5 content experts. The expert system produced scores for all responses and duplicated the judgments of raters with reasonable accuracy; the median of 12 correlations between the system and human scores was .88, and the largest average discrepancy was 1.2 on a 16-point scale. No obvious differences in scoring agreement between constructed-response formats or content classes emerged. Ideas are discussed for using expert scoring systems in large-scale assessment programs and in interactive diagnostic assessment.

Expert-System Scores for Complex Constructed-Response Quantitative Items: A Study of Convergent Validity

Article

Full-text available

Sep 1991
APPL PSYCH MEAS

This study investigated the convergent validity of expert-system scores for four mathematical constructed-response item formats. A five-factor model comprised of four constructed-response for mat factors and a Graduate Record Examination (GRE) General Test quantitative factor was posed. Confirmatory factor analysis was used to test the fit of this model and to compare it with several alter natives. The five-factor model fit well, although a solution comprised of two highly correlated dimensions_GRE-quantitative and constructed- response—represented the data almost as well. These results extend the meaning of the expert system's constructed-response scores by relating them to a well-established quantitative measure and by indicating that they signify the same underlying proficiency across item formats.

Handbook of Modern Item Response Theory

Book

Jan 1997

Item response theory has become an essential component in the toolkit of every researcher in the behavioral sciences. It provides a powerful means to study individual responses to a variety of stimuli, and the methodology has been extended and developed to cover many different models of interaction. This volume presents a wide-ranging handbook to item response theory - and its applications to educational and psychological testing. It will serve as both an introduction to the subject and also as a comprehensive reference volume for practitioners and researchers. It is organized into six major sections: the nominal categories model, models for response time or multiple attempts on items, models for multiple abilities or cognitive components, nonparametric models, models for nonmonotone items, and models with special assumptions. Each chapter in the book has been written by an expert of that particular topic, and the chapters have been carefully edited to ensure that a uniform style of notation and presentation is used throughout. As a result, all researchers whose work uses item response theory will find this an indispensable companion to their work and it will be the subject's reference volume for many years to come.

A methodology for scoring open-ended architectural design problems.

Article

Aug 1991

Isaac I. Bejar

Psychometric and architectural principles were integrated to create a general approach for scoring open-ended architectural site-design test problems. In this approach, solutions are examined and described in terms of design features, and those features are then mapped onto a scoring scale by means of scoring rules. This methodology was applied to 2 problems that had been administered as part of a national certification test. Because the test is not currently administered by computer, the paper-and-pencil solutions were first converted to machine-readable form. One problem dealt with the spatial arrangement of buildings in a country club, and the other called for regrading of a site by rearranging contours. In both instances, the results suggest that computer scoring is feasible.

Standards of validity and the validity of standards in performance assessment

Article

Jan 1995
Educ Meas

S. Messick

Handbook of Modern Item Response Theory.

Article

Dec 1998

Directions in automated essay scoring

Article

Jan 2003

Computer analysis of essay content for automated score prediction: A prototype automated scoring system for GMAT analytical writing assessment

Article

Jun 1998

This report discusses the development and evaluation of a research prototype system designed to automatically score essay responses to the GMAT Analytical Writing Assessments: (a). Analysis of an Argument (Argument essays) and (b). Analysis of an Issue (Issue essays) item types. The system, Electronic Essay Rater (e-rater), was designed to automatically analyze several features of an essay and score the essay based on the features of writing as specified in holistic rubrics. E-rater uses a hybrid feature methodology. It incorporates several variables that are derived statistically, extracted through NLP techniques, or achieved by simple “counting” procedures. The version of the e-rater described in this report uses five sets of critical feature variables to build the final linear regression model used for predicting scores. The same set of critical variables was used to fit models for the issue and argument training essays and the following results were achieved. For the set of 275 cross-validation data, exact or adjacent agreement with human rater scores reached 95%. For the 282 cross-validation issue essays exact or adjacent agreement with human rater scores achieved 93%. The rich feature variables used as score predictors in e-rater could potentially be used to generate explanation of score predictions, and diagnostic and instructional information.

Exploration of an automated editing task as a GRE writing measure

Article

Jun 1999

Hunter Breland

Two editing tasks were developed and programmed for the computer to explore the possibility that such tasks might be useful as measures of writing skill. An informal data collection was then conducted with 52 prospective graduate students. These students completed the editing tasks with no time limit, as well as a writing experience questionnaire. Scores obtained on the two editing tasks were correlated with variables developed from the questionnaire. The total score for the two editing tasks correlated .52 with student self-assessments of their writing ability, .46 with grade-point average (GPA) based on courses requiring at least some writing, and .30 with writing accomplishments. The correlation with GPA, however, was only .14. The reliability of the total editing score was estimated at .84.

Use of computer-based collaborative knowledge mapping to measure team processes and team outcomes

Article

May 1999
COMPUT HUM BEHAV

In this study we examined the feasibility of using a computer-based, networked collaborative knowledge mapping system to measure teamwork skills. A knowledge map is a node–link–node representation of information, where nodes represent concepts and links represent relationships between connected concepts. We studied the nature of the interaction between team members as they jointly constructed a knowledge map. Each team member was randomly assigned to a team and communicated (anonymously) with other members by sending pre defined messages. Teamwork processes were measured by examining message usage. Each message was categorized as belonging to one of six team processes: (1) adaptability; (2) communication; (3) coordination; (4) decision making; (5) interpersonal; and (6) leadership. Team performance was measured by scoring each team's knowledge map using four expert maps as the criterion. No significant correlations were found between the team processes and team outcomes. This unexpected finding may be due in part to a split-attention effect resulting from the design of the user interface. However, students were able to successfully construct knowledge maps using our system, suggesting that our general approach to using networked computers to measure group processes remain viable given existing alternatives.

Technological Innovations in Large-Scale Assessment

Abstract and Figures

Recommended publications

Energy for rural development; renewable resources and alternative technologies for developing countr...

Making sense of network dynamics through network pictures: A longitudinal case study

Discursive Contestation on Technological Innovation and the Institutional Design of the UNFCCC in th...