ArticlePDF Available

Usability: Lessons Learned … and Yet to Be Learned

International Journal of Human-Computer Interaction

September 2014
30(9)

DOI:10.1080/10447318.2014.930311

Authors:

James R. Lewis

MeasuringU

The philosopher of science J. W. Grove (1989) once wrote, “There is, of course, nothing strange or scandalous about divisions of opinion among scientists. This is a condition for scientific progress” (p. 133). Over the past 30 years, usability, both as a practice and as an emerging science, has had its share of controversies. It has inherited some from its early roots in experimental psychology, measurement, and statistics. Others have emerged as the field of usability has matured and extended into user-centered design and user experience. In many ways, a field of inquiry is shaped by its controversies. This article reviews some of the persistent controversies in the field of usability, starting with their history, then assessing their current status from the perspective of a pragmatic practitioner. Put another way: Over the past three decades, what are some of the key lessons we have learned, and what remains to be learned? Some of the key lessons learned are:• When discussing usability, it is important to distinguish between the goals and practices of summative and formative usability.• There is compelling rational and empirical support for the practice of iterative formative usability testing—it appears to be effective in improving both objective and perceived usability.• When conducting usability studies, practitioners should use one of the currently available standardized usability questionnaires.• Because “magic number” rules of thumb for sample size requirements for usability tests are optimal only under very specific conditions, practitioners should use the tools that are available to guide sample size estimation rather than relying on “magic numbers.”

Content uploaded by James R. Lewis

Content may be subject to copyright.

Intl. Journal of Human–Computer Interaction, 30: 663–684, 2014

ISSN: 1044-7318 print / 1532-7590 online

DOI: 10.1080/10447318.2014.930311

Usability: Lessons Learned . . . and Yet to Be Learned

James R. Lewis

IBM Corporation, Software Group, Boca Raton, Florida, USA

The philosopher of science J. W. Grove (1989) once wrote,

“There is, of course, nothing strange or scandalous about divi-

sions of opinion among scientists. This is a condition for scientiﬁc

progress” (p. 133). Over the past 30 years, usability, both as a prac-

tice and as an emerging science, has had its share of controversies.

It has inherited some from its early roots in experimental psychol-

ogy, measurement, and statistics. Others have emerged as the ﬁeld

of usability has matured and extended into user-centered design

and user experience. In many ways, a ﬁeld of inquiry is shaped by

its controversies. This article reviews some of the persistent con-

troversies in the ﬁeld of usability, starting with their history, then

assessing their current status from the perspective of a pragmatic

practitioner. Put another way: Over the past three decades, what

are some of the key lessons we have learned, and what remains to

be learned? Some of the key lessons learned are:

•When discussing usability, it is important to distinguish

between the goals and practices of summative and forma-

tive usability.

•There is compelling rational and empirical support for the

practice of iterative formative usability testing—it appears

to be effective in improving both objective and perceived

usability.

•When conducting usability studies, practitioners should use

one of the currently available standardized usability ques-

tionnaires.

•Because “magic number” rules of thumb for sample size

requirements for usability tests are optimal only under very

speciﬁc conditions, practitioners should use the tools that are

available to guide sample size estimation rather than relying

on “magic numbers.”

1. INTRODUCTION

Therefore, the seeker after the truth is not one who studies the

writings of the ancients and, following his natural disposition, puts

his trust in them, but rather the one who suspects his faith in them

and questions what he gathers from them, the one who submits to

argument and demonstration, and not to the sayings of a human

being whose nature is fraught with all kinds of imperfection and

deﬁciency. Thus the duty of the man who investigates the writings

Editors’ note: This article was the keynote presentation at

HCII 2014, the 16th International Conference on Human–Computer

Interaction and its Afﬁliated International Conferences, 22–27 June

2014, Crete, Greece.

Address correspondence to James R. Lewis, 7329 Serrano Terrace,

Delray Beach, FL 33446, USA. E-mail: jimlewis@us.ibm.com

of scientists, if learning the truth is his goal, is to make himself an

enemy of all that he reads, and, applying his mind to the core and

margins of its content, attack it from every side. He should also sus-

pect himself as he performs his critical examination of it, so that he

may avoid falling into either prejudice or leniency. (Ibn al-Haytham,

965–1040, as cited in Sabra, 2003, p. 55)

I entered the ﬁeld of usability engineering a little more than

30 years ago. In 1980 I interned at the IBM usability lab

in Boca Raton, Florida (see Figure 1). Although there has

been considerable development in the methods of usability

engineering over the years, a modern practitioner would rec-

ognize the activities. Some of the evaluations that were under

way were traditional human factors experiments, for example,

studies of the optimal angle of inclination of typing key-

boards. Some of the evaluations shared certain properties of

traditional experiments but differed slightly in their focus on

measuring behaviors and attitudes captured as participant com-

pleted the key tasks for a product. Other evaluations departed

even more from traditional experimentation in their focus on

the iterative discovery and remediation of usability problems

rather than performance and satisfaction metrics. Thus, even

in the late 1970s, the usability practices commonly referred

to as summative and formative testing (Lewis, 2012) were

present.

Although present, these fundamental usability engineering

methods were not well established. Controversies abounded.

What was the true deﬁnition of usability? How reliable were

less traditional iterative evaluations in terms of departure from

classical psychological experimentation? What was the appro-

priate role of statistical methods (such as hypothesis testing,

conﬁdence intervals, psychometrics and sample size estimation)

in usability engineering?

Much of industrial usability engineering work is conﬁden-

tial. Companies are reluctant to expose the usability blemishes

of their products and services to the public, preferring instead

to keep them “in house” as they track them and seek to elim-

inate or reduce their impact on users. Practitioners have been

much freer to publish the results of methodological investiga-

tions, exposing and discussing methodological controversies of

signiﬁcant importance to the development of the ﬁeld of usabil-

ity engineering. An important aspect of publication is criticism

(through the peer review process and critical literature reviews),

and “criticism is the mother of methodology” (Abelson’s, 1995,

663

Downloaded by [78.87.127.172] at 04:36 24 June 2014

664 LEWIS

FIG. 1. Usability lab at IBM facility in Boca Raton, Florida, circa 1978.

8th law, p. xv). The purpose of this article is to summarize some

of these controversies, providing arguments from both sides and

a pragmatic assessment for practitioners. In other words, what

are the lessons learned (which controversies appear to be set-

tled) and which are yet to be learned (those for which there is

still work to do)?

2. DO WE KNOW WHAT USABILITY IS?

There has long been a general understanding of the word

usability (sometimes spelled useability). For example, a refrig-

erator advertisement from the 1930s included usability as a

feature, and listed characteristics of usability such as “handier to

use,” “saves steps, saves work,” and “compare with others” (S.

Isensee, personal communication, January 17, 2010; see http://

tinyurl.com/yjn3caa). In 1979, Bennett published what may

have been the ﬁrst scientiﬁc article to have the term “usability”

in the title. But do we really know what usability is?

2.1. One Point of View

The following quotations span a period of roughly 10 to

20 years from Bennett’s early scientiﬁc use of the term “usabil-

ity.”

One of the most important issues is that there is, as yet, no gen-

erally agreed deﬁnition of usability and its measurement. (Shackel,

1990, p. 31)

Attempts to derive a clear and crisp deﬁnition of usability can be

aptly compared to attempts to nail a blob of Jell-O to the wall. (Gray

& Salzman, 1998, p. 242)

A major obstacle to the implantation of User-Centered Design

in the real world is the fact that no precise deﬁnition of the concept

of usability exists that is widely accepted and applied in practice.

(Alonso-Ríos, Vázquez-Garcia, Mosqueira-Rey, & Moret-Bonillo,

2010, p. 53)

Basically, the argument from this side is that it is either

impossible or so difﬁcult to deﬁne usability that for a period

of more than 30 years there is yet to be a clear and generally

accepted deﬁnition. There are several reasons why this might be

so. The measurement of usability is complex because usability

is not a speciﬁc property of a person or thing. You cannot mea-

sure usability with a simple “usability” thermometer (Dumas,

2003; Hertzum, 2010; Hornbæk, 2006). Rather, it is an emer-

gent property dependent on interactions among users, products,

tasks, and environments.

Also, there are two major conceptions of usability (Lewis,

2012), commonly referred to as “summative” and “formative.”

Although there are similarities, the differences between summa-

tive and formative usability are substantial enough that a single

concise deﬁnition cannot cover both. The focus of summative

usability measurement is on metrics associated with meeting

global task and product goals (i.e., measurement-based usabil-

ity). The focus of formative usability is the detection of usability

problems and the design of interventions to reduce or eliminate

their impact (i.e., diagnostic usability).

2.2. Another Point of View

Although it may be difﬁcult to deﬁne usability, it should be

possible given an appropriate distinction between summative

and formative conceptions. One of the early attempts to deﬁne

summative usability was the MUSiC project—Measurement of

Usability in Context (Bevan, Kirkakowski, & Maissel, 1991).

Downloaded by [78.87.127.172] at 04:36 24 June 2014

USABILITY LESSONS 665

The MUSiC project focused on metrics of effectiveness and

efﬁciency in the context of use. This type of research led to

the current International Organization for Standardization (ISO;

1998) and American National Standards Institute (ANSI; 2001)

usability standards, which continued to emphasize the impor-

tance of effectiveness and efﬁciency in the context of use and

added the subjective metric of satisfaction. Most usability prac-

titioners have at least some familiarity with these standards and

their three key metrics. These metrics and their collection and

interpretation bear a strong resemblance to the methods and

metrics of experimental psychology, especially as instantiated

in human factors engineering (Lewis, 2011a).

The concept of formative usability, on the other hand, repre-

sents a signiﬁcant departure from the practices of experimental

psychology. Formative usability has strong ties to the prac-

tice of iterative design—building something, checking to see

where it could be improved, improving it, and trying again (i.e.,

design–test–redesign–retest). The earliest deﬁnitions of forma-

tive usability came from Chapanis and his students (Al-Awar,

Chapanis, & Ford, 1981; Chapanis, 1981; Kelley, 1984).

Although it is not easy to measure “ease of use,” it is easy to mea-

sure difﬁculties that people have in using something. Difﬁculties and

errors can be identiﬁed, classiﬁed, counted, and measured. So my

premise is that ease of use is inversely proportional to the number

and severity of difﬁculties people have in using software. There are,

of course, other measures that have been used to assess ease of use,

but I think the weight of the evidence will support the conclusion

that these other dependent measures are correlated with the number

and severity of difﬁculties. (Chapanis, 1981, p. 3)

The publications of Chapanis and his colleagues had an

almost immediate inﬂuence on product development practices

at IBM (Kennedy, 1982; Lewis, 1982) and other companies,

notably Xerox (Smith, Irby, Kimball, Verplank, & Harlem,

1982) and Apple (Williams, 1983). Shortly thereafter, John

Gould and his associates at the IBM T. J. Watson Research

Center began publishing inﬂuential papers on usability testing

and iterative design (Gould, 1988; Gould & Boies, 1983; Gould,

Boies, Levy, Richards, & Schoonard, 1987; Gould & Lewis,

1984), as did Whiteside, Bennett, and Holtzblatt (1988) at DEC

(Baecker, 2008; Dumas, 2007). More recently, there have been

contributions by practitioners from the ANSI, both for stan-

dardization of summative usability testing reports (the Common

Industry Format; ANSI, 2001) and recommendations for effec-

tive reporting of formative usability test results (Theofanos &

Quesenbery, 2005), along with several informative government

websites (zing.ncsl.nist.gov/iusr/; www.usability.gov).

The concept of formative usability evaluation in the con-

text of iterative design has affected the development of a

number of empirical and inspection methods. Empirical meth-

ods include standard formative usability testing, either with or

without the think-aloud (TA) protocol (Dumas, 2003) and con-

textual evaluations (Whiteside et al., 1988). Inspection methods

include expert and heuristic evaluations (Nielsen & Mack,

1994), as well as a variety of other structured protocols such as

GOMS (Card, Moran, & Newell, 1983), cognitive walkthroughs

(Spencer, 2000; Wharton, Rieman, Lewis, & Polson, 1994),

and card sorting (Snyder, Happ, Malcus, Paap, & Lewis, 1985;

Tullis, 1985; Tullis & Albert, 2008,2013).

2.3. Lessons Learned

When discussing usability, it is important to distinguish

between the goals and practices of summative and formative

usability. Following the summative conception, a product is

usable when people can use it for its intended purpose effec-

tively, efﬁciently, and with a feeling of satisfaction. Following

the formative conception, the presence of usability depends on

the absence of usability problems.

Regarding the assessment of usability, summative and for-

mative methods share a number of properties. Both require

• a careful plan of study, including initial instructions

and debrieﬁng protocols;

• participants who are members of the population of

interest; and

• appropriate tasks and environments (i.e., context of

use).

There are also important differences. Summative evaluations

tend to be more like traditional experiments, with little to no

interaction between observers and participants during the per-

formance of tasks and no changes to the system or product

during the study. Formative studies permit a wide variation in

technique (Wildman, 1995), which can be very formal or infor-

mal, silent or TA; participants can work solo or in pairs; use low-

or high-ﬁdelity prototypes; or use current, future, or competitive

products (Lewis, 2012).

Ideally, practitioners should use both conceptualizations of

usability during iterative design and should combine qualitative

and quantitative methods. Before conducting testing with users,

part of the preparation of a study should include an inspec-

tion method such as heuristic or expert evaluation. Any iterative

method must include a stopping rule to prevent inﬁnite itera-

tions. In the real world, resource constraints and deadlines can

dictate the stopping rule (although this practice is valid only

if there is a reasonable expectation that undiscovered problems

will not lead to drastic consequences). In an ideal setting, the

results of summative usability testing can act as a stopping

rule for the iterative formative studies when user performance

and preference meet predeﬁned summative goals (Lewis, 2012).

This is not a new concept, having appeared in the seminal

paper by Al-Awar et al. (1981) describing iterative usability

evaluation:

Our methodology is strictly empirical. You write a program, test

it on the target population, ﬁnd out what’s wrong with it, and revise

it. The cycle of test–rewrite is repeated over and over until a sat-

isfactory level of performance is reached. Revisions are based on

the performance, that is, the difﬁculties typical users have in going

through the program. (p. 31)

Downloaded by [78.87.127.172] at 04:36 24 June 2014

666 LEWIS

2.4. Lessons Yet to Be Learned

Usability science is a relatively young endeavor, and there

are a number of lessons yet to be learned. An exhaustive

treatment is beyond the scope of this article. Two of the exist-

ing controversies are the scope of usability and the details of

appropriate TA evaluations.

The scope of usability. What is the appropriate scope of

usability? As just described, the typical deﬁnitions of usabil-

ity have to do with measurements of effectiveness, efﬁciency,

and satisfaction (summative) or the absence of usability prob-

lems (formative). These deﬁnitions are fairly well engrained

in current practices of usability engineering. Over the decades

since the early scientiﬁc descriptions of usability, there have

been several extensions, including user-centered design (UCD;

Vredenburg, Mao, Smith, & Carey, 2002) and, more recently,

user experience (UX; Tullis & Albert, 2008,2013).

These extensions typically have traditional usability as a core

concept. The extensions of UCD were primarily in the speciﬁ-

cation of product development practices and included usability

engineering, human factors engineering, and ergonomics, all

within frameworks intended to incorporate these activities into

the product development life cycle. For UX, the extensions have

been more in the direction of design and measurement beyond

the traditional goals of effectiveness, efﬁciency, and satisfaction

to experiences that have a more compelling emotional effect.

Historically, UCD subsumed usability engineering (as well as

ergonomics and human factors engineering), and UX has sub-

sumed UCD. In the near future, perhaps UX will become part

of a larger customer experience effort, especially given recent

emphasis on service design and the emergence of the discipline

of service science.

Service science (Lusch, Vargo, & O’Brien, 2007; Lusch,

Vargo, & Wessels, 2008; Pitkänen, Virtanen, & Kemppinen,

2008; Spohrer & Maglio, 2008) is an interdisciplinary area of

study focused on systematic innovation in service as opposed

to physical product design. The U.S. economy relies heavily

on service industries (>75%; see Larson, 2008). In a service

industry customers pay for performance rather than physical

goods. Some key attributes of service are that it is time per-

ishable, created and used simultaneously, and includes a client

who participates in the coproduction of value. As work in a ser-

vice system matures, there tends to be a shift from service based

on human talent to technology-based self-service (Spohrer &

Maglio, 2008). With the change to automated service deliv-

ered through interactive voice response systems (Lewis, 2011b),

mobile services, or websites, it is important to design for an

excellent user experience (effective, efﬁcient, and satisfying),

especially when it is relatively easy for users to switch service

providers. A highly usable and compelling service experience

leads to enhanced customer attraction and retention (Xue &

Harker, 2002). It seems likely that service science and customer

experience would beneﬁt from the adoption of lessons learned

in usability engineering and science.

Throughout the transformations from usability engineering

to UCD to UX, usability can be a relatively stable component.

Some researchers have suggested changes to the fundamental

deﬁnition of usability. Bevan (2009) recommended including

ﬂexibility and safety to create a more comprehensive quality-

of-use model. An even more expansive scheme (quality in use

integrated measurement) included 10 factors, 26 subfactors, and

127 speciﬁc metrics (Seffah, Donyaee, Kline, & Padda, 2006).

Winter, Wagner, and Deissenboeck (2008) published a two-

dimensional model of usability that associated a large number of

system properties with user activities. Alonso-Rios et al. (2010)

assembled a taxonomy that included traditional and nontradi-

tional aspects of usability organized under the primary factors

of Knowability, Operability, Efﬁciency, Robustness, Safety,

and Subjective Satisfaction. It isn’t yet clear what role these

expanded models of usability will play in the work of future

practitioners and researchers. There is compelling psychome-

tric evidence for an underlying construct of usability for the

traditional metrics of effectiveness, efﬁciency, and satisfaction

(Sauro & Lewis, 2009), but these expanded deﬁnitions have

yet to undergo statistical testing to conﬁrm their hypothesized

structures.

There are still lessons to be learned in investigating the

effects of culture on the construct of usability (Hertzum et al.,

2007; Marcus, 2007). To what extent are aspects of usabil-

ity culturally invariant and what aspects are affected by cul-

ture? As discussed later in this article, there are clear cul-

tural considerations when conducting TA studies (Clemmensen,

Hertzum, Hornbæk, Shi, & Yammiyavar, 2009) or when trans-

lating standardized usability questionnaires (van de Vijver

& Leung, 2001). Surveys of Danish and Chinese users

revealed differences in the understanding and prioritization of

aspects of usability (Frandsen-Thorlacius, Hornbæk, Hertzum,

& Clemmensen, 2009). The ﬁnding that system acceptance was

more affected by perceived usefulness for Chinese students but

by perceived ease of use by Indonesian students (Evers & Day,

1997) suggests that a simple East–West cultural dichotomy will

be insufﬁcient.

TA methodology. In a TA study, participants receive

instructions to talk about what they’re doing as they do it, and

receive reminders to talk aloud if they forget to do so. The most

common theoretical justiﬁcation for the use of TA is from the

human problem-solving research of Ericsson and Simon (1980),

who found that certain kinds of verbal reports could produce

reliable data. Speciﬁcally, the verbalizations that participants

produce during task performance that do not require additional

cognitive processing beyond that required for task performance

and verbalization tend to be reliable.

Some of the common claims associated with TA studies

are that they are more productive for ﬁnding usability prob-

lems (van den Haak & de Jong, 2003; Virzi, Sorce, & Herbert,

1993) and thinking aloud does not affect user ratings or per-

formance (Bowers & Snyder, 1990; Ohnemus & Biers, 1993;

Downloaded by [78.87.127.172] at 04:36 24 June 2014

USABILITY LESSONS 667

Olmsted-Hawala, Murphy, Hawala, & Ashenfelter, 2010).

Although there is some evidence in support of these claims,

the evidence is mixed. For example, Berry and Broadbent

(1990) reported that the TA process invoked cognitive pro-

cesses that improved rather than degraded performance. Wright

and Converse (1992) compared silent with TA usability testing

and found that the TA group performed better with the differ-

ence increasing as a function of task difﬁculty. MacDonald,

McGarry, and Willis (2013) found that task complexity likely

interacts with differences in TA protocols.

TA practice in usability testing often does not conform to

its most cited theoretical basis (Ericsson & Simon, 1980), with

reported inconsistencies in explanations to participants about

how to do TA, practice periods, styles of reminding participants

to TA, prompting intervals, and styles of intervention (Boren

& Ramey, 2000; MacDonald, Edwards, & Zhao, 2012). Boren

and Ramey (2000) suggested an alternative theoretical approach

to TA based on speech communication theory, with clearly

deﬁned communicative roles for the participant (in the role

of domain expert or valued customer, making the participant

the primary speaker) and the usability practitioner (the learner

or listener, thus a secondary speaker). They recommended the

use of acknowledgment tokens that do not take speakership

away from the participant, such as “mm hm?” and “uh-huh?”

because in normal communication silence can be interpreted as

aloofness or condescension.

Krahmer and Ummelen (2004) conducted an exploratory

comparison of the Ericsson and Simon (E&S) versus the Boren

and Ramey (B&R) TA procedures and found similar outcomes

for both procedures. The main difference was that moderators in

the B&R condition intervened more frequently with the conse-

quence that the participants were less lost and completed more

tasks. Hertzum, Hansen, and Andersen (2009) compared silent

task completion with strict E&S and more relaxed TA. Strict

E&S TA required more time for task completion, but the TA

method did not affect successful task completion rates (which

tended to be high in the study).

Olmsted-Hawala et al. (2010) studied E&S TA, B&R TA,

a less restrictive coaching protocol in which moderators could

freely probe participants, and silence (no TA at all). The out-

comes were similar for silence, E&S, and B&R procedures.

Participants in the coaching condition successfully completed

signiﬁcantly more tasks and had higher satisfaction ratings.

Their results for B&R differed from those reported by Krahmer

and Ummelen (2004): “Since the test administrator in the

Krahmer & Ummelen study offered assistance and encour-

agement to the test subject during the session, we think their

speech-communication protocol is more akin to the coaching

condition in our study” (Olmsted-Hawala et al., 2010, p. 2387).

Clemmensen et al. (2009) discussed the impact of cultural

differences on TA. There are several ways in which cultural

differences could affect testing, such as the instructions and

tasks, the participant’s verbalization, how the observer “reads”

the participant, and the overall relationship between participant

and observer. In particular, with regard to studies that have

Western observers and Eastern participants, they recommended

that observers should allow sufﬁcient time for participants to

pause while thinking aloud, rely less on expressions of surprise,

and be sensitive to the tendency for indirect criticism.

The evidence indicates that relative to silent participation, TA

can affect task performance and reported satisfaction, depend-

ing on the exact TA protocol in use. If the primary purpose

of the study is problem discovery, TA appears to be advanta-

geous over completely silent task completion. If the primary

purpose of the test is task performance measurement, the use

of TA is somewhat more complicated. As long as all the tasks

in the planned comparisons were completed under the same

conditions, performance comparisons should be legitimate. It is

critical, however, that practitioners using TA provide a complete

description of their method, including the kind and frequency

of probing. There is still work to do before we will have a

deep understanding of the effects of current variations in TA

protocols.

3. IS FORMATIVE USABILITY TESTING RELIABLE?

The widespread use of formative usability testing is evi-

dence that practitioners generally believe that it is effective.

However, there are ﬁelds in which practitioners’ belief in the

effectiveness of their methods does not appear to be warranted

by the evidence (e.g., the use of projective techniques such

as the Rorschach test in psychotherapy; see Lilienfeld, Wood,

& Garb, 2000). Might it be possible that formative usability

testing is fundamentally unreliable and, if so, should usability

practitioners abandon the method?

3.1. One Point of View

Human factors work can be reliable – different human fac-

tors engineers, using different human factors techniques at dif-

ferent stages of a product’s development, identiﬁed many of the

same potential usability defects. (Marshall, Brendan, & Prail, 1990,

p. 243)

There are rational arguments in favor of iterative formative

usability testing, starting with the early publications that ini-

tially described and promoted the method (Al-Awar, Chapanis,

& Ford, 1981; Chapanis, 1981; Gould, 1988; Gould & Lewis,

1984). The basic idea of achieving usability by watching people

use a product, noting the problems, and then ﬁxing the prob-

lems, seems irrefutable. This rational argument has received

support from a number of published case studies and a few early

experiments (G. Bailey, 1993; R. W. Bailey, Allan, & Raiello,

1992; Gould et al., 1987; Høegh & Jensen, 2008; Lewis, 1996;

Marshall et al., 1990). Published cost–beneﬁt analyses (Bias

& Mayhew, 1994) have demonstrated the value of usability

Downloaded by [78.87.127.172] at 04:36 24 June 2014

668 LEWIS

engineering processes that include usability testing, with cost–

beneﬁt ratios ranging from 1:2 for smaller projects to 1:100 for

larger projects (Karat, 1997). For example, consider the results

of one case study (Lewis, 1996) and one experiment (G. Bailey,

1993).

Lewis (1996) published a case study of the development

of the Simon, a personal communicator now widely consid-

ered to be the ﬁrst commercially available smartphone. The

development team (including the usability engineers) deﬁned

a set of tasks to use to develop competitive benchmarks and

for iterative formative usability testing. As shown in Figure 2,

the perceived usability of the Simon (measured using the Post-

Study System Usability Questionnaire [PSSUQ]; Lewis, 1995)

dramatically improved after the ﬁrst iteration (from Simon A

to Simon B), then showed very little improvement after the

second iteration (from Simon B to Simon C). The perceived

usability of the Simon after the application of formative usabil-

ity testing was better than the initially established benchmarks

(Lewis, 1996).

G. Bailey (1993) conducted an experiment in which he

had eight designers use a prototyping tool to create a recipes

application. Bailey then recorded participants performing tasks

with each of the prototypes, three participants per prototype in

a between-subjects design. Each designer reviewed the tapes

of the use of his or her prototype and used those observa-

tions to modify the design. This process continued until each

designer indicated that further improvements were not possible.

All designers stopped after three to ﬁve iterations. Comparison

of the ﬁrst and last designs showed signiﬁcant improvement

in successful task completion rates, task completion times, and

number of serious errors.

3.2. Another Point of View

Our main conclusion is that our simple assumption that we are

all doing the same and getting the same results in a usability test is

plainly wrong. (Molich, Ede, Kaasgaard, & Karyukin, 2004, p. 65)

Since 1998, a number of papers have questioned the reliabil-

ity of usability problem discovery (Kessner, Wood, Dillon, &

West, 2001; Molich et al., 1998; Molich, Ede, Kaasgaard, &

Karyukin, 2004; Molich & Dumas, 2008)—a process that is

at the heart of iterative usability testing. The consistent ﬁnd-

ing from this line of research (in particular, Molich’s series of

competitive usability evaluation [CUE] studies) has been that

observers, individually or in teams, who evaluated the same

product discovered very different sets of usability problems.

In Molich et al. (1998), four independent usability labo-

ratories carried out inexpensive usability tests of a software

application for new users. The four teams reported 141 dif-

ferent problems, with only one problem common among all

four teams. Kessner et al. (2001) had six professional usabil-

ity teams independently test an early prototype of a dialog

box. None of the problems were detected by every team, and

18 problems were described by one only team. Molich et al.

(2004) assessed the consistency of usability testing across nine

independent organizations that evaluated the same website.

They documented considerable variability in methodologies,

resources applied, and problems reported. There were a total of

310 reported problems, with only two problems reported by six

or more organizations, and 232 (61%) uniquely reported prob-

lems. The fourth CUE (CUE-4; Molich & Dumas, 2008) had a

similar method and similar outcomes.

1.0

2.0

3.0

4.0

5.0

6.0

7.0

SysUse InfoQual IntQual Overall

Mean Rating

PSSUQ Scale

Benchmark

Simon A

Simon B

Simon C

FIG. 2. Improvements in perceived usability through iterative usability testing. Note. Lower Post-Study System Usability Questionnaire (PSSUQ) scores indicate

better perceived usability.

Downloaded by [78.87.127.172] at 04:36 24 June 2014

USABILITY LESSONS 669

3.3. Lessons Learned

There is compelling rational and empirical support for the

practice of iterative formative usability testing—it appears to be

effective in improving both objective and perceived usability.

There is also compelling evidence that a key aspect of formative

usability testing—problem discovery—is not reliable because

independent evaluations do not appear to produce identical (or

even similar) lists of usability issues. Thus, we have two lessons

learned that appear to be in stark contrast with one another.

To reconcile this apparent dilemma, it is important to keep

in mind the limitations of small-sample formative usability test-

ing. It is not realistic to expect similar lists of usability problems

when sample sizes are small and when there is variation in the

sample of tested tasks. These conditions will lead inevitably

to different sets of usability issues. The process of iterative

usability testing is a hill-climbing procedure, so variation in the

sets of usability problems driving redesign should still result in

movement up the hill toward more usable design—just not nec-

essarily the exact same design. A characteristic of hill-climbing

procedures is that there may be many paths up the hill.

Iterative usability testing alone cannot ensure a usable-

enough design. It is also important, when working in an existing

design space, to have an understanding of the competitive

design landscape. To accomplish this, deﬁne the key set of tasks

early in the design process, and then conduct usability tests

with key competitors to establish benchmarks for objective and

perceived usability (e.g., as in Lewis, 1996). Iterative forma-

tive usability evaluation shows the way to climb the hill toward

more usable design, and competitive usability testing points to

the right hill to climb.

The results of the CUE (and similar) studies of Molich and

colleagues (Molich et al., 1998; Molich & Dumas, 2008; Molich

et al., 2004) show that usability practitioners must conduct their

usability tests as carefully as possible, document their methods

completely, and show proper caution when interpreting their

results. Although still a valuable and necessary part of UCD,

the limitations of usability testing make it insufﬁcient for cer-

tain testing goals, such as quality assurance of safety-critical

systems (Schmettow, 2008; Thimbleby, 2007), which need addi-

tional activities such as acceptance and continuous use testing.

It can be difﬁcult to assess complex systems with complex goals

and tasks (Howard, 2008; Howard & Howard, 2009; Redish,

2007). On the other hand, as Landauer stated in 1997 (and

remains true today): “There is ample evidence that expanded

task analysis and formative evaluation can, and almost always

do, bring substantial improvements in the effectiveness and

desirability of systems” (p. 204).

3.4. Lessons Yet to Be Learned

There is a clear need for more research in the reliability

of formative usability testing—not so much the lack of relia-

bility as its practical consequences. For example, a limitation

of research that stops with the comparison of problem lists is

that it is not possible to assess the magnitude of the usability

improvement (if any) that would result from product redesigns

based on design recommendations derived from the problem

lists (Hornbæk, 2010; Wixon, 2003).

A related research area is in the quality of usability prob-

lem lists and associated recommendations. Because the content

of these lists provides the most direct information about how

to improve usability, developers have an intense interest in

them (Capra, 2007; Høegh, Nielsen, Overgaard, Pedersen, &

Stage, 2006; Nørgaard & Hornbæk, 2009). There are, however,

different approaches to the construction of usability recommen-

dations and to their prioritization.

Recommendations. For example, there is some controversy

in the usability practitioner community regarding whether prob-

lem descriptions should or should not include recommendations

(Theofanos & Quesenbery, 2005). In an exploratory study of

different presentation methods for usability issues and recom-

mendations (Nørgaard & Hornbæk, 2009), developers rated

redesign proposals, multimedia presentations, and screenshots

as useful inputs, problem lists second, and scenarios as least

helpful, with problem lists best suited for documenting rel-

atively simple problems that did not require a strong con-

text for understanding the issue. Molich, Jeffries, and Dumas

(2007) analyzed data collected during CUE-4 to develop guide-

lines for making usability recommendations useful and usable,

including

• Communicate clearly at the conceptual level.

• Ensure that recommendations improve overall

usability.

• Be aware of business or technical constraints.

• Solve the whole problem, not just a special case.

The results and recommendations from this line of research

seem reasonable, but they still require validation so practition-

ers can understand their true downstream utility in leading to

changes that improve usability.

Prioritization. Another area in which there is variability

in usability engineering practice is the prioritization of usabil-

ity problems. Because usability tests can reveal more problems

than there are resources to address, it is important to have

some means for prioritization, keeping in mind that design pro-

cess considerations can inﬂuence the speciﬁc usability changes

made to a product (Hertzum, 2006). Two fundamentally dif-

ferent approaches to prioritization are judgment driven (Virzi,

1992) and data driven (Dumas & Redish, 1999; Lewis, Henry,

& Mack, 1990; Rubin, 1994; Rubin & Chisnell, 2008). The

bases for judgment-driven prioritizations are the ratings of

stakeholders in the project (such as usability practitioners and

developers). The bases for data-driven prioritizations are the

data associated with the problems, such as frequency, impact,

ease of correction, and likelihood of usage of the portion of

the product in which the problem occurred (Lewis, 2012).

Of these, the most common measurements are frequency and

impact (sometimes referred to as severity). In a study of the

Downloaded by [78.87.127.172] at 04:36 24 June 2014

670 LEWIS

two approaches to prioritization, Hassenzahl (2000) found that

data-driven and judgment-driven estimates differed.

The measurement of frequency of occurrence is the straight-

forward division of the number of occurrences within partic-

ipants divided by the number of participants, usually at the

task level. A common method (Dumas & Redish, 1999; Rubin,

1994; Rubin & Chisnell, 2008) for assessing impact is to assign

impact scores according to whether the problem (a) prevents

task completion, (b) causes a signiﬁcant delay or frustration, (c)

has a relatively minor effect on task performance, or (d) is a

suggestion.

Prioritization based on multiple types of data requires some

means of data combination. For example, one could employ a

graphical problem grid with frequency on one axis and impact

on the other. High-frequency, high-impact problems would

receive treatment before low-frequency, low-impact problems.

The relative treatment of high-frequency, low-impact problems

and low-frequency, high-impact problems would depend on

practitioner judgment.

Rubin (1994) described an arithmetic procedure for com-

bining four levels of impact (using the criteria just described

with 4 assigned to the most serious level) with four levels of

frequency (4: frequency ≥90%; 3: 51–89%; 2: 11–50%; 1:

≤10%) by adding the scores. For example, if a problem had

an observed frequency of occurrence of 80% and had a minor

effect on performance, its priority would be 5 (a frequency rat-

ing of 3 plus an impact rating of 2). With this approach, priority

scores can range from a low of 2 to a high of 8. If information

is available about the likelihood that a user would work with

the part of the product that enables the problem, this informa-

tion would be used to adjust the frequency rating. Continuing

the example, if the expectation is that only 10% of users would

encounter the problem, the priority would be 3 (a frequency rat-

ing of 1 for the 10% ×80%, or an 8% likelihood of occurrence

plus an impact rating of 2).

A similar strategy is to multiply the observed percentage fre-

quency of occurrence by the impact score (Lewis et al., 1990).

The range of priorities depends on the values assigned to each

impact level. Assigning 10 to the most serious impact level

leads to a maximum priority (severity) score of 1000 (which can

optionally be divided by 10 to create a scale ranging from 1 to

100). Appropriate values for the remaining three impact cate-

gories depend on practitioner judgment, but a reasonable set is

5, 3, and 1. Using those values, the problem with an observed

frequency of occurrence of 80% and a minor effect on perfor-

mance would have a priority of 24 (80 ×3/10). It is possible

to extend this method to account for the likelihood of use using

the same procedure as that described by Rubin (1994), which

in the example resulted in modifying the frequency measure-

ment from 80 to 8%. Another way to extend the method is to

categorize the likelihood of use with a set of categories such as

very high likelihood (assigned a score of 10), high likelihood

(assigned a score of 5), moderate likelihood (assigned a score

of 3), and low likelihood (assigned a score of 1) and multiply

all three scores to get the ﬁnal priority (severity) score (then

optionally divide by 100 to create a scale that ranges from 1 to

100). Continuing the previous example with the assumption that

the task in which the problem occurred has a high likelihood of

occurrence, the problem’s priority would be 12 (5 ×240/100).

As far as I know, no one has systematically compared

various prioritization schemes against one or more sets of

usability problems to investigate the similarities and differ-

ences in their outputs. Of even greater utility to practitioners

would be insights into the downstream effectiveness of various

prioritization schemes, but it is notoriously difﬁcult to con-

duct that type of research. Clearly, there are still lessons to be

learned about how to effectively prioritize usability problems

and recommendations.

4. IS IT OK TO AVERAGE MULTIPOINT SCALES?

Usability practitioners who use measurement and statistics

to guide design recommendations, as most do, inherit the con-

troversies from those ﬁelds. One of the ongoing controversies

in measurement and statistics is what role, if any, the level of

measurement plays in determining acceptable arithmetic and

statistical manipulation. The controversy started when S. S.

Stevens (1946) declared that all numbers are not created equal,

and deﬁned the following levels of measurement:

• Nominal: Numbers that are simply labels, such as the

numbering of football players or model numbers.

• Ordinal: Numbers that have an order, but the differ-

ences between numbers do not necessarily correspond

to the differences in the underlying attribute, such as

levels of multipoint rating scales or rank order of sports

teams based on percentage of wins.

• Interval: Numbers that not only are ordinal but for

which equal differences in the numbers correspond to

equal differences in the underlying attribute, such as

Fahrenheit or Celsius temperature scales.

• Ratio: Numbers that not only are interval but for which

there is a true 0 point so equal ratios in the numbers

correspond to equal ratios in the underlying attribute,

such as time intervals (reaction time, task completion

times) or the Kelvin temperature scale.

4.1. One Point of View

From these four classes of measurements, Stevens argued

that certain types of arithmetic operations were not reason-

able to apply to certain types of data. Based on the “principle

of invariance,” he recommended against doing anything more

than counting nominal and ordinal data, and he restricted addi-

tion, subtraction, multiplication, and division to interval and

ratio data. From this perspective, strictly speaking, the mul-

tipoint scales commonly used for rating attitudes are ordinal

measurements, so it would not be permissible to compute their

arithmetic means. If it’s illogical to compute means of rating

Downloaded by [78.87.127.172] at 04:36 24 June 2014

USABILITY LESSONS 671

scale data, then it follows that it is incorrect to use statistical

procedures such as ttests that depend on computing the mean.

Stevens’s levels of measurement have been very inﬂuential,

appearing in numerous statistics textbooks and used to guide

recommendations given to users of some statistical analysis

programs (Velleman & Wilkinson, 1993).

4.2. Another Point of View

That I do not accept Stevens’ position on the relationship

between strength of measurement and “permissible” statistical pro-

cedures should be evident from the kinds of data used as examples

throughout this Primer: level of agreement with a questionnaire item,

as measured on a ﬁve-point scale having attached verbal labels. ...

This is not to say, however, that the researcher may simply ignore the

level of measurement provided by his or her data. It is indeed crucial

for the investigator to take this factor into account in considering the

kinds of theoretical statements and generalizations he or she makes

on the basis of signiﬁcance tests. (Harris, 1985, pp. 326–328)

Even if one believes that there is a “real” scale for each attribute,

which is either mirrored directly in a particular measure or mirrored

as some monotonic transformation, an important question is, “What

difference does it make if the measure does not have the same zero

point or proportionally equal intervals as the ‘real’ scale?” If the

scientist assumes, for example, that the scale is an interval scale

when it “really” is not, something should go wrong in the daily

work of the scientist. What would really go wrong? All that could go

wrong would be that the scientist would make misstatements about

the speciﬁc form of the relationship between the attribute and other

variables. . . . How seriously are such misassumptions about scale

properties likely to inﬂuence the reported results of scientiﬁc exper-

iments? In psychology at the present time, the answer in most cases

is “very little.” (Nunnally, 1978, p. 28)

For analyzing ordinal data, some researchers have recom-

mended the use of nonparametric statistical methods that are

similar to the well-known tand Ftests but which replace the

original data with ranks before analysis (Bradley, 1976). These

methods (e.g., the Mann-Whitney U-test, the Friedman test,

or the Kruskal-Wallis test), however, involve taking the means

and standard deviations of the ranks, which are ordinal—not

interval or ratio—data. Despite these violations of permissible

manipulation from Stevens’s point of view, those methods work

perfectly well.

Probably the most famous counterargument was by Lord

(1953) with his parable of a retired professor who had a machine

used to randomly assign football numbers to the jerseys of

freshmen and sophomore football players at his university—a

clear use of numbers as labels (nominal data). After assigning

numbers, the freshmen complained that the assignment wasn’t

random—they claimed to have received generally smaller num-

bers than the sophomores and that the sophomores must have

tampered with the machine. In a panic and to avoid impend-

ing violence between the classes, the professor consulted with a

statistician to investigate how likely it was that the freshmen got

their low numbers by chance. Over the professor’s objections,

the statistician determined the population mean and standard

deviation of the football numbers—54.3 and 16.0, respectively.

He found that the mean of the freshmen’s numbers was too

low to have happened by chance, strongly indicating that the

sophomores had tampered with the football number machine

to get larger numbers. The famous ﬁctional dialog between the

professor and the statistician was as follows:

“But these numbers are not cardinal numbers,” the professor

expostulated. “You can’t add them.”

“Oh, can’t I?” said the statistician. “I just did. Furthermore, after

squaring each number, adding the squares, and proceeding in the

usual fashion, I ﬁnd the population standard deviation to be exactly

16.0.”

“But you can’t multiply ‘football numbers,’” the professor

wailed. “Why, they aren’t even ordinal numbers, like test scores.”

“The numbers don’t know that,” said the statistician. “Since the

numbers don’t remember where they came from, they always behave

just the same way, regardless.” (Lord, 1953, p. 751)

The controversy continued for decades, with measurement

theorists generally supporting the importance of levels of mea-

surement and applied statisticians arguing against it. In their

recap of the controversy, Velleman and Wilkinson (1993) wrote,

“At times, the debate has been less than cordial. Gaito (1980)

aimed sarcastic barbs at the measurement theory camp and

Townsend and Ashby (1984) ﬁred back. Unfortunately, as

Mitchell (1986) noted, they often shot past each other” (p. 68).

The debate has continued into the 21st century (Scholten &

Borsboom, 2009).

It is interesting to note that in Stevens’s (1946) original

paper, he actually took a fairly moderate stance.

On the other hand, for this ‘illegal’ statisticizing there can be

invoked a kind of pragmatic sanction: In numerous instances it leads

to fruitful results. While the outlawing of this procedure would prob-

ably serve no good purpose, it is proper to point out that means and

standard deviations computed on an ordinal scale are in error to the

extent that the successive intervals on the scale are unequal in size.

When only the rank-order of data is known, we should proceed cau-

tiously with our statistics, and especially with the conclusions we

draw from them. (p. 679)

Responding to criticisms of the implications of his

1953 paper, Lord (1954) challenged critics of his logic to par-

ticipate in a game based on the “football numbers” story, with

the statistician paying the critic one dollar every time the statis-

tician incorrectly designated a sample as being drawn from one

of two populations of nominal two-digit numbers and the critic

paying the statistician one dollar when he is right. No critic ever

agreed to play the game.

4.3. Lessons Learned

In the late 1980s I worked on a high-proﬁle project to com-

pare performance and satisfaction across a set of common tasks

for three competitive ofﬁce application suites (Lewis et al.,

1990). Based on what I had learned in college about Stevens’s

levels of measurement, I pointed out that the multipoint rating

scale data we were dealing with did not meet the assumptions

Downloaded by [78.87.127.172] at 04:36 24 June 2014

672 LEWIS

required for the computation of means, so we should instead

present medians. I also advised using the nonparametric Mann-

Whitney U-test rather than ttests for individual comparisons of

the rating scale results.

The practitioners who started running the statistics and

putting the presentation together (which would have been given

to a group that included high-level IBM executives) called me

in a panic after they started following this advice. In the analy-

ses, there were cases where the medians were identical but the

U-test detected a statistically signiﬁcant difference. The U-test

is sensitive not only to central tendency but also to the shape of

the distribution, and in these cases the distributions had opposite

skew with overlapping medians. After this experience, I system-

atically investigated the relationship among mean and median

differences for multipoint scales and the observed signiﬁcance

levels of ttests and U-tests conducted on the same data, all taken

from this fairly large-scale usability test. The mean difference

correlated more than the median difference with the observed

signiﬁcance levels (both parametric and nonparametric) for dis-

crete multipoint scale data (Lewis, 1993). For the purposes of

analysis and presentation, mean differences were signiﬁcantly

superior to median differences, and there was apparently no

compelling reason to use the nonparametric rather than the ttest

for the assessment of signiﬁcant differences. However, it would

be a mistake to just ignore the level of measurement.

When making claims about the interpretation of the out-

comes of statistical tests, it is important to keep in mind that

rating scale data are ordinal rather than interval. An average rat-

ing of 4 might be better than an average rating of 2, and a ttest

might indicate that across a group of participants, the difference

is consistent enough to be statistically signiﬁcant. Even so, it

would be inappropriate to claim that it is twice as good (a ratio

claim), nor should one claim that the difference between 4 and

2 is equal to the difference between 4 and 6 (an interval claim).

The only reasonable claim is that there is a consistent difference.

Fortunately, even if one made the mistake of thinking one prod-

uct was twice as good as another when the scale didn’t justify

it, it would be a mistake that often would not affect the practical

decision of which product is better.

4.4. Lessons Yet to Be Learned

There are methods other than computing the median that are

alternatives to the mean when dealing with multipoint scales.

One of the most common is to use top-box scoring. For exam-

ple, one way to analyze data collected using a 5-point scale is to

report the percentage of ratings of 4 and 5—a top-2-box score.

A potential downside of this method is that these types of met-

rics tend to have a higher variability than the mean, so to achieve

an equal precision of measurement would require a larger sam-

ple size (Sauro & Lewis, 2012). I do not know of any systematic

investigation of means and top-box scoring methods using real

usability data, nor any work done to compare their downstream

utility (which is more effective in leading to changes that

improve usability), so there are still lessons to learn in this area.

5. HOW ROBUST ARE STANDARDIZED USABILITY

QUESTIONNAIRES?

A questionnaire is a form designed to obtain information

from respondents. The items in a questionnaire can be open-

ended questions but are more typically multiple choice, with

respondents selecting from a set of alternatives (“Please select

the type of the wine that you prefer.”) or points on a rating

scale (“On a scale of 1 to 5 where 1 is very dissatisﬁed and

5 is very satisﬁed, how satisﬁed were you with your recent visit

to our airport?”). A standardized questionnaire is one designed

for repeated use, typically with a speciﬁc set of questions

presented in a speciﬁed order using a speciﬁed format, with spe-

ciﬁc rules for producing metrics. As part of the development

of standardized questionnaires, it is customary for the devel-

oper to report its reliability, validity, and sensitivity—in other

words, for the questionnaire to have undergone psychometric

qualiﬁcation (Nunnally, 1978).

Standardized measures offer many advantages to practi-

tioners, speciﬁcally objectivity, easier replicability, quantiﬁca-

tion, economy, and easier communication of results (Nunnally,

1978). The earliest standardized questionnaires in this area

focused on the measurement of computer satisfaction (e.g., the

Gallagher Value of MIS Reports Scale and the Hatcher and

Diebert Computer Acceptance Scale) but were not designed for

the assessment of usability following participation in scenario-

based usability tests (see LaLomia & Sidowski, 1990, for

a review of computer satisfaction questionnaires published

between 1974 and 1988). The ﬁrst standardized usability ques-

tionnaires appropriate for usability testing appeared in the late

1980s (Chin, Diehl, & Norman, 1988; Kirakowski & Dillon,

1988; Lewis, 1990a,1990b). Some standardized usability ques-

tionnaires are for administration at the end of a study. Others

are for a quick, more contextual assessment at the end of each

task or scenario. Currently, the most widely used standard-

ized usability questionnaires for assessment of the perception of

usability at the end of a study (after completing a set of test sce-

narios) and those cited in national and international standards

(ANSI, 2001; ISO, 1998) are as follows:

• The Questionnaire for User Interaction Satisfaction

(Chin et al., 1988)

• The Software Usability Measurement Inventory

(Kirakowski & Corbett, 1993; McSweeney, 1992)

• The PSSUQ (Lewis, 1990a,1992,1995)

• The Software Usability Scale (SUS; Brooke, 1996,

2013; Sauro, 2011)

There is not much controversy on the utility of standardized

questionnaires in usability engineering. Unsurprisingly, stan-

dardized usability questionnaires are more reliable than ad hoc

questionnaires (Hornbæk, 2006; Hornbæk & Law, 2007; Sauro

& Lewis, 2009). As with other aspects of applied statistics,

however, there are controversies in psychometrics inherited by

usability practitioners. A complete treatment of the topic is

beyond the scope of this article. For a recent special issue on

Downloaded by [78.87.127.172] at 04:36 24 June 2014

USABILITY LESSONS 673

the topic of developing and evaluating scales for use in studies

of human-computer interaction, see Lindgaard and Kirakowski

(2013). Here I focus on a few controversies of particular interest

to usability practitioners and scientists, particularly with regard

to the robustness (tolerance to deviation from speciﬁed use) of

standardized usability questionnaires.

5.1. One Point of View

As mentioned previously, a standardized questionnaire has a

speciﬁc set of questions presented in a speciﬁed order (which

could be random order) using a speciﬁed format, with spe-

ciﬁc rules for producing metrics. Any deviation, no matter how

slight, from these speciﬁcations makes it possible that the result-

ing measures would be invalid. These deviations might include

changing the wording of an item, changing the number of scale

steps or step labels, or using the questionnaire in a setting differ-

ent from its development setting. Also, to avoid artifacts such as

the extreme response bias and the acquiescence bias, the items

in a standardized questionnaire that measures sentiments such

as satisfaction should have a mixed tone, about half positive and

half negative. Mixing the tone of items also provides a way to

check that respondents are making some effort when complet-

ing the questionnaire, at least enough to avoid marking items in

a way that is clearly careless.

5.2. Another Point of View

Robust psychometric instruments should be able to toler-

ate some deviation from speciﬁcation. When there is deviation

from speciﬁcation, the question is whether the practitioner

or researcher has merely bent the rules or has broken the

instrument. There is evidence from the psychometric litera-

ture speciﬁc to usability engineering that standardized usability

questionnaires are reasonably robust.

All-positive versus mixed-tone items. When I worked with

the team at IBM that produced the PSSUQ and Computer

System Usability Questionnaire (CSUQ) in 1988, we had quite

a bit of discussion regarding whether to use a mixed or consis-

tently positive item tone. Ultimately, we decided to be consis-

tently positive, even though that was not the prevailing practice

in questionnaire development (Lewis, 1995). Our primary con-

cern was that varying the tone would make the questionnaires

more difﬁcult for users to complete, and as a consequence might

increase the frequency of user error in marking items (Lewis,

1999,2002). Thus, the most common criticism of these ques-

tionnaires was the consistently positive tone of the items (e.g.,

Travis, 2008).

There are a number of articles from different literatures,

however, that have been critical of mixed tone because it can

create undesirable structure in a metric in which positive items

align with one factor and negative items align with the other

(Barnette, 2000; Davis, 1989; Pilotte & Gable, 1990; Schmitt &

Stuits, 1985; Schriesheim & Hill, 1981; Stewart & Frye, 2004;

Wong, Rindﬂeisch, & Burroughs, 2003). It is also possible that

mixed tone could contribute to three types of errors in practice

(Sauro & Lewis, 2011), speciﬁcally the following:

•Misinterpretation: Users might respond to items forced

into a negative tone in a way that is not the simple

negative of the positive version of the item.

•Mistake: Users might not intend to respond differently

to mixed-tone items but might forget to reverse their

score, accidentally agreeing with a negative item when

they meant to disagree.

•Miscode: To generate a composite overall score from

mixed-tone items, it is necessary to reverse the scoring

of the negative-tone items before combination with the

positive-tone items. Failure to perform this step would

result in incorrect composite values, with the errors not

necessarily easy to detect.

The SUS (Brooke, 1996), probably the most widely used

standardized usability questionnaire (Sauro & Lewis, 2009), is

an example of an instrument with mixed tone, composed of

10 items with alternating positive and negative tone. As part

of the eighth CUE workshop (Molich, Kirakowski, Sauro, &

Tullis, 2009), eight of 15 teams used the SUS, and one of those

had miscoded results. As part of a meta-analysis of prototypical

usability metrics (Sauro & Lewis, 2009), two of 19 contributed

SUS data sets had coding errors. Taken together, this suggests

that about 10% of SUS data sets may have coding issues.

To more systematically investigate the effect of all-positive

versus mixed tone items, Sauro and Lewis (2011) created an all-

positive version of the SUS and compared its scores with those

of the standard (mixed-tone) version. There were no signiﬁ-

cant differences between the means of the overall SUS scores,

the means of the odd items (positive tone in both versions), or

the means of the even items (positive tone in the all-positive

version, negative tone in the standard version). There was a sig-

niﬁcant difference in the means of the odd- and even-numbered

items, but the difference was consistent across the two versions

of the questionnaire (see Figure 3). There were no signiﬁcant

differences in either acquiescence or extreme response bias

between the versions. Examination of responses to the stan-

dard version indicated about 17% of completed questionnaires

had an internal inconsistency consistent with a user mistake in

marking one or more items.

It is possible to change items so drastically that it affects the

metrics (Sauro, 2010b; Sauro & Lewis, 2012). In an experiment

exploring the manipulation of item tone and intensity, volun-

teer participants rated the Usability Professionals Association

website using one of ﬁve versions of the SUS—an all posi-

tive extreme, all negative extreme, one of two versions of an

extreme mix (half positive and half negative extreme), or the

standard SUS. For example, the extreme negative version of

the SUS Item 4 was “I think that I would need a permanent

hot-line to the help desk to be able to use the web site.” The

extreme positive and extreme negative items were signiﬁcantly

different from the original SUS, consistent with prior research

Downloaded by [78.87.127.172] at 04:36 24 June 2014

674 LEWIS

3.5

2.5

Mean Composite Item Rating after Recoding

1.5

0.5

Standard Positive

Odd

Even

Version of SUS

FIG. 3. Interaction between odd and even items of standard and positive version of the Software Usability Scale (SUS).

showing that people tend to agree with statements that are

close to their attitude and to disagree with all other statements

(Spector, Van Katwyk, Brannick, & Chen, 1997; Thurstone,

1928). By rephrasing items to extremes, only respondents who

passionately favored the usability of the Usability Professionals

Association website tended to agree with the extremely phrased

positive statements—resulting in a signiﬁcantly lower average

score. Likewise, only respondents who passionately disfavored

its usability agreed with the extremely negatively questions—

resulting in a signiﬁcantly higher average score.

Using standardized usability questionnaires outside of a

usability testing context. The psychometric literature includes

research that illustrates invariance in psychometric properties

across different contexts of measurement (e.g., Bangor, Kortum,

& Miller, 2008; Davis & Venkatesh, 1996; Kortum & Bangor,

2013; Lewis, 1995,2002; Lewis & Mayes, 2014 [included in

this issue]; Sauro & Lewis, 2012). For example, Lewis (1995,

2002) used a variety of methods to collect data for the PSSUQ

and CSUQ, two questionnaires that have essentially the same

items but differ slightly to support use in the lab (PSSUQ) and

as a mailed or online survey (CSUQ). Regardless of the data col-

lection method and slight differences in the content of the items,

the resulting factor structures were identical, as were estimates

of scale reliability and validity.

The SUS also seems to have similar psychometric proper-

ties when used in the lab or when used as part of a web survey

(Bangor et al., 2008; Grier, Bangor, Kortum, & Peres, 2013;

Kortum & Bangor, 2013), at least with regard to reliability and

concurrent validity. There have been some inconsistencies in

its content validity (factor structure), but that does not seem

to be related to its context of measurement (Sauro & Lewis,

2012). The SUS also seems to tolerate other minor changes to

its wording, for example, using “website” or a product name in

place of the original “system,” or the replacement of the word

“cumbersome” with “awkward” (Bangor et al., 2008; Finstad,

2006; Lewis & Sauro, 2009). In a study comparing different

standardized usability questionnaires, the SUS was the fastest

to converge on its large-sample mean (Tullis & Stetson, 2004).

5.3. Lessons Learned

When conducting usability studies, practitioners should use

one of the currently available standardized usability question-

naires. Table 1 lists a number of questionnaires along with some

of their key properties.

It is possible to bend these standardized questionnaires to

an extent, but it is also possible to break them if the devi-

ations from standardization are extreme. Minor changes in

wording should not typically have extreme effects, and the ques-

tionnaires should be robust against the inclusion of additional

custom items. With minor adjustment as needed, the ques-

tionnaires should work well in either a laboratory or remote

usability testing context, or as part of a mailed or online survey.

It is important to plan for whether the measurements will

need interpretation in isolation or will be comparative (e.g.,

comparing competitive products or different versions of a

product). If comparative, then any of the currently available

questionnaires should sufﬁce, but if there is a need for inter-

pretation without comparison, then it would be wise to select a

questionnaire that has at least some type of norms available.

From its inception, the developers of the Software Usability

Measurement Inventory have maintained an extensive norma-

tive database (Kirakowski, 1996). One of the reasons that the

SUS has been gaining in popularity is the relatively recent

publication of normative data to aid in the interpretation of

SUS scores (Bangor et al., 2008; Kortum & Bangor, 2013;

Sauro & Lewis, 2012). There is also some normative data avail-

able for the PSSUQ/CSUQ (Lewis, 2002; Sauro & Lewis,

2012). Keep in mind, however, that variation in products and

tasks can weaken the generalizability of the norms (Cavallin,

Martin, & Heylighen, 2007). As for item wording, slight

Downloaded by [78.87.127.172] at 04:36 24 June 2014

USABILITY LESSONS 675

TABLE 1

Key Characteristics of Some Standardized Usability Questionnaires

Questionnaire

Requires

License Fee No. of Items

No. of

Subscales

Global

Reliability Validity Notes References

QUIS Yes ($50–750) 27 5 0.94 Construct validity;

evidence of

sensitivity

Chin et al., 1988

SUMI Yes ( C0–1000) 50 5 0.92 Construct validity;

evidence of

sensitivity;

availability of

norms

Kirakowski, 1996

USE Unknown 30 4 Not

published

Not published Lund, 1998,2001

PSSUQ/CSUQ No 16 3 0.94 Construct validity;

concurrent

validity;

evidence of

sensitivity;

some normative

information

Lewis, 1995,2002

SUS No 10 2 0.92 Construct validity;

evidence of

sensitivity;

some normative

information

Brooke, 1996,

2013; Sauro,

2011

UMUX No 4 2 0.81, 0.87,

0.97

Construct validity;

evidence of

sensitivity

Finstad, 2010,

2013; Lewis,

2013

UMUX-LITE No 2 1 0.83 Construct validity;

evidence of

sensitivity

Lewis et al., 2013

Note. QUIS =Questionnaire for User Interaction Satisfaction; SUMI =Software Usability Measurement Inventory; USE =Usefulness,

Satisfaction, and Ease of Use; PSSUQ =Post-Study System Usability Questionnaire; CSUQ =Computer System Usability Questionnaire;

SUS =Software Usability Scale; UMUX =Usability Metric for User Experience.

deviations should not cause a serious problem, but extreme

deviations probably will. For any planned deviations, if pos-

sible, check the data to ensure that nothing has obviously

broken.

5.4. Lessons Yet to Be Learned

There are still lessons to be learned in the domain of stan-

dardized usability testing—still work to do. For example, what

is the real factor structure of the SUS? What is the status of the

translation of standardized usability questionnaires into other

languages? How effective are the new, short questionnaires

designed to produce SUS-like measurements, the Usability

Metric for User Experience (UMUX) and UMUX-LITE? How

effective is the Emotional Metric Outcomes (EMO), a new

questionnaire designed to assess the emotional consequences of

interaction?

Factor structure of the SUS. The original intent was for

the SUS to be a unidimensional (one factor) measurement of

perceived usability (Brooke, 1996). Once researchers began to

publish data sets (or correlation matrices) from sample sizes

large enough to support factor analysis, it began to appear that

the items of the SUS might align with two factors. Data from

three independent studies (Borsci, Federici, & Lauriola, 2009;

Lewis & Sauro, 2009) indicated a consistent two-factor struc-

ture (with Items 4 and 10 aligning on a factor separate from

the remaining items). Analyses conducted since 2009 (Lewis,

Utesch, & Mayer, 2013; Sauro & Lewis, 2011; and a number

of unpublished analyses) have typically resulted in a two-factor

structure but have not replicated the item-factor alignment that

Downloaded by [78.87.127.172] at 04:36 24 June 2014

676 LEWIS

seemed apparent in 2009. The more recent analyses have been

somewhat consistent with a general alignment of positive- and

negative-tone items on separate factors—the type of uninten-

tional structure that can occur with sets of mixed-tone items (see

the earlier section on all-positive vs. mixed-tone items). It would

be helpful for usability practitioners or researchers who have

fairly large-sample data sets of SUS questionnaires to publish

the results of factor analysis of their data, or at least to publish

the correlation matrix of the items so other researchers could

conduct factor analyses.

Translation into other languages. There is more to the

translation of a standardized usability questionnaire into another

language than simply translating the wording of the items. It is

also necessary to conduct psychometric analyses to ensure that

the translated questionnaire has the same (or similar) proper-

ties as the source questionnaire (van de Vijver & Leung, 2001).

There have been two recent translations, one of the CSUQ

into Turkish (Erdinç & Lewis, 2013) and one of the SUS into

Slovene (Blažica & Lewis, 2014), but there are many more

opportunities for usability researchers to extend the beneﬁts

of standardized usability measurement to other languages and

cultures.

There may be times when practitioners will need to work

with translated questionnaires that have not yet undergone psy-

chometric revalidation. For example, if the study will have a

small sample size, then there will not be sufﬁcient data to con-

duct factor analysis or other psychometric analyses. There is

little research on the extent to which doing this would pro-

duce results that are useful or results that are misleading, so

practitioners ﬁnding themselves in this situation should proceed

with caution. It is possible, over time (possibly years), to collect

enough cases to conduct revalidation analyses (Lewis, 2002).

With the advent of web survey tools and remote unmoderated

usability testing (Albert, Tullis, & Tedesco, 2010), this process

could occur much more rapidly than in the past.

The UMUX and UMUX-LITE. As widely used as the SUS

is, some practitioners have a need for a standardized question-

naire that has fewer than 10 items. This need is most pressing

when standardized usability measurement is one part of a larger

poststudy or online questionnaire (Lewis et al., 2013). Finstad

(2010,2013) created the UMUX to address that need. The

UMUX is a relatively new standardized usability questionnaire

designed to get a measurement of perceived usability consistent

with the SUS, but using fewer items. After standard psychomet-

ric development, the ﬁnal version of the UMUX had four items,

two with positive tone and two with negative. Using a recod-

ing scheme similar to the SUS, a UMUX score can range from

0 to 100. Finstad reported a UMUX reliability of 0.94 and an

extremely high correlation of 0.96 with concurrently collected

SUS scores.

Lewis et al. (2013) took the UMUX as a starting point for

an even shorter questionnaire, just using the two positive-tone

items of the UMUX. The UMUX-LITE items were “This sys-

tem’s capabilities meet my requirements” and “This system is

easy to use.” Data from two independent surveys demonstrated

adequate psychometric quality of the questionnaire. Estimates

of reliability were .82 and .83—excellent for a two-item

instrument. Concurrent validity was also high, with signiﬁcant

correlation with the SUS (r=.81) and with likelihood-to-

recommend scores (r=.73). UMUX-LITE score means were

slightly lower than those for the SUS but easily adjusted using

linear regression to match the SUS scores. Due to its parsimony

(two items), reliability; validity; structural basis (usefulness and

usability); and, after applying the corrective regression formula,

its correspondence to SUS scores, the UMUX-LITE appears to

be a promising alternative to the SUS when it is not desirable to

use a 10-item instrument.

Because they are such new metrics, few practitioners have

used the UMUX or UMUX-LITE. For some period, it would be

useful for practitioners who use the SUS to also use the UMUX

(which contains the UMUX-LITE) and to report their observed

correspondences among the metrics to help the usability engi-

neering community develop an improved understanding of the

potential utility of these new metrics.

The EMO. Most standardized usability questionnaires have

focused on assessing satisfaction with usability or perceived

usability, and more from a cognitive than an emotional per-

spective (Agarwal & Meyer, 2009; Lottridge, Chignell, &

Jovicic, 2011). The growing trend toward user experience

(UX) design has created a need for a concise, psychometri-

cally qualiﬁed measurement of the emotional consequences of

interaction. There have been some previous efforts to develop

instruments with a more emotional focus (Benedek & Miner,

2002; Hassenzahl, 2001,2004;Tullis & Albert, 2008,2013),

but none that have directly addressed the consequences of

interaction.

Lewis and Mayes (this issue) developed and evaluated the

EMO questionnaire—a new questionnaire designed to assess

the emotional outcomes of interaction, especially the interac-

tion of customers with service-provider personnel or software.

The EMO is a concise multifactor standardized questionnaire

that provides an assessment of transaction-driven personal and

relationship emotional outcomes, both positive and negative.

Psychometric evaluation showed that the EMO and its compo-

nent scales had high reliability and concurrent validity with loy-

alty and overall experience metrics in a variety of measurement

contexts. Concurrent measurement with the SUS indicated that

a reported signiﬁcant correlation of the SUS with likelihood-

to-recommend ratings (Sauro & Lewis, 2012) may be due to

emotional rather than utilitarian aspects of the SUS.

Like the UMUX and UMUX-LITE, the EMO is a new metric

with desirable psychometric properties. One of its current weak-

nesses is that there is no EMO data yet collected in the context

of a usability study. There is every reason to believe it should

be robust enough to have the same psychometric properties in

that new context of measurement, but it is important to verify

this. Usability practitioners and researchers who can include

the EMO in their battery of post-study questionnaires should do

Downloaded by [78.87.127.172] at 04:36 24 June 2014

USABILITY LESSONS 677

so—especially those who desire a metric that has a strong rela-

tionship to loyalty metrics such as likelihood-to-recommend.

To as great an extent as possible, they should publish their

ﬁndings.

6. WHAT ABOUT THE MAGIC NUMBER 5 (OR 8, OR 10,

OR 30)?

In the context of usability testing, “magic” numbers refer to

rules of thumb for sample sizes. A common rule of thumb for

summative usability tests, based on a common convention in

applied statistics, is to have a sample size of at least 30. For

formative usability testing, the best known magic number is 5

(Nielsen, 2000; Nielsen & Landauer, 1993), although 8 (Perfetti

& Landesman, 2001; Spool & Schroeder, 2001) and 10 (Hwang

& Salvendy, 2010) have also appeared in the literature. Do these

magic numbers have any validity?

6.1. One Point of View

Summative usability testing. According to the central limit

theorem, as the sample size increases, the distribution of the

mean becomes more and more normal, regardless of the nor-

mality of the underlying distribution. Some simulation studies

have shown that for a wide variety of distributions (but not all;

see Bradley, 1978), the distribution of the mean becomes near

normal when n=30. Another consideration is that it is slightly

simpler to use zscores rather than tscores because zscores do

not require the use of degrees of freedom. As shown in Figure 4,

by the time there are about 30 degrees of freedom the value of

tclosely approaches the value of z. Consequently, there can be

a feeling that you don’t have to deal with small samples that

require special small-sample treatment (Cohen, 1990).

Formative usability testing. Early descriptions of formative

usability testing (Al-Awar et al., 1981; Chapanis, 1981) did not

provide any guidance regarding the sample size per iteration

rather than saying to run “a few” participants. Lewis (1982) pro-

posed using the cumulative binomial probability formula (P(at

least one occurrence) =1−(1 −p)n) as an aid in sample size

estimation for these types of problem-discovery studies. In this

formula, nis the sample size, and pis the likelihood of occur-

rence of a usability problem (or whatever the investigator is

trying to discover via observation).

In the early 1990s there were a number of fairly large-sample

formative usability studies run for the purpose of exploring

the relationship between sample size and problem discovery

(Nielsen & Molich, 1990; Virzi, 1990,1992). Nielsen and

Landauer (1993) collected a number of the studies from the

literature and from the practitioner community and determined

that the average value for pwas .31. They determined that if

you use this value for p, set nto 5, then compute the cumula-

tive binomial probability that an event with p=.31 will occur

at least once out of ﬁve opportunities, that probability is about

85% (1 – (1 – .31)5=.8436). In other words, the ﬁrst ﬁve par-

ticipants observed in a formative usability study should usually

reveal about 85% of the problems available for discovery in that

iteration, where the properties of the study (type of participants

and tasks employed) place limits on what is discoverable. But

over time, in the minds of many usability practitioners, the rule

became simpliﬁed to “All you need to do is watch ﬁve people

to ﬁnd 85% of a product’s usability problems.”

6.2. Another Point of View

Summative usability testing. The idea that even with the

t-distribution (as opposed to the z-distribution) you need to

have a sample size of at least 30 is inconsistent with the

t(.01)

t(.05)

t(.10)

Value of t

1 5 10 15 20

Degrees of Freedom

25 30 z

FIG. 4. tapproaches zas nincreases.

Downloaded by [78.87.127.172] at 04:36 24 June 2014

678 LEWIS

history of the development of the distribution. In 1899, William

S. Gossett joined the Guinness brewery. Surprisingly, brew-

ing has certain economic realities in common with moderated

usability testing. “The nature of the process of brewing, with

its variability in temperature and ingredients, means that it is

not possible to take large samples over a long run” (Cowles,

1989, pp. 108–109). For much of his research, Gossett per-

formed an early version of Monte Carlo simulations (Stigler,

1999) by preparing 3,000 cards labeled with physical measure-

ments taken on criminals, shufﬂing them, then dealing them

out into 750 groups of n=4 – a much smaller sample size

than 30.

When the cost of a sample is expensive, as it typically is in

many types of user research (e.g., moderated usability testing),

it is important to estimate the needed sample size as accu-

rately as possible, with the understanding that it is an estimate.

The likelihood that 30 is exactly the right sample for a given

set of circumstances is very low. It is more appropriate to use

the formulas for computing the signiﬁcance levels of a statis-

tical test and, using algebra to solve for n, to convert them

to sample size estimation formulas (e.g., see Sauro & Lewis,

2012).

Formative usability testing. Sometimes the Magic Number

5 is correct, but only under special circumstances (Borsci et al.,

2013; Lewis, 1994). If the design space under investigation

has many problems available for discovery whose probabili-

ties of occurrence are markedly different from 0.31, then there

is no guarantee that observing ﬁve participants will lead to

the discovery of 85% of the problems available for discovery.

For example, in their famous “Eight is Not Enough” paper,

Perfetti and Landesman (2001) found that testing ﬁve users

fell far short of achieving 85% problem discovery. A reanal-

ysis of their data indicated that their value of pwas proba-

bly about .03—one tenth of the value needed for the Magic

Number 5 rule of thumb to work (Lewis, 2006; Sauro & Lewis,

2012).

Rather than relying on magic numbers of any kind, it may be

more reasonable to use formulas or tables based on the cumula-

tive binomial probability formula when planning samples sizes

for a one-shot or iterative formative usability testing. Using

algebra to solve for n, the sample size formula based on 1 −

(1 −p)nis n=ln(1 – discoveryGoal)/ln(1 – p). For example,

if the desired discovery goal is to discover at least 80% of the

problems that have a likelihood of occurrence of 0.15, where

discovery means to ﬁnd them at least once, then n=9.9, which

rounds up to 10. Alternatively, practitioners can use a table like

the one shown in Table 2 (built using the preceding sample size

formula) to get a sense of what problem discovery to expect as

a function of sample size.

For example, suppose circumstances limit a practitioner to

running a single-shot study with ﬁve participants. As shown

in Table 2, the expectation is that the study will expose (at

least once) almost all of the problems that have a probability

of occurrence of 0.5 or greater. Then, in descending order, the

TABLE 2

Likelihood of Discovery for Various Sample Sizes and

Probabilities of Occurrence

p n =1n=2n=3n=4n=5n=10 n=15 n=20

.01 0.01 0.02 0.03 0.04 0.05 0.10 0.14 0.18

.05 0.05 0.10 0.14 0.19 0.23 0.40 0.54 0.64

.10 0.10 0.19 0.27 0.34 0.41 0.65 0.79 0.88

.15 0.15 0.28 0.39 0.48 0.56 0.80 0.91 0.96

.25 0.25 0.44 0.58 0.68 0.76 0.94 0.99 1.00

.50 0.50 0.75 0.88 0.94 0.97 1.00 1.00 1.00

.90 0.90 0.99 1.00 1.00 1.00 1.00 1.00 1.00

expectation is discovery of about 76% of problems where p=

.25, 56% of problems where p=.15, 41% of problems where

p=.1, 23% of problems where p=.05, and 5% of problems

where p=.01. In other words, a study with n=5 is likely to

leave many problems undiscovered if their probability of occur-

rence is less than .5, but, more optimistically, the study will

probably uncover enough problems to give the developers the

information needed to improve the product’s usability. This is a

more nuanced approach than a simple magic number, but also

more likely to lead to more realistic expectations on the part

of practitioners and stakeholders. Note that the incompleteness

of discovery for lower frequency problems may contribute to

the discrepancies observed in the usability problem lists gener-

ated in the CUE studies (see the Is Formative Usability Testing

Reliable? section). When you roll dice, you don’t expect the

same number to come up every time.

6.3. Lessons Learned

There are rational and empirical bases underlying the magic

number rules of thumb for sample size requirements, but they

are optimal only under very speciﬁc conditions. Rather than

referring to magic numbers, it would be better practice to use

the tools that are available to guide sample size estimation, both

for summative and for formative usability testing.

6.4. Lessons Yet to Be Learned

Summative usability testing. Given its close connection

to traditional experimentation and use of inferential statistics,

there are probably relatively few lessons yet to be learned for

statistical analyses associated with summative usability test-

ing. It is, however, important for at least a subset of usability

researchers to keep abreast of developments in applied statis-

tics and to publish their ﬁndings. For example, there have been

recent articles on new developments for computing binomial

conﬁdence intervals (Agresti & Coull, 1998; Sauro & Lewis,

2005) and a better formula for chi-square tests (Campbell, 2007)

when sample sizes are small (for details, see Sauro & Lewis,

2012). Continuing concerns about the effects of violating the

Downloaded by [78.87.127.172] at 04:36 24 June 2014

USABILITY LESSONS 679

assumptions of parametric statistical methods (such as the t

test and analysis of variance) may be addressed in the future

by bootstrapping (or similar resampling) procedures (Chernick,

2008), but we have yet to see a set of easy-to-use bootstrap-

ping analysis tools, especially with a focus on the concerns of

user researchers. We also have yet to see systematic analyses

of the potential beneﬁts and liabilities of bootstrapping methods

in user research, especially given the documented robustness of

parametric statistical procedures (Sauro & Lewis, 2012).

Formative usability testing. The methods for estimating

sample size requirements for formative usability testing are

much younger than those for summative usability testing.

Consequently, there is a richer set of opportunities to learn more

lessons. Of particular interest are questions about the mean-

ing of pand potential problems associated with basing these

methods on the binomial probability formula.

As previously discussed, the value of pis a critical com-

ponent in the cumulative binomial probability formula. There

are outstanding questions, however, about what, if anything, it

means. For example, if the value of pis small, as in the “Eight

is Not Enough” study, then that indicates that there was very

little overlap in the problems experienced across participants.

Was that due to the product being mature and free of errors

likely to affect a large proportion of users? If so, then the small

value of pwould be indicative of a generally positive situation.

If the value of pis large, then there is considerable overlap in

the problems experienced across participants. One would gen-

erally expect this outcome for new products that have not yet

undergone usability testing, and would also be a generally pos-

itive outcome. But might there also be situations in which the

value of pwould indicate a negative outcome—perhaps a low

value when you’d expect a high one, or vice versa? There is

opportunity for new research on this topic.

A number of publications have criticized the use of the

binomial probability formula as the basis for estimating sam-

ple sizes for formative usability studies. A key assumption of

the binomial model is that the value of pis constant from

trial to trial (Ennis & Bi, 1998). It seems likely that this

assumption does not strictly hold in user research due to dif-

ferences in users’ capabilities and experiences (Caulton, 2001;

Schmettow, 2008; Woolrych & Cockton, 2001). The extent to

which this should affect the use of the binomial formula in

modeling problem discovery is an ongoing topic of research

(Briand, El Emam, Freimut, & Laitenberger, 2000; Kanis, 2011;

Lewis, 2001; Schmettow, 2008,2009). Some researchers have

investigated alternative means of discovery modeling that do

not make the assumption of the homogeneity of p, including

the beta-binomial (Schmettow, 2008), logit-normal binomial

(Schmettow, 2009,2012), bootstrapping (Borsci, Londei, &

Federici, 2011), and capture–recapture models (Briand et al.,

2000). These more complex alternative models may turn out

to be advantageous over the simple binomial model, but they

may also have disadvantages, particularly in the sample sizes

required to accurately estimate their parameters. Only future

research will tell.

Note that the sample size estimation procedures provided

earlier in this article for formative usability testing are not

affected by the assumption of homogeneity because they take

as given (not as estimated) one or more speciﬁc values of p.

There is still work to do to compare the overall effectiveness of

the approach presented earlier with those driven by other types

of discovery models.

7. CONCLUSION

This article has presented information about ﬁve aspects of

usability that have generated published research and discussion

in the usability science and usability engineering communities.

What is usability? Is formative usability testing reliable? Is it

OK to average ratings from multipoint scales? How robust are

standardized usability questionnaires? How useful are “magic

numbers” for planning sample sizes for usability testing? For

each topic, there is coverage of its background, discussion of

the controversies, summarization of the lessons learned, and

description of lessons yet to be learned. Some of the key lessons

learned are as follows:

• When discussing usability, it is important to distin-

guish between the goals and practices of summative

and formative usability.

• There is compelling rational and empirical support for

the practice of iterative formative usability testing—it

appears to be effective in improving both objective and

perceived usability.

• It is permissible to average multipoint scale ratings,

but there are restrictions on the interpretations of the

results.

• When conducting usability studies, practitioners

should include one or more of the currently available

standardized usability questionnaires.

• Because “magic number” rules of thumb for sample

size requirements for usability tests are optimal only

under very speciﬁc conditions, practitioners should

use the tools that are available to guide sample size

estimation rather than relying on magic numbers.

The usability practitioner community owes a substantial debt

to those who have made the effort to share their research,

both through peer-reviewed publication and through books. The

contribution of those who have undergone the peer review pro-

cess is evident in the references cited throughout this article.

For recent book-level treatments of usability and UX issues,

see Albert et al. (2010); Barnum (2010); Lazar, Feng, and

Hochheiser (2010); MacKenzie (2014); Sauro (2010a); Sauro

& Lewis (2012); Shneiderman & Plaisant (2010); and Tullis and

Albert (2013).

I want to end with a call to action. Speciﬁcally, I encourage

practitioners as well as researchers to look for opportunities in

their day-to-day work to study and compare different methods

and, most important, to publish the ﬁndings. In the long run, this

is how we will move questions from lessons yet to be learned to

Downloaded by [78.87.127.172] at 04:36 24 June 2014

680 LEWIS

lessons learned. Looking back over the past three decades, this

may be the most important lesson learned.

ACKNOWLEDGEMENTS

I express many thanks to Dr. Gavriel Salvendy for his sup-

port throughout my career, giving me signiﬁcant opportunities

to develop my skills as an author, reviewer, editor, and speaker.

Thanks to the reviewers who responded quickly to requests to

review this paper, and whose comments were insightful and

invaluable. I also thank Pete Kennedy and his wife, Audrey, who

respectively made sure that in my ﬁrst years at IBM I had plenty

to learn and plenty to eat, and a continuing friendship. Thanks

also to all my coworkers and clients for the fascinating work

of building usable systems and the accompanying intellectual

challenges.

REFERENCES

Abelson, R. P. (1995). Statistics as principled argument. Hillsdale, NJ: Erlbaum.

Agarwil, A., & Meyer, A. (2009). Beyond usability: Evaluating emotional

response as an integral part of the user experience. In Proceedings of CHI

2009 Extended Abstracts on Human Factors in Computing Systems (pp.

2919–2930). Boston, MA: Association for Computing Machinery.

Agresti, A., & Coull, B. (1998). Approximate is better than ‘exact’ for interval

estimation of binomial proportions. The American Statistician,52, 119–126.

Al-Awar, J., Chapanis, A., & Ford, R. (1981). Tutorials for the ﬁrst-time

computer user. IEEE Transactions on Professional Communication,24,

30–37.

Albert, B., Tullis, T., & Tedesco, D. (2010). Beyond the usability lab.

Burlington, MA: Morgan Kaufmann.

Alonso-Ríos, D., Vázquez-Garcia, A., Mosqueira-Rey, E., & Moret-Bonillo.

(2010). Usability: A critical analysis and a taxonomy. International Journal

of Human-Computer Interaction,26, 53–74.

American National Standards Institute. (2001). Common industry format for

usability test reports (ANSI-NCITS 354-2001). Washington, DC: Author.

Baecker, R. M. (2008). Themes in the early history of HCI—Some unanswered

questions. Interactions,15(2), 22–27.

Bailey, G. (1993). Iterative methodology and designer training in human–

computer interface design. In INTERCHI ‘93 Conference Proceedings (pp.

198–205). New York, NY: Association for Computing Machinery.

Bailey, R. W., Allan, R. W., & Raiello, P. (1992). Usability testing vs. heuris-

tic evaluation: A head to head comparison. In Proceedings of the Human

Factors and Ergonomics Society 36th Annual Meeting (pp. 409–413). Santa

Monica, CA: Human Factors and Ergonomics Society.

Bangor, A., Kortum, P. T., & Miller, J. T. (2008). An empirical evaluation

of the System Usability Scale. International Journal of Human–Computer

Interaction,24, 574–594.

Barnette, J. J. (2000). Effects of stem and Likert response option reversals on

survey internal consistency: If you feel the need, there is a better alterna-

tive to using those negatively worded stems. Educational and Psychological

Measurement,60, 361–370.

Barnum, C. M. (2010). Usability testing essentials: Ready, set ... test!

Burlington, MA: Morgan Kaufmann.

Benedek, J., & Miner, T. (2002). Measuring desirability: New methods for

evaluating desirability in a usability lab setting. In Proceedings of the

Usability Professionals’ Association. Orlando, FL: Usability Professionals

Association. Available at: http://www.microsoft.com/usability/uepostings/

desirabilitytoolkit.doc (Accessed June 10, 2014).

Bennett, J. L. (1979). The commercial impact of usability in interactive sys-

tems. Infotech State of the Art Report: Man/Computer Communications,2,

289–297.

Berry, D. C., & Broadbent, D. E. (1990). The role of instruction and verbal-

ization in improving performance on complex search tasks. Behaviour &

Information Technology,9, 175–190.

Bevan, N. (2009). Extending quality in use to provide a framework for usability

measurement. In M. Kurosu (Ed.), Human centered design, HCII 2009 (pp.

13–22). Heidelberg, Germany: Springer-Verlag.

Bevan, N., Kirakowski, J., & Maissel, J. (1991). What is usability? In H.

J. Bullinger (Ed.), Human Aspects in Computing, Design and Use of

Interactive Systems and Work with Terminals, Proceedings of the 4th

International Conference on Human–Computer Interaction (pp. 651–655).

Stuttgart, Germany: Elsevier Science.

Bias, R. G., & Mayhew, D. J. (1994). Cost-justifying usability. Boston, MA:

Academic.

Blažica, B., & Lewis, J. R. (2014). A Slovene translation of the System Usability

Scale: The SUS-SI. International Journal of Human–Computer Interaction.

In Press.

Boren, T., & Ramey, J. (2000). Thinking aloud: Reconciling theory and practice.

IEEE Transactions on Professional Communications,43, 261–278.

Borsci, S., Federici, S., & Lauriola, M. (2009). On the dimensionality of the

system usability scale: A test of alternative measurement models. Cognitive

Processes,10, 193–197.

Borsci, S., Londei, A., & Federici, S. (2011). The Bootstrap Discovery

Behaviour (BDB): A new outlook on usability evaluation. Cognitive

Processes,12, 23–31.

Borsci, S., Macredie, R. D., Barnett, J., Martin, J., Kuljis, J., & Young, T.

(2013). Reviewing and extending the ﬁve-user assumption: A grounded pro-

cedure for interaction evaluation. ACM Transactions on Computer-Human

Interaction,20, 29:01–29:23.

Bowers, V., & Snyder, H. (1990). Concurrent versus retrospective verbal proto-

cols for comparing window usability. In Proceedings of the Human Factors

Society 34th Annual Meeting (pp. 1270–1274). Santa Monica, CA: Human

Factors Society.

Bradley, J. V. (1976). Probability; decision; statistics. Englewood Cliffs, NJ:

Prentice-Hall.

Bradley, J. V. (1978). Robustness? British Journal of Mathematical and

Statistical Psychology,31, 144–152.

Briand, L. C., El Emam, K., Freimut, B. G., & Laitenberger, O. (2000). A com-

prehensive evaluation of capture-recapture models for estimating software

defect content. IEEE Transactions on Software Engineering,26, 518–540.

Brooke, J. (1996). SUS—A “quick and dirty” usability scale. In P. W. Jordan

(Ed.), Usability evaluation in industry (pp. 189–194). London, UK: Taylor

& Francis.

Brooke, J. (2013). SUS: A retrospective. Journal of Usability Studies,8(2),

29–40.

Campbell, I. (2007). Chi-squared and Fisher-Irwin tests of two-by-two tables

with small sample recommendations. Statistics in Medicine,26, 3661–3675.

Capra, M. G. (2007). Comparing usability problem identiﬁcation and descrip-

tion by practitioners and students. In Proceedings of the Human Factors and

Ergonomics Society 51st Annual Meeting (pp. 474–478). Santa Monica, CA:

Human Factors and Ergonomics Society.

Card, S. K., Moran, T. P., & Newell, A. (1983). The psychology of human-

computer interaction. London, UK: Erlbaum.

Caulton, D. A. (2001). Relaxing the homogeneity assumption in usability

testing. Behaviour & Information Technology,20, 1–7.

Cavallin, H., Martin, W. M., & Heylighen, A. (2007). How relative absolute can

be: SUMI and the impact of the nature of the task in measuring perceived

software usability. Artiﬁcial Intelligence and Society,22, 227–235.

Chapanis, A. (1981). Evaluating ease of use. Unpublished manuscript prepared

for IBM, Boca Raton, FL. (Available from J. R. Lewis, ADDRESS).

Chernick, M. R. (2008). Bootstrap methods: A guide for practitioners and

researchers. Hoboken, NJ: Wiley.

Chin, J. P., Diehl, V. A., & Norman, K. L. (1988). Development of an instru-

ment measuring user satisfaction of the human–computer interface. In

Proceedings of CHI 1988 (pp. 213–218). Washington, DC: Association for

Computing Machinery.

Clemmensen, T., Hertzum, M., Hornbæk, K., Shi, Q., & Yammiyavar, P. (2009).

Cultural cognition in usability evaluation. Interacting with Computers,21,

212–220.

Downloaded by [78.87.127.172] at 04:36 24 June 2014

USABILITY LESSONS 681

Cohen, J. (1990). Things I have learned (so far). American Psychologist,45,

1304–1312.

Cowles, M. (1989). Statistics in psychology: An historical perspective.

Hillsdale, NJ: Erlbaum.

Davis, F. D. (1989). Perceived usefulness, perceived ease of use, and user

acceptance of information technology. MIS Quarterly,13, 319–339.

Davis, F. D., & Venkatesh, V. (1996). A critical assessment of potential mea-

surement biases in the Technology Acceptance Model: Three experiments.

International Journal of Human-Computer Studies,45, 19–45.

Dumas, J. S. (2003). User-based evaluations. In J. A. Jacko & A. Sears (Eds.),

The human–computer interaction handbook (pp. 1093–1117). Mahwah, NJ:

Erlbaum.

Dumas, J. S. (2007). The great leap forward: The birth of the usability

profession (1988–1993). Journal of Usability Studies,2, 54–60.

Dumas, J., & Redish, J. C. (1999). A practical guide to usability testing.

Portland, OR: Intellect.

Ennis, D. M., & Bi, J. (1998). The beta-binomial model: Accounting for

inter-trial variation in replicated difference and preference tests. Journal of

Sensory Studies,13, 389–412.

Erdinç, O., & Lewis, J. R. (2013). Psychometric evaluation of the T-CSUQ:

The Turkish version of the Computer System Usability Questionnaire.

International Journal of Human-Computer Interaction,29, 319–326.

Ericsson, K. A., & Simon, H. A. (1980). Verbal reports as data. Psychological

Review,87, 215–251.

Evers, V., & Day, D. (1997). The role of culture in interface acceptance. In

Proceedings of Interact 1997 (pp. 260–267). Sydney, Australia: Chapman

and Hall.

Finstad, K. (2006). The System Usability Scale and non-native English speak-

ers. Journal of Usability Studies,1, 185–188.

Finstad, K. (2010). The usability metric for user experience. Interacting with

Computers,22, 323–327.

Finstad, K. (2013). Response to commentaries on “The Usability Metric for

User Experience”. Interacting with Computers,25, 327–330.

Frandsen-Thorlacius, O., Hornbæk, K., Hertzum, M., & Clemmensen, T.

(2009). Non-universal usability? A survey of how usability is understood by

Chinese and Danish users. In Proceedings of CHI 2009 (pp. 41–50). Boston,

MA: Association for Computing Machinery.

Gaito, J. (1980). Measurement scales and statistics: Resurgence of an old

misconception. Psychological Bulletin,87, 564–567.

Gould, J. D. (1988). How to design usable systems. In M. Helander (Ed.),

Handbook of human–computer interaction (pp. 757–789). Amsterdam, the

Netherlands: North-Holland.

Gould, J. D., & Boies, S. J. (1983). Human factors challenges in creating a

principal support ofﬁce system: The Speech Filing System approach. ACM

Transactions on Information Systems,1, 273–298.

Gould, J. D., Boies, S. J., Levy, S., Richards, J. T., & Schoonard, J. (1987). The

1984 Olympic Message System: A test of behavioral principles of system

design. Communications of the ACM,30, 758–769.

Gould, J. D., & Lewis, C. (1984). Designing for usability: Key principles and

what designers think (IBM Tech. Report No. RC-10317). Yorktown Heights,

NY: International Business Machines Corporation.

Gray, W. D., & Salzman, M. C. (1998). Damaged merchandise? A review of

experiments that compare usability evaluation methods. Human–Computer

Interaction,13, 203–261.

Grier, R. A., Bangor, A., Kortum, P., & Peres, S. C. (2013). The System

Usability Scale: Beyond standard usability testing. In Proceedings of the

Human Factors and Ergonomics Society (pp. 187–191). Santa Monica, CA:

Human Factors and Ergonomics Society.

Grove, J. W. (1989). In defence of science: Science, technology, and politics in

modern society. Toronto, Canada: University of Toronto Press.

Harris, R. J. (1985). A primer of multivariate statistics. Orlando, FL: Academic

Press.

Hassenzahl, M. (2000). Prioritizing usability problems: Data driven and judg-

ment driven severity estimates. Behaviour & Information Technology,19,

29–42.

Hassenzahl, M. (2001). The effect of perceived hedonic quality on product

appealingness. International Journal of Human-Computer Interaction,13,

481–499.

Hassenzahl, M. (2004). The interplay of beauty, goodness, and usabil-

ity in interactive products. Human-Computer Interaction,19,

319–349.

Hertzum, M. (2006). Problem prioritization in usability evaluation: From

severity assessments to impact on design. International Journal of Human-

Computer Interaction,21, 125–146.

Hertzum, M. (2010). Images of usability. International Journal of Human-

Computer Interaction,26, 567–600.

Hertzum, M., Clemmensen, T., Hornbæk, K., Kumar, J., Shi, Q., &

Yammiyavar, P. (2007). Usability constructs: A cross-cultural study of

how users and developers experience their use of information systems.

In Proceedings of HCI International 2007 (pp. 317–326). Beijing, China:

Springer-Verlag.

Hertzum, M., Hansen, K. D., & Andersen, H. H. K. (2009). Scrutinising usabil-

ity evaluation: Does thinking aloud affect behaviour and mental workload?

Behaviour & Information Technology,28, 165–181.

Høegh, R. T., & Jensen, J. J. (2008). A case study of three software projects:

Can software developers anticipate the usability problems in their software?

Behaviour & Information Technology,27, 307–312.

Høegh, R. T., Nielsen, C. M., Overgaard, M., Pedersen, M. B., & Stage, J.

(2006). The impact of usability reports and user test observations on devel-

opers’ understanding of usability data: An exploratory study. International

Journal of Human-Computer Interaction,21, 173–196.

Hornbæk, K. (2006). Current practice in measuring usability: Challenges to

usability studies and research. International Journal of Human-Computer

Studies,64, 79–102.

Hornbæk, K. (2010). Dogmas in the assessment of usability evaluation methods.

Behaviour & Information Technology,29,97–111.

Hornbæk, K., & Law, E. L. (2007). Meta-analysis of correlations among usabil-

ity measures. In Proceedings of CHI 2007 (pp. 617–626). San Jose, CA:

Association for Computing Machinery.

Howard, T. W. (2008). Unexpected complexity in a traditional usability study.

Journal of Usability Studies,3, 189–205.

Howard, T., & Howard, W. (2009). Unexpected complexity in user testing of

information products. In Proceedings of the Professional Communication

Conference (pp. 1–5). Waikiki, HI: Institute of Electrical and Electronics

Engineers.

Hwang, W., & Salvendy, G. (2010). Number of people required for usabil-

ity evaluation: The 10±2 rule. Communications of the ACM,53,

130–133.

International Organization for Standardization. (1998). Ergonomic require-

ments for ofﬁce work with visual display terminals (VDTs), Part 11,

Guidance on usability (ISO 9241-11: 1998(E)). Geneva, Switzerland:

Author.

Kanis, H. (2011). Estimating the number of usability problems. Applied

Ergonomics,42, 337–347.

Karat, C. (1997). Cost-justifying usability engineering in the software life cycle.

In M. Helander, T. K. Landauer, & P. Prabhu (Eds.), Handbook of human–

computer interaction (2nd ed., pp. 767–778). Amsterdam, the Netherlands:

Elsevier.

Kelley, J. F. (1984). An iterative design methodology for user-friendly natural

language ofﬁce information applications. ACM Transactions on Information

Systems,2, 26–41.

Kennedy, P. J. (1982). Development and testing of the operator training package

for a small computer system. In Proceedings of the Human Factors Society

26th Annual Meeting (pp. 715–717). Santa Monica, CA: Human Factors

Society.

Kessner, M., Wood, J., Dillon, R. F., & West, R. L. (2001). On the reliability

of usability testing. In J. Jacko & A. Sears (Eds.), Conference on Human

Factors in Computing Systems: CHI 2001 Extended Abstracts (pp. 97–98).

Seattle, WA: Association for Computing Machinery.

Kirakowski, J. (1996). The Software Usability Measurement Inventory:

Background and usage. In P. Jordan, B. Thomas, & B. Weerdmeester (Eds.),

Usability evaluation in industry (pp. 169–178). London, UK: Taylor &

Francis

Kirakowski, J., & Corbett, M. (1993). SUMI: The Software Usability

Measurement Inventory. British Journal of Educational Technology,24,

210–212.

Downloaded by [78.87.127.172] at 04:36 24 June 2014

682 LEWIS

Kirakowski, J., & Dillon, A. (1988). The Computer User Satisfaction Inventory

(CUSI): Manual and scoring key. Cork, Ireland: Human Factors Research

Group, University College of Cork.

Kortum, P. T., & Bangor, A. (2013). Usability ratings for everyday products

measured with the System Usability Scale. International Journal of Human-

Computer Interaction,29, 67–76.

Krahmer, E., & Ummelen, N. (2004). Thinking about thinking aloud: A com-

parison of two verbal protocols for usability testing. IEEE Transactions on

Professional Communication,47, 105–117.

LaLomia, M. J., & Sidowski, J. B. (1990). Measurements of computer satis-

faction, literacy, and aptitudes: A review. International Journal of Human–

Computer Interaction,2, 231–253.

Landauer, T. K. (1997). Behavioral research methods in human–computer inter-

action. In M. Helander, K. T. Landauer, & P. Prabhu (Eds.), Handbook

of human–computer interaction (2nd ed., pp. 203–227). Amsterdam, the

Netherlands: Elsevier.

Larson, R. C. (2008). Service science: At the intersection of management,

social, and engineering sciences. IBM Systems Journal,47, 41–51.

Lazar, J., Feng, J. H., & Hochheiser, H. (2010). Research methods in human-

computer interaction. Chichester, UK: Wiley.

Lewis, J. R. (1982). Testing small system customer set-up. In Proceedings of the

Human Factors Society 26th Annual Meeting (pp. 718–720). Santa Monica,

CA: Human Factors Society.

Lewis, J. R. (1990a). Psychometric evaluation of a post-study system usabil-

ity questionnaire:The PSSUQ (Tech. Rep. No. 54.535). Boca Raton, FL:

International Business Machines Corp.

Lewis, J. R. (1990b). Psychometric evaluation of an after-scenario question-

naire for computer usability studies:The ASQ (Tech. Rep. No. 54.541).

Boca Raton, FL: International Business Machines Corp.

Lewis, J. R. (1992). Psychometric evaluation of the Post-Study System

Usability Questionnaire: The PSSUQ. In Proceedings of the Human Factors

Society 36th Annual Meeting (pp. 1259–1263). Santa Monica, CA: Human

Factors Society.

Lewis, J. R. (1993). Multipoint scales: Mean and median differences and

observed signiﬁcance levels. International Journal of Human-Computer

Interaction,5, 383–392.

Lewis, J. R. (1994). Sample sizes for usability studies: Additional considera-

tions. Human Factors,36, 368–378.

Lewis, J. R. (1995). IBM computer usability satisfaction questionnaires:

Psychometric evaluation and instructions for use. International Journal of

Human-Computer Interaction,7, 57–78.

Lewis, J. R. (1996). Reaping the beneﬁts of modern usability evaluation: The

Simon story. In G. Salvendy & A. Ozok (Eds.), Advances in Applied

Ergonomics: Proceedings of the 1st International Conference on Applied

Ergonomics—ICAE ‘96 (pp. 752–757). Istanbul, Turkey: USA Publishing.

Lewis, J. R. (1999). Tradeoffs in the design of the IBM computer usability

satisfaction questionnaires. In Proceedings of HCI International 1999 (pp.

1023–1027). Mahwah, NJ: Erlbaum.

Lewis, J. R. (2001). Evaluation of procedures for adjusting problem-discovery

rates estimated from small samples. International Journal of Human–

Computer Interaction,13, 445–479.

Lewis, J. R. (2002). Psychometric evaluation of the PSSUQ using data from

ﬁve years of usability studies. International Journal of Human–Computer

Interaction,14, 463–488.

Lewis, J. R. (2006). Sample sizes for usability tests: Mostly math, not magic.

Interactions,13(6),29–33. (See corrected formula in Interactions, 14(1), 4)

Lewis, J. R. (2011a). Human factors engineering. In P. A. LaPlante (Ed.),

Encyclopedia of software engineering (pp. 383–394). New York, NY: Taylor

& Francis.

Lewis, J. R. (2011b). Practical speech user interface design. Boca Raton, FL:

Taylor & Francis.

Lewis, J. R. (2012). Usability testing. In G. Salvendy (Ed.), Handbook of human

factors and ergonomics (4th ed., pp. 1267–1312). New York, NY: Wiley.

Lewis, J. R. (2013). Critical review of “The Usability Metric for User

Experience”. Interacting with Computers,25, 320–324.

Lewis, J. R., Henry, S. C., & Mack, R. L. (1990). Integrated ofﬁce software

benchmarks: A case study. In Proceedings of the 3rd IFIP Conference on

Human-Computer Interaction, INTERACT ‘90 (pp. 337–343). Cambridge,

UK: Elsevier Science.

Lewis, J. R., & Sauro, J. (2009). The factor structure of the system usability

scale. In M. Kurosu (Ed.), Human centered design (pp. 94–103). Heidelberg,

Germany: Springer-Verlag.

Lewis, J. R., Utesch, B. S., & Maher, D. E. (2013). UMUX-LITE—When

there’s no time for the SUS. In Proceedings of CHI 2013 (pp. 2099–2102).

Paris, France: Association for Computing Machinery.

Lilienfeld, S. O., Wood, J. M., & Garb, H. N. (2000). The scientiﬁc status

of projective techniques. Psychological Science in the Public Interest,1,

27–66.

Lindgaard, G., & Kirakowski, J. (2013). Introduction to the special issue:

The tricky landscape of developing rating scales in HCI. Interacting with

Computers,25, 271–277.

Lord, F. M. (1953). On the statistical treatment of football numbers. American

Psychologist,8, 750–751.

Lord, F. M. (1954). Further comment on “football numbers.” American

Psychologist,9, 264–265.

Lottridge, D., Chignell, M., & Jovicic, A. (2011). Affective design:

Understanding, evaluating, and designing for human emotion. Reviews of

Human Factors and Ergonomics,7, 197–237.

Lund, A. (1998). USE Questionnaire Resource Page. Retrieved from http://

usesurvey.com.

Lund, A. (2001). Measuring usability with the USE questionnaire. Usability and

User Experience Newsletter of the STC Usability SIG,8(2), 1–4.

Lusch, R. F., Vargo, S. L., & O’Brien, M. (2007). Competing through service:

Insights from service-dominant logic. Journal of Retailing,83, 5–18.

Lusch, R. F., Vargo, S. L., & Wessels, G. (2008). Toward a conceptual foun-

dation for service science: Contributions from service-dominant logic. IBM

Systems Journal,47, 5–14.

MacDonald, S., Edwards, H. M., & Zhao, T. (2012). Exploring think-

alouds in usability testing: An international survey. IEEE Transactions on

Professional Communication,55, 2–19.

MacDonald, S., McGarry, K., & Willis, L. M. (2013). Thinking-aloud about

web navigation: The relationship between think-aloud instructions, task

difﬁculty and performance. In Proceedings of the Human Factors and

Ergonomics Society Annual Meeting (pp. 2037–2041. Santa Monica:

Human Factors and Ergonomics Society.

MacKenzie, I. S. (2014). Human-computer interaction: An empirical research

perspective. Waltham, MA: Morgan Kaufmann.

Marcus, A. (2007). Global/intercultural user-interface design. In J. Jacko &

A. Spears (Eds.), Handbook of human-computer interaction (3rd ed., pp.

355–380). New York, NY: Erlbaum.

Marshall, C., Brendan, M., & Prail, A. (1990). Usability of Product X: Lessons

from a real product. Behaviour & Information Technology,9, 243–253.

McSweeney, R. (1992). SUMI: A psychometric approach to software evaluation

(Unpublished master’s thesis). Cork, Ireland: University College of Cork.

Mitchell, J. (1986). Measurement scales and statistics: A clash of paradigms.

Psychological Bulletin,100, 398–407.

Molich, R., Bevan, N., Curson, I., Butler, S., Kindlund, E., Miller, D.,

& Kirakowski, J. (1998). Comparative evaluation of usability tests. In

Usability Professionals Association Annual Conference Proceedings (pp.

189–200). Washington, DC: Usability Professionals Association.

Molich, R., & Dumas, J. S. (2008). Comparative usability evaluation (CUE-4).

Behaviour & Information Technology,27, 263–281.

Molich, R., Ede, M. R., Kaasgaard, K., & Karyukin, B. (2004). Comparative

usability evaluation. Behaviour & Information Technology,23, 65–74.

Molich, R., Jeffries, R., & Dumas, J. S. (2007). Making usability recommenda-

tions useful and usable. Journal of Usability Studies,2, 162–179.

Molich, R., Kirakowski, J, Sauro, J., & Tullis, T. (2009). Comparative usabil-

ity task measurement workshop (CUE-8). Workshop conducted at the UPA

2009 Conference in Portland, OR.

Nielsen, J. (2000). Why you only need to test with 5 users. Alertbox. Retrieved

from http://www.useit.com/alertbox/20000319.html

Nielsen, J., & Landauer, T. K. (1993). A mathematical model of the ﬁnd-

ing of usability problems. In Proceedings of INTERCHI’93 (pp. 206–213).

Amsterdam, the Netherlands: Association for Computing Machinery.

Downloaded by [78.87.127.172] at 04:36 24 June 2014

USABILITY LESSONS 683

Nielsen, J., & Mack, R. L. (1994). Usability inspection methods. New York,

NY: Wiley.

Nielsen, J., & Molich, R. (1990). Heuristic evaluation of user interfaces. In

Proceedings of CHI ’90 (pp. 249–256). New York, NY: Association for

Computing Machinery.

Nørgaard, M., & Hornbæk, K. (2009). Exploring the value of usability feed-

back formats. International Journal of Human-Computer Interaction,25,

49–74.

Nunnally, J.C. (1978). Psychometric theory. New York, NY: McGraw-Hill.

Ohnemus, K. R., & Biers, D. W. (1993). Retrospective versus thinking aloud

in usability testing. In Proceedings of the Human Factors and Ergonomics

Society 37th Annual Meeting (pp. 1127–1131). Seattle, WA: Human Factors

and Ergonomics Society

Olmsted-Hawala, E. L., Murphy, E., Hawala, S., & Ashenfelter, K. T. (2010).

Think-aloud protocols: A comparison of three think-aloud protocols for use

in testing data-dissemination web sites for usability. In Proceedings of CHI

2010 (pp. 2381–2390). Atlanta, GA: Association for Computing Machinery.

Perfetti, C., & Landesman, L. (2001). Eight is not enough. Retrieved from http://

www.uie.com/articles/eight_is_not_enough/

Pilotte, W. J., & Gable, R. K. (1990). The impact of positive and negative

item stems on the validity of a computer anxiety scale. Educational and

Psychological Measurement,50, 603–610.

Pitkänen, O., Virtanen, P., & Kemppinen, J. (2008). Legal research topics in

user-centric services. IBM Systems Journal,47, 143–152.

Redish, J. (2007). Expanding usability testing to evaluate complex systems.

Journal of Usability Studies,2, 102–111.

Rubin, J. (1994). Handbook of usability testing: How to plan, design, and

conduct effective tests. New York, NY: Wiley.

Rubin, J., & Chisnell, D. (2008). Handbook of usability testing: How to plan,

design, and conduct effective tests, 2nd ed. New York, NY: Wiley.

Sabra, A. I. (2003). Ibn al-Haytham. Harvard Magazine,106, 54–55.

Sauro, J. (2010a). A practical guide to measuring usability. Denver, CO: Create

Space.

Sauro, J. (2010b). That’s the worst website ever! Effects of extreme survey items.

Retrieved from http://L www.measuringusability.com/blog/extreme-items.

php.

Sauro, J. (2011). A practical guide to the System Usability Scale (SUS):

Background, bench-marks & best practices. Denver, CO: Measuring

Usability.

Sauro, J., & Lewis, J. R. (2005). Estimating completion rates from small

samples using binomial conﬁdence intervals: Comparisons and recommen-

dations. In Proceedings of the Human Factors and Ergonomics Society 49th

Annual Meeting (pp. 2100–2104). Santa Monica, CA: Human Factors and

Ergonomics Society.

Sauro, J., & Lewis, J. R. (2009). Correlations among prototypical usability met-

rics: Evidence for the construct of usability. In Proceedings of CHI 2009

(pp. 1609–1618). Boston, MA: Association for Computing Machinery.

Sauro, J., & Lewis, J. R. (2011). When designing usability questionnaires,

does it hurt to be positive? In Proceedings of CHI 2011 (pp. 2215–2223).

Vancouver, Canada: Association for Computing Machinery.

Sauro, J., & Lewis, J. R. (2012). Quantifying the user experience: Practical

statistics for user research. Burlington, MA: Morgan-Kaufmann.

Schmettow, M. (2008). Heterogeneity in the usability evaluation process. In

Proceedings of the 22nd British HCI Group Annual Conference on HCI

2008: People and Computers XXII: Culture, Creativity, Interaction - Volume

1(pp. 89–98). Liverpool, UK: Association for Computing Machinery.

Schmettow, M. (2009). Controlling the usability evaluation process under vary-

ing defect visibility. In Proceedings of the 2009 British Computer Society

Conference on Human-Computer Interaction (pp. 188–197). Cambridge,

UK: Association for Computing Machinery.

Schmettow, M. (2012). Sample size in usability studies. Communications of the

ACM,55(4), 64–70.

Schmitt, N., & Stuits, D. (1985) Factors deﬁned by negatively keyed items:

The result of careless respondents? Applied Psychological Measurement,9,

367–373.

Scholten, A. Z., & Borsboom, D. (2009). A reanalysis of Lord’s statistical

treatment of football numbers. Journal of Mathematical Psychology,53,

69–75.

Schriesheim, C. A., & Hill, K. D. (1981). Controlling acquiescence response

bias by item reversals: the effect on questionnaire validity. Educational and

Psychological Measurement,41, 1101–1114.

Seffah, A., Donyaee, M., Kline, R. B., & Padda, H. K. (2006). Usability mea-

surement and metrics: A consolidated model. Software Quality Journal,14,

159–178.

Shackel, B. (1990). Human factors and usability. In J. Preece & L. Keller

(Eds.), Human–computer interaction: Selected readings (pp. 27–41). Hemel

Hempstead, England: Prentice Hall International.

Shneiderman, B., & Plaisant, C. (2010). Designing the user interface: Strategies

for effective human-computer interaction, 5th ed. Reading, MA: Addison-

Wesley.

Smith, D. C., Irby, C., Kimball, R., Verplank, B., & Harlem, E. (1982).

Designing the Star user interface. Byte,7, 242–282.

Snyder, K. M., Happ, A. J., Malcus, L., Paap, K. R., & Lewis, J. R. (1985).

Using cognitive models to create menus. In Proceedings of the Human

Factors Society 29th Annual Meeting (pp. 655–658). Baltimore, MD:

Human Factors Society.

Spector, P., Van Katwyk, P., Brannick, M., & Chen, P. (1997) When two fac-

tors don’t reﬂect two constructs: How item characteristics can produce

artifactual factors. Journal of Management,23, 659–677.

Spencer, R. (2000). The streamlined cognitive walkthrough method: Working

around social constraints encountered in a software development company.

In Proceedings of CHI 2000 (pp. 353–359). New York, NY: Association for

Computing Machinery.

Spohrer, J., & Maglio, P. P. (2008). The emergence of service science:

Toward systematic service innovations to accelerate co-creation of value.

Production and Operations Management,17, 238–246.

Spool, J., & Schroeder, W. (2001). Testing websites: Five users is nowhere near

enough. In CHI 2001 extended abstracts (pp. 285–286). New York, NY:

Association for Computing Machinery.

Stevens, S. S. (1946). On the theory of scales of measurement. Science,103,

677–680.

Stewart, T. J., & Frye, A. W. (2004). Investigating the use of negatively-phrased

survey items in medical education settings: Common wisdom or common

mistake? Academic Medicine,79 (Suppl. 10), S1–S3.

Stigler, S. M. (1999). Statistics on the table: The history of statistical concepts

and methods. Cambridge, MA: Harvard University Press.

Theofanos, M., & Quesenbery, W. (2005). Towards the design of effective

formative test reports. Journal of Usability Studies,1(1), 27–45.

Thimbleby, H. (2007). User-centered methods are insufﬁcient for safety criti-

cal systems. In A. Holzinger (Ed.), Proceedings of USAB 2007 (pp. 1–20).

Heidelberg, Germany: Springer-Verlag.

Thurstone, L. L. (1928). Attitudes can be measured. American Journal of

Sociology,33, 529–554.

Townsend, J. T., & Ashby, F. G. (1984). Measurement scales and statistics: The

misconception misconceived. Psychological Bulletin,96, 394–401.

Travis, D. (2008). Measuring satisfaction: Beyond the usability questionnaire.

Retrieved from http://www.userfocus.co.uk/articles/satisfaction.html.

Tullis, T. S. (1985). Designing a menu-based interface to an operating system.

In Proceedings of CHI 1985 (pp. 79–84). San Francisco, CA: Association

for Computing Machinery.

Tullis, T. S., & Albert, W. (2008). Measuring the user experience: Collecting,

analyzing, and presenting usability data. Waltham, MA: Morgan-Kauffman.

Tullis, T. S., & Albert, W. (2013). Measuring the user experience: Collecting,

analyzing, and presenting usability data, 2nd ed. Waltham, MA: Morgan-

Kauffman.

Tullis, T. S., & Stetson, J. N. (2004). A comparison of questionnaires for

assessing website usability. Paper presented at the Usability Professionals

Association Annual Conference. Minneapolis, MN: Usability Professionals

Association.

van de Vijver, F. J. R., & Leung, K. (2001). Personality in cultural context:

Methodological issues. Journal of Personality,69, 1007–1031.

van den Haak, M. J., & de Jong, D. T. M. (2003). Exploring two methods

of usability testing: Concurrent versus retrospective think-aloud proto-

cols. In Proceedings of the International Professional Communication

Conference, IPCC 2003 (pp. 285–287). Orlando, FL: Institute of Electrical

and Electronics Engineers.

Downloaded by [78.87.127.172] at 04:36 24 June 2014

684 LEWIS

Velleman, P. F., & Wilkinson, L. (1993). Nominal, ordinal, interval,

and ratio typologies are misleading. The American Statistician,47,

65–72.

Virzi, R. A. (1990). Streamlining the design process: Running fewer subjects.

In Proceedings of the Human Factors Society 34th Annual Meeting (pp.

291–294). Santa Monica, CA: Human Factors Society.

Virzi, R. A. (1992). Reﬁning the test phase of usability evaluation: How many

subjects Is enough? Human Factors,34, 457–468.

Virzi, R. A., Sorce, J. F., & Herbert, L. B. (1993). A comparison of three usabil-

ity evaluation methods: Heuristic, think-aloud, and performance testing. In

Proceedings of the Human Factors and Ergonomics Society 37th Annual

Meeting (pp. 309–313). Santa Monica, CA: Human Factors and Ergonomics

Society.

Vredenburg, K., Mao, J. Y., Smith, P. W., & Carey, T. (2002). A survey of

user centered design practice. In Proceedings of CHI 2002 (pp. 471–478).

Minneapolis, MN: Association for Computing Machinery.

Wharton, C., Rieman, J., Lewis, C., & Polson, P. (1994). The cognitive walk-

through method: A practitioner’s guide. In J. Nielsen & R. L. Mack (Eds.),

Usability inspection methods (pp. 105–140). New York, NY: Wiley.

Whiteside, J., Bennett, J., & Holtzblatt, K. (1988). Usability engineering: Our

experience and evolution. In M. Helander (Ed.), Handbook of human–

computer interaction (pp. 791–817). Amsterdam, the Netherlands: North-

Holland.

Wildman, D. (1995). Getting the most from paired-user testing. Interactions,

2(3), 21–27.

Williams, G. (1983). The Lisa computer system. Byte,8(2), 33–50.

Winter, S., Wagner, S., & Deissenboeck, F. (2008). A comprehensive model

of usability. In Engineering Interactive Systems (pp. 106–122). Heidelberg,

Germany: International Federation for Information Processing.

Wixon, D. (2003). Evaluating usability methods: Why the current literature fails

the practitioner. Interactions,10(4), 28–34.

Wong, N., Rindﬂeisch, A., & Burroughs, J. (2003). Do reverse-worded items

confound measures in cross-cultural consumer research? The case of the

material values scale. Journal of Consumer Research,30, 72–91.

Woolrych, A., & Cockton, G. (2001). Why and when ﬁve test users

aren’t enough. In J. Vanderdonckt, A. Blandford, & A. Derycke (Eds.),

Proceedings of IHM–HCI 2001 Conference, Vol. 2 (pp. 105–108). Toulouse,

France: Cépadèus Éditions.

Wright, R. B., & Converse, S. A. (1992). Method bias and concurrent ver-

bal protocol in software usability testing. In Proceedings of the Human

Factors and Ergonomics Society 36th Annual Meeting (pp. 1220–1224).

Santa Monica, CA: Human Factors and Ergonomics Society.

Xue, M., & Harker, P. T. (2002). Customer efﬁciency: Concept and its

impact on e-business management. Journal of Service Research,4,

253–267.

ABOUT THE AUTHOR

James R. Lewis has been a usability practitioner at IBM

since 1981. His books include Practical Speech User Interface

Design (2011) and Quantifying the User Experience (2012,

with Jeff Sauro). He serves on the editorial boards of the

International Journal of Human-Computer Interaction and the

Journal of Usability Studies, is an IBM Master Inventor with

more than 80 U.S. patents, and is currently president of the

Association for Voice Interaction Design.

Downloaded by [78.87.127.172] at 04:36 24 June 2014

Diseño y usabilidad de IndagApp: una app para la enseñanza de las ciencias por indagación

Article

Full-text available

Mar 2024

La indagación es una metodología didáctica que promueve el desarrollo de competencias científicas y el aprendizaje significativo de las ciencias. Sin embargo, su implementación en el contexto educativo español se enfrenta a diversas barreras, como la falta de recursos y formación docente. El objetivo de este estudio fue diseñar y evaluar la usabilidad de IndagApp, un recurso TIC que facilita la enseñanza de las ciencias por indagación con alumnado de 10 a 14 años de edad. Se utilizó un diseño de métodos mixtos convergentes, con un muestreo intencional compuesto por un panel de 14 expertos de distintas disciplinas. Los resultados cuantitativos mostraron una usabilidad elevada de la app, mientras que los cualitativos permitieron mejorar la interfaz del usuario, incluir estrategias de andamiaje y alinear el recurso con las demandas curriculares. A partir de este proceso se realizó una mejora de la app que, en su versión mejorada, consta de diez indagaciones que abordan contenidos centrales del nuevo currículo de la LOMLOE. Además, se han diseñado recursos de apoyo para su implementación, como programas-guía para el profesorado y fichas de clase imprimibles para el alumnado. En conjunto, este recurso se presenta como pertinente e innovador para la transposición didáctica de la indagación, brindando a la comunidad educativa e investigadora iberoamericana una herramienta valiosa para la enseñanza de las ciencias. Se propone el desarrollo de investigaciones que aborden el análisis del uso y la percepción de la usabilidad del recurso en potenciales usuarios del ámbito de la Educación Primaria y Secundaria. ARTÍCULO COMPLETO:https://revistas.uned.es/index.php/ried/article/view/39109/28914

Remote Inclusion of Vulnerable Users in Mobile Health Intervention Design: Retrospective Case Analysis (Preprint)

Article

Dec 2023

Background Mobile health (mHealth) interventions that promote healthy behaviors or mindsets are a promising avenue to reach vulnerable or at-risk groups. In designing such mHealth interventions, authentic representation of intended participants is essential. The COVID-19 pandemic served as a catalyst for innovation in remote user-centered research methods. The capability of such research methods to effectively engage with vulnerable participants requires inquiry into practice to determine the suitability and appropriateness of these methods. Objective In this study, we aimed to explore opportunities and considerations that emerged from involving vulnerable user groups remotely when designing mHealth interventions. Implications and recommendations are presented for researchers and practitioners conducting remote user-centered research with vulnerable populations. Methods Remote user-centered research practices from 2 projects involving vulnerable populations in Norway and Australia were examined retrospectively using visual mapping and a reflection-on-action approach. The projects engaged low-income and unemployed groups during the COVID-19 pandemic in user-based evaluation and testing of interactive, web-based mHealth interventions. Results Opportunities and considerations were identified as (1) reduced barriers to research inclusion; (2) digital literacy transition; (3) contextualized insights: a window into people’s lives; (4) seamless enactment of roles; and (5) increased flexibility for researchers and participants. Conclusions Our findings support the capability and suitability of remote user methods to engage with users from vulnerable groups. Remote methods facilitate recruitment, ease the burden of research participation, level out power imbalances, and provide a rich and relevant environment for user-centered evaluation of mHealth interventions. There is a potential for a much more agile research practice. Future research should consider the privacy impacts of increased access to participants’ environment via webcams and screen share and how technology mediates participants’ action in terms of privacy. The development of support procedures and tools for remote testing of mHealth apps with user participants will be crucial to capitalize on efficiency gains and better protect participants’ privacy.

Validity of the PortionSize app compared to MyFitnessPal for accurately estimating intake: A randomized crossover laboratory-based evaluation

Article

May 2024
AM J CLIN NUTR

Evaluating the Validity of the PortionSize Smartphone Application for Estimating Dietary Intake in Free-Living Conditions: A Pilot Study

Article

Jun 2024

Virtual reality platform for teacher training on classroom climate management: evaluating user acceptance

Article

Full-text available

Mar 2024

Enhancing the educational experience through Immersive Virtual Reality (IVR) is a promising avenue, elevating the authenticity and responsiveness of simulations. Particularly in educational settings, IVR holds the potential to augment accessibility and engagement in learning. However, one pivotal aspect lies in assessing the learners' acceptance of such environments to ensure optimal and effective utilization of these technologies. This paper delves into the Didascalia Virtual-ClassRoom usability testing —an immersive IVR environment tailored for pre-service secondary school teachers. The platform transports users into a simulated classroom, where they are invited to play the role of a teacher. During the simulation, three scenarios are recreated, reproducing disruptive behaviours commonly faced in real classrooms. 84 participants (28 teachers and 56 pre-service teachers) engaged in decision-making to manage the classroom climate influenced by the simulated situations. To collect data, we used a questionnaire based on the Technology Acceptance Model (TAM) to assess and gauge users' inclinations and attitudes towards embracing the technology in question. To gain deeper insights into the user experience, participants were further invited to participate in semi-structured interviews, offering reflections and suggestions for potential enhancements. The evaluation process encompassed the perceived usefulness of the Didascalia Virtual-ClassRoom, shedding light on factors that could either facilitate or impede the adoption of this platform to enhance classroom management competence. The participants' perspectives serve as a valuable foundation for refining the tool's functionality, and their feedback fuels recommendations for its seamless integration into initial teacher training programs.

A Combination of Three Scales for Measuring User-Perceived Usability of a Clinical Information System: Which Approach Produces the Most Informative Results?

Article

Full-text available

Mar 2024

Introduction: For the first time in Iran, a kidney stone clinical information system (CIS) has been designed to manage and calculate and visualize patients’ kidney stone risk profiles. Nevertheless, the usability of this system has not been evaluated yet. Medically, this study aims to evaluate the user-perceived usability of the kidney stone CIS. Technically, the current study aims to determine which user-perceived usability testing approach produces the most informative results about the usability of this system.Material and Methods: Three questionnaires, including system usability scale (SUS), software usability measurement inventory (SUMI), and post-study system usability questionnaire, were applied to carry out the study. A total of 15 users of the kidney stone CIS participated.Results: The findings revealed that the system is of medium usability. Moreover, of the three methods used, the SUMI echoes the comments of the end-users. Despite the medium usability of the system, it was comprehensive in terms of proper data collection and storage, as well as reporting. The interface design, the lack of appropriate guidance, the time-consuming data entry, and the slow reporting system were aspects needed improvement.Conclusion: Using a combination of tools is recommended for usability evaluation. Since there is much space in the Global and each sub-scale, whereby measures may improve with SUMI, this research recommends its use for future evaluation of the kidney stone CIS.

MOBİL ARAYÜZ VE DENEYİM TASARIMINDA TEMEL İLKELER

Thesis

Full-text available

Jan 2023

Yunus Emre Bastabak

It is not possible to think about mobile interface design independently from the brain, senses, human behavior, sociology, and psychology. It is necessary to analyze well how the brain works, how the eye sees, human cognitive abilities, and cultural and sociological characteristics of humans, in other words, to understand humans well. In this way, more usable and useful interface and experience designs can be made. With the launch of the Apple iPhone in 2007, usability and experience once again gained importance, and designing user-friendly interfaces and experiences has become the most important factor in the industry of smartphones, which have a market worth of 187.5 billion dollars in 2021 and are used by more than 6 billion people. Research in the field also indicates an increasing focus on mobile usability studies. These studies provide the necessary insights for designers to create interfaces that can be adopted by users in the industry. When it is considered that the Alpha generation may not know the "click" interaction at all and will only encounter it through touch or newer interaction forms, it is necessary to examine new principles and design methods for this field. Usability is one of the main quality elements in interactive systems, and especially in mobile interfaces. Traditional methods, usability evaluations, and principles do not fully comply with the nature of these devices in mobile interface design processes. It has also become important to adapt traditional usability and design principles in new ways and to evaluate principles in the context of mobile use by more deeply examining the context of mobile use. The first contribution of this study is to examine traditional principles adapted to mobile interface design and experience. Secondly, a series of design and process methods for more effective and usable mobile interfaces have been investigated. A sample mobile application design has been made using these methods.

An Inclusive Model for External Human Machine Interfaces of Autonomous Vehicles

Conference Paper

May 2024

Mapping the Landscape of Digital Health Usage in Information Systems Research

Conference Paper

May 2024

This study investigates the use of digital healthcare in information systems (IS) research, emphasizing the need for a nuanced understanding of the conflation of related terms. The lack of an agreement on the definition of "digital healthcare usage" in research within this domain complicates assessing its impact. A conceptual framework is essential to clarify these terms and facilitate further investigation of digital health in IS. Through a combined quantitative and qualitative analysis of 5510 carefully identified articles from the IS literature, we outlined the landscape of digital healthcare usage. This groundwork is a crucial stepping stone for understanding technology integration and users’ engagement, pivotal for sustainable digital health development. The analysis revealed evolving trends in digital health research, shifting from utility, usability, and user-centric design to sustainability, privacy, and security considerations. The proposed framework not only provides clarity in terminology but also serves as a foundation for future research. This study is instrumental in guiding future IS research.

Implementing global positioning system trackers for people with dementia who are at risk of wandering

Article

Apr 2024
Dementia

Objective The main aim of this study was to evaluate the feasibility and acceptability of using a GPS tracker to mitigate the risks associated with wandering for people with dementia and those caring for them and further evaluate the impact of trackers in delaying 24-hour care and the potential for reducing the involvement of support services, such as the police, in locating patients. Methods We recruited forty-five wearers-carers dyads, and a GPS tracker was issued to each participant. Dyads completed pre-and post-outcome questionnaires after six months, and a use-log of experiences was maintained through monthly monitoring calls. At six months, focus groups were conducted with 14 dyads where they shared ideas and learning. Data analyses were performed on outcome questionnaires, use-log analysis, and focus groups discussion. Results A 24% ( N = 14) attrition rate was recorded, with 76% ( N = 34) of the participants completing pre- and post-outcome questionnaires, of which 41% ( N = 14) attended four focus group meetings. Participants reported enhanced independence for wearers as fewer restrictions were placed on their movements, peace of mind and reduced burden for the carers with less need to involve police or social services, and delays in 24-hour care. Conclusion The results supported the feasibility of routine implementation of GPS trackers in dementia care with clear guidance, monitoring and support to family carers on safe use. This could delay admission into 24-hour care as wearers and carers have a greater sense of safety and are better connected should help be required. Studies with larger sample sizes, diverse participants and health economic analysis are needed to develop the evidence base further ahead of the routine implementation of GPS trackers in health and social care services.

Sample Sizes for Usability Studies: Additional Considerations

Article

Full-text available

Feb 1994
HUM FACTORS

James R. Lewis

Recently, Virzi (1992) presented data that support three claims regarding sample sizes for usability studies: (1) observing four or five participants will allow a us-ability practitioner to discover 80% of a product's usability problems, (2) observing additional participants will reveal fewer and fewer new usability problems, and (3) more severe usability problems are easier to detect with the first few participants. Results from an independent usability study clearly support the second claim, partially support the first, but fail to support the third. Problem discovery shows diminishing retums as a function of sample size. Observing four to five participants will uncover about 80% of a product's usability problems as long as the average likelihood of problem detection ranges between 0.32 and 0.42, as in Virzi. If the average likelihood of problem detection is lower, then a practitioner will need to observe more than five participants to discover 80% of the problems. Using behavioral categories for problem severity (or impact), these data showed no correlation between problem severity (impact) and rate of discovery. The data provided evidence that the binomial probability formula may provide a good model for predicting problem discovery curves, given an estimate of the average likelihood of problem detection. Finally, data from economic simulations that estimated return on investment (ROI) under a variety of settings showed that only the average likelihood of problem detection strongly influenced the range of sample sizes for maximum ROI.

Reaping the Benefits of Modern Usability Evaluation: The Simon Story

Conference Paper

Full-text available

May 1996

James R. Lewis

Simon (TM-Bellsouth Corp.) is a commercially available personal communicator (PC) combining features of a PDA (personal digital assistant) with a full suite of communications features. This paper describes the involvement of human factors engineering in the development of Simon, and summarizes the various approaches to usability evaluation employed during its development. Simon has received a considerable amount of praise from the industry and won several industry awards, with recognition both for its innovative engineering and its usability.

Human Factors Engineering

Chapter

Full-text available

Jan 2011

James R. Lewis

Human factors engineering (also known as ergonomics) is a body of knowledge and a collection of analytical and empirical methods used to guide the development of systems to ensure suitability for human use. Human factors engineering is a component of user-centered design, and encompasses the disciplines of human-computer interaction and usability engineering. The focus of human factors engineering is on understanding the capabilities and limitations of human abilities, and applying that understanding to the design of human-machine systems. This entry introduces the general topics of human factors engineering and usability testing.

SUS -- a quick and dirty usability scale

Chapter

Full-text available

Jan 1996

John Brooke

Usability Testing

Chapter

Full-text available

Mar 2012

James R. Lewis

Covers the basics of usability testing plus some statistical topics (sample size estimation, confidence intervals, and standardized usability questionnaires).

A Comparison of Three Usability Evaluation Methods: Heuristic, Think-Aloud, and Performance Testing

Conference Paper

Oct 1993

A high-fidelity prototype of an extended voice mail application was created. We tested it using three distinct usability testing paradigms so that we could compare the quantity and quality of the information obtained using each. The three methods employed were (1) heuristic evaluation, in which usability experts critique the user interface, (2) think-aloud testing, in which naive subjects comment on the system as they use it, and (3) performance testing, in which task completion times and error rates are collected as naive subjects interact with the system. The three testing methodologies were roughly equivalent in their ability to detect a core set of usability problems on a per evaluator basis. However, the heuristic and think-aloud evaluations were generally more sensitive, uncovering a broader array of problems in the user interface. Implications of these findings are discussed in terms of the costs of doing the evaluations and in light of other work on this topic.

Empirical Research Methods for Human-Computer Interaction

Conference Paper

Apr 2019

In this two-session course, attendees learn how to conduct empirical research in human-computer interaction (HCI). This course delivers an A-to-Z tutorial on designing a user study and demonstrates how to write a successful CHI paper. It would benefit anyone interested in conducting a user study or writing a CHI paper. Only general HCI knowledge is required.

In Defence of Science: Science, Technology, and Politics in Modern Society

Book

Dec 1989

Jack Grove

Practical Speech User Interface Design

Book

Dec 2010

James R. Lewis

Although speech is the most natural form of communication between humans, most people find using speech to communicate with machines anything but natural. Drawing from psychology, human-computer interaction, linguistics, and communication theory, Practical Speech User Interface Design provides a comprehensive yet concise survey of practical speech user interface (SUI) design. It offers practice-based and research-based guidance on how to design effective, efficient, and pleasant speech applications that people can really use. Focusing on the design of speech user interfaces for IVR applications, the book covers speech technologies including speech recognition and production, ten key concepts in human language and communication, and a survey of self-service technologies. The author, a leading human factors engineer with extensive experience in research, innovation and design of products with speech interfaces that are used worldwide, covers both high- and low-level decisions and includes Voice XML code examples. To help articulate the rationale behind various SUI design guidelines, he includes a number of detailed discussions of the applicable research. The techniques for designing usable SUIs are not obvious, and to be effective, must be informed by a combination of critically interpreted scientific research and leading design practices. The blend of scholarship and practical experience found in this book establishes research-based leading practices for the design of usable speech user interfaces for interactive voice response applications.

Designing the user interface: strategies for effective human-computer interaction

Book

Jan 2010

Usability: Lessons Learned … and Yet to Be Learned

Abstract

Recommended publications

Applications of a UX Maturity Model to Influencing HF Best Practices in Technology Centric Companies...

MDFLUXO: Ophtalmology Education with a PDA Efficacy and Usability Evaluation.

Design, Implementation, and Case Study of a Pragmatic Vibrotactile Belt

Usability Heuristic Evaluation of Scientific Data Analysis and Visualization Tools