ArticlePDF Available

Usability: Lessons Learned … and Yet to Be Learned

Taylor & Francis
International Journal of Human-Computer Interaction
Authors:
  • MeasuringU

Abstract

The philosopher of science J. W. Grove (1989) once wrote, “There is, of course, nothing strange or scandalous about divisions of opinion among scientists. This is a condition for scientific progress” (p. 133). Over the past 30 years, usability, both as a practice and as an emerging science, has had its share of controversies. It has inherited some from its early roots in experimental psychology, measurement, and statistics. Others have emerged as the field of usability has matured and extended into user-centered design and user experience. In many ways, a field of inquiry is shaped by its controversies. This article reviews some of the persistent controversies in the field of usability, starting with their history, then assessing their current status from the perspective of a pragmatic practitioner. Put another way: Over the past three decades, what are some of the key lessons we have learned, and what remains to be learned? Some of the key lessons learned are:• When discussing usability, it is important to distinguish between the goals and practices of summative and formative usability.• There is compelling rational and empirical support for the practice of iterative formative usability testing—it appears to be effective in improving both objective and perceived usability.• When conducting usability studies, practitioners should use one of the currently available standardized usability questionnaires.• Because “magic number” rules of thumb for sample size requirements for usability tests are optimal only under very specific conditions, practitioners should use the tools that are available to guide sample size estimation rather than relying on “magic numbers.”
Intl. Journal of Human–Computer Interaction, 30: 663–684, 2014
Copyright © Taylor & Francis Group, LLC
ISSN: 1044-7318 print / 1532-7590 online
DOI: 10.1080/10447318.2014.930311
Usability: Lessons Learned . . . and Yet to Be Learned
James R. Lewis
IBM Corporation, Software Group, Boca Raton, Florida, USA
The philosopher of science J. W. Grove (1989) once wrote,
“There is, of course, nothing strange or scandalous about divi-
sions of opinion among scientists. This is a condition for scientific
progress” (p. 133). Over the past 30 years, usability, both as a prac-
tice and as an emerging science, has had its share of controversies.
It has inherited some from its early roots in experimental psychol-
ogy, measurement, and statistics. Others have emerged as the field
of usability has matured and extended into user-centered design
and user experience. In many ways, a field of inquiry is shaped by
its controversies. This article reviews some of the persistent con-
troversies in the field of usability, starting with their history, then
assessing their current status from the perspective of a pragmatic
practitioner. Put another way: Over the past three decades, what
are some of the key lessons we have learned, and what remains to
be learned? Some of the key lessons learned are:
When discussing usability, it is important to distinguish
between the goals and practices of summative and forma-
tive usability.
There is compelling rational and empirical support for the
practice of iterative formative usability testing—it appears
to be effective in improving both objective and perceived
usability.
When conducting usability studies, practitioners should use
one of the currently available standardized usability ques-
tionnaires.
Because “magic number” rules of thumb for sample size
requirements for usability tests are optimal only under very
specific conditions, practitioners should use the tools that are
available to guide sample size estimation rather than relying
on “magic numbers.”
1. INTRODUCTION
Therefore, the seeker after the truth is not one who studies the
writings of the ancients and, following his natural disposition, puts
his trust in them, but rather the one who suspects his faith in them
and questions what he gathers from them, the one who submits to
argument and demonstration, and not to the sayings of a human
being whose nature is fraught with all kinds of imperfection and
deficiency. Thus the duty of the man who investigates the writings
Editors’ note: This article was the keynote presentation at
HCII 2014, the 16th International Conference on Human–Computer
Interaction and its Affiliated International Conferences, 22–27 June
2014, Crete, Greece.
Address correspondence to James R. Lewis, 7329 Serrano Terrace,
Delray Beach, FL 33446, USA. E-mail: jimlewis@us.ibm.com
of scientists, if learning the truth is his goal, is to make himself an
enemy of all that he reads, and, applying his mind to the core and
margins of its content, attack it from every side. He should also sus-
pect himself as he performs his critical examination of it, so that he
may avoid falling into either prejudice or leniency. (Ibn al-Haytham,
965–1040, as cited in Sabra, 2003, p. 55)
I entered the field of usability engineering a little more than
30 years ago. In 1980 I interned at the IBM usability lab
in Boca Raton, Florida (see Figure 1). Although there has
been considerable development in the methods of usability
engineering over the years, a modern practitioner would rec-
ognize the activities. Some of the evaluations that were under
way were traditional human factors experiments, for example,
studies of the optimal angle of inclination of typing key-
boards. Some of the evaluations shared certain properties of
traditional experiments but differed slightly in their focus on
measuring behaviors and attitudes captured as participant com-
pleted the key tasks for a product. Other evaluations departed
even more from traditional experimentation in their focus on
the iterative discovery and remediation of usability problems
rather than performance and satisfaction metrics. Thus, even
in the late 1970s, the usability practices commonly referred
to as summative and formative testing (Lewis, 2012) were
present.
Although present, these fundamental usability engineering
methods were not well established. Controversies abounded.
What was the true definition of usability? How reliable were
less traditional iterative evaluations in terms of departure from
classical psychological experimentation? What was the appro-
priate role of statistical methods (such as hypothesis testing,
confidence intervals, psychometrics and sample size estimation)
in usability engineering?
Much of industrial usability engineering work is confiden-
tial. Companies are reluctant to expose the usability blemishes
of their products and services to the public, preferring instead
to keep them “in house” as they track them and seek to elim-
inate or reduce their impact on users. Practitioners have been
much freer to publish the results of methodological investiga-
tions, exposing and discussing methodological controversies of
significant importance to the development of the field of usabil-
ity engineering. An important aspect of publication is criticism
(through the peer review process and critical literature reviews),
and “criticism is the mother of methodology” (Abelson’s, 1995,
663
Downloaded by [78.87.127.172] at 04:36 24 June 2014
664 LEWIS
FIG. 1. Usability lab at IBM facility in Boca Raton, Florida, circa 1978.
8th law, p. xv). The purpose of this article is to summarize some
of these controversies, providing arguments from both sides and
a pragmatic assessment for practitioners. In other words, what
are the lessons learned (which controversies appear to be set-
tled) and which are yet to be learned (those for which there is
still work to do)?
2. DO WE KNOW WHAT USABILITY IS?
There has long been a general understanding of the word
usability (sometimes spelled useability). For example, a refrig-
erator advertisement from the 1930s included usability as a
feature, and listed characteristics of usability such as “handier to
use,” “saves steps, saves work,” and “compare with others” (S.
Isensee, personal communication, January 17, 2010; see http://
tinyurl.com/yjn3caa). In 1979, Bennett published what may
have been the first scientific article to have the term “usability”
in the title. But do we really know what usability is?
2.1. One Point of View
The following quotations span a period of roughly 10 to
20 years from Bennett’s early scientific use of the term “usabil-
ity.”
One of the most important issues is that there is, as yet, no gen-
erally agreed definition of usability and its measurement. (Shackel,
1990, p. 31)
Attempts to derive a clear and crisp definition of usability can be
aptly compared to attempts to nail a blob of Jell-O to the wall. (Gray
& Salzman, 1998, p. 242)
A major obstacle to the implantation of User-Centered Design
in the real world is the fact that no precise definition of the concept
of usability exists that is widely accepted and applied in practice.
(Alonso-Ríos, Vázquez-Garcia, Mosqueira-Rey, & Moret-Bonillo,
2010, p. 53)
Basically, the argument from this side is that it is either
impossible or so difficult to define usability that for a period
of more than 30 years there is yet to be a clear and generally
accepted definition. There are several reasons why this might be
so. The measurement of usability is complex because usability
is not a specific property of a person or thing. You cannot mea-
sure usability with a simple “usability” thermometer (Dumas,
2003; Hertzum, 2010; Hornbæk, 2006). Rather, it is an emer-
gent property dependent on interactions among users, products,
tasks, and environments.
Also, there are two major conceptions of usability (Lewis,
2012), commonly referred to as “summative” and “formative.”
Although there are similarities, the differences between summa-
tive and formative usability are substantial enough that a single
concise definition cannot cover both. The focus of summative
usability measurement is on metrics associated with meeting
global task and product goals (i.e., measurement-based usabil-
ity). The focus of formative usability is the detection of usability
problems and the design of interventions to reduce or eliminate
their impact (i.e., diagnostic usability).
2.2. Another Point of View
Although it may be difficult to define usability, it should be
possible given an appropriate distinction between summative
and formative conceptions. One of the early attempts to define
summative usability was the MUSiC project—Measurement of
Usability in Context (Bevan, Kirkakowski, & Maissel, 1991).
Downloaded by [78.87.127.172] at 04:36 24 June 2014
USABILITY LESSONS 665
The MUSiC project focused on metrics of effectiveness and
efficiency in the context of use. This type of research led to
the current International Organization for Standardization (ISO;
1998) and American National Standards Institute (ANSI; 2001)
usability standards, which continued to emphasize the impor-
tance of effectiveness and efficiency in the context of use and
added the subjective metric of satisfaction. Most usability prac-
titioners have at least some familiarity with these standards and
their three key metrics. These metrics and their collection and
interpretation bear a strong resemblance to the methods and
metrics of experimental psychology, especially as instantiated
in human factors engineering (Lewis, 2011a).
The concept of formative usability, on the other hand, repre-
sents a significant departure from the practices of experimental
psychology. Formative usability has strong ties to the prac-
tice of iterative design—building something, checking to see
where it could be improved, improving it, and trying again (i.e.,
design–test–redesign–retest). The earliest definitions of forma-
tive usability came from Chapanis and his students (Al-Awar,
Chapanis, & Ford, 1981; Chapanis, 1981; Kelley, 1984).
Although it is not easy to measure “ease of use,” it is easy to mea-
sure difficulties that people have in using something. Difficulties and
errors can be identified, classified, counted, and measured. So my
premise is that ease of use is inversely proportional to the number
and severity of difficulties people have in using software. There are,
of course, other measures that have been used to assess ease of use,
but I think the weight of the evidence will support the conclusion
that these other dependent measures are correlated with the number
and severity of difficulties. (Chapanis, 1981, p. 3)
The publications of Chapanis and his colleagues had an
almost immediate influence on product development practices
at IBM (Kennedy, 1982; Lewis, 1982) and other companies,
notably Xerox (Smith, Irby, Kimball, Verplank, & Harlem,
1982) and Apple (Williams, 1983). Shortly thereafter, John
Gould and his associates at the IBM T. J. Watson Research
Center began publishing influential papers on usability testing
and iterative design (Gould, 1988; Gould & Boies, 1983; Gould,
Boies, Levy, Richards, & Schoonard, 1987; Gould & Lewis,
1984), as did Whiteside, Bennett, and Holtzblatt (1988) at DEC
(Baecker, 2008; Dumas, 2007). More recently, there have been
contributions by practitioners from the ANSI, both for stan-
dardization of summative usability testing reports (the Common
Industry Format; ANSI, 2001) and recommendations for effec-
tive reporting of formative usability test results (Theofanos &
Quesenbery, 2005), along with several informative government
websites (zing.ncsl.nist.gov/iusr/; www.usability.gov).
The concept of formative usability evaluation in the con-
text of iterative design has affected the development of a
number of empirical and inspection methods. Empirical meth-
ods include standard formative usability testing, either with or
without the think-aloud (TA) protocol (Dumas, 2003) and con-
textual evaluations (Whiteside et al., 1988). Inspection methods
include expert and heuristic evaluations (Nielsen & Mack,
1994), as well as a variety of other structured protocols such as
GOMS (Card, Moran, & Newell, 1983), cognitive walkthroughs
(Spencer, 2000; Wharton, Rieman, Lewis, & Polson, 1994),
and card sorting (Snyder, Happ, Malcus, Paap, & Lewis, 1985;
Tullis, 1985; Tullis & Albert, 2008,2013).
2.3. Lessons Learned
When discussing usability, it is important to distinguish
between the goals and practices of summative and formative
usability. Following the summative conception, a product is
usable when people can use it for its intended purpose effec-
tively, efficiently, and with a feeling of satisfaction. Following
the formative conception, the presence of usability depends on
the absence of usability problems.
Regarding the assessment of usability, summative and for-
mative methods share a number of properties. Both require
a careful plan of study, including initial instructions
and debriefing protocols;
participants who are members of the population of
interest; and
appropriate tasks and environments (i.e., context of
use).
There are also important differences. Summative evaluations
tend to be more like traditional experiments, with little to no
interaction between observers and participants during the per-
formance of tasks and no changes to the system or product
during the study. Formative studies permit a wide variation in
technique (Wildman, 1995), which can be very formal or infor-
mal, silent or TA; participants can work solo or in pairs; use low-
or high-fidelity prototypes; or use current, future, or competitive
products (Lewis, 2012).
Ideally, practitioners should use both conceptualizations of
usability during iterative design and should combine qualitative
and quantitative methods. Before conducting testing with users,
part of the preparation of a study should include an inspec-
tion method such as heuristic or expert evaluation. Any iterative
method must include a stopping rule to prevent infinite itera-
tions. In the real world, resource constraints and deadlines can
dictate the stopping rule (although this practice is valid only
if there is a reasonable expectation that undiscovered problems
will not lead to drastic consequences). In an ideal setting, the
results of summative usability testing can act as a stopping
rule for the iterative formative studies when user performance
and preference meet predefined summative goals (Lewis, 2012).
This is not a new concept, having appeared in the seminal
paper by Al-Awar et al. (1981) describing iterative usability
evaluation:
Our methodology is strictly empirical. You write a program, test
it on the target population, find out what’s wrong with it, and revise
it. The cycle of test–rewrite is repeated over and over until a sat-
isfactory level of performance is reached. Revisions are based on
the performance, that is, the difficulties typical users have in going
through the program. (p. 31)
Downloaded by [78.87.127.172] at 04:36 24 June 2014
666 LEWIS
2.4. Lessons Yet to Be Learned
Usability science is a relatively young endeavor, and there
are a number of lessons yet to be learned. An exhaustive
treatment is beyond the scope of this article. Two of the exist-
ing controversies are the scope of usability and the details of
appropriate TA evaluations.
The scope of usability. What is the appropriate scope of
usability? As just described, the typical definitions of usabil-
ity have to do with measurements of effectiveness, efficiency,
and satisfaction (summative) or the absence of usability prob-
lems (formative). These definitions are fairly well engrained
in current practices of usability engineering. Over the decades
since the early scientific descriptions of usability, there have
been several extensions, including user-centered design (UCD;
Vredenburg, Mao, Smith, & Carey, 2002) and, more recently,
user experience (UX; Tullis & Albert, 2008,2013).
These extensions typically have traditional usability as a core
concept. The extensions of UCD were primarily in the specifi-
cation of product development practices and included usability
engineering, human factors engineering, and ergonomics, all
within frameworks intended to incorporate these activities into
the product development life cycle. For UX, the extensions have
been more in the direction of design and measurement beyond
the traditional goals of effectiveness, efficiency, and satisfaction
to experiences that have a more compelling emotional effect.
Historically, UCD subsumed usability engineering (as well as
ergonomics and human factors engineering), and UX has sub-
sumed UCD. In the near future, perhaps UX will become part
of a larger customer experience effort, especially given recent
emphasis on service design and the emergence of the discipline
of service science.
Service science (Lusch, Vargo, & O’Brien, 2007; Lusch,
Vargo, & Wessels, 2008; Pitkänen, Virtanen, & Kemppinen,
2008; Spohrer & Maglio, 2008) is an interdisciplinary area of
study focused on systematic innovation in service as opposed
to physical product design. The U.S. economy relies heavily
on service industries (>75%; see Larson, 2008). In a service
industry customers pay for performance rather than physical
goods. Some key attributes of service are that it is time per-
ishable, created and used simultaneously, and includes a client
who participates in the coproduction of value. As work in a ser-
vice system matures, there tends to be a shift from service based
on human talent to technology-based self-service (Spohrer &
Maglio, 2008). With the change to automated service deliv-
ered through interactive voice response systems (Lewis, 2011b),
mobile services, or websites, it is important to design for an
excellent user experience (effective, efficient, and satisfying),
especially when it is relatively easy for users to switch service
providers. A highly usable and compelling service experience
leads to enhanced customer attraction and retention (Xue &
Harker, 2002). It seems likely that service science and customer
experience would benefit from the adoption of lessons learned
in usability engineering and science.
Throughout the transformations from usability engineering
to UCD to UX, usability can be a relatively stable component.
Some researchers have suggested changes to the fundamental
definition of usability. Bevan (2009) recommended including
flexibility and safety to create a more comprehensive quality-
of-use model. An even more expansive scheme (quality in use
integrated measurement) included 10 factors, 26 subfactors, and
127 specific metrics (Seffah, Donyaee, Kline, & Padda, 2006).
Winter, Wagner, and Deissenboeck (2008) published a two-
dimensional model of usability that associated a large number of
system properties with user activities. Alonso-Rios et al. (2010)
assembled a taxonomy that included traditional and nontradi-
tional aspects of usability organized under the primary factors
of Knowability, Operability, Efficiency, Robustness, Safety,
and Subjective Satisfaction. It isn’t yet clear what role these
expanded models of usability will play in the work of future
practitioners and researchers. There is compelling psychome-
tric evidence for an underlying construct of usability for the
traditional metrics of effectiveness, efficiency, and satisfaction
(Sauro & Lewis, 2009), but these expanded definitions have
yet to undergo statistical testing to confirm their hypothesized
structures.
There are still lessons to be learned in investigating the
effects of culture on the construct of usability (Hertzum et al.,
2007; Marcus, 2007). To what extent are aspects of usabil-
ity culturally invariant and what aspects are affected by cul-
ture? As discussed later in this article, there are clear cul-
tural considerations when conducting TA studies (Clemmensen,
Hertzum, Hornbæk, Shi, & Yammiyavar, 2009) or when trans-
lating standardized usability questionnaires (van de Vijver
& Leung, 2001). Surveys of Danish and Chinese users
revealed differences in the understanding and prioritization of
aspects of usability (Frandsen-Thorlacius, Hornbæk, Hertzum,
& Clemmensen, 2009). The finding that system acceptance was
more affected by perceived usefulness for Chinese students but
by perceived ease of use by Indonesian students (Evers & Day,
1997) suggests that a simple East–West cultural dichotomy will
be insufficient.
TA methodology. In a TA study, participants receive
instructions to talk about what they’re doing as they do it, and
receive reminders to talk aloud if they forget to do so. The most
common theoretical justification for the use of TA is from the
human problem-solving research of Ericsson and Simon (1980),
who found that certain kinds of verbal reports could produce
reliable data. Specifically, the verbalizations that participants
produce during task performance that do not require additional
cognitive processing beyond that required for task performance
and verbalization tend to be reliable.
Some of the common claims associated with TA studies
are that they are more productive for finding usability prob-
lems (van den Haak & de Jong, 2003; Virzi, Sorce, & Herbert,
1993) and thinking aloud does not affect user ratings or per-
formance (Bowers & Snyder, 1990; Ohnemus & Biers, 1993;
Downloaded by [78.87.127.172] at 04:36 24 June 2014
USABILITY LESSONS 667
Olmsted-Hawala, Murphy, Hawala, & Ashenfelter, 2010).
Although there is some evidence in support of these claims,
the evidence is mixed. For example, Berry and Broadbent
(1990) reported that the TA process invoked cognitive pro-
cesses that improved rather than degraded performance. Wright
and Converse (1992) compared silent with TA usability testing
and found that the TA group performed better with the differ-
ence increasing as a function of task difficulty. MacDonald,
McGarry, and Willis (2013) found that task complexity likely
interacts with differences in TA protocols.
TA practice in usability testing often does not conform to
its most cited theoretical basis (Ericsson & Simon, 1980), with
reported inconsistencies in explanations to participants about
how to do TA, practice periods, styles of reminding participants
to TA, prompting intervals, and styles of intervention (Boren
& Ramey, 2000; MacDonald, Edwards, & Zhao, 2012). Boren
and Ramey (2000) suggested an alternative theoretical approach
to TA based on speech communication theory, with clearly
defined communicative roles for the participant (in the role
of domain expert or valued customer, making the participant
the primary speaker) and the usability practitioner (the learner
or listener, thus a secondary speaker). They recommended the
use of acknowledgment tokens that do not take speakership
away from the participant, such as “mm hm?” and “uh-huh?”
because in normal communication silence can be interpreted as
aloofness or condescension.
Krahmer and Ummelen (2004) conducted an exploratory
comparison of the Ericsson and Simon (E&S) versus the Boren
and Ramey (B&R) TA procedures and found similar outcomes
for both procedures. The main difference was that moderators in
the B&R condition intervened more frequently with the conse-
quence that the participants were less lost and completed more
tasks. Hertzum, Hansen, and Andersen (2009) compared silent
task completion with strict E&S and more relaxed TA. Strict
E&S TA required more time for task completion, but the TA
method did not affect successful task completion rates (which
tended to be high in the study).
Olmsted-Hawala et al. (2010) studied E&S TA, B&R TA,
a less restrictive coaching protocol in which moderators could
freely probe participants, and silence (no TA at all). The out-
comes were similar for silence, E&S, and B&R procedures.
Participants in the coaching condition successfully completed
significantly more tasks and had higher satisfaction ratings.
Their results for B&R differed from those reported by Krahmer
and Ummelen (2004): “Since the test administrator in the
Krahmer & Ummelen study offered assistance and encour-
agement to the test subject during the session, we think their
speech-communication protocol is more akin to the coaching
condition in our study” (Olmsted-Hawala et al., 2010, p. 2387).
Clemmensen et al. (2009) discussed the impact of cultural
differences on TA. There are several ways in which cultural
differences could affect testing, such as the instructions and
tasks, the participant’s verbalization, how the observer “reads”
the participant, and the overall relationship between participant
and observer. In particular, with regard to studies that have
Western observers and Eastern participants, they recommended
that observers should allow sufficient time for participants to
pause while thinking aloud, rely less on expressions of surprise,
and be sensitive to the tendency for indirect criticism.
The evidence indicates that relative to silent participation, TA
can affect task performance and reported satisfaction, depend-
ing on the exact TA protocol in use. If the primary purpose
of the study is problem discovery, TA appears to be advanta-
geous over completely silent task completion. If the primary
purpose of the test is task performance measurement, the use
of TA is somewhat more complicated. As long as all the tasks
in the planned comparisons were completed under the same
conditions, performance comparisons should be legitimate. It is
critical, however, that practitioners using TA provide a complete
description of their method, including the kind and frequency
of probing. There is still work to do before we will have a
deep understanding of the effects of current variations in TA
protocols.
3. IS FORMATIVE USABILITY TESTING RELIABLE?
The widespread use of formative usability testing is evi-
dence that practitioners generally believe that it is effective.
However, there are fields in which practitioners’ belief in the
effectiveness of their methods does not appear to be warranted
by the evidence (e.g., the use of projective techniques such
as the Rorschach test in psychotherapy; see Lilienfeld, Wood,
& Garb, 2000). Might it be possible that formative usability
testing is fundamentally unreliable and, if so, should usability
practitioners abandon the method?
3.1. One Point of View
Human factors work can be reliable different human fac-
tors engineers, using different human factors techniques at dif-
ferent stages of a product’s development, identified many of the
same potential usability defects. (Marshall, Brendan, & Prail, 1990,
p. 243)
There are rational arguments in favor of iterative formative
usability testing, starting with the early publications that ini-
tially described and promoted the method (Al-Awar, Chapanis,
& Ford, 1981; Chapanis, 1981; Gould, 1988; Gould & Lewis,
1984). The basic idea of achieving usability by watching people
use a product, noting the problems, and then fixing the prob-
lems, seems irrefutable. This rational argument has received
support from a number of published case studies and a few early
experiments (G. Bailey, 1993; R. W. Bailey, Allan, & Raiello,
1992; Gould et al., 1987; Høegh & Jensen, 2008; Lewis, 1996;
Marshall et al., 1990). Published cost–benefit analyses (Bias
& Mayhew, 1994) have demonstrated the value of usability
Downloaded by [78.87.127.172] at 04:36 24 June 2014
668 LEWIS
engineering processes that include usability testing, with cost–
benefit ratios ranging from 1:2 for smaller projects to 1:100 for
larger projects (Karat, 1997). For example, consider the results
of one case study (Lewis, 1996) and one experiment (G. Bailey,
1993).
Lewis (1996) published a case study of the development
of the Simon, a personal communicator now widely consid-
ered to be the first commercially available smartphone. The
development team (including the usability engineers) defined
a set of tasks to use to develop competitive benchmarks and
for iterative formative usability testing. As shown in Figure 2,
the perceived usability of the Simon (measured using the Post-
Study System Usability Questionnaire [PSSUQ]; Lewis, 1995)
dramatically improved after the first iteration (from Simon A
to Simon B), then showed very little improvement after the
second iteration (from Simon B to Simon C). The perceived
usability of the Simon after the application of formative usabil-
ity testing was better than the initially established benchmarks
(Lewis, 1996).
G. Bailey (1993) conducted an experiment in which he
had eight designers use a prototyping tool to create a recipes
application. Bailey then recorded participants performing tasks
with each of the prototypes, three participants per prototype in
a between-subjects design. Each designer reviewed the tapes
of the use of his or her prototype and used those observa-
tions to modify the design. This process continued until each
designer indicated that further improvements were not possible.
All designers stopped after three to five iterations. Comparison
of the first and last designs showed significant improvement
in successful task completion rates, task completion times, and
number of serious errors.
3.2. Another Point of View
Our main conclusion is that our simple assumption that we are
all doing the same and getting the same results in a usability test is
plainly wrong. (Molich, Ede, Kaasgaard, & Karyukin, 2004, p. 65)
Since 1998, a number of papers have questioned the reliabil-
ity of usability problem discovery (Kessner, Wood, Dillon, &
West, 2001; Molich et al., 1998; Molich, Ede, Kaasgaard, &
Karyukin, 2004; Molich & Dumas, 2008)—a process that is
at the heart of iterative usability testing. The consistent find-
ing from this line of research (in particular, Molich’s series of
competitive usability evaluation [CUE] studies) has been that
observers, individually or in teams, who evaluated the same
product discovered very different sets of usability problems.
In Molich et al. (1998), four independent usability labo-
ratories carried out inexpensive usability tests of a software
application for new users. The four teams reported 141 dif-
ferent problems, with only one problem common among all
four teams. Kessner et al. (2001) had six professional usabil-
ity teams independently test an early prototype of a dialog
box. None of the problems were detected by every team, and
18 problems were described by one only team. Molich et al.
(2004) assessed the consistency of usability testing across nine
independent organizations that evaluated the same website.
They documented considerable variability in methodologies,
resources applied, and problems reported. There were a total of
310 reported problems, with only two problems reported by six
or more organizations, and 232 (61%) uniquely reported prob-
lems. The fourth CUE (CUE-4; Molich & Dumas, 2008) had a
similar method and similar outcomes.
1.0
2.0
3.0
4.0
5.0
6.0
7.0
SysUse InfoQual IntQual Overall
Mean Rating
PSSUQ Scale
Benchmark
Simon A
Simon B
Simon C
FIG. 2. Improvements in perceived usability through iterative usability testing. Note. Lower Post-Study System Usability Questionnaire (PSSUQ) scores indicate
better perceived usability.
Downloaded by [78.87.127.172] at 04:36 24 June 2014
USABILITY LESSONS 669
3.3. Lessons Learned
There is compelling rational and empirical support for the
practice of iterative formative usability testing—it appears to be
effective in improving both objective and perceived usability.
There is also compelling evidence that a key aspect of formative
usability testing—problem discovery—is not reliable because
independent evaluations do not appear to produce identical (or
even similar) lists of usability issues. Thus, we have two lessons
learned that appear to be in stark contrast with one another.
To reconcile this apparent dilemma, it is important to keep
in mind the limitations of small-sample formative usability test-
ing. It is not realistic to expect similar lists of usability problems
when sample sizes are small and when there is variation in the
sample of tested tasks. These conditions will lead inevitably
to different sets of usability issues. The process of iterative
usability testing is a hill-climbing procedure, so variation in the
sets of usability problems driving redesign should still result in
movement up the hill toward more usable design—just not nec-
essarily the exact same design. A characteristic of hill-climbing
procedures is that there may be many paths up the hill.
Iterative usability testing alone cannot ensure a usable-
enough design. It is also important, when working in an existing
design space, to have an understanding of the competitive
design landscape. To accomplish this, define the key set of tasks
early in the design process, and then conduct usability tests
with key competitors to establish benchmarks for objective and
perceived usability (e.g., as in Lewis, 1996). Iterative forma-
tive usability evaluation shows the way to climb the hill toward
more usable design, and competitive usability testing points to
the right hill to climb.
The results of the CUE (and similar) studies of Molich and
colleagues (Molich et al., 1998; Molich & Dumas, 2008; Molich
et al., 2004) show that usability practitioners must conduct their
usability tests as carefully as possible, document their methods
completely, and show proper caution when interpreting their
results. Although still a valuable and necessary part of UCD,
the limitations of usability testing make it insufficient for cer-
tain testing goals, such as quality assurance of safety-critical
systems (Schmettow, 2008; Thimbleby, 2007), which need addi-
tional activities such as acceptance and continuous use testing.
It can be difficult to assess complex systems with complex goals
and tasks (Howard, 2008; Howard & Howard, 2009; Redish,
2007). On the other hand, as Landauer stated in 1997 (and
remains true today): “There is ample evidence that expanded
task analysis and formative evaluation can, and almost always
do, bring substantial improvements in the effectiveness and
desirability of systems” (p. 204).
3.4. Lessons Yet to Be Learned
There is a clear need for more research in the reliability
of formative usability testing—not so much the lack of relia-
bility as its practical consequences. For example, a limitation
of research that stops with the comparison of problem lists is
that it is not possible to assess the magnitude of the usability
improvement (if any) that would result from product redesigns
based on design recommendations derived from the problem
lists (Hornbæk, 2010; Wixon, 2003).
A related research area is in the quality of usability prob-
lem lists and associated recommendations. Because the content
of these lists provides the most direct information about how
to improve usability, developers have an intense interest in
them (Capra, 2007; Høegh, Nielsen, Overgaard, Pedersen, &
Stage, 2006; Nørgaard & Hornbæk, 2009). There are, however,
different approaches to the construction of usability recommen-
dations and to their prioritization.
Recommendations. For example, there is some controversy
in the usability practitioner community regarding whether prob-
lem descriptions should or should not include recommendations
(Theofanos & Quesenbery, 2005). In an exploratory study of
different presentation methods for usability issues and recom-
mendations (Nørgaard & Hornbæk, 2009), developers rated
redesign proposals, multimedia presentations, and screenshots
as useful inputs, problem lists second, and scenarios as least
helpful, with problem lists best suited for documenting rel-
atively simple problems that did not require a strong con-
text for understanding the issue. Molich, Jeffries, and Dumas
(2007) analyzed data collected during CUE-4 to develop guide-
lines for making usability recommendations useful and usable,
including
Communicate clearly at the conceptual level.
Ensure that recommendations improve overall
usability.
Be aware of business or technical constraints.
Solve the whole problem, not just a special case.
The results and recommendations from this line of research
seem reasonable, but they still require validation so practition-
ers can understand their true downstream utility in leading to
changes that improve usability.
Prioritization. Another area in which there is variability
in usability engineering practice is the prioritization of usabil-
ity problems. Because usability tests can reveal more problems
than there are resources to address, it is important to have
some means for prioritization, keeping in mind that design pro-
cess considerations can influence the specific usability changes
made to a product (Hertzum, 2006). Two fundamentally dif-
ferent approaches to prioritization are judgment driven (Virzi,
1992) and data driven (Dumas & Redish, 1999; Lewis, Henry,
& Mack, 1990; Rubin, 1994; Rubin & Chisnell, 2008). The
bases for judgment-driven prioritizations are the ratings of
stakeholders in the project (such as usability practitioners and
developers). The bases for data-driven prioritizations are the
data associated with the problems, such as frequency, impact,
ease of correction, and likelihood of usage of the portion of
the product in which the problem occurred (Lewis, 2012).
Of these, the most common measurements are frequency and
impact (sometimes referred to as severity). In a study of the
Downloaded by [78.87.127.172] at 04:36 24 June 2014
670 LEWIS
two approaches to prioritization, Hassenzahl (2000) found that
data-driven and judgment-driven estimates differed.
The measurement of frequency of occurrence is the straight-
forward division of the number of occurrences within partic-
ipants divided by the number of participants, usually at the
task level. A common method (Dumas & Redish, 1999; Rubin,
1994; Rubin & Chisnell, 2008) for assessing impact is to assign
impact scores according to whether the problem (a) prevents
task completion, (b) causes a significant delay or frustration, (c)
has a relatively minor effect on task performance, or (d) is a
suggestion.
Prioritization based on multiple types of data requires some
means of data combination. For example, one could employ a
graphical problem grid with frequency on one axis and impact
on the other. High-frequency, high-impact problems would
receive treatment before low-frequency, low-impact problems.
The relative treatment of high-frequency, low-impact problems
and low-frequency, high-impact problems would depend on
practitioner judgment.
Rubin (1994) described an arithmetic procedure for com-
bining four levels of impact (using the criteria just described
with 4 assigned to the most serious level) with four levels of
frequency (4: frequency 90%; 3: 51–89%; 2: 11–50%; 1:
10%) by adding the scores. For example, if a problem had
an observed frequency of occurrence of 80% and had a minor
effect on performance, its priority would be 5 (a frequency rat-
ing of 3 plus an impact rating of 2). With this approach, priority
scores can range from a low of 2 to a high of 8. If information
is available about the likelihood that a user would work with
the part of the product that enables the problem, this informa-
tion would be used to adjust the frequency rating. Continuing
the example, if the expectation is that only 10% of users would
encounter the problem, the priority would be 3 (a frequency rat-
ing of 1 for the 10% ×80%, or an 8% likelihood of occurrence
plus an impact rating of 2).
A similar strategy is to multiply the observed percentage fre-
quency of occurrence by the impact score (Lewis et al., 1990).
The range of priorities depends on the values assigned to each
impact level. Assigning 10 to the most serious impact level
leads to a maximum priority (severity) score of 1000 (which can
optionally be divided by 10 to create a scale ranging from 1 to
100). Appropriate values for the remaining three impact cate-
gories depend on practitioner judgment, but a reasonable set is
5, 3, and 1. Using those values, the problem with an observed
frequency of occurrence of 80% and a minor effect on perfor-
mance would have a priority of 24 (80 ×3/10). It is possible
to extend this method to account for the likelihood of use using
the same procedure as that described by Rubin (1994), which
in the example resulted in modifying the frequency measure-
ment from 80 to 8%. Another way to extend the method is to
categorize the likelihood of use with a set of categories such as
very high likelihood (assigned a score of 10), high likelihood
(assigned a score of 5), moderate likelihood (assigned a score
of 3), and low likelihood (assigned a score of 1) and multiply
all three scores to get the final priority (severity) score (then
optionally divide by 100 to create a scale that ranges from 1 to
100). Continuing the previous example with the assumption that
the task in which the problem occurred has a high likelihood of
occurrence, the problem’s priority would be 12 (5 ×240/100).
As far as I know, no one has systematically compared
various prioritization schemes against one or more sets of
usability problems to investigate the similarities and differ-
ences in their outputs. Of even greater utility to practitioners
would be insights into the downstream effectiveness of various
prioritization schemes, but it is notoriously difficult to con-
duct that type of research. Clearly, there are still lessons to be
learned about how to effectively prioritize usability problems
and recommendations.
4. IS IT OK TO AVERAGE MULTIPOINT SCALES?
Usability practitioners who use measurement and statistics
to guide design recommendations, as most do, inherit the con-
troversies from those fields. One of the ongoing controversies
in measurement and statistics is what role, if any, the level of
measurement plays in determining acceptable arithmetic and
statistical manipulation. The controversy started when S. S.
Stevens (1946) declared that all numbers are not created equal,
and defined the following levels of measurement:
Nominal: Numbers that are simply labels, such as the
numbering of football players or model numbers.
Ordinal: Numbers that have an order, but the differ-
ences between numbers do not necessarily correspond
to the differences in the underlying attribute, such as
levels of multipoint rating scales or rank order of sports
teams based on percentage of wins.
Interval: Numbers that not only are ordinal but for
which equal differences in the numbers correspond to
equal differences in the underlying attribute, such as
Fahrenheit or Celsius temperature scales.
Ratio: Numbers that not only are interval but for which
there is a true 0 point so equal ratios in the numbers
correspond to equal ratios in the underlying attribute,
such as time intervals (reaction time, task completion
times) or the Kelvin temperature scale.
4.1. One Point of View
From these four classes of measurements, Stevens argued
that certain types of arithmetic operations were not reason-
able to apply to certain types of data. Based on the “principle
of invariance,” he recommended against doing anything more
than counting nominal and ordinal data, and he restricted addi-
tion, subtraction, multiplication, and division to interval and
ratio data. From this perspective, strictly speaking, the mul-
tipoint scales commonly used for rating attitudes are ordinal
measurements, so it would not be permissible to compute their
arithmetic means. If it’s illogical to compute means of rating
Downloaded by [78.87.127.172] at 04:36 24 June 2014
USABILITY LESSONS 671
scale data, then it follows that it is incorrect to use statistical
procedures such as ttests that depend on computing the mean.
Stevens’s levels of measurement have been very influential,
appearing in numerous statistics textbooks and used to guide
recommendations given to users of some statistical analysis
programs (Velleman & Wilkinson, 1993).
4.2. Another Point of View
That I do not accept Stevens’ position on the relationship
between strength of measurement and “permissible” statistical pro-
cedures should be evident from the kinds of data used as examples
throughout this Primer: level of agreement with a questionnaire item,
as measured on a five-point scale having attached verbal labels. ...
This is not to say, however, that the researcher may simply ignore the
level of measurement provided by his or her data. It is indeed crucial
for the investigator to take this factor into account in considering the
kinds of theoretical statements and generalizations he or she makes
on the basis of significance tests. (Harris, 1985, pp. 326–328)
Even if one believes that there is a “real” scale for each attribute,
which is either mirrored directly in a particular measure or mirrored
as some monotonic transformation, an important question is, “What
difference does it make if the measure does not have the same zero
point or proportionally equal intervals as the ‘real’ scale?” If the
scientist assumes, for example, that the scale is an interval scale
when it “really” is not, something should go wrong in the daily
work of the scientist. What would really go wrong? All that could go
wrong would be that the scientist would make misstatements about
the specific form of the relationship between the attribute and other
variables. . . . How seriously are such misassumptions about scale
properties likely to influence the reported results of scientific exper-
iments? In psychology at the present time, the answer in most cases
is “very little.” (Nunnally, 1978, p. 28)
For analyzing ordinal data, some researchers have recom-
mended the use of nonparametric statistical methods that are
similar to the well-known tand Ftests but which replace the
original data with ranks before analysis (Bradley, 1976). These
methods (e.g., the Mann-Whitney U-test, the Friedman test,
or the Kruskal-Wallis test), however, involve taking the means
and standard deviations of the ranks, which are ordinal—not
interval or ratio—data. Despite these violations of permissible
manipulation from Stevens’s point of view, those methods work
perfectly well.
Probably the most famous counterargument was by Lord
(1953) with his parable of a retired professor who had a machine
used to randomly assign football numbers to the jerseys of
freshmen and sophomore football players at his university—a
clear use of numbers as labels (nominal data). After assigning
numbers, the freshmen complained that the assignment wasn’t
random—they claimed to have received generally smaller num-
bers than the sophomores and that the sophomores must have
tampered with the machine. In a panic and to avoid impend-
ing violence between the classes, the professor consulted with a
statistician to investigate how likely it was that the freshmen got
their low numbers by chance. Over the professor’s objections,
the statistician determined the population mean and standard
deviation of the football numbers—54.3 and 16.0, respectively.
He found that the mean of the freshmen’s numbers was too
low to have happened by chance, strongly indicating that the
sophomores had tampered with the football number machine
to get larger numbers. The famous fictional dialog between the
professor and the statistician was as follows:
“But these numbers are not cardinal numbers,” the professor
expostulated. “You can’t add them.”
“Oh, can’t I?” said the statistician. “I just did. Furthermore, after
squaring each number, adding the squares, and proceeding in the
usual fashion, I find the population standard deviation to be exactly
16.0.”
“But you can’t multiply ‘football numbers,’” the professor
wailed. “Why, they aren’t even ordinal numbers, like test scores.”
“The numbers don’t know that,” said the statistician. “Since the
numbers don’t remember where they came from, they always behave
just the same way, regardless.” (Lord, 1953, p. 751)
The controversy continued for decades, with measurement
theorists generally supporting the importance of levels of mea-
surement and applied statisticians arguing against it. In their
recap of the controversy, Velleman and Wilkinson (1993) wrote,
“At times, the debate has been less than cordial. Gaito (1980)
aimed sarcastic barbs at the measurement theory camp and
Townsend and Ashby (1984) fired back. Unfortunately, as
Mitchell (1986) noted, they often shot past each other” (p. 68).
The debate has continued into the 21st century (Scholten &
Borsboom, 2009).
It is interesting to note that in Stevens’s (1946) original
paper, he actually took a fairly moderate stance.
On the other hand, for this ‘illegal’ statisticizing there can be
invoked a kind of pragmatic sanction: In numerous instances it leads
to fruitful results. While the outlawing of this procedure would prob-
ably serve no good purpose, it is proper to point out that means and
standard deviations computed on an ordinal scale are in error to the
extent that the successive intervals on the scale are unequal in size.
When only the rank-order of data is known, we should proceed cau-
tiously with our statistics, and especially with the conclusions we
draw from them. (p. 679)
Responding to criticisms of the implications of his
1953 paper, Lord (1954) challenged critics of his logic to par-
ticipate in a game based on the “football numbers” story, with
the statistician paying the critic one dollar every time the statis-
tician incorrectly designated a sample as being drawn from one
of two populations of nominal two-digit numbers and the critic
paying the statistician one dollar when he is right. No critic ever
agreed to play the game.
4.3. Lessons Learned
In the late 1980s I worked on a high-profile project to com-
pare performance and satisfaction across a set of common tasks
for three competitive office application suites (Lewis et al.,
1990). Based on what I had learned in college about Stevens’s
levels of measurement, I pointed out that the multipoint rating
scale data we were dealing with did not meet the assumptions
Downloaded by [78.87.127.172] at 04:36 24 June 2014
672 LEWIS
required for the computation of means, so we should instead
present medians. I also advised using the nonparametric Mann-
Whitney U-test rather than ttests for individual comparisons of
the rating scale results.
The practitioners who started running the statistics and
putting the presentation together (which would have been given
to a group that included high-level IBM executives) called me
in a panic after they started following this advice. In the analy-
ses, there were cases where the medians were identical but the
U-test detected a statistically significant difference. The U-test
is sensitive not only to central tendency but also to the shape of
the distribution, and in these cases the distributions had opposite
skew with overlapping medians. After this experience, I system-
atically investigated the relationship among mean and median
differences for multipoint scales and the observed significance
levels of ttests and U-tests conducted on the same data, all taken
from this fairly large-scale usability test. The mean difference
correlated more than the median difference with the observed
significance levels (both parametric and nonparametric) for dis-
crete multipoint scale data (Lewis, 1993). For the purposes of
analysis and presentation, mean differences were significantly
superior to median differences, and there was apparently no
compelling reason to use the nonparametric rather than the ttest
for the assessment of significant differences. However, it would
be a mistake to just ignore the level of measurement.
When making claims about the interpretation of the out-
comes of statistical tests, it is important to keep in mind that
rating scale data are ordinal rather than interval. An average rat-
ing of 4 might be better than an average rating of 2, and a ttest
might indicate that across a group of participants, the difference
is consistent enough to be statistically significant. Even so, it
would be inappropriate to claim that it is twice as good (a ratio
claim), nor should one claim that the difference between 4 and
2 is equal to the difference between 4 and 6 (an interval claim).
The only reasonable claim is that there is a consistent difference.
Fortunately, even if one made the mistake of thinking one prod-
uct was twice as good as another when the scale didn’t justify
it, it would be a mistake that often would not affect the practical
decision of which product is better.
4.4. Lessons Yet to Be Learned
There are methods other than computing the median that are
alternatives to the mean when dealing with multipoint scales.
One of the most common is to use top-box scoring. For exam-
ple, one way to analyze data collected using a 5-point scale is to
report the percentage of ratings of 4 and 5—a top-2-box score.
A potential downside of this method is that these types of met-
rics tend to have a higher variability than the mean, so to achieve
an equal precision of measurement would require a larger sam-
ple size (Sauro & Lewis, 2012). I do not know of any systematic
investigation of means and top-box scoring methods using real
usability data, nor any work done to compare their downstream
utility (which is more effective in leading to changes that
improve usability), so there are still lessons to learn in this area.
5. HOW ROBUST ARE STANDARDIZED USABILITY
QUESTIONNAIRES?
A questionnaire is a form designed to obtain information
from respondents. The items in a questionnaire can be open-
ended questions but are more typically multiple choice, with
respondents selecting from a set of alternatives (“Please select
the type of the wine that you prefer.”) or points on a rating
scale (“On a scale of 1 to 5 where 1 is very dissatisfied and
5 is very satisfied, how satisfied were you with your recent visit
to our airport?”). A standardized questionnaire is one designed
for repeated use, typically with a specific set of questions
presented in a specified order using a specified format, with spe-
cific rules for producing metrics. As part of the development
of standardized questionnaires, it is customary for the devel-
oper to report its reliability, validity, and sensitivity—in other
words, for the questionnaire to have undergone psychometric
qualification (Nunnally, 1978).
Standardized measures offer many advantages to practi-
tioners, specifically objectivity, easier replicability, quantifica-
tion, economy, and easier communication of results (Nunnally,
1978). The earliest standardized questionnaires in this area
focused on the measurement of computer satisfaction (e.g., the
Gallagher Value of MIS Reports Scale and the Hatcher and
Diebert Computer Acceptance Scale) but were not designed for
the assessment of usability following participation in scenario-
based usability tests (see LaLomia & Sidowski, 1990, for
a review of computer satisfaction questionnaires published
between 1974 and 1988). The first standardized usability ques-
tionnaires appropriate for usability testing appeared in the late
1980s (Chin, Diehl, & Norman, 1988; Kirakowski & Dillon,
1988; Lewis, 1990a,1990b). Some standardized usability ques-
tionnaires are for administration at the end of a study. Others
are for a quick, more contextual assessment at the end of each
task or scenario. Currently, the most widely used standard-
ized usability questionnaires for assessment of the perception of
usability at the end of a study (after completing a set of test sce-
narios) and those cited in national and international standards
(ANSI, 2001; ISO, 1998) are as follows:
The Questionnaire for User Interaction Satisfaction
(Chin et al., 1988)
The Software Usability Measurement Inventory
(Kirakowski & Corbett, 1993; McSweeney, 1992)
The PSSUQ (Lewis, 1990a,1992,1995)
The Software Usability Scale (SUS; Brooke, 1996,
2013; Sauro, 2011)
There is not much controversy on the utility of standardized
questionnaires in usability engineering. Unsurprisingly, stan-
dardized usability questionnaires are more reliable than ad hoc
questionnaires (Hornbæk, 2006; Hornbæk & Law, 2007; Sauro
& Lewis, 2009). As with other aspects of applied statistics,
however, there are controversies in psychometrics inherited by
usability practitioners. A complete treatment of the topic is
beyond the scope of this article. For a recent special issue on
Downloaded by [78.87.127.172] at 04:36 24 June 2014
USABILITY LESSONS 673
the topic of developing and evaluating scales for use in studies
of human-computer interaction, see Lindgaard and Kirakowski
(2013). Here I focus on a few controversies of particular interest
to usability practitioners and scientists, particularly with regard
to the robustness (tolerance to deviation from specified use) of
standardized usability questionnaires.
5.1. One Point of View
As mentioned previously, a standardized questionnaire has a
specific set of questions presented in a specified order (which
could be random order) using a specified format, with spe-
cific rules for producing metrics. Any deviation, no matter how
slight, from these specifications makes it possible that the result-
ing measures would be invalid. These deviations might include
changing the wording of an item, changing the number of scale
steps or step labels, or using the questionnaire in a setting differ-
ent from its development setting. Also, to avoid artifacts such as
the extreme response bias and the acquiescence bias, the items
in a standardized questionnaire that measures sentiments such
as satisfaction should have a mixed tone, about half positive and
half negative. Mixing the tone of items also provides a way to
check that respondents are making some effort when complet-
ing the questionnaire, at least enough to avoid marking items in
a way that is clearly careless.
5.2. Another Point of View
Robust psychometric instruments should be able to toler-
ate some deviation from specification. When there is deviation
from specification, the question is whether the practitioner
or researcher has merely bent the rules or has broken the
instrument. There is evidence from the psychometric litera-
ture specific to usability engineering that standardized usability
questionnaires are reasonably robust.
All-positive versus mixed-tone items. When I worked with
the team at IBM that produced the PSSUQ and Computer
System Usability Questionnaire (CSUQ) in 1988, we had quite
a bit of discussion regarding whether to use a mixed or consis-
tently positive item tone. Ultimately, we decided to be consis-
tently positive, even though that was not the prevailing practice
in questionnaire development (Lewis, 1995). Our primary con-
cern was that varying the tone would make the questionnaires
more difficult for users to complete, and as a consequence might
increase the frequency of user error in marking items (Lewis,
1999,2002). Thus, the most common criticism of these ques-
tionnaires was the consistently positive tone of the items (e.g.,
Travis, 2008).
There are a number of articles from different literatures,
however, that have been critical of mixed tone because it can
create undesirable structure in a metric in which positive items
align with one factor and negative items align with the other
(Barnette, 2000; Davis, 1989; Pilotte & Gable, 1990; Schmitt &
Stuits, 1985; Schriesheim & Hill, 1981; Stewart & Frye, 2004;
Wong, Rindfleisch, & Burroughs, 2003). It is also possible that
mixed tone could contribute to three types of errors in practice
(Sauro & Lewis, 2011), specifically the following:
Misinterpretation: Users might respond to items forced
into a negative tone in a way that is not the simple
negative of the positive version of the item.
Mistake: Users might not intend to respond differently
to mixed-tone items but might forget to reverse their
score, accidentally agreeing with a negative item when
they meant to disagree.
Miscode: To generate a composite overall score from
mixed-tone items, it is necessary to reverse the scoring
of the negative-tone items before combination with the
positive-tone items. Failure to perform this step would
result in incorrect composite values, with the errors not
necessarily easy to detect.
The SUS (Brooke, 1996), probably the most widely used
standardized usability questionnaire (Sauro & Lewis, 2009), is
an example of an instrument with mixed tone, composed of
10 items with alternating positive and negative tone. As part
of the eighth CUE workshop (Molich, Kirakowski, Sauro, &
Tullis, 2009), eight of 15 teams used the SUS, and one of those
had miscoded results. As part of a meta-analysis of prototypical
usability metrics (Sauro & Lewis, 2009), two of 19 contributed
SUS data sets had coding errors. Taken together, this suggests
that about 10% of SUS data sets may have coding issues.
To more systematically investigate the effect of all-positive
versus mixed tone items, Sauro and Lewis (2011) created an all-
positive version of the SUS and compared its scores with those
of the standard (mixed-tone) version. There were no signifi-
cant differences between the means of the overall SUS scores,
the means of the odd items (positive tone in both versions), or
the means of the even items (positive tone in the all-positive
version, negative tone in the standard version). There was a sig-
nificant difference in the means of the odd- and even-numbered
items, but the difference was consistent across the two versions
of the questionnaire (see Figure 3). There were no significant
differences in either acquiescence or extreme response bias
between the versions. Examination of responses to the stan-
dard version indicated about 17% of completed questionnaires
had an internal inconsistency consistent with a user mistake in
marking one or more items.
It is possible to change items so drastically that it affects the
metrics (Sauro, 2010b; Sauro & Lewis, 2012). In an experiment
exploring the manipulation of item tone and intensity, volun-
teer participants rated the Usability Professionals Association
website using one of five versions of the SUS—an all posi-
tive extreme, all negative extreme, one of two versions of an
extreme mix (half positive and half negative extreme), or the
standard SUS. For example, the extreme negative version of
the SUS Item 4 was “I think that I would need a permanent
hot-line to the help desk to be able to use the web site.” The
extreme positive and extreme negative items were significantly
different from the original SUS, consistent with prior research
Downloaded by [78.87.127.172] at 04:36 24 June 2014
674 LEWIS
4
3.5
3
2.5
Mean Composite Item Rating after Recoding
1.5
2
1
0.5
0
Standard Positive
Odd
Even
Version of SUS
FIG. 3. Interaction between odd and even items of standard and positive version of the Software Usability Scale (SUS).
showing that people tend to agree with statements that are
close to their attitude and to disagree with all other statements
(Spector, Van Katwyk, Brannick, & Chen, 1997; Thurstone,
1928). By rephrasing items to extremes, only respondents who
passionately favored the usability of the Usability Professionals
Association website tended to agree with the extremely phrased
positive statements—resulting in a significantly lower average
score. Likewise, only respondents who passionately disfavored
its usability agreed with the extremely negatively questions—
resulting in a significantly higher average score.
Using standardized usability questionnaires outside of a
usability testing context. The psychometric literature includes
research that illustrates invariance in psychometric properties
across different contexts of measurement (e.g., Bangor, Kortum,
& Miller, 2008; Davis & Venkatesh, 1996; Kortum & Bangor,
2013; Lewis, 1995,2002; Lewis & Mayes, 2014 [included in
this issue]; Sauro & Lewis, 2012). For example, Lewis (1995,
2002) used a variety of methods to collect data for the PSSUQ
and CSUQ, two questionnaires that have essentially the same
items but differ slightly to support use in the lab (PSSUQ) and
as a mailed or online survey (CSUQ). Regardless of the data col-
lection method and slight differences in the content of the items,
the resulting factor structures were identical, as were estimates
of scale reliability and validity.
The SUS also seems to have similar psychometric proper-
ties when used in the lab or when used as part of a web survey
(Bangor et al., 2008; Grier, Bangor, Kortum, & Peres, 2013;
Kortum & Bangor, 2013), at least with regard to reliability and
concurrent validity. There have been some inconsistencies in
its content validity (factor structure), but that does not seem
to be related to its context of measurement (Sauro & Lewis,
2012). The SUS also seems to tolerate other minor changes to
its wording, for example, using “website” or a product name in
place of the original “system,” or the replacement of the word
“cumbersome” with “awkward” (Bangor et al., 2008; Finstad,
2006; Lewis & Sauro, 2009). In a study comparing different
standardized usability questionnaires, the SUS was the fastest
to converge on its large-sample mean (Tullis & Stetson, 2004).
5.3. Lessons Learned
When conducting usability studies, practitioners should use
one of the currently available standardized usability question-
naires. Table 1 lists a number of questionnaires along with some
of their key properties.
It is possible to bend these standardized questionnaires to
an extent, but it is also possible to break them if the devi-
ations from standardization are extreme. Minor changes in
wording should not typically have extreme effects, and the ques-
tionnaires should be robust against the inclusion of additional
custom items. With minor adjustment as needed, the ques-
tionnaires should work well in either a laboratory or remote
usability testing context, or as part of a mailed or online survey.
It is important to plan for whether the measurements will
need interpretation in isolation or will be comparative (e.g.,
comparing competitive products or different versions of a
product). If comparative, then any of the currently available
questionnaires should suffice, but if there is a need for inter-
pretation without comparison, then it would be wise to select a
questionnaire that has at least some type of norms available.
From its inception, the developers of the Software Usability
Measurement Inventory have maintained an extensive norma-
tive database (Kirakowski, 1996). One of the reasons that the
SUS has been gaining in popularity is the relatively recent
publication of normative data to aid in the interpretation of
SUS scores (Bangor et al., 2008; Kortum & Bangor, 2013;
Sauro & Lewis, 2012). There is also some normative data avail-
able for the PSSUQ/CSUQ (Lewis, 2002; Sauro & Lewis,
2012). Keep in mind, however, that variation in products and
tasks can weaken the generalizability of the norms (Cavallin,
Martin, & Heylighen, 2007). As for item wording, slight
Downloaded by [78.87.127.172] at 04:36 24 June 2014
USABILITY LESSONS 675
TABLE 1
Key Characteristics of Some Standardized Usability Questionnaires
Questionnaire
Requires
License Fee No. of Items
No. of
Subscales
Global
Reliability Validity Notes References
QUIS Yes ($50–750) 27 5 0.94 Construct validity;
evidence of
sensitivity
Chin et al., 1988
SUMI Yes ( C0–1000) 50 5 0.92 Construct validity;
evidence of
sensitivity;
availability of
norms
Kirakowski, 1996
USE Unknown 30 4 Not
published
Not published Lund, 1998,2001
PSSUQ/CSUQ No 16 3 0.94 Construct validity;
concurrent
validity;
evidence of
sensitivity;
some normative
information
Lewis, 1995,2002
SUS No 10 2 0.92 Construct validity;
evidence of
sensitivity;
some normative
information
Brooke, 1996,
2013; Sauro,
2011
UMUX No 4 2 0.81, 0.87,
0.97
Construct validity;
evidence of
sensitivity
Finstad, 2010,
2013; Lewis,
2013
UMUX-LITE No 2 1 0.83 Construct validity;
evidence of
sensitivity
Lewis et al., 2013
Note. QUIS =Questionnaire for User Interaction Satisfaction; SUMI =Software Usability Measurement Inventory; USE =Usefulness,
Satisfaction, and Ease of Use; PSSUQ =Post-Study System Usability Questionnaire; CSUQ =Computer System Usability Questionnaire;
SUS =Software Usability Scale; UMUX =Usability Metric for User Experience.
deviations should not cause a serious problem, but extreme
deviations probably will. For any planned deviations, if pos-
sible, check the data to ensure that nothing has obviously
broken.
5.4. Lessons Yet to Be Learned
There are still lessons to be learned in the domain of stan-
dardized usability testing—still work to do. For example, what
is the real factor structure of the SUS? What is the status of the
translation of standardized usability questionnaires into other
languages? How effective are the new, short questionnaires
designed to produce SUS-like measurements, the Usability
Metric for User Experience (UMUX) and UMUX-LITE? How
effective is the Emotional Metric Outcomes (EMO), a new
questionnaire designed to assess the emotional consequences of
interaction?
Factor structure of the SUS. The original intent was for
the SUS to be a unidimensional (one factor) measurement of
perceived usability (Brooke, 1996). Once researchers began to
publish data sets (or correlation matrices) from sample sizes
large enough to support factor analysis, it began to appear that
the items of the SUS might align with two factors. Data from
three independent studies (Borsci, Federici, & Lauriola, 2009;
Lewis & Sauro, 2009) indicated a consistent two-factor struc-
ture (with Items 4 and 10 aligning on a factor separate from
the remaining items). Analyses conducted since 2009 (Lewis,
Utesch, & Mayer, 2013; Sauro & Lewis, 2011; and a number
of unpublished analyses) have typically resulted in a two-factor
structure but have not replicated the item-factor alignment that
Downloaded by [78.87.127.172] at 04:36 24 June 2014
676 LEWIS
seemed apparent in 2009. The more recent analyses have been
somewhat consistent with a general alignment of positive- and
negative-tone items on separate factors—the type of uninten-
tional structure that can occur with sets of mixed-tone items (see
the earlier section on all-positive vs. mixed-tone items). It would
be helpful for usability practitioners or researchers who have
fairly large-sample data sets of SUS questionnaires to publish
the results of factor analysis of their data, or at least to publish
the correlation matrix of the items so other researchers could
conduct factor analyses.
Translation into other languages. There is more to the
translation of a standardized usability questionnaire into another
language than simply translating the wording of the items. It is
also necessary to conduct psychometric analyses to ensure that
the translated questionnaire has the same (or similar) proper-
ties as the source questionnaire (van de Vijver & Leung, 2001).
There have been two recent translations, one of the CSUQ
into Turkish (Erdinç & Lewis, 2013) and one of the SUS into
Slovene (Blažica & Lewis, 2014), but there are many more
opportunities for usability researchers to extend the benefits
of standardized usability measurement to other languages and
cultures.
There may be times when practitioners will need to work
with translated questionnaires that have not yet undergone psy-
chometric revalidation. For example, if the study will have a
small sample size, then there will not be sufficient data to con-
duct factor analysis or other psychometric analyses. There is
little research on the extent to which doing this would pro-
duce results that are useful or results that are misleading, so
practitioners finding themselves in this situation should proceed
with caution. It is possible, over time (possibly years), to collect
enough cases to conduct revalidation analyses (Lewis, 2002).
With the advent of web survey tools and remote unmoderated
usability testing (Albert, Tullis, & Tedesco, 2010), this process
could occur much more rapidly than in the past.
The UMUX and UMUX-LITE. As widely used as the SUS
is, some practitioners have a need for a standardized question-
naire that has fewer than 10 items. This need is most pressing
when standardized usability measurement is one part of a larger
poststudy or online questionnaire (Lewis et al., 2013). Finstad
(2010,2013) created the UMUX to address that need. The
UMUX is a relatively new standardized usability questionnaire
designed to get a measurement of perceived usability consistent
with the SUS, but using fewer items. After standard psychomet-
ric development, the final version of the UMUX had four items,
two with positive tone and two with negative. Using a recod-
ing scheme similar to the SUS, a UMUX score can range from
0 to 100. Finstad reported a UMUX reliability of 0.94 and an
extremely high correlation of 0.96 with concurrently collected
SUS scores.
Lewis et al. (2013) took the UMUX as a starting point for
an even shorter questionnaire, just using the two positive-tone
items of the UMUX. The UMUX-LITE items were “This sys-
tem’s capabilities meet my requirements” and “This system is
easy to use.” Data from two independent surveys demonstrated
adequate psychometric quality of the questionnaire. Estimates
of reliability were .82 and .83—excellent for a two-item
instrument. Concurrent validity was also high, with significant
correlation with the SUS (r=.81) and with likelihood-to-
recommend scores (r=.73). UMUX-LITE score means were
slightly lower than those for the SUS but easily adjusted using
linear regression to match the SUS scores. Due to its parsimony
(two items), reliability; validity; structural basis (usefulness and
usability); and, after applying the corrective regression formula,
its correspondence to SUS scores, the UMUX-LITE appears to
be a promising alternative to the SUS when it is not desirable to
use a 10-item instrument.
Because they are such new metrics, few practitioners have
used the UMUX or UMUX-LITE. For some period, it would be
useful for practitioners who use the SUS to also use the UMUX
(which contains the UMUX-LITE) and to report their observed
correspondences among the metrics to help the usability engi-
neering community develop an improved understanding of the
potential utility of these new metrics.
The EMO. Most standardized usability questionnaires have
focused on assessing satisfaction with usability or perceived
usability, and more from a cognitive than an emotional per-
spective (Agarwal & Meyer, 2009; Lottridge, Chignell, &
Jovicic, 2011). The growing trend toward user experience
(UX) design has created a need for a concise, psychometri-
cally qualified measurement of the emotional consequences of
interaction. There have been some previous efforts to develop
instruments with a more emotional focus (Benedek & Miner,
2002; Hassenzahl, 2001,2004;Tullis & Albert, 2008,2013),
but none that have directly addressed the consequences of
interaction.
Lewis and Mayes (this issue) developed and evaluated the
EMO questionnaire—a new questionnaire designed to assess
the emotional outcomes of interaction, especially the interac-
tion of customers with service-provider personnel or software.
The EMO is a concise multifactor standardized questionnaire
that provides an assessment of transaction-driven personal and
relationship emotional outcomes, both positive and negative.
Psychometric evaluation showed that the EMO and its compo-
nent scales had high reliability and concurrent validity with loy-
alty and overall experience metrics in a variety of measurement
contexts. Concurrent measurement with the SUS indicated that
a reported significant correlation of the SUS with likelihood-
to-recommend ratings (Sauro & Lewis, 2012) may be due to
emotional rather than utilitarian aspects of the SUS.
Like the UMUX and UMUX-LITE, the EMO is a new metric
with desirable psychometric properties. One of its current weak-
nesses is that there is no EMO data yet collected in the context
of a usability study. There is every reason to believe it should
be robust enough to have the same psychometric properties in
that new context of measurement, but it is important to verify
this. Usability practitioners and researchers who can include
the EMO in their battery of post-study questionnaires should do
Downloaded by [78.87.127.172] at 04:36 24 June 2014
USABILITY LESSONS 677
so—especially those who desire a metric that has a strong rela-
tionship to loyalty metrics such as likelihood-to-recommend.
To as great an extent as possible, they should publish their
findings.
6. WHAT ABOUT THE MAGIC NUMBER 5 (OR 8, OR 10,
OR 30)?
In the context of usability testing, “magic” numbers refer to
rules of thumb for sample sizes. A common rule of thumb for
summative usability tests, based on a common convention in
applied statistics, is to have a sample size of at least 30. For
formative usability testing, the best known magic number is 5
(Nielsen, 2000; Nielsen & Landauer, 1993), although 8 (Perfetti
& Landesman, 2001; Spool & Schroeder, 2001) and 10 (Hwang
& Salvendy, 2010) have also appeared in the literature. Do these
magic numbers have any validity?
6.1. One Point of View
Summative usability testing. According to the central limit
theorem, as the sample size increases, the distribution of the
mean becomes more and more normal, regardless of the nor-
mality of the underlying distribution. Some simulation studies
have shown that for a wide variety of distributions (but not all;
see Bradley, 1978), the distribution of the mean becomes near
normal when n=30. Another consideration is that it is slightly
simpler to use zscores rather than tscores because zscores do
not require the use of degrees of freedom. As shown in Figure 4,
by the time there are about 30 degrees of freedom the value of
tclosely approaches the value of z. Consequently, there can be
a feeling that you don’t have to deal with small samples that
require special small-sample treatment (Cohen, 1990).
Formative usability testing. Early descriptions of formative
usability testing (Al-Awar et al., 1981; Chapanis, 1981) did not
provide any guidance regarding the sample size per iteration
rather than saying to run “a few” participants. Lewis (1982) pro-
posed using the cumulative binomial probability formula (P(at
least one occurrence) =1(1 p)n) as an aid in sample size
estimation for these types of problem-discovery studies. In this
formula, nis the sample size, and pis the likelihood of occur-
rence of a usability problem (or whatever the investigator is
trying to discover via observation).
In the early 1990s there were a number of fairly large-sample
formative usability studies run for the purpose of exploring
the relationship between sample size and problem discovery
(Nielsen & Molich, 1990; Virzi, 1990,1992). Nielsen and
Landauer (1993) collected a number of the studies from the
literature and from the practitioner community and determined
that the average value for pwas .31. They determined that if
you use this value for p, set nto 5, then compute the cumula-
tive binomial probability that an event with p=.31 will occur
at least once out of five opportunities, that probability is about
85% (1 (1 .31)5=.8436). In other words, the first five par-
ticipants observed in a formative usability study should usually
reveal about 85% of the problems available for discovery in that
iteration, where the properties of the study (type of participants
and tasks employed) place limits on what is discoverable. But
over time, in the minds of many usability practitioners, the rule
became simplified to “All you need to do is watch five people
to find 85% of a product’s usability problems.”
6.2. Another Point of View
Summative usability testing. The idea that even with the
t-distribution (as opposed to the z-distribution) you need to
have a sample size of at least 30 is inconsistent with the
70
60
50
40
t(.01)
t(.05)
t(.10)
30
Value of t
20
10
0
1 5 10 15 20
Degrees of Freedom
25 30 z
FIG. 4. tapproaches zas nincreases.
Downloaded by [78.87.127.172] at 04:36 24 June 2014
678 LEWIS
history of the development of the distribution. In 1899, William
S. Gossett joined the Guinness brewery. Surprisingly, brew-
ing has certain economic realities in common with moderated
usability testing. “The nature of the process of brewing, with
its variability in temperature and ingredients, means that it is
not possible to take large samples over a long run” (Cowles,
1989, pp. 108–109). For much of his research, Gossett per-
formed an early version of Monte Carlo simulations (Stigler,
1999) by preparing 3,000 cards labeled with physical measure-
ments taken on criminals, shuffling them, then dealing them
out into 750 groups of n=4 a much smaller sample size
than 30.
When the cost of a sample is expensive, as it typically is in
many types of user research (e.g., moderated usability testing),
it is important to estimate the needed sample size as accu-
rately as possible, with the understanding that it is an estimate.
The likelihood that 30 is exactly the right sample for a given
set of circumstances is very low. It is more appropriate to use
the formulas for computing the significance levels of a statis-
tical test and, using algebra to solve for n, to convert them
to sample size estimation formulas (e.g., see Sauro & Lewis,
2012).
Formative usability testing. Sometimes the Magic Number
5 is correct, but only under special circumstances (Borsci et al.,
2013; Lewis, 1994). If the design space under investigation
has many problems available for discovery whose probabili-
ties of occurrence are markedly different from 0.31, then there
is no guarantee that observing five participants will lead to
the discovery of 85% of the problems available for discovery.
For example, in their famous “Eight is Not Enough” paper,
Perfetti and Landesman (2001) found that testing five users
fell far short of achieving 85% problem discovery. A reanal-
ysis of their data indicated that their value of pwas proba-
bly about .03—one tenth of the value needed for the Magic
Number 5 rule of thumb to work (Lewis, 2006; Sauro & Lewis,
2012).
Rather than relying on magic numbers of any kind, it may be
more reasonable to use formulas or tables based on the cumula-
tive binomial probability formula when planning samples sizes
for a one-shot or iterative formative usability testing. Using
algebra to solve for n, the sample size formula based on 1
(1 p)nis n=ln(1 discoveryGoal)/ln(1 p). For example,
if the desired discovery goal is to discover at least 80% of the
problems that have a likelihood of occurrence of 0.15, where
discovery means to find them at least once, then n=9.9, which
rounds up to 10. Alternatively, practitioners can use a table like
the one shown in Table 2 (built using the preceding sample size
formula) to get a sense of what problem discovery to expect as
a function of sample size.
For example, suppose circumstances limit a practitioner to
running a single-shot study with five participants. As shown
in Table 2, the expectation is that the study will expose (at
least once) almost all of the problems that have a probability
of occurrence of 0.5 or greater. Then, in descending order, the
TABLE 2
Likelihood of Discovery for Various Sample Sizes and
Probabilities of Occurrence
p n =1n=2n=3n=4n=5n=10 n=15 n=20
.01 0.01 0.02 0.03 0.04 0.05 0.10 0.14 0.18
.05 0.05 0.10 0.14 0.19 0.23 0.40 0.54 0.64
.10 0.10 0.19 0.27 0.34 0.41 0.65 0.79 0.88
.15 0.15 0.28 0.39 0.48 0.56 0.80 0.91 0.96
.25 0.25 0.44 0.58 0.68 0.76 0.94 0.99 1.00
.50 0.50 0.75 0.88 0.94 0.97 1.00 1.00 1.00
.90 0.90 0.99 1.00 1.00 1.00 1.00 1.00 1.00
expectation is discovery of about 76% of problems where p=
.25, 56% of problems where p=.15, 41% of problems where
p=.1, 23% of problems where p=.05, and 5% of problems
where p=.01. In other words, a study with n=5 is likely to
leave many problems undiscovered if their probability of occur-
rence is less than .5, but, more optimistically, the study will
probably uncover enough problems to give the developers the
information needed to improve the product’s usability. This is a
more nuanced approach than a simple magic number, but also
more likely to lead to more realistic expectations on the part
of practitioners and stakeholders. Note that the incompleteness
of discovery for lower frequency problems may contribute to
the discrepancies observed in the usability problem lists gener-
ated in the CUE studies (see the Is Formative Usability Testing
Reliable? section). When you roll dice, you don’t expect the
same number to come up every time.
6.3. Lessons Learned
There are rational and empirical bases underlying the magic
number rules of thumb for sample size requirements, but they
are optimal only under very specific conditions. Rather than
referring to magic numbers, it would be better practice to use
the tools that are available to guide sample size estimation, both
for summative and for formative usability testing.
6.4. Lessons Yet to Be Learned
Summative usability testing. Given its close connection
to traditional experimentation and use of inferential statistics,
there are probably relatively few lessons yet to be learned for
statistical analyses associated with summative usability test-
ing. It is, however, important for at least a subset of usability
researchers to keep abreast of developments in applied statis-
tics and to publish their findings. For example, there have been
recent articles on new developments for computing binomial
confidence intervals (Agresti & Coull, 1998; Sauro & Lewis,
2005) and a better formula for chi-square tests (Campbell, 2007)
when sample sizes are small (for details, see Sauro & Lewis,
2012). Continuing concerns about the effects of violating the
Downloaded by [78.87.127.172] at 04:36 24 June 2014
USABILITY LESSONS 679
assumptions of parametric statistical methods (such as the t
test and analysis of variance) may be addressed in the future
by bootstrapping (or similar resampling) procedures (Chernick,
2008), but we have yet to see a set of easy-to-use bootstrap-
ping analysis tools, especially with a focus on the concerns of
user researchers. We also have yet to see systematic analyses
of the potential benefits and liabilities of bootstrapping methods
in user research, especially given the documented robustness of
parametric statistical procedures (Sauro & Lewis, 2012).
Formative usability testing. The methods for estimating
sample size requirements for formative usability testing are
much younger than those for summative usability testing.
Consequently, there is a richer set of opportunities to learn more
lessons. Of particular interest are questions about the mean-
ing of pand potential problems associated with basing these
methods on the binomial probability formula.
As previously discussed, the value of pis a critical com-
ponent in the cumulative binomial probability formula. There
are outstanding questions, however, about what, if anything, it
means. For example, if the value of pis small, as in the “Eight
is Not Enough” study, then that indicates that there was very
little overlap in the problems experienced across participants.
Was that due to the product being mature and free of errors
likely to affect a large proportion of users? If so, then the small
value of pwould be indicative of a generally positive situation.
If the value of pis large, then there is considerable overlap in
the problems experienced across participants. One would gen-
erally expect this outcome for new products that have not yet
undergone usability testing, and would also be a generally pos-
itive outcome. But might there also be situations in which the
value of pwould indicate a negative outcome—perhaps a low
value when you’d expect a high one, or vice versa? There is
opportunity for new research on this topic.
A number of publications have criticized the use of the
binomial probability formula as the basis for estimating sam-
ple sizes for formative usability studies. A key assumption of
the binomial model is that the value of pis constant from
trial to trial (Ennis & Bi, 1998). It seems likely that this
assumption does not strictly hold in user research due to dif-
ferences in users’ capabilities and experiences (Caulton, 2001;
Schmettow, 2008; Woolrych & Cockton, 2001). The extent to
which this should affect the use of the binomial formula in
modeling problem discovery is an ongoing topic of research
(Briand, El Emam, Freimut, & Laitenberger, 2000; Kanis, 2011;
Lewis, 2001; Schmettow, 2008,2009). Some researchers have
investigated alternative means of discovery modeling that do
not make the assumption of the homogeneity of p, including
the beta-binomial (Schmettow, 2008), logit-normal binomial
(Schmettow, 2009,2012), bootstrapping (Borsci, Londei, &
Federici, 2011), and capture–recapture models (Briand et al.,
2000). These more complex alternative models may turn out
to be advantageous over the simple binomial model, but they
may also have disadvantages, particularly in the sample sizes
required to accurately estimate their parameters. Only future
research will tell.
Note that the sample size estimation procedures provided
earlier in this article for formative usability testing are not
affected by the assumption of homogeneity because they take
as given (not as estimated) one or more specific values of p.
There is still work to do to compare the overall effectiveness of
the approach presented earlier with those driven by other types
of discovery models.
7. CONCLUSION
This article has presented information about five aspects of
usability that have generated published research and discussion
in the usability science and usability engineering communities.
What is usability? Is formative usability testing reliable? Is it
OK to average ratings from multipoint scales? How robust are
standardized usability questionnaires? How useful are “magic
numbers” for planning sample sizes for usability testing? For
each topic, there is coverage of its background, discussion of
the controversies, summarization of the lessons learned, and
description of lessons yet to be learned. Some of the key lessons
learned are as follows:
When discussing usability, it is important to distin-
guish between the goals and practices of summative
and formative usability.
There is compelling rational and empirical support for
the practice of iterative formative usability testing—it
appears to be effective in improving both objective and
perceived usability.
It is permissible to average multipoint scale ratings,
but there are restrictions on the interpretations of the
results.
When conducting usability studies, practitioners
should include one or more of the currently available
standardized usability questionnaires.
Because “magic number” rules of thumb for sample
size requirements for usability tests are optimal only
under very specific conditions, practitioners should
use the tools that are available to guide sample size
estimation rather than relying on magic numbers.
The usability practitioner community owes a substantial debt
to those who have made the effort to share their research,
both through peer-reviewed publication and through books. The
contribution of those who have undergone the peer review pro-
cess is evident in the references cited throughout this article.
For recent book-level treatments of usability and UX issues,
see Albert et al. (2010); Barnum (2010); Lazar, Feng, and
Hochheiser (2010); MacKenzie (2014); Sauro (2010a); Sauro
& Lewis (2012); Shneiderman & Plaisant (2010); and Tullis and
Albert (2013).
I want to end with a call to action. Specifically, I encourage
practitioners as well as researchers to look for opportunities in
their day-to-day work to study and compare different methods
and, most important, to publish the findings. In the long run, this
is how we will move questions from lessons yet to be learned to
Downloaded by [78.87.127.172] at 04:36 24 June 2014
680 LEWIS
lessons learned. Looking back over the past three decades, this
may be the most important lesson learned.
ACKNOWLEDGEMENTS
I express many thanks to Dr. Gavriel Salvendy for his sup-
port throughout my career, giving me significant opportunities
to develop my skills as an author, reviewer, editor, and speaker.
Thanks to the reviewers who responded quickly to requests to
review this paper, and whose comments were insightful and
invaluable. I also thank Pete Kennedy and his wife, Audrey, who
respectively made sure that in my first years at IBM I had plenty
to learn and plenty to eat, and a continuing friendship. Thanks
also to all my coworkers and clients for the fascinating work
of building usable systems and the accompanying intellectual
challenges.
REFERENCES
Abelson, R. P. (1995). Statistics as principled argument. Hillsdale, NJ: Erlbaum.
Agarwil, A., & Meyer, A. (2009). Beyond usability: Evaluating emotional
response as an integral part of the user experience. In Proceedings of CHI
2009 Extended Abstracts on Human Factors in Computing Systems (pp.
2919–2930). Boston, MA: Association for Computing Machinery.
Agresti, A., & Coull, B. (1998). Approximate is better than ‘exact’ for interval
estimation of binomial proportions. The American Statistician,52, 119–126.
Al-Awar, J., Chapanis, A., & Ford, R. (1981). Tutorials for the first-time
computer user. IEEE Transactions on Professional Communication,24,
30–37.
Albert, B., Tullis, T., & Tedesco, D. (2010). Beyond the usability lab.
Burlington, MA: Morgan Kaufmann.
Alonso-Ríos, D., Vázquez-Garcia, A., Mosqueira-Rey, E., & Moret-Bonillo.
(2010). Usability: A critical analysis and a taxonomy. International Journal
of Human-Computer Interaction,26, 53–74.
American National Standards Institute. (2001). Common industry format for
usability test reports (ANSI-NCITS 354-2001). Washington, DC: Author.
Baecker, R. M. (2008). Themes in the early history of HCI—Some unanswered
questions. Interactions,15(2), 22–27.
Bailey, G. (1993). Iterative methodology and designer training in human–
computer interface design. In INTERCHI ‘93 Conference Proceedings (pp.
198–205). New York, NY: Association for Computing Machinery.
Bailey, R. W., Allan, R. W., & Raiello, P. (1992). Usability testing vs. heuris-
tic evaluation: A head to head comparison. In Proceedings of the Human
Factors and Ergonomics Society 36th Annual Meeting (pp. 409–413). Santa
Monica, CA: Human Factors and Ergonomics Society.
Bangor, A., Kortum, P. T., & Miller, J. T. (2008). An empirical evaluation
of the System Usability Scale. International Journal of Human–Computer
Interaction,24, 574–594.
Barnette, J. J. (2000). Effects of stem and Likert response option reversals on
survey internal consistency: If you feel the need, there is a better alterna-
tive to using those negatively worded stems. Educational and Psychological
Measurement,60, 361–370.
Barnum, C. M. (2010). Usability testing essentials: Ready, set ... test!
Burlington, MA: Morgan Kaufmann.
Benedek, J., & Miner, T. (2002). Measuring desirability: New methods for
evaluating desirability in a usability lab setting. In Proceedings of the
Usability Professionals’ Association. Orlando, FL: Usability Professionals
Association. Available at: http://www.microsoft.com/usability/uepostings/
desirabilitytoolkit.doc (Accessed June 10, 2014).
Bennett, J. L. (1979). The commercial impact of usability in interactive sys-
tems. Infotech State of the Art Report: Man/Computer Communications,2,
289–297.
Berry, D. C., & Broadbent, D. E. (1990). The role of instruction and verbal-
ization in improving performance on complex search tasks. Behaviour &
Information Technology,9, 175–190.
Bevan, N. (2009). Extending quality in use to provide a framework for usability
measurement. In M. Kurosu (Ed.), Human centered design, HCII 2009 (pp.
13–22). Heidelberg, Germany: Springer-Verlag.
Bevan, N., Kirakowski, J., & Maissel, J. (1991). What is usability? In H.
J. Bullinger (Ed.), Human Aspects in Computing, Design and Use of
Interactive Systems and Work with Terminals, Proceedings of the 4th
International Conference on Human–Computer Interaction (pp. 651–655).
Stuttgart, Germany: Elsevier Science.
Bias, R. G., & Mayhew, D. J. (1994). Cost-justifying usability. Boston, MA:
Academic.
Blažica, B., & Lewis, J. R. (2014). A Slovene translation of the System Usability
Scale: The SUS-SI. International Journal of Human–Computer Interaction.
In Press.
Boren, T., & Ramey, J. (2000). Thinking aloud: Reconciling theory and practice.
IEEE Transactions on Professional Communications,43, 261–278.
Borsci, S., Federici, S., & Lauriola, M. (2009). On the dimensionality of the
system usability scale: A test of alternative measurement models. Cognitive
Processes,10, 193–197.
Borsci, S., Londei, A., & Federici, S. (2011). The Bootstrap Discovery
Behaviour (BDB): A new outlook on usability evaluation. Cognitive
Processes,12, 23–31.
Borsci, S., Macredie, R. D., Barnett, J., Martin, J., Kuljis, J., & Young, T.
(2013). Reviewing and extending the five-user assumption: A grounded pro-
cedure for interaction evaluation. ACM Transactions on Computer-Human
Interaction,20, 29:01–29:23.
Bowers, V., & Snyder, H. (1990). Concurrent versus retrospective verbal proto-
cols for comparing window usability. In Proceedings of the Human Factors
Society 34th Annual Meeting (pp. 1270–1274). Santa Monica, CA: Human
Factors Society.
Bradley, J. V. (1976). Probability; decision; statistics. Englewood Cliffs, NJ:
Prentice-Hall.
Bradley, J. V. (1978). Robustness? British Journal of Mathematical and
Statistical Psychology,31, 144–152.
Briand, L. C., El Emam, K., Freimut, B. G., & Laitenberger, O. (2000). A com-
prehensive evaluation of capture-recapture models for estimating software
defect content. IEEE Transactions on Software Engineering,26, 518–540.
Brooke, J. (1996). SUS—A “quick and dirty” usability scale. In P. W. Jordan
(Ed.), Usability evaluation in industry (pp. 189–194). London, UK: Taylor
& Francis.
Brooke, J. (2013). SUS: A retrospective. Journal of Usability Studies,8(2),
29–40.
Campbell, I. (2007). Chi-squared and Fisher-Irwin tests of two-by-two tables
with small sample recommendations. Statistics in Medicine,26, 3661–3675.
Capra, M. G. (2007). Comparing usability problem identification and descrip-
tion by practitioners and students. In Proceedings of the Human Factors and
Ergonomics Society 51st Annual Meeting (pp. 474–478). Santa Monica, CA:
Human Factors and Ergonomics Society.
Card, S. K., Moran, T. P., & Newell, A. (1983). The psychology of human-
computer interaction. London, UK: Erlbaum.
Caulton, D. A. (2001). Relaxing the homogeneity assumption in usability
testing. Behaviour & Information Technology,20, 1–7.
Cavallin, H., Martin, W. M., & Heylighen, A. (2007). How relative absolute can
be: SUMI and the impact of the nature of the task in measuring perceived
software usability. Artificial Intelligence and Society,22, 227–235.
Chapanis, A. (1981). Evaluating ease of use. Unpublished manuscript prepared
for IBM, Boca Raton, FL. (Available from J. R. Lewis, ADDRESS).
Chernick, M. R. (2008). Bootstrap methods: A guide for practitioners and
researchers. Hoboken, NJ: Wiley.
Chin, J. P., Diehl, V. A., & Norman, K. L. (1988). Development of an instru-
ment measuring user satisfaction of the human–computer interface. In
Proceedings of CHI 1988 (pp. 213–218). Washington, DC: Association for
Computing Machinery.
Clemmensen, T., Hertzum, M., Hornbæk, K., Shi, Q., & Yammiyavar, P. (2009).
Cultural cognition in usability evaluation. Interacting with Computers,21,
212–220.
Downloaded by [78.87.127.172] at 04:36 24 June 2014
USABILITY LESSONS 681
Cohen, J. (1990). Things I have learned (so far). American Psychologist,45,
1304–1312.
Cowles, M. (1989). Statistics in psychology: An historical perspective.
Hillsdale, NJ: Erlbaum.
Davis, F. D. (1989). Perceived usefulness, perceived ease of use, and user
acceptance of information technology. MIS Quarterly,13, 319–339.
Davis, F. D., & Venkatesh, V. (1996). A critical assessment of potential mea-
surement biases in the Technology Acceptance Model: Three experiments.
International Journal of Human-Computer Studies,45, 19–45.
Dumas, J. S. (2003). User-based evaluations. In J. A. Jacko & A. Sears (Eds.),
The human–computer interaction handbook (pp. 1093–1117). Mahwah, NJ:
Erlbaum.
Dumas, J. S. (2007). The great leap forward: The birth of the usability
profession (1988–1993). Journal of Usability Studies,2, 54–60.
Dumas, J., & Redish, J. C. (1999). A practical guide to usability testing.
Portland, OR: Intellect.
Ennis, D. M., & Bi, J. (1998). The beta-binomial model: Accounting for
inter-trial variation in replicated difference and preference tests. Journal of
Sensory Studies,13, 389–412.
Erdinç, O., & Lewis, J. R. (2013). Psychometric evaluation of the T-CSUQ:
The Turkish version of the Computer System Usability Questionnaire.
International Journal of Human-Computer Interaction,29, 319–326.
Ericsson, K. A., & Simon, H. A. (1980). Verbal reports as data. Psychological
Review,87, 215–251.
Evers, V., & Day, D. (1997). The role of culture in interface acceptance. In
Proceedings of Interact 1997 (pp. 260–267). Sydney, Australia: Chapman
and Hall.
Finstad, K. (2006). The System Usability Scale and non-native English speak-
ers. Journal of Usability Studies,1, 185–188.
Finstad, K. (2010). The usability metric for user experience. Interacting with
Computers,22, 323–327.
Finstad, K. (2013). Response to commentaries on “The Usability Metric for
User Experience”. Interacting with Computers,25, 327–330.
Frandsen-Thorlacius, O., Hornbæk, K., Hertzum, M., & Clemmensen, T.
(2009). Non-universal usability? A survey of how usability is understood by
Chinese and Danish users. In Proceedings of CHI 2009 (pp. 41–50). Boston,
MA: Association for Computing Machinery.
Gaito, J. (1980). Measurement scales and statistics: Resurgence of an old
misconception. Psychological Bulletin,87, 564–567.
Gould, J. D. (1988). How to design usable systems. In M. Helander (Ed.),
Handbook of human–computer interaction (pp. 757–789). Amsterdam, the
Netherlands: North-Holland.
Gould, J. D., & Boies, S. J. (1983). Human factors challenges in creating a
principal support office system: The Speech Filing System approach. ACM
Transactions on Information Systems,1, 273–298.
Gould, J. D., Boies, S. J., Levy, S., Richards, J. T., & Schoonard, J. (1987). The
1984 Olympic Message System: A test of behavioral principles of system
design. Communications of the ACM,30, 758–769.
Gould, J. D., & Lewis, C. (1984). Designing for usability: Key principles and
what designers think (IBM Tech. Report No. RC-10317). Yorktown Heights,
NY: International Business Machines Corporation.
Gray, W. D., & Salzman, M. C. (1998). Damaged merchandise? A review of
experiments that compare usability evaluation methods. Human–Computer
Interaction,13, 203–261.
Grier, R. A., Bangor, A., Kortum, P., & Peres, S. C. (2013). The System
Usability Scale: Beyond standard usability testing. In Proceedings of the
Human Factors and Ergonomics Society (pp. 187–191). Santa Monica, CA:
Human Factors and Ergonomics Society.
Grove, J. W. (1989). In defence of science: Science, technology, and politics in
modern society. Toronto, Canada: University of Toronto Press.
Harris, R. J. (1985). A primer of multivariate statistics. Orlando, FL: Academic
Press.
Hassenzahl, M. (2000). Prioritizing usability problems: Data driven and judg-
ment driven severity estimates. Behaviour & Information Technology,19,
29–42.
Hassenzahl, M. (2001). The effect of perceived hedonic quality on product
appealingness. International Journal of Human-Computer Interaction,13,
481–499.
Hassenzahl, M. (2004). The interplay of beauty, goodness, and usabil-
ity in interactive products. Human-Computer Interaction,19,
319–349.
Hertzum, M. (2006). Problem prioritization in usability evaluation: From
severity assessments to impact on design. International Journal of Human-
Computer Interaction,21, 125–146.
Hertzum, M. (2010). Images of usability. International Journal of Human-
Computer Interaction,26, 567–600.
Hertzum, M., Clemmensen, T., Hornbæk, K., Kumar, J., Shi, Q., &
Yammiyavar, P. (2007). Usability constructs: A cross-cultural study of
how users and developers experience their use of information systems.
In Proceedings of HCI International 2007 (pp. 317–326). Beijing, China:
Springer-Verlag.
Hertzum, M., Hansen, K. D., & Andersen, H. H. K. (2009). Scrutinising usabil-
ity evaluation: Does thinking aloud affect behaviour and mental workload?
Behaviour & Information Technology,28, 165–181.
Høegh, R. T., & Jensen, J. J. (2008). A case study of three software projects:
Can software developers anticipate the usability problems in their software?
Behaviour & Information Technology,27, 307–312.
Høegh, R. T., Nielsen, C. M., Overgaard, M., Pedersen, M. B., & Stage, J.
(2006). The impact of usability reports and user test observations on devel-
opers’ understanding of usability data: An exploratory study. International
Journal of Human-Computer Interaction,21, 173–196.
Hornbæk, K. (2006). Current practice in measuring usability: Challenges to
usability studies and research. International Journal of Human-Computer
Studies,64, 79–102.
Hornbæk, K. (2010). Dogmas in the assessment of usability evaluation methods.
Behaviour & Information Technology,29,97–111.
Hornbæk, K., & Law, E. L. (2007). Meta-analysis of correlations among usabil-
ity measures. In Proceedings of CHI 2007 (pp. 617–626). San Jose, CA:
Association for Computing Machinery.
Howard, T. W. (2008). Unexpected complexity in a traditional usability study.
Journal of Usability Studies,3, 189–205.
Howard, T., & Howard, W. (2009). Unexpected complexity in user testing of
information products. In Proceedings of the Professional Communication
Conference (pp. 1–5). Waikiki, HI: Institute of Electrical and Electronics
Engineers.
Hwang, W., & Salvendy, G. (2010). Number of people required for usabil-
ity evaluation: The 10±2 rule. Communications of the ACM,53,
130–133.
International Organization for Standardization. (1998). Ergonomic require-
ments for office work with visual display terminals (VDTs), Part 11,
Guidance on usability (ISO 9241-11: 1998(E)). Geneva, Switzerland:
Author.
Kanis, H. (2011). Estimating the number of usability problems. Applied
Ergonomics,42, 337–347.
Karat, C. (1997). Cost-justifying usability engineering in the software life cycle.
In M. Helander, T. K. Landauer, & P. Prabhu (Eds.), Handbook of human–
computer interaction (2nd ed., pp. 767–778). Amsterdam, the Netherlands:
Elsevier.
Kelley, J. F. (1984). An iterative design methodology for user-friendly natural
language office information applications. ACM Transactions on Information
Systems,2, 26–41.
Kennedy, P. J. (1982). Development and testing of the operator training package
for a small computer system. In Proceedings of the Human Factors Society
26th Annual Meeting (pp. 715–717). Santa Monica, CA: Human Factors
Society.
Kessner, M., Wood, J., Dillon, R. F., & West, R. L. (2001). On the reliability
of usability testing. In J. Jacko & A. Sears (Eds.), Conference on Human
Factors in Computing Systems: CHI 2001 Extended Abstracts (pp. 97–98).
Seattle, WA: Association for Computing Machinery.
Kirakowski, J. (1996). The Software Usability Measurement Inventory:
Background and usage. In P. Jordan, B. Thomas, & B. Weerdmeester (Eds.),
Usability evaluation in industry (pp. 169–178). London, UK: Taylor &
Francis
Kirakowski, J., & Corbett, M. (1993). SUMI: The Software Usability
Measurement Inventory. British Journal of Educational Technology,24,
210–212.
Downloaded by [78.87.127.172] at 04:36 24 June 2014
682 LEWIS
Kirakowski, J., & Dillon, A. (1988). The Computer User Satisfaction Inventory
(CUSI): Manual and scoring key. Cork, Ireland: Human Factors Research
Group, University College of Cork.
Kortum, P. T., & Bangor, A. (2013). Usability ratings for everyday products
measured with the System Usability Scale. International Journal of Human-
Computer Interaction,29, 67–76.
Krahmer, E., & Ummelen, N. (2004). Thinking about thinking aloud: A com-
parison of two verbal protocols for usability testing. IEEE Transactions on
Professional Communication,47, 105–117.
LaLomia, M. J., & Sidowski, J. B. (1990). Measurements of computer satis-
faction, literacy, and aptitudes: A review. International Journal of Human–
Computer Interaction,2, 231–253.
Landauer, T. K. (1997). Behavioral research methods in human–computer inter-
action. In M. Helander, K. T. Landauer, & P. Prabhu (Eds.), Handbook
of human–computer interaction (2nd ed., pp. 203–227). Amsterdam, the
Netherlands: Elsevier.
Larson, R. C. (2008). Service science: At the intersection of management,
social, and engineering sciences. IBM Systems Journal,47, 41–51.
Lazar, J., Feng, J. H., & Hochheiser, H. (2010). Research methods in human-
computer interaction. Chichester, UK: Wiley.
Lewis, J. R. (1982). Testing small system customer set-up. In Proceedings of the
Human Factors Society 26th Annual Meeting (pp. 718–720). Santa Monica,
CA: Human Factors Society.
Lewis, J. R. (1990a). Psychometric evaluation of a post-study system usabil-
ity questionnaire:The PSSUQ (Tech. Rep. No. 54.535). Boca Raton, FL:
International Business Machines Corp.
Lewis, J. R. (1990b). Psychometric evaluation of an after-scenario question-
naire for computer usability studies:The ASQ (Tech. Rep. No. 54.541).
Boca Raton, FL: International Business Machines Corp.
Lewis, J. R. (1992). Psychometric evaluation of the Post-Study System
Usability Questionnaire: The PSSUQ. In Proceedings of the Human Factors
Society 36th Annual Meeting (pp. 1259–1263). Santa Monica, CA: Human
Factors Society.
Lewis, J. R. (1993). Multipoint scales: Mean and median differences and
observed significance levels. International Journal of Human-Computer
Interaction,5, 383–392.
Lewis, J. R. (1994). Sample sizes for usability studies: Additional considera-
tions. Human Factors,36, 368–378.
Lewis, J. R. (1995). IBM computer usability satisfaction questionnaires:
Psychometric evaluation and instructions for use. International Journal of
Human-Computer Interaction,7, 57–78.
Lewis, J. R. (1996). Reaping the benefits of modern usability evaluation: The
Simon story. In G. Salvendy & A. Ozok (Eds.), Advances in Applied
Ergonomics: Proceedings of the 1st International Conference on Applied
Ergonomics—ICAE ‘96 (pp. 752–757). Istanbul, Turkey: USA Publishing.
Lewis, J. R. (1999). Tradeoffs in the design of the IBM computer usability
satisfaction questionnaires. In Proceedings of HCI International 1999 (pp.
1023–1027). Mahwah, NJ: Erlbaum.
Lewis, J. R. (2001). Evaluation of procedures for adjusting problem-discovery
rates estimated from small samples. International Journal of Human–
Computer Interaction,13, 445–479.
Lewis, J. R. (2002). Psychometric evaluation of the PSSUQ using data from
five years of usability studies. International Journal of Human–Computer
Interaction,14, 463–488.
Lewis, J. R. (2006). Sample sizes for usability tests: Mostly math, not magic.
Interactions,13(6),29–33. (See corrected formula in Interactions, 14(1), 4)
Lewis, J. R. (2011a). Human factors engineering. In P. A. LaPlante (Ed.),
Encyclopedia of software engineering (pp. 383–394). New York, NY: Taylor
& Francis.
Lewis, J. R. (2011b). Practical speech user interface design. Boca Raton, FL:
Taylor & Francis.
Lewis, J. R. (2012). Usability testing. In G. Salvendy (Ed.), Handbook of human
factors and ergonomics (4th ed., pp. 1267–1312). New York, NY: Wiley.
Lewis, J. R. (2013). Critical review of “The Usability Metric for User
Experience”. Interacting with Computers,25, 320–324.
Lewis, J. R., Henry, S. C., & Mack, R. L. (1990). Integrated office software
benchmarks: A case study. In Proceedings of the 3rd IFIP Conference on
Human-Computer Interaction, INTERACT ‘90 (pp. 337–343). Cambridge,
UK: Elsevier Science.
Lewis, J. R., & Sauro, J. (2009). The factor structure of the system usability
scale. In M. Kurosu (Ed.), Human centered design (pp. 94–103). Heidelberg,
Germany: Springer-Verlag.
Lewis, J. R., Utesch, B. S., & Maher, D. E. (2013). UMUX-LITE—When
there’s no time for the SUS. In Proceedings of CHI 2013 (pp. 2099–2102).
Paris, France: Association for Computing Machinery.
Lilienfeld, S. O., Wood, J. M., & Garb, H. N. (2000). The scientific status
of projective techniques. Psychological Science in the Public Interest,1,
27–66.
Lindgaard, G., & Kirakowski, J. (2013). Introduction to the special issue:
The tricky landscape of developing rating scales in HCI. Interacting with
Computers,25, 271–277.
Lord, F. M. (1953). On the statistical treatment of football numbers. American
Psychologist,8, 750–751.
Lord, F. M. (1954). Further comment on “football numbers.” American
Psychologist,9, 264–265.
Lottridge, D., Chignell, M., & Jovicic, A. (2011). Affective design:
Understanding, evaluating, and designing for human emotion. Reviews of
Human Factors and Ergonomics,7, 197–237.
Lund, A. (1998). USE Questionnaire Resource Page. Retrieved from http://
usesurvey.com.
Lund, A. (2001). Measuring usability with the USE questionnaire. Usability and
User Experience Newsletter of the STC Usability SIG,8(2), 1–4.
Lusch, R. F., Vargo, S. L., & O’Brien, M. (2007). Competing through service:
Insights from service-dominant logic. Journal of Retailing,83, 5–18.
Lusch, R. F., Vargo, S. L., & Wessels, G. (2008). Toward a conceptual foun-
dation for service science: Contributions from service-dominant logic. IBM
Systems Journal,47, 5–14.
MacDonald, S., Edwards, H. M., & Zhao, T. (2012). Exploring think-
alouds in usability testing: An international survey. IEEE Transactions on
Professional Communication,55, 2–19.
MacDonald, S., McGarry, K., & Willis, L. M. (2013). Thinking-aloud about
web navigation: The relationship between think-aloud instructions, task
difficulty and performance. In Proceedings of the Human Factors and
Ergonomics Society Annual Meeting (pp. 2037–2041. Santa Monica:
Human Factors and Ergonomics Society.
MacKenzie, I. S. (2014). Human-computer interaction: An empirical research
perspective. Waltham, MA: Morgan Kaufmann.
Marcus, A. (2007). Global/intercultural user-interface design. In J. Jacko &
A. Spears (Eds.), Handbook of human-computer interaction (3rd ed., pp.
355–380). New York, NY: Erlbaum.
Marshall, C., Brendan, M., & Prail, A. (1990). Usability of Product X: Lessons
from a real product. Behaviour & Information Technology,9, 243–253.
McSweeney, R. (1992). SUMI: A psychometric approach to software evaluation
(Unpublished master’s thesis). Cork, Ireland: University College of Cork.
Mitchell, J. (1986). Measurement scales and statistics: A clash of paradigms.
Psychological Bulletin,100, 398–407.
Molich, R., Bevan, N., Curson, I., Butler, S., Kindlund, E., Miller, D.,
& Kirakowski, J. (1998). Comparative evaluation of usability tests. In
Usability Professionals Association Annual Conference Proceedings (pp.
189–200). Washington, DC: Usability Professionals Association.
Molich, R., & Dumas, J. S. (2008). Comparative usability evaluation (CUE-4).
Behaviour & Information Technology,27, 263–281.
Molich, R., Ede, M. R., Kaasgaard, K., & Karyukin, B. (2004). Comparative
usability evaluation. Behaviour & Information Technology,23, 65–74.
Molich, R., Jeffries, R., & Dumas, J. S. (2007). Making usability recommenda-
tions useful and usable. Journal of Usability Studies,2, 162–179.
Molich, R., Kirakowski, J, Sauro, J., & Tullis, T. (2009). Comparative usabil-
ity task measurement workshop (CUE-8). Workshop conducted at the UPA
2009 Conference in Portland, OR.
Nielsen, J. (2000). Why you only need to test with 5 users. Alertbox. Retrieved
from http://www.useit.com/alertbox/20000319.html
Nielsen, J., & Landauer, T. K. (1993). A mathematical model of the find-
ing of usability problems. In Proceedings of INTERCHI’93 (pp. 206–213).
Amsterdam, the Netherlands: Association for Computing Machinery.
Downloaded by [78.87.127.172] at 04:36 24 June 2014
USABILITY LESSONS 683
Nielsen, J., & Mack, R. L. (1994). Usability inspection methods. New York,
NY: Wiley.
Nielsen, J., & Molich, R. (1990). Heuristic evaluation of user interfaces. In
Proceedings of CHI ’90 (pp. 249–256). New York, NY: Association for
Computing Machinery.
Nørgaard, M., & Hornbæk, K. (2009). Exploring the value of usability feed-
back formats. International Journal of Human-Computer Interaction,25,
49–74.
Nunnally, J.C. (1978). Psychometric theory. New York, NY: McGraw-Hill.
Ohnemus, K. R., & Biers, D. W. (1993). Retrospective versus thinking aloud
in usability testing. In Proceedings of the Human Factors and Ergonomics
Society 37th Annual Meeting (pp. 1127–1131). Seattle, WA: Human Factors
and Ergonomics Society
Olmsted-Hawala, E. L., Murphy, E., Hawala, S., & Ashenfelter, K. T. (2010).
Think-aloud protocols: A comparison of three think-aloud protocols for use
in testing data-dissemination web sites for usability. In Proceedings of CHI
2010 (pp. 2381–2390). Atlanta, GA: Association for Computing Machinery.
Perfetti, C., & Landesman, L. (2001). Eight is not enough. Retrieved from http://
www.uie.com/articles/eight_is_not_enough/
Pilotte, W. J., & Gable, R. K. (1990). The impact of positive and negative
item stems on the validity of a computer anxiety scale. Educational and
Psychological Measurement,50, 603–610.
Pitkänen, O., Virtanen, P., & Kemppinen, J. (2008). Legal research topics in
user-centric services. IBM Systems Journal,47, 143–152.
Redish, J. (2007). Expanding usability testing to evaluate complex systems.
Journal of Usability Studies,2, 102–111.
Rubin, J. (1994). Handbook of usability testing: How to plan, design, and
conduct effective tests. New York, NY: Wiley.
Rubin, J., & Chisnell, D. (2008). Handbook of usability testing: How to plan,
design, and conduct effective tests, 2nd ed. New York, NY: Wiley.
Sabra, A. I. (2003). Ibn al-Haytham. Harvard Magazine,106, 54–55.
Sauro, J. (2010a). A practical guide to measuring usability. Denver, CO: Create
Space.
Sauro, J. (2010b). That’s the worst website ever! Effects of extreme survey items.
Retrieved from http://L www.measuringusability.com/blog/extreme-items.
php.
Sauro, J. (2011). A practical guide to the System Usability Scale (SUS):
Background, bench-marks & best practices. Denver, CO: Measuring
Usability.
Sauro, J., & Lewis, J. R. (2005). Estimating completion rates from small
samples using binomial confidence intervals: Comparisons and recommen-
dations. In Proceedings of the Human Factors and Ergonomics Society 49th
Annual Meeting (pp. 2100–2104). Santa Monica, CA: Human Factors and
Ergonomics Society.
Sauro, J., & Lewis, J. R. (2009). Correlations among prototypical usability met-
rics: Evidence for the construct of usability. In Proceedings of CHI 2009
(pp. 1609–1618). Boston, MA: Association for Computing Machinery.
Sauro, J., & Lewis, J. R. (2011). When designing usability questionnaires,
does it hurt to be positive? In Proceedings of CHI 2011 (pp. 2215–2223).
Vancouver, Canada: Association for Computing Machinery.
Sauro, J., & Lewis, J. R. (2012). Quantifying the user experience: Practical
statistics for user research. Burlington, MA: Morgan-Kaufmann.
Schmettow, M. (2008). Heterogeneity in the usability evaluation process. In
Proceedings of the 22nd British HCI Group Annual Conference on HCI
2008: People and Computers XXII: Culture, Creativity, Interaction - Volume
1(pp. 89–98). Liverpool, UK: Association for Computing Machinery.
Schmettow, M. (2009). Controlling the usability evaluation process under vary-
ing defect visibility. In Proceedings of the 2009 British Computer Society
Conference on Human-Computer Interaction (pp. 188–197). Cambridge,
UK: Association for Computing Machinery.
Schmettow, M. (2012). Sample size in usability studies. Communications of the
ACM,55(4), 64–70.
Schmitt, N., & Stuits, D. (1985) Factors defined by negatively keyed items:
The result of careless respondents? Applied Psychological Measurement,9,
367–373.
Scholten, A. Z., & Borsboom, D. (2009). A reanalysis of Lord’s statistical
treatment of football numbers. Journal of Mathematical Psychology,53,
69–75.
Schriesheim, C. A., & Hill, K. D. (1981). Controlling acquiescence response
bias by item reversals: the effect on questionnaire validity. Educational and
Psychological Measurement,41, 1101–1114.
Seffah, A., Donyaee, M., Kline, R. B., & Padda, H. K. (2006). Usability mea-
surement and metrics: A consolidated model. Software Quality Journal,14,
159–178.
Shackel, B. (1990). Human factors and usability. In J. Preece & L. Keller
(Eds.), Human–computer interaction: Selected readings (pp. 27–41). Hemel
Hempstead, England: Prentice Hall International.
Shneiderman, B., & Plaisant, C. (2010). Designing the user interface: Strategies
for effective human-computer interaction, 5th ed. Reading, MA: Addison-
Wesley.
Smith, D. C., Irby, C., Kimball, R., Verplank, B., & Harlem, E. (1982).
Designing the Star user interface. Byte,7, 242–282.
Snyder, K. M., Happ, A. J., Malcus, L., Paap, K. R., & Lewis, J. R. (1985).
Using cognitive models to create menus. In Proceedings of the Human
Factors Society 29th Annual Meeting (pp. 655–658). Baltimore, MD:
Human Factors Society.
Spector, P., Van Katwyk, P., Brannick, M., & Chen, P. (1997) When two fac-
tors don’t reflect two constructs: How item characteristics can produce
artifactual factors. Journal of Management,23, 659–677.
Spencer, R. (2000). The streamlined cognitive walkthrough method: Working
around social constraints encountered in a software development company.
In Proceedings of CHI 2000 (pp. 353–359). New York, NY: Association for
Computing Machinery.
Spohrer, J., & Maglio, P. P. (2008). The emergence of service science:
Toward systematic service innovations to accelerate co-creation of value.
Production and Operations Management,17, 238–246.
Spool, J., & Schroeder, W. (2001). Testing websites: Five users is nowhere near
enough. In CHI 2001 extended abstracts (pp. 285–286). New York, NY:
Association for Computing Machinery.
Stevens, S. S. (1946). On the theory of scales of measurement. Science,103,
677–680.
Stewart, T. J., & Frye, A. W. (2004). Investigating the use of negatively-phrased
survey items in medical education settings: Common wisdom or common
mistake? Academic Medicine,79 (Suppl. 10), S1–S3.
Stigler, S. M. (1999). Statistics on the table: The history of statistical concepts
and methods. Cambridge, MA: Harvard University Press.
Theofanos, M., & Quesenbery, W. (2005). Towards the design of effective
formative test reports. Journal of Usability Studies,1(1), 27–45.
Thimbleby, H. (2007). User-centered methods are insufficient for safety criti-
cal systems. In A. Holzinger (Ed.), Proceedings of USAB 2007 (pp. 1–20).
Heidelberg, Germany: Springer-Verlag.
Thurstone, L. L. (1928). Attitudes can be measured. American Journal of
Sociology,33, 529–554.
Townsend, J. T., & Ashby, F. G. (1984). Measurement scales and statistics: The
misconception misconceived. Psychological Bulletin,96, 394–401.
Travis, D. (2008). Measuring satisfaction: Beyond the usability questionnaire.
Retrieved from http://www.userfocus.co.uk/articles/satisfaction.html.
Tullis, T. S. (1985). Designing a menu-based interface to an operating system.
In Proceedings of CHI 1985 (pp. 79–84). San Francisco, CA: Association
for Computing Machinery.
Tullis, T. S., & Albert, W. (2008). Measuring the user experience: Collecting,
analyzing, and presenting usability data. Waltham, MA: Morgan-Kauffman.
Tullis, T. S., & Albert, W. (2013). Measuring the user experience: Collecting,
analyzing, and presenting usability data, 2nd ed. Waltham, MA: Morgan-
Kauffman.
Tullis, T. S., & Stetson, J. N. (2004). A comparison of questionnaires for
assessing website usability. Paper presented at the Usability Professionals
Association Annual Conference. Minneapolis, MN: Usability Professionals
Association.
van de Vijver, F. J. R., & Leung, K. (2001). Personality in cultural context:
Methodological issues. Journal of Personality,69, 1007–1031.
van den Haak, M. J., & de Jong, D. T. M. (2003). Exploring two methods
of usability testing: Concurrent versus retrospective think-aloud proto-
cols. In Proceedings of the International Professional Communication
Conference, IPCC 2003 (pp. 285–287). Orlando, FL: Institute of Electrical
and Electronics Engineers.
Downloaded by [78.87.127.172] at 04:36 24 June 2014
684 LEWIS
Velleman, P. F., & Wilkinson, L. (1993). Nominal, ordinal, interval,
and ratio typologies are misleading. The American Statistician,47,
65–72.
Virzi, R. A. (1990). Streamlining the design process: Running fewer subjects.
In Proceedings of the Human Factors Society 34th Annual Meeting (pp.
291–294). Santa Monica, CA: Human Factors Society.
Virzi, R. A. (1992). Refining the test phase of usability evaluation: How many
subjects Is enough? Human Factors,34, 457–468.
Virzi, R. A., Sorce, J. F., & Herbert, L. B. (1993). A comparison of three usabil-
ity evaluation methods: Heuristic, think-aloud, and performance testing. In
Proceedings of the Human Factors and Ergonomics Society 37th Annual
Meeting (pp. 309–313). Santa Monica, CA: Human Factors and Ergonomics
Society.
Vredenburg, K., Mao, J. Y., Smith, P. W., & Carey, T. (2002). A survey of
user centered design practice. In Proceedings of CHI 2002 (pp. 471–478).
Minneapolis, MN: Association for Computing Machinery.
Wharton, C., Rieman, J., Lewis, C., & Polson, P. (1994). The cognitive walk-
through method: A practitioner’s guide. In J. Nielsen & R. L. Mack (Eds.),
Usability inspection methods (pp. 105–140). New York, NY: Wiley.
Whiteside, J., Bennett, J., & Holtzblatt, K. (1988). Usability engineering: Our
experience and evolution. In M. Helander (Ed.), Handbook of human–
computer interaction (pp. 791–817). Amsterdam, the Netherlands: North-
Holland.
Wildman, D. (1995). Getting the most from paired-user testing. Interactions,
2(3), 21–27.
Williams, G. (1983). The Lisa computer system. Byte,8(2), 33–50.
Winter, S., Wagner, S., & Deissenboeck, F. (2008). A comprehensive model
of usability. In Engineering Interactive Systems (pp. 106–122). Heidelberg,
Germany: International Federation for Information Processing.
Wixon, D. (2003). Evaluating usability methods: Why the current literature fails
the practitioner. Interactions,10(4), 28–34.
Wong, N., Rindfleisch, A., & Burroughs, J. (2003). Do reverse-worded items
confound measures in cross-cultural consumer research? The case of the
material values scale. Journal of Consumer Research,30, 72–91.
Woolrych, A., & Cockton, G. (2001). Why and when five test users
aren’t enough. In J. Vanderdonckt, A. Blandford, & A. Derycke (Eds.),
Proceedings of IHM–HCI 2001 Conference, Vol. 2 (pp. 105–108). Toulouse,
France: Cépadèus Éditions.
Wright, R. B., & Converse, S. A. (1992). Method bias and concurrent ver-
bal protocol in software usability testing. In Proceedings of the Human
Factors and Ergonomics Society 36th Annual Meeting (pp. 1220–1224).
Santa Monica, CA: Human Factors and Ergonomics Society.
Xue, M., & Harker, P. T. (2002). Customer efficiency: Concept and its
impact on e-business management. Journal of Service Research,4,
253–267.
ABOUT THE AUTHOR
James R. Lewis has been a usability practitioner at IBM
since 1981. His books include Practical Speech User Interface
Design (2011) and Quantifying the User Experience (2012,
with Jeff Sauro). He serves on the editorial boards of the
International Journal of Human-Computer Interaction and the
Journal of Usability Studies, is an IBM Master Inventor with
more than 80 U.S. patents, and is currently president of the
Association for Voice Interaction Design.
Downloaded by [78.87.127.172] at 04:36 24 June 2014
... La investigación señala que la mayoría de los problemas se detectan con los primeros 3-5 sujetos y que es poco probable que más sujetos revelen nueva información (Nielsen et al., 2006). Comúnmente, un estudio con 5 participantes es suficiente para descubrir los problemas principales y mejorar la usabilidad; y con 10 participantes se detecta más del 80 % de los problemas de usabilidad (Lewis, 2014). ...
... Los datos cualitativos fueron analizados mediante un análisis de contenido dirigido (Cohen et al., 2018). Este análisis se basó en las dimensiones de usabilidad de la literatura, partiendo de una estructura de codificación predeterminada (Hoehle y Venkatesh, 2015;Lewis, 2014Lewis, , 2018. Es importante mencionar que, en este estudio, todos los datos se ajustaron a códigos existentes y no se encontraron casos que requirieran una nueva categoría. ...
... Tras las mejoras realizadas y la confirmación de la usabilidad del recurso para el profesorado, los siguientes pasos del proyecto se enfocarán en el análisis de la usabilidad por parte del alumnado de Primaria y Secundaria; se han realizado pruebas informales con alumnado obteniéndose resultados satisfactorios, mas no relacionados con su uso en el contexto formal del aula ni con la rigurosidad necesaria. Por ello, se requiere de futuros estudios con metodologías rigurosas e instrumentos con adecuadas propiedades psicométricas para la recolecta de datos válidos y confiables (Lewis, 2014), además de otras variables de interés como la mejora del aprendizaje, las actitudes o la motivación científica (Toma, 2021b). ...
Article
Full-text available
La indagación es una metodología didáctica que promueve el desarrollo de competencias científicas y el aprendizaje significativo de las ciencias. Sin embargo, su implementación en el contexto educativo español se enfrenta a diversas barreras, como la falta de recursos y formación docente. El objetivo de este estudio fue diseñar y evaluar la usabilidad de IndagApp, un recurso TIC que facilita la enseñanza de las ciencias por indagación con alumnado de 10 a 14 años de edad. Se utilizó un diseño de métodos mixtos convergentes, con un muestreo intencional compuesto por un panel de 14 expertos de distintas disciplinas. Los resultados cuantitativos mostraron una usabilidad elevada de la app, mientras que los cualitativos permitieron mejorar la interfaz del usuario, incluir estrategias de andamiaje y alinear el recurso con las demandas curriculares. A partir de este proceso se realizó una mejora de la app que, en su versión mejorada, consta de diez indagaciones que abordan contenidos centrales del nuevo currículo de la LOMLOE. Además, se han diseñado recursos de apoyo para su implementación, como programas-guía para el profesorado y fichas de clase imprimibles para el alumnado. En conjunto, este recurso se presenta como pertinente e innovador para la transposición didáctica de la indagación, brindando a la comunidad educativa e investigadora iberoamericana una herramienta valiosa para la enseñanza de las ciencias. Se propone el desarrollo de investigaciones que aborden el análisis del uso y la percepción de la usabilidad del recurso en potenciales usuarios del ámbito de la Educación Primaria y Secundaria. ARTÍCULO COMPLETO:https://revistas.uned.es/index.php/ried/article/view/39109/28914
... Mobile health (mHealth) interventions, which use mobile technology such as smartphone apps to promote healthy behaviors or mindsets [1], are a promising avenue to reach vulnerable groups [2]. Meaningful user involvement is critical for such interventions [3] to ensure that end user needs and perspectives are adequately represented in the design process [4]. Conducting such feedback and evaluations with users face to face (local testing) [5][6][7] involves efficiency drawbacks, particularly travel, time, and cost [8]. ...
... Studies comparing local and remote research practices have concluded comparable results in the quality of the research output [11]. However, before the COVID-19 pandemic, local testing was the usual practice in research and among practitioners [4,6]. Reasons may include network variance, poor audio or video quality, unfamiliarity with remote technology, and the lack of contextual information or nonverbal cues inherent in remote methods. ...
... In the 2 projects, the respective authors (IJS and KAB) developed user-centered design approaches, which were predominantly formative user-based evaluation [4,6] in the form of qualitative, moderated early testing [25] and feedback on intervention prototypes. This included conducting the posttest analysis of the collected data. ...
Article
Background Mobile health (mHealth) interventions that promote healthy behaviors or mindsets are a promising avenue to reach vulnerable or at-risk groups. In designing such mHealth interventions, authentic representation of intended participants is essential. The COVID-19 pandemic served as a catalyst for innovation in remote user-centered research methods. The capability of such research methods to effectively engage with vulnerable participants requires inquiry into practice to determine the suitability and appropriateness of these methods. Objective In this study, we aimed to explore opportunities and considerations that emerged from involving vulnerable user groups remotely when designing mHealth interventions. Implications and recommendations are presented for researchers and practitioners conducting remote user-centered research with vulnerable populations. Methods Remote user-centered research practices from 2 projects involving vulnerable populations in Norway and Australia were examined retrospectively using visual mapping and a reflection-on-action approach. The projects engaged low-income and unemployed groups during the COVID-19 pandemic in user-based evaluation and testing of interactive, web-based mHealth interventions. Results Opportunities and considerations were identified as (1) reduced barriers to research inclusion; (2) digital literacy transition; (3) contextualized insights: a window into people’s lives; (4) seamless enactment of roles; and (5) increased flexibility for researchers and participants. Conclusions Our findings support the capability and suitability of remote user methods to engage with users from vulnerable groups. Remote methods facilitate recruitment, ease the burden of research participation, level out power imbalances, and provide a rich and relevant environment for user-centered evaluation of mHealth interventions. There is a potential for a much more agile research practice. Future research should consider the privacy impacts of increased access to participants’ environment via webcams and screen share and how technology mediates participants’ action in terms of privacy. The development of support procedures and tools for remote testing of mHealth apps with user participants will be crucial to capitalize on efficiency gains and better protect participants’ privacy.
... For both PortionSize and MFP conditions, after recording the simulated lunch meal during the respective study visit, participants completed a User Satisfaction Survey (USS), and the Computer System Usability Questionnaire (CSUQ) [19][20][21] using Research Electronic Data Capture (REDCap) [8,22,23]. REDCap is a secure, web-based application designed to support data capture for research studies, providing the following: 1) an intuitive interface with validated data collection instruments (e.g., the CSUQ) that have been precoded in the REDCap data dictionary formats [24]; 2) audit trails for tracking data manipulation and export procedures; 3) automated export procedures for seamless data downloads to common statistical packages; and 4) procedures for importing data from external sources [8]. ...
... The USS assessed satisfaction, ease of use, and adequacy of training on how to use each respective application. The CSUQ is a standardized, reliable (coefficient α > 0.89), and validated (criterion-related validity, r ¼ 0.80) questionnaire for adult populations, being originally designed to evaluate computer programs that have been used to quantify the usability of mobile phone applications [19][20][21]. The CSUQ includes 19 questions, scored using a 7-point Likert scale (1 being the most favorable score), and participants rated overall satisfaction, usefulness, information quality, and interface quality for each respective application. ...
... Race or ethnicity was also self-reported by the participants from a list including non-Hispanic White, non-Hispanic Black, Hispanic, Asian or Pacific Islander, Native American (including Alaskan), biracial or multiracial (specify), or other (specify). Participants completed an 8-item survey that was adapted from prior studies 19,20 to obtain overall satisfaction with PortionSize, satisfaction with embedded food templates and app training, and ease of use. All items were rated on a scale ranging from 1 −6, with 1 indicating extremely dissatisfied or not at all and 6 indicating extremely satisfied or very much. ...
... This practice enables developers, trainers, and end users to grasp interaction dynamics and identify usability issues (Holden and Rada 2011;Kari and Kosa 2023). Given the contextual dependency of 'magic number' rules for sample sizes in usability testing (Lewis 2014), it is imperative to complement quantitative measures with qualitative methods for assessing instructional design quality (Shernoff et al. 2018). ...
Article
Full-text available
Enhancing the educational experience through Immersive Virtual Reality (IVR) is a promising avenue, elevating the authenticity and responsiveness of simulations. Particularly in educational settings, IVR holds the potential to augment accessibility and engagement in learning. However, one pivotal aspect lies in assessing the learners' acceptance of such environments to ensure optimal and effective utilization of these technologies. This paper delves into the Didascalia Virtual-ClassRoom usability testing —an immersive IVR environment tailored for pre-service secondary school teachers. The platform transports users into a simulated classroom, where they are invited to play the role of a teacher. During the simulation, three scenarios are recreated, reproducing disruptive behaviours commonly faced in real classrooms. 84 participants (28 teachers and 56 pre-service teachers) engaged in decision-making to manage the classroom climate influenced by the simulated situations. To collect data, we used a questionnaire based on the Technology Acceptance Model (TAM) to assess and gauge users' inclinations and attitudes towards embracing the technology in question. To gain deeper insights into the user experience, participants were further invited to participate in semi-structured interviews, offering reflections and suggestions for potential enhancements. The evaluation process encompassed the perceived usefulness of the Didascalia Virtual-ClassRoom, shedding light on factors that could either facilitate or impede the adoption of this platform to enhance classroom management competence. The participants' perspectives serve as a valuable foundation for refining the tool's functionality, and their feedback fuels recommendations for its seamless integration into initial teacher training programs.
... As usability represents the most prominent aspect of the user-perceived quality of an information system [16], it is recommended to continuously evaluate the usability of health information systems to explore and solve possible problems [6][7][8][9][10]. In recent years, research into the user-perceived usability evaluation of health information systems has become prominent in the research agenda and standard tools designed to assess this aspect of the usability of information systems [17]. ...
Article
Full-text available
Introduction: For the first time in Iran, a kidney stone clinical information system (CIS) has been designed to manage and calculate and visualize patients’ kidney stone risk profiles. Nevertheless, the usability of this system has not been evaluated yet. Medically, this study aims to evaluate the user-perceived usability of the kidney stone CIS. Technically, the current study aims to determine which user-perceived usability testing approach produces the most informative results about the usability of this system.Material and Methods: Three questionnaires, including system usability scale (SUS), software usability measurement inventory (SUMI), and post-study system usability questionnaire, were applied to carry out the study. A total of 15 users of the kidney stone CIS participated.Results: The findings revealed that the system is of medium usability. Moreover, of the three methods used, the SUMI echoes the comments of the end-users. Despite the medium usability of the system, it was comprehensive in terms of proper data collection and storage, as well as reporting. The interface design, the lack of appropriate guidance, the time-consuming data entry, and the slow reporting system were aspects needed improvement.Conclusion: Using a combination of tools is recommended for usability evaluation. Since there is much space in the Global and each sub-scale, whereby measures may improve with SUMI, this research recommends its use for future evaluation of the kidney stone CIS.
... Lewis'a göre 1930'larda buzdolabı reklamlarında bir özellik olarak kullanılabilirlik kelime olarak geçmiş, "kullanımı daha kolay", "hızlı erişilebilir alanlar", "kısa adımlar ile iş tasarrufu" gibi kullanılabilirlik özelliklerini anlatmıştır (Lewis, 2014). Aynı İBE alanı gibi kullanılabilirlik de akademik alanda 1970'lere kadar kullanılmamıştır. ...
Thesis
Full-text available
It is not possible to think about mobile interface design independently from the brain, senses, human behavior, sociology, and psychology. It is necessary to analyze well how the brain works, how the eye sees, human cognitive abilities, and cultural and sociological characteristics of humans, in other words, to understand humans well. In this way, more usable and useful interface and experience designs can be made. With the launch of the Apple iPhone in 2007, usability and experience once again gained importance, and designing user-friendly interfaces and experiences has become the most important factor in the industry of smartphones, which have a market worth of 187.5 billion dollars in 2021 and are used by more than 6 billion people. Research in the field also indicates an increasing focus on mobile usability studies. These studies provide the necessary insights for designers to create interfaces that can be adopted by users in the industry. When it is considered that the Alpha generation may not know the "click" interaction at all and will only encounter it through touch or newer interaction forms, it is necessary to examine new principles and design methods for this field. Usability is one of the main quality elements in interactive systems, and especially in mobile interfaces. Traditional methods, usability evaluations, and principles do not fully comply with the nature of these devices in mobile interface design processes. It has also become important to adapt traditional usability and design principles in new ways and to evaluate principles in the context of mobile use by more deeply examining the context of mobile use. The first contribution of this study is to examine traditional principles adapted to mobile interface design and experience. Secondly, a series of design and process methods for more effective and usable mobile interfaces have been investigated. A sample mobile application design has been made using these methods.
Conference Paper
This study investigates the use of digital healthcare in information systems (IS) research, emphasizing the need for a nuanced understanding of the conflation of related terms. The lack of an agreement on the definition of "digital healthcare usage" in research within this domain complicates assessing its impact. A conceptual framework is essential to clarify these terms and facilitate further investigation of digital health in IS. Through a combined quantitative and qualitative analysis of 5510 carefully identified articles from the IS literature, we outlined the landscape of digital healthcare usage. This groundwork is a crucial stepping stone for understanding technology integration and users’ engagement, pivotal for sustainable digital health development. The analysis revealed evolving trends in digital health research, shifting from utility, usability, and user-centric design to sustainability, privacy, and security considerations. The proposed framework not only provides clarity in terminology but also serves as a foundation for future research. This study is instrumental in guiding future IS research.
Article
Objective The main aim of this study was to evaluate the feasibility and acceptability of using a GPS tracker to mitigate the risks associated with wandering for people with dementia and those caring for them and further evaluate the impact of trackers in delaying 24-hour care and the potential for reducing the involvement of support services, such as the police, in locating patients. Methods We recruited forty-five wearers-carers dyads, and a GPS tracker was issued to each participant. Dyads completed pre-and post-outcome questionnaires after six months, and a use-log of experiences was maintained through monthly monitoring calls. At six months, focus groups were conducted with 14 dyads where they shared ideas and learning. Data analyses were performed on outcome questionnaires, use-log analysis, and focus groups discussion. Results A 24% ( N = 14) attrition rate was recorded, with 76% ( N = 34) of the participants completing pre- and post-outcome questionnaires, of which 41% ( N = 14) attended four focus group meetings. Participants reported enhanced independence for wearers as fewer restrictions were placed on their movements, peace of mind and reduced burden for the carers with less need to involve police or social services, and delays in 24-hour care. Conclusion The results supported the feasibility of routine implementation of GPS trackers in dementia care with clear guidance, monitoring and support to family carers on safe use. This could delay admission into 24-hour care as wearers and carers have a greater sense of safety and are better connected should help be required. Studies with larger sample sizes, diverse participants and health economic analysis are needed to develop the evidence base further ahead of the routine implementation of GPS trackers in health and social care services.
Article
Full-text available
Recently, Virzi (1992) presented data that support three claims regarding sample sizes for usability studies: (1) observing four or five participants will allow a us-ability practitioner to discover 80% of a product's usability problems, (2) observing additional participants will reveal fewer and fewer new usability problems, and (3) more severe usability problems are easier to detect with the first few participants. Results from an independent usability study clearly support the second claim, partially support the first, but fail to support the third. Problem discovery shows diminishing retums as a function of sample size. Observing four to five participants will uncover about 80% of a product's usability problems as long as the average likelihood of problem detection ranges between 0.32 and 0.42, as in Virzi. If the average likelihood of problem detection is lower, then a practitioner will need to observe more than five participants to discover 80% of the problems. Using behavioral categories for problem severity (or impact), these data showed no correlation between problem severity (impact) and rate of discovery. The data provided evidence that the binomial probability formula may provide a good model for predicting problem discovery curves, given an estimate of the average likelihood of problem detection. Finally, data from economic simulations that estimated return on investment (ROI) under a variety of settings showed that only the average likelihood of problem detection strongly influenced the range of sample sizes for maximum ROI.
Conference Paper
Full-text available
Simon (TM-Bellsouth Corp.) is a commercially available personal communicator (PC) combining features of a PDA (personal digital assistant) with a full suite of communications features. This paper describes the involvement of human factors engineering in the development of Simon, and summarizes the various approaches to usability evaluation employed during its development. Simon has received a considerable amount of praise from the industry and won several industry awards, with recognition both for its innovative engineering and its usability.
Chapter
Full-text available
Human factors engineering (also known as ergonomics) is a body of knowledge and a collection of analytical and empirical methods used to guide the development of systems to ensure suitability for human use. Human factors engineering is a component of user-centered design, and encompasses the disciplines of human-computer interaction and usability engineering. The focus of human factors engineering is on understanding the capabilities and limitations of human abilities, and applying that understanding to the design of human-machine systems. This entry introduces the general topics of human factors engineering and usability testing.
Chapter
Full-text available
Covers the basics of usability testing plus some statistical topics (sample size estimation, confidence intervals, and standardized usability questionnaires).
Conference Paper
A high-fidelity prototype of an extended voice mail application was created. We tested it using three distinct usability testing paradigms so that we could compare the quantity and quality of the information obtained using each. The three methods employed were (1) heuristic evaluation, in which usability experts critique the user interface, (2) think-aloud testing, in which naive subjects comment on the system as they use it, and (3) performance testing, in which task completion times and error rates are collected as naive subjects interact with the system. The three testing methodologies were roughly equivalent in their ability to detect a core set of usability problems on a per evaluator basis. However, the heuristic and think-aloud evaluations were generally more sensitive, uncovering a broader array of problems in the user interface. Implications of these findings are discussed in terms of the costs of doing the evaluations and in light of other work on this topic.
Conference Paper
In this two-session course, attendees learn how to conduct empirical research in human-computer interaction (HCI). This course delivers an A-to-Z tutorial on designing a user study and demonstrates how to write a successful CHI paper. It would benefit anyone interested in conducting a user study or writing a CHI paper. Only general HCI knowledge is required.
Book
Although speech is the most natural form of communication between humans, most people find using speech to communicate with machines anything but natural. Drawing from psychology, human-computer interaction, linguistics, and communication theory, Practical Speech User Interface Design provides a comprehensive yet concise survey of practical speech user interface (SUI) design. It offers practice-based and research-based guidance on how to design effective, efficient, and pleasant speech applications that people can really use. Focusing on the design of speech user interfaces for IVR applications, the book covers speech technologies including speech recognition and production, ten key concepts in human language and communication, and a survey of self-service technologies. The author, a leading human factors engineer with extensive experience in research, innovation and design of products with speech interfaces that are used worldwide, covers both high- and low-level decisions and includes Voice XML code examples. To help articulate the rationale behind various SUI design guidelines, he includes a number of detailed discussions of the applicable research. The techniques for designing usable SUIs are not obvious, and to be effective, must be informed by a combination of critically interpreted scientific research and leading design practices. The blend of scholarship and practical experience found in this book establishes research-based leading practices for the design of usable speech user interfaces for interactive voice response applications.