ArticlePDF Available

Multimodal behavior and interaction as indicators of cognitive load

Authors:

Abstract and Figures

High cognitive load arises from complex time and safety-critical tasks, for example, mapping out flight paths, monitoring traffic, or even managing nuclear reactors, causing stress, errors, and lowered performance. Over the last five years, our research has focused on using the multimodal interaction paradigm to detect fluctuations in cognitive load in user behavior during system interaction. Cognitive load variations have been found to impact interactive behavior: by monitoring variations in specific modal input features executed in tasks of varying complexity, we gain an understanding of the communicative changes that occur when cognitive load is high. So far, we have identified specific changes in: speech, namely acoustic, prosodic, and linguistic changes; interactive gesture; and digital pen input, both interactive and freeform. As ground-truth measurements, galvanic skin response, subjective, and performance ratings have been used to verify task complexity. The data suggest that it is feasible to use features extracted from behavioral changes in multiple modal inputs as indices of cognitive load. The speech-based indicators of load, based on data collected from user studies in a variety of domains, have shown considerable promise. Scenarios include single-user and team-based tasks; think-aloud and interactive speech; and single-word, reading, and conversational speech, among others. Pen-based cognitive load indices have also been tested with some success, specifically with pen-gesture, handwriting, and freeform pen input, including diagraming. After examining some of the properties of these measurements, we present a multimodal fusion model, which is illustrated with quantitative examples from a case study. The feasibility of employing user input and behavior patterns as indices of cognitive load is supported by experimental evidence. Moreover, symptomatic cues of cognitive load derived from user behavior such as acoustic speech signals, transcribed text, digital pen trajectories of handwriting, and shapes pen, can be supported by well-established theoretical frameworks, including O'Donnell and Eggemeier's workload measurement [1986] Sweller's Cognitive Load Theory [Chandler and Sweller 1991], and Baddeley's model of modal working memory [1992] as well as McKinstry et al.'s [2008] and Rosenbaum's [2005] action dynamics work. The benefit of using this approach to determine the user's cognitive load in real time is that the data can be collected implicitly that is, during day-to-day use of intelligent interactive systems, thus overcomes problems of intrusiveness and increases applicability in real-world environments, while adapting information selection and presentation in a dynamic computer interface with reference to load.
Content may be subject to copyright.
Multimodal Behaviour and Interaction as Indicators of Cognitive
Load
FANG CHEN,
NICTA, Australia
NATALIE RUIZ,
NICTA, Australia
ERIC CHOI,
NICTA, Australia
JULIEN EPPS,
NICTA, Australia
M. ASIF KHAWAJA,
NICTA, Australia
RONNIE TAIB,
NICTA, Australia
BO YIN,
NICTA, Australia
YANG WANG
, NICTA, Australia
High cognitive demand arises from complex, time and safety-critical tasks, e.g., mapping out flight paths,
monitoring traffic or even managing nuclear reactors, causing stress, errors and lowered performance.
Over the last five years, our research has focused on using the multimodal interaction paradigm to detect
fluctuations in cognitive load both in user behaviour during system interaction. Cognitive load variations
have been found to impact interactive behaviour: by monitoring variations in specific modal input features
executed in tasks of varying complexity, we gain an understanding of the communicative changes that
occur when cognitive load is high. So far, we have identified specific changes in: speech, namely acoustic,
prosodic and linguistic changes; interactive gesture; digital pen input, both interactive and freeform. As
ground truth measurements, galvanic skin response, subjective and performance ratings have been used to
verify task complexity.
The data suggests that it is feasible to use features extracted from behavioural changes in multiple modal
inputs as indices of cognitive load. The speech based indicators of load, based on data collected from user
studies in a variety of domains, have shown considerable promise. Scenarios include single-user and team-
based tasks, think-aloud and interactive speech, single-word, reading and conversational speech, among
others. Pen based cognitive load indices have also been tested with some success, specifically with pen-
gesture, handwriting and free-form pen input, including diagramming. After examining some of the
properties of these measurements, we present a multimodal fusion model, illustrated with quantitative
examples from a case study.
The feasibility of using user input and behaviour patterns as indices of cognitive load is supported by
experimental evidence. Moreover, using symptomatic cues of cognitive load derived from user behaviour,
such as acoustic speech signals, transcribed text, digital pen trajectories of handwriting and shapes pen,
can be supported by well-established theoretical frameworks, including O’Donnell and Eggemeier’s
workload measurement [O’Donnell & Eggemeier, 1986], Sweller’s Cognitive Load Theory[Chandler &
Sweller, 1991] and Baddeley’s model of modal working memory [A. D. Baddeley, 1992], as well as
McKinstry [McKinstry, Dale, & Spivey, 2008] and Rosenbaum’s [Rosenbaum, 2005] action dynamics work.
The benefit of using this approach to determine the user's cognitive load in real time is that the data can
be collected implicitly, i.e. during day-to-day usage of intelligent interactive systems, thus overcoming
problems of intrusiveness and increasing applicability in real-world environments, while adapting
information selection and presentation in a dynamic computer interface with reference to load.
This work is supported by NICTA, which is funded by the Australian Government as represented by the
Department of Broadband, Communications and the Digital Economy and the Australian Research
Council through the ICT Centre of Excellence program..
Author’s addresses: F. Chen, E. Choi, J. Epps, M. A. Khawaja,, N. Ruiz, R. Taib, and B. Yin, B., NICTA,
Level 5, 13 Garden Street, Eveleigh, NSW 2015, Australia.
Permission to make digital or hardcopies of part or all of this work for personal or classroom use is granted
without fee provided that copies are not made or distributed for profit or commercial advantage and that
copies show this notice on the first page or initial screen of a display along with the full citation.
Copyrights for components of this work owned by others than ACM must be honored. Abstracting with
credits permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any
component of this work in other works requires prior specific permission and/or a fee. Permissions may be
requested from Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA,
fax +1 (212) 869-0481, or permissions@acm.org.
@2010 ACM 1539-9087/2010/03-ART39 $10.00
DOI10.1145/0000000.0000000 http://doi.acm.org/10.1145/0000000.0000000
39
Categories and Subject Descriptors: H.1.2 [User/Machine Systems] – Human information processing;
H.5.2 [User Interfaces] - Interaction Styles
General Terms: Measurement, Experimentation, Human Factors
Additional Key Words and Phrases: Cognitive Load, Pen Input, Assessment, Multimodal
ACM Reference Format:
Chen, F., Ruiz, N., Khawaja, M. A., Choi, E., Epps, J., Taib, R., and Yin, B., Multimodal Behaviour and
Interaction as Indicators of Cognitive Load. ACM Trans. Intell. Interact. Syst. (May 2011), 26 pages. DOI =
10.1145/0000000.0000000 http://doi.acm.org/10.1145/0000000.0000000
1. INTRODUCTION
The past few decades have been marked by the rapid development of new and
powerful information systems, granting access to volumes of data previously unheard
of. The dramatic evolution of computers and networks has allowed exponential
functionality to be offered by expert software. However, the capabilities of the human
brain, working memory in particular, have remained unchanged and fairly limited.
Even domain experts well trained with the tools now struggle in the management of
information. Worse still, the lack of metrics in complex environments make it
impossible to predict the tipping point, when the user no longer has control of the
situation. This issue is exacerbated in high-intensity, data-laden and safety-critical
environments, and calls for a robust and real-time measurement of user’s cognitive
load.
Indeed, traffic management centres, crisis or air-traffic control rooms, intelligent
interactive systems involve inherently complex domain tasks for operators to solve.
High cognitive demand arises from such tasks as mapping out flight paths,
monitoring traffic or even managing nuclear reactors. The ability to measure a user’s
cognitive load in real-time can support personalised system adaptation to users
affected by high cognitive load, easing the demand and avoiding stress, frustration
and errors. While conventional human-computer interaction paradigms (e.g.,
Graphical User Interfaces) are useful in personal computing applications such as
word processing, they do not adequately support tasks that require the manipulation
of complex data types and constraints in the way intelligent, interactive systems
have the potential to do.
Our research goal for the past five years has been the development of technology
for the implicit, objective, automated and real-time estimation of a user’s cognitive
load, suitable for real-time deployment as part of an intelligent interactive system.
The approach is focused on the identification of possible correlations between
increasing levels of cognitive demand and both passive and active modalities, from
speech, digital pen, and freehand gesture, to eye-activity, galvanic skin response and
electroencephalography (EEG). This article begins with a summary of the underlying
psychological theories on which our research rests, an overview of our individual
approach, and a review of the most promising indices and features that we have
found to be sensitive to high cognitive load. Finally, we discuss the implications of
our findings and plans for future work.
2. RELATED WORK
2.1 Working Memory and Cognitive Load
It is well-established that the two main limitations of working memory resources are
its capacity and duration [A. D. Baddeley, 1992]. According to Baddeley’s model [A.
D. Baddeley, 1992], working memory has separate processors for visual and verbal
information. Only a limited number of item “chunks” can be held in working memory
at any one time and then only for a short amount of time [Cowan, 2001]. These
limitations are never more evident than when users undertake complex tasks, or
when in the process of learning resulting in extremely high demands being placed
on working memory. The construct of cognitive load refers to the working memory
demand induced by a complex task in a particular instance where novel information
or novel processing is required [Sweller, Merrienboer, & Paas, 1998]. Any single task
can induce differing levels of mental effort or cognitive load from one user to another,
or as a user gains expertise. This discrepancy in the mental demand from person to
person could be due to a number of reasons, for example, level of domain expertise or
prior knowledge, interface familiarity, the user’s age, or any mental or physical
impediments. A task that may cause high load in one user may not necessarily do so
in a more experienced user, for example
The cognitive load construct comprises at least two separate load sources:
intrinsic load and extraneous load [Paas, Tuovinen, Tabbers, & Gerven, 2003;
Sweller, et al., 1998]. Intrinsic load refers to the inherent complexity of the task
itself; whereas extraneous load refers to the representational complexity that is,
complexity that varies depending on the way the task is presented. In an intelligent
interactive system, the inherent task complexity would be dictated by the domain.
For example, in a traffic management scenario, a sample domain task may be to find
the exact location of an accident. The equipment, tools and applications the operator
employs to complete the task, for example a paper-based directory, a GIS or
electronic maps, or even street monitoring cameras, each contribute to extraneous
load. Both of these types of load combine to form the overall experience of cognitive
load. Situations that induce high levels of cognitive load can impede learning and
efficient performance on designated tasks [Paas, et al., 2003; Sweller, et al., 1998].
The ability to determine exactly when a user is being cognitively loaded beyond a
level that they are able to manage could enable the system to adapt its interaction
strategy intelligently. For example, the system could attempt to reduce the cognitive
load experienced by the operator particularly in terms of extraneous load such
that optimal performance is facilitated. A number of methods have been used, both
in Human-Computer Interaction (HCI) and other domains, to estimate the level of
cognitive load experienced. There are four main methods comprising the state of the
art: subjective (self-report) measures, where users rank their experienced level of
load on single or multiple rating scales [Gopher & Braune, 1984]; physiological
measures, such as galvanic skin response, and heart rate [Delis, Kramer, & Kaplan,
2001]; performance measures, such as task completion time, speed or correctness,
critical errors and false starts [Gawron, 2000; O’Donnell & Eggemeier, 1986; Paas,
Ayers, & Pachman, 2008], as well as dual-tasks [Chandler & Sweller, 1991]; and
finally, behavioural measures, which observe feature patterns of interactive
behaviour, such as linguistic or dialogue patterns [Berthold & Jameson, 1999], and
even text input events and mouse-click events [Ikehara & Crosby, 2005]. However,
while most of these types of measures are suitable for research purposes, many are
unfeasible for widespread deployment in intelligent interactive systems.
2.2 Subjective Measures
Traditionally, the most consistent results for cognitive load measurement have been
achieved through subjective measures [O’Donnell & Eggemeier, 1986]. These
measures ask users to describe in fine detail and reflect each user’s perception of
cognitive load by means of introspection: the user is required to perform a self-
assessment of their mental demand by answering a set of assessment questions
immediately after the task. However, such an approach is impractical in real, day to
day situations because the questionnaires not only interrupt task flow but also add
more tasks to the load of potentially overloaded users.
2.3 Physiological Measures
The physiological approach for cognitive load measurement is based on the
assumption that any changes in the human cognitive functioning are reflected in the
human physiology [Kramer, 1991]. The measures that have been used in the
literature to show some relationship between subjects’ mental workload or cognitive
load and their physiological behaviour include, among others, heart rate and heart
rate variability [Kennedy & Scholey, 2000; Mousavi, Low, & Sweller, 1995; Nickel &
Nachreiner, 2000], brain activity (e.g. changes in oxygenation and blood volume,
electrocardiography (ECG), electroencephalography (EEG)) [Brunken, Plass, &
Leutner, 2003; Wilson & Russell, 2003], galvanic skin response (GSR) or skin
conductance [Jacobs et al., 1994; Shi, Ruiz, Taib, Choi, & Chen, 2007], and eye
activity (e.g. blink rate, eye movement, pupillary dilation) [Backs & Walrath, 1992;
Iqbal, Zheng, & Bailey, 2004; Lipp & Neumann, 2004; Marshall, Pleydell-Pearce, &
Dickson, 2003]. Changes in the physiological data occur with the level of stimulation
experienced by the person and can represent various levels of mental processing. The
data collected from body functions are useful as they are continuous and allow the
signal to be measured at a high rate and in fine detail. However, physiological
measures require users to wear a lot of cumbersome equipment, e.g. EEG headsets
that not only interfere with their task, but are prohibitive in cost and
implementation. Additionally, the large amounts of physiological data that need to be
collected and the expertise needed to interpret those signals render many types of
physiological signals unsuitable for common intelligent interactive systems [Delis, et
al., 2001]. While they can be very sensitive to cognitive activity, the above issues in
combination with the degree of variability of physiological signals, due to external
factors such as temperature and movement, means they may have limited suitability
for environments other than laboratory conditions [Delis, et al., 2001].
2.4 Performance Measures
The hypothetical relationship between performance and workload as discussed by
O’Donnell and Eggemeier [O’Donnell & Eggemeier, 1986], is composed of three
regions, A, B and C, as seen in Fig. 1. The authors claim that primary task measures
of workload cannot be used to reflect mental workload in Region A, because this
region is characterised as indicating “adequate task performance” on behalf of the
subject [O’Donnell & Eggemeier, 1986]. However, in many real world tasks, what
constitutes “adequate task performance” is analogous to a band of acceptable
outcomes rather than a single correct response, and subtle differences may occur
between different solution alternatives which may not be reflected in the overall
performance measures used. As discussed in this paper, we propose that certain
features of the behavioural responses have the potential to differentiate between
these solution outcomes by identifying compensatory behaviours. In much the same
way, performance measures cannot measure spare capacity, when a user still has
plenty of cognitive resources to deploy.
In region B, both primary and secondary task performance measures can be used
to reflect workload as performance decreases. Dual-task approaches have been
incorporated in several studies to measure subjects’ performance in controlled
conditions [O’Donnell & Eggemeier, 1986]. While secondary task performance can
provide a measure of remaining resources not being used by the primary task [Kerr,
1973; Marcus, Cooper, & Sweller, 1996], it is not feasible for operators to complete
dual tasks “in the wild”, and hence these cannot be adopted for widespread use. In
real world tasks, performance measures from the primary task can be extremely
difficult to calculate on the fly, if at all. In the case of Transport Management
Centres, senior staff will often conduct reviews of incident handling to debrief
operators and qualitatively rate their performance.
Fig. 1. Hypothetical relationship between workload and operator performance, after
O’Donnell and Eggemeier [O’Donnell & Eggemeier, 1986].
Performance measures tend to remain stable as load increases in region B,
particularly when the operator exerts a greater amount of mental effort, as noted by
[O’Donnell & Eggemeier, 1986]. This is addressed more specifically by Hockey
[Hockey, 2003] who proposes a range within which compensatory efforts may have an
effect. Fig. 2 illustrates this concept – the subject still achieves a high level of
performance within the region labelled “effort”, depending on the degree of effort
exerted. Exposure to high cognitive load (workload) culminates in a higher likelihood
of errors [Byrne, Sellen, & Jones, 1998; Hockey, 2003; Ruffell Smith, 1979] and
compensatory efforts can only be maintained for a time, the subject then fatigues and
their performance begins to decline [Hockey, 2003]. At the Overload stage,
compensatory efforts no longer make a difference – it is too late for the system to
react appropriately to ease the operator’s load and both the system and the user must
engage in costly recovery strategies.
Fig. 2. Relationship between performance and workload, adapted from Hockey
[Hockey, 2003]
The approaches described thus far have the disadvantage of being physically or
psychologically intrusive. Likewise, many of them are also post-hoc and hence not
conducive to the implementation of real-time adaptive behaviours and strategies by
an intelligent interactive system or interface. Performance measures can also depend
on the subject completing the task, which may not always be possible in high load
situations – for example, the subject may be stuck on one or two steps of the overall
task for a relatively long period of time and no valid task-based performance
assessment can be calculated.
Similarly, performance measures which we define as measures that reflect the
accuracy and correctness of a user’s response and are directly relevant to the outcome
of the task – are often calculated after the fact, if they can be assessed objectively at
all. In the kinds of complex domains which we are targeting, measures based on
performance outcomes are impossible to access in real-time such that the system is
able to act on the information in a timely manner. For example, the spontaneous
nature of crisis management and other control room situations means the user’s
performance in this sense is very difficult to rate, even during debriefing, and unique
to almost every situation. The actions taken can vary widely from operator to
operator, both in order and content, while still being equally effective in achieving
the task goals and solving the problem to an adequate level of performance. In some
cases, performance cannot be calculated automatically at all.
2.5 Behavioural Measures
On the other hand, we define response-based behavioural features as those that can
be extracted from any user activity that is related to deliberate/voluntary task
completion, for example, eye-gaze tracking, mouse pointing and clicking, keyboard
usage, application usage, digital pen input, gesture input or any other kind of
interactive input used to issue system commands. These responses provide two types
of information: first, the inherent content or meaning of the response, and secondly,
the manner in which the response was made. For example, one could type in a
sequence of numbers as part of a task in different ways using a variety of equipment
the keys on the top part of the keyboard (above the alphabet), or the keys on the
number pad on the right side of the keyboard, or by clicking buttons on a numeric
display with a mouse. The string of numbers is the same – this is the content or
meaning in the response relevant to the domain task. The manner in which the
response is made does not directly affect the outcome of the task, but does provide
other information, for example, how long it took to enter the sequence of letters, how
much pressure was exerted on each key, and in the case of the mouse usage, features
such as the mouse trajectory and the time between clicks.
We define these sources as behavioural rather than performance centric because
the information they hold does not directly affect the domain-based outcome of the
task, hence there is a lot of margin for differences within and between users. They
are objective, and can be collected implicitly, i.e. while the user is completing their
task and without overt collection activities (e.g. stopping to ask the user to provide a
subjective rating of difficulty), hence suitable for control-room type environments.
They are also distinct from physiological measures in that they are mostly or entirely
under the user’s voluntary control. While this is likely to introduce some variability
relative to physiological measures, however this may turn out to be smaller when the
behaviour is the response to a task or task type that occurs very often in the user’s
environment. There is evidence that shows that these kinds of behavioural features
can reflect mental states, such as mental effort and cognitive load. For instance, Gütl
et al. [Gütl et al., 2005] used eye tracking to observe subjects’ learning activities in
real-time by monitoring their eye movements for adaptive learning purposes. Others
have used mouse clicking and keyboard key-pressing behaviour to make inferences
about their emotional state and adapt the system’s response accordingly [Ark, Dryer,
& Lu, 1999; Liu & et. al, 2003].
Previous research also suggests the existence of major speech cues that are
related to high cognitive load [Berthold & Jameson, 1999; Jameson et al., 2009;
Keränen et al., 2004; Müller, Großmann-Hutter, Jameson, Rummer, & Wittig, 2001].
Examples of features that have been shown to vary according to task difficulty
include pitch, prosody, speech rate, speech energy, and fundamental speech
frequency. Some studies have reported an increase in the subjects’ rate of speech as
well as speech energy, amplitude, and variability under high load conditions
[Brenner, Shipp, Doherty, & Morrissey, 1985; Lively, Pisoni, Summers, & Bernacki,
1993]. Others have found specific peak intonation [Kettebekov, 2004] and pitch range
patterns [Lively, et al., 1993; Wood & et al., 2004] in high load conditions. Pitch
variability has also been shown to potentially correlate to the cognitive load
[Brenner, et al., 1985; Tolkmitt & Scherer, 1986; Wood & et al., 2004]. These features
are classified as behavioural because they show variations regardless of the meaning
of the utterance being conveyed.
Higher level features, such as linguistic and grammatical features, may also be
extracted from user’s spoken language for patterns that may be indicative of high
cognitive load. Significant variations in levels of spoken disfluency, articulation rate
and filler and pause rates [Berthold & Jameson, 1999] have been found in users
experiencing low versus high cognitive load. Extensions of this work attempt to
recognise cognitive load levels using a Bayesian network approach [Müller, et al.,
2001]; other work has found changes in word frequency and first person plurals
[Sexton & Helmreich, 2000]. Changes in linguistic and grammatical features have
also been used for purposes other than cognitive load measurement [Schilperoord,
2001].
More closely linking multimodal interactive systems and cognitive load, users
have been found to change and adapt their multimodal behaviour in complex
situations. Empirical evidence suggests that when tasks are more difficult, users
prefer to interact multimodally rather than unimodally across a variety of different
application domains [Oviatt, Coulston, & Lunsford, 2004]. As task complexity
increases, users tend to spread information acquisition and production over distinct
modalities, seemingly for more effective use of the various modality-based working
memory resources [Alibali, Kita, & Young, 2000; Goldin-Meadow, Nusbaum, Kelly, &
S.Wagner, 2001; Mousavi, et al., 1995; Oviatt, 1997, 2006]. Temporal relationships
that exist between interaction modalities (e.g., speech and pen) have also been shown
to change under increased load conditions, showing a deeper entrenchment into the
participant’s preferred multimodal pattern, either simultaneous or sequential
[Oviatt, et al., 2004]. Another study employed users’ digital-pen gestures and usage
patterns to evaluate the usability and complexity of different interfaces [Oviatt,
2006]. It has been suggested that pen-based interfaces can dramatically improve
subjects’ ability to express themselves over traditional interfaces because linguistic,
alphanumeric and spatial representations bear little cognitive overhead [Oviatt,
2009].
2.6 Estimating Load from Interactive Behaviour
The premise of our research is that observations of interactive features may be
suitable for cognitive load assessment because a user experiencing a high cognitive
load will show behavioural symptoms relating to the management of that load. This
suggests a more generalised effect of an attempt to maximise working memory
resources during completion of complex tasks [Mousavi, et al., 1995; Oviatt, et al.,
2004]. High cognitive load tasks increase the cognitive demand, forcing more
cognitive processes to share fewer resources. We hypothesise that such reactions will
cause changes in interactive and communicative behaviour, whether voluntary or
otherwise.
The hypothesis that behavioural responses can provide insight into mental states
and processing is not without precedent. Spivey et. al. contend that reaching
movements made with a computer mouse provide a continuous two-dimensional
index of which regions of a scene are influencing or guiding “action plans” – and
therefore reflective of changes in cognitive processes [Spivey, Grosjean, & Knoblich,
2005]. In an experiment involving decision-making, McKinstry et. al. found that
mouse trajectories for answer selection (YES and NO) options are characterised by
the greatest curvature and the lowest peak velocity when the “correct” choice to be
made is more ambiguous or more complex [McKinstry, et al., 2008]. They conclude
that spatial extent and temporal dynamics of motor movements can provide insight
into high-level cognition [McKinstry, et al., 2008; Rosenbaum, 2005]. Dale et. al ran a
study where participants’ hand movements were continuously tracked using a
Nintendo® Wii™ remote, as they learned to categorise elements [Dale, Roche,
Snyder, & McCall, 2008]. They noted that participants’ arm movements started and
finished more quickly and more smoothly (decreased fluctuation and increased
perturbation) after learning the categorisation rules. The “features of action
dynamics” show that participants grow more ‘‘confident’’ over a learning task, and
indicate learning has taken place. Van Galen and Huygevoort have shown that time
pressure and dual task load results in “biomechanical adaptations of pen pressure” as
a coping mechanism to increased load [Galen & van Huygevoort, 2000]. These studies
provide evidence that features of behavioural responses can be harnessed to provide
an indication of changes in cognitive processing and strategy.
Symptomatic changes in structure, form and quality of communicative and
interactive responses are more likely to appear as people are increasingly loaded, as
will be described in this paper. With the proliferation of sensor data that can be
collected from users through the latest intelligent systems, there is a very specific
opportunity to use this behavioural input to detect patterns of change that are
correlated with high load, and use these cues to guide the adaptation strategies
employed by the system. Such features have the added advantage of offering an
implicit – as opposed to overt - way to collect and assess cues that indicate changes in
cognitive load. However, to do this it is necessary to first identify and quantify the
fluctuations of features in user interaction as cognitive load varies over a variety of
input modalities.
The major challenge of choosing the assessment features for automated load
detection is to make sure they satisfy the requirements of consistency, compact
representation and automatic acquisition [Yin, Chen, Ruiz, & Ambikairajah, 2008].
Our aim is to find effective features which reliably reflect the cues and can be
extracted automatically such that they are useful in adaptive systems. A second goal
is to find a suitable learning or modelling scheme for each index to resolve the
corresponding level of cognitive load [Yin, et al., 2008]. By manipulating the level of
task complexity and cognitive load, and conducting a series of repeated measures
user studies in a variety of scenarios, we have been able to identify a series of
cognitive load indices based on features from a number of input modalities,
specifically, observations of significant changes in speech and digital-pen input that
are abstracted from individual application domains in which they occur as well as
correlated to high cognitive load.
3. SPEECH BASED FEATURES OF COGNITIVE LOAD
Speech signal features can be a particularly good choice for cognitive load indices,
since speech data exists in many real life tasks (e.g. telephone conversations, voice
command and control systems, self-talk) and can be easily collected in a non-
intrusive and inexpensive way with a close talk microphone. Types of speech features
can vary, from intensity, pitch and formants inspired by speech production, to other
acoustic, prosodic or linguistic features such as grammar and syntax. We have
explored all types of features with significant results.
3.1 Speech Datasets
Over the last five years, we have conducted a series of user studies in which we have
collected a large amount of interactive and conversational speech data in a variety of
application domains, ranging from speech responses to simple psychological tests
such as the Stroop test [Stroop, 1935], to reading and comprehension speech, to
interactive speech (in both simulated and real multimodal interactive system
environments) and think-aloud speech from controlled user studies with interactive
systems [P. Le, Epps, Ambikairajah, & Sethu, 2010; Phu Le, Epps, Choi, &
Ambikairajah, 2010; Stroop, 1935; T. Yap, 2011; T. F. Yap, Ambikairajah, Epps, &
Choi, 2010; T. F. Yap, Epps, Ambikairajah, & Choi, 2010; Yin & Chen, 2007; Yin, et
al., 2008]. All data was collected through a series of specially designed and controlled
experiments, where we have manipulated multiple parameters to isolate different
cognitive load factors. Finally, through collaborative partnerships with industry, we
have also collected speech from the field, generated in real life environments such as
air traffic control rooms, call centres and bushfire control training exercises.
The speech datasets we have used in our investigations have either been elicited
during tasks of increasing cognitive load a-priori or labelled a-posteriori with expert
ratings of task complexity and cognitive load [Yin, et al., 2008; Yin, Ruiz, Chen, &
Khawaja, 2007]. While six different speech datasets have been collected in our team –
including field data, and lab studies featuring multimodal interaction with speech
and gesture, multimodal interaction with speech and pen in two different application
domains (Incident management and Basketball training), as well as a simulated
driving user study – two key databases have been used in the development of the
speech cognitive load measurement system. The first is the Stroop database and the
second is the Reading and Comprehension database. Both were generated from lab
controlled experiments featuring cognitive load manipulations.
The Stroop test corpus is based on the original test by John Ridley Stroop [Stroop,
1935]. Three levels of cognitive load were derived from this paradigm – the task
difficulty arises from cognitive interference when reading colour names or naming
colour words. In our version of the Stroop test, speech from the low cognitive load
task was recorded by asking the subjects to read aloud a series of words (which were
the names of colours) written either in black font or a congruent font colour (e.g. the
word red, written in red font). During the medium load level, subjects were asked to
name the font colour of words written in incongruent colour (i.e., the font colour of
the words is different to the meaning of the word, e.g. white written in blue font). In
the high cognitive load level, a time constraint was added to the medium load task,
forcing the subjects to complete the task faster. An additional recording was collected
from each subject when they were asked to read a short story aloud for
approximately 90 s; this was used as baseline data and for the background model of
the base cognitive load measurement engine illustrated in Fig. 3 below. This corpus
contains single word utterances of ten colour names, spoken slowly in a series. The
majority are also very short, containing only one or two syllables (e.g. red, blue,
white). In addition, there is a speech rate artefact caused by the time constraint for
the high cognitive load speech.
The second corpus, the Reading and Comprehension corpus was generated by
asking the subjects to read 3 stories aloud, each corresponding to a load level from
low to high. The difficulty levels of the stories were estimated based on the Lexile
scale [Lennon & Burdick, 2004] – a semantic difficulty and syntactic complexity
measure scale ranging from 200 to 1700 Lexiles (L), corresponding to the reading
level expected from a first grade student to a graduate student. The Lexile ratings of
the stories used were 925 L, 1200 L, and 1350 L, respectively. The approximate
lengths of the utterances corresponding to reading the story for the low, medium and
high cognitive loads are 90 s, 140 s, and 230 s, respectively. The story reading speech
is referred to as the Reading data. After each story, the subjects were asked to
answer three open ended questions related to the content of the story. The
approximate length of each answer to the three questions for all three levels of
cognitive load is 30s. In contrast to the Stroop dataset, the Reading and
Comprehension corpus contains a significantly larger vocabulary due to the less-
constraint content of the stories and the answers given to the questions.
3.2 Acoustic and Prosodic Speech Features
Inspired by previous research on emotional and stressed speech [Fernandez &
Picard, 2003; Hansen, 1996; Picard, 1997], we expected that prosodic patterns (i.e.
voice pitch variation) could be used as a cue to reflect cognitive load. The rate of
pauses and rate of pitch peaks emerged as good potential indicators of cognitive load
levels for the speech in multimodal interaction tasks in a study described in [Ruiz,
Taib, & Chen, 2006]. We used a sliding window implementation, which showed these
indicators to be higher when the cognitive load level was higher [Yin & Chen, 2007].
This proved to be the first of the speech signal based indices we uncovered.
Since the areas of interest in cognitive load monitoring are extreme levels of
cognitive load (too high or too low), the assessment problem was re-interpreted from
a continuous scale of degrees of cognitive load by introducing the notion of discrete
levels of load. A classification approach could then be employed for cognitive load
measurement [Yin, et al., 2008; Yin, et al., 2007]. In a bottom-up, data-driven
strategy for cognitive load assessment, the subsequent datasets were employed in a
statistical machine learning approach, in an effort to build a cognitive load
monitoring engine based solely on the changes in the speech signal [Yin, et al., 2008;
Yin, et al., 2007]. We have developed an automatic, real-time, speaker-independent
cognitive load assessment module, which can be adapted to varied task scenarios
[Yin, et al., 2008; Yin, et al., 2007].
A Gaussian Mixture Model (GMM) based classifier [Reynolds & Rose, 1992] –
forming part of the base system as pictured in Fig. 3 – was created with semi-
supervised training, from hours of annotated data from both of these sets, where each
of the cognitive load levels is modelled by a GMM. The classification engine
determines the best-matched model based on a calculated likelihood score. Channel
and speaker normalisation are deployed also for improving robustness. The
classification process uses a mixture of frame-based acoustic features: Mel-Frequency
Cepstral coefficients (MFCCs), pitch, and intensity. MFCCs are a set of features
commonly used in speech recognition applications, and they capture information in
the magnitude part of the speech spectrum. Pitch and intensity, on the other hand,
are features that capture information relating to the prosody of speech. Additionally,
a background model was introduced, in the form of another GMM trained on data
from all the cognitive load levels. All individual cognitive load level models are
adapted from it using the maximum-a-posteriori (MAP) estimation technique. Since
the background model models the basic feature distribution shared by all speakers
under all load levels, it is a good initial distribution from which to adapt models of
specific levels, and therefore improves the generalisation capabilities of models of
specific CL levels when training and/or test data are limited.
Fig. 3. The base system structure for acoustic and prosodic CLM.
The classification accuracy for both of these databases using the baseline MFCC
cognitive load measurement system has been very positive. The Stroop test data
reveals an accuracy of 78.9% for classification into low, medium and high load in a
speaker-independent scenario [Yin, et al., 2008]. Similarly, over three discrete
cognitive level ranges, classification of the Reading and Comprehension dataset
(comprehension data) achieved a 71.1% accuracy in a speaker-independent closed-set
setting [Yin, et al., 2007].
MFCCs proved to be an effective set of baseline frame-based features for cognitive
load classification. However, MFCCs do not provide us with any insight into how
cognitive load affects the speech spectrum or the underlying speech production
system. Moreover, since MFCCs may have higher dimensionality than is strictly
required for the problem, it may be possible to achieve the same result using more
targeted sets of features. Voice source or glottal features have been investigated in an
attempt to link cognitive load to the speech production system [Phu Le, et al., 2010;
Tet F. Yap, E. Ambikairajah, et al., 2010] with some success. The system tested on
the Stroop test corpus, after fusing the scores of the baseline system (combination of
MFCC, pitch, and intensity) with the scores of a glottal parameter feature based
system [T. F. Yap, Epps, Choi, & Ambikairajah, 2010] produces an accuracy of 84.4%
on that dataset.
Investigations conducted in order to assess the effectiveness of detailed spectral
features such as Spectral Centroid Frequency (SCF) [Paliwal, 1998], and Spectral
Centroid Amplitude (SCA) [P. Le, Ambikairajah, Epps, Vidhyasaharan, & Choi,
2011], as part of the cognitive load classification system, have also proven successful.
Inclusion of these features has resulted in improvements to the baseline
classification result. Firstly, the Stroop test classification over three levels reaches
88.5% accuracy with the fusion of the SCF based system and the SCA based system.
Second, the speaker-dependent system based on the fusion of the SCF based system
and the SCA based system has an accuracy of 84.3% in the Reading and
Comprehension corpus.
A recent step has been to study the effect of cognitive load on the vocal tract
through an investigation of formant frequencies. Formant frequencies (the
frequencies at which broad spectral peaks occur in the magnitude spectrum of
Front-end
Background
model
Background data
Front-end
Testing data
Front-end
Specific
Models
MAP
adaptation
Maximum
likelihood
Data with
annotated CL
Feature extraction
Channel and
speaker
normalisation
Shifted delta
coefficients
Front
-
end
Predicted CL
speech) are closely related to the underlying configuration of the vocal tract. The
results show that 2-class (low and high) and 3-class (low, medium and high)
utterance-based evaluations on both of the databases, using frame-based formant
features, perform at least as well as the baseline system with MFCC features. This is
despite formant features having a dimensionality of 3 compared with MFCCs with a
dimensionality of 7 [T. F. Yap, Epps, Ambikairajah, & Choi, 2011]. This finding
suggests that cognitive load information can be captured using features with lower
dimensionality [T. F. Yap, et al., 2011], potentially reducing the amount of data
needed for training models. Combinations of features derived from formants and
speech production models have produced accuracies to 95% in more recent work [T.
Yap, 2011]
3.3 Real-Life Case Studies
1
As part of our research, the involvement of collaborative industry partners has been
sought in order to obtain field data from the field, on which to test our behavioural
indices, in particular, the speech features. The following case studies show the
viability of using speech-based cognitive load indices for measurement and
assessment with our system.
The first case study took place in an Emergency Communications Centre in North
America, where the operators are responsible for receiving information from police,
traffic authorities and other sources, and dispatching them to relevant ambulance
units. A total of 37 working sessions were recorded from 10 participants during
training. Each 30 min. session contained a number of events under different
workload levels, each lasting approx. 10 seconds. All events were manually annotated
with an observed workload level during a post-review session by domain experts to
label the data for adaptation and evaluation purposes. Our speech-based cognitive
load measurement system successfully estimated the load level of participants with
an average accuracy of 82.2% over 3 levels of load. Moreover, the high load event
detection achieved a 95.9% hit rate and 4.1% false alarm rate.
In a second case study conducted at a large outsourced Contact Centre operator in
Australia (5000+ seats), high personnel attrition rates and associated hiring and
training expenses were key issues to be addressed. Our speech-based cognitive load
measure was used to investigate the correlation between tenure and demonstrated
load level under a series of tests. By analysing the speech responses of the potential
candidates, it was possible to predict whether the candidate was likely to perform
well as a contact centre operator, and hence was more likely to have a longer
tenure. A group of 191 freshly hired agents received the assessment, and the
attrition reduction was evaluated over 12 weeks. The overall attrition rate reached
18% at the end of week 12, while the attrition rate of the most suitable candidates, as
identified by the system, was only 9% - representing a relative 50% improvement
over existing assessments.
A third case study conducted on real-life training data involved air traffic
controllers from 3 different regional airports in Australia. The speech data produced
by the controllers in their communications with the pilots during a shift was
recorded. In addition, every two minutes, the controllers were also asked to report on
their current level of cognitive load. The tasks were designed to emulate different
difficulty levels, e.g. from routine landings of single flights to multiple landings in
inclement weather. Our speech-based cognitive load measurement was applied on the
collected speech, resulting in 98-100% accuracy rates in the identification of the
1
Industry partners are kept anonymous for confidentiality purposes.
cognitive load level – according to the post-hoc labelling of the subjective ratings from
the operators, over all three airport locations.
3.4 Linguistic Features
In a complementary, semantically-driven approach, we have also examined the
linguistic features of speech for cues that indicate high cognitive load. It has been
shown that people’s selection of the language elements and linguistic features varies
from one situation to another depending on the circumstances of the situation
[Dechert & Raupach, 1980; Sexton & Helmreich, 2000]. We have been successful in
isolating a number of cognitive indices based on linguistic features that correlate
strongly with high load. The data examined here have been gathered from one of
three (and in many cases, more than one) scenarios: Reading and Comprehension lab
study; the Bushfire training team lab study; or the real-life Bushfire team training
field study. The linguistic features of interest are: Pause features, Grammar
features, Language Complexity features and Word features.
Pause Features. Traditionally in psychology, the pauses during natural speech
have been associated with a person’s thinking and cognitive processes. It is argued
that the more time it takes to produce the response, the more cognitive energy it
requires to do so [Schilperoord, 2001]. In other words, the increased amount of time
spent in pausing (and hence thinking) while talking represents the increased level of
cognitive load experienced [Esposito, Stejskal, Smékal, & Bourbakis, 2007;
Schilperoord, 2001]. We have found that people used more and longer pauses
(including both silent and filled pauses) under high cognitive load conditions versus
low load conditions. Furthermore, it was found that people’s response latency
increased confirming results from other studies in the literature [Berthold &
Jameson, 1999; Müller, et al., 2001].
Grammatical features. The use of personal pronouns has also been found to differ
significantly, specifically the use of individual and collective pronoun use in team
based tasks. Four personal pronoun words (1
st
person singular, 1
st
person plural, 3
rd
person singular, and 3
rd
person plural) were examined in low vs. high load conditions.
The results show an interaction between usage of pronoun types (singular vs. plural)
and task difficulty (and so the cognitive load). People’s use of singular pronouns
decreased while their use of plural pronouns increased significantly when cognitive
load was high. As task difficulty increases, teams tend to share more of the work
[Kirschner, Paas, & Kirschner, 2009] and this behaviour is visible through their
pronominal usage preferences. A further analysis of the results confirmed that use of
both singular personal pronoun words (1
st
person and 3
rd
person) decreased while
their use of both plural pronoun words (1
st
person and 3
rd
person) increased when
cognitive load was increased.
Language complexity. The complexity of a written or spoken text or transcript can
be measured by two main factors: semantic difficulty and syntactic complexity
(Lennon & Burdick, 2004). Our investigations show that while working
collaboratively and performing tasks of high difficulty, people speak more and use
longer sentences as the cognitive load increases. That is because under high
workload conditions, as things become more complex, team members communicate
more and provide more explanations as a strategy to deal with high task difficulty
[Katz, Fraser, & Wagner, 1998]. The language complexity measures we used include
lexical density [Chalker & Weiner, 1998; Ure, 1971], complex word ratio [Chalker &
Weiner, 1998], Gunning Fog Index [Gunning, 1952; Reck & Reck, 2007], Flesch-
Kincaid Grade [Flesch, 1948], SMOG Grade [McLaughlin, 1969], and Lexile Level
[Lennon & Burdick, 2004]. While these complexity measures have mostly been used
for written texts, e.g. articles and essays, we have successfully demonstrated their
use for measuring people’s cognitive load from their spoken or written texts
[Khawaja, Chen, & Marcus, 2010]. People’s lexical density, i.e. their use of unique
and different words (or vocabulary richness), decreases as cognitive load increases: a
result of fewer working memory resources available for the language processing task
[A. Baddeley, 2003], resulting in poorer selection of many unique words from the pool
of words stored in the long-term memory. A second result reveals people’s spoken
language became more complex and difficult to comprehend under high load
conditions, again, working memory resources are allocated to the task itself than on
speaking – as a result, their speech is complicated, and comprised of often ill-formed
sentences.
Word categories and valence. Qualitative investigations into word category usage
show that while performing a collaborative task, as the cognitive load increased,
people’s use of negative emotion words significantly increased and their use of
positive emotion words decreased significantly. The analyses also show that people
used on the average fewer overall emotion words (either negative or positive) under
high load situations as compared to low cognitive load. More importantly we found a
significant interaction between the emotion words (negative vs. positive) and
cognitive load levels (low vs. high). People working in groups and exhibiting negative
emotions spend more time negotiating and engaging in more group discussions
[Donner & Hancock, 2011], which also supports our previous findings of increased
word count in high load. Members of teams experiencing high load used significantly
more cognitive mechanism words (e.g. think, consider, know, remember) and
perceptual words (e.g. hear, view, touch) than those under low load conditions
portraying their increased mental effort and concentration for the task. Similarly,
their use of conflicting and disagreement words (e.g. no, wrong, never) increased
significantly while there was a significant decrease in the use of agreement words
(e.g. ok, right, fine). Other studies show people experiencing negative emotions tend
show more disagreement [Hancock, Gee, Ciaccio, & Lin, 2008].
The linguistic and grammatical features can be used as an individual set of cognitive
load indices in the domains where speech and/or conversational transcripts are used
as the main input modalities. Cognitive load measurement from the proposed
linguistic features will require state-of-the-art automatic speech recognition (ASR)
technology with highly accurate automatic speech-to-text (STT) functionality to
realise its potential in practical applications. Our linguistic approach to measuring
cognitive load may also be used as a post-hoc analysis technique for user interface
evaluation and interaction design improvement, in addition to the acoustic speech
analysis of load.
4. DIGITAL PEN-BASED FEATURES OF COGNITIVE LOAD
Digital pen or digital ink is becoming an increasingly popular method for interaction
in specialised systems, beyond use as a pointing device. Recognition accuracy has
markedly improved for handwriting purposes, symbolic drawing and gesture
recognition, and sketching and visualisation, as well as mark-up/annotation
applications. An intuitive input mechanism, pen usage is said to be spontaneous,
allowing users to produce 2D representations almost as quickly as they are envisaged
[Oviatt, 2009; Schwartz & Heiser, 2006]. In particular, pen input can support
thought-organising activities such as counting, ordering, grouping, labelling and
showing relationships, helping complex problem solving in high load contexts [Oviatt,
2006], and is thus an ideal candidate for capturing symptomatic input changes.
Geometric and temporal features of ink trajectory can be used as potential cues
suited for automated extraction, while higher level recognition of characters and
meaning can be compiled in a post-hoc manner to offer an alternative view of how
pen input is affected by high load tasks. Our investigation of cognitive load
assessment features from digital pen spans three types of input: pen gesture input of
predefined shapes, handwriting and free-form pen input, including note-taking and
sketching.
4.1 Symbolic Pen Gesture Features
Symbolic gestures refer to pen input methods that require the user to reproduce a
specific 2D shape modelled on a predefined shape to trigger a specific function within
an intelligent interactive system. Our motivation for examining pen-input features
when cognitive load is high is based on the premise that a user’s performance is
likely to be affected at a fine-grained level, where the quality of their motor
productions may diminish in much the same way as their speech signal, due to low
working memory resources [Ruiz, 2011]. In fact, empirical evidence we have collected
shows that the degeneration geometric features in pre-defined pen gesture shapes
increases significantly when cognitive load is very high, suggesting that a cognitive
load index could be derived from such a measure [Ruiz, Taib, Shi, Choi, & Chen,
2007].
Pen-gesture ink trajectories used in our analysis were collected using a custom
interactive application, where users were required to build alternative routes on a
map. The cognitive load factor was manipulated through three levels of task
complexity – requiring users satisfy increasing sets of constraints related to the
distance and traffic congestion of the roads along possible alternative routes [Ruiz, et
al., 2007]. We defined three types of predefined functional pen-gesture inputs: to
query the traffic flow, distance cost and a toggle function showing the start and end
of the route being constructed. These were invoked when the user drew any of the
possible pen inputs, shown in Fig. 4 below, on the map area. It was expected that
when cognitive load increased, the shapes produced as part of the user’s input would
degrade in quality, that is, differ more significantly from the standard form of that
shape. Dissimilarity could be due to asymmetry, jittery strokes, or generally ‘messy’
script. Users produced a set of pen gesture inputs in a ‘no-load’ task – essentially 10
instances of each type of gesture on a blank screen, in order to create a standard form
for that user from these trajectory instances.
Fig. 4. Pen gestures - Traffic flow, distance cost and toggle route.
The Mahalanobis distance (MDIST) was used to measure the level of degeneration of
each single-stroke shape instance to the standard form [Rubine, 1991; Ruiz, 2011].
MDIST is most often used for recognition purposes, to classify the sample input into
its correct type, however, we were able to leverage this function by also recording the
degree of difference between the baseline form of each pen-shape type for each user
and comparing to standard features. MDIST is a statistical measure of how similar
an unknown vector of features is to a known vector of features, and considers the
correlations between the features being assessed [Rubine, 1991]. This is done via a
classification technique that calculates MDIST as the weighted Euclidean distance
from the vector of sample points (each input trajectory) to a standard model created
for the inputs of that type, for each user. Subsequent pen inputs, produced during
task time, were generated in each load level and MDIST was used to quantify the
degree of dissimilarity between the standard form and each specific instance. Hence,
the greater the distance from its own type, the higher the MDIST value and
geometric difference between the sample and the standard. The changes in MDIST
have been found to be statistically significantly different between low cognitive load
and high cognitive load tasks [Ruiz, 2011; Ruiz, et al., 2007].
Some examples of a standard form and sample inputs are shown in Fig. 5 below. A
MDIST result of zero is interpreted as a perfect replicate of the standard form of the
shape for that user. The set of MDIST values generated by the system were
developed by using a variant Rubine’s specification for a classification-based gesture
recogniser [Rubine, 1991].
Fig. 5. Standard form and degenerate samples.
Using MDIST as a proxy for the geometric degeneration of pen-shapes has a number
of useful properties. Firstly, it can combine a number of individual geometric features
into a single value measurement of degeneration. This means that the trajectory
changes from the baseline, or standard form, can occur in any or all of the features
used by the classifier. MDIST can combine very small geometric changes in multiple
features to register a significant combined change from the standard, and this type of
deformation can be compared with large geometric change in a single feature. In this
sense, it is a ‘true’ measure of degeneration of the shape, with respect to any of the
features used. Similarly, the same set of geometric features can be used uniformly to
measure degeneration of all types of single stroke pen-shapes, hence providing
comparable results for a wide variety of shapes and allowing us to group the data to
obtain higher level of confidence in the assessment of degeneration.
In high load situations, a combination of both high intrinsic and high extraneous
load can negatively affect both intrinsic and extraneous types of processing - and this
is reflected in decreased performance scores (intrinsic) as well as degradation in the
quality of modal productions (extraneous). The degree of degradation as measured by
MDIST can provide an indication of increased load between extreme load levels (low
and high) for 85% of subjects, regardless of expertise. This measure is also
personalised, each user has their own standard form which can be updated as often
as necessary, and the baseline can be updated over time, so that the user is not
penalised for learning to use the pen input more efficiently over time, e.g. holding the
pen differently or more comfortably [Ruiz, 2011].
4.2 Handwriting Features
While the technical challenge of handwriting recognition has received ample
attention from HCI researchers, analysis of form and structure of handwriting itself
has received relatively little, and certainly not in the context of cognitive load
assessment. Research into cognitive load during handwriting is important for
improving the performance and experience of users in pen-based interactions
[Frankish, Hull, & Morgan, 1995]. Our team’s investigations have resulted in the
first pen-based cognitive load classification engine for handwriting.
According to handwriting experts, the writing process comprises three distinct
phases: planning, translating and reviewing [Vanderberg & Swanson, 2007]. The
cognitive load induced by these processes places a demand on working memory
resources from each of these in subsequent stages. We hypothesised in the same way
as low level trace features were found to change in pen-gesture inputs, stroke level
changes may also be detectable in handwriting produced under high load. An initial
attempt to statistically analyse the stroke-level features of velocity, length and
pressure information with respect to increasing cognitive load showed that local
maximum writing pressure, and local minimum writing velocity for strokes in
particular are sensitive to the cognitive load of the writer.
The findings are based on a study where subjects completed a sentence
composition task using sets of given words. The number of words that were to be
used in each sentence increased in each of the three complexity levels. Subjective
ratings confirm the levels induced appropriate cognitive load differences. The
handwriting dataset produced included approximately 600 handwritten sentences.
Sentences from two subjects are shown in Fig. 6 as examples.
Fig. 6. Handwriting samples for high load task from two different users[Yu, Epps, &
Chen, 2011c].
The analysis shows that maximum writing pressure tends to occur at the beginning,
at the corners and at the end of strokes, where the minimum writing velocity is
observed concurrently [Yu, Epps, & Chen, 2011a]. This could be attributed to the
shaping of alphanumeric letters, as it appears that writers experience higher
cognitive load when forming the shapes than when producing straight parts of a
stroke [Yu, et al., 2011a]. Straight sections do not require a change in the direction of
the pen trace, allowing the writer spare resources for other cognitive processes, such
as reviewing of previously written material. This may suggest that cognitive load can
potentially fluctuate even during the process of a stroke, correlated with the tempo of
stroke construction [Yu, et al., 2011a].
A second attempt has been made to use sample-based rather than stroke-based
features from the ink trace [Yu, Epps, & Chen, 2011b]. Specifically, this meant
examining each writing point as a set of attributes including time-stamped trajectory
coordinates and pressure of the pen-tip, and the orientation of the pen tip [Yu, et al.,
2011b]. This information was modelled using Gaussian mixture models. Taking the
combination of pressure and azimuth as features of the pen trace and using the same
classifier, the application of altitude intervals improved the classification accuracy
from 50.1% to 63.5%. Also, a particular span of pen altitudes, corresponding to about
12% of the writing samples, was found to produce a higher cognitive load
classification accuracy of 75.4%. Using altitude to sort the samples used in the
models resulted in significant improvement, which signified that for samples with
similar altitude, their pressure and azimuth attributes are sensitive to cognitive load
Word List
feasible
expand
damage
changes. This finding could potentially decrease the computational cost for pen-based
cognitive load classifications. [Yu, et al., 2011b].
As a non-intrusive supporting component, a cognitive load measurement module
based on handwriting with a digital pen can provide a useful reference to control the
difficulty level of tasks where a writing requirement exists. In order to use the
engine, a user would initially need to set up a profile to log the individual
characteristics of their handwriting (e.g. altitude distribution) by producing a
sufficient amount of text [Yu, et al., 2011b]. The quantity of text needed is a trade-off
between profiling time and system accuracy. [Yu, et al., 2011b].
4.3 Free-Form Pen Features
The use of digital pen for freeform note-taking, including sketches, symbolic
diagrams and other miscellaneous “doodles”, is less common in high-pressure control
room environments, despite the fact that many operators use an array of low-tech
tools, such as physical paper and pen to support their work processes, indeed this has
been found in a variety of domains [Lajoie, 2000; Schwartz & Heiser, 2006]. In our
industry case studies, in particular with a large traffic monitoring centre, we have
seen that some of the work processes are duplicated, with operators transferring
information organised on paper notes into the system after the fact. At the same
time, freeform pen input recognition is not yet mature or robust enough for
deployment in mission-critical systems. Nevertheless, it is possible that freeform pen
input can provide further insights into cognitive effort. In much the same way as we
expected the quality of pen-gesture inputs and handwriting to be affected by high
mental demand, we expected there to also be changes in low-level temporal features
of freeform pen input to signify reduced resources available for cognitive processing
[Ruiz, 2011]. Indeed, our investigations reveal significant changes in stroke
frequency to be correlated with low versus high cognitive load [Ruiz, Taib, & Chen,
2011]. We also found that the discrepancy between stroke frequencies under low and
high load is reduced with expertise. These results indicate that pen stroke frequency,
which can be automated, could be used as an indicator of cognitive load, or
conversely, of expertise level [Ruiz, et al., 2011].
The analysis was carried out on a dataset of freeform pen input, generated as part
of the same user study from which the pen gesture input data came from, generating
data in three levels of increasing load (sample inputs are shown in Fig. 7). In contrast
with symbolic pen gesture inputs however, freeform pen markings were used solely
in the digital notepad area and did not trigger any specific functionality. The role of
the digital “notepad” was simply to emulate low-tech tools such as pen and paper.
Fig. 7. Sample digital notepad input.
Due to the high variability in content matter in each user’s scratchpad, the analysis
needed to be based on features sufficiently abstracted from the task content and
semantics of the data itself. The features we investigated were based on the pen
trajectories and the task time chosen to ensure the content from all subjects could
be judged equally. Using normalised stroke frequency measures (strokes per second),
we found a main effect of cognitive load, where the frequency increased as cognitive
load increased [Ruiz, et al., 2011]. This signified that operators were writing much
faster and relying on the digital notepad much more as cognitive load increased
[Ruiz, et al., 2011].
However, we also found that the discrepancy between stroke frequencies under
low and high loads reduced with expertise, with both eventually converging. This
suggests the possibility of using a convergence of stroke frequency in spite of varying
task complexity to diagnose gains in expertise – increased expertise indicates an
improvement in schema representations in working memory and the fact that
learning has taken place [Ruiz, et al., 2011]. Noting the fact that stroke frequency
can easily be extracted in real-time and unobtrusively using a tablet monitor or
electronic pen, not only can this measure be applied to assess cognitive load levels,
but also to detect when a user has acquired enough expertise on a given task or
interface and hence allow them to progress to the next level of complexity [Ruiz, et
al., 2011].
5. PROPERTIES OF BEHAVIOURAL INDICES OF LOAD
Previous work on mental load measures, most notably by O’Donnell and Eggemeier
[O’Donnell & Eggemeier, 1986], Wickens and Hollands [Wickens & Hollands, 2000],
Kramer [Kramer, 1991] and Gopher and Braune [Gopher & Braune, 1984], has
sought to describe them using a series of properties, to enable comparisons to be
made such that the most appropriate measure is chosen for any situation. These
include: Sensitivity, Diagnosticity, Primary Task Intrusion, Implementation
Requirements, Operator Acceptance, Selectivity and Bandwidth and reliability.
In the context of our work, other properties have also proven to be useful in
classifying behavioural indices, namely: the potential for implementing this
measurement in real-time; the provision and use of contextual information in
interpreting the index or measure, dimensionality and temporal scales.
Closely related is the issue of weighting methods for each of the individual modal
index types and their sub features during fusion approaches. Weighting can be based
on the task context, or can be based on the sensitivity or diagnostic power of the
index, feature or modality from which it is sourced [Ruiz, 2011]. Confidence levels for
the reliability of each index, as well as the combined multimodal index can be
provided on a task or user basis, depending on the quality of calibration and index
combination types available [Ruiz, 2011]. Other limitations can occur when
combining data that is derived from inputs which vary sampling rates.
5.1 Real Time Potential
One of the main goals of this work was to produce an automated method for cognitive
load assessment which would allow the measure to have a high potential for being
implemented and used in real time. The basic requirement then is that the features
used for assessment be extracted automatically, without the need for labelling or
human intervention. The features used to derive each of the modal indices presented
here can be fully automated – the process from extraction to assessment can be done
on the fly with very good results. Both types of indices (speech based and digital pen
based) require some level of user calibration at the very beginning in order to
improve the accuracy of the results. However, this process is quite simple in all
instances and not at all prohibitive either in terms of time or effort. Individual
measures within these modal index types may have differing levels of automation,
e.g. some may require a baseline or bootstrapping sequence at initialisation.
5.2 Temporal Scales
Whether the features themselves offer discrete or continuous assessments will also
affect how they are combined in a multimodal index. Information about how often the
feature is updated or refreshed will need to be included as part of that single
modality index information profile and serve as a reference as to which others it can
be combined with. For example, in a constant speech-signal monitoring scenario, the
acoustic speech index can be updated after every 2 seconds of active speech or less. In
contrast, the linguistic index based on the same raw input will have a longer update
response lag since the speech needs to be transcribed on the fly and a minimum
amount of data needs to be accumulated before classification, e.g. a complete
sentence, phrase, or word. Hence, specific temporal update windows will need to be
defined for specific modal index combinations. The granularity at which each type of
index is refreshed or updated can be increased by implementing sliding window
algorithms on the streaming signal, and initially weighted less heavily during the
fusion stages. Many of the indices explored have displayed significant differences in
average-based or rate-based features (e.g. MDIST, freeform pen), which means they
have the potential to be indicative even with fewer samples than those collected here.
5.3 Dimensionality
Dimensionality refers to the number of features, individual measures included in a
single index. For example, the MDIST measure is a single reading which combines
the information from 12 separate information points derived from a trajectory.
Particularly when using multi-dimensional indices, weighting of each sub-feature can
be a make or break factor, especially if the availability or quality of the sensor data
on which those features is based cannot be guaranteed. Further, in early fusion
implementations, the dimensionality of the features will certainly have an effect on
how often they can be updated and combined with each other and contribute to the
final multimodal index: features with high dimensionality and high update frequency
can be revised much more often and potentially a higher level of confidence can be
attached to such an index.
5.4 Contextual Information
Any multimodal index of load will need to be tightly coupled with the task context
and workflow process. An understanding of the task flow is imperative because the
indices presented here cannot be implemented as a universal solution applicable in
all scenarios. For example, the linguistic indices are most useful in think-aloud data,
or human-human communication between operators, but would be completely
ineffective when applied on command and control speech input. Therefore, the
contextual information will need to be closely derived from the work process, to select
the indices most likely to be a) present, b) reliably collected, and c) the best match in
each task scenario. In other cases, more than one modal indicator may be activated,
for example when an operator is speaking on the phone, the handwriting indices and
a continuous speech signal index may be equally effective. Similarly, the contextual
information can be considered on a per-user basis – user profiles can notify the
system of individual preferences (e.g. Operator A does not like to use pen input or
handwriting; Operator B is quite reliant on pen for taking notes, but not interacting
with the system or issuing commands).
5.5 Domain Independence
The advantage of the kinds of behavioural indices for cognitive load that are
presented here is that they largely remain domain-independent. For example, among
the speech features, the acoustic indices appear to be relatively successful regardless
of the domain (as shown by the variety of lab experiments conducted and the real-life
case studies presented. The pen features, such as handwriting for example, can also
be applied in any domain, as long as the user is using a digital pen that is
instrumented with the right sensors to capture the relevant features that comprise
the index. Domain-independence refers to how tightly coupled the features are to the
domain in which they were observed – high domain independence means a loose
coupling exists, whereas low domain independence means a tight coupling exists
between the dataset and the domain.
5.6 Summary
Table 1 and Table 2 below summarise the properties previously defined in the
literature as well as the pertinent new ones we defined.
Table 1. Properties previously defined in the literature.
Property
Definition
Sensitivity Capability for discriminating significant variations along the
workload
continuum
(Refer to
Fig.
1
and
Fig
.
2
)
.
Diagnosticity Capability for discriminating the specific computational process
causing the load changes.
Primary Task Intrusion Whether the task workflow is interrupted or not
Implementation Requirements How difficult it is to implement within context, includes operator
training or instrumentation required
Operator Acceptance Willingness of operators to follow instructions.
Selectivity Whether the measure is sensitive to mental workload only or also to
physical changes.
Bandwidth and reliability Bandwidth and reliability refer to the workload's estimate that has
to be reliable both within and across tests.
Table 2. Five more pertinent properties we define.
Property Definition
Real-time potential Is it possible to automatically extract these features from the input
signal?
Temporal Scales
How often can this feature
be
updated?
Dimensionality How many dimensions in this measure?
Contextual Information Information regarding which indices can be combined with which
others for maximum effect – and which should not be combined at
all.
Domain independence How tightly coupled is this measure to the domain it was observed
in?
Table 3 and Table 4 summarise the measures for the most pertinent previously
defined properties and three of the new properties. Implementation Requirements,
Operator Acceptance, Bandwidth and Reliability, Contextual Information and
Domain Independence can only be assessed meaningfully with reference to a specific
application domain, and hence are not included in this table.
Table 3. Speech based measures and indices.
Feature
Sensitivity
Diagnosticity
Primary Task
Intrusion
Selectivity
Real Time
Automation
Temporal
Scales
Dimensionality
Linguistic Features
Pauses
Low,
High
High Low Med High – using voice
activity detection
Significant pauses
take around 0.3s.
Sliding window every
5 seconds
~3 individual
features (total
duration,
frequency and avg.
length)
Pronouns
Low,
High
High Low High Med – Dependent
on recogniser
Word Level -Sliding
window every
sentence or phrase,
using voice activity
detection
Single
Complexity
Low,
High
High Low High High – Dependent
on recogniser, use
voice activity
detection
Word, Sentence or
Phrase level
~3 measures of
complexity
Category
Low,
High
High Low High High – Dependent
on and word
categories, Sliding
window every task
Word level 2-10 significant
categories
Valence
Low,
High
Low High Med to High–
Dependent on
recogniser, Sliding
window every task
Word level 2 significant
categories
Acoustic Features
Acoustic
Low,
Normal,
High+
High Low Med High 2 -10 second windows
yield excellent results
High >72 features
Table 4. Other indices and measures.
Feature
Sensitivity
Diagnosticity
Primary Task
Intrusion
Selectivity
Real Time
Automation
Temporal
Scales
Dimensionality
Handwriting
Velocity, Length,
Pressure, Orientation,
Altitude, Azimuth
Low,
Normal,
High+
Med
to
High
Low High High - Per
stroke basis
High - Per
stroke basis
~6 individual
features per
stroke
Frequency
Low,
Normal,
High+
Med
to
High
Low High High – per
stroke basis
Med to High –
Dependent on
segmentation
~3 individual
features per
stroke
Symbolic
MDIST
Low,
High
Med
to
High
Low High High – Per
symbol basis
High ~12 features
per stroke
Custom,
Geometric,
Features
Low,
High+ Med
to
High
Low High High – Per
stroke basis,
dependent on
segmentation
scheme
High –
Dependent on
recogniser
Variable
GSR
Mean
Low,
Med,
High +
High
Med – can
use embed-
ded sensors
Low High – sampled
every 100ms
High Variable
6. MULTIMODAL INDICES OF LOAD
Given previous successes in finding features from pen and speech input that allow us
to differentiate cognitive load levels for up to 3 levels of load, the next step is to apply
a multimodal index of load that combines output from different sources. Correlations
between single-modality indices offer a way in which to introduce redundancy and
robustness to a multimodal index of cognitive load. Dual-modality indices working
together in a complementary fashion, such as speech signal based classification or
degree of degeneration of pen input are likely to align quite well, reinforcing each
other. However, there are a number of aspects that need to be considered in the
development of a multimodal index of load, for example, whether early or late fusion
approaches are used. At an abstract level, multimodal indices can be derived in four
ways [Ruiz, 2011]:
(1)
Combining component features within each modality e.g. combining within pen-
input features such as stroke frequency, MDIST or altitude span;
(2)
Combining component features across modalities e.g. combining stroke frequency
(from pen) with use of singular pronouns (linguistic);
(3)
Combining index results between modalities, e.g. between pen-only assessment
vs. speech signal-only assessment;
(4)
Using a combination of any of the three methods above.
6.1 An Abstract Model for Multimodal Assessment
Fig. 8 depicts a high level functional model of a proposed Cognitive Load
Measurement (CLM) system. The abstract system model embodies four high level
processes: pre-processing and data cleaning, feature extraction, load assessment and
index fusion. The great advantage of multimodal behavioural indices of cognitive load
is that they are derived from activity already undertaken as part of the task, and
thus can be collected implicitly, or ‘passively’ [Zander & Kothe, 2011]. The raw
modality input sources are first and foremost intended for purposes other than
cognitive load measurement, specifically to do with the domain application.
Fig. 8. High level functional model of a multiple modality CLM system.
For example, the data may be used for semantic interpretation or rendering (e.g. in
the case of command and control speech or interactive pen gestures). The data may
therefore need to be duplicated and diverted with the original stream sent to the
recognisers, and a secondary stream sent to the Cognitive Load Measurement engine.
In Fig. 8, speech input data is first captured through a close-talk microphone. This
generates two kinds of data, speech signal data (e.g. a wav file) and text (through a
Multimodal CL
Measure
Speech
Data
Contextual
Data
Other
Modality
Data
Speech pre
-
processing
Cognitive Load Processor
User
Profile and
Calibration Data
Fusion + Index
Combination
Rules
User Speech
Input
Other Modality
Inputs
Application
Scenario Input
STT
CL
Models
Pen Data
User Pen Input
Speech
Transcription
Feature Extraction
Category
Frequencies
Cognitive Load
Engine
GMM Speech
classification
MFCC and
prosodic
features
MDIST
Classification
Geometric
Features
Temporal
Features
Pen trajectory
pre-processing
Assess
Frequency
Thresholds
Pen
Data
GMM
Handwriting
Classification
Assess Temporal
and Geometric
Thresholds
Other
Feature
Extraction
Fusion
Module
speech to text engine). Likewise, pen input data is collected as trajectory tuples,
including pressure, pen orientation and other information transmitted directly from
the device drivers, alongside system timestamps.
Data pre-processing and data cleaning refers to any reformatting, restructuring of
the input data, or removal of unnecessary information, for example, any outliers or
segments that are too short for geometric and temporal analysis; words not
recognised in the text, as well as words that are not used in the analysis. Input
streams from other modalities will follow the same processes. Similarly, a number of
other non-behavioural indices will also undergo pre-processing as needed; these
include indices that may also be used in the process, such as galvanic skin response,
or other body-based data, such as posture, movement or temperature. Environmental
and other external context information may also be provided to the CLM system for
enhanced performance at this point.
The second stage involves streaming the individual modal input into their
respective feature extraction components. The same data may be used for multiple
feature extraction components, while other extraction components may not be
activated, depending on domain-specific contextual information gathered from the
active applications and workflow diagrams established a-priori. This will allow the
feature extraction engine to choose the most appropriate modules to activate for each
incoming input stream. For example, if the incoming speech is sourced from a phone
call or radio conversation, the feature extraction component will activate both MFCC
and prosodic feature extraction as well as the linguistic category extraction
components, since both can provide meaningful measures on this kind of data. On the
other hand, if the incoming speech is sourced from command and control input, only
MFCC and prosodic feature extraction will be activated, as the linguistic categories
cannot provide any meaningful cognitive load measurement information on short,
closed vocabulary, single word speech.
The third stage involves the decision-making aspect of the process, where
thresholds are invoked and the appropriate models for each modality are selected
from the database from which to carry out the classification. For example, for the
speech signal based cognitive load measurement, different models are required for
single word cognitive load classification versus continuous speech classification.
Likewise, different MDIST models exist for each shape, and also for each user. Any
calibration data that is needed for classification or for comparison purposes is also
accessed at this point.
The final stage involves the fusion of indices resolved from the previous stage. The
assessment results obtained from each modality can also convey confidence
information to support the fusion process. The fusion engine accesses information
regarding the modality load assessment combination rules in each specific context,
e.g. whether the time-windows for the collected inputs are compatible; which indices
are complementary with which others; and the appropriate weightings for each
index, given the scenario and the user situation. Fig. 8 shows how mid- and late-
fusion may be achieved from a set of cognitive load assessments from each of the sub-
features. Mid-level fusion, for example, is achieved by combining multiple
assessments that are based on the same input modality, for example, speech based
and linguistic assessments. Late fusion for a multimodal index can be likewise
achieved by combining the results from all the features individually (regardless of
input modality), or combining the input modality subgroup from the mid-level fusion
results. The final output from the CLM engine can then be passed onto the output
generation system in order to implement appropriate adaptation strategies.
We now present a user study illustrating the applicability of this model to
multimodal data processing.
6.2 Basketball Skills Training
In order to illustrate how a multimodal cognitive load measurement system could
work, we now present a lab-based study in which cognitive load and complexity were
manipulated, and multiple behavioural modalities were recorded. This objective is to
assess how well individual and combined modalities can reflect levels of cognitive
load, and provide a concrete application for our multimodal cognitive load
measurement model.
Elite athletes at the Australian Institute of Sport (AIS) are required to complete
cognitive skills training using a targeted sports-specific software application called
AISReact [Mackintosh, 2010]. While aiming at ever faster situation analysis and
decision making through the construction better mental schemas, it is desirable to
precisely determine onsets of very high cognitive load in order to adapt the training
rate to each individual athlete. In this experiment, we modified the software to
accept pen based interaction, and added the modalities of speech and eye-activity. In
addition, performance (accuracy) measures, physiological signals (GSR) and
subjective ratings were also collected to establish a ground truth for cognitive load
and task difficulty. The set-up is shown in Fig. 9.
Fig. 9. Physical set-up of a user completing a task using a digital pen and with GRS
attached.
Twelve male recreational basketball players, aged 19-36, each with more than 2
years’ experience (average of 9.4 years) volunteered to complete the study. The task
consisted of a 10 second video basketball clip played on a tablet monitor, which was
then frozen and replaced with a blank court schematic. The clips involved 10 players
and the participants had to remember the locations and roles of some players in
three task difficulty levels (remember 3 players for Low level, 6 for Medium, and all
10 for High). Each level consisted of 6 distinct clips. The clips were filmed from above
and cover half the court, with all plays moving from the bottom of the screen towards
the top, where the basketball hoop was located, as seen in Fig. 10.
The participants used specific pen marks to identify the remembered player
positions on the tablet monitor: attackers were denoted by crosses, defenders by
circles and the ball carrier by a circle with a dot in the middle, as illustrated in Fig. .
Participants were also instructed to think aloud through their answers, and these
utterances were captured using a close-talk microphone.
Fig. 10a. Last frame of video clip
before freeze.
Fig. 10b. Blank court image with
player markings.
6.3 Subjective Ratings and Performance Results
Subjective ratings were collected using a Likert 9-point scale, where 1 was minimal
effort and 9 was extreme effort. The task complexity levels induced extreme levels of
load as reflected in the subjective ratings, increasing significantly as cognitive load
increased, with mean averages of 3.2 (SD=1.34), 5.5 (SD=1.62) and 7.6 (SD=1.23) for
the Low, Medium and High load tasks respectively in Fig. 11. Due to the non-
parametric dataset, this was verified using Friedman’s χ2 test (χ2(12,2)=25.53,
p<0.001), where Low, Med and High were ranked 1.00, 2.04 and 2.96 respectively.
As expected, the participants’ performance decreased significantly from Low load
to High load. Scores were given for each mark whose centroid was placed within a
radius of 8% screen distance (in pixels) from the correct player position, as
recommended by basketball experts at the Australian Institute of Sport, who also
annotated the correct player positions on the schematic. The mean score for the Low,
Med and High load tasks 83.5% (SD=11.63), 77.7% (SD=12.26) and 68.1%
(SD=15.14). The decrease was verified through a repeated-measures ANOVA test
(F(2,22)=4.84, p=.018). Subsequent planned contrasts show a significant linear
(F(1,11)=5.59, p=0.04, r = 0.46) to the 0.05 level, with a medium effect size. This is
evident in Fig. 11 also, where the performance decreases gradually between Low and
Medium load levels and then more steeply from Medium to High levels.
Fig. 11. Performance scores and subjective ratings [Ruiz, Liu, Yin, Farrow, & Chen,
2010].
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
LOW MED HIGH
CL Leve ls
Performance Scores
0
1
2
3
4
5
6
7
8
9
Subjective Ratings
Scores (%)
Ratings
Overall, participants’ performance decreased significantly, while their subjective
ratings of load increased significantly, from Low load to High load, validating that
the responses elicited by these tasks are affected by extreme levels of cognitive load.
6.4 Individual Modalities
In this section, we analyse the capacity of individual modalities to classify load levels.
In addition to speech and pen input, we present galvanic skin response (GSR)
although it is a physiological measure, because we compare the relative potential of
these three modalities in the next section.
Speech data was analysed for all 12 subjects as described in section 3 (and in
[Ruiz, et al., 2010]), and the results use the average of the two evaluation folds,
classifying into three pre-designed load levels. As shown in Table 5, Low load
achieved 100% accuracy, and High load 82% of testing samples. Interestingly
however, testing samples from the Medium load level were mostly misclassified into
either the Low or High load, suggesting that no distinct pattern was captured. We
suspect participants with subtly varied basketball skills and load capacity may have
experienced slightly lower or higher loads in this level. The average accuracy for the
3 levels was 62.7%.
Unfortunately, due to corrupt collected signals from some of the GSR input
sensor, and data losses caused by unexpected crashes in the software, only 9 subjects
have complete data for the purpose of fusion, and hence this subset will be
exclusively used for the remainder of this case study. For these 9 subjects only, the
average speech classification accuracy drops slightly, to 61.8%.
Table 5: Confusion matrix of three-level speech classification.
Classified as
Low Medium High
Testing samples from
Low 100% 0% 0%
Medium 40% 6% 54%
High 15% 3% 82%
Pen input was analysed through a set of simple, objective, features based on circling
shapes drawn by the participants. Table 6 summarises the features and their
individual accuracy at classifying load levels for the 9 subjects. The results range
from 31% to 41% for the 3-level classification, i.e. in some cases not always
outperforming a random classification.
Table 6 : Pen-input trajectory features.
Geometric
feature Description Accuracy on
test samples
Duration
Stroke duration, in milliseconds
32.
6
%
Length Cumulative distance between sampled points along the
trajectory 40.7%
Mean
velocity
Mean velocity of the stroke trajectory, calculated point to
point 30.7%
Mean
acceleration
Mean acceleration of the stroke trajectory, calculated
point to point
37.0%
Area The area in pixels taken by the circle shape, enclosed by
the trajectory 36.3%
First.Last
Distance between the first and last points of the
33.3%
trajectory
Overlap ratio The ratio of the overlapping distance between the first
and last points of the trajectory to the total size of the
shape.
37.4%
Although galvanic skin response is not a behavioural measure of cognitive load but a
physiological one (i.e. it is not a voluntary reaction but a function of the autonomic
nervous system), it was used as a ground truth measurement for the study.
Measured in micro-Siemens S), the signal was simply analysed using an average
measurement over the task period, yielding a classification accuracy of 64.4% over 3
load levels, across all 9 subjects, using a leave-one out evaluation scheme.
6.5 Multimodal Fusion
In this section we fuse the above features extracted from speech, pen input and GSR
using the AdaBoost boosting algorithm. Boosting [Freund, 1995; Schapire, 1990] is a
general ensemble learning algorithm, which creates an accurate strong classifier
H
by iteratively combining a number
T
of moderately inaccurate weak classifiers
h
t
. By
definition, a strong classifier has high classification accuracy on the data set, while a
weak classifier’s accuracy is just above that of a random guess. The final strong
classifier can be defined as:
=
=
otherwise1
0 if1
1
T
ttt
xh
xH )(
)(
α
where
α
t
is a weight coefficient. In simple cases, each weak classifier is attached to a
feature, so the process of combining weak classifiers in Boosting is equivalent to a
feature fusion process.
We used AdaBoost [Freund & Schapire, 1997; Schapire, Freund, Bartlett, & Lee,
1998], an Adaptive version of Boosting. Sample weights are all initially set equal,
then refined iteratively during a training process. In order to select those features
that are most discriminative of a given problem, in each iteration AdaBoost selects a
new weak classifier
h
t
with the minimal weighted classification error with respect to
the training sample weight distribution, which means the newly selected weak
classifier can guarantee the more important samples (samples with higher weights)
are classified correctly. Then the weights of incorrectly classified samples are
increased, so in the next iteration, AdaBoost can focus on these incorrectly classified
samples.
Table 7 details the weights obtained for speech and the pen features. The average
classification accuracy when fusing all these features is 64.1% on the testing
samples, for the 3 load levels across all 9 subjects.
Table 7: AdaBoost weights for speech and pen input features.
Speech
Duration
Length
Velocity
Acceleration
Area
First
-Last
Overlap
Weights 0.686 0.150 0.053 0.000 0.051 0.059 0.000 0.002
Similarly, details the weights obtained when fusing speech, pen features and also
GSR. The average classification accuracy is then 77.8% on the testing samples, for
the 3 load levels across all 9 subjects.
Table 8: AdaBoost weights for speech, pen input features and GSR.
Speech
Duration
Length
Velocity
Acceleration
Area
First
-Last
Overlap
GSR
Weights 0.478 0.176 0.055 0.05 0.041 0.053 0.011 0.000 0.181
Adding the GSR feature provides a significant improvement, supporting the benefits
of feature fusion for workload detection. This case study is proposed as one
implementation example of the model, however the results indicate that other
behavioural features, yet to be explored, may be able to provide further multimodal
cognitive load measurement accuracy.
7. DYNAMIC SYSTEM ADAPTATION BASED ON COGNITIVE LOAD INDICES
Intelligent interactive systems equipped with methods for unobtrusive, real-time
detection of cognitive load and general cognitive load awareness should be able to
adapt content delivery in more appropriate ways by sensing what the user is able to
cognitively cope with at any given moment. Presentation and interaction strategies
can be used to adapt the pace, volume and format of the information conveyed to the
user, depending on their individual cognitive load experience [Ruiz, 2011]. For
example, in the case of a real-life bushfire management control centre scenario, the
interaction system may be able to adapt many elements of the interface to decrease
the cognitive load experienced by a user: from highlighting a critical computer screen
or a specific information window, to sorting and prioritising task checklists, to
showing controlled reminders, to filtering email or SMS messages, to redirecting
phone calls to the less cognitively loaded operators, the system has the power to
subtly ease the user’s cognitive demand [Khawaja, et al., 2010; Khawaja, Chen,
Owen, & Hickey, 2009].
Recent advances in the design of applications and user interfaces have promoted
the awareness of the user context. It is crucial to establish a reliable indicator of
cognitive load for each individual, by assessing which feature patterns are likely to
occur at high or low levels of load on a case by case basis, given that there are large
individual variations within a trend or pattern from one person to another. Many of
the potential pen and speech indices summarised above need a relative baseline or
standardisation feature. As well as this, user preferences can also be used as the
basis for response strategies for high cognitive load some users may prefer to be
overtly alerted to the system detecting their high load, while others may choose to let
the system support them in a more autonomous way, e.g. redirecting incoming calls
to voicemail.
7.1 Performance Monitoring
The type of multimodal interaction environment we are targeting, featuring high
complexity, safety-critical tasks, could benefit from dynamic adaptation based on
cognitive load assessment, as part of performance monitoring. Such complex work
scenarios do not provide readily usable metrics for an operator’s performance;
instead, debriefings are used to assess the team’s performance and address any
undesirable outcomes. Cognitive load assessment can provide a real-time indicator of
the load experienced by each operator: from this point, the system can be equipped to
provide feedback to them, offer a warning, or suggest ways in which the system can
“help”. Other less technically-oriented solution strategies are possible, for example,
where team leaders or managers manually redirect incoming incidents or incoming
tasks to operators who may have more cognitive resources available to attend to
them, while scaffolding others who are struggling to cope with demand.
7.2 Targeted Training
The use of cognitive load indices in intelligent environments, possibly in conjunction
with performance and other measures, could provide an individually targeted
learning experience. Interface and system learning environments are often aimed at
a group level, training programs seldom take into account individual differences in
cognitive load during progression through increasingly complex material. Although
some systems already exist that are able to cater for performance differences in
training scenarios that can adapt slightly to accommodate these, it is generally
acknowledged within the field of educational psychology that performance does not
always accurately reflect the level of load. The latter, in fact, represents the subject’s
cognitive cost—e.g., cognitive resources spent, mental effort invested [Kalyuga, 2007]
to produce those results. By deploying cognitive assessment during training sessions,
learners can benefit from a self-paced curriculum, supported by system
recommendations as to when it may be appropriate to advance to the next module.
This could potentially reduce training time and increase efficiency, with learners
spending more time on material when necessary, and less otherwise.
8. CONCLUSION AND FUTURE WORK
The work presented here summarises research aiming to measure cognitive load
expanded by human operators, especially using unobtrusive, real-time
measurements. These are crucial for practical applications, where they can be used to
optimise user interaction.
Previous research has tried to assess users’ cognitive load using several methods
including physiological, performance-based, and subjective measures. However,
intelligent interactive systems lend themselves more to the collection of behavioural
measures – in particular modal inputs and communication – for cognitive load
assessment. The goal is to measure a user’s cognitive load implicitly and in real-time
so as to adapt systems to users affected by high cognitive load, easing the demand
and avoiding stress, frustration and errors. This work presented here has explored
the viability of a number of behavioural modal data sources, especially from speech
and pen input, to identify symptomatic cues of high cognitive load.
The feasibility of using user input and behaviour patterns as indices of cognitive
load is supported by experimental evidence. The benefits of this approach are that
these measures can be collected implicitly, i.e. by monitoring variations in specific
modal features executed during day-to-day usage of intelligent interactive systems,
thus overcoming problems of intrusiveness and increasing applicability in real-world
environments. Moreover, using symptomatic cues of cognitive load derived from user
behaviour, such as acoustic speech signals, linguistic analysis of transcribed text,
digital pen trajectories of handwriting and geometric shapes, can be supported by
well-established theoretical frameworks, including O’Donnell and Eggemeier’s
workload measurement [O’Donnell & Eggemeier, 1986], Sweller’s Cognitive Load
Theory[Sweller, et al., 1998] and Baddeley’s model of modal working memory [A. D.
Baddeley, 1992; Sweller, et al., 1998], as well as McKinstry [McKinstry, et al.,
2008]and Rosenbaum’s [Rosenbaum, 2005] action dynamics findings.
Behaviour-based cognitive load measurement also benefits from its very means of
data collection. It doesn’t require extra physical instrumentation of the user or
environment, since the inputs it captures are part of the natural interaction required
by the task. Moreover, the data is always available and current, so long as the user is
interacting with the system or completing a task. Such real-time assessment of the
user’s cognitive load can then help achieve the ultimate goal of adapting information
selection and presentation in a dynamic computer interface with reference to load.
The development of standardised tasks to compare cognitive load measures would go
a long way to achieving more definitive comparisons between indices and assessing
their applicability in real-life dynamic scenarios.
Extensive investigations into a complete speech signal analysis for cognitive load
measurement have culminated in the development of a fully functional automatic
cognitive load assessment engine, able to produce a result in real-time without
manual intervention. Providing reliable speaker-independent measurement of
cognitive load 85% accurate over 3 levels, without the need to create a model for
each individual subject for data collected using a close talk microphone incurring
minimal cost. This would be a significant improvement in industrial environments
where no cognitive load assessment technology currently exists. The changes in the
user’s voice that characterise high cognitive load occur at the acoustic and prosodic
features of speech data, thus, the technology is able to make an accurate assessment
regardless of the specific words uttered, meaning of the message or vocabulary used.
Likewise, it is difficult for the user to consciously manipulate the assessment.
In regards to the linguistic analysis of the speech data, our studies show that the
frequency of selected linguistic and grammatical features changed between low and
high load tasks. We have successfully isolated a number of cognitive indices based on
pause features, grammar features, language complexity features and word category
features such as emotive and agreement words. These indices are ideal tool to
complement current speech signal based results because they assess the content of
the user’s speech.
The results of our ongoing research also suggest that pen input data produced
under high cognitive load will also exhibit symptomatic characteristics, specifically in
the structure, form and manner of the trajectories generated in pen gesture,
handwriting and drawing. The findings demonstrate that the quality of interactive
pen gesture trajectories degrades as tasks become more complex; altitude, pressure
and orientation features show significant changes in handwriting produced in high
load situations; and finally, the frequency of sketching, drawing and other note
taking activities using a digital pen increases significantly in very difficult tasks
compared to very simple tasks. Of these three pen input measures, structural
handwriting analysis has proven to be the most promising index of cognitive load.
Strokes and inter-strokes provide a comprehensive record of writing behaviour,
conveying rich insights into the cognitive load experienced by a writer. The overall
classification accuracy showed that pen altitude, pen orientation and pressure reflect
cognitive load variations successfully, reaching 75% accuracy over three load levels.
These specific modal changes in modal and communicative behaviours when
cognitive resources are scarce reflect a mental mechanism designed to extend
working memory and reserve resources for problem solving strategies and processes.
Despite significant evidence linking physical alterations to behavioural changes, the
question of causality, where we can definitively link these changes to cognitive load,
is still an open issue, and one we are actively investigating.
Finally, we proposed a high level model of a system for assessment of cognitive
load using a number of behavioural indices over two modalities: speech and pen. The
real-time assessment of cognitive load provided by the system offers new potential for
dynamic support and adaptive system behaviour, promising to optimise the human-
computer interaction throughput, and reduce the burden placed on the limited
human cognitive capabilities.
ACKNOWLEDGMENTS
NICTA is funded by the Australian Government as represented by the Department of Broadband,
Communications and the Digital Economy and the Australian Research Council through the ICT. We
would like to thank our annotators, students and volunteers whose efforts are presented here. Special
thanks to Guanzhong Li and Zhidong Li for their essential contributions.
REFERENCES
ALIBALI, M. W., KITA, S., & YOUNG, A. J. 2000. Gesture and the process of speech production: We think,
therefore we gesture. Language and Cognitive Processes, 15, 6, 593-613.
ARK, W. S., DRYER, D. C., & LU, D. J. 1999. The Emotion Mouse. In Bullinger & Ziegler (Eds.),
Proceedings of HCI International (the 8th International Conference on Human-Computer Interaction)
on Human-Computer Interaction: Ergonomics and User Interfaces (Vol. 1, pp. 818-823): Lawrence
Erlbaum Association, London.
BACKS, R. W., & WALRATH, L. C. 1992. Eye movement and pupillary response indices of mental
workload during visual search of symbolic displays. Applied Ergonomics, 23, 243-254.
BADDELEY, A. 2003. Working Memory and Language: An Overview. Journal of Communication
Disorders, 36, 189-208.
BADDELEY, A. D. 1992. Working Memory. Science, 255, 556-559.
BERTHOLD, A., & JAMESON, A. 1999. Interpreting Symptoms of Cognitive Load in Speech Input. In
Proceedings of the Seventh International Conference on User Modeling (UM99).
BRENNER, M., SHIPP, T., DOHERTY, E., & MORRISSEY, P. 1985. Voice measures of psychological
stress: Laboratory and field data. In I. Titze & R. Scherer (Eds.), Vocal Fold Physiology, Biomechanics,
Acoustics, and Phonatory Control (pp. 239-248): Denver Center for the Performing Arts, Denver,
Colorado.
BRUNKEN, R., PLASS, J. L., & LEUTNER, D. 2003. Direct measurement of cognitive load in multimedia
learning. Educational Psychologist, 38, 1, 53-61.
BYRNE, A. J., SELLEN, A. J., & JONES, J. G. 1998. Errors on anaesthetic record charts as a measure of
anaesthetic performance during simulated critical incidents. British Journal of Anaesthetics, 80, 58-
62.
CHALKER, S., & WEINER, E. 1998. The Oxford Dictionary of English Grammar: New York : Oxford
University Press
CHANDLER, P., & SWELLER, J. 1991. Cognitive Load Theory and the Format of Instruction. Cognition
and Instruction, 8, 4, 293-332.
COWAN, N. 2001. The magical number 4 in short-term memory: A reconsideration of mental storage
capacity. Behavioral and Brain Sciences, 24, 1, 87-114.
DALE, R., ROCHE, J., SNYDER, K., & MCCALL, R. 2008. Exploring Action Dynamics as an Index of
Paired-Associate Learning. PLoS ONE 3, 3, e1728. doi: 10.1371/journal.pone.0001728
DECHERT, H. W., & RAUPACH, M. 1980. Towards a Cross-Linguistic Assessment of Speech Production:
Frankfur: Lang.
DELIS, D. C., KRAMER, J. H., & KAPLAN, E. 2001. The Delis-Kaplan Executive Function System. The
Psychological Corporation.
DONNER, W., & HANCOCK, J. T. 2011. Upset Now? Emotion Contagion Distributed Groups. In
Proceedings of the International Conference on Computer-Human Interaction (CHI 2011), Vancouver,
BC, Canada.
ESPOSITO, A., STEJSKAL, V., SMÉKAL, Z., & BOURBAKIS, N. 2007. The Significance of Empty Speech
Pauses: Cognitive and Algorithmic Issues In Proceedings of the Proceedings of International
Symposium on Brain, Vision and Artificial intelligence (BVAI'07), LNCS 4729, Naples, Italy.
FERNANDEZ, R., & PICARD, R. W. 2003. Modeling drivers’ speech under stress. Speech Communication,,
40, 1-2.
FLESCH, R. 1948. A New Readability Yardstick. Journal of Applied Psychology, 32, 3, 221-233.
FRANKISH, G., HULL, R., & MORGAN, P. 1995. Recognition accuracy and user acceptance of pen
interfaces. In Proceedings of the CHI'95.
FREUND, Y. 1995. Boosting a weak learning algorithm by majority. Information and Computation, 121, 2,
256–285.
FREUND, Y., & SCHAPIRE, R. E. 1997. A decision-theoretic generalization of on-line learning and an
application to boosting. Journal of Computer and System Sciences, 55, 1, 119-139.
GALEN, G. P., & VAN HUYGEVOORT, M. 2000. Error, stress and the role of neuron-motor noise in space
oriented behaviour. Biological Psychology, 51, 151-171.
GAWRON, V. J. 2000. Human Performance Measures Handbook: Lawrence Erlbaum Associates, New
Jersey, NJ.
GOLDIN-MEADOW, S., NUSBAUM, H., KELLY, S., & S.WAGNER. 2001. Explaining math: Gesturing
lightens the load. Psychological Science, 12, 516-522.
GOPHER, D., & BRAUNE, R. 1984. On the psychophysics of workload: Why bother with subjective
measures? . Human Factors, 26, 519-532.
GUNNING, R. 1952. The Technique of Clear Writing: McGraw-Hill.
GÜTL, C., PIVEC, M., TRUMMER, C., GARCABARRIOS, V. M., MDRITSCHER, F., PRIPFL, J., &
UMGEHER, M. 2005. Adele (adaptive e-learning with eye-tracking): Theoretical background, system
architecture and application scenarios. European Journal of Open, Distance and E-Learning, 2.
HANCOCK, J. T., GEE, K., CIACCIO, K., & LIN, J. M. 2008. I'm sad, you're sad: Emotion Contagion in
CMC. In Proceedings of the CSCW 2008.
HANSEN, J. H. L. 1996. Analysis and compensation of speech under stress and noise for environmental
robustness in speech recognition. Speech Communication, 20, 1-2, Speech under Stress, 151-173. doi:
10.1016/S0167-6393(96)00050-7
HOCKEY, G. R. J. 2003. Operator Functional State as a Framework for the Assessment of Performance
Degradation. Paper presented at the NATO Advances Research Workshop on Operator Functional
State and Impaired Performance in Complex Work Environments, Il Ciocco, Italy.
IKEHARA, C. S., & CROSBY, M. E. 2005. Assessing Cognitive Load with Physiological Sensors. In
Proceedings of the HICSS'05: 38th Annual Hawaii International Conference on System Sciences
Hawaii.
IQBAL, S. T., ZHENG, X. S., & BAILEY, B. P. 2004. Task-Evoked Pupillary Response to Mental Workload
in Human-Computer Interaction. In Proceedings of the Internatinal Conference on Computer-Human
Interaction (CHI'04), Vienna, Austria.
JACOBS, S. C., FRIEDMAN, R., PARKER, J. D., TOFLER, G. H., JIMENEZ, A. H., MULLER, J. E., . . .
STONE, P. H. 1994. Use of skin conductance changes during mental stress testing as an index of
autonomic arousal in cardiovascular research. The American Heart Journal, 128(1), 6, 1170-1177.
JAMESON, A., KIEFER, J., MÜLLER, C., GROßMANN-HUTTER, B., WITTIG, F., & RUMMER, R. 2009.
Assessment of a User’s Time Pressure and Cognitive Load on the Basis of Features of Speech. In M. W.
Crocker & J. Siekmann (Eds.), Resource-Adaptive Cognitive Processes (pp. 171): Springer Berlin
Heidelberg.
KALYUGA, S. 2007. Enhancing instructional efficiency of interactive e-learning environments: A cognitive
load perspective. Educational Psychology Review, 19, 387-399.
KATZ, C., FRASER, E. B., & WAGNER, T. L. 1998. Rotary-Wing Crew Communication Patterns Across
Workload Levels. In Proceedings of the RTO HFM Symposium on 'Current Aeromedical Issues in
Rotary Wing Operations.', San Diego, USA.
KENNEDY, D., & SCHOLEY, A. 2000. Glucose administration, heart rate and cognitive performance:
effects of increasing mental effort. Psychopharmacology, 149, 1, 63-71.
KERÄNEN, H., VÄYRYNEN, E., PÄÄKKÖNEN, R., LEINO, T., KURONEN, P., TOIVANEN, J., &
SEPPÄNEN, T. 2004. Prosodic Features of Speech Produced By Military Pilots During Demandng
Tasks. In Proceedings of the Proceedings of Fonetiikan Päivät 2004, Oulu, Finland.
KERR, B. 1973. Processing demands during mental operations. Memory & Cognition. Journal of Memory
and Cognition, 1, 401-412.
KETTEBEKOV, S. 2004. Exploiting prosodic structuring of coverbal gesticulation. In Proceedings of the
ICMI’04: 6th international conference on Multimodal interfaces, State College, PA, USA.
KHAWAJA, M. A., CHEN, F., & MARCUS, N. 2010. Using Language Complexity to Measure Cognitive
Load for Adaptive Interaction Design. In Proceedings of the Proceedings of International Conference on
Intelligent User Interfaces (IUI 2010), Hong Kong, China.
KHAWAJA, M. A., CHEN, F., OWEN, C., & HICKEY, G. 2009. Cognitive Load Measurement from User’s
Linguistic Speech Features for Adaptive Interaction Design. In Proceedings of the Proceedings of
International Conference on Human-Computer Interaction (INTERACT’09), Part I, LNCS 5726,
Uppsala, Sweden.
KIRSCHNER, F., PAAS, F., & KIRSCHNER, P. A. 2009. Cognitive Load Approach to Collaborative
Learning: United Brains for Complex Tasks. Educational Psychology Review, 21, 1.
KRAMER, A. F. 1991. Physiological metrics of mental workload: a review of recent progress. In D. L.
Damos (Ed.), Multiple-task performance (pp. 279-328): Taylor and Francis, London.
LAJOIE, S. P. 2000. Computers as cognitive tools: No More Walls: Hillsdale, NJ.: Lawrence Erlbaum.
LE, P., AMBIKAIRAJAH, E., EPPS, J., VIDHYASAHARAN, S., & CHOI, E. 2011. Investigation of
Spectral Centroid Features for Cognitive Load Classification. Speech Communication, 53, 4, 540-551.
LE, P., EPPS, J., AMBIKAIRAJAH, E., & SETHU, V. 2010. Robust Speech-Based Cognitive Load
Classification Using a Multi-band Approach. In Proceedings of the APSIPA Annual Summit and
Conference (APSIPA’10), Biopolis, Singapore.
LE, P., EPPS, J., CHOI, E., & AMBIKAIRAJAH, E. 2010. A Study of Voice Source and Vocal Tract Filter
Based Features in Cognitive Load Classificatio. In Proceedings of the International Conference on
Pattern Recognition (ICPR’10), Istanbul, Turkey.
LENNON, C., & BURDICK, H. 2004. The Lexile Freamework as an Approach for Reading Measurement
and Success. Retrieved from
http://www.Lexile.com
doi:
http://www.Lexile.com
LIPP, O. V., & NEUMANN, D. L. 2004. Attentional blink reflex modulation in a continuous performance
task is modality specific. Psychophysiology, 41, 3, 417-425.
LIU, J., & ET. AL. 2003. An Adaptive User Interface Based on Personalised Learning. IEEE Intelligent
Systems, 18, 2, 52-57.
LIVELY, E., PISONI, D. B., SUMMERS, W. V., & BERNACKI, R. 1993. Effects of cognitive workload on
speech production: Acoustic analyses and perceptual consequences. Journal of the Acoustical Society of
America, 93, 2962-2973.
MACKINTOSH, C. 2010. AIS React Software v.6.6. Australian Institute of Sport.
MARCUS, N., COOPER, M., & SWELLER, J. 1996. Understand Instructions. Educational Psychology, 88,
1, 49-63.
MARSHALL, S. P., PLEYDELL-PEARCE, C. W., & DICKSON, B. T. 2003. Integrating psychological
measures of cognitive workload and eye movements to detect strategy shifts Proceedings of 36th
Hawaii International Conference on System Sciences (HICSS'03) (Vol. 5): IEEE Computer Society,
Washington, DC, USA.
MCKINSTRY, C., DALE, R., & SPIVEY, M. J. 2008. Action Dynamics Reveal Parallel Competition in
Decision Making. Psychological Science, 19, 1, 22-24.
MCLAUGHLIN, H. G. 1969. SMOG Grading - A New Readability Formula. Journal of Reading, 12, 8,
639-646.
MOUSAVI, S. Y., LOW, R., & SWELLER, J. 1995. Measurement and analysis methods of heart rate and
respiration for use in applied environments. Journal of Educational Psychology, 87, 2, 319-334.
MÜLLER, C., GROßMANN-HUTTER, B., JAMESON, A., RUMMER, R., & WITTIG, F. 2001. Recognising
time pressure and cognitive load on the basis of speech: An experimental study. In Proceedings of the
UM2001, User Modeling: Proceedings of the Eighth International Conference, Berlin.
NICKEL, P., & NACHREINER, F. 2000. Psychometric Properties of the 0.1Hz Component of HRV as an
Indicator of Mental Strain. In Proceedings of the Proceedings of the IEA 2000/HFES 2000: the XIVth
Triennial Congress of the International Ergonomics Association and 44th Annual Meeting of the
Human Factors and Ergonomics Society, San Diego, California.
O’DONNELL, R. D., & EGGEMEIER, F. T. 1986. Workload assessment methodology. In K. R. Boff, L.
Kaufman & J. P. Thomas (Eds.), Handbook of perception and human performance (Vol. 2, pp. 1-49):
Wiley, New York.
OVIATT, S. 1997. Multimodal interactive maps: Designing for human performance. Human-Computer
Interaction, 12, 93-129.
OVIATT, S. 2006. Human-Centered Design Meets Cognitive Load Theory: Designing Interfaces that Help
People Think. In Proceedings of the ACM Multimedia.
OVIATT, S. 2009. Designing Interfaces that stimulate ideational superfluency. In Proceedings of the INKE
2009: Research Foundations for Understanding Book and Reading in the Digital Age: Implementing
New Knowledge Environments.
OVIATT, S., COULSTON, R., & LUNSFORD, R. 2004. When do we interact multimodally?: Cognitive load
and multimodal communication patterns. In Proceedings of the ICMI ’04: Proceedings of the 6th
international conference on Multimodal interfaces, New York, NY, USA.
PAAS, F., AYERS, P., & PACHMAN, M. 2008. Assessment of cognitive load in multimedia learning:
Theory, methods and applications. In D. H. Robinson & G. Schraw (Eds.), Recent Innovations in
Educational Technology that Facilitate Student Learning (pp. 11-36).
PAAS, F., TUOVINEN, J. E., TABBERS, H., & GERVEN, P. 2003. Cognitive Load Measurement as a
Means to Advance Cognitive Load Theory. Educational Psychologist, 38, 1, 63-71.
PALIWAL, K. K. 1998. Spectral subband centroid features for speech recognition. In Proceedings of the
IEEE International Conference on Acoustics, Speech and Signal Processing, (ICASSP 1998). Seattle,
WA , USA
PICARD, R. 1997. Affective Computing: MIT Press.
RECK, R. P., & RECK, R. A. 2007. Generating and rendering readability scores for Project Gutenberg texts.
In Proceedings of the Corpus Linguistics Conference, Birmingham, UK.
REYNOLDS, D. A., & ROSE, R. C. 1992. An integrated speech-background model for robust speaker
identification. In Proceedings of the IEEE ICASSP 1992.
ROSENBAUM, D. A. 2005. The Cinderella of psychology: The neglect of
motor control in the science of mental life and behavior. . American Psychologist, 60, 308-317.
RUBINE, D. 1991. Specifying gestures by example. In Proceedings of the SIGGRAPH ’91: 18th annual
conference on Computer graphics and interactive techniques, New York, NY, USA.
RUFFELL SMITH, H. P. 1979. A simulator study of the interaction of pilot workload with errors,
vigilance, and decisions NASA Technical Memorandum.
Moffett Field, CA
: NASA Ames
Research Center.
RUIZ, N. 2011. Cognitive Load Measurement in Multimodal Interfaces: A PhD Dissertation. University of
New South Wales, Sydney, Australia.
RUIZ, N., LIU, G., YIN, B., FARROW, D., & CHEN, F. 2010. Teaching Athletes Cognitive Skills: Detecting
Load in Speech Input. In Proceedings of the 24th BCS Conference on Human Computer Interaction
(HCI2010), Dundee, Scotland.
RUIZ, N., TAIB, R., & CHEN, F. 2006. Examining the Redundancy of Multimodal Input. In Proceedings of
the Annual Conference of the Australian Computer-Human Interaction Special Interest Group
(OzCHI’06), Sydney.
RUIZ, N., TAIB, R., & CHEN, F. 2011. Freeform Pen-Input as Evidence of Cognitive Load and Expertise. In
Proceedings of the International Conference on Multimodal Interfaces 2011, Spain.
RUIZ, N., TAIB, R., SHI, Y., CHOI, E., & CHEN, F. 2007. Using Pen Input Features as Indices of Cognitive
Load. In Proceedings of the 9th International Conference on Multimodal Interfaces (ICMI'07), Nagoya,
Japan.
SCHAPIRE, R. E. 1990. The strength of weak learnability. Machine Learning, 5, 2, 197-227.
SCHAPIRE, R. E., FREUND, Y., BARTLETT, P., & LEE, W. S. 1998. Boosting the margin: a new
explanation for the effectiveness of voting methods. The Annals of Statistics, 26, 5, 1651-1686.
SCHILPEROORD, J. 2001. On the Cognitive Status of Pauses in Discourse Production. In T. Olive & C. M.
Levy (Eds.), Contemporary tools and techniques for studying writing: London: Kluwer Academic
Publishers.
SCHWARTZ, D. L., & HEISER, J. 2006. Spatial representations and imagery in learning. In K. Sawywe
(Ed.), Handbook of the Learning Sciences: Cambridge: Lawrence Elrbaum.
SEXTON, J. B., & HELMREICH, R. L. 2000. Analyzing Cockpit Communication: The Links between
Language, Performance, Error, and Workload. The Journal of the Human Performance in Extreme
Environments, 5, 1, 63-68.
SHI, Y., RUIZ, N., TAIB, R., CHOI, E., & CHEN, F. 2007. Galvanic Skin Response (GSR) as an Index of
Cognitive Load. In Proceedings of the Proc. SIGCHI Conference on Human Factors in Computing
Systems (CHI’07), San Jose.
SPIVEY, M. J., GROSJEAN, M., & KNOBLICH, G. 2005. Continuous attraction toward phonological
competitors. Proceedings of the National Academy of Sciences of the United States of America, 102, 29,
10393-10398. doi: 10.1073/pnas.0503903102
STROOP, J. R. 1935. Studies of interference in serial verbal reactions. Journal of Experimental
Psychology.
SWELLER, J., MERRIENBOER, J., & PAAS, F. 1998. Cognitive Architecture and Instructional Design.
Educational Psychology Review, 10, 3, 251-296.
TOLKMITT, E. J., & SCHERER, K. R. 1986. Effect of experimentally induced stress on vocal parameters.
Journal of Experimental Psychology, 12, 302-312.
URE, J. 1971. Lexical Density and Register Differentiation. In G. Perren & T. Trim (Eds.), Applications of
Linguistics (pp. 443-452): London: Cambridge University Press.
VANDERBERG, R., & SWANSON, H. L. 2007. Which components of working memory are important in
the writing process? Reading and Writing: An Interdisciplinary Journal, 20, 7.
WICKENS, C. D., & HOLLANDS, J. G. 2000. Engineering psychology and human performance (3rd ed.).
Upper Saddle River, NJ: Pearson/Prentice Hall.
WILSON, G. F., & RUSSELL, C. A. 2003. Real-time Assessment of Mental Workload using Psychological
Measures and Artificial Neural Network. Human Factors, 45, 4, 635-643.
WOOD, C., & ET AL. 2004. Using Driver’s Speech to Detect Cognitive Workload 9th Conference on Speech
and Computer (SPECOM'04): International Speech Communication Association Press, France.
YAP, T. 2011. Speech production under cognitive load: Effects and Classification. PhD, University of New
South Wales, Sydney, Australia.
YAP, T. F., AMBIKAIRAJAH, E., EPPS, J., & CHOI, E. 2010. Cognitive Load Classification Using
Formant Features. In Proceedings of the Proceedings of IEEE International Conference on Information
Sciences, Signal Processing and Their Applications (ISSPA’10), Kuala Lumpur, Malaysia.
YAP, T. F., EPPS, J., AMBIKAIRAJAH, E., & CHOI, E. 2010. An Investigation of Formant Frequencies for
Cognitive Load Classification. In Proceedings of the Proceedings of Annual Conference of the
International Speech Communication Association (InterSpeech’10), Makuhari, Japan.
YAP, T. F., EPPS, J., AMBIKAIRAJAH, E., & CHOI, E. 2011. Formant Frequencies under Cognitive Load:
Effects and Classification. EURASIP Journal on Advances in Signal Processing, 2011, Article ID
219253. doi: 10.1155/2011/219253
YAP, T. F., EPPS, J., CHOI, E., & AMBIKAIRAJAH, E. 2010. Glottal Features For Speech-Based Cognitive
Load Classification. In Proceedings of the IEEE International Conference on Acoustic, Speech and
Signal Processing (ICASSP’10), Dallas, USA.
YIN, B., & CHEN, F. 2007. Towards automatic cognitive load measurement from speech analysis. In J.
Jacko (Ed.), Human-Computer Interaction. Interaction Design and Usability (Vol. 4550, pp. 1011-1020):
Springer Berlin, Heidelberg.
YIN, B., CHEN, F., RUIZ, N., & AMBIKAIRAJAH, E. 2008. Speech-based Cognitive Load Monitoring
System. In Proceedings of the IEEE International Conference on Acoustic, Speech and Signal
Processing (ICASSP’08).
YIN, B., RUIZ, N., CHEN, F., & KHAWAJA, M. A. 2007. Automatic Cognitive Load Detection from Speech
Features. In Proceedings of the Australasian Computer-Human Interaction Conference (OzCHI'07),
Adelaide, Australia.
YU, K., EPPS, J., & CHEN, F. 2011a. Cognitive Load Evaluation of Handwriting Using Stroke-level
Features. In Proceedings of the International Conference on Intelligent User Interfaces (IUI’11), Palo,
Alto, USA.
YU, K., EPPS, J., & CHEN, F. 2011b. Cognitive Load Evaluation with Pen Orientation and Pressure. In
Proceedings of the UIST2011, Santa Barbara, CA, USA.
YU, K., EPPS, J., & CHEN, F. 2011c. Cognitive Load Measurement with Pen Orientation and Pressure. In
Proceedings of the MMCogEmS 2011 - A workshop of ICMI2011, Alicante, Spain.
http://icmi11.forge.nicta.com.au/papers/MMCogEmS2011_Yu.pdf
ZANDER, T. O., & KOTHE, C. 2011. Towards passive brain–computer interfaces: applying brain–
computer interface technology to human–machine systems in general. Journal of Neural Engineering,
8.
... Cognitive workload measures are commonly classified as self-reported, task performance, or physiological (O'Donnell and Eggemeier 1986;Lysaght et al. 1989;Young et al. 2015;Longo et al. 2022). One can also add the class behavioural measures (Parasuraman 2003;Chen et al. 2012;Durantin et al. 2014). Self-report measures require participants to quantify their experience of workload (Tsang and Vidulich 2006). ...
... The behavioural basis might range from cursor movements and interface navigation to primary task performance such as response time, accuracy, and task errors (Annett 2002;Parasuraman 2003;Durantin et al. 2014). For the purpose of this review, we will use the term somewhat less broadly than many authors by excluding primary and secondary task measures (Khawaja, Chen, and Marcus 2012;Chen et al. 2012;Braarud and Pignoni 2023). A premise for behavioural measures is that observable overt behaviour may indicate cognitive effort to handle task demand, and that operators adapt their behaviour to manage cognitive workload (Hockey 1997;Hancock and Warm, 1989;De Waard and Evans 2014;Hancock and Matthews 2019). ...
... Sensitivity, a fundamental measurement criterion, refers to whether the measure discriminates between distinct levels of cognitive workload (O'Donnell and Eggemeier 1986;Wickens et al. 2012). An additional important criterion for complex dynamic work is resolution or granularity (Muckler and Seven 1992;Chen et al. 2012;Chuang et al. 2016). Depending on the purpose of the evaluation, the measurement should be able to provide granularity with regard to work phases or task steps. ...
Article
Full-text available
Despite the substantial literature and human factors guidance, evaluators report challenges in selecting cognitive workload measures for the evaluation of complex human–technology systems. A review of 32 articles found that self-report measures and secondary tasks were systematically sensitive to human–system interface conditions and correlated with physiological measures. Therefore, including a self-report measure of cognitive workload is recommended when evaluating human–system interfaces. Physiological measures were mainly used in method studies, and future research must demonstrate the utility of these measures for human–system evaluation in complex work settings. However, indexes of physiological measures showed promise for cognitive workload assessment. The review revealed a limited focus on the measurement of excessive cognitive workload, although this is a key topic in nuclear process control. To support human–system evaluation of adequate cognitive workload, future research on behavioural measures may be useful in the identification and analysis of underload and overload.
... Studies by [54,55] have demonstrated that user eye movements play a crucial role in showing visual behavior. Previous research has indicated that eye gaze is closely tied to visual attention, which is fundamental in information selection for user perception and action [56], as well as reflective of a user's interest in absorbing information [57]. ...
... Chen et al. [55] utilized user input and behavioral patterns as practical benchmarks for measuring cognitive load. Kalyani and Gadiraju [59] investigated the correlation between a user's search behavior and the level of learning by analyzing their search interactions. ...
Article
Full-text available
Compared to traditional techniques, augmented reality (AR) confers notable benefits in facilitating complex product assembly processes. The efficacy of AR systems in assembly contexts is notably influenced by the pivotal role of AR instructions. As such, meeting users’ demands for AR instructions is crucial during AR-guided assembly processes. In the present study, an investigation was conducted into the influence of complex assembly task types and user assembly experience on their demands for AR instructions. Firstly, complex assembly tasks were categorized into repetitive complex assembly tasks (RAT) and non-repetitive complex assembly tasks (NRAT) based on their complex characteristics. A user study was conducted using HMD-HoloLens 2 as the experimental device. User performances were recorded during iterative execution of AR experimental tasks under the aforementioned task conditions. The specific measures included users’ attention process, interface interaction behaviors, assembly errors, and users’ subjective experience. The results indicate significant differences in users’ demands for AR instructions across different task types. Moreover, users’ demands for AR instructions also changed with increasing assembly experience. Through comprehensive analysis, the rules of users’ demands for AR instructions were summarized. The present findings enhance the current comprehension of users’ demands regarding AR instructions and offer valuable insights for designing and developing efficient AR-guided assembly systems.
... It has been shown that gestures help to structure an interaction and thereby minimize verbal load [31]. Chen et al. [17] show which features are relevant to measure the current cognitive load and apply their method to different experimental scenarios. Oviatt et al. [71] show that as cognitive load rises people tend to go multimodal, presumably to distribute the load over the used modalities. ...
Preprint
Full-text available
In human interaction, gestures serve various functions such as marking speech rhythm, highlighting key elements, and supplementing information. These gestures are also observed in explanatory contexts. However, the impact of gestures on explanations provided by virtual agents remains underexplored. A user study was carried out to investigate how different types of gestures influence perceived interaction quality and listener understanding. This study addresses the effect of gestures in explanation by developing an embodied virtual explainer integrating both beat gestures and iconic gestures to enhance its automatically generated verbal explanations. Our model combines beat gestures generated by a learned speech-driven synthesis module with manually captured iconic gestures, supporting the agent's verbal expressions about the board game Quarto! as an explanation scenario. Findings indicate that neither the use of iconic gestures alone nor their combination with beat gestures outperforms the baseline or beat-only conditions in terms of understanding. Nonetheless, compared to prior research, the embodied agent significantly enhances understanding.
... In previous studies, users' attention and cognitive workload have been regarded as factors to be considered in multimodal interactions [20,64,65]. Particularly, context-aware interactions, such as interactions with a visually attentive interface, rely on a person's attention as the primary input [66]. ...
Article
Full-text available
People are expected to have more opportunities to spend their free time inside the vehicle with advanced vehicle automation in the near future. This will enable people to turn their attention to desirable activities other than driving and to have varied in-vehicle interactions through multimodal ways of conveying and receiving information. Previous studies on in-vehicle multimodal interactions primarily have focused on making users evaluate the impacts of particular multimodal integrations on them, which do not fully provide an overall understanding of user expectations of the multimodal experience in autonomous vehicles. The research was thus designed to fill the research gap by posing the key question “What are the critical aspects that differentiate and characterise in-vehicle multimodal experiences?” To answer this question, five sessions of design fiction workshops were separately conducted with 17 people to understand the users’ expectations of the multimodal experience in autonomous vehicles. Twenty-two subthemes of users’ expected tasks of multimodal experience were extracted through thematic analysis. The research found that two dimensions, attention and duration, are critical aspects that impact in-vehicle multimodal interactions. With this knowledge, a conceptual model of the users’ in-vehicle multimodal experience was proposed with a two-dimensional spectrum, which populates four different layers: sustained, distinct, concurrent, and coherent. The proposed conceptual model could help designers understand and approach users’ expectations more clearly, allowing them to make more informed decisions from the initial stages of the design process.
... Cognitive load is also found to be closely reflected by changes in various physiological states (Chen et al., 2012). Typical physiological MMD includes heart rate, galvanic skin response or skin conductance, eye activity and brain activity (di Mitri et al., 2017;Giannakos et al., 2019). ...
Article
Full-text available
Although the utilization of mobile technologies has recently emerged in various educational settings, limited research has focused on cognitive load detection in the pen‐based learning process. This research conducted two experimental studies to investigate what and how multimodal data can be used to measure and classify learners' real‐time cognitive load. The results found that it was a promising method to predict learners' cognitive load by analysing their handwriting, touch gestural and eye‐tracking data individually and conjunctively. The machine learning approach used in this research achieved a prediction accuracy of 0.86 area under the receiver operating characteristic curve (AUC) and 0.85/0.86 sensitivity/specificity by only using handwriting data, 0.93 AUC and 0.93/0.94 sensitivity/specificity by only using touch gestural data, and 0.94 AUC and 0.94/0.95 sensitivity/specificity by using both the touch gestural and eye‐tracking data. The results can contribute to the optimization of cognitive load and the development of adaptive learning systems for pen‐based mobile learning. Practitioner notes What is already known about this topic Pen‐based mobile learning systems allow natural ways of handwriting and gestural touching, which can facilitate learners' cognitive processes in mobile learning. Behavioural and physiological multimodal data are helpful in detecting learners' real‐time cognitive load in mobile learning. The effectiveness of behavioural and physiological multimodal data for measuring cognitive load in pen‐based mobile learning is limited investigated. What this paper adds This paper confirms the effectiveness of handwriting and touch gestural multimodal data for measuring pen‐based learning cognitive load, in terms of their stroke‐, path‐ and time‐based features. This paper explores the potential of eye‐tracking data in measuring pen‐based learning cognitive load. A combination of behavioural and physiological multimodal data is reported to increase the prediction accuracy for cognitive load measurement. Implications for practice and/or policy Practitioners are suggested to use behavioural and physiological multimodal data individually or conjunctively for measuring cognitive load in pen‐based learning. The results provide guides for developing adaptive pen‐based learning systems by optimizing the real‐time cognitive load.
Chapter
This volume collects research on language, cognition, and communication in multilingualism. Apart from theoretical concerns including grammatical description, language-specific analyses, and modeling of multilingualism, different fields of study and research interests center around three core themes: The Early Years (aspects of language acquisition and development, including vernaculars or minority languages, reading, writing, and cognition, and multilingual extensions), Issues in Everyday Life (the role of multilingualism in and for speech–language–communication difficulties, including diagnosis, provisions of services, and later language breakdown), and From the Past to the Future (aspects of multilingualism beyond acquisition, education, or pathology, with a focus on heritage languages and translanguaging). Specialists from each of these areas introduce state-of-the-art research, novel experimental studies, and/or quantitative as well as qualitative data bearing on ‘multifaceted multilingualism’. There is a broad spectrum for take-home messages, ranging from new theoretical analyses or approaches to assess multilingual speakers all the way to recommendations for policy-makers.
Chapter
Learning on adaptive e-learning platforms occupies a main role in the revolution, various pedagogical technologies, and collaboration to create educational scenarios in collaborative systems during adaptive e-learning activities. The main objective of this work is to present collaborative adaptation scenarios in an online collaborative adaptive system during adaptive learning activities in order to group learners in a collaborative space to discuss with each other and develop the aggregation of knowledge. This is about fostering adaptive collaboration. The author discuss the effectiveness of the ADDIE method with this adaptive scenario that group's collaborative adaptive content between the group and the collaborative learning system to provide an engaging online learning experience focused on skill development problem-solving practices. As a result, they meet their objective of designing collaborative scenarios in an adaptive system. The authors value collaborative learning through the interpretation of data generated by the system so that an online collaborative environment surrounds learners.
Article
reviews major categories of empirical workload measurement techniques and provides guidelines for the choice of appropriate assessment procedures for particular applications sensitivity / diagnosticity rating scales / psychometric techniques (PsycINFO Database Record (c) 2012 APA, all rights reserved)
Article
Speech-based cognitive load classification has begun to attract research attention in the last couple of years. Previous studies in this research area, conducted in laboratory environments with no noise, have utilized speech features extracted from the entire speech bandwidth to discriminate between cognitive load levels. However, in more realistic environments, these approaches may not be effective due to the presence of noise. In this paper, we utilize speech features computed in disjoint bands of the spectrum and investigate the effectiveness of this multi-band approach for classifying cognitive load from speech in the presence of noise. Our experimental results indicate that by extracting the speech features in different frequency bands separately and controlling the contribution of each frequency band based on noise estimates from that band, a multi-band approach can yield better performance than conventional approaches for robust cognitive load classification.
Chapter
This chapter is divided into two parts. The first describes the effect of Pat Rabbitt's influence in encouraging the first author to use the increasingly sophisticated methods of ageing research to answer questions about the fundamental characteristics of working memory, together with reflections on why so little of this work reached publication. The second part presents a brief review of the literature on working memory and ageing, followed by an account of more recent work attempting to apply the traditional method of experimental dissociation to research on normal ageing and Alzheimer's disease. The discussion suggests that even such simple methods can throw light on both the processes of ageing and the understanding of working memory.
Article
This adaptive user interface provides individualized, just-in-time assistance to users by recording user interface events and frequencies, organizing them into episodes, and automatically deriving patterns. It also builds, maintains, and makes suggestions based on user profiles.
Article
Current graphical keyboard and mouse interfaces are better suited for handling mechanical tasks, like email and text editing, than they are at supporting focused problem solving or complex learning tasks. One reason is that graphical interfaces limit users’ ability to fluidly express content involving different representational systems (e.g., symbols, diagrams) as they think through steps during complex problem solutions. We asked: Can interfaces be designed that actively stimulate students’ ability to “think on paper,” including providing better support for both ideation and convergent problem solving? In this talk, we will summarize new research on the affordances of different types of interface (e.g., pen-based, keyboard-based), and how these basic computer input capabilities function to substantially facilitate or impede people’s ideational fluency. We also will show data on the relation between interface support for communicative fluency (i.e., both linguistic and non-linguistic forms) and ideational fluency. In addition, we’ll discuss the relation between interface support for active marking (i.e., both formal structures like diagrams, and informal ones such as “thinking marks”) and successful problem solving. Finally, we’ll present new data on interfaces that improve support for learning and performance in lower-performing populations, and we will discuss how these new directions in interface media could play a role in improving their education and minimizing the persistent achievement gap between low- versus high-performing groups
Article
Legal regulations in the EU concerning the evaluation of mental workload require that suitable and practical methods for the assessment of mental workload at the workplace are needed. Currently the 0.1 Hz component of heartrate variability (HRV) is considered an attractive and promising measure of mental strain. However, systematic and comprehensive studies investigating the psychometric properties of this cardiovascular measure are still missing. Therefore this problem has been addressed experimentally: If the 0.1 component of HRV is a valid measure of mental strain it should discriminate between mental load produced by different types of tasks (diagnosticity) and different levels of difficulty (sensitivity). Comparing psychophysiological, performance, and subjective data the results for the psychophysiolgical data cannot be interpreted as support for a sufficient sensitivity and diagnosticity of the 0.1 component of HRV as a measure of mental strain. This cardiovascular indicator does not meet conventional requirements to be used in mental and especially cognitive workload evaluation. However, there is evidence that the 0.1 component of HRV is more likely to indicate emotional strain (stress reactions) or general activation.
Article
Human error has been identified as the primary contributor to aircraft mishaps. A critical implication of this finding for the CIRCLE to address s patterns, sugoesting t e potential utility of the system towards analysis of o&her factors affecting crew communications. 2. INTRODUCTION Several investi that more than of human error an that most of these errors stem from failures in communication, teamwork, and decision making (I). Crew communication, or the flow of information between individual operators, serves as the coupling agent that determines the functionm % of the operators as an ensemble (2). In the cockpit, crew mem ers coordinate their actions through commands, statements of intent, self-reports, acknowledgments, and R uestions (3), and tt has been suggested that a breakdown in t ese communication components is the first step leading to accidents and incidents (4,5). This contention has been corroborated by several retrospective studies of commercial and military rotary-wing accidents (6). Robert Helmreich (1) recommended that a potentially effective deterrent to aircraft mishaps would be "training focusing on the inherent limitations of human perfomtance, including the impact of stress on the ability to absorb information and make decisions .". However, research has neglected to systematically examine the relationship between workload and cockpit communications, g, erhaps because crew coordination is dynamic and dtfficult to o jecttvely quantify with a meaningful metric. Our objective was to quantitatively assess the effects of workload levels on the coordinated verbal behaviors of two-person military aircrews.