Content uploaded by Kristiina Jokinen
Author content
All content in this area was uploaded by Kristiina Jokinen on Nov 08, 2014
Content may be subject to copyright.
Integration of Gestures and Speech in Human-
Robot Interaction
Raveesh Meena*, Kristiina Jokinen, ** and Graham Wilcock**
* KTH Royal Institute of Technology, TMH, Stockholm, Sweden
** University of Helsinki, Helsinki, Finland
raveesh@csc.kth.se, kristiina.jokinen@helsinki.fi, graham.wilcock@helsinki.fi
Abstract— We present an approach to enhance the
interaction abilities of the Nao humanoid robot by extending
its communicative behavior with non-verbal gestures (hand
and head movements, and gaze following). A set of non-
verbal gestures were identified that Nao could use for
enhancing its presentation and turn-management
capabilities in conversational interactions. We discuss our
approach for modeling and synthesizing gestures on the Nao
robot. A scheme for system evaluation that compares the
values of users’ expectations and actual experiences has
been presented. We found that open arm gestures, head
movements and gaze following could significantly enhance
Nao’s ability to be expressive and appear lively, and to
engage human users in conversational interactions.
I. INTRODUCTION
Human-human face-to-face conversational interactions
involve not just exchange of verbal feedback, but also that
of non-verbal expressions. Conversational partners may
use verbal feedback for various activities, such as asking
clarification or information questions, giving response to a
question, providing new information, expressing the
understanding or uncertainty about the new information,
or to simply encourage the speaker, through backchannels
(‘ah’, ‘uhu’, ‘mhm’), to continue speaking.
Often verbal expressions are accompanied by non-
verbal expressions, such as gestures (e.g., hand, head and
facial movements) and eye-gaze. Non-verbal expressions
of this kind are not mere artifacts in a conversation, but
are intentionally used by the speaker to draw attention to
certain pieces of information present in the verbal
expression. There are some other non-verbal expressions
that may function as important signals to manage the
dialogue and the information flow in a conversational
interaction [1]. Thus, while a speaker employs verbal and
non-verbal expressions to convey her communicative
intentions appropriately, the listener(s) combine cues from
these expressions to ground the meaning of the verbal
expression and establish a common ground [2].
It is desirable for artificial agents, such as the Nao
humanoid robot, to be able to understand and exhibit
verbal and non-verbal behavior in human-robot
conversational interactions. Exhibiting non-verbal
expressions would not only add to their ability to draw
attention of the users(s) to useful pieces of information,
but also make them appear more expressive and
intelligible which will help them build social rapport with
their users.
In this paper we report our work on enhancing Nao’s
presentation capabilities by extending its communicative
behavior with non-verbal expressions. In section II we
briefly discuss some gestures types and their functions in
conversational interactions. In section III we identify a set
of gestures that are useful for Nao in the context of this
work. In section IV we first discuss the general approach
for synthesis of non-verbal expressions in artificial agents
and then present our approach. Next, in section V we
discuss our scheme for user evaluation of the non-verbal
behavior in Nao. In section VI we present the results and
discuss our findings. In section VII we discuss possible
extensions to this work and report our conclusions.
II. BACKGROUND
Gestures belong to the communicative repertoire that
the speakers have at their disposal in order to express
meanings and give feedback. According to Kendon,
gestures are intentionally communicative actions and they
have certain immediately recognizable features which
distinguish them from other kind of activity such as
postural adjustments or spontaneous hands and arms
movement. In addition, he refers to the act of gesturing as
gesticulation, with a preparatory phase in the beginning of
the movement, the stroke, or the peak structure in the
middle, and the recovery phase at the end of the
movement [1].
Gestures can be classified based on their form (e.g.,
iconic, symbolic and emblems gestures) or also based on
their function. For instance, gesture can complement the
speech and single out a certain referent as is the case with
typical deictic pointing gestures (that box). They can also
illustrate the speech like iconic gestures do, e.g., a speaker
may spread her arms while uttering the box was quite big
to illustrate that the box was really big. Hand gestures
could also be used to add rhythm to the speech as the
beats. Beats are usually synchronized with the important
concepts in the spoken utterance, i.e., they accompany
spoken foci (e.g., when uttering Shakespeare had three
children: Susanna and twins Hamnet and Judith, the beats
fall on the names of the children). Gesturing can thus
direct the conversational partners’ attention to an
important aspect of the spoken message without the
speaker needing to put their intentions in words.
The gestures that we are particularly interested in this
work are Kendon’s Open Hand Supine (“palm up”) and
Open Hand Prone (“palm down”). Gestures in these two
families have their own semantic themes, which are
related to offering and giving vs. stopping and halting,
respectively. Gestures in “palm-up” family generally
express offering or giving of ideas, and they accompany
speech which aims at presenting, explaining,
summarizing, etc. [1].
While much of the gestures accompany speech, some
gestures may function as important signals that are used to
manage the dialogue and the information flow. According
to Allwood, some gestures may be classified as having
turn-management function. Turn-management involves
turn transitions depending on the interlocutors action with
respect to the turn: turn-accepting (the speaker takes over
the floor), turn-holding (the speaker keeps the floor), and
turn-yielding (the speaker hands over the floor) [3].
It has been established that conversational partners take
cues from various source: the intonation of the verbal
expression utterance, the phrase boundaries, pauses, and
semantic and syntactic context to infer turn transition
relevance place. In additional to these verbal cues, eye-
gaze shifts is one non-verbal cue that conversational
participants employ for turn management in conversation
interactions. The speaker is particularly more influential
than the other partners in coordination turn changes. It has
been shown that if the speaker wants to give the turn, she
looks at the listeners, while the listeners tend to look at the
current speaker, but turn their gaze away if they do not
want to take the turn, If the listeners wants to take the turn
the listeners also looks at the speaker, and turn taking is
agreed by the mutual gaze. Mutual gaze is usually broken
by the listener who takes the turn, and once the planning
of the utterance starts, the listener usually looks away,
following the typical gaze aversion pattern [3].
III. GESTURES AND NAO
The task of integrating non-verbal gestures in the Nao
humanoid robot was part of a project on multimodal
conversational interaction with a humanoid robot [4]. We
started with WikiTalk [5], a spoken dialogue system for
open domain conversation using Wikipedia as a
knowledge source. By implementing WikiTalk on the
Nao, we greatly extended the robot’s interaction
capabilities by enabling Nao to talk about an unlimited
range of topics. One of the critical aspects of this
interaction is that since the user doesn’t have access to a
computer monitor she is completely unaware of the
structure of the article and the hyperlinks present in there
which could be a possible sub-topic for the user to
continue the conversation. The robot should be able to
bring the user attention to these hyperlinks, which we treat
as the new information. While prosody plays a vital role in
making emphasis on content words in this work we aim
specifically at achieving the same with non-verbal
gestures. In order to make the interaction smooth we
wanted the robot to coordinate turn taking. Here again we
were more interested in the turn-management aspect of
non-verbal gestures and eye-gaze. Based on these
objectives we set the two primary goals of this work as:
Goal 1: Extend the speaking Nao with hand gesturing that
will enhance its presentation capabilities.
Goal 2: Extend Nao’s turn-management capabilities using
non -verbal gestures.
Towards the first goal we identified a set of
presentation gestures to mark topic, the end of a sentence
or a paragraph, beat gestures and head nods to attract
attention to hyperlinks (the new information), and head
nodding as backchannels. Towards the second goal we put
the following scheme in place: Nao will speak and
observes the human partner at the same time. After
presenting a piece of new information the user is expected
to signal interest by making explicit requests or using
backchannels. Nao should observe and react to such user
responses. After each paragraph the human is invited to
signal continuation (verbal command phrases like
‘enough’, ‘continue’, ‘stop’, etc.). Nao asks explicit
feedback (may also gesture, stop, etc. depending on
previous interaction). Table I provides the summary of the
gestures (along with their functions and their placements)
that we aimed to integrate in Nao.
IV. APPROACH
A. The choice and timing of non-verbal gestures
Synthesizing non-verbal behavior in artificial agents
primarily requires making the choice of right non-verbal
behavior to generate and the alignment of that non-verbal
behavior to the verbal expression with respect to the
temporal, semantic, and discourse related aspects of the
dialogue. The content of a spoken utterance, its intonation
contour, and the non-verbal expressions accompanying it
together express the communicative intention of the
speaker. The logical choice therefore is to have a
composite semantic representation that captures the
meanings along these three dimensions. The agent’s
domain plan and the discourse context play a crucial role
in planning the communicative goal (e.g. should the agent
provide an answer to a question or seek clarification).
TABLE I.
NON-VERBAL GESTURES AND THEIR ROLE IN INTERACTION WITH NAO
Gesture Function(s) Placement and the meaning of the gesture
Open Hand Palm Up Indicating new paragraph Beginning of a paragraph. The Open Hand Palm Up gestures has the
semantic theme of offering information or ideas.
Discourse structure
Open Hand Palm Vertical Indicating new information Hyperlink in a sentence. The Open Hand Palm Vertical rhythmic up
and down movement emphasizes new information (beat gesture).
Head Nod Down Indicating new information Hyperlink in a sentence. Slight head nod marks emphasis on pieces
of verbal information.
Head Nod Up
Expressing surprise On being interrupted by the user (through tactile sensors).
Turn-yielding End of a sentence where Nao expects the user to provide an explicit
response. Speaker gaze at the listener indicates a possibility for
listener to grab the conversational floor.
Discourse structure
Speaking-to-Listening Turn-yielding Listening mode. Nao goes to standing posture from the speaking
pose and listens to the user.
Listening-to-Speaking Turn-accepting Presentation model. Nao goes to speaking posture from the standing
pose to prepare for presenting information to the user.
Open Arms Open Hand
Palm Up Presenting new topic Beginning of a new topic. The Open Arm Open Hand Palm Up
gestures has the semantic theme of offering information or ideas.
However, an agent requires a model of attention (what is
currently salient) and intention (next dialogue act) for
extending the communicative intention with pragmatic
factors that determine what intonation contours and
gestures are appropriate in its linguistic realization. This
includes the theme (information that is grounded) and the
rheme (information yet to be grounded) marking of the
elements in the composite semantic representation. The
realizer should be able to synthesis the correct surface
form, the appropriate intonation, and the correct gesture.
Text is generated and pitch accents and phrasal melodies
are placed on generated text which is then produced by a
text to speech synthesizer. The non-verbal synthesizer
produces the animated gestures.
As for timing of gestures the information about the
duration of intonational phrases is acquired during speech
generation and then used to time gesture. This is because
gestural domains are observed to be isomorphic with
intonational domains. The speaker’s hands rise into space
with the beginning of the intonational rise at the
beginning of an utterance, and the hands fall at the end of
the utterance along with the final intonational marking.
The most effortful part of the gesture (the “stroke”) co-
occurs with the pitch accent, or most effortful part of
pronunciation. Furthermore, gestures co-occur with the
rhematic part of speech, just as we find particular
intonational tunes co-occurring with the rhematic part of
speech [6].
[6] presents various embodied cognitive agents that
exhibit multimodal non-verbal behavior, including hand
gestures, facial expressions (eye brow movements, lip
movements) and head nods based on the scheme
discussion above. In [7] a back projected talking head is
presented that exhibits non-verbal facial expression such
as lip movement, eyebrow movement, and eye gaze. The
timing of these gestures is again motivated from the
intonational phrase of the verbal expressions.
B. Integrating non-verbal behavior in Nao
The preparation, stroke, and retraction phases of a
gesture may be differentiated by short holding phases
surrounding the stroke. It is in the second phase—the
stroke—that contains the meaning features that allows
one to interpret the gestures. Towards animating gestures
in Nao our first step was to define the stroke phase for
each gesture type identified in TABLE I. We refer to Nao’s
full body pose during the stroke phase as the key pose that
captures the essence of the action. Fig. A to G in TABLE
II illustrates the key poses for the set of gestures
identified in TABLE I. For example, Fig. A in TABLE II
illustrates the key pose for the Open Hand Palm Up
gesture.
In our approach we model the preparatory phase of a
gesture as comprising of an intermediate gesture, the
preparatory pose, which is a gesture pose halfway on the
transition from the current Nao posture to the target key
pose. Similarly, the retraction phase is comprised of an
intermediate gesture, the retraction pose, which is a
gesture pose half way on the transition between the target
key pose and the follow-up gesture. The complete gesture
was then synthesized using the B-spline algorithm [8] for
interpolating the joint positions from the preparatory
pose to the key pose and from the key pose to the
retraction pose.
It is critical for the key pose of a gesture to coincide
with the pitch accent in the intonational contour of the
verbal expression. During trials in the lab we observed
that there is always some latency in Nao’s motor
response. Since gestures can be chanined and the
preperatory phase of the follow-up gesture unifies
with the retraction phase of the previous gesture,
considering the Listening key pose (Fig. E TABLE II), the
default standing position for Nao, as the starting pose for
all gestures, increased the latency, and was often
unnatural as well. We therefore specified the Speaking
key pose (Fig. F TABLE II) as the default follow-up
posture. This approach has the practical relevance of not
only reducing the latency but also that the transitions
from the Listening key pose to Speaking key pose
(presentation mode) and vice versa served the purpose of
turn-management. Synthesizing a specific gesture on Nao
then basically required an animated movement of joints
from any current body pose to the target gestural key
pose and the follow-up pose.
As an illustration, the Open Hand Palm Up gesture for
paragraph beginning was synthesized as an B-spline
interpolation of the following sequence of key poses:
Standing → Speaking → Open Hand Palm Up
preparatory pose → Open-Hand Palm Up key pose →
Open-hand Palm Up retraction pose → Speaking.
Beat gestures, the rhythmic movement of Open Hand
Palm Vertical gesture, are different from the other
gestures as they are characterized by two phases of
movement: a movement into the gesture space, and a
movement out of it [6]. In contrast to the pause in the
stroke phase of other gestures, it is the rhythm of the beat
gestures that is intended to draw the listeners’ attention to
TABLE II.
NON KEY POSES FOR VARIOUS GESTURES AND HEAD MOVEMENTS.
Fig. A :
Open Hand
Palm Up
Fig. A1:
Side view of
Fig. A
Fig. B:
Open Hand
Palm Vertical
Fig. B1:
Side view of
Fig. B.
Fig. C:
Head Nod
Down
Fig. D:
Head Nod Up
Fig. E:
Listenin g key
pose
Fig. F:
Speaking key
pose
Fig. G: Open Arms Open Hand Palm Up
the verbal expressions. A beat gesture
as an B-spline interpolation of Spea
k
Open Hand Palm Vertical key pose →
S
with no Open Hand Palm Vertical
retraction poses. This sequence o
f
animated in loops for synthesizing rhyt
h
for drawing attention to a sequence of
n
We handcrafted the
p
reparatory,
k
poses for all the animated ges
t
Choregraphe® (part of Nao’s toolki
t
offers an intuitive way of designing a
n
Nao and obtained the corresponding
C
This enabled us to develop a
p
ara
m
function library of all the gestures.
synthesize a gesture with varying
animation and the amplitude of joint
approach to define gestures as
p
ara
m
obtained from templates is also use
d
non-verbal
b
ehavior in embodied cogni
t
facial gestures in
b
ack projected talkin
g
C. Synchronizing Nao
g
estures wi
t
Since much of gestures that we ha
v
work accompany speech we wanted to
of a target gesture with the content
w
information. To achieve this we should
intonational phrases information fro
m
speech synthesis system. However,
ba
unable to obtain the intonational phras
e
Nao’s speech synthesizer. Therefore
w
simple approach of finding the averag
e
b
efore which the gesture synthesis s
h
such that the key pose coincides with
This number is calculated based on a
(of the template) and the length of t
h
count) to be spoken. Based on
approximated (online) the duration
p
gesture to be synthesized. In similar f
a
punctuations and structural details
sentence end, paragraph end) of a W
time the turn-management gestures. O
f
the timing of these gestures was perc
e
developers in lab.
F
IGURE
1
p
rovides an overview of
N
Interaction Manger (MIM). On receiv
i
N
ao Manager instructs the MIM to pro
c
MIM interacts with the Wikipedia Ma
n
content and the structural details o
f
Wikipedia. MIM instructs the Gestur
e
these pieces of information in con
j
Discourse Context to specify the gestur
e
F
IGURE
1:
N
AO
’
S
M
ULTIMODAL
I
NTERAC
T
s was synthesized
k
ing key pose →
S
peaking key pose,
preparatory and
f
key poses was
h
mic beat gestures
n
ew information.
k
ey and retraction
t
ures using the
t
). Choregraphe®
n
imated actions in
C
++
/
Python code.
m
eterized gesture
We could then
duration of the
movements. This
m
eterized functions
d
for synthesizing
t
ive agents [6] and
g
heads [7].
t
h Nao speech
v
e focused in this
alig
n
the key pose
w
ords bearing new
have extrac
t
ed the
m
Nao’s text-to-
a
ck then, we were
e
information from
w
e took the rather
e
number of words
h
ould be triggered
the content word.
gesture’s duration
h
e sentence (word
these two we
p
arame
t
er of the
a
shion we used the
(new paragraph,
ikipedia article to
f
te
n
, if not always,
e
ived okay by the
N
ao’s Multimodal
i
ng the user input,
c
ess the User Input.
n
ager to obtain the
f
the topic from
e
Manager to use
j
unctio
n
with the
e
type (refers to the
Gesture Library). Next, the
d
gesture is calculated (Gestu
r
placing the gesture tag at the a
p
to be spoken. While the Nao
T
p
roduces the verbal expression
,
the Nao Movement Controlle
r
(Gesture Synthesizer).
V. U
SER
E
V
We evaluated the impact
o
verbal expressions in a conv
e
human subjects. Since we w
a
significance of individual gest
u
versions of Nao’s MIM with
limited set of non-verbal gest
u
the non-verbal gesturing abiliti
e
For evaluation we follo
w
comparing users’ expectations
their actual experiences of the
users were first asked to fill
designed to measure their exp
Subjects then took
p
art i
n
interactions, and after each i
n
the users filled in another que
s
experience with the system the
y
Both the questionnaire cont
a
were aimed at seeking users’
e
feedback on the following
I
nterface, Responsiveness, Ex
p
Overall Experience. T
ABLE I
V
from the two questionnaires th
a
Nao’s non-verbal
b
ehav
questionnaire served the dual
attention to system behaviors
t
Participants
p
rovided their re
s
from one to five (with five ind
i
Twelve users participated i
n
p
articipants of the 8th Intern
a
on Multimodal Interfaces, eN
were instructed that Nao can
from Wikipedia and that they
with it as much as they wish.
or restrictions on the topics.
U
about almost anything. In a
d
p
rovided a list of commands
themselves with the interacti
o
interacted with the three sy
s
System 1, System 2 and then S
y
I. R
E
The figure in T
ABLE V
pr
expected and observed feature
s
x axis corresponds to the state
m
TABLE III.
N
ON
-
VERBAL GESTURE CAPABILITIES
O
System
version
E
xhibited no
System 1 Face tracking , alway
s
System 2
Head Nod Up, Head
N
Palm Up, Open Hand
and Standing pose
System 3 Head Nod Up, Open
H
Gesture ( Open Hand
T
ION
M
ANAGER
d
uration parameter of this
r
e Timing) and used for
p
propriate place in the text
T
ext-to-Speech synthesizer
,
the Nao Manager instructs
r
to synthesize the gesture
V
ALUATION
o
f Nao’s verbal and non-
e
rsational interaction with
a
nted to also measure the
u
re types, we created three
each system exhibiting a
u
res. T
ABLE
III summarizes
e
s of the three systems.
w
ed the scheme [9] of
before the evaluation with
system. Under this scheme
a questionnaire that was
ectations from the system.
n
three about 10-minute
n
teraction with the system
s
tionnaire that gauged their
y
had just interacted with.
a
ined 31 statements, which
e
xpectation and experience
aspects of the systems:
p
ressiveness, Usability and
V
shows the 14 statements
at
were aimed at evaluating
io
r
. The expectation
purpose of priming user’s
t
hat we wanted to evaluate.
s
ponse o
n
the Likert scale
i
cating strong agreemen
t
).
n
the evaluation. They were
a
tional Summer Workshop
TERFACE-2012. Subjects
provide them information
can talk to Nao, and play
There were no constraints
U
sers could ask Nao to talk
d
dition to this, they were
to help them familiarize
o
n control. All the users
s
tems in the same order:
y
stem 3.
SULTS
r
esents the values of the
s
for all the test users. The
m
en
t
id. (S.Id) in T
ABLE IV
.
O
F THE
MIM
INSTANT IATION S
n-verbal gestures
s
in the Speaking pose
N
od Down, Open Hand
Palm Vertical, Listening
H
and Palm Up and Beat
Palm Vertical)
Measuring the significance of these values is part of the
ongoing work, therefore we report here just the
preliminary observations based on this figure.
Interface: Users expected Nao hand gestures to be
linked to exploring topics (I1). They perceived their
experience with System 2 to be above their expectations,
while System 3 was perceived somewhat closer to what
they had expected. As System 1 lacked any hand gestures
the expected behavior was hardly observed. Users
expected Nao hand and body movement to be distracting
(I3). However, the observed values suggest that it wasn’t
the case with any of the three interactions. Among the
three, System 1 was perceived the least distracting which
could be due to lack of hand and body movements. Users
expected Nao’s hand and body movement to cause
curiosity (I4). This is in fact true for the observed values
for System 2 and 3. Despite the gaze following behavior
in System 1 it wasn’t able to cause enough curiosity.
Expressiveness: The users expected Nao to be
expressive (E1). Among the three systems, the interaction
with System 2 was experienced closest to the
expectations. System 2 exceeded the users’ expectation
when it comes to Nao’s liveliness (E2). Interaction with
System 3 was experienced more lively than interaction
with System 1 suggesting that body movements could
add significantly to the liveliness of an agent that exhibit
only head gestures. Among the three systems, the users
found System 2 to meet their expectations about the
timeliness of head nods (E3). Concerning the naturalness
of the gestures System 2 clearly beats the user
expectations while System 3 was perceived okay. Users
found all the three interactions very engaging (E6).
Responsiveness: The users expected Nao’s
presentation to be easy to follow (R6). The gaze
following gesture in System 1 was perceived the easiest
to follow. System 2 and 3 were able to achieve this only
to an extent. As to whether gesturing and information
presentation are linked (R7), the interactions with System
2 were perceived closer to the users’ expectations.
Usability: Users expected to remember possible topics
without visual feedback (U1). For all the three systems,
the observed values were close to expected values.
Overall: The Nao gestures in System 1 were observed
to meet the users’ expectations (O1). The head nods in
System 2 were also perceived to meet the users’
expectations (O2), and the gaze tracking in System 1 was
also liked by the users (O3). The responses to O2 and O3
indicate that the users were able to distinguish head nods
from gaze following movements of the Nao head.
In all, the users liked the interaction with System 2
most. This can be attributed to the large variety of non-
verbal gestures exhibited by System 2. System 2 and
System 3 should benefit by incorporating the gaze
following gestures of System 1. Among the hand
gestures, open arm gestures were perceived better then
beat gestures. We attribute this to the poor synthesis of
beat gestures by the Nao motors.
II. DISCUSSION AND CONCLUSIONS
In this work we extended the Nao humanoid robot’s
presentation capabilities by integrating a set of non-verbal
behaviors (hand gestures, head movements and gaze
following). We identified a set of gestures that Nao could
use for information presentation and turn-management.
We discussed our approach to synthesize these gestures on
the Nao robot. We presented a scheme for evaluating the
system’s non-verbal behavior based on the users’
expectations and actual experiences. The results suggest
that Nao can significantly enhance its expressivity by
exhibiting open arms gestures (they serve the function of
structuring the discourse), as well as gaze-following and
head movements for keeping the users engaged.
Synthesizing sophisticated movements such as beat
gestures would require a more elaborate model for gesture
placement and smooth yet responsive robot motor actions.
In this work we handcrafted the gestures ourselves, using
Choregraphe®. We believe other approaches in the field
such as use of motion capture devices or Kinect could be
TABLE IV.
QUESTIONNAIRES FOR MEASURING USER EXPECTATIONS AND REAL EXPERIENCE WITH NAO.
System aspect S.Id. Expectation questionnaire Experience questionnaire
Interface
I2 I expect to notice if Nao's hand gestures are linked
to exploring topics.
I noticed Nao's hand gestures were linked to
exploring topic.
I3 I expect to find Nao's hand and body movement
distracting.
Nao's hand and body movement distracted me.
I4 I expect to find Nao’s hand and body movements
creating curiosity in me.
Nao’s hand and body movements created
curiosity in me.
Expressiveness
E1 I expect Nao's behaviour to be expressive. Nao's behaviour was expressive.
E2 I expect Nao will appear lively. Nao appeared lively.
E3 I expect Nao to nod at suitable times. Nao nodded at suitable times.
E5 I expect Nao's gesturing will be natural. Nao’s gesturing was natural.
E6 I expect Nao's conversations will be engaging Nao's conversations was engaging
Responsiveness
R6 I expect Nao’s presentation will be easy to follow. Nao’s presentation was easy to follow.
R7 I expect it will be clear that Nao’s gesturing and
information presentation are linked.
It was clear that Nao’s gesturing and
information presentation were linked.
Usability U1
I expect it will be easy to remember the possible
topics without visual feedback.
It was easy to remember the possible topics
without visual feedback.
Overall
O1 I expect I will like Nao's gesturing. I liked Nao's gesturing.
O2 I expect I will like Nao's head movements. I liked Nao's head movements.
O3 I expect I will like Nao’s head tracking. I liked Nao’s head tracking.
used to design more natural gesture
s
conduct any independent
p
erceptio
n
synthesized gestures to gauge how hu
m
the meaning of such gestures in c
o
Perception studies similar to the one
pr
should be useful for us.
We believe the traditional app
r
alignment using the phoneme inform
a
given better gesture timings. We also n
e
for determining the duration and amplit
u
the gesture functions. Exploring th
e
parameters in the lines of [10] on e
x
space for robots to display Emotion
a
would be an interesting direction to foll
o
As to whether the users were able to
r
information conveyed by the emphatic
not been verified yet. This requires ex
t
the video recordings and has been plan
n
Moreover, previous research has shown
and head movements play a vital role i
n
We could not verify whether Nao's g
e
this kind of role in interaction coordin
a
but we believe that non-verbal gestures
for turn-management, especially to be
u
default beep sound that Nao robot cu
r
explicitly indicate turn changes. How
e
suggest that open arm hand gestures, h
e
following can significantly enhance
engage users (Goal 1, p.2), verifie
d
difference between the user's experienc
of the Nao's interactive capability.
A
CKNOWLEDGMENT
The authors thank the organizers of
e
at Supelec, Metz, for the excellent en
v
project.
R
EFERENCES
[1] K. Jokinen, "Pointing Gestures
Communication Management," in
M
ultimodal Interfaces: Active Listen
i
vol. 5967, A. Esposito, N. Campbell,
C
and A. Nijholt, Eds., Heidelber
g
USER EXPECTAT
s
. Also we didn’t
n
studies for the
m
an users perceive
o
ntext of speech.
r
esented in [3], [8]
r
oach of gesture
a
tion would have
e
ed a better model
u
de parameters for
e
range of these
x
plo
r
ing the affect
a
l Body language
ow
.
r
emember the new
hand gestures has
t
ensive analysis of
n
ed as future work.
that hand gestures
n
turn management.
e
stures also served
a
tion (Goal 2, p.2),
will be well suited
u
sed instead of the
r
rently employs to
e
ver, our findings
e
ad nods and gaze
Nao’s ability to
d
by the positive
e and expectations
e
NTERFACE 2012
v
ironment for this
and Synchronous
D
evelopment o
f
i
ng and Synchron
y
,
C
. Vogel, A. Hussain
g
, Springer Berlin
Heidelberg, 2010, pp. 33-49.
[2] H. H. Clark and E. F.
Discourse,” Cognitive Scienc
e
[3] K. Jokinen, H. Furukawa,
M
“Gaze and Turn-Taking Beh
a
Interactions,” in ACM
T
I
ntelligent Systems, Speci
a
Intelligent Human-
M
achine
In
[4] A. Csapo, E. Gilmartin, J.
G
Anastasiou, K. Jokinen an
d
Conversational Interaction
w
P
roceedings of the 3rd IEE
E
Cognitive Infocommunicat
i
Kosice, Slovakia, 2012.
[5] G. Wilcock, “WikiTalk: A S
p
Domain Knowledge Acc
e
An
s
wering in Complex Dom
a
India, 2012.
[6] J. Cassell, “Embodied Conv
e
Gesture into Automatic Spo
k
Press, 1989.
[7] S. Al Moubayed, J. Beskow,
G
“Furhat: A Back-
p
rojected
H
Multiparty Human-Machin
e
B
ehavioural Systems. Lectur
e
A. Esposito, A. Esposito, A.
V. C. Müller, Eds., Springer,
[8] A. Beck, A. Hiolle, A.
“Interpretation of Emotional
Robots,” in
P
roceesings of t
h
on Affective Interaction
(AFFINE'10), Firenze, Italy,
2
[9] K. Jokinen and T. Hurtig,
“
Experience on a Multimo
d
P
roceedings of Interspeech
2
US, 2006.
[10] A. Beck, L. Canamero and K
.
Space for Robots to Display
in
P
roceedings of the 19th I
E
on Robot and Human Inte
r
MAN 2010), Principe di Pie
m
TABLE V.
IONS
(uExpect’n)
AND THEIR EXPERIENCES
(ueSys1/2/3)
WITH
N
AO
.
Schaefer, “Contributing to
e
, pp. 259-294, 1989.
M
. Nishida and S. Yamamoto,
a
vior in Casual Conversational
T
ransactions on In
t
eractive
a
l Issue on Eye Gaze in
In
teraction, ACM, 2010.
G
rizou, F. Han, R. Meena, D.
d
G. Wilcock, “Multimodal
w
ith a Humanoid Robot,” in
E
International Conference on
i
ons (CogInfoCom 2012),
p
oken Wikipedia-based Open-
e
ss System,” in Question
a
ins (QACD 2012), Mumbai,
e
rsation: Integrating Face and
k
en Dialogue Systems,” MIT
G
. Skantze and B. Granström,
H
uman-like Robot Head for
e
Interaction,” in Cognitive
e
Notes in Computer Science.,
Vinciarelli, R. Hoffmann and
2012.
Mazel and L. Canamero,
Body Language Displayed by
h
e 3rd International Workshop
in Natural Environments
2
010.
“
User Expectations and Real
d
al Interactive System,” in
2
006, Pittsburg, Pennsylvania,
.
A. Bard, “Towards an Affect
Emotional Body Language,”
E
EE International Symposium
r
active Communication (Ro-
m
onto -Viareggio, Italy, 2010.
.