Conference PaperPDF Available

Integration of Gestures and Speech in Human- Robot Interaction

Authors:
  • AIST Tokyo Waterfront

Abstract

We present an approach to enhance the interaction abilities of the Nao humanoid robot by extending its communicative behavior with non-verbal gestures (hand and head movements, and gaze following). A set of non-verbal gestures were identified that Nao could use for enhancing its presentation and turn-management capabilities in conversational interactions. We discuss our approach for modeling and synthesizing gestures on the Nao robot. A scheme for system evaluation that compares the values of users' expectations and actual experiences has been presented. We found that open arm gestures, head movements and gaze following could significantly enhance Nao's ability to be expressive and appear lively, and to engage human users in conversational interactions.
Integration of Gestures and Speech in Human-
Robot Interaction
Raveesh Meena*, Kristiina Jokinen, ** and Graham Wilcock**
* KTH Royal Institute of Technology, TMH, Stockholm, Sweden
** University of Helsinki, Helsinki, Finland
raveesh@csc.kth.se, kristiina.jokinen@helsinki.fi, graham.wilcock@helsinki.fi
Abstract— We present an approach to enhance the
interaction abilities of the Nao humanoid robot by extending
its communicative behavior with non-verbal gestures (hand
and head movements, and gaze following). A set of non-
verbal gestures were identified that Nao could use for
enhancing its presentation and turn-management
capabilities in conversational interactions. We discuss our
approach for modeling and synthesizing gestures on the Nao
robot. A scheme for system evaluation that compares the
values of users’ expectations and actual experiences has
been presented. We found that open arm gestures, head
movements and gaze following could significantly enhance
Nao’s ability to be expressive and appear lively, and to
engage human users in conversational interactions.
I. INTRODUCTION
Human-human face-to-face conversational interactions
involve not just exchange of verbal feedback, but also that
of non-verbal expressions. Conversational partners may
use verbal feedback for various activities, such as asking
clarification or information questions, giving response to a
question, providing new information, expressing the
understanding or uncertainty about the new information,
or to simply encourage the speaker, through backchannels
(‘ah’, ‘uhu’, ‘mhm’), to continue speaking.
Often verbal expressions are accompanied by non-
verbal expressions, such as gestures (e.g., hand, head and
facial movements) and eye-gaze. Non-verbal expressions
of this kind are not mere artifacts in a conversation, but
are intentionally used by the speaker to draw attention to
certain pieces of information present in the verbal
expression. There are some other non-verbal expressions
that may function as important signals to manage the
dialogue and the information flow in a conversational
interaction [1]. Thus, while a speaker employs verbal and
non-verbal expressions to convey her communicative
intentions appropriately, the listener(s) combine cues from
these expressions to ground the meaning of the verbal
expression and establish a common ground [2].
It is desirable for artificial agents, such as the Nao
humanoid robot, to be able to understand and exhibit
verbal and non-verbal behavior in human-robot
conversational interactions. Exhibiting non-verbal
expressions would not only add to their ability to draw
attention of the users(s) to useful pieces of information,
but also make them appear more expressive and
intelligible which will help them build social rapport with
their users.
In this paper we report our work on enhancing Nao’s
presentation capabilities by extending its communicative
behavior with non-verbal expressions. In section II we
briefly discuss some gestures types and their functions in
conversational interactions. In section III we identify a set
of gestures that are useful for Nao in the context of this
work. In section IV we first discuss the general approach
for synthesis of non-verbal expressions in artificial agents
and then present our approach. Next, in section V we
discuss our scheme for user evaluation of the non-verbal
behavior in Nao. In section VI we present the results and
discuss our findings. In section VII we discuss possible
extensions to this work and report our conclusions.
II. BACKGROUND
Gestures belong to the communicative repertoire that
the speakers have at their disposal in order to express
meanings and give feedback. According to Kendon,
gestures are intentionally communicative actions and they
have certain immediately recognizable features which
distinguish them from other kind of activity such as
postural adjustments or spontaneous hands and arms
movement. In addition, he refers to the act of gesturing as
gesticulation, with a preparatory phase in the beginning of
the movement, the stroke, or the peak structure in the
middle, and the recovery phase at the end of the
movement [1].
Gestures can be classified based on their form (e.g.,
iconic, symbolic and emblems gestures) or also based on
their function. For instance, gesture can complement the
speech and single out a certain referent as is the case with
typical deictic pointing gestures (that box). They can also
illustrate the speech like iconic gestures do, e.g., a speaker
may spread her arms while uttering the box was quite big
to illustrate that the box was really big. Hand gestures
could also be used to add rhythm to the speech as the
beats. Beats are usually synchronized with the important
concepts in the spoken utterance, i.e., they accompany
spoken foci (e.g., when uttering Shakespeare had three
children: Susanna and twins Hamnet and Judith, the beats
fall on the names of the children). Gesturing can thus
direct the conversational partners’ attention to an
important aspect of the spoken message without the
speaker needing to put their intentions in words.
The gestures that we are particularly interested in this
work are Kendon’s Open Hand Supine (“palm up”) and
Open Hand Prone (“palm down”). Gestures in these two
families have their own semantic themes, which are
related to offering and giving vs. stopping and halting,
respectively. Gestures in “palm-up” family generally
express offering or giving of ideas, and they accompany
speech which aims at presenting, explaining,
summarizing, etc. [1].
While much of the gestures accompany speech, some
gestures may function as important signals that are used to
manage the dialogue and the information flow. According
to Allwood, some gestures may be classified as having
turn-management function. Turn-management involves
turn transitions depending on the interlocutors action with
respect to the turn: turn-accepting (the speaker takes over
the floor), turn-holding (the speaker keeps the floor), and
turn-yielding (the speaker hands over the floor) [3].
It has been established that conversational partners take
cues from various source: the intonation of the verbal
expression utterance, the phrase boundaries, pauses, and
semantic and syntactic context to infer turn transition
relevance place. In additional to these verbal cues, eye-
gaze shifts is one non-verbal cue that conversational
participants employ for turn management in conversation
interactions. The speaker is particularly more influential
than the other partners in coordination turn changes. It has
been shown that if the speaker wants to give the turn, she
looks at the listeners, while the listeners tend to look at the
current speaker, but turn their gaze away if they do not
want to take the turn, If the listeners wants to take the turn
the listeners also looks at the speaker, and turn taking is
agreed by the mutual gaze. Mutual gaze is usually broken
by the listener who takes the turn, and once the planning
of the utterance starts, the listener usually looks away,
following the typical gaze aversion pattern [3].
III. GESTURES AND NAO
The task of integrating non-verbal gestures in the Nao
humanoid robot was part of a project on multimodal
conversational interaction with a humanoid robot [4]. We
started with WikiTalk [5], a spoken dialogue system for
open domain conversation using Wikipedia as a
knowledge source. By implementing WikiTalk on the
Nao, we greatly extended the robot’s interaction
capabilities by enabling Nao to talk about an unlimited
range of topics. One of the critical aspects of this
interaction is that since the user doesn’t have access to a
computer monitor she is completely unaware of the
structure of the article and the hyperlinks present in there
which could be a possible sub-topic for the user to
continue the conversation. The robot should be able to
bring the user attention to these hyperlinks, which we treat
as the new information. While prosody plays a vital role in
making emphasis on content words in this work we aim
specifically at achieving the same with non-verbal
gestures. In order to make the interaction smooth we
wanted the robot to coordinate turn taking. Here again we
were more interested in the turn-management aspect of
non-verbal gestures and eye-gaze. Based on these
objectives we set the two primary goals of this work as:
Goal 1: Extend the speaking Nao with hand gesturing that
will enhance its presentation capabilities.
Goal 2: Extend Nao’s turn-management capabilities using
non -verbal gestures.
Towards the first goal we identified a set of
presentation gestures to mark topic, the end of a sentence
or a paragraph, beat gestures and head nods to attract
attention to hyperlinks (the new information), and head
nodding as backchannels. Towards the second goal we put
the following scheme in place: Nao will speak and
observes the human partner at the same time. After
presenting a piece of new information the user is expected
to signal interest by making explicit requests or using
backchannels. Nao should observe and react to such user
responses. After each paragraph the human is invited to
signal continuation (verbal command phrases like
‘enough’, ‘continue’, ‘stop’, etc.). Nao asks explicit
feedback (may also gesture, stop, etc. depending on
previous interaction). Table I provides the summary of the
gestures (along with their functions and their placements)
that we aimed to integrate in Nao.
IV. APPROACH
A. The choice and timing of non-verbal gestures
Synthesizing non-verbal behavior in artificial agents
primarily requires making the choice of right non-verbal
behavior to generate and the alignment of that non-verbal
behavior to the verbal expression with respect to the
temporal, semantic, and discourse related aspects of the
dialogue. The content of a spoken utterance, its intonation
contour, and the non-verbal expressions accompanying it
together express the communicative intention of the
speaker. The logical choice therefore is to have a
composite semantic representation that captures the
meanings along these three dimensions. The agent’s
domain plan and the discourse context play a crucial role
in planning the communicative goal (e.g. should the agent
provide an answer to a question or seek clarification).
TABLE I.
NON-VERBAL GESTURES AND THEIR ROLE IN INTERACTION WITH NAO
Gesture Function(s) Placement and the meaning of the gesture
Open Hand Palm Up Indicating new paragraph Beginning of a paragraph. The Open Hand Palm Up gestures has the
semantic theme of offering information or ideas.
Discourse structure
Open Hand Palm Vertical Indicating new information Hyperlink in a sentence. The Open Hand Palm Vertical rhythmic up
and down movement emphasizes new information (beat gesture).
Head Nod Down Indicating new information Hyperlink in a sentence. Slight head nod marks emphasis on pieces
of verbal information.
Head Nod Up
Expressing surprise On being interrupted by the user (through tactile sensors).
Turn-yielding End of a sentence where Nao expects the user to provide an explicit
response. Speaker gaze at the listener indicates a possibility for
listener to grab the conversational floor.
Discourse structure
Speaking-to-Listening Turn-yielding Listening mode. Nao goes to standing posture from the speaking
pose and listens to the user.
Listening-to-Speaking Turn-accepting Presentation model. Nao goes to speaking posture from the standing
pose to prepare for presenting information to the user.
Open Arms Open Hand
Palm Up Presenting new topic Beginning of a new topic. The Open Arm Open Hand Palm Up
gestures has the semantic theme of offering information or ideas.
However, an agent requires a model of attention (what is
currently salient) and intention (next dialogue act) for
extending the communicative intention with pragmatic
factors that determine what intonation contours and
gestures are appropriate in its linguistic realization. This
includes the theme (information that is grounded) and the
rheme (information yet to be grounded) marking of the
elements in the composite semantic representation. The
realizer should be able to synthesis the correct surface
form, the appropriate intonation, and the correct gesture.
Text is generated and pitch accents and phrasal melodies
are placed on generated text which is then produced by a
text to speech synthesizer. The non-verbal synthesizer
produces the animated gestures.
As for timing of gestures the information about the
duration of intonational phrases is acquired during speech
generation and then used to time gesture. This is because
gestural domains are observed to be isomorphic with
intonational domains. The speaker’s hands rise into space
with the beginning of the intonational rise at the
beginning of an utterance, and the hands fall at the end of
the utterance along with the final intonational marking.
The most effortful part of the gesture (the “stroke”) co-
occurs with the pitch accent, or most effortful part of
pronunciation. Furthermore, gestures co-occur with the
rhematic part of speech, just as we find particular
intonational tunes co-occurring with the rhematic part of
speech [6].
[6] presents various embodied cognitive agents that
exhibit multimodal non-verbal behavior, including hand
gestures, facial expressions (eye brow movements, lip
movements) and head nods based on the scheme
discussion above. In [7] a back projected talking head is
presented that exhibits non-verbal facial expression such
as lip movement, eyebrow movement, and eye gaze. The
timing of these gestures is again motivated from the
intonational phrase of the verbal expressions.
B. Integrating non-verbal behavior in Nao
The preparation, stroke, and retraction phases of a
gesture may be differentiated by short holding phases
surrounding the stroke. It is in the second phase—the
stroke—that contains the meaning features that allows
one to interpret the gestures. Towards animating gestures
in Nao our first step was to define the stroke phase for
each gesture type identified in TABLE I. We refer to Nao’s
full body pose during the stroke phase as the key pose that
captures the essence of the action. Fig. A to G in TABLE
II illustrates the key poses for the set of gestures
identified in TABLE I. For example, Fig. A in TABLE II
illustrates the key pose for the Open Hand Palm Up
gesture.
In our approach we model the preparatory phase of a
gesture as comprising of an intermediate gesture, the
preparatory pose, which is a gesture pose halfway on the
transition from the current Nao posture to the target key
pose. Similarly, the retraction phase is comprised of an
intermediate gesture, the retraction pose, which is a
gesture pose half way on the transition between the target
key pose and the follow-up gesture. The complete gesture
was then synthesized using the B-spline algorithm [8] for
interpolating the joint positions from the preparatory
pose to the key pose and from the key pose to the
retraction pose.
It is critical for the key pose of a gesture to coincide
with the pitch accent in the intonational contour of the
verbal expression. During trials in the lab we observed
that there is always some latency in Nao’s motor
response. Since gestures can be chanined and the
preperatory phase of the follow-up gesture unifies
with the retraction phase of the previous gesture,
considering the Listening key pose (Fig. E TABLE II), the
default standing position for Nao, as the starting pose for
all gestures, increased the latency, and was often
unnatural as well. We therefore specified the Speaking
key pose (Fig. F TABLE II) as the default follow-up
posture. This approach has the practical relevance of not
only reducing the latency but also that the transitions
from the Listening key pose to Speaking key pose
(presentation mode) and vice versa served the purpose of
turn-management. Synthesizing a specific gesture on Nao
then basically required an animated movement of joints
from any current body pose to the target gestural key
pose and the follow-up pose.
As an illustration, the Open Hand Palm Up gesture for
paragraph beginning was synthesized as an B-spline
interpolation of the following sequence of key poses:
Standing Speaking Open Hand Palm Up
preparatory pose Open-Hand Palm Up key pose
Open-hand Palm Up retraction pose Speaking.
Beat gestures, the rhythmic movement of Open Hand
Palm Vertical gesture, are different from the other
gestures as they are characterized by two phases of
movement: a movement into the gesture space, and a
movement out of it [6]. In contrast to the pause in the
stroke phase of other gestures, it is the rhythm of the beat
gestures that is intended to draw the listeners’ attention to
TABLE II.
NON KEY POSES FOR VARIOUS GESTURES AND HEAD MOVEMENTS.
Fig. A :
Open Hand
Palm Up
Fig. A1:
Side view of
Fig. A
Fig. B:
Open Hand
Palm Vertical
Fig. B1:
Side view of
Fig. B.
Fig. C:
Head Nod
Down
Fig. D:
Head Nod Up
Fig. E:
Listenin g key
pose
Fig. F:
Speaking key
pose
Fig. G: Open Arms Open Hand Palm Up
the verbal expressions. A beat gesture
as an B-spline interpolation of Spea
k
Open Hand Palm Vertical key pose
S
with no Open Hand Palm Vertical
retraction poses. This sequence o
f
animated in loops for synthesizing rhyt
h
for drawing attention to a sequence of
n
We handcrafted the
p
reparatory,
k
poses for all the animated ges
Choregraphe® (part of Nao’s toolki
t
offers an intuitive way of designing a
n
Nao and obtained the corresponding
C
This enabled us to develop a
p
ara
m
function library of all the gestures.
synthesize a gesture with varying
animation and the amplitude of joint
approach to define gestures as
p
ara
m
obtained from templates is also use
d
non-verbal
b
ehavior in embodied cogni
t
facial gestures in
b
ack projected talkin
g
C. Synchronizing Nao
g
estures wi
t
Since much of gestures that we ha
v
work accompany speech we wanted to
of a target gesture with the content
w
information. To achieve this we should
intonational phrases information fro
m
speech synthesis system. However,
ba
unable to obtain the intonational phras
e
Nao’s speech synthesizer. Therefore
w
simple approach of finding the averag
e
b
efore which the gesture synthesis s
h
such that the key pose coincides with
This number is calculated based on a
(of the template) and the length of t
h
count) to be spoken. Based on
approximated (online) the duration
p
gesture to be synthesized. In similar f
a
punctuations and structural details
sentence end, paragraph end) of a W
time the turn-management gestures. O
f
the timing of these gestures was perc
e
developers in lab.
F
IGURE
1
p
rovides an overview of
N
Interaction Manger (MIM). On receiv
i
N
ao Manager instructs the MIM to pro
c
MIM interacts with the Wikipedia Ma
n
content and the structural details o
f
Wikipedia. MIM instructs the Gestur
e
these pieces of information in con
j
Discourse Context to specify the gestur
e
F
IGURE
1:
N
AO
S
M
ULTIMODAL
I
NTERAC
T
s was synthesized
k
ing key pose
S
peaking key pose,
preparatory and
f
key poses was
h
mic beat gestures
n
ew information.
k
ey and retraction
t
ures using the
t
). Choregraphe®
n
imated actions in
C
++
/
Python code.
m
eterized gesture
We could then
duration of the
movements. This
m
eterized functions
d
for synthesizing
t
ive agents [6] and
g
heads [7].
t
h Nao speech
v
e focused in this
alig
n
the key pose
w
ords bearing new
have extrac
t
ed the
m
Nao’s text-to-
a
ck then, we were
e
information from
w
e took the rather
e
number of words
h
ould be triggered
the content word.
gesture’s duration
h
e sentence (word
these two we
p
arame
t
er of the
a
shion we used the
(new paragraph,
ikipedia article to
f
te
n
, if not always,
e
ived okay by the
N
ao’s Multimodal
i
ng the user input,
c
ess the User Input.
n
ager to obtain the
f
the topic from
e
Manager to use
j
unctio
n
with the
e
type (refers to the
Gesture Library). Next, the
d
gesture is calculated (Gestu
r
placing the gesture tag at the a
p
to be spoken. While the Nao
T
p
roduces the verbal expression
,
the Nao Movement Controlle
r
(Gesture Synthesizer).
V. U
SER
E
V
We evaluated the impact
o
verbal expressions in a conv
e
human subjects. Since we w
a
significance of individual gest
u
versions of Nao’s MIM with
limited set of non-verbal gest
u
the non-verbal gesturing abiliti
e
For evaluation we follo
w
comparing users’ expectations
their actual experiences of the
users were first asked to fill
designed to measure their exp
Subjects then took
p
art i
n
interactions, and after each i
n
the users filled in another que
s
experience with the system the
y
Both the questionnaire cont
a
were aimed at seeking users’
e
feedback on the following
I
nterface, Responsiveness, Ex
p
Overall Experience. T
ABLE I
V
from the two questionnaires th
a
Nao’s non-verbal
b
ehav
questionnaire served the dual
attention to system behaviors
t
Participants
p
rovided their re
s
from one to five (with five ind
i
Twelve users participated i
n
p
articipants of the 8th Intern
a
on Multimodal Interfaces, eN
were instructed that Nao can
from Wikipedia and that they
with it as much as they wish.
or restrictions on the topics.
U
about almost anything. In a
d
p
rovided a list of commands
themselves with the interacti
o
interacted with the three sy
s
System 1, System 2 and then S
y
I. R
E
The figure in T
ABLE V
pr
expected and observed feature
s
x axis corresponds to the state
m
TABLE III.
N
ON
-
VERBAL GESTURE CAPABILITIES
O
System
version
E
xhibited no
System 1 Face tracking , alway
s
System 2
Head Nod Up, Head
N
Palm Up, Open Hand
and Standing pose
System 3 Head Nod Up, Open
H
Gesture ( Open Hand
T
ION
M
ANAGER
d
uration parameter of this
r
e Timing) and used for
p
propriate place in the text
T
ext-to-Speech synthesizer
,
the Nao Manager instructs
r
to synthesize the gesture
V
ALUATION
o
f Nao’s verbal and non-
e
rsational interaction with
a
nted to also measure the
u
re types, we created three
each system exhibiting a
u
res. T
ABLE
III summarizes
e
s of the three systems.
w
ed the scheme [9] of
before the evaluation with
system. Under this scheme
a questionnaire that was
ectations from the system.
n
three about 10-minute
n
teraction with the system
s
tionnaire that gauged their
y
had just interacted with.
a
ined 31 statements, which
e
xpectation and experience
aspects of the systems:
p
ressiveness, Usability and
V
shows the 14 statements
at
were aimed at evaluating
io
r
. The expectation
purpose of priming user’s
t
hat we wanted to evaluate.
s
ponse o
n
the Likert scale
i
cating strong agreemen
t
).
n
the evaluation. They were
a
tional Summer Workshop
TERFACE-2012. Subjects
provide them information
can talk to Nao, and play
There were no constraints
U
sers could ask Nao to talk
d
dition to this, they were
to help them familiarize
o
n control. All the users
s
tems in the same order:
y
stem 3.
SULTS
r
esents the values of the
s
for all the test users. The
m
en
t
id. (S.Id) in T
ABLE IV
.
O
F THE
MIM
INSTANT IATION S
n-verbal gestures
s
in the Speaking pose
N
od Down, Open Hand
Palm Vertical, Listening
H
and Palm Up and Beat
Palm Vertical)
Measuring the significance of these values is part of the
ongoing work, therefore we report here just the
preliminary observations based on this figure.
Interface: Users expected Nao hand gestures to be
linked to exploring topics (I1). They perceived their
experience with System 2 to be above their expectations,
while System 3 was perceived somewhat closer to what
they had expected. As System 1 lacked any hand gestures
the expected behavior was hardly observed. Users
expected Nao hand and body movement to be distracting
(I3). However, the observed values suggest that it wasn’t
the case with any of the three interactions. Among the
three, System 1 was perceived the least distracting which
could be due to lack of hand and body movements. Users
expected Nao’s hand and body movement to cause
curiosity (I4). This is in fact true for the observed values
for System 2 and 3. Despite the gaze following behavior
in System 1 it wasn’t able to cause enough curiosity.
Expressiveness: The users expected Nao to be
expressive (E1). Among the three systems, the interaction
with System 2 was experienced closest to the
expectations. System 2 exceeded the users’ expectation
when it comes to Nao’s liveliness (E2). Interaction with
System 3 was experienced more lively than interaction
with System 1 suggesting that body movements could
add significantly to the liveliness of an agent that exhibit
only head gestures. Among the three systems, the users
found System 2 to meet their expectations about the
timeliness of head nods (E3). Concerning the naturalness
of the gestures System 2 clearly beats the user
expectations while System 3 was perceived okay. Users
found all the three interactions very engaging (E6).
Responsiveness: The users expected Nao’s
presentation to be easy to follow (R6). The gaze
following gesture in System 1 was perceived the easiest
to follow. System 2 and 3 were able to achieve this only
to an extent. As to whether gesturing and information
presentation are linked (R7), the interactions with System
2 were perceived closer to the users’ expectations.
Usability: Users expected to remember possible topics
without visual feedback (U1). For all the three systems,
the observed values were close to expected values.
Overall: The Nao gestures in System 1 were observed
to meet the users’ expectations (O1). The head nods in
System 2 were also perceived to meet the users’
expectations (O2), and the gaze tracking in System 1 was
also liked by the users (O3). The responses to O2 and O3
indicate that the users were able to distinguish head nods
from gaze following movements of the Nao head.
In all, the users liked the interaction with System 2
most. This can be attributed to the large variety of non-
verbal gestures exhibited by System 2. System 2 and
System 3 should benefit by incorporating the gaze
following gestures of System 1. Among the hand
gestures, open arm gestures were perceived better then
beat gestures. We attribute this to the poor synthesis of
beat gestures by the Nao motors.
II. DISCUSSION AND CONCLUSIONS
In this work we extended the Nao humanoid robot’s
presentation capabilities by integrating a set of non-verbal
behaviors (hand gestures, head movements and gaze
following). We identified a set of gestures that Nao could
use for information presentation and turn-management.
We discussed our approach to synthesize these gestures on
the Nao robot. We presented a scheme for evaluating the
system’s non-verbal behavior based on the users’
expectations and actual experiences. The results suggest
that Nao can significantly enhance its expressivity by
exhibiting open arms gestures (they serve the function of
structuring the discourse), as well as gaze-following and
head movements for keeping the users engaged.
Synthesizing sophisticated movements such as beat
gestures would require a more elaborate model for gesture
placement and smooth yet responsive robot motor actions.
In this work we handcrafted the gestures ourselves, using
Choregraphe®. We believe other approaches in the field
such as use of motion capture devices or Kinect could be
TABLE IV.
QUESTIONNAIRES FOR MEASURING USER EXPECTATIONS AND REAL EXPERIENCE WITH NAO.
System aspect S.Id. Expectation questionnaire Experience questionnaire
Interface
I2 I expect to notice if Nao's hand gestures are linked
to exploring topics.
I noticed Nao's hand gestures were linked to
exploring topic.
I3 I expect to find Nao's hand and body movement
distracting.
Nao's hand and body movement distracted me.
I4 I expect to find Nao’s hand and body movements
creating curiosity in me.
Nao’s hand and body movements created
curiosity in me.
Expressiveness
E1 I expect Nao's behaviour to be expressive. Nao's behaviour was expressive.
E2 I expect Nao will appear lively. Nao appeared lively.
E3 I expect Nao to nod at suitable times. Nao nodded at suitable times.
E5 I expect Nao's gesturing will be natural. Nao’s gesturing was natural.
E6 I expect Nao's conversations will be engaging Nao's conversations was engaging
Responsiveness
R6 I expect Nao’s presentation will be easy to follow. Nao’s presentation was easy to follow.
R7 I expect it will be clear that Nao’s gesturing and
information presentation are linked.
It was clear that Nao’s gesturing and
information presentation were linked.
Usability U1
I expect it will be easy to remember the possible
topics without visual feedback.
It was easy to remember the possible topics
without visual feedback.
Overall
O1 I expect I will like Nao's gesturing. I liked Nao's gesturing.
O2 I expect I will like Nao's head movements. I liked Nao's head movements.
O3 I expect I will like Nao’s head tracking. I liked Nao’s head tracking.
used to design more natural gesture
s
conduct any independent
p
erceptio
n
synthesized gestures to gauge how hu
m
the meaning of such gestures in c
o
Perception studies similar to the one
pr
should be useful for us.
We believe the traditional app
r
alignment using the phoneme inform
a
given better gesture timings. We also n
e
for determining the duration and amplit
u
the gesture functions. Exploring th
e
parameters in the lines of [10] on e
x
space for robots to display Emotion
a
would be an interesting direction to foll
o
As to whether the users were able to
r
information conveyed by the emphatic
not been verified yet. This requires ex
t
the video recordings and has been plan
n
Moreover, previous research has shown
and head movements play a vital role i
n
We could not verify whether Nao's g
e
this kind of role in interaction coordin
a
but we believe that non-verbal gestures
for turn-management, especially to be
u
default beep sound that Nao robot cu
r
explicitly indicate turn changes. How
e
suggest that open arm hand gestures, h
e
following can significantly enhance
engage users (Goal 1, p.2), verifie
d
difference between the user's experienc
of the Nao's interactive capability.
A
CKNOWLEDGMENT
The authors thank the organizers of
e
at Supelec, Metz, for the excellent en
v
project.
R
EFERENCES
[1] K. Jokinen, "Pointing Gestures
Communication Management," in
M
ultimodal Interfaces: Active Listen
i
vol. 5967, A. Esposito, N. Campbell,
C
and A. Nijholt, Eds., Heidelber
g
USER EXPECTAT
s
. Also we didn’t
n
studies for the
m
an users perceive
o
ntext of speech.
r
esented in [3], [8]
r
oach of gesture
a
tion would have
e
ed a better model
u
de parameters for
e
range of these
x
plo
r
ing the affect
a
l Body language
ow
.
r
emember the new
hand gestures has
t
ensive analysis of
n
ed as future work.
that hand gestures
n
turn management.
e
stures also served
a
tion (Goal 2, p.2),
will be well suited
u
sed instead of the
r
rently employs to
e
ver, our findings
e
ad nods and gaze
Nao’s ability to
d
by the positive
e and expectations
e
NTERFACE 2012
v
ironment for this
and Synchronous
D
evelopment o
f
i
ng and Synchron
y
,
C
. Vogel, A. Hussain
g
, Springer Berlin
Heidelberg, 2010, pp. 33-49.
[2] H. H. Clark and E. F.
Discourse,” Cognitive Scienc
e
[3] K. Jokinen, H. Furukawa,
M
“Gaze and Turn-Taking Beh
a
Interactions,” in ACM
T
I
ntelligent Systems, Speci
a
Intelligent Human-
M
achine
In
[4] A. Csapo, E. Gilmartin, J.
G
Anastasiou, K. Jokinen an
d
Conversational Interaction
w
P
roceedings of the 3rd IEE
E
Cognitive Infocommunicat
i
Kosice, Slovakia, 2012.
[5] G. Wilcock, “WikiTalk: A S
p
Domain Knowledge Acc
e
An
s
wering in Complex Dom
a
India, 2012.
[6] J. Cassell, “Embodied Conv
e
Gesture into Automatic Spo
k
Press, 1989.
[7] S. Al Moubayed, J. Beskow,
G
“Furhat: A Back-
p
rojected
H
Multiparty Human-Machin
e
B
ehavioural Systems. Lectur
e
A. Esposito, A. Esposito, A.
V. C. Müller, Eds., Springer,
[8] A. Beck, A. Hiolle, A.
“Interpretation of Emotional
Robots,” in
P
roceesings of t
h
on Affective Interaction
(AFFINE'10), Firenze, Italy,
2
[9] K. Jokinen and T. Hurtig,
Experience on a Multimo
d
P
roceedings of Interspeech
2
US, 2006.
[10] A. Beck, L. Canamero and K
.
Space for Robots to Display
in
P
roceedings of the 19th I
E
on Robot and Human Inte
r
MAN 2010), Principe di Pie
m
TABLE V.
IONS
(uExpect’n)
AND THEIR EXPERIENCES
(ueSys1/2/3)
WITH
N
AO
.
Schaefer, “Contributing to
e
, pp. 259-294, 1989.
M
. Nishida and S. Yamamoto,
a
vior in Casual Conversational
T
ransactions on In
t
eractive
a
l Issue on Eye Gaze in
In
teraction, ACM, 2010.
G
rizou, F. Han, R. Meena, D.
d
G. Wilcock, “Multimodal
w
ith a Humanoid Robot,” in
E
International Conference on
i
ons (CogInfoCom 2012),
p
oken Wikipedia-based Open-
e
ss System,” in Question
a
ins (QACD 2012), Mumbai,
e
rsation: Integrating Face and
k
en Dialogue Systems,” MIT
G
. Skantze and B. Granström,
H
uman-like Robot Head for
e
Interaction,” in Cognitive
e
Notes in Computer Science.,
Vinciarelli, R. Hoffmann and
2012.
Mazel and L. Canamero,
Body Language Displayed by
h
e 3rd International Workshop
in Natural Environments
2
010.
User Expectations and Real
d
al Interactive System,” in
2
006, Pittsburg, Pennsylvania,
.
A. Bard, “Towards an Affect
Emotional Body Language,”
E
EE International Symposium
r
active Communication (Ro-
m
onto -Viareggio, Italy, 2010.
.
... This is the approach that we have followed in this work. Meena et al. [35] proposed a system for a Nao robot that uses the number of words in the utterance to align the stroke phase of the gesture with the speech. The stroke phase refers to the part of the gesture where the meaning is conveyed. ...
... This is the case for the works of Hasegawa et al. [21], Ravenet et al. [27], Kucherenko et al. [22], Ginosar et al. [23], Yoon et al. [24], Yoon et al. [43], Spitale and Matarić [28], Ahuja et al. [33], Fares et al. [34], and Qi et al. [30]. Among the works that use handcrafted expressions, there are also several authors that focused exclusively on the combination of speech and body motions and posture, like Meena et al. [35], Xu et al. [18], or Asai et al. [41]. There are also authors that only considered body motions and postures, without the verbal component. ...
... However, authors have proposed different solutions for overcoming this issue. For example, the work of Meena et al. [35] allows adapting the length and amplitude of the gestures used based on the utterances they will accompany. ...
Article
Full-text available
Robots are becoming an increasingly important part of our society and have started to be used in tasks that require communicating with humans. Communication can be decoupled in two dimensions: symbolic (information aimed to achieve a particular goal) and spontaneous (displaying the speaker’s emotional and motivational state) communication. Thus, to enhance human–robot interactions, the expressions that are used have to convey both dimensions. This paper presents a method for modelling a robot’s expressiveness as a combination of these two dimensions, where each of them can be generated independently. This is the first contribution of our work. The second contribution is the development of an expressiveness architecture that uses predefined multimodal expressions to convey the symbolic dimension and integrates a series of modulation strategies for conveying the robot’s mood and emotions. In order to validate the performance of the proposed architecture, the last contribution is a series of experiments that aim to study the effect that the addition of the spontaneous dimension of communication and its fusion with the symbolic dimension has on how people perceive a social robot. Our results show that the modulation strategies improve the users’ perception and can convey a recognizable affective state.
... Nickel and Stiefelhagen [25] used head orientation information as an additional feature to train and detect finger gestures based on a hidden Markov model classifier, significantly improving the performance of gesture recognition. Meena et al. [26] proposed a method to enhance Nao humanoid robot interaction by expanding the communication behavior of nonverbal gestures (hand and head movements and gaze following) to significantly enhance Nao's expressive and expressive abilities. Liu et al. [27] proposed a human-robot interaction (MEC-HRI) system based on multi-mode emotional communication, including voice, facial expression and gesture. ...
Article
The lightweight human-robot interaction model with high real-time, high accuracy, and strong anti-interference capability can be better applied to future lunar surface exploration and construction work. Based on the feature information inputted from the monocular camera, the signal acquisition and processing fusion of the astronaut gesture and eye-movement modal interaction can be performed. Compared with the single mode, the human-robot interaction model of bimodal collaboration can achieve the issuance of complex interactive commands more efficiently. The optimization of the target detection model is executed by inserting attention into YOLOv4 and filtering image motion blur. The central coordinates of pupils are identified by the neural network to realize the human-robot interaction in the eye movement mode. The fusion between the astronaut gesture signal and eye movement signal is performed at the end of the collaborative model to achieve complex command interactions based on a lightweight model. The dataset used in the network training is enhanced and extended to simulate the realistic lunar space interaction environment. The human-robot interaction effects of complex commands in the single mode are compared with those of complex commands in the bimodal collaboration. The experimental results show that the concatenated interaction model of the astronaut gesture and eye movement signals can excavate the bimodal interaction signal better, discriminate the complex interaction commands more quickly, and has stronger signal anti-interference capability based on its stronger feature information mining ability. Compared with the command interaction realized by using the single gesture modal signal and the single eye movement modal signal, the interaction model of bimodal collaboration is shorter about 79% to 91% of the time under the single mode interaction. Regardless of the influence of any image interference item, the overall judgment accuracy of the proposed model can be maintained at about 83% to 97%. The effectiveness of the proposed method is verified.
... al. [16] and R. Meena et. al. [17] determined from the evaluations they conducted at the workshop that participants appreciated engaging with the NAO robot, particularly it's nonverbal gesturing. Therefore, NAO is currently used in applications such as rehabilitation and interactions with children [18]- [20], and also in a previous study, NAO has been used as a tutor in a collaborative gameplay setting with children [21]. ...
Conference Paper
Child-Human-Robot interaction might be more authentic , intuitive, successful, and natural if robots had emotional intelligence. Therefore, observing and recognising human actions is vital for achieving this goal. As well as that, making robots appealing to children is similarly crucial. Otherwise, children may be intimidated by humanoid robot agents, which would hinder child-robot interaction. Through the use of a robotic agent, the NAO robot, we determine child-robot interaction based on the degree of the child's capability of keeping the gaze with the robotic agent during child diagnostic and therapeutic sessions. The Wizard of OZ (WoZ) approach is used to remotely control the robotic agent in Robot Enhanced Therapy (RET), where the child's guardian or therapist acts as the third party and the robot is controlled by a different individual. Based on the findings, we suggest cutting-edge research to effectively establish child-robot interaction in robot-assisted and robot-enhanced therapeutic sessions for children. In addition, we provide crucial insights for enhancing child-robot interactions in order to diagnose neurodevelopmental disorders in children and facilitate future research.
... As noted above, it has been argued that language is an early cognitive infocommunications technology. Speech is a modality for expression of language, and CogInfoCom researchers dwell on very many aspects of speech, such as-speech recognition, speaker recognition, speech synthesis, speech as an information source for clinical diagnosis, speech in relation to other co-linguistic signals, such as gesture, and so on (Esposito and Esposito, 2011;Meena et al., 2012;Szaszák and Beke, 2012;Vicsi et al., 2012;Kiss et al., 2013;Marinozzi and Zanuy, 2014;Siegert et al., 2015;Hunyadi et al., 2016;Navarretta, 2016Navarretta, , 2017Kovács et al., 2017;Beke, 2018;de Velasco Vázquez et al., 2019;Esposito et al., 2020). ...
... However, as a drawback, this approach works well only when considering a limited number of gestures, which may result in repetitive and boring behaviors in the long term. Given their simplicity, these approaches have been widely adopted also in a number of research works: for example, in [13,14], the humanoid robot NAO has been made able to use a set of communicative gestures while telling stories. ...
Article
Full-text available
Embedding social robots with the capability of accompanying their sentences with natural gestures may be the key to increasing their acceptability and their usage in real contexts. However, the definition of natural communicative gestures may not be trivial, since it strictly depends on the culture of the person interacting with the robot. The proposed work investigates the possibility of generating culture-dependent communicative gestures, by proposing an integrated approach based on a custom dataset composed exclusively of persons belonging to the same culture, an adversarial generation module based on speech audio features, a voice conversion module to manage the multi-person dataset, and a 2D-to-3D mapping module for generating three-dimensional gestures. The approach has eventually been implemented and tested with the humanoid robot Pepper. Preliminary results, obtained through a statistical analysis of the evaluations made by human participants identifying themselves as belonging to different cultures, are discussed.
Article
Communication using manual (hand) gestures is considered a defining property of social robots, and their physical embodiment and presence, therefore we see a need for a comprehensive overview of the state of the art in social robots that use gestures. This systematic literature review aims to address this need by (1) describing the gesture production process of a social robot, including the design and planning steps, and (2) providing a survey of the effects of robot-performed gestures on human-robot interactions in a multitude of domains. We identify patterns and themes from the existing body of literature, resulting in nine outstanding questions for research on robot-performed gestures regarding: developments in sensor technology and AI, structuring the gesture design and evaluation process, the relationship between physical appearance and gestures, the effects of planning on the overall interaction, standardizing measurements of gesture ‘quality’, individual differences, gesture mirroring, whether human-likeness is desirable, and universal accessibility of robots. We also reflect on current methodological practices in studies of robot-performed gestures, and suggest improvements regarding replicability, external validity, measurement instruments used, and connections with other disciplines. These outstanding questions and methodological suggestions can guide future work in this field of research.
Chapter
Human-robot interaction is an important field that affects a robot’s friendliness and ability to communicate. This paper proposes a human-robot interaction system capable of communicating with humans in Vietnamese. The system is built and developed based on the human communication process. First, the identity and expression of the interlocutor are recognized by the system. This is to combine with speech of the interlocutor to form input data. They are categorized into search text and conversation text. The response of the search text is initialized from the results returned from the internet. Conversation text has responses generated from the trained model and data about the expressions and information of the interlocutor. These data are combined to form the final system response. Simultaneously, important data during the current conversation is also stored to use in the next conversation. The designed system is based on the interaction model of real people. That creates friendliness and empathy for the system. As the experimental results, the level of satisfaction in the process of interacting with the system reaches 78%.
Article
Full-text available
Eye gaze is an important means for controlling interaction and coordinating the participants' turns smoothly. We have studied how eye gaze correlates with spoken interaction and especially focused on the combined effect of the speech signal and gazing to predict turn taking possibilities. It is well known that mutual gaze is important in the coordination of turn taking in two-party dialogs, and in this article, we investigate whether this fact also holds for three-party conversations. In group interactions, it may be that different features are used for managing turn taking than in two-party dialogs. We collected casual conversational data and used an eye tracker to systematically observe a participant's gaze in the interactions. By studying the combined effect of speech and gaze on turn taking, we aimed to answer our main questions: How well can eye gaze help in predicting turn taking? What is the role of eye gaze when the speaker holds the turn? Is the role of eye gaze as important in three-party dialogs as in two-party dialogue? We used Support Vector Machines (SVMs) to classify turn taking events with respect to speech and gaze features, so as to estimate how well the features signal a change of the speaker or a continuation of the same speaker. The results confirm the earlier hypothesis that eye gaze significantly helps in predicting the partner's turn taking activity, and we also get supporting evidence for our hypothesis that the speaker is a prominent coordinator of the interaction space. Such a turn taking model could be used in interactive applications to improve the system's conversational performance.
Conference Paper
Full-text available
The paper presents a multimodal conversational interaction system for the Nao humanoid robot. The system was developed at the 8th International Summer Workshop on Multimodal Interfaces, Metz, 2012. We implemented WikiTalk, an existing spoken dialogue system for open-domain conversations, on Nao. This greatly extended the robot's interaction capabilities by enabling Nao to talk about an unlimited range of topics. In addition to speech interaction, we developed a wide range of multimodal interactive behaviours by the robot, including face-tracking, nodding, communicative gesturing, proximity detection and tactile interrupts. We made video recordings of user interactions and used questionnaires to evaluate the system. We further extended the robot's capabilities by linking Nao with Kinect.
Article
Full-text available
In order for robots to be socially accepted and generate empathy they must display emotions. For robots such as Nao, body language is the best medium available, as they do not have the ability to display facial expressions. Displaying emotional body language that can be interpreted whilst interacting with the robot should greatly improve its acceptance. This research investigates the creation of an "Affect Space" [1] for the generation of emotional body language that could be displayed by robots. An Affect Space is generated by "blending" (i.e. interpolating between) different emotional expressions to create new ones. An Affect Space for body language based on the Circumplex Model of emotions [2] has been created. The experiment reported in this paper investigated the perception of specific key poses from the Affect Space. The results suggest that this Affect Space for body expressions can be used to improve the expressiveness of humanoid robots. In addition, early results of a pilot study are described. It revealed that the context helps human subjects improve their recognition rate during a human-robot imitation game, and in turn this recognition leads to better outcome of the interactions.
Article
Full-text available
In this chapter, we first present a summary of findings from two previous studies on the limitations of using flat displays with embodied conversational agents (ECAs) in the contexts of face-to-face human-agent interaction. We then motivate the need for a three dimensional display of faces to guarantee accurate delivery of gaze and directional movements and present Furhat, a novel, simple, highly effective, and human-like back-projected robot head that utilizes computer animation to deliver facial movements, and is equipped with a pan-tilt neck. After presenting a detailed summary on why and how Furhat was built, we discuss the advantages of using optically projected animated agents for interaction. We discuss using such agents in terms of situatedness, environment, context awareness, and social, human-like face-to-face interaction with robots where subtle nonverbal and social facial signals can be communicated. At the end of the chapter, we present a recent application of Furhat as a multimodal multiparty interaction system that was presented at the London Science Museum as part of a robot festival,. We conclude the paper by discussing future developments, applications and opportunities of this technology.
Conference Paper
Full-text available
In order for robots to be socially accepted and generate empathy it is necessary that they display rich emotions. For robots such as Nao, body language is the best medium available given their inability to convey facial expressions. Displaying emotional body language that can be interpreted whilst interacting with the robot should significantly improve its sociability. This research investigates the creation of an Affect Space for the generation of emotional body language to be displayed by robots. To create an Affect Space for body language, one has to establish the contribution of the different positions of the joints to the emotional expression. The experiment reported in this paper investigated the effect of varying a robot's head position on the interpretation, Valence, Arousal and Stance of emotional key poses. It was found that participants were better than chance level in interpreting the key poses. This finding confirms that body language is an appropriate medium for robot to express emotions. Moreover, the results of this study support the conclusion that Head Position is an important body posture variable. Head Position up increased correct identification for some emotion displays (pride, happiness, and excitement), whereas Head Position down increased correct identification for other displays (anger, sadness). Fear, however, was identified well regardless of Head Position. Head up was always evaluated as more highly Aroused than Head straight or down. Evaluations of Valence (degree of negativity to positivity) and Stance (degree to which the robot was aversive to approaching), however, depended on both Head Position and the emotion displayed. The effects of varying this single body posture variable were complex.
Conference Paper
Full-text available
We present evaluation results of a multimodal route navigation system that allows interaction using speech and tactile/visual modes. Various functional aspects of the system were studied, related especially to the IO-modalities and their use as means of communication. We compared the users' expectations before the evaluation with their actual experience of the system, and found significant differences among various user groups.
Conference Paper
Full-text available
The focus of this paper is on pointing gestures that do not function as deictic pointing to a concrete referent but rather as structuring the flow of information. Examples are given on their use in giving feedback and creating a common ground in natural conversations, and their meaning is described with the help of semantic themes of the Index Finger Extended gesture family. A communication model is also sketched together with an exploration of the simultaneous occurrence of gestures and speech signals, using a two-way approach that combines top-down linguistic-pragmatic and bottom-up signal analysis.
Article
For people to contribute to discourse, they must do more than utter the right sentence at the right time. The basic requirement is that they add to their common ground in an orderly way. To do this, we argue, they try to establish for each utterance the mutual belief that the addressees have understood what the speaker meant well enough for current purposes. This is accomplished by the collective actions of the current contributor and his or her partners, and these result in units of conversation called contributions. We present a model of contributions and show how it accounts for a variety of features of everyday conversations.