Conference PaperPDF Available

Integration of Gestures and Speech in Human- Robot Interaction

December 2012

December 2012

DOI:10.1109/CogInfoCom.2012.6421936

Conference: IEEE 3rd International Conference on Cognitive Infocommunications (CogInfoCom 2012)

Authors:

Raveesh Meena

KTH Royal Institute of Technology

Kristiina Jokinen

AIST Tokyo Waterfront

We present an approach to enhance the interaction abilities of the Nao humanoid robot by extending its communicative behavior with non-verbal gestures (hand and head movements, and gaze following). A set of non-verbal gestures were identified that Nao could use for enhancing its presentation and turn-management capabilities in conversational interactions. We discuss our approach for modeling and synthesizing gestures on the Nao robot. A scheme for system evaluation that compares the values of users' expectations and actual experiences has been presented. We found that open arm gestures, head movements and gaze following could significantly enhance Nao's ability to be expressive and appear lively, and to engage human users in conversational interactions.

Content uploaded by Kristiina Jokinen

Content may be subject to copyright.

Integration of Gestures and Speech in Human-

Robot Interaction

Raveesh Meena*, Kristiina Jokinen, ** and Graham Wilcock**

* KTH Royal Institute of Technology, TMH, Stockholm, Sweden

** University of Helsinki, Helsinki, Finland

raveesh@csc.kth.se, kristiina.jokinen@helsinki.fi, graham.wilcock@helsinki.fi

Abstract— We present an approach to enhance the

interaction abilities of the Nao humanoid robot by extending

its communicative behavior with non-verbal gestures (hand

and head movements, and gaze following). A set of non-

verbal gestures were identified that Nao could use for

enhancing its presentation and turn-management

capabilities in conversational interactions. We discuss our

approach for modeling and synthesizing gestures on the Nao

robot. A scheme for system evaluation that compares the

values of users’ expectations and actual experiences has

been presented. We found that open arm gestures, head

movements and gaze following could significantly enhance

Nao’s ability to be expressive and appear lively, and to

engage human users in conversational interactions.

I. INTRODUCTION

Human-human face-to-face conversational interactions

involve not just exchange of verbal feedback, but also that

of non-verbal expressions. Conversational partners may

use verbal feedback for various activities, such as asking

clarification or information questions, giving response to a

question, providing new information, expressing the

understanding or uncertainty about the new information,

or to simply encourage the speaker, through backchannels

(‘ah’, ‘uhu’, ‘mhm’), to continue speaking.

Often verbal expressions are accompanied by non-

verbal expressions, such as gestures (e.g., hand, head and

facial movements) and eye-gaze. Non-verbal expressions

of this kind are not mere artifacts in a conversation, but

are intentionally used by the speaker to draw attention to

certain pieces of information present in the verbal

expression. There are some other non-verbal expressions

that may function as important signals to manage the

dialogue and the information flow in a conversational

interaction [1]. Thus, while a speaker employs verbal and

non-verbal expressions to convey her communicative

intentions appropriately, the listener(s) combine cues from

these expressions to ground the meaning of the verbal

expression and establish a common ground [2].

It is desirable for artificial agents, such as the Nao

humanoid robot, to be able to understand and exhibit

verbal and non-verbal behavior in human-robot

conversational interactions. Exhibiting non-verbal

expressions would not only add to their ability to draw

attention of the users(s) to useful pieces of information,

but also make them appear more expressive and

intelligible which will help them build social rapport with

their users.

In this paper we report our work on enhancing Nao’s

presentation capabilities by extending its communicative

behavior with non-verbal expressions. In section II we

briefly discuss some gestures types and their functions in

conversational interactions. In section III we identify a set

of gestures that are useful for Nao in the context of this

work. In section IV we first discuss the general approach

for synthesis of non-verbal expressions in artificial agents

and then present our approach. Next, in section V we

discuss our scheme for user evaluation of the non-verbal

behavior in Nao. In section VI we present the results and

discuss our findings. In section VII we discuss possible

extensions to this work and report our conclusions.

II. BACKGROUND

Gestures belong to the communicative repertoire that

the speakers have at their disposal in order to express

meanings and give feedback. According to Kendon,

gestures are intentionally communicative actions and they

have certain immediately recognizable features which

distinguish them from other kind of activity such as

postural adjustments or spontaneous hands and arms

movement. In addition, he refers to the act of gesturing as

gesticulation, with a preparatory phase in the beginning of

the movement, the stroke, or the peak structure in the

middle, and the recovery phase at the end of the

movement [1].

Gestures can be classified based on their form (e.g.,

iconic, symbolic and emblems gestures) or also based on

their function. For instance, gesture can complement the

speech and single out a certain referent as is the case with

typical deictic pointing gestures (that box). They can also

illustrate the speech like iconic gestures do, e.g., a speaker

may spread her arms while uttering the box was quite big

to illustrate that the box was really big. Hand gestures

could also be used to add rhythm to the speech as the

beats. Beats are usually synchronized with the important

concepts in the spoken utterance, i.e., they accompany

spoken foci (e.g., when uttering Shakespeare had three

children: Susanna and twins Hamnet and Judith, the beats

fall on the names of the children). Gesturing can thus

direct the conversational partners’ attention to an

important aspect of the spoken message without the

speaker needing to put their intentions in words.

The gestures that we are particularly interested in this

work are Kendon’s Open Hand Supine (“palm up”) and

Open Hand Prone (“palm down”). Gestures in these two

families have their own semantic themes, which are

related to offering and giving vs. stopping and halting,

respectively. Gestures in “palm-up” family generally

express offering or giving of ideas, and they accompany

speech which aims at presenting, explaining,

summarizing, etc. [1].

While much of the gestures accompany speech, some

gestures may function as important signals that are used to

manage the dialogue and the information flow. According

to Allwood, some gestures may be classified as having

turn-management function. Turn-management involves

turn transitions depending on the interlocutors action with

respect to the turn: turn-accepting (the speaker takes over

the floor), turn-holding (the speaker keeps the floor), and

turn-yielding (the speaker hands over the floor) [3].

It has been established that conversational partners take

cues from various source: the intonation of the verbal

expression utterance, the phrase boundaries, pauses, and

semantic and syntactic context to infer turn transition

relevance place. In additional to these verbal cues, eye-

gaze shifts is one non-verbal cue that conversational

participants employ for turn management in conversation

interactions. The speaker is particularly more influential

than the other partners in coordination turn changes. It has

been shown that if the speaker wants to give the turn, she

looks at the listeners, while the listeners tend to look at the

current speaker, but turn their gaze away if they do not

want to take the turn, If the listeners wants to take the turn

the listeners also looks at the speaker, and turn taking is

agreed by the mutual gaze. Mutual gaze is usually broken

by the listener who takes the turn, and once the planning

of the utterance starts, the listener usually looks away,

following the typical gaze aversion pattern [3].

III. GESTURES AND NAO

The task of integrating non-verbal gestures in the Nao

humanoid robot was part of a project on multimodal

conversational interaction with a humanoid robot [4]. We

started with WikiTalk [5], a spoken dialogue system for

open domain conversation using Wikipedia as a

knowledge source. By implementing WikiTalk on the

Nao, we greatly extended the robot’s interaction

capabilities by enabling Nao to talk about an unlimited

range of topics. One of the critical aspects of this

interaction is that since the user doesn’t have access to a

computer monitor she is completely unaware of the

structure of the article and the hyperlinks present in there

which could be a possible sub-topic for the user to

continue the conversation. The robot should be able to

bring the user attention to these hyperlinks, which we treat

as the new information. While prosody plays a vital role in

making emphasis on content words in this work we aim

specifically at achieving the same with non-verbal

gestures. In order to make the interaction smooth we

wanted the robot to coordinate turn taking. Here again we

were more interested in the turn-management aspect of

non-verbal gestures and eye-gaze. Based on these

objectives we set the two primary goals of this work as:

Goal 1: Extend the speaking Nao with hand gesturing that

will enhance its presentation capabilities.

Goal 2: Extend Nao’s turn-management capabilities using

non -verbal gestures.

Towards the first goal we identified a set of

presentation gestures to mark topic, the end of a sentence

or a paragraph, beat gestures and head nods to attract

attention to hyperlinks (the new information), and head

nodding as backchannels. Towards the second goal we put

the following scheme in place: Nao will speak and

observes the human partner at the same time. After

presenting a piece of new information the user is expected

to signal interest by making explicit requests or using

backchannels. Nao should observe and react to such user

responses. After each paragraph the human is invited to

signal continuation (verbal command phrases like

‘enough’, ‘continue’, ‘stop’, etc.). Nao asks explicit

feedback (may also gesture, stop, etc. depending on

previous interaction). Table I provides the summary of the

gestures (along with their functions and their placements)

that we aimed to integrate in Nao.

IV. APPROACH

A. The choice and timing of non-verbal gestures

Synthesizing non-verbal behavior in artificial agents

primarily requires making the choice of right non-verbal

behavior to generate and the alignment of that non-verbal

behavior to the verbal expression with respect to the

temporal, semantic, and discourse related aspects of the

dialogue. The content of a spoken utterance, its intonation

contour, and the non-verbal expressions accompanying it

together express the communicative intention of the

speaker. The logical choice therefore is to have a

composite semantic representation that captures the

meanings along these three dimensions. The agent’s

domain plan and the discourse context play a crucial role

in planning the communicative goal (e.g. should the agent

provide an answer to a question or seek clarification).

TABLE I.

NON-VERBAL GESTURES AND THEIR ROLE IN INTERACTION WITH NAO

Gesture Function(s) Placement and the meaning of the gesture

Open Hand Palm Up Indicating new paragraph Beginning of a paragraph. The Open Hand Palm Up gestures has the

semantic theme of offering information or ideas.

Discourse structure

Open Hand Palm Vertical Indicating new information Hyperlink in a sentence. The Open Hand Palm Vertical rhythmic up

and down movement emphasizes new information (beat gesture).

Head Nod Down Indicating new information Hyperlink in a sentence. Slight head nod marks emphasis on pieces

of verbal information.

Head Nod Up

Expressing surprise On being interrupted by the user (through tactile sensors).

Turn-yielding End of a sentence where Nao expects the user to provide an explicit

response. Speaker gaze at the listener indicates a possibility for

listener to grab the conversational floor.

Discourse structure

Speaking-to-Listening Turn-yielding Listening mode. Nao goes to standing posture from the speaking

pose and listens to the user.

Listening-to-Speaking Turn-accepting Presentation model. Nao goes to speaking posture from the standing

pose to prepare for presenting information to the user.

Open Arms Open Hand

Palm Up Presenting new topic Beginning of a new topic. The Open Arm Open Hand Palm Up

gestures has the semantic theme of offering information or ideas.

However, an agent requires a model of attention (what is

currently salient) and intention (next dialogue act) for

extending the communicative intention with pragmatic

factors that determine what intonation contours and

gestures are appropriate in its linguistic realization. This

includes the theme (information that is grounded) and the

rheme (information yet to be grounded) marking of the

elements in the composite semantic representation. The

realizer should be able to synthesis the correct surface

form, the appropriate intonation, and the correct gesture.

Text is generated and pitch accents and phrasal melodies

are placed on generated text which is then produced by a

text to speech synthesizer. The non-verbal synthesizer

produces the animated gestures.

As for timing of gestures the information about the

duration of intonational phrases is acquired during speech

generation and then used to time gesture. This is because

gestural domains are observed to be isomorphic with

intonational domains. The speaker’s hands rise into space

with the beginning of the intonational rise at the

beginning of an utterance, and the hands fall at the end of

the utterance along with the final intonational marking.

The most effortful part of the gesture (the “stroke”) co-

occurs with the pitch accent, or most effortful part of

pronunciation. Furthermore, gestures co-occur with the

rhematic part of speech, just as we find particular

intonational tunes co-occurring with the rhematic part of

speech [6].

[6] presents various embodied cognitive agents that

exhibit multimodal non-verbal behavior, including hand

gestures, facial expressions (eye brow movements, lip

movements) and head nods based on the scheme

discussion above. In [7] a back projected talking head is

presented that exhibits non-verbal facial expression such

as lip movement, eyebrow movement, and eye gaze. The

timing of these gestures is again motivated from the

intonational phrase of the verbal expressions.

B. Integrating non-verbal behavior in Nao

The preparation, stroke, and retraction phases of a

gesture may be differentiated by short holding phases

surrounding the stroke. It is in the second phase—the

stroke—that contains the meaning features that allows

one to interpret the gestures. Towards animating gestures

in Nao our first step was to define the stroke phase for

each gesture type identified in TABLE I. We refer to Nao’s

full body pose during the stroke phase as the key pose that

captures the essence of the action. Fig. A to G in TABLE

II illustrates the key poses for the set of gestures

identified in TABLE I. For example, Fig. A in TABLE II

illustrates the key pose for the Open Hand Palm Up

gesture.

In our approach we model the preparatory phase of a

gesture as comprising of an intermediate gesture, the

preparatory pose, which is a gesture pose halfway on the

transition from the current Nao posture to the target key

pose. Similarly, the retraction phase is comprised of an

intermediate gesture, the retraction pose, which is a

gesture pose half way on the transition between the target

key pose and the follow-up gesture. The complete gesture

was then synthesized using the B-spline algorithm [8] for

interpolating the joint positions from the preparatory

pose to the key pose and from the key pose to the

retraction pose.

It is critical for the key pose of a gesture to coincide

with the pitch accent in the intonational contour of the

verbal expression. During trials in the lab we observed

that there is always some latency in Nao’s motor

response. Since gestures can be chanined and the

preperatory phase of the follow-up gesture unifies

with the retraction phase of the previous gesture,

considering the Listening key pose (Fig. E TABLE II), the

default standing position for Nao, as the starting pose for

all gestures, increased the latency, and was often

unnatural as well. We therefore specified the Speaking

key pose (Fig. F TABLE II) as the default follow-up

posture. This approach has the practical relevance of not

only reducing the latency but also that the transitions

from the Listening key pose to Speaking key pose

(presentation mode) and vice versa served the purpose of

turn-management. Synthesizing a specific gesture on Nao

then basically required an animated movement of joints

from any current body pose to the target gestural key

pose and the follow-up pose.

As an illustration, the Open Hand Palm Up gesture for

paragraph beginning was synthesized as an B-spline

interpolation of the following sequence of key poses:

Standing → Speaking → Open Hand Palm Up

preparatory pose → Open-Hand Palm Up key pose →

Open-hand Palm Up retraction pose → Speaking.

Beat gestures, the rhythmic movement of Open Hand

Palm Vertical gesture, are different from the other

gestures as they are characterized by two phases of

movement: a movement into the gesture space, and a

movement out of it [6]. In contrast to the pause in the

stroke phase of other gestures, it is the rhythm of the beat

gestures that is intended to draw the listeners’ attention to

TABLE II.

NON KEY POSES FOR VARIOUS GESTURES AND HEAD MOVEMENTS.

Fig. A :

Open Hand

Palm Up

Fig. A1:

Side view of

Fig. A

Fig. B:

Open Hand

Palm Vertical

Fig. B1:

Side view of

Fig. B.

Fig. C:

Head Nod

Down

Fig. D:

Head Nod Up

Fig. E:

Listenin g key

pose

Fig. F:

Speaking key

pose

Fig. G: Open Arms Open Hand Palm Up

the verbal expressions. A beat gesture

as an B-spline interpolation of Spea

Open Hand Palm Vertical key pose →

with no Open Hand Palm Vertical

retraction poses. This sequence o

animated in loops for synthesizing rhyt

for drawing attention to a sequence of

We handcrafted the

reparatory,

poses for all the animated ges

Choregraphe® (part of Nao’s toolki

offers an intuitive way of designing a

Nao and obtained the corresponding

This enabled us to develop a

ara

function library of all the gestures.

synthesize a gesture with varying

animation and the amplitude of joint

approach to define gestures as

ara

obtained from templates is also use

non-verbal

ehavior in embodied cogni

facial gestures in

ack projected talkin

C. Synchronizing Nao

estures wi

Since much of gestures that we ha

work accompany speech we wanted to

of a target gesture with the content

information. To achieve this we should

intonational phrases information fro

speech synthesis system. However,

unable to obtain the intonational phras

Nao’s speech synthesizer. Therefore

simple approach of finding the averag

efore which the gesture synthesis s

such that the key pose coincides with

This number is calculated based on a

(of the template) and the length of t

count) to be spoken. Based on

approximated (online) the duration

gesture to be synthesized. In similar f

punctuations and structural details

sentence end, paragraph end) of a W

time the turn-management gestures. O

the timing of these gestures was perc

developers in lab.

IGURE

rovides an overview of

Interaction Manger (MIM). On receiv

ao Manager instructs the MIM to pro

MIM interacts with the Wikipedia Ma

content and the structural details o

Wikipedia. MIM instructs the Gestur

these pieces of information in con

Discourse Context to specify the gestur

IGURE

’

ULTIMODAL

NTERAC

s was synthesized

ing key pose →

peaking key pose,

preparatory and

key poses was

mic beat gestures

ew information.

ey and retraction

ures using the

). Choregraphe®

imated actions in

Python code.

eterized gesture

We could then

duration of the

movements. This

eterized functions

for synthesizing

ive agents [6] and

heads [7].

h Nao speech

e focused in this

alig

the key pose

ords bearing new

have extrac

ed the

Nao’s text-to-

ck then, we were

information from

e took the rather

number of words

ould be triggered

the content word.

gesture’s duration

e sentence (word

these two we

arame

er of the

shion we used the

(new paragraph,

ikipedia article to

, if not always,

ived okay by the

ao’s Multimodal

ng the user input,

ess the User Input.

ager to obtain the

the topic from

Manager to use

unctio

with the

type (refers to the

Gesture Library). Next, the

gesture is calculated (Gestu

placing the gesture tag at the a

to be spoken. While the Nao

roduces the verbal expression

the Nao Movement Controlle

(Gesture Synthesizer).

V. U

SER

We evaluated the impact

verbal expressions in a conv

human subjects. Since we w

significance of individual gest

versions of Nao’s MIM with

limited set of non-verbal gest

the non-verbal gesturing abiliti

For evaluation we follo

comparing users’ expectations

their actual experiences of the

users were first asked to fill

designed to measure their exp

Subjects then took

art i

interactions, and after each i

the users filled in another que

experience with the system the

Both the questionnaire cont

were aimed at seeking users’

feedback on the following

nterface, Responsiveness, Ex

Overall Experience. T

ABLE I

from the two questionnaires th

Nao’s non-verbal

ehav

questionnaire served the dual

attention to system behaviors

Participants

rovided their re

from one to five (with five ind

Twelve users participated i

articipants of the 8th Intern

on Multimodal Interfaces, eN

were instructed that Nao can

from Wikipedia and that they

with it as much as they wish.

or restrictions on the topics.

about almost anything. In a

rovided a list of commands

themselves with the interacti

interacted with the three sy

System 1, System 2 and then S

I. R

The figure in T

ABLE V

expected and observed feature

x axis corresponds to the state

TABLE III.

VERBAL GESTURE CAPABILITIES

System

version

xhibited no

System 1 Face tracking , alway

System 2

Head Nod Up, Head

Palm Up, Open Hand

and Standing pose

System 3 Head Nod Up, Open

Gesture ( Open Hand

ION

ANAGER

uration parameter of this

e Timing) and used for

propriate place in the text

ext-to-Speech synthesizer

the Nao Manager instructs

to synthesize the gesture

ALUATION

f Nao’s verbal and non-

rsational interaction with

nted to also measure the

re types, we created three

each system exhibiting a

res. T

ABLE

III summarizes

s of the three systems.

ed the scheme [9] of

before the evaluation with

system. Under this scheme

a questionnaire that was

ectations from the system.

three about 10-minute

teraction with the system

tionnaire that gauged their

had just interacted with.

ined 31 statements, which

xpectation and experience

aspects of the systems:

ressiveness, Usability and

shows the 14 statements

were aimed at evaluating

. The expectation

purpose of priming user’s

hat we wanted to evaluate.

ponse o

the Likert scale

cating strong agreemen

the evaluation. They were

tional Summer Workshop

TERFACE-2012. Subjects

provide them information

can talk to Nao, and play

There were no constraints

sers could ask Nao to talk

dition to this, they were

to help them familiarize

n control. All the users

tems in the same order:

stem 3.

SULTS

esents the values of the

for all the test users. The

id. (S.Id) in T

ABLE IV

F THE

MIM

INSTANT IATION S

n-verbal gestures

in the Speaking pose

od Down, Open Hand

Palm Vertical, Listening

and Palm Up and Beat

Palm Vertical)

Measuring the significance of these values is part of the

ongoing work, therefore we report here just the

preliminary observations based on this figure.

Interface: Users expected Nao hand gestures to be

linked to exploring topics (I1). They perceived their

experience with System 2 to be above their expectations,

while System 3 was perceived somewhat closer to what

they had expected. As System 1 lacked any hand gestures

the expected behavior was hardly observed. Users

expected Nao hand and body movement to be distracting

(I3). However, the observed values suggest that it wasn’t

the case with any of the three interactions. Among the

three, System 1 was perceived the least distracting which

could be due to lack of hand and body movements. Users

expected Nao’s hand and body movement to cause

curiosity (I4). This is in fact true for the observed values

for System 2 and 3. Despite the gaze following behavior

in System 1 it wasn’t able to cause enough curiosity.

Expressiveness: The users expected Nao to be

expressive (E1). Among the three systems, the interaction

with System 2 was experienced closest to the

expectations. System 2 exceeded the users’ expectation

when it comes to Nao’s liveliness (E2). Interaction with

System 3 was experienced more lively than interaction

with System 1 suggesting that body movements could

add significantly to the liveliness of an agent that exhibit

only head gestures. Among the three systems, the users

found System 2 to meet their expectations about the

timeliness of head nods (E3). Concerning the naturalness

of the gestures System 2 clearly beats the user

expectations while System 3 was perceived okay. Users

found all the three interactions very engaging (E6).

Responsiveness: The users expected Nao’s

presentation to be easy to follow (R6). The gaze

following gesture in System 1 was perceived the easiest

to follow. System 2 and 3 were able to achieve this only

to an extent. As to whether gesturing and information

presentation are linked (R7), the interactions with System

2 were perceived closer to the users’ expectations.

Usability: Users expected to remember possible topics

without visual feedback (U1). For all the three systems,

the observed values were close to expected values.

Overall: The Nao gestures in System 1 were observed

to meet the users’ expectations (O1). The head nods in

System 2 were also perceived to meet the users’

expectations (O2), and the gaze tracking in System 1 was

also liked by the users (O3). The responses to O2 and O3

indicate that the users were able to distinguish head nods

from gaze following movements of the Nao head.

In all, the users liked the interaction with System 2

most. This can be attributed to the large variety of non-

verbal gestures exhibited by System 2. System 2 and

System 3 should benefit by incorporating the gaze

following gestures of System 1. Among the hand

gestures, open arm gestures were perceived better then

beat gestures. We attribute this to the poor synthesis of

beat gestures by the Nao motors.

II. DISCUSSION AND CONCLUSIONS

In this work we extended the Nao humanoid robot’s

presentation capabilities by integrating a set of non-verbal

behaviors (hand gestures, head movements and gaze

following). We identified a set of gestures that Nao could

use for information presentation and turn-management.

We discussed our approach to synthesize these gestures on

the Nao robot. We presented a scheme for evaluating the

system’s non-verbal behavior based on the users’

expectations and actual experiences. The results suggest

that Nao can significantly enhance its expressivity by

exhibiting open arms gestures (they serve the function of

structuring the discourse), as well as gaze-following and

head movements for keeping the users engaged.

Synthesizing sophisticated movements such as beat

gestures would require a more elaborate model for gesture

placement and smooth yet responsive robot motor actions.

In this work we handcrafted the gestures ourselves, using

Choregraphe®. We believe other approaches in the field

such as use of motion capture devices or Kinect could be

TABLE IV.

QUESTIONNAIRES FOR MEASURING USER EXPECTATIONS AND REAL EXPERIENCE WITH NAO.

System aspect S.Id. Expectation questionnaire Experience questionnaire

Interface

I2 I expect to notice if Nao's hand gestures are linked

to exploring topics.

I noticed Nao's hand gestures were linked to

exploring topic.

I3 I expect to find Nao's hand and body movement

distracting.

Nao's hand and body movement distracted me.

I4 I expect to find Nao’s hand and body movements

creating curiosity in me.

Nao’s hand and body movements created

curiosity in me.

Expressiveness

E1 I expect Nao's behaviour to be expressive. Nao's behaviour was expressive.

E2 I expect Nao will appear lively. Nao appeared lively.

E3 I expect Nao to nod at suitable times. Nao nodded at suitable times.

E5 I expect Nao's gesturing will be natural. Nao’s gesturing was natural.

E6 I expect Nao's conversations will be engaging Nao's conversations was engaging

Responsiveness

R6 I expect Nao’s presentation will be easy to follow. Nao’s presentation was easy to follow.

R7 I expect it will be clear that Nao’s gesturing and

information presentation are linked.

It was clear that Nao’s gesturing and

information presentation were linked.

Usability U1

I expect it will be easy to remember the possible

topics without visual feedback.

It was easy to remember the possible topics

without visual feedback.

Overall

O1 I expect I will like Nao's gesturing. I liked Nao's gesturing.

O2 I expect I will like Nao's head movements. I liked Nao's head movements.

O3 I expect I will like Nao’s head tracking. I liked Nao’s head tracking.

used to design more natural gesture

conduct any independent

erceptio

synthesized gestures to gauge how hu

the meaning of such gestures in c

Perception studies similar to the one

should be useful for us.

We believe the traditional app

alignment using the phoneme inform

given better gesture timings. We also n

for determining the duration and amplit

the gesture functions. Exploring th

parameters in the lines of [10] on e

space for robots to display Emotion

would be an interesting direction to foll

As to whether the users were able to

information conveyed by the emphatic

not been verified yet. This requires ex

the video recordings and has been plan

Moreover, previous research has shown

and head movements play a vital role i

We could not verify whether Nao's g

this kind of role in interaction coordin

but we believe that non-verbal gestures

for turn-management, especially to be

default beep sound that Nao robot cu

explicitly indicate turn changes. How

suggest that open arm hand gestures, h

following can significantly enhance

engage users (Goal 1, p.2), verifie

difference between the user's experienc

of the Nao's interactive capability.

CKNOWLEDGMENT

The authors thank the organizers of

at Supelec, Metz, for the excellent en

project.

EFERENCES

[1] K. Jokinen, "Pointing Gestures

Communication Management," in

ultimodal Interfaces: Active Listen

vol. 5967, A. Esposito, N. Campbell,

and A. Nijholt, Eds., Heidelber

USER EXPECTAT

. Also we didn’t

studies for the

an users perceive

ntext of speech.

esented in [3], [8]

oach of gesture

tion would have

ed a better model

de parameters for

range of these

plo

ing the affect

l Body language

emember the new

hand gestures has

ensive analysis of

ed as future work.

that hand gestures

turn management.

stures also served

tion (Goal 2, p.2),

will be well suited

sed instead of the

rently employs to

ver, our findings

ad nods and gaze

Nao’s ability to

by the positive

e and expectations

NTERFACE 2012

ironment for this

and Synchronous

evelopment o

ng and Synchron

. Vogel, A. Hussain

, Springer Berlin

Heidelberg, 2010, pp. 33-49.

[2] H. H. Clark and E. F.

Discourse,” Cognitive Scienc

[3] K. Jokinen, H. Furukawa,

“Gaze and Turn-Taking Beh

Interactions,” in ACM

ntelligent Systems, Speci

Intelligent Human-

achine

[4] A. Csapo, E. Gilmartin, J.

Anastasiou, K. Jokinen an

Conversational Interaction

roceedings of the 3rd IEE

Cognitive Infocommunicat

Kosice, Slovakia, 2012.

[5] G. Wilcock, “WikiTalk: A S

Domain Knowledge Acc

wering in Complex Dom

India, 2012.

[6] J. Cassell, “Embodied Conv

Gesture into Automatic Spo

Press, 1989.

[7] S. Al Moubayed, J. Beskow,

“Furhat: A Back-

rojected

Multiparty Human-Machin

ehavioural Systems. Lectur

A. Esposito, A. Esposito, A.

V. C. Müller, Eds., Springer,

[8] A. Beck, A. Hiolle, A.

“Interpretation of Emotional

Robots,” in

roceesings of t

on Affective Interaction

(AFFINE'10), Firenze, Italy,

[9] K. Jokinen and T. Hurtig,

“

Experience on a Multimo

roceedings of Interspeech

US, 2006.

[10] A. Beck, L. Canamero and K

Space for Robots to Display

roceedings of the 19th I

on Robot and Human Inte

MAN 2010), Principe di Pie

TABLE V.

IONS

(uExpect’n)

AND THEIR EXPERIENCES

(ueSys1/2/3)

WITH

Schaefer, “Contributing to

, pp. 259-294, 1989.

. Nishida and S. Yamamoto,

vior in Casual Conversational

ransactions on In

eractive

l Issue on Eye Gaze in

teraction, ACM, 2010.

rizou, F. Han, R. Meena, D.

G. Wilcock, “Multimodal

ith a Humanoid Robot,” in

International Conference on

ons (CogInfoCom 2012),

oken Wikipedia-based Open-

ss System,” in Question

ins (QACD 2012), Mumbai,

rsation: Integrating Face and

en Dialogue Systems,” MIT

. Skantze and B. Granström,

uman-like Robot Head for

Interaction,” in Cognitive

Notes in Computer Science.,

Vinciarelli, R. Hoffmann and

2012.

Mazel and L. Canamero,

Body Language Displayed by

e 3rd International Workshop

in Natural Environments

010.

“

User Expectations and Real

al Interactive System,” in

006, Pittsburg, Pennsylvania,

A. Bard, “Towards an Affect

Emotional Body Language,”

EE International Symposium

active Communication (Ro-

onto -Viareggio, Italy, 2010.

Creating Expressive Social Robots That Convey Symbolic and Spontaneous Communication

Article

Full-text available

Jun 2024
SENSORS-BASEL

Robots are becoming an increasingly important part of our society and have started to be used in tasks that require communicating with humans. Communication can be decoupled in two dimensions: symbolic (information aimed to achieve a particular goal) and spontaneous (displaying the speaker’s emotional and motivational state) communication. Thus, to enhance human–robot interactions, the expressions that are used have to convey both dimensions. This paper presents a method for modelling a robot’s expressiveness as a combination of these two dimensions, where each of them can be generated independently. This is the first contribution of our work. The second contribution is the development of an expressiveness architecture that uses predefined multimodal expressions to convey the symbolic dimension and integrates a series of modulation strategies for conveying the robot’s mood and emotions. In order to validate the performance of the proposed architecture, the last contribution is a series of experiments that aim to study the effect that the addition of the spontaneous dimension of communication and its fusion with the symbolic dimension has on how people perceive a social robot. Our results show that the modulation strategies improve the users’ perception and can convey a recognizable affective state.

Interactive method research of dual mode information coordination integration for astronaut gesture and eye movement signals based on hybrid model

Article

May 2023

The lightweight human-robot interaction model with high real-time, high accuracy, and strong anti-interference capability can be better applied to future lunar surface exploration and construction work. Based on the feature information inputted from the monocular camera, the signal acquisition and processing fusion of the astronaut gesture and eye-movement modal interaction can be performed. Compared with the single mode, the human-robot interaction model of bimodal collaboration can achieve the issuance of complex interactive commands more efficiently. The optimization of the target detection model is executed by inserting attention into YOLOv4 and filtering image motion blur. The central coordinates of pupils are identified by the neural network to realize the human-robot interaction in the eye movement mode. The fusion between the astronaut gesture signal and eye movement signal is performed at the end of the collaborative model to achieve complex command interactions based on a lightweight model. The dataset used in the network training is enhanced and extended to simulate the realistic lunar space interaction environment. The human-robot interaction effects of complex commands in the single mode are compared with those of complex commands in the bimodal collaboration. The experimental results show that the concatenated interaction model of the astronaut gesture and eye movement signals can excavate the bimodal interaction signal better, discriminate the complex interaction commands more quickly, and has stronger signal anti-interference capability based on its stronger feature information mining ability. Compared with the command interaction realized by using the single gesture modal signal and the single eye movement modal signal, the interaction model of bimodal collaboration is shorter about 79% to 91% of the time under the single mode interaction. Regardless of the influence of any image interference item, the overall judgment accuracy of the proposed model can be maintained at about 83% to 97%. The effectiveness of the proposed method is verified.

Child-Robot Interaction in Robot-Enhanced Diagnostic and Therapeutic Sessions of Children

Conference Paper

Oct 2022

Child-Human-Robot interaction might be more authentic , intuitive, successful, and natural if robots had emotional intelligence. Therefore, observing and recognising human actions is vital for achieving this goal. As well as that, making robots appealing to children is similarly crucial. Otherwise, children may be intimidated by humanoid robot agents, which would hinder child-robot interaction. Through the use of a robotic agent, the NAO robot, we determine child-robot interaction based on the degree of the child's capability of keeping the gaze with the robotic agent during child diagnostic and therapeutic sessions. The Wizard of OZ (WoZ) approach is used to remotely control the robotic agent in Robot Enhanced Therapy (RET), where the child's guardian or therapist acts as the third party and the robot is controlled by a different individual. Based on the findings, we suggest cutting-edge research to effectively establish child-robot interaction in robot-assisted and robot-enhanced therapeutic sessions for children. In addition, we provide crucial insights for enhancing child-robot interactions in order to diagnose neurodevelopmental disorders in children and facilitate future research.

Editorial: Cognitive infocommunications

Article

Full-text available

Feb 2023

Towards Culture-Aware Co-Speech Gestures for Social Robots

Article

Full-text available

Jun 2022

Embedding social robots with the capability of accompanying their sentences with natural gestures may be the key to increasing their acceptability and their usage in real contexts. However, the definition of natural communicative gestures may not be trivial, since it strictly depends on the culture of the person interacting with the robot. The proposed work investigates the possibility of generating culture-dependent communicative gestures, by proposing an integrated approach based on a custom dataset composed exclusively of persons belonging to the same culture, an adversarial generation module based on speech audio features, a voice conversion module to manage the multi-person dataset, and a 2D-to-3D mapping module for generating three-dimensional gestures. The approach has eventually been implemented and tested with the humanoid robot Pepper. Preliminary results, obtained through a statistical analysis of the evaluations made by human participants identifying themselves as belonging to different cultures, are discussed.

HumanoidBot: framework for integrating full-body humanoid robot with open-domain chat system

Article

Sep 2023

A System for Giving Presentations with the NAO Robot

Conference Paper

Nov 2022

Head Movement Patterns during Face-to-Face Conversations Vary with Age

Conference Paper

Nov 2022

The Design and Observed Effects of Robot-Performed Manual Gestures: A Systematic Review

Article

Jul 2022

Communication using manual (hand) gestures is considered a defining property of social robots, and their physical embodiment and presence, therefore we see a need for a comprehensive overview of the state of the art in social robots that use gestures. This systematic literature review aims to address this need by (1) describing the gesture production process of a social robot, including the design and planning steps, and (2) providing a survey of the effects of robot-performed gestures on human-robot interactions in a multitude of domains. We identify patterns and themes from the existing body of literature, resulting in nine outstanding questions for research on robot-performed gestures regarding: developments in sensor technology and AI, structuring the gesture design and evaluation process, the relationship between physical appearance and gestures, the effects of planning on the overall interaction, standardizing measurements of gesture ‘quality’, individual differences, gesture mirroring, whether human-likeness is desirable, and universal accessibility of robots. We also reflect on current methodological practices in studies of robot-performed gestures, and suggest improvements regarding replicability, external validity, measurement instruments used, and connections with other disciplines. These outstanding questions and methodological suggestions can guide future work in this field of research.

Human-Robot Interaction System Using Vietnamese

Chapter

Jun 2022

Human-robot interaction is an important field that affects a robot’s friendliness and ability to communicate. This paper proposes a human-robot interaction system capable of communicating with humans in Vietnamese. The system is built and developed based on the human communication process. First, the identity and expression of the interlocutor are recognized by the system. This is to combine with speech of the interlocutor to form input data. They are categorized into search text and conversation text. The response of the search text is initialized from the results returned from the internet. Conversation text has responses generated from the trained model and data about the expressions and information of the interlocutor. These data are combined to form the final system response. Simultaneously, important data during the current conversation is also stored to use in the next conversation. The designed system is based on the interaction model of real people. That creates friendliness and empathy for the system. As the experimental results, the level of satisfaction in the process of interacting with the system reaches 78%.

Gaze and turn-taking behavior in casual conversational interactions

Article

Full-text available

Jul 2013

Eye gaze is an important means for controlling interaction and coordinating the participants' turns smoothly. We have studied how eye gaze correlates with spoken interaction and especially focused on the combined effect of the speech signal and gazing to predict turn taking possibilities. It is well known that mutual gaze is important in the coordination of turn taking in two-party dialogs, and in this article, we investigate whether this fact also holds for three-party conversations. In group interactions, it may be that different features are used for managing turn taking than in two-party dialogs. We collected casual conversational data and used an eye tracker to systematically observe a participant's gaze in the interactions. By studying the combined effect of speech and gaze on turn taking, we aimed to answer our main questions: How well can eye gaze help in predicting turn taking? What is the role of eye gaze when the speaker holds the turn? Is the role of eye gaze as important in three-party dialogs as in two-party dialogue? We used Support Vector Machines (SVMs) to classify turn taking events with respect to speech and gaze features, so as to estimate how well the features signal a change of the speaker or a continuation of the same speaker. The results confirm the earlier hypothesis that eye gaze significantly helps in predicting the partner's turn taking activity, and we also get supporting evidence for our hypothesis that the speaker is a prominent coordinator of the interaction space. Such a turn taking model could be used in interactive applications to improve the system's conversational performance.

Multimodal Conversational Interaction with a Humanoid Robot

Conference Paper

Full-text available

Dec 2012

The paper presents a multimodal conversational interaction system for the Nao humanoid robot. The system was developed at the 8th International Summer Workshop on Multimodal Interfaces, Metz, 2012. We implemented WikiTalk, an existing spoken dialogue system for open-domain conversations, on Nao. This greatly extended the robot's interaction capabilities by enabling Nao to talk about an unlimited range of topics. In addition to speech interaction, we developed a wide range of multimodal interactive behaviours by the robot, including face-tracking, nodding, communicative gesturing, proximity detection and tactile interrupts. We made video recordings of user interactions and used questionnaires to evaluate the system. We further extended the robot's capabilities by linking Nao with Kinect.

Interpretation of emotional body language displayed by robots

Article

Full-text available

Oct 2010

In order for robots to be socially accepted and generate empathy they must display emotions. For robots such as Nao, body language is the best medium available, as they do not have the ability to display facial expressions. Displaying emotional body language that can be interpreted whilst interacting with the robot should greatly improve its acceptance. This research investigates the creation of an "Affect Space" [1] for the generation of emotional body language that could be displayed by robots. An Affect Space is generated by "blending" (i.e. interpolating between) different emotional expressions to create new ones. An Affect Space for body language based on the Circumplex Model of emotions [2] has been created. The experiment reported in this paper investigated the perception of specific key poses from the Affect Space. The results suggest that this Affect Space for body expressions can be used to improve the expressiveness of humanoid robots. In addition, early results of a pilot study are described. It revealed that the context helps human subjects improve their recognition rate during a human-robot imitation game, and in turn this recognition leads to better outcome of the interactions.

Furhat: A Back-Projected Human-Like Robot Head for Multiparty Human-Machine Interaction

Article

Full-text available

Jan 2012

In this chapter, we first present a summary of findings from two previous studies on the limitations of using flat displays with embodied conversational agents (ECAs) in the contexts of face-to-face human-agent interaction. We then motivate the need for a three dimensional display of faces to guarantee accurate delivery of gaze and directional movements and present Furhat, a novel, simple, highly effective, and human-like back-projected robot head that utilizes computer animation to deliver facial movements, and is equipped with a pan-tilt neck. After presenting a detailed summary on why and how Furhat was built, we discuss the advantages of using optically projected animated agents for interaction. We discuss using such agents in terms of situatedness, environment, context awareness, and social, human-like face-to-face interaction with robots where subtle nonverbal and social facial signals can be communicated. At the end of the chapter, we present a recent application of Furhat as a multimodal multiparty interaction system that was presented at the London Science Museum as part of a robot festival,. We conclude the paper by discussing future developments, applications and opportunities of this technology.

Towards an Affect Space for robots to display emotional body language

Conference Paper

Full-text available

Oct 2010

In order for robots to be socially accepted and generate empathy it is necessary that they display rich emotions. For robots such as Nao, body language is the best medium available given their inability to convey facial expressions. Displaying emotional body language that can be interpreted whilst interacting with the robot should significantly improve its sociability. This research investigates the creation of an Affect Space for the generation of emotional body language to be displayed by robots. To create an Affect Space for body language, one has to establish the contribution of the different positions of the joints to the emotional expression. The experiment reported in this paper investigated the effect of varying a robot's head position on the interpretation, Valence, Arousal and Stance of emotional key poses. It was found that participants were better than chance level in interpreting the key poses. This finding confirms that body language is an appropriate medium for robot to express emotions. Moreover, the results of this study support the conclusion that Head Position is an important body posture variable. Head Position up increased correct identification for some emotion displays (pride, happiness, and excitement), whereas Head Position down increased correct identification for other displays (anger, sadness). Fear, however, was identified well regardless of Head Position. Head up was always evaluated as more highly Aroused than Head straight or down. Evaluations of Valence (degree of negativity to positivity) and Stance (degree to which the robot was aversive to approaching), however, depended on both Head Position and the emotion displayed. The effects of varying this single body posture variable were complex.

User expectations and real experience on a multimodal interactive system

Conference Paper

Full-text available

Sep 2006

We present evaluation results of a multimodal route navigation system that allows interaction using speech and tactile/visual modes. Various functional aspects of the system were studied, related especially to the IO-modalities and their use as means of communication. We compared the users' expectations before the evaluation with their actual experience of the system, and found significant differences among various user groups.

Pointing Gestures and Synchronous Communication Management

Conference Paper

Full-text available

Jan 2009

Kristiina Jokinen

The focus of this paper is on pointing gestures that do not function as deictic pointing to a concrete referent but rather as structuring the flow of information. Examples are given on their use in giving feedback and creating a common ground in natural conversations, and their meaning is described with the help of semantic themes of the Index Finger Extended gesture family. A communication model is also sketched together with an exploration of the simultaneous occurrence of gestures and speech signals, using a two-way approach that combines top-down linguistic-pragmatic and bottom-up signal analysis.

Contributing to Discourse

Article

Jun 1989

H. H. Clark

WikiTalk: A Spoken Wikipedia-based Open-Domain Knowledge Access System

Conference Paper

Dec 2012

Graham Wilcock

Contributing to Discourse

Article

Apr 1989
COGNITIVE SCI

For people to contribute to discourse, they must do more than utter the right sentence at the right time. The basic requirement is that they add to their common ground in an orderly way. To do this, we argue, they try to establish for each utterance the mutual belief that the addressees have understood what the speaker meant well enough for current purposes. This is accomplished by the collective actions of the current contributor and his or her partners, and these result in units of conversation called contributions. We present a model of contributions and show how it accounts for a variety of features of everyday conversations.

Integration of Gestures and Speech in Human- Robot Interaction

Abstract

Recommended publications

Cooperative gestures

Natural Interaction with an Assistive Humanoid Robot

Commanding a humanoid to move objects in a multimodal language

Coupled Inverse-Forward Models for Action Execution Leading to Tool-Use in a Humanoid Robot