Conference PaperPDF Available

Tracking of facial features to support human-robot interaction

May 2009
Proceedings - IEEE International Conference on Robotics and Automation

May 2009

DOI:10.1109/ROBOT.2009.5152600

Source
DBLP

Conference: 2009 IEEE International Conference on Robotics and Automation, ICRA 2009, Kobe, Japan, May 12-17, 2009

Authors:

Maria Pateraki

Foundation for Research and Technology - Hellas

Haris Baltzakis

Foundation for Research and Technology - Hellas

Polychronis Kondaxakis

Panos E. Trahanias

Foundation for Research and Technology - Hellas

In this paper we present a novel methodology for detection and tracking of facial features like eyes, nose and mouth in image sequences. The proposed methodology is intended to support natural interaction with autonomously navigating robots that guide visitors in museums and exhibition centers and, more specifically, to provide input for the analysis of facial expressions that humans utilize while engaged in various conversational states. For face and facial feature region detection and tracking, we propose a methodology that combines appearance-based and feature-based methods for recognition and tracking, respectively. For the stage of face tracking the introduced method is based on least squares matching (LSM), a matching technique able to model effectively radiometric and geometric differences between image patches in different images. Thus, compared with previous research, the LSM approach can overcome the problems of variable scene illumination and head in-plane rotation. Another significant characteristic of the proposed approach is that tracking is performed on the image plane only wherever laser range information suggests so. The increased computational efficiency meets the real time demands of human-robot interaction applications and hence facilitates the development of relevant systems.

…

Registered LRS and vision data. (a) The detected moving objects are marked within the ellipses, (b) the localized expected face areas

…

Examples of face and facial features tracking results under variable lighting conditions and background

…

Figures - uploaded by Panos E. Trahanias

Content may be subject to copyright.

Content uploaded by Panos E. Trahanias

Content may be subject to copyright.

Content uploaded by Panos E. Trahanias

Content may be subject to copyright.

Tracking of Facial Features to Support Human-Robot Interaction

Maria Pateraki, Haris Baltzakis, Polychronis Kondaxakis, Panos Trahanias

Institute of Computer Science,

Foundation for Research and Technology - Hellas,

Heraklion, Crete, Greece

{pateraki,xmpalt,konda,trahania}@ics.forth.gr

Abstract— In this paper we present a novel methodology

for detection and tracking of facial features like eyes, nose

and mouth in image sequences. The proposed methodology

is intended to support natural interaction with autonomously

navigating robots that guide visitors in museums and exhi-

bition centers and, more speciﬁcally, to provide input for the

analysis of facial expressions that humans utilize while engaged

in various conversational states. For face and facial feature

region detection and tracking, we propose a methodology that

combines appearance-based and feature-based methods for

recognition and tracking, respectively. For the stage of face

tracking the introduced method is based on Least Squares

Matching (LSM), a matching technique able to model effectively

radiometric and geometric differences between image patches

in different images. Thus, compared with previous research, the

LSM approach can overcome the problems of variable scene

illumination and head in-plane rotation. Another signiﬁcant

characteristic of the proposed approach is that tracking is

performed on the image plane only wherever laser range

information suggests so. The increased computational efﬁciency

meets the real time demands of human-robot interaction

applications and hence facilitates the development of relevant

systems.

I. INTRODUCTION

A key enabling technology for next-generation robots for

the service, domestic and entertainment market is Human-

Robot-Interaction. A socially interactive robot, i.e a robot

that collaborates with humans on a daily basis (be this

in care applications, in a professional or private context)

requires interactive skills that go beyond keyboards, button

clicks or metallic voices. For this class of robots, human-like

interactivity is a fundamental part of their functionality. Some

of the greatest challenges towards this goal are related to how

robots perceive the world. As pointed out in [1], in order

to interact meaningfully with humans, a socially interactive

robot must be able to perceive, analyze and interpret the

state of the surrounding environment and/or humans in a

way similar to the way humans do. In other words, it must be

able to sense and interpret the same phenomena that humans

observe.

Unlike humans that mostly depend on their eyes, most

current robots, in addition to vision sensors, also utilize range

sensors like sonars, infrared sensors and laser range scanners.

Approaches based on range sensors are very popular for tasks

This work has been partially supported by the EU Information Society

Technologies research project INDIGO (FP6-045388). and the Greek na-

tional GSRT project XENIOS.

like autonomous navigation [2], [3], [4], [5] and 2D people

tracking [6]. The main advantage of such sensors over vision

ones is that they are capable of providing accurate range

measurements of the environment in large angular ﬁelds and

at very fast rates. On the other hand, for tasks like gesture

recognition and face detection, i.e tasks that require richer

information (e.g intensity, color) or information beyond the

2D scanning plane of a typical range sensor setup, vision is

the only alternative [7].

In this paper we present a novel methodology for detection

and tracking of facial features like eyes, nose and mouth

in image sequences. The proposed approach is intended to

support natural interaction with autonomously navigating

robots that guide visitors in museums and exhibition centers

and, more speciﬁcally, to provide input for the analysis

of facial expressions that humans utilize while engaged in

various conversational states. The operational requirements

of such an application challenge existing approaches in that

the visual perception system should operate efﬁciently un-

der unconstrained conditions regarding occlusions, variable

illumination, moving cameras, and varying background. The

proposed approach combines and extends multiple state-of-

the art techniques to solve a number of related subproblems

like (a) detection and tracking of people in both the ground

plane and the image plane, (b) detection and tracking of

human faces on the image plane and, (c) tracking of speciﬁc

facial features like eyes, nose and mouth on the image plane.

People tracking, given the constrains of the application at

hand, is a very challenging task by itself. This is because the

applied method must be computationally efﬁcient, in order

to perform in almost real-time, and robust in the presence

of occlusions, variable illumination, moving cameras and

varying background. A thorough survey on vision-based

approaches to people tracking, can be found in [7]. The

referenced methods rely to a great extent on visual detection

of head or face and tend to be time-consuming and less robust

in uncontrolled lighting conditions. Laser-based detection

and tracking can provide a more reliable automatic detection

of humans in dynamic scenes, using one [8], [9], [10] or

multiple registered laser scanners [11]. However, the lack of

color information, causes difﬁculties in laser- based methods,

e.g. to maintain tracked trajectories of different objects when

occlusions occur. Therefore, the combination of distance and

angle information, obtained from a laser scanner with visual

information, obtained from a camera, could support vision-

2009 IEEE International Conference on Robotics and Automation

Kobe International Conference Center

Kobe, Japan, May 12-17, 2009

based methods for faster and more reliable human tracking.

In the ﬁeld of robotics, ”hybrid” methods combining laser

and camera data have appeared recently, and in [12] repre-

sentative methods, e.g. [13], [14], are discussed.

Tracking of human faces and facial features on the image

plane constitutes another challenging task because of face

variability in location, scale, orientation (up-right, rotated),

pose (frontal, proﬁle), age and expression. Furthermore,

it should be irrespective of lighting conditions and scene

content. Detection can be based on different cues: skin color

(color images/videos), motion (videos), face/head shape,

facial appearance, or a combination of them. Comprehen-

sive surveys on face detection and tracking are [15], [16].

Appearance- based methods avoid difﬁculties in modeling

3D structures of faces by considering possible face ap-

pearances under various conditions with AdaBoost learning-

based algorithms, e.g. [17], [18], [19], to be the most

effective so far. Color-based systems may be computationally

attractive but the color constraint alone is insufﬁcient for

achieving high accuracy face detection, mainly due to large

facial color variation in different lighting conditions and

humans of different skin color. Other methods primarily

based on color models, e.g. [20] may prove more robust

in laboratory environments but in unconstrained lighting and

illumination still their performance is limited and are less

suitable to derive head rotations.

The most important contribution of this paper is related to

the methodology used for for face and facial feature region

detection and tracking which combines appearance-based

and feature-based methods for recognition and tracking,

respectively. For the stage of face tracking the introduced

method is based on Least Squares Matching (LSM), a

matching technique able to model effectively radiometric and

geometric differences between image patches in different im-

ages. Compared with previous research, the LSM approach

can overcome the problems of variable scene illumination

and head in- and off-plane rotations.

Another signiﬁcant characteristic of the proposed method-

ology is that visual people tracking is performed only

wherever laser range information suggests so. The increased

computational efﬁciency meets the real time demands of the

speciﬁc application at hand and facilitates its application to

other crucial robotics tasks as well. Moreover, since informa-

tion encapsulated in visual data acts supplementary to laser

range information, inherent advantages of both sensors are

maintained, leading to implementations combining accuracy,

efﬁciency and robustness at the same time.

The proposed methodology was tested extensively with

a variety of real data gathered with two different robotic

platforms, both in laboratory and in real museum/exhibition

setups. The results obtained are very promising and demon-

strate its effectiveness.

II. METHODOLOGY OVE RVIEW

The basic idea in the proposed methodology is to suf-

ﬁciently exploit the detection capability of laser scanners

and to combine visual information (both grey-level and color

Fig. 1. Overview of laser/vision based system

images can be used) to the detection results to track people,

as well as face and facial features in dynamic scenes. The

temporal detection of humans relies on laser-based detection

and tracking of moving objects (DATMO), i.e. of moving

legs, and registration of vision-based information to localize

the expected face region. After human detection, the face

and facial features are detected and tracked over time. The

tracking information of facial features is used for a later

analysis of conversational states.

The methodology is schematically depicted in Fig. 1,

where it can be seen that once moving objects have been

detected using the Laser Range Scanner (LRS) data, humans

are tracked by integrating calibrated camera information and

ﬁeld of view, distance from laser scanner (baseline) and

minimum and maximum man height. The face that is frontal

and closest to the camera is detected and its position and

rotation are tracked. Within the enclosing area of the detected

face, the facial subregions, i.e. eyes, mouth and nose, are then

detected by imposing anthropometric constraints. Tracking of

these features is done by area-based matching approaches.

In both the tracking of face and facial regions quality

measures of the matching result are given and evaluated.

These measures will determine if the ﬁnal tracked result

should be accepted or not.

III. LAS ER-BA SED DE TEC TIO N AND TRACKING OF

MOVI NG TARG ETS

To extract and track multiple moving targets (e.g. legs)

from stationary background (e.g. building walls, desks,

chairs, etc.) a Joint Probabilistic Data Association with

Interacting Multiple Model (JPDA-IMM) algorithm is em-

ployed. Its robustness, in comparison to other techniques, is

thoroughly presented in [6] and here the consecutive steps are

brieﬂy described. Initially, a single occupancy grid is created

and every individual grid-cell accumulates a counter, which

increases linearly when a laser measurement falls inside its

occupancy area. Grid-cells with values above a dynamic

threshold level are selected as stationary background. A

certain background is obtained and subtracted from every

laser frame leaving only the measurements that represent

3756

possible moving targets. The remaining measurements are

clustered into groups and a center of gravity is assigned

to each one. Finally, the JPDA-IMM initiates and allocates

tracks to clusters that exceed a certain velocity level.

The technique effectively distinguishes targets which move

at close proximity to each other, being also able to com-

pensate the relative movement of the robot. The identiﬁed

moving targets are treated as potential leg candidates, con-

sidering also that, apart from humans, other moving objects

may appear in the scene.

IV. VISION-BA SED SYSTEM

The camera is adjusted on the same robotic platform as

the LRS, at an appropriate height to view humans face on,

and its optical axis parallel to the LRS pointing direction.

The camera is calibrated using Zhang’s method [21], and

calibration information is utilized at a later stage for LRS

and vision data registration.

A. Human detection

Camera calibration parameters, image resolution, frame

rate, known baseline between the LRS and the video camera

are used to register the image with the LRS data. The

laser points, indicated as leg candidates, are projected on

the visual image plane and by including information on the

minimum and maximum human height, e.g. 1.20 m and

1.90 m, respectively, the expected face region is localized in

the image plane. The results of the method are demonstrated

in Fig. 2(a), where the ellipses mark the four moving objects,

corresponding to four legs, and in Fig. 2(b), where the

expected face region is localized using the registered LRS

and vision data and the human height constraints.

B. Face Detection and Tracking

Following body region localization the faces are detected

within the given region in order to reject possible outliers

arising after LRS processing and verify that there are people

moving towards the robot. We assume that for the initial

detection of the person interacting with the robot we have

frontal views of the person. However, we still have to tackle

the issues of in- and off-plane head rotations in face track-

ing, important in the analysis of communicative signs and

variable scene illumination. With respect to illumination, our

(a) (b)

Fig. 2. Registered LRS and vision data. (a) The detected moving objects

are marked within the ellipses, (b) the localized expected face areas

Fig. 3. Detection of faces

aim is to place the robot in environments with unconstrained

lighting conditions.

We utilize a hybrid approach by integrating an appearance-

based approach for face detection and a feature-based ap-

proach for face tracking. In the introductory part the ad-

vantages of appearance-based methods have been already

pointed out. The robust face detector developed by Viola and

Jones [19] is employed in this work. The named detector

combines four key concepts; Haar features, integral image

for rapid feature detection, the Adaboost machine-learning

method and a cascaded classiﬁer to combine many features

efﬁciently. Unfortuantelly, this approach suffers from two

signiﬁcant limitations: (a) inability to handle signiﬁcant in-

plane rotations (i.e. rotations of 30 degrees or more), and (b)

increased processing time. Although some recent approaches

(e.g. [18], [22]) tackle with the ﬁrst limitation (inability to

track in-plane rotations) using an extended set of rotated

Haar-like features, the required computational power still

prohibits their use to applications that involve higher fram-

erates and/or higher image resolutions. Moreover, they aim

in detecting faces in every input frame without maintaining

ids of speciﬁc persons, i.e they do not perform face tracking.

Therefore, in our method the face is detected in the

initial frame using the Haar method and tracked in the

subsequent frames with the LSM approach, described in the

next paragraph. Only if the LSM tracker fails to converge to

a solution, the Haar detector will re-initialize the process. In

Fig. 3 the rectangles mark the detected faces on the image

by the Haar method.

C. A Least-Squares Approach to Face Tracking

Cross-correlation is based on the assumption that geo-

metric differences are modeled only by translation, and

radiometric differences exist only due to brightness and

contrast. Thus, its precision is limited, decreases rapidly if

the geometric model is violated (rotations greater than 20◦

and scale differences between images greater than 30%). A

generalization of cross correlation is Least Squares Matching

(LSM) [23], which in its general approach can compensate

geometric differences in rotation, scale and shearing.

Several approaches exist in the current literature, mainly

from the photogrammetric community, using least squares

for image registration, calibration, surface reconstruction etc.

To the best of our knowledge, the iplementation of LSM

described in this paper, is introduced for the ﬁrst time face

tracking in a robotic application.

3757

The formulation of the general estimation model is based

on the assumption that there are two or more image win-

dows (called image patches) given as discrete functions

f(x,y),gi(x,y),i=1, ...n−1, wherein fis the template

and gthe search image patch in i=1, ...n−1 search

images. The problem statement is ﬁnding the corresponding

part of the template image patch f(x,y)in the search images

gi(x,y),i=1, ...n−1.

f(x,y)−ei(x,y) = gi(x,y)(1)

Equation (1) gives the least squares grey level observa-

tion equations, which relate the f(x,y)and gi(x,y)image

functions or image patches. The true error vector ei(x,y)

is included to model errors that arise from radiometric

and geometric differences in the images. For the selection

of the geometrical model it is assumed that the object

surface is approximated by local planar facets and an afﬁne

transformation is generally used. Radiometric corrections

(e.g. equalization), for compensation of different lighting

conditions are not included in the model but are applied

during LSM.

In our implementation we use two images and the afﬁne

transformation is applied with respect to an initial position

x0,y0:

x=a0+a1·x0+a2·y0

y=b0+b1·x0+b2·y0

(2)

After linearization of the function g(x,y), (1) becomes:

f(x,y)−e(x,y) = g(x0,y0) + ∂g(x0,y0)

∂x·dx +∂g(x0,y0)

∂y·dy

(3)

With the simpliﬁed notation:

gx=∂g(x0,y0)

∂x,gy=∂g(x0,y0)

∂y

and by differentiating (2), then (3) results in:

f(x,y)−e(x,y) = g(x0,y0) + gxda0+gxx0da1+gxy0da2+

+gydb0+gyx0db1+gyy0db2

(4)

with the parameter vector xbeing deﬁned as:

xT= (da0,da1,da2,db0,db1,db2)(5)

The least squares solution of the system is given by (6):

ˆx= (ATPA)−1(ATPl)(6)

where ˆxis the vector of unknowns, Ais the design matrix of

grey level observation equations, Pis the weight matrix, and

lis the discrepancy vector of the observations. The weights

are typically diagonal with elements set to unity for all the

grey level observation equations. The number of grey level

observations is related to the size of the template size. If e.g.

a patch size of 9x9 is selected then the number of observation

equations is 81.

In our implementation of LSM for face tracking, the afﬁne

is constrained to conformal transformation to avoid over-

parametrization of the system since only estimation of shifts,

rotation and scale sufﬁce to model the geometric differences

of frontal faces in frame sequence. The geometric differences

refer to: (a) face scaling, when the person moves closer or

away from the robot, (b) head in-plane rotations and (c) head

off-plane rotations. It is known that if there is insufﬁcient

signal content or if the initial approximations in the least

squares solution are not close to the real solution, the solution

will not converge. These issues can be easily handled in

face tracking. Initial approximations for shifts are taken from

the center of the face area detected with the Haar method

that initialized the process of face localization. The template

used for temporal matching, is initialized to the center of the

detected face and scaled at 75% of its area. Equalization is

applied both for the template and the search area during LSM

to compensate for radiometric differences. The template is

updated when solution converges and the process continues.

Variable illumination poses less an issue since the search area

expands around the initial patch at a maximum of half the

size of the largest dimension of the patch. The only drawback

is the size of the template when the face is very close to

the camera, increasing the number of observation equations.

A proposed solution is to apply LSM in images of lower

resolution and transform the matching results to the original

level.

As far as quality is concerned, the criteria used to evaluate

matching results are the number of iterations, the alteration

of the size of parameters in each iteration and the size

of parameters. The number of iterations is a rather good

criterion, assuming that the approximations are good. In

parallel, variations in the parameter values (magnitude and

sign) in each iteration have to be observed in order to

evaluate the stability of the solution. The threshold for the

iterations should not be set too high (maximum number of

15 iterations), considering that fast convergence should be

achieved, since the initial values are close to the correct

solution. The size of parameters, especially the estimated

value for scales, should not exceed an upper (>3.0) and

lower value (<0.3) and the difference to their initial values

should be small. The initial values for scales are set to 1.

The variation of x, y coordinates from their initial values is

also checked, considering the utilized frame rate and if it is

above a certain threshold the point is rejected.

D. Facial Feature Region Detection

Eyes, nose and mouth are the facial regions that are

detected. However, the image resolution of the face is an

important factor in facial feature region detection. When the

face area is smaller than 70 x 90 pixels the facial regions

become hard to detect [24]. As in the case of face detection,

individual sets of Haar-like features for each region are used

to detect the eyes, mouth and nose area, within the detected

and tracked face region using the method from [19].

False detections may arise especially in faces of larger

scale or higher resolution and anthropometric constraints are

imposed for reliability of the solution. The eye, mouth and

nose regions should be found in the upper half, lower half

and in the central part of the face respectively, while the ratio

of their respective widths to the width of the face by coarse

3758

(a) (b) (c)

(d) (e) (f)

Fig. 4. Examples of face and facial features tracking results under variable lighting conditions and background

approximation should be close to 0.4, empirically learned

from normalized measurements in a number of frontal faces.

These constraints proved to enforce detection.

E. Facial Feature Region Tracking

The detected facial feature regions are tracked in the

subsequent video frames by using cross-correlation of image

patches. This approach is justiﬁed as there are only small

deviations in the relative positions of these feature areas with

respect to the position of the detected and tracked face within

the image. The previously detected eye, nose and mouth

regions are used as templates in the matching process. The

process is computationally efﬁcient, since Haar detection is

only used for the initial detection, whereas the templates are

taken from the same object, used in the tracking procedure.

V. EXP ERI MEN TAL RE SULTS

The method proposed in this paper has been implemented

and assessed on two different robotic platforms in various

laboratory and real application setups.

For laboratory testing, we utilized an iRobot-B21r plat-

form equipped with a SICK-PLS laser range ﬁnder and

a low-end camera operating at a standard resolution of

640×480 pixels. The range ﬁnder is capable of scanning 180

degrees of the environment, with an angular resolution of one

measurement per degree and a range measuring accuracy of

5cm.

The platform that was utilized to collect data from the

actual exhibition place, was a Neobotix NE-470 robotic plat-

form equipped with a PointGray Bumblebee2 stereovision

head, operating at the same, 640 ×480 resolution and a Sick

S300 laser range ﬁnder. This speciﬁc range ﬁnder is capable

of achieving an angular resolution of two measurements per

degree and its placement in front of the robot ensured a ﬁeld

of view of about 210 degrees.

Prior to testing our methodology, an internal calibration

procedure has been applied to both robots in order to estimate

the relative positions and the intrinsic parameters of all

sensors.

Examples from the human tracking results have been

previously shown in Figs. 2 and 3.

The proposed methodology for face and facial feature

detection and tracking was tested with the different video

sequences recorded in the laboratory, as well as in various

locations at the exhibition place. Fig. 4 shows results from

the video sequences, recorded at various locations of the

exhibition place. As it can be seen the method is able to

handle severe illumination and background conditions, yet

extract face and facial features reliably. Images in Fig. 4(a)

and 4(b), were recorded in a room with low lighting and

monitors in the background, that pose a problem for methods

employing background subtraction. Even in the case of

strong background lightning, as in Fig. 4(b), feature areas

can be tracked. Moreover, our method is able to handle (a)

dynamic backgrounds, e.g. Figs. 4(d), 4(e), 4(f), (b) scene

objects being very close in color to skin, e.g. Figs. 4(b) and

4(f) (clothes in similar color to skin).

In addition the comparative advantage of the LSM method

for face tracking over other commonly used methods is

demonstrated. LSM is able to compensate geometric and

radiometric differences between image patches. In Fig. 5,

results of LSM versus pure Haar-based face tracking and the

CMU tracker are shown. The indicative images are selected

from a video sequence recorded in the laboratory. As can be

easily observed, in this result, the Haar-based method fails

to track the face when in-plane rotations occur. The CMU

method also fails to provide reliable results for position as

well as for rotation, whereas the LSM provides a solution,

along with certain measures to evaluate the tracking result.

In all experiments conducted, including the ones presented

3759

above, the LSM tracking operated at a frame rate of 30 fps,

on a Pentium Core Duo 2.8 GHz.

VI. CONCLUSIONS AND FUTURE WORK

In this paper we have presented a novel methodology

for robust detection and tracking of human faces and fa-

cial features in image sequences, intended for human-robot

interaction applications.

According to the proposed methodology, the 3D locations

of people, as produced by a people tracker that utilizes laser

range data to track people on the ground plane, are projected

on the image plane in order to identify image regions that

may contain human faces. A state-of-the-art, appearance-

based method is used in order to speciﬁcally detect human

faces within these regions and initiate a feature based tracker

that tracks these faces as well as speciﬁc facial features over

time.

Experimental results have conﬁrmed the effectiveness

and the increased computational efﬁciency of the proposed

methodology, proving that the individual advantages of all

involved components are maintained, leading to implemen-

tations that combine accuracy, efﬁciency and robustness at

the same time.

We intend to use the proposed methodology in order

to support natural interaction with autonomously navigating

robots that guide visitors in museums and exhibition centers.

More speciﬁcally the proposed methodology will provide

input for the analysis of facial expressions that human utilize

while engaged in various conversational states.

Future work includes extension of the LSM temporal

tracking to handle stereo vision by exploiting epipolar con-

straints. Moreover, the methodology presented in the paper

will be employed in an integrated system for naturalistic

human-robot interaction.

(a) (b) (c)

(d) (e) (f)

(g) (h) (i)

Fig. 5. Comparison of Haar, CMU and LSM face tracker in the presence

of in-plane rotation. Results from the Haar-based detection in (a), (b), (c),

the CMU tracker in (d), (e), (f) and the LSM tracker in (g), (h), (i).

REFERENCES

[1] T. Fong, I. Nourbakhsh, and K. Dautenhahn, “A survey of socially

interactive robots,” Robotics and Autonomous Systems, vol. 42, pp.

143–166, 2003.

[2] J. Castellanos, J. Tard´

os, and J. Neira, “Constraint-based mobile robot

localization,” in Proc. Intl. Workshop on Advanced Robotics and

Intelligent Machines, University of Salford, Manchester, U.K., 1996.

[3] S. Thrun, A. Buecken, W. Burgard, D. Fox, T. Froehlinghaus, D. Hen-

nig, T. Hofmann, M. Krell, and T. Schmidt, “Map learning and high-

speed navigation in RHINO,” University of Bonn, Department of

Computer Science, Technical Report IAI-TR-96-3, July 1996.

[4] J.-S. Gutmann and K. Konolige, “Incremental mapping of large cyclic

environments,” in Proc. IEEE Intl. Symposium on Computational

Intelligence in Robotics and Automation, (CIRA), Monterey, CA, 1999,

pp. 318–325.

[5] H. Baltzakis and P. Trahanias, “A hybrid framework for mobile

robot localization. formulation using switching state-space models,”

Autonomous Robots, vol. 15, no. 2, pp. 169–191, 2003.

[6] P. Kondaxakis, S. Kasderidis, and P. Trahanias, “A multi-target track-

ing technique for mobile robots using a laser range scanner,” in Proc.

IEEE Intl. Conf. on Robots and Systems (IROS’08), Nice, France,

2008.

[7] D. Gavrilla, “The visual analysis of human movement: A survey,”

Computer Vision and Image Understanding, vol. 73, no. 1, pp. 82–98,

1999.

[8] A. Fod, A. Howard, and M. Mataric, “Laser-based people tracking,”

in Proc. IEEE Intl. Conf. on Robotics and Automation (ICRA’02),

Washington, DC, USA, 2002, pp. 3024–3029.

[9] D. Schulz, W. Burgard, D. Fox, and A. Cremers, “Tracking multiple

moving objects with a mobile robot,” in Proc. IEEE Intl. Conf. on

Computer Vision and Pattern Recognition (CVPR’01), 2001, pp. 371–

377.

[10] B. Kluge, C. Koehler, and E. Prassler, “Fast and robust tracking of

multiple moving objects with a laser range ﬁnder,” in Proc. IEEE Intl.

Conf. on Robotics and Automation (ICRA’01), 2001, pp. 1683-1688.

[11] H. Zhao and R. Shibasaki, “A novel system for tracking pedestrians

using multiple single-row laser range scanners,” IEEE Trans. Syst.,

Man, Cybern. A, vol. 35, no. 2, pp. 283– 291, 2004.

[12] J. Cui, H. Zha, H. Zhao, and R. Shibasaki, “Multi-modal tracking

of people using laser scanners and video camera,” Image and Vision

Computing, vol. 26, no. 2, pp. 240–252, 2008.

[13] Z. Byers, M. Dixon, K. Goodier, C. Grimma, and W. Smart, “An

autonomous robot photographer,” in Proc. IEEE Intl. Conf. on Robots

and Systems (IROS’03), Las Vegas, Nevada, USA, 2003, pp. 2636–

2641.

[14] M. Scheutz, J. McRaven, and G. Cserey, “Fast, reliable, adaptive

bimodal people tracking for indoor environments,” in Proc. IEEE Intl.

Conf. on Robots and Systems (IROS’04), Sendai, Japan, 2004, pp.

1347–1352.

[15] M.-H. Yang, D. J. Kriegman, and N. Ahuja, “Detecting faces in

images: A survey,” IEEE Trans. Pattern Anal. Machine Intell., vol. 24,

no. 1, pp. 34–58, 2002.

[16] E. Hjelmas and B. K. Low, “Face detection: A survey,” Computer

Vision and Image Understanding, vol. 3, no. 3.

[17] S. Li and Z. Zhang, “Floatboost learning and statistical face detection,”

IEEE Trans. Pattern Anal. Machine Intell., vol. 26, no. 9.

[18] R. Lienhart, A. Kuranov, and V. Pisarevsky, “Empirical analysis of

detection cascades of boosted classiﬁers for rapid object detection,”

Intel Labs, Amherst, MA, MRL Technical Report, 2002.

[19] P. Viola and M. Jones, “Robust real-time face detection,” Int. J.

Comput. Vision, vol. 57, no. 2, pp. 137–154, 2004.

[20] F. J. Huang and T. Chen, “Tracking of multiple faces for human-

computer interfaces and virtual environments,” in IEEE Intl. Conf. on

Multimedia and Expo., 2000.

[21] Z. Zhang, “A ﬂexible new technique for camera calibration,” IEEE

Trans. Pattern Anal. Machine Intell., vol. 22, no. 11, pp. 1330–1334,

2000.

[22] C. Messom and A. Barczak, “Fast and efﬁcient rotated haar-like

features using rotated integral images,” in Proc. of 2006 Australian

Conference on Robotics and Automation (ACRA ’06), 2006.

[23] Ackermann F., “High Precision Digital image Correlation,” in Pho-

togrammetric Week, Heft 9. Universit¨

at Stuttgart, 1984.

[24] Y. Tian, “Evaluation of face resolution for expression analysis,”

in Proc. IEEE Conf. on Computer Vision and Pattern Recognition

Workshop (CVPRW’04). IEEE Computer Society, 2004, pp. 82–89.

3760

2. 5D Feature Tracking and 3D Motion Modeling

Article

Full-text available

Feb 2013

Mozhdeh Shahbazi

Modelling State of Interaction from Head Poses for Social Human-Robot Interaction

Conference Paper

Mar 2012

In this publication, we analyse how humans use head pose in various states of an interaction, in both human-human and human-robot observations. Our scenario is the short-term, every-day interaction of a customer ordering a drink from a bartender. To empirically study the use of head pose in this scenario, we recorded 108 such interactions in real bars. The analysis of these recordings shows, (i) customers follow a defined script to order their drink—attention request, or-dering, closing of interaction—and (ii) customers use head pose to nonverbally request the attention of the bartender, to signal the ongoing process, and to close the interaction. Based on these findings, we design a hidden Markov model that reflects the typical interaction states in the bar sce-nario and implement it on the human-robot interaction sys-tem of the European JAMES project. We train the model with data from an automatic head pose estimation algo-rithm and additional body pose information. Our evaluation shows that the model correctly recognises the state of inter-action of a customer in 78.3% of all states. More specifically, the model recognises the interaction state "attention to bar-tender" with 83.8% and "attention to another guest" with 73.0% correctness, providing the robot sufficient knowledge to begin, perform, and end interactions in a socially appro-priate way.

Point cloud optimization method of low-altitude remote sensing image based on vertical patch-based least square matching

Article

Full-text available

Jul 2016
J APPL REMOTE SENS

This paper presents a point cloud optimization method of low-altitude remote sensing image based on least square matching (LSM). The proposed method is designed to be especially effective for addressing the conundrum of stereo matching on the discontinuity of architectural structures. To overcome the error matching and blur on building discontinuities in three-dimensional (3-D) reconstruction, a pair of mutually perpendicular patches is set up for every point of object discontinuities instead of a single patch. Then an error equation is built to compute the optimal point according to the LSM method, space geometry relationship, and collinear equation constraint. Compared with the traditional patch-based LSM method, the proposed method can achieve higher accuracy 3-D point cloud data and sharpen the edge. This is because a geometric mean patch in patch-based LSM is the local tangent plane of an object's surface. Using a pair of mutually perpendicular patches instead of a single patch evades the problem that the local tangent plane on the discontinuity of a building did not exist and highlights the edges of buildings. Comparison studies and experimental results prove the high accuracy of the proposed algorithm in low-altitude remote sensing image point cloud optimization. © 2016 Society of Photo-Optical Instrumentation Engineers (SPIE).

A Time-Delay Control Approach for a Stereo Vision Based Human-Machine Interaction System

Article

Full-text available

Nov 2014

In this paper, an approach to control a 6-DoF stereo camera for the purpose of actively tracking the face of a human observer in the context of Human-Robot Interaction (HRI) is proposed. The main objective in the presented work is to cope with the critical time-delay introduced by the computer vision algorithms used to acquire the feedback variable within the control system. In the studied HRI architecture, the feedback variable is represented by the 3D position of a human subject. We proposed a predictive control method which is able to handle the high time-delay inserted by the vision elements into the control system of the stereo camera. Also, along with the predictive control approach, a novel 3D nose detection algorithm is suggested for the computation of the feedback variable. The performance of the implemented platform is given through experimental results.

An integrated approach for visual tracking of hands, faces and facial features

Article

Full-text available

This paper presents an integrated approach for tracking hands, faces and specific facial features (eyes, nose, and mouth) in image sequences. For hand and face tracking, we employ a state-of-the-art blob tracker which is specifically trained to track skin-colored regions. The skin-color tracker is extended by incorporating an incremental probabilistic clas-sifier, which is used to maintain and continuously update the belief about the class of each tracked blob, which can be left-hand, right hand or face as well as to associate hand blobs with their corresponding faces. Then, in order to detect and track specific facial features within each detected facial blob, a hybrid method consisting of an appearance-based detector and a feature based tracker is employed. The proposed approach is intended to provide input for the analysis of hand gestures and facial expressions that humans utilize while engaged in various conversational states with robots that operate autonomously in public places. It has been integrated into a system which runs in real time on a conventional personal computer which is located on the mobile robot itself. Experimental results confirm its effectiveness for the specific task at hand.

Visual estimation of pointed targets for robot guidance via fusion of face pose and hand orientation

Article

Dec 2013

In this paper we address an important issue in human–robot interaction, that of accurately deriving pointing information from a corresponding gesture. Based on the fact that in most applications it is the pointed object rather than the actual pointing direction which is important, we formulate a novel approach which takes into account prior information about the location of possible pointed targets. To decide about the pointed object, the proposed approach uses the Dempster–Shafer theory of evidence to fuse information from two different input streams: head pose, estimated by visually tracking the off-plane rotations of the face, and hand pointing orientation. Detailed experimental results are presented that validate the effectiveness of the method in realistic application setups. Highlights • Problem formulation: given a number of possible pointed targets, compute the target that the user points to. • Estimate head pose by visually tracking the off-plane rotations of the face. • Recognize two different hand pointing gestures (point left and point right). • Model the problem using the Dempster–Shafer theory of evidence. • Use Demspster’s rule of combination to fuse information and derive the pointed target.

Using Dempster's rule of combination to robustly estimate pointed targets

Article

Full-text available

May 2012

In this paper we address an important issue in human-robot interaction, that of accurately deriving pointing information from a corresponding gesture. Based on the fact that in most applications it is the pointed object rather than the actual pointing direction which is important, we formulate a novel approach which takes into account prior information about the location of possible pointed targets. To decide about the pointed object, the proposed approach uses the Dempster-Shafer theory of evidence to fuse information from two different input streams: head pose, estimated by visually tracking the off-plane rotations of the face, and hand pointing orientation. Detailed experimental results are presented that validate the effectiveness of the method in realistic application setups.

Live Face Tracking Robot Arm

Article

Jul 2022

Manasa S

This paper presents the design and implementation of a robotic arm integrated with computer vision. As robotics continues to become a more integral part of the industrial complex, there is a need for automated systems that require minimal to no user training to operate. With this motivation in mind, we have developed a robotic arm embedded with face-tracking technology that can be operated offline. We were able to achieve this using Arduino Uno for implementing the embedded part, collaborating the same with the OpenCV library for face tracking. Thereby, developing a multipurpose robotic arm.

Computational Human-Robot Interaction

Article

Full-text available

Jan 2016

Label propagation on data with multiple representations through multi-graph locality preserving projections

Article

Jan 2015

In this paper a novel method is introduced for propagating label information on data with multiple representations. The method performs dimensionality reduction of the data by calculating a projection matrix that preserves locality information and a priori pairwise information, in the form of must-link and cannot-link constraints between the various data representations. The final data representations are then fused, in order to perform label propagation. The performance of the proposed method was evaluated on facial images extracted from stereo movies and on the UCF11 action recognition database. Experimental results showed that the proposed method outperforms state of the art methods.

A Multi-Target Tracking Technique for Mobile Robots using a Laser Range Scanner

Conference Paper

Full-text available

Oct 2008

A major issue in the field of mobile robotics today is the detection and tracking of moving objects (DATMO) from a moving observer. In dynamic and highly populated environments, this problem presents a complex and computationally demanding task. It can be divided in sub-problems such as robotpsilas relative motion compensation, feature extraction, measurement clustering, data association and targetspsila state vector estimation. In this paper we present an innovative approach that addresses all these issues exploiting various probabilistic and deterministic techniques. The algorithm utilizes real laser-scanner data to dynamically extract moving objects from their background environment, using a time-fading grid map method, and tracks the identified targets employing a joint probabilistic data association with interacting multiple model (JPDA-IMM) algorithm. The resulting technique presents a computationally efficient approach to already existing target-tracking research for real time application scenarios.

The Visual Analysis of Human Movement: A Survey

Article

Full-text available

Jan 1999

Dariu M. Gavrila

The ability to recognize humans and their activities by vision is key for a machine to interact intelligently and effortlessly with a human-inhabited environment. Because of many potentially important applications, “looking at people” is currently one of the most active application domains in computer vision. This survey identifies a number of promising applications and provides an overview of recent developments in this domain. The scope of this survey is limited to work on whole-body or hand motion; it does not include work on human faces. The emphasis is on discussing the various methodologies; they are grouped in 2-D approaches with or without explicit shape models and 3-D approaches. Where appropriate, systems are reviewed. We conclude with some thoughts about future directions.

High precision digital image correlation

Article

Jan 1983

F. Ackermann

Floatboost learning and statistical face detection

Article

Jan 2004

Fast and robust tracking of multiple objects through a laser range finder

Article

Jan 2001

A Survey of Socially Interactive Robots

Article

Mar 2003
ROBOT AUTON SYST

This paper reviews “socially interactive robots”: robots for which social human–robot interaction is important. We begin by discussing the context for socially interactive robots, emphasizing the relationship to other research fields and the different forms of “social robots”. We then present a taxonomy of design methods and system components used to build socially interactive robots. Finally, we describe the impact of these robots on humans and discuss open issues. An expanded version of this paper, which contains a survey and taxonomy of current applications, is available as a technical report [T. Fong, I. Nourbakhsh, K. Dautenhahn, A survey of socially interactive robots: concepts, design and applications, Technical Report No. CMU-RI-TR-02-29, Robotics Institute, Carnegie Mellon University, 2002].

Constraint-based Mobile Robot Localization

Conference Paper

Jan 1996

Fast and efficient rotated haar-like features using rotated integral images

Conference Paper

Jan 2006

This paper introduces an extended set of Haar-like features beyond the standard vertically and horizontally aligned Haar-like features [Viola and Jones, 2001a; 2001b] and the 45 o twisted Haar-like features [Lienhart and Maydt, 2002; Lienhart et al., 2003a; 2003b]. The extended rotated Haar-like features are based on the standard Haar-like features that have been rotated based on whole integer pixel based rotations. These rotated feature values can also be calculated using rotated integral images which means that they can be fast and efficiently calculated with just 8 operations irrespective of the feature size. In general each feature requires another 8 operations based on an identity integral image so that appropriate scaling corrections can be applied. These scaling corrections are needed due to the rounding errors associated with scaling the features. The errors introduced by these rotated features on natural images are small enough to allow rotated classifiers to be implemented using a classifier trained on only vertically aligned images. This is a significant improvement in training time for a classifier that is invariant to the rotations represented in the parallel classifier. Figure 1. Standard Haar-like features.

Face Detection: A Survey

Article

Sep 2001

In this paper we present a comprehensive and critical survey of face detection algorithms. Face detection is a necessary first-step in face recognition systems, with the purpose of localizing and extracting the face region from the background. It also has several applications in areas such as content-based image retrieval, video coding, video conferencing, crowd surveillance, and intelligent human–computer interfaces. However, it was not until recently that the face detection problem received considerable attention among researchers. The human face is a dynamic object and has a high degree of variability in its apperance, which makes face detection a difficult problem in computer vision. A wide variety of techniques have been proposed, ranging from simple edge-based algorithms to composite high-level approaches utilizing advanced pattern recognition methods. The algorithms presented in this paper are classified as either feature-based or image-based and are discussed in terms of their technical approach and performance. Due to the lack of standardized tests, we do not provide a comprehensive comparative evaluation, but in cases where results are reported on common datasets, comparisons are presented. We also give a presentation of some proposed applications and possible application areas.

Empirical Analysis of Detection Cascades of Boosted Classifiers for Rapid Object Detection

Conference Paper

Sep 2003
Lect Notes Comput Sci

Recently Viola et al. have introduced a rapid object detection scheme based on a boosted cascade of simple feature classifiers. In this paper we introduce and empirically analysis two extensions to their approach: Firstly, a novel set of rotated haar-like features is introduced. These novel features significantly enrich the simple features of (6) and can also be calculated efficiently. With these new rotated features our sample face detector shows off on average a 10% lower false alarm rate at a given hit rate. Secondly, we present a through analysis of different boosting algorithms (namely Discrete, Real and Gentle Adaboost) and weak classifiers on the detection performance and computational complexity. We will see that Gentle Adaboost with small CART trees as base classifiers outperform Discrete Adaboost and stumps. The complete object detection training and detection system as well as a trained face detector are available in the Open Computer Vision Library at sourceforge.net (8).

Tracking of facial features to support human-robot interaction

Abstract and Figures

Recommended publications

A feature based approach to face recognition

Combined Face Detection/Recognition System for Smart Rooms

Face Tracking and Recognition in Videos: HMM Vs KNN

Facial feature extraction for face recognition: A review