Conference PaperPDF Available

Re-identification with RGB-D Sensors

October 2012

October 2012
7583

DOI:10.1007/978-3-642-33863-2_43

Conference: First International Workshop on Re-Identification in conjunction with ECCV 2012

Authors:

Igor Barros Barbosa

Norwegian University of Science and Technology

Marco Cristani

University of Verona

Alessio Del Bue

Istituto Italiano di Tecnologia

Loris Bazzani

Dartmouth College

Show all 5 authorsHide

People re-identification is a fundamental operation for any multi-camera surveillance scenario. Until now, it has been performed by exploiting primarily appearance cues, hypothesizing that the individuals cannot change their clothes. In this paper, we relax this constraint by presenting a set of 3D soft-biometric cues, being insensitive to appearance variations, that are gathered using RGB-D technology. The joint use of these characteristics provides encouraging performances on a benchmark of 79 people, that have been captured in different days and with different clothing. This promotes a novel research direction for the re-identification community, supported also by the fact that a new brand of affordable RGB-D cameras have recently invaded the worldwide market.

Distances employed for building the soft-biometric features (in black), and some of the soft biometric features (in green). It is important to notice that the joints are not localized in the outskirt of the point-cloud, but, in most of the cases, in the proximities of the real articulations of the human body.

…

Geodesic features: the red line represents the path found by A between torso to left shoulder, torso to left hip and torso to right hip

…

Illustration of the different groups in the recorded data, rows from top to bottom: “Walking”, “Walking2”, “Backwards” and “Collaborative”. Note that people changed their clothings during the acquisitions in different days. On the right, statistics of the “Walking” dataset: for each feature, the histogram is shown; in the parenthesis, its mean value (in cm, except d 2 ) and standard deviation.

…

Compilation of final CMC curves —“Collaborative” - “Walking 2”

…

Figures - uploaded by Igor Barros Barbosa

Content may be subject to copyright.

Content uploaded by Igor Barros Barbosa

Content may be subject to copyright.

Re-identiﬁcation with RGB-D sensors

Igor Barros Barbosa1,3, Marco Cristani1,2, Alessio Del Bue1,

Loris Bazzani1, and Vittorio Murino1

1Pattern Analysis and Computer Vision (PAVIS) - Istituto Italiano di Tecnologia

(IIT), Via Morego 30, 16163 Genova, Italy

2Dipartimento di Informatica, University of Verona,

Strada Le Grazie 15, 37134 Verona, Italy

3Universit´e de Bourgogne, 720 Avenue de lEurope, 71200 Le Creusot, France

Abstract. People re-identiﬁcation is a fundamental operation for any

multi-camera surveillance scenario. Until now, it has been performed by

exploiting primarily appearance cues, hypothesizing that the individuals

cannot change their clothes. In this paper, we relax this constraint by

presenting a set of 3D soft-biometric cues, being insensitive to appearance

variations, that are gathered using RGB-D technology. The joint use of

these characteristics provides encouraging performances on a benchmark

of 79 people, that have been captured in diﬀerent days and with diﬀerent

clothing. This promotes a novel research direction for the re-identiﬁcation

community, supported also by the fact that a new brand of aﬀordable

RGB-D cameras have recently invaded the worldwide market.

Keywords: Re-identiﬁcation, RGB-D sensors, Kinect

1 Introduction

The task of person re-identiﬁcation (re-id) consists in recognizing an individual

in diﬀerent locations over a set of non-overlapping camera views. It represents

a fundamental task for heterogeneous video surveillance applications, especially

for modeling long-term activities inside large and structured environments, such

as airports, museums, shopping malls, etc. In most of the cases, re-id approaches

rely on appearance-based only techniques, in which it is assumed that individuals

do not change their clothing within the observation period [1–3]. This hypothesis

represents a very strong restriction, since it constraints re-id methods to be

applied under a limited temporal range (reasonably, in the order of minutes).

In this paper we remove this restriction, presenting a new approach of person

re-id that uses soft biometrics cues as features. In general, soft biometrics cues

have been exploited in diﬀerent contexts, either to aid facial recognition [4], used

as features in security surveillance solutions [5,6] or also for person recognition

under a bag of words policy [7]. In [4] soft biometrics cues are the size of limbs,

which were manually measured. The approaches in [5–7] are based on data com-

ing from 2D cameras and extract soft biometrics cues such as gender, ethnicity,

clothing, etc.

2 Igor Barros Barbosa et al.

At the best of our knowledge, 3D soft biometric features for re-identiﬁcation

have been employed only in [4], but in that case the scenario is strongly super-

vised and needs a complete cooperation of the user to take manual measures. In

contrast, a viable soft biometrics system should mostly deal with subjects with-

out requiring strong collaboration from them, in order to extend its applicability

to more practical scenarios.

In our case, the cues are extracted from range data which are computed

using RGB-D cameras. Recently, novel RGB-D camera sensors as the Microsoft

Kinect and Asus Xtion PRO, both manufactured using the techniques developed

by PrimeSense [8], provided to the community a new method of acquiring depth

information in a fast and aﬀordable way. This drove researchers to use RGB-D

cameras in diﬀerent ﬁelds of applications, such as pose estimation [9] and object

recognition [10], to quote a few. In our opinion, re-id can be extended to novel

scenarios by exploiting this novel technology, allowing to overcome the constraint

of analyzing people that do not change their clothes.

In particular, our aim is to extract a set of features computed directly on the

range measurements given by the sensor. Such features are related to speciﬁc

anthropometric measurements computed automatically from the person body.

In more detail, we introduce two distinct subsets of features. The ﬁrst subset

represents cues computed from the ﬁtted skeleton to depth data i.e. the Euclidean

distance between selected body parts such as legs, arms and the overall height.

The second subset contains features computed on the surface given by the range

data. They come in the form of geodesic distances computed from a predeﬁned

set of joints (e.g. from torso to right hip). This latest measure gives an indication

of the curvature (and, by approximation, of the size) of speciﬁc regions of the

body.

After analyzing the eﬀectiveness of each feature separately and performing a

pruning stage aimed at removing not inﬂuent cues, we studied how such features

have to be weighted in order to maximize the re-identiﬁcation performance. We

obtained encouraging re-id results on a pool of 79 people, acquired under diﬀerent

times and across intervals of days. This promotes our approach and in general

the idea of performing re-id with 3D soft biometric cues extracted from RGB-D

cameras.

The remaining of the paper is organized as follows. Section 2 brieﬂy presents

the re-identiﬁcation literature. Section 3 details our approach followed by Sec-

tion 4 that shows experimental results. Finally, Section 5 concludes the paper,

envisaging some future perspectives.

2 State of the art

Most of the re-identiﬁcation approaches build on appearance-based features [1,

11, 3] and this prevents from focusing on re-id scenarios where the clothing may

change. Few approaches constrain the re-id operative conditions by simplifying

the problem to temporal reasoning. They actually use the information on the

layout distribution of cameras and the temporal information in order to prune

Re-identiﬁcation with RGB-D sensors 3

away some candidates in the gallery set [12].

The adoption of 3D body information in the re-identiﬁcation problem was ﬁrst

introduced by [13] where a coarse and rigid 3D body model was ﬁtted to diﬀer-

ent pedestrians. Given such 3D localization, the person silhouette can be related

given the diﬀerent orientations of the body as viewed from diﬀerent cameras.

Then, the registered data are used to perform appearance-based re-identiﬁcation.

Diﬀerently, in our case we manage genuine soft biometric cues of a body which is

truly non-rigid and also disregarding an appearance based approach. Such pos-

sibility is given by nowadays technology that allows to extract reliable anatomic

cues from depth information provided by a range sensor.

In general, the methodological approach to re-identiﬁcation can be divided

into two groups: learning-based and direct strategies. Learning based methods

split a re-id dataset into two sets: training and test [1, 3]. The training set is used

for learning features and strategies for combining features while the test dataset

is used for validation. Direct strategies [11] are simple feature extractors. Usually,

learning-based strategies are strongly time-consuming (considering the training

and testing steps), but more eﬀective than direct ones. Under this taxonomy,

our proposal can be deﬁned as a learning-based strategy.

3 Our approach

Our re-identiﬁcation approach has two distinct phases. First, a particular sig-

nature is computed from the range data of each subject. Such signature is a

composition of several soft biometric cues extracted from the depth data ac-

quired with a RGB-D sensor. In the second phase, these signatures are matched

against the test subjects from the gallery set. A learning stage, computed be-

forehand, explains how each single feature has to be weighted when combined

with the others. A feature with high weight means that it is useful for obtaining

good re-identiﬁcation performances.

3.1 First stage: signature extraction

The ﬁrst step processes the data acquired from a RGB-D camera such as the

Kinect. In particular, this sensor uses a structured light based infrared patterns

[8] that illuminates the scene/objects. Thus the system obtains a depth map of

the scene by measuring the pattern distortion created by the 3D relief of the

object. When RGB-D cameras are used with the OpenNI framework [14], it

is possible to use the acquired depth map to segment & track human bodies,

estimate the human pose, and perform metric 3D scene reconstruction. In our

case, the information used is given by the segmented point-cloud of a person,

the positions of the ﬁfteen body joints and the estimation of the ﬂoor plane.

Although the person depth map and pose are given by the OpenNI software

libraries, the segmentation of the ﬂoor required an initial pre-processing using

RANSAC to ﬁt a plane to the ground. Additionally, a mesh was generated from

the person point cloud using the“Greedy Projection” method [15].

4 Igor Barros Barbosa et al.

Before focusing on the signature extraction, a preliminary study has been per-

formed by examining a set of 121 features on a dataset of 79 individuals, each

captured in 4 diﬀerent days (see more information on the dataset in Sec. 4).

These features can be partitioned in two groups: the ﬁrst contains the skeleton-

based features, i.e., those cues which are based on the exhaustive combination

of distances among joints, distances between the ﬂoor plane and all the possible

joints. The second group contains the Surface-based features, i.e., the geodesic

distances on the mesh surface computed from diﬀerent joints pairs. In order

to determine the most relevant features, a feature selection stage evaluates the

performance on the re-identiﬁcation task of each single cue, one at a time, in-

dependently. In particular, as a measure of the re-id accuracy, we evaluated the

normalized area under curve (nAUC) of the cumulative matching curve (CMC)

discarding those features which resulted equivalent to perform a random choice

of the correct match (see more information on these classiﬁcation measures on

Sec. 4).

The results after such pruning stage was a set of 10 features:

– Skeleton-based features

•d1: Euclidean distance between ﬂoor and head

•d2: Ratio between torso and legs

•d3: Height estimate

•d4: Euclidean distance between ﬂoor and neck

•d5: Euclidean distance between neck and left shoulder

•d6: Euclidean distance between neck and right shoulder

•d7: Euclidean distance between torso center and right shoulder

– Surface-based features

•d8: Geodesic distance between torso center and left shoulder

•d9: Geodesic distance between torso center and left hip

•d10: Geodesic distance between torso center and right hip

Some of the features based on the distance from the ﬂoor are illustrated in Fig.

1 together with the joints localization on the body. In particular, the second

feature (ratio between torso and legs) is computed according to the following

equation:

d2=mean(d5+d6)

mean(dfloorLhip +dfloor Rhip)·(d1)−1(1)

The computation of the (approximated) geodesic distances, i.e., Torso to left

shoulder, torso to left hip and torso to right hip, is given by the following steps.

First, the selected joints pairs, which are normally not lying onto the point cloud,

are projected towards the respective closest points in depth. This generates a

starting and ending point on the surface where it is possible to initialize an A?

algorithm computing the minimum path over the point cloud (Fig. 2). Since

the torso is usually recovered by the RGB-D sensor with higher precision, the

computed geodesic features should be also reliable.

As a further check on the 10 selected features, we veriﬁed the accuracy by

manually measuring the features on a restricted set of subjects. At the end, we

found out that higher precision was captured especially in the features related to

Re-identiﬁcation with RGB-D sensors 5

d3: height

d1: floor-head

floor-left hip

floor-right hip

floor-torso

d4: floor-neck

Fig. 1. Distances employed for building the soft-biometric features (in black), and some

of the soft biometric features (in green). It is important to notice that the joints are not

localized in the outskirt of the point-cloud, but, in most of the cases, in the proximities

of the real articulations of the human body.

Fig. 2. Geodesic features: the red line represents the path found by A?between torso

to left shoulder, torso to left hip and torso to right hip

the height (d1, ..., d4), while other features were slightly more noisy. In general,

all these features are well-suited for an indoor usage, in which people do not

wear heavy clothes that might hide the human body aspects.

3.2 Second stage: signature matching

This section illustrates how the selected features can be jointly employed in the

re-id problem. In the literature, a re-id technique is usually evaluated considering

two sets of personal ID signatures: a gallery set Aand a probe set B.

The evaluation consists in associating each ID signature of the probe set B

to a corresponding ID signature in the gallery set A. For the sake of clarity, let

us suppose to have Ndiﬀerent ID signatures (each one representing a diﬀerent

individual, so Ndiﬀerent individuals) in the probe set and the same occurs

in the gallery set. All the Nsubjects in the probe are present in the gallery.

For evaluating the performance of a re-id technique, the most used measure is

the Cumulative Matching Curve (CMC) [1], which models the mean probability

that whatever probe signature is correctly matched in the ﬁrst Tranked gallery

individuals, where the ranking is given by evaluating the distances between ID

signatures in ascending order.

6 Igor Barros Barbosa et al.

In our case, each ID signature is composed by Ffeatures (in our case,

F= 10), and each feature has a numerical value. Let us then deﬁne the dis-

tance between corresponding features as the squared diﬀerence between them.

For each feature, we obtain a N×Ndistance matrix. However such matrix is

biased towards features with higher measured values leading to a problem of

heterogeneity of the measures. Thus, if a feature such as the height is measured,

it would count more w.r.t. other features whose range of values is more compact

(e.g. the distance between neck and left shoulder). To avoid this problem, we

normalize all the features to a zero mean and unitary variance. We use the data

from the gallery set to compute the mean value of each feature as well as the

feature variance.

Given the normalized N×Ndistance matrix, we now have to surrogate those

distances into a single distance matrix, obtaining thus a ﬁnal CMC curve. The

naive way to integrate them out would be to just average the matrices. Instead,

we propose to utilize a weighted sum of the distance matrices. Let us deﬁne

the set of weight wifor i= 1, ..., F that represents the importance of the i−th

feature: the higher the weight, the more important is the feature. Since tuning

those weights is usually hard, we propose a quasi-exhaustive learning strategy,

i.e., we explore the weight space (from 0 to 1 with step 0.01) in order to select

the weights that maximize the nAUC score. In the experiments, we report the

values of those weights and compare this strategy with the average baseline.

4 Experiments

In this section, we describe ﬁrst how we built the experimental dataset and how

we formalised the re-id protocol. Then, an extensive validation is carried forward

over the test dataset in diﬀerent conditions.

4.1 Database creation

Our dataset is composed by four diﬀerent groups of data. The ﬁrst “Collabora-

tive” group has been obtained by recording 79 people with a frontal view, walking

slowly, avoiding occlusions and with stretched arms. This happened in an indoor

scenario, where the people were at least 2 meters away from the camera. This

scenario represents a collaborative setting, the only one that we considered in

these experiments. The second (“Walking”) and third (“Walking2”) groups of

data are composed by frontal recordings of the same 79 people walking normally

while entering the lab where they normally work. The fourth group (“Back-

wards”) is a back view recording of the people walking away from the lab. Since

all the acquisitions have been performed in diﬀerent days, there is no guarantee

that visual aspects like clothing or accessories will be kept constant. Figure 3

shows the computed meshes from diﬀerent people during the recording of the

four diﬀerent sessions, together with some statistics about the collected features.

Re-identiﬁcation with RGB-D sensors 7

Fig. 3. Illustration of the diﬀerent groups in the recorded data, rows from top to

bottom: “Walking”, “Walking2”, “Backwards” and “Collaborative”. Note that people

changed their clothings during the acquisitions in diﬀerent days. On the right, statistics

of the “Walking” dataset: for each feature, the histogram is shown; in the parenthesis,

its mean value (in cm, except d2) and standard deviation.

From each acquisition, a single frame was automatically selected for the com-

putation of the biometric features. This selection uses the frame with the best

conﬁdence of tracked skeleton joints1, which is closest to the camera and it was

not cropped by the sensors ﬁelds of view. This represents the frame with the

highest joints tracking conﬁdence which in most of the cases was approximately

2.5 meters away from the camera.

After that, the mesh for each subject was computed and the 10 soft biometric

cues have been extracted using both skeleton and geodesics information.

4.2 Semi-Cooperative re-id

Given the four datasets, we have built a semi-collaborative scenario, where the

gallery set was composed by the ID signatures of the “Collaborative” setting,

and the test data was the “Walking 2” set. The CMCs related to each feature are

portrayed in Fig. 4: they show how each feature is able to capture discriminative

information of the analyzed subjects. Fig. 5 shows the normalized AUC of each

features. Notice that the features associated to the height of the person are very

meaningful, as so the ratio between torso and legs.

1Such conﬁdence score is a byproduct of the skeleton ﬁtting algorithm.

8 Igor Barros Barbosa et al.

0 10 20 30 40 50 60 70

100

Rank [k]

Recogntion Rate [%]

head−floor=d1

torso−ratio=d2

height=d3

neck−floor=d4

neck−Lshoulder=d5

neck−Rshoulder=d6

Rshoulder−torso=d7

torso−Lshoulder=d8

torso−Lhip=d9

tosro−Rhip=d10

Fig. 4. Single-feature CMCs — “Collaborative” VS “Walking 2” (best viewed in colors)

12345678910

100

Feature

Accuracy

d1 d2 d3 d4 d5 d6 d7 d8 d9 d10

88.1

76.2 80.3

63.8

52.8 52.8 54.1 58.2 62.0

69.7

Fig. 5. Area under the curve for each feature (the numbering here follows the features

enumeration presented in Sec. 3) —“Collaborative” VS “Walking 2”. The numbers

over the bars indicate the numerical nAUC values of the diﬀerent features.

The results of Fig. 5 highlights that the nAUC over the diﬀerent features

spans from 52.8% to 88.1%. Thus, all of them contributes to have better re-

identiﬁcation results. To investigate how their combination helps in re-id, we

exploit the learning strategy proposed in Sec. 3.2. Such weights wiare learned

once using a diﬀerent dataset than the one used during testing. The obtained

weights are: w1= 0.24, w2= 0.17, w3= 0.18, w4= 0.09, w5= 0.02, w6=

0.02, w7= 0.03, w8= 0.05, w9= 0.08, w10 = 0.12. The weights mirrors the

nUAC obtained for each feature independently (Fig. 5): the most relevant ones

are d1(Euclidean distance between ﬂoor and head), d2(Ratio between torso

and legs), d3(Height estimate), and d10 (Geodesic distance between torso center

and right hip). In Fig. 6, we compare this strategy with a baseline: the average

case where wi= 1/F for each i. It is clear that the learning strategy gives better

results (nAUC= 88.88%) with respect to the baseline (nAUC= 76.19%) and also

the best feature (nAUC= 88.10%) that correspods to d1in Fig. 5. For the rest

of the experiments the learning strategy is adopted.

4.3 Non-Cooperative re-id

Non-cooperative scenarios consist of the “walking”, “walking2” and “backwards”

datasets. We generate diﬀerent experiments by combining cooperative and non-

cooperative scenarios as gallery and probe sets. Table 1 reports the nAUC score

given the trials we carried out. The non-cooperative scenarios gave rise to higher

Re-identiﬁcation with RGB-D sensors 9

Fig. 6. Compilation of ﬁnal CMC curves —“Collaborative” - “Walking 2”

performances than the cooperative ones. The reason is that, in the collaborative

acquisition, people tended to move in a very unnatural and constrained way,

thus originating biased measurements towards a speciﬁc posture. In the non-

cooperative setting this did not clearly happen.

Gallery Probe nAUC

Collab. Walking 90.11 %

Collab. Walking 2 88.88 %

Collab. Backwards 85.64 %

Walking Walking 2 91.76 %

Walking Backwards 88.72%

Walking 2 Backwards 87.73 %

Table 1. nAUC scores for the diﬀerent re-id scenarios.

5 Conclusions

In this paper, we presented a person re-identiﬁcation approach which exploits

soft-biometrics features, extracted from range data, investigating collaborative

and non-collaborative settings. Each feature has a particular discriminative ex-

pressiveness with height and torso/legs ratio being the most informative cues.

Re-identiﬁcation by 3D soft biometric information seems to be a very fruitful

research direction: other than the main advantage of a soft biometric policy, i.e.,

that of being to some extent invariant to clothing, many are the other reasons:

from one side, the availability of precise yet aﬀordable RGB-D sensors encourage

the study of robust software solutions toward the creation of real surveillance

system. On the other side, the classical appearance-based re-id literature is char-

acterized by powerful learning approaches that can be easily embedded in the

3D situation. Our research will be focused on this last point, and on the creation

of a larger 3D non-collaborative dataset.

10 Igor Barros Barbosa et al.

References

1. D. Gray and H. Tao, “Viewpoint invariant pedestrian recognition with an ensamble

of localized features,” in ECCV, Marseille, France, 2008, pp. 262–275.

2. M. Farenzena, L. Bazzani, A. Perina, V. Murino, and M. Cristani, “Person re-

identiﬁcation by symmetry-driven accumulation of local features,” in CVPR, 2010.

3. W. Zheng, S. Gong, and T. Xiang, “Person re-identiﬁcation by probabilistic relative

distance comparison,” in Computer Vision and Pattern Recognition (CVPR), 2011

IEEE Conference on. IEEE, 2011, pp. 649–656.

4. C. Velardo and J.-L. Dugelay, “Improving identiﬁcation by pruning: a case study on

face recognition and body soft biometric,” Eurecom, Tech. Rep. EURECOM+3593,

01 2012.

5. Y.-F. Wang, E. Y. Chang, and K. P. Cheng, “A video analysis framework for

soft biometry security surveillance,” in Proceedings of the third ACM international

workshop on Video surveillance & sensor networks, ser. VSSN ’05, 2005, pp. 71–78.

6. M. Demirkus and K. Garg, “Automated person categorization for video surveillance

using soft biometrics,” Proc of SPIE, Biometric Technology for, 2010.

7. A. Dantcheva, J.-L. Dugelay, and P. Elia, “Person recognition using a bag of facial

soft biometrics (BoFSB),” in 2010 IEEE International Workshop on Multimedia

Signal Processing, vol. 85. IEEE, Oct. 2010, pp. 511–516.

8. B. Freedman, A. Shpunt, M. Machline, and Y. Ariel, “US Patent -

US2010/0118123,” 2010.

9. J. Shotton, A. Fitzgibbon, M. Cook, T. Sharp, M. Finocchio, R. Moore, A. Kip-

man, and A. Blake, “Real-time human pose recognition in parts from single depth

images,” in CVPR 2011. IEEE, Jun. 2011, pp. 1297–1304.

10. L. Bo, K. Lai, X. Ren, and D. Fox, “Object recognition with hierarchical kernel

descriptors,” in CVPR 2011, no. c. IEEE, Jun. 2011, pp. 1729–1736.

11. D. S. Cheng, M. Cristani, M. Stoppa, L. Bazzani, and V. Murino, “Custom pictorial

structures for re-identiﬁcation,” in British Machine Vision Conference (BMVC),

2011.

12. O. Javed, K. Shaﬁque, Z. Rasheed, and M. Shah, “Modeling inter-camera space-

time and appearance relationships for tracking across non-overlapping views,”

Comput. Vis. Image Underst., vol. 109, no. 2, pp. 146–162, 2008.

13. D. Baltieri, R. Vezzani, and R. Cucchiara, “Sarc3d: a new 3d body model for people

tracking and re-identiﬁcation,” in Proceedings of the 16th international conference

on Image analysis and processing, ser. ICIAP’11, 2011, pp. 197–206.

14. OpenNI. (2012, Feb.) Openni framework@ONLINE. [Online]. Available: http:

//www.openni.org/

15. Z. C. Marton, R. B. Rusu, and M. Beetz, “On Fast Surface Reconstruction Methods

for Large and Noisy Datasets,” in Proceedings of the IEEE International Confer-

ence on Robotics and Automation (ICRA), Kobe, Japan, May 12-17 2009.

Enhancing person re-identification on RGB-D data withnoise free pose-regularized color and skeleton distancefeatures

Article

Full-text available

Apr 2024

Noisy features may introduce irrelevant or incorrect features that can lead to incorrect classifications and lower accuracy. This can be especially problematic in tasks such as person re-identification (ReID), where subtle differences between individuals need to be accurately captured and distinguished. However, the existing ReID methods directly use noisy and limited multimodality features for similarity measures. It is crucial to use robust features and pre-processing techniques to reduce the effects of noise and ensure accurate classification. As a solution, we employ a Gaussian filter to eliminate the Gaussian noise from RGB-D data in the pre-processing stage. For similarity measure, the color descriptors are computed using the top eight peaks of the 2D histogram constructed from pose regularized partition grid cells, and eleven different skeleton distances are considered. The proposed method is evaluated on the BIWI RGBD-ID dataset, which comprises still (front view images) and walking set (images with varied pose and viewpoint) images. The obtained recognition rates of 99.15% and 94% on still and walking set images demonstrate the effectiveness of the proposed approach for the ReID task in the presence of pose and viewpoint variations. Further, the method is evaluated on and RGBD-ID and achieved improved performance over the existing techniques.

A Survey of Artificial Intelligence in Gait-Based Neurodegenerative Disease Diagnosis

Preprint

Full-text available

May 2024

Recent years have witnessed an increasing global population affected by neurodegenerative diseases (NDs), which traditionally require extensive healthcare resources and human effort for medical diagnosis and monitoring. As a crucial disease-related motor symptom, human gait can be exploited to characterize different NDs. The current advances in artificial intelligence (AI) models enable automatic gait analysis for NDs identification and classification, opening a new avenue to facilitate faster and more cost-effective diagnosis of NDs. In this paper, we provide a comprehensive survey on recent progress of machine learning and deep learning based AI techniques applied to diagnosis of five typical NDs through gait. We provide an overview of the process of AI-assisted NDs diagnosis, and present a systematic taxonomy of existing gait data and AI models. Through an extensive review and analysis of 164 studies, we identify and discuss the challenges, potential solutions, and future directions in this field. Finally, we envision the prospective utilization of 3D skeleton data for human gait representation and the development of more efficient AI models for NDs diagnosis. We provide a public resource repository to track and facilitate developments in this emerging field: https://github.com/Kali-Hac/AI4NDD-Survey.

Event Anonymization: Privacy-Preserving Person Re-Identification and Pose Estimation in Event-Based Vision

Article

Full-text available

Jan 2024

The widespread use of visual surveillance in public areas puts individual privacy at stake while also increasing resource usage (energy, bandwidth, and computation). Neuromorphic vision sensors (or event cameras) are considered viable solutions for privacy issues; since event cameras only capture scene dynamics, they do not capture detailed RGB images of individuals. However, recent deep learning architectures have enabled the reconstruction of high-fidelity images from event sensor data that could reveal individual identity information. As a result, it reintroduces privacy risks for event-based vision applications. In this work, we focus on protecting the identity of individuals from such image reconstruction attacks by anonymizing event data. To achieve this, we present an end-to-end network architecture jointly optimized for the twofold objective of preserving privacy and performing a downstream computer vision task. The proposed network learns to scramble events, thereby degrading the quality of images that potential intruders could reconstruct. We demonstrate the application of our framework in two challenging computer vision tasks: person re-identification (ReId) and human pose estimation (HPE). To this end, we also present and evaluate the first event-based person ReId dataset, Event-ReId. We validate the privacy-preserving efficacy of our approach against possible privacy attacks through extensive experiments: for person ReId, we utilize the real event-based Event-ReId dataset and synthetic event data simulated from the SoftBio dataset; for HPE, we use a publicly available event-based dataset DHP19. The results of both tasks show that anonymizing event data effectively protects private information with minimal impact on the subsequent task performance.

A Cross-Period Network for Clothing Change Person Re-Identification

Article

Full-text available

Jan 2024

Pedestrian re-identification aims to identify the same target pedestrian among multiple non-overlapping cameras. However, in real scenarios, pedestrians often change their clothing features due to external factors such as weather and seasons, rendering traditional methods reliant on consistent clothing features ineffective. In this paper, we propose a Knowledge-Driven Cross-Period Network for Clothing Change Person Re-Identification, comprising three key components: (1) Knowledge-Driven Topology Inference Network: Leveraging knowledge graphs and graph convolution networks, this network captures spatio-temporal information between camera nodes. Knowledge embedding is introduced into the graph convolution network for effective topology inference. (2) Cross-Period Clothing Change Network: This network aggregates spatio-temporal information for clothing generation. By utilizing overall pedestrian clothing characteristics whthin logical topology cameras, it mitigates matching errors caused by external factors. (3) Joint Optimization Mechanism: A collaborative approach involving both the topology inference network and cross-period clothing change network. Multi-camera logical topology offers auxiliary information and retrieval order for the clothing change network, while pedestrian re-identification results provide feedback to adjust the logical topology. Experimental analysis on datasets Celeb-ReID, PRCC, UJS-ReID, SLP, and DukeMTMC-ReID, demonstrates the effectiveness and robustness of our proposed model in addressing the challenges of pedestrian re-identification in scenarios involving changing clothing features.

Reconsideration of Bertillonage in the age of digitalisation: Digital anthropometric patterns as a promising method for establishing identity

Article

Full-text available

Dec 2023

The idea of using measurements of the human body for identity matching is deeply associated with Bertillonage, a historic biometric system that was briefly applied until it was superseded by fingerprinting in the early 20th century. The apparent failure then commonly causes doubt with regard to the suitability of a set of measurements as a biometric identifier in the present. Hence, the aim of this paper is to explore the potentials of using an anthropometric pattern, comprising of a set of body measurements, for identity matching. For this purpose, it will begin with a thorough examination of Bertillon's system and move on to conduct a comprehensive inquiry of the current possibilities of using digital anthropometric patterns in image or video-based evidence.

Person Re-Identification without Identification via Event Anonymization

Conference Paper

Full-text available

Oct 2023

Wide-scale use of visual surveillance in public spaces puts individual privacy at stake while increasing resource consumption (energy, bandwidth, and computation). Neuromorphic vision sensors (event-cameras) have been recently considered a valid solution to the privacy issue because they do not capture detailed RGB visual information of the subjects in the scene. However, recent deep learning architectures have been able to reconstruct images from event cameras with high fidelity, reintroducing a potential threat to privacy for event-based vision applications. In this paper, we aim to anonymize event-streams to protect the identity of human subjects against such image reconstruction attacks. To achieve this, we propose an end-to-end network architecture jointly optimized for the twofold objective of preserving privacy and performing a downstream task such as person ReId. Our network learns to scramble events, enforcing the degradation of images recovered from the privacy attacker. In this work, we also bring to the community the first ever event-based person ReId dataset gathered to evaluate the performance of our approach. We validate our approach with extensive experiments and report results on the synthetic event data simulated from the publicly available SoftBio dataset and our proposed Event-ReId dataset.

MutualFormer: Multi-modal Representation Learning via Cross-Diffusion Attention

Article

Full-text available

Apr 2024
INT J COMPUT VISION

Aggregating multi-modal data to obtain reliable data representation attracts more and more attention. Recent studies demonstrate that Transformer models usually work well for multi-modal tasks. Existing Transformers generally either adopt the cross-attention (CA) mechanism or simple concatenation to achieve the information interaction among different modalities which generally ignore the issue of modality gap. In this work, we re-think Transformer and extend it to MutualFormer for multi-modal data representation. Rather than CA in Transformer, MutualFormer employs our new design of cross-diffusion attention (CDA) to conduct the information communication among different modalities. Comparing with CA, the main advantages of the proposed CDA are three aspects. First, the cross-affinities in CDA are defined based on the individual modal affinities (token metrics) which thus can naturally alleviate the issue of modality/domain gap existed in traditional token feature based CA definition. Second, CDA provides a general scheme which can either be used for multi-modal representation or serve as the post-optimization for existing CA models. Third, CDA is implemented efficiently. We successfully apply the MutualFormer on several multi-modal learning tasks. Extensive experiments demonstrate the effectiveness of the proposed MutualFormer.

Privacy-Enhancing Person Re-identification Framework – A Dual-Stage Approach

Conference Paper

Jan 2024

Deep Multimodal Data Fusion

Article

Feb 2024

Multimodal Artificial Intelligence (Multimodal AI), in general, involves various types of data (e.g., images, texts, or data collected from different sensors), feature engineering (e.g., extraction, combination/fusion), and decision-making (e.g., majority vote). As architectures become more and more sophisticated, multimodal neural networks can integrate feature extraction, feature fusion, and decision-making processes into one single model. The boundaries between those processes are increasingly blurred. The conventional multimodal data fusion taxonomy (e.g., early/late fusion), based on which the fusion occurs in, is no longer suitable for the modern deep learning era. Therefore, based on the main-stream techniques used, we propose a new fine-grained taxonomy grouping the state-of-the-art (SOTA) models into five classes: Encoder-Decoder methods, Attention Mechanism methods, Graph Neural Network methods, Generative Neural Network methods, and other Constraint-based methods. Most existing surveys on multimodal data fusion are only focused on one specific task with a combination of two specific modalities. Unlike those, this survey covers a broader combination of modalities, including Vision + Language (e.g., videos, texts), Vision + Sensors (e.g., images, LiDAR), etc., and their corresponding tasks (e.g., video captioning, object detection). Moreover, a comparison among these methods is provided, as well as challenges and future directions in this area.

Multimodal Consistency Co-Assisted Training for Person Re-Identification

Conference Paper

Jul 2023

Automated person categorization for video surveillance using soft biometrics

Article

Full-text available

Apr 2010
Proceedings of SPIE

We present a prototype video tracking and person categorization system that uses face and person soft biometric features to tag people while tracking them in multiple camera views. Our approach takes advantage of temporal aspect of video by extracting and accumulating feasible soft biometric features for each person in every frame to build a dynamic soft biometric feature list for each tracked person in surveillance videos. We developed algorithms for extracting face soft biometric features to achieve gender and ethnicity classification and session soft biometric features to aid in camera hand-off in surveillance videos with low resolution and uncontrolled illumination. To train and test our face soft biometry algorithms, we collected over 1500 face images from both genders and three ethnicity groups with various sizes, poses and illumination. These soft biometric feature extractors and classifiers are implemented on our existing video content extraction platform to enhance video surveillance tasks. Our algorithms achieved promising results for gender and ethnicity classification, and tracked person re-identification for camera hand-off on low to good quality surveillance and broadcast videos. By utilizing the proposed system, a high level description of extracted person's soft biometric data can be stored to use later for different purposes, such as to provide categorical information of people, to create database partitions to accelerate searches in responding to user queries, and to track people between cameras.

Custom Pictorial Structures for Re-identification

Conference Paper

Full-text available

Jan 2011

Improving identification by pruning: A case study on face recognition and body soft biometric

Article

Full-text available

May 2012

We investigate body soft biometrics capabilities to perform pruning of a hard biometrics database improving both retrieval speed and accuracy. Our pre-classification step based on anthropometric measures is elaborated on a large scale medical dataset to guarantee statistical meaning of the results, and tested in conjunction with a face recognition algorithm. Our assumptions are verified by testing our system on a chimera dataset. We clearly identify the trade off among pruning, accuracy, and mensuration error of an anthropomeasure based system. Even in the worst case of ±10% biased anthropometric measures, our approach improves the recognition accuracy guaranteeing that only half database has to be considered.

Person re-identification by symmetry-driven accumulation of local features

Article

Full-text available

Jun 2010
IEEE Comput Soc Conf Comput Vis Pattern Recogn

In this paper, we present an appearance-based method for person re-identification. It consists in the extraction of features that model three complementary aspects of the human appearance: the overall chromatic content, the spatial arrangement of colors into stable regions, and the presence of recurrent local motifs with high entropy. All this information is derived from different body parts, and weighted opportunely by exploiting symmetry and asymmetry perceptual principles. In this way, robustness against very low resolution, occlusions and pose, viewpoint and illumination changes is achieved. The approach applies to situations where the number of candidates varies continuously, considering single images or bunch of frames for each individual. It has been tested on several public benchmark datasets (ViPER, iLIDS, ETHZ), gaining new state-of-the-art performances.

A video analysis framework for soft biometry security surveillance

Article

Full-text available

Nov 2005

We propose a distributed, multi-camera video analysis paradigm for aiport security surveillance. We propose to use a new class of biometry signatures, which are called soft biometry including a person's height, built, skin tone, color of shirts and trousers, motion pattern, trajectory history, etc., to ID and track errant passengers and suspicious events without having to shut down a whole termi-nal building and cancel multiple flights. The proposed research is to enable the reliable acquisition, maintenance, and correspon-dence of soft biometry signatures in a coordinated manner from a large number of video streams for security surveillance. The intel-lectual merit of the proposed research is to address three important video analysis problems in a distributed, multi-camera surveillance network: sensor network calibration, peer-to-peer sensor data fu-sion, and stationary-dynamic cooperative camera sensing.

Person recognition using a bag of facial soft biometrics (BoFSB)

Conference Paper

Full-text available

Nov 2010

This work introduces the novel idea of using a bag of facial soft biometrics for person verification and identification. The novel tool inherits the non-intrusiveness and computational efficiency of soft biometrics, which allow for fast and enrolment-free biometric analysis, even in the absence of consent and cooperation of the surveillance subject. In conjunction with the proposed system design and detection algorithms, we also proceed to shed some light on the statistical properties of different parameters that are pertinent to the proposed system, as well as provide insight on general design aspects in soft-biometric systems, and different aspects regarding efficient resource allocation.

Object Recognition with Hierarchical Kernel Descriptors

Conference Paper

Full-text available

Jun 2011
IEEE Comput Soc Conf Comput Vis Pattern Recogn

Kernel descriptors provide a unified way to generate rich visual feature sets by turning pixel attributes into patch-level features, and yield impressive results on many object recognition tasks. However, best results with kernel descriptors are achieved using efficient match kernels in conjunction with nonlinear SVMs, which makes it impractical for large-scale problems. In this paper, we propose hierarchical kernel descriptors that apply kernel descriptors recursively to form image-level features and thus provide a conceptually simple and consistent way to generate image-level features from pixel attributes. More importantly, hierarchical kernel descriptors allow linear SVMs to yield state-of-the-art accuracy while being scalable to large datasets. They can also be naturally extended to extract features over depth images. We evaluate hierarchical kernel descriptors both on the CIFAR10 dataset and the new RGB-D Object Dataset consisting of segmented RGB and depth images of 300 everyday objects.

Real-Time Human Pose Recognition in Parts from Single Depth Images

Conference Paper

Full-text available

Jun 2011
IEEE Comput Soc Conf Comput Vis Pattern Recogn

We propose a new method to quickly and accurately predict human pose---the 3D positions of body joints---from a single depth image, without depending on information from preceding frames. Our approach is strongly rooted in current object recognition strategies. By designing an intermediate representation in terms of body parts, the difficult pose estimation problem is transformed into a simpler per-pixel classification problem, for which efficient machine learning techniques exist. By using computer graphics to synthesize a very large dataset of training image pairs, one can train a classifier that estimates body part labels from test images invariant to pose, body shape, clothing, and other irrelevances. Finally, we generate confidence-scored 3D proposals of several body joints by reprojecting the classification result and finding local modes. The system runs in under 5ms on the Xbox 360. Our evaluation shows high accuracy on both synthetic and real test sets, and investigates the effect of several training parameters. We achieve state-of-the-art accuracy in our comparison with related work and demonstrate improved generalization over exact whole-skeleton nearest neighbor matching.

Re-identification by Relative Distance Comparison [J]

Article

Jan 2012

Modeling inter-camera space–time and appearance relationships for tracking across non-overlapping views

Article

Feb 2008

Tracking across cameras with non-overlapping views is a challenging problem. Firstly, the observations of an object are often widely separated in time and space when viewed from non-overlapping cameras. Secondly, the appearance of an object in one camera view might be very different from its appearance in another camera view due to the differences in illumination, pose and camera properties. To deal with the first problem, we observe that people or vehicles tend to follow the same paths in most cases, i.e., roads, walkways, corridors etc. The proposed algorithm uses this conformity in the traversed paths to establish correspondence. The algorithm learns this conformity and hence the inter-camera relationships in the form of multivariate probability density of space–time variables (entry and exit locations, velocities, and transition times) using kernel density estimation. To handle the appearance change of an object as it moves from one camera to another, we show that all brightness transfer functions from a given camera to another camera lie in a low dimensional subspace. This subspace is learned by using probabilistic principal component analysis and used for appearance matching. The proposed approach does not require explicit inter-camera calibration, rather the system learns the camera topology and subspace of inter-camera brightness transfer functions during a training phase. Once the training is complete, correspondences are assigned using the maximum likelihood (ML) estimation framework using both location and appearance cues. Experiments with real world videos are reported which validate the proposed approach.

Re-identification with RGB-D Sensors

Abstract and Figures

Recommended publications

Rethinking Pose in 3D: Multi-stage Refinement and Recovery for Markerless Motion Capture

Diz Hareketi İzleme Teknolojilerinin Kar¸şıla¸ştırmalı Değerlendirmesi Comparative Assessment of Kne...

A weighting scheme for mining key skeletal joints for human action recognition

Depth Recovery with Face Priors