Conference PaperPDF Available

Abstract and Figures

People re-identification is a fundamental operation for any multi-camera surveillance scenario. Until now, it has been performed by exploiting primarily appearance cues, hypothesizing that the individuals cannot change their clothes. In this paper, we relax this constraint by presenting a set of 3D soft-biometric cues, being insensitive to appearance variations, that are gathered using RGB-D technology. The joint use of these characteristics provides encouraging performances on a benchmark of 79 people, that have been captured in different days and with different clothing. This promotes a novel research direction for the re-identification community, supported also by the fact that a new brand of affordable RGB-D cameras have recently invaded the worldwide market.
Content may be subject to copyright.
Re-identification with RGB-D sensors
Igor Barros Barbosa1,3, Marco Cristani1,2, Alessio Del Bue1,
Loris Bazzani1, and Vittorio Murino1
1Pattern Analysis and Computer Vision (PAVIS) - Istituto Italiano di Tecnologia
(IIT), Via Morego 30, 16163 Genova, Italy
2Dipartimento di Informatica, University of Verona,
Strada Le Grazie 15, 37134 Verona, Italy
3Universit´e de Bourgogne, 720 Avenue de lEurope, 71200 Le Creusot, France
Abstract. People re-identification is a fundamental operation for any
multi-camera surveillance scenario. Until now, it has been performed by
exploiting primarily appearance cues, hypothesizing that the individuals
cannot change their clothes. In this paper, we relax this constraint by
presenting a set of 3D soft-biometric cues, being insensitive to appearance
variations, that are gathered using RGB-D technology. The joint use of
these characteristics provides encouraging performances on a benchmark
of 79 people, that have been captured in different days and with different
clothing. This promotes a novel research direction for the re-identification
community, supported also by the fact that a new brand of affordable
RGB-D cameras have recently invaded the worldwide market.
Keywords: Re-identification, RGB-D sensors, Kinect
1 Introduction
The task of person re-identification (re-id) consists in recognizing an individual
in different locations over a set of non-overlapping camera views. It represents
a fundamental task for heterogeneous video surveillance applications, especially
for modeling long-term activities inside large and structured environments, such
as airports, museums, shopping malls, etc. In most of the cases, re-id approaches
rely on appearance-based only techniques, in which it is assumed that individuals
do not change their clothing within the observation period [1–3]. This hypothesis
represents a very strong restriction, since it constraints re-id methods to be
applied under a limited temporal range (reasonably, in the order of minutes).
In this paper we remove this restriction, presenting a new approach of person
re-id that uses soft biometrics cues as features. In general, soft biometrics cues
have been exploited in different contexts, either to aid facial recognition [4], used
as features in security surveillance solutions [5,6] or also for person recognition
under a bag of words policy [7]. In [4] soft biometrics cues are the size of limbs,
which were manually measured. The approaches in [5–7] are based on data com-
ing from 2D cameras and extract soft biometrics cues such as gender, ethnicity,
clothing, etc.
2 Igor Barros Barbosa et al.
At the best of our knowledge, 3D soft biometric features for re-identification
have been employed only in [4], but in that case the scenario is strongly super-
vised and needs a complete cooperation of the user to take manual measures. In
contrast, a viable soft biometrics system should mostly deal with subjects with-
out requiring strong collaboration from them, in order to extend its applicability
to more practical scenarios.
In our case, the cues are extracted from range data which are computed
using RGB-D cameras. Recently, novel RGB-D camera sensors as the Microsoft
Kinect and Asus Xtion PRO, both manufactured using the techniques developed
by PrimeSense [8], provided to the community a new method of acquiring depth
information in a fast and affordable way. This drove researchers to use RGB-D
cameras in different fields of applications, such as pose estimation [9] and object
recognition [10], to quote a few. In our opinion, re-id can be extended to novel
scenarios by exploiting this novel technology, allowing to overcome the constraint
of analyzing people that do not change their clothes.
In particular, our aim is to extract a set of features computed directly on the
range measurements given by the sensor. Such features are related to specific
anthropometric measurements computed automatically from the person body.
In more detail, we introduce two distinct subsets of features. The first subset
represents cues computed from the fitted skeleton to depth data i.e. the Euclidean
distance between selected body parts such as legs, arms and the overall height.
The second subset contains features computed on the surface given by the range
data. They come in the form of geodesic distances computed from a predefined
set of joints (e.g. from torso to right hip). This latest measure gives an indication
of the curvature (and, by approximation, of the size) of specific regions of the
body.
After analyzing the effectiveness of each feature separately and performing a
pruning stage aimed at removing not influent cues, we studied how such features
have to be weighted in order to maximize the re-identification performance. We
obtained encouraging re-id results on a pool of 79 people, acquired under different
times and across intervals of days. This promotes our approach and in general
the idea of performing re-id with 3D soft biometric cues extracted from RGB-D
cameras.
The remaining of the paper is organized as follows. Section 2 briefly presents
the re-identification literature. Section 3 details our approach followed by Sec-
tion 4 that shows experimental results. Finally, Section 5 concludes the paper,
envisaging some future perspectives.
2 State of the art
Most of the re-identification approaches build on appearance-based features [1,
11, 3] and this prevents from focusing on re-id scenarios where the clothing may
change. Few approaches constrain the re-id operative conditions by simplifying
the problem to temporal reasoning. They actually use the information on the
layout distribution of cameras and the temporal information in order to prune
Re-identification with RGB-D sensors 3
away some candidates in the gallery set [12].
The adoption of 3D body information in the re-identification problem was first
introduced by [13] where a coarse and rigid 3D body model was fitted to differ-
ent pedestrians. Given such 3D localization, the person silhouette can be related
given the different orientations of the body as viewed from different cameras.
Then, the registered data are used to perform appearance-based re-identification.
Differently, in our case we manage genuine soft biometric cues of a body which is
truly non-rigid and also disregarding an appearance based approach. Such pos-
sibility is given by nowadays technology that allows to extract reliable anatomic
cues from depth information provided by a range sensor.
In general, the methodological approach to re-identification can be divided
into two groups: learning-based and direct strategies. Learning based methods
split a re-id dataset into two sets: training and test [1, 3]. The training set is used
for learning features and strategies for combining features while the test dataset
is used for validation. Direct strategies [11] are simple feature extractors. Usually,
learning-based strategies are strongly time-consuming (considering the training
and testing steps), but more effective than direct ones. Under this taxonomy,
our proposal can be defined as a learning-based strategy.
3 Our approach
Our re-identification approach has two distinct phases. First, a particular sig-
nature is computed from the range data of each subject. Such signature is a
composition of several soft biometric cues extracted from the depth data ac-
quired with a RGB-D sensor. In the second phase, these signatures are matched
against the test subjects from the gallery set. A learning stage, computed be-
forehand, explains how each single feature has to be weighted when combined
with the others. A feature with high weight means that it is useful for obtaining
good re-identification performances.
3.1 First stage: signature extraction
The first step processes the data acquired from a RGB-D camera such as the
Kinect. In particular, this sensor uses a structured light based infrared patterns
[8] that illuminates the scene/objects. Thus the system obtains a depth map of
the scene by measuring the pattern distortion created by the 3D relief of the
object. When RGB-D cameras are used with the OpenNI framework [14], it
is possible to use the acquired depth map to segment & track human bodies,
estimate the human pose, and perform metric 3D scene reconstruction. In our
case, the information used is given by the segmented point-cloud of a person,
the positions of the fifteen body joints and the estimation of the floor plane.
Although the person depth map and pose are given by the OpenNI software
libraries, the segmentation of the floor required an initial pre-processing using
RANSAC to fit a plane to the ground. Additionally, a mesh was generated from
the person point cloud using the“Greedy Projection” method [15].
4 Igor Barros Barbosa et al.
Before focusing on the signature extraction, a preliminary study has been per-
formed by examining a set of 121 features on a dataset of 79 individuals, each
captured in 4 different days (see more information on the dataset in Sec. 4).
These features can be partitioned in two groups: the first contains the skeleton-
based features, i.e., those cues which are based on the exhaustive combination
of distances among joints, distances between the floor plane and all the possible
joints. The second group contains the Surface-based features, i.e., the geodesic
distances on the mesh surface computed from different joints pairs. In order
to determine the most relevant features, a feature selection stage evaluates the
performance on the re-identification task of each single cue, one at a time, in-
dependently. In particular, as a measure of the re-id accuracy, we evaluated the
normalized area under curve (nAUC) of the cumulative matching curve (CMC)
discarding those features which resulted equivalent to perform a random choice
of the correct match (see more information on these classification measures on
Sec. 4).
The results after such pruning stage was a set of 10 features:
Skeleton-based features
d1: Euclidean distance between floor and head
d2: Ratio between torso and legs
d3: Height estimate
d4: Euclidean distance between floor and neck
d5: Euclidean distance between neck and left shoulder
d6: Euclidean distance between neck and right shoulder
d7: Euclidean distance between torso center and right shoulder
Surface-based features
d8: Geodesic distance between torso center and left shoulder
d9: Geodesic distance between torso center and left hip
d10: Geodesic distance between torso center and right hip
Some of the features based on the distance from the floor are illustrated in Fig.
1 together with the joints localization on the body. In particular, the second
feature (ratio between torso and legs) is computed according to the following
equation:
d2=mean(d5+d6)
mean(dfloorLhip +dfloor Rhip)·(d1)1(1)
The computation of the (approximated) geodesic distances, i.e., Torso to left
shoulder, torso to left hip and torso to right hip, is given by the following steps.
First, the selected joints pairs, which are normally not lying onto the point cloud,
are projected towards the respective closest points in depth. This generates a
starting and ending point on the surface where it is possible to initialize an A?
algorithm computing the minimum path over the point cloud (Fig. 2). Since
the torso is usually recovered by the RGB-D sensor with higher precision, the
computed geodesic features should be also reliable.
As a further check on the 10 selected features, we verified the accuracy by
manually measuring the features on a restricted set of subjects. At the end, we
found out that higher precision was captured especially in the features related to
Re-identification with RGB-D sensors 5
d3: height
d5
d6
d7
d1: floor-head
floor-left hip
floor-right hip
floor-torso
d4: floor-neck
Fig. 1. Distances employed for building the soft-biometric features (in black), and some
of the soft biometric features (in green). It is important to notice that the joints are not
localized in the outskirt of the point-cloud, but, in most of the cases, in the proximities
of the real articulations of the human body.
Fig. 2. Geodesic features: the red line represents the path found by A?between torso
to left shoulder, torso to left hip and torso to right hip
the height (d1, ..., d4), while other features were slightly more noisy. In general,
all these features are well-suited for an indoor usage, in which people do not
wear heavy clothes that might hide the human body aspects.
3.2 Second stage: signature matching
This section illustrates how the selected features can be jointly employed in the
re-id problem. In the literature, a re-id technique is usually evaluated considering
two sets of personal ID signatures: a gallery set Aand a probe set B.
The evaluation consists in associating each ID signature of the probe set B
to a corresponding ID signature in the gallery set A. For the sake of clarity, let
us suppose to have Ndifferent ID signatures (each one representing a different
individual, so Ndifferent individuals) in the probe set and the same occurs
in the gallery set. All the Nsubjects in the probe are present in the gallery.
For evaluating the performance of a re-id technique, the most used measure is
the Cumulative Matching Curve (CMC) [1], which models the mean probability
that whatever probe signature is correctly matched in the first Tranked gallery
individuals, where the ranking is given by evaluating the distances between ID
signatures in ascending order.
6 Igor Barros Barbosa et al.
In our case, each ID signature is composed by Ffeatures (in our case,
F= 10), and each feature has a numerical value. Let us then define the dis-
tance between corresponding features as the squared difference between them.
For each feature, we obtain a N×Ndistance matrix. However such matrix is
biased towards features with higher measured values leading to a problem of
heterogeneity of the measures. Thus, if a feature such as the height is measured,
it would count more w.r.t. other features whose range of values is more compact
(e.g. the distance between neck and left shoulder). To avoid this problem, we
normalize all the features to a zero mean and unitary variance. We use the data
from the gallery set to compute the mean value of each feature as well as the
feature variance.
Given the normalized N×Ndistance matrix, we now have to surrogate those
distances into a single distance matrix, obtaining thus a final CMC curve. The
naive way to integrate them out would be to just average the matrices. Instead,
we propose to utilize a weighted sum of the distance matrices. Let us define
the set of weight wifor i= 1, ..., F that represents the importance of the ith
feature: the higher the weight, the more important is the feature. Since tuning
those weights is usually hard, we propose a quasi-exhaustive learning strategy,
i.e., we explore the weight space (from 0 to 1 with step 0.01) in order to select
the weights that maximize the nAUC score. In the experiments, we report the
values of those weights and compare this strategy with the average baseline.
4 Experiments
In this section, we describe first how we built the experimental dataset and how
we formalised the re-id protocol. Then, an extensive validation is carried forward
over the test dataset in different conditions.
4.1 Database creation
Our dataset is composed by four different groups of data. The first “Collabora-
tive” group has been obtained by recording 79 people with a frontal view, walking
slowly, avoiding occlusions and with stretched arms. This happened in an indoor
scenario, where the people were at least 2 meters away from the camera. This
scenario represents a collaborative setting, the only one that we considered in
these experiments. The second (“Walking”) and third (“Walking2”) groups of
data are composed by frontal recordings of the same 79 people walking normally
while entering the lab where they normally work. The fourth group (“Back-
wards”) is a back view recording of the people walking away from the lab. Since
all the acquisitions have been performed in different days, there is no guarantee
that visual aspects like clothing or accessories will be kept constant. Figure 3
shows the computed meshes from different people during the recording of the
four different sessions, together with some statistics about the collected features.
Re-identification with RGB-D sensors 7
Fig. 3. Illustration of the different groups in the recorded data, rows from top to
bottom: “Walking”, “Walking2”, “Backwards” and “Collaborative”. Note that people
changed their clothings during the acquisitions in different days. On the right, statistics
of the “Walking” dataset: for each feature, the histogram is shown; in the parenthesis,
its mean value (in cm, except d2) and standard deviation.
From each acquisition, a single frame was automatically selected for the com-
putation of the biometric features. This selection uses the frame with the best
confidence of tracked skeleton joints1, which is closest to the camera and it was
not cropped by the sensors fields of view. This represents the frame with the
highest joints tracking confidence which in most of the cases was approximately
2.5 meters away from the camera.
After that, the mesh for each subject was computed and the 10 soft biometric
cues have been extracted using both skeleton and geodesics information.
4.2 Semi-Cooperative re-id
Given the four datasets, we have built a semi-collaborative scenario, where the
gallery set was composed by the ID signatures of the “Collaborative” setting,
and the test data was the “Walking 2” set. The CMCs related to each feature are
portrayed in Fig. 4: they show how each feature is able to capture discriminative
information of the analyzed subjects. Fig. 5 shows the normalized AUC of each
features. Notice that the features associated to the height of the person are very
meaningful, as so the ratio between torso and legs.
1Such confidence score is a byproduct of the skeleton fitting algorithm.
8 Igor Barros Barbosa et al.
0 10 20 30 40 50 60 70
0
10
20
30
40
50
60
70
80
90
100
Rank [k]
Recogntion Rate [%]
head−floor=d1
torso−ratio=d2
height=d3
neck−floor=d4
neck−Lshoulder=d5
neck−Rshoulder=d6
Rshoulder−torso=d7
torso−Lshoulder=d8
torso−Lhip=d9
tosro−Rhip=d10
Fig. 4. Single-feature CMCs — “Collaborative” VS “Walking 2” (best viewed in colors)
12345678910
0
20
40
60
80
100
Feature
Accuracy
d1 d2 d3 d4 d5 d6 d7 d8 d9 d10
88.1
76.2 80.3
63.8
52.8 52.8 54.1 58.2 62.0
69.7
Fig. 5. Area under the curve for each feature (the numbering here follows the features
enumeration presented in Sec. 3) —“Collaborative” VS “Walking 2”. The numbers
over the bars indicate the numerical nAUC values of the different features.
The results of Fig. 5 highlights that the nAUC over the different features
spans from 52.8% to 88.1%. Thus, all of them contributes to have better re-
identification results. To investigate how their combination helps in re-id, we
exploit the learning strategy proposed in Sec. 3.2. Such weights wiare learned
once using a different dataset than the one used during testing. The obtained
weights are: w1= 0.24, w2= 0.17, w3= 0.18, w4= 0.09, w5= 0.02, w6=
0.02, w7= 0.03, w8= 0.05, w9= 0.08, w10 = 0.12. The weights mirrors the
nUAC obtained for each feature independently (Fig. 5): the most relevant ones
are d1(Euclidean distance between floor and head), d2(Ratio between torso
and legs), d3(Height estimate), and d10 (Geodesic distance between torso center
and right hip). In Fig. 6, we compare this strategy with a baseline: the average
case where wi= 1/F for each i. It is clear that the learning strategy gives better
results (nAUC= 88.88%) with respect to the baseline (nAUC= 76.19%) and also
the best feature (nAUC= 88.10%) that correspods to d1in Fig. 5. For the rest
of the experiments the learning strategy is adopted.
4.3 Non-Cooperative re-id
Non-cooperative scenarios consist of the “walking”, “walking2” and “backwards”
datasets. We generate different experiments by combining cooperative and non-
cooperative scenarios as gallery and probe sets. Table 1 reports the nAUC score
given the trials we carried out. The non-cooperative scenarios gave rise to higher
Re-identification with RGB-D sensors 9
Fig. 6. Compilation of final CMC curves —“Collaborative” - “Walking 2”
performances than the cooperative ones. The reason is that, in the collaborative
acquisition, people tended to move in a very unnatural and constrained way,
thus originating biased measurements towards a specific posture. In the non-
cooperative setting this did not clearly happen.
Gallery Probe nAUC
Collab. Walking 90.11 %
Collab. Walking 2 88.88 %
Collab. Backwards 85.64 %
Walking Walking 2 91.76 %
Walking Backwards 88.72%
Walking 2 Backwards 87.73 %
Table 1. nAUC scores for the different re-id scenarios.
5 Conclusions
In this paper, we presented a person re-identification approach which exploits
soft-biometrics features, extracted from range data, investigating collaborative
and non-collaborative settings. Each feature has a particular discriminative ex-
pressiveness with height and torso/legs ratio being the most informative cues.
Re-identification by 3D soft biometric information seems to be a very fruitful
research direction: other than the main advantage of a soft biometric policy, i.e.,
that of being to some extent invariant to clothing, many are the other reasons:
from one side, the availability of precise yet affordable RGB-D sensors encourage
the study of robust software solutions toward the creation of real surveillance
system. On the other side, the classical appearance-based re-id literature is char-
acterized by powerful learning approaches that can be easily embedded in the
3D situation. Our research will be focused on this last point, and on the creation
of a larger 3D non-collaborative dataset.
10 Igor Barros Barbosa et al.
References
1. D. Gray and H. Tao, “Viewpoint invariant pedestrian recognition with an ensamble
of localized features,” in ECCV, Marseille, France, 2008, pp. 262–275.
2. M. Farenzena, L. Bazzani, A. Perina, V. Murino, and M. Cristani, “Person re-
identification by symmetry-driven accumulation of local features,” in CVPR, 2010.
3. W. Zheng, S. Gong, and T. Xiang, “Person re-identification by probabilistic relative
distance comparison,” in Computer Vision and Pattern Recognition (CVPR), 2011
IEEE Conference on. IEEE, 2011, pp. 649–656.
4. C. Velardo and J.-L. Dugelay, “Improving identification by pruning: a case study on
face recognition and body soft biometric,” Eurecom, Tech. Rep. EURECOM+3593,
01 2012.
5. Y.-F. Wang, E. Y. Chang, and K. P. Cheng, “A video analysis framework for
soft biometry security surveillance,” in Proceedings of the third ACM international
workshop on Video surveillance & sensor networks, ser. VSSN ’05, 2005, pp. 71–78.
6. M. Demirkus and K. Garg, “Automated person categorization for video surveillance
using soft biometrics,” Proc of SPIE, Biometric Technology for, 2010.
7. A. Dantcheva, J.-L. Dugelay, and P. Elia, “Person recognition using a bag of facial
soft biometrics (BoFSB),” in 2010 IEEE International Workshop on Multimedia
Signal Processing, vol. 85. IEEE, Oct. 2010, pp. 511–516.
8. B. Freedman, A. Shpunt, M. Machline, and Y. Ariel, “US Patent -
US2010/0118123,” 2010.
9. J. Shotton, A. Fitzgibbon, M. Cook, T. Sharp, M. Finocchio, R. Moore, A. Kip-
man, and A. Blake, “Real-time human pose recognition in parts from single depth
images,” in CVPR 2011. IEEE, Jun. 2011, pp. 1297–1304.
10. L. Bo, K. Lai, X. Ren, and D. Fox, “Object recognition with hierarchical kernel
descriptors,” in CVPR 2011, no. c. IEEE, Jun. 2011, pp. 1729–1736.
11. D. S. Cheng, M. Cristani, M. Stoppa, L. Bazzani, and V. Murino, “Custom pictorial
structures for re-identification,” in British Machine Vision Conference (BMVC),
2011.
12. O. Javed, K. Shafique, Z. Rasheed, and M. Shah, “Modeling inter-camera space-
time and appearance relationships for tracking across non-overlapping views,”
Comput. Vis. Image Underst., vol. 109, no. 2, pp. 146–162, 2008.
13. D. Baltieri, R. Vezzani, and R. Cucchiara, “Sarc3d: a new 3d body model for people
tracking and re-identification,” in Proceedings of the 16th international conference
on Image analysis and processing, ser. ICIAP’11, 2011, pp. 197–206.
14. OpenNI. (2012, Feb.) Openni framework@ONLINE. [Online]. Available: http:
//www.openni.org/
15. Z. C. Marton, R. B. Rusu, and M. Beetz, “On Fast Surface Reconstruction Methods
for Large and Noisy Datasets,” in Proceedings of the IEEE International Confer-
ence on Robotics and Automation (ICRA), Kobe, Japan, May 12-17 2009.
... In [2], the authors split their method into two phases: first, they extracted skeleton-based and surface-based features; then, these combined features were used for person re-identification. Evaluation using normalized area under the curve (nAUC) showed varying performance (52.8% to 88.1%) on Collaborative and Walking2 datasets. ...
... Another one is, walking set, which includes complete training set images to have an effect of different pose and viewpoint variations. RGBD-ID [2]: Further, the performance is evaluated on RGBD-ID dataset. This dataset encompasses RGB and depth data for 79 individuals, with each individual having four acquisitions: walking1, walking2, collaborative, and backwards. ...
Article
Full-text available
Noisy features may introduce irrelevant or incorrect features that can lead to incorrect classifications and lower accuracy. This can be especially problematic in tasks such as person re-identification (ReID), where subtle differences between individuals need to be accurately captured and distinguished. However, the existing ReID methods directly use noisy and limited multimodality features for similarity measures. It is crucial to use robust features and pre-processing techniques to reduce the effects of noise and ensure accurate classification. As a solution, we employ a Gaussian filter to eliminate the Gaussian noise from RGB-D data in the pre-processing stage. For similarity measure, the color descriptors are computed using the top eight peaks of the 2D histogram constructed from pose regularized partition grid cells, and eleven different skeleton distances are considered. The proposed method is evaluated on the BIWI RGBD-ID dataset, which comprises still (front view images) and walking set (images with varied pose and viewpoint) images. The obtained recognition rates of 99.15% and 94% on still and walking set images demonstrate the effectiveness of the proposed approach for the ReID task in the presence of pose and viewpoint variations. Further, the method is evaluated on and RGBD-ID and achieved improved performance over the existing techniques.
... Recently driven by economical, non-obtrusive and accurate skeleton-tracking devices like Kinect [262], 3D skeleton data has been a popular and generic data modality for many gait-related tasks such as gait recognition and person re-identification [253,258,259,[263][264][265][266][267][268][269]. A 3D skeleton is defined as 3D coordinates of key human joints (typically 20 or 25 key joints [262]) of a person, while 3D skeletons are defined as all joints' temporal series conveying motion dynamics of the person. ...
Preprint
Full-text available
Recent years have witnessed an increasing global population affected by neurodegenerative diseases (NDs), which traditionally require extensive healthcare resources and human effort for medical diagnosis and monitoring. As a crucial disease-related motor symptom, human gait can be exploited to characterize different NDs. The current advances in artificial intelligence (AI) models enable automatic gait analysis for NDs identification and classification, opening a new avenue to facilitate faster and more cost-effective diagnosis of NDs. In this paper, we provide a comprehensive survey on recent progress of machine learning and deep learning based AI techniques applied to diagnosis of five typical NDs through gait. We provide an overview of the process of AI-assisted NDs diagnosis, and present a systematic taxonomy of existing gait data and AI models. Through an extensive review and analysis of 164 studies, we identify and discuss the challenges, potential solutions, and future directions in this field. Finally, we envision the prospective utilization of 3D skeleton data for human gait representation and the development of more efficient AI models for NDs diagnosis. We provide a public resource repository to track and facilitate developments in this emerging field: https://github.com/Kali-Hac/AI4NDD-Survey.
... While the majority of ReId systems are designed for traditional RGB cameras, various approaches have been developed for multi-modal person ReId. These include techniques like cross-modal RGB-infrared [40], [41] and systems utilizing RGB-D cameras [42], [43]. ...
Article
Full-text available
The widespread use of visual surveillance in public areas puts individual privacy at stake while also increasing resource usage (energy, bandwidth, and computation). Neuromorphic vision sensors (or event cameras) are considered viable solutions for privacy issues; since event cameras only capture scene dynamics, they do not capture detailed RGB images of individuals. However, recent deep learning architectures have enabled the reconstruction of high-fidelity images from event sensor data that could reveal individual identity information. As a result, it reintroduces privacy risks for event-based vision applications. In this work, we focus on protecting the identity of individuals from such image reconstruction attacks by anonymizing event data. To achieve this, we present an end-to-end network architecture jointly optimized for the twofold objective of preserving privacy and performing a downstream computer vision task. The proposed network learns to scramble events, thereby degrading the quality of images that potential intruders could reconstruct. We demonstrate the application of our framework in two challenging computer vision tasks: person re-identification (ReId) and human pose estimation (HPE). To this end, we also present and evaluate the first event-based person ReId dataset, Event-ReId. We validate the privacy-preserving efficacy of our approach against possible privacy attacks through extensive experiments: for person ReId, we utilize the real event-based Event-ReId dataset and synthetic event data simulated from the SoftBio dataset; for HPE, we use a publicly available event-based dataset DHP19. The results of both tasks show that anonymizing event data effectively protects private information with minimal impact on the subsequent task performance.
... Clothing change pedestrian re-recognition tasks can be broadly classified into three categories. The first involves reconstructing pedestrians using depth information [20]- [22]. Early research aimed to address clothing dependency by leveraging non-clothing information for matching. ...
Article
Full-text available
Pedestrian re-identification aims to identify the same target pedestrian among multiple non-overlapping cameras. However, in real scenarios, pedestrians often change their clothing features due to external factors such as weather and seasons, rendering traditional methods reliant on consistent clothing features ineffective. In this paper, we propose a Knowledge-Driven Cross-Period Network for Clothing Change Person Re-Identification, comprising three key components: (1) Knowledge-Driven Topology Inference Network: Leveraging knowledge graphs and graph convolution networks, this network captures spatio-temporal information between camera nodes. Knowledge embedding is introduced into the graph convolution network for effective topology inference. (2) Cross-Period Clothing Change Network: This network aggregates spatio-temporal information for clothing generation. By utilizing overall pedestrian clothing characteristics whthin logical topology cameras, it mitigates matching errors caused by external factors. (3) Joint Optimization Mechanism: A collaborative approach involving both the topology inference network and cross-period clothing change network. Multi-camera logical topology offers auxiliary information and retrieval order for the clothing change network, while pedestrian re-identification results provide feedback to adjust the logical topology. Experimental analysis on datasets Celeb-ReID, PRCC, UJS-ReID, SLP, and DukeMTMC-ReID, demonstrates the effectiveness and robustness of our proposed model in addressing the challenges of pedestrian re-identification in scenarios involving changing clothing features.
... Refs. [98,101,102], often using Microsoft Kinect [37,97,100,[103][104][105][106][107][108][109][110][111][112][113][114][115][116][117]for depth sensing or drawing from 3D models, Euclidean distances based on anthropometric survey data [118,119](e.g. CAESAR [23] [96,[120][121][122])) or 2D as well as 3D pose estimation frameworks [123][124][125][126][127][128][129][130][131][132][133][134]. ...
Article
Full-text available
The idea of using measurements of the human body for identity matching is deeply associated with Bertillonage, a historic biometric system that was briefly applied until it was superseded by fingerprinting in the early 20th century. The apparent failure then commonly causes doubt with regard to the suitability of a set of measurements as a biometric identifier in the present. Hence, the aim of this paper is to explore the potentials of using an anthropometric pattern, comprising of a set of body measurements, for identity matching. For this purpose, it will begin with a thorough examination of Bertillon's system and move on to conduct a comprehensive inquiry of the current possibilities of using digital anthropometric patterns in image or video-based evidence.
... The person ReId problem has been extensively studied in standard (RGB) camera networks and deep-learning-based ReId approaches [39,40] have improved the performance rapidly. Most of the existing ReId frameworks are developed for conventional RGB cameras, although different methods have been proposed for multi-modal person ReId such as e.g., cross-modal RGB-infrared [6,22] and with RGB-D camera [2,29]. ...
Conference Paper
Full-text available
Wide-scale use of visual surveillance in public spaces puts individual privacy at stake while increasing resource consumption (energy, bandwidth, and computation). Neuromorphic vision sensors (event-cameras) have been recently considered a valid solution to the privacy issue because they do not capture detailed RGB visual information of the subjects in the scene. However, recent deep learning architectures have been able to reconstruct images from event cameras with high fidelity, reintroducing a potential threat to privacy for event-based vision applications. In this paper, we aim to anonymize event-streams to protect the identity of human subjects against such image reconstruction attacks. To achieve this, we propose an end-to-end network architecture jointly optimized for the twofold objective of preserving privacy and performing a downstream task such as person ReId. Our network learns to scramble events, enforcing the degradation of images recovered from the privacy attacker. In this work, we also bring to the community the first ever event-based person ReId dataset gathered to evaluate the performance of our approach. We validate our approach with extensive experiments and report results on the synthetic event data simulated from the publicly available SoftBio dataset and our proposed Event-ReId dataset.
Article
Full-text available
Aggregating multi-modal data to obtain reliable data representation attracts more and more attention. Recent studies demonstrate that Transformer models usually work well for multi-modal tasks. Existing Transformers generally either adopt the cross-attention (CA) mechanism or simple concatenation to achieve the information interaction among different modalities which generally ignore the issue of modality gap. In this work, we re-think Transformer and extend it to MutualFormer for multi-modal data representation. Rather than CA in Transformer, MutualFormer employs our new design of cross-diffusion attention (CDA) to conduct the information communication among different modalities. Comparing with CA, the main advantages of the proposed CDA are three aspects. First, the cross-affinities in CDA are defined based on the individual modal affinities (token metrics) which thus can naturally alleviate the issue of modality/domain gap existed in traditional token feature based CA definition. Second, CDA provides a general scheme which can either be used for multi-modal representation or serve as the post-optimization for existing CA models. Third, CDA is implemented efficiently. We successfully apply the MutualFormer on several multi-modal learning tasks. Extensive experiments demonstrate the effectiveness of the proposed MutualFormer.
Article
Multimodal Artificial Intelligence (Multimodal AI), in general, involves various types of data (e.g., images, texts, or data collected from different sensors), feature engineering (e.g., extraction, combination/fusion), and decision-making (e.g., majority vote). As architectures become more and more sophisticated, multimodal neural networks can integrate feature extraction, feature fusion, and decision-making processes into one single model. The boundaries between those processes are increasingly blurred. The conventional multimodal data fusion taxonomy (e.g., early/late fusion), based on which the fusion occurs in, is no longer suitable for the modern deep learning era. Therefore, based on the main-stream techniques used, we propose a new fine-grained taxonomy grouping the state-of-the-art (SOTA) models into five classes: Encoder-Decoder methods, Attention Mechanism methods, Graph Neural Network methods, Generative Neural Network methods, and other Constraint-based methods. Most existing surveys on multimodal data fusion are only focused on one specific task with a combination of two specific modalities. Unlike those, this survey covers a broader combination of modalities, including Vision + Language (e.g., videos, texts), Vision + Sensors (e.g., images, LiDAR), etc., and their corresponding tasks (e.g., video captioning, object detection). Moreover, a comparison among these methods is provided, as well as challenges and future directions in this area.
Article
Full-text available
We present a prototype video tracking and person categorization system that uses face and person soft biometric features to tag people while tracking them in multiple camera views. Our approach takes advantage of temporal aspect of video by extracting and accumulating feasible soft biometric features for each person in every frame to build a dynamic soft biometric feature list for each tracked person in surveillance videos. We developed algorithms for extracting face soft biometric features to achieve gender and ethnicity classification and session soft biometric features to aid in camera hand-off in surveillance videos with low resolution and uncontrolled illumination. To train and test our face soft biometry algorithms, we collected over 1500 face images from both genders and three ethnicity groups with various sizes, poses and illumination. These soft biometric feature extractors and classifiers are implemented on our existing video content extraction platform to enhance video surveillance tasks. Our algorithms achieved promising results for gender and ethnicity classification, and tracked person re-identification for camera hand-off on low to good quality surveillance and broadcast videos. By utilizing the proposed system, a high level description of extracted person's soft biometric data can be stored to use later for different purposes, such as to provide categorical information of people, to create database partitions to accelerate searches in responding to user queries, and to track people between cameras.
Article
Full-text available
We investigate body soft biometrics capabilities to perform pruning of a hard biometrics database improving both retrieval speed and accuracy. Our pre-classification step based on anthropometric measures is elaborated on a large scale medical dataset to guarantee statistical meaning of the results, and tested in conjunction with a face recognition algorithm. Our assumptions are verified by testing our system on a chimera dataset. We clearly identify the trade off among pruning, accuracy, and mensuration error of an anthropomeasure based system. Even in the worst case of ±10% biased anthropometric measures, our approach improves the recognition accuracy guaranteeing that only half database has to be considered.
Article
Full-text available
In this paper, we present an appearance-based method for person re-identification. It consists in the extraction of features that model three complementary aspects of the human appearance: the overall chromatic content, the spatial arrangement of colors into stable regions, and the presence of recurrent local motifs with high entropy. All this information is derived from different body parts, and weighted opportunely by exploiting symmetry and asymmetry perceptual principles. In this way, robustness against very low resolution, occlusions and pose, viewpoint and illumination changes is achieved. The approach applies to situations where the number of candidates varies continuously, considering single images or bunch of frames for each individual. It has been tested on several public benchmark datasets (ViPER, iLIDS, ETHZ), gaining new state-of-the-art performances.
Article
Full-text available
We propose a distributed, multi-camera video analysis paradigm for aiport security surveillance. We propose to use a new class of biometry signatures, which are called soft biometry including a person's height, built, skin tone, color of shirts and trousers, motion pattern, trajectory history, etc., to ID and track errant passengers and suspicious events without having to shut down a whole termi-nal building and cancel multiple flights. The proposed research is to enable the reliable acquisition, maintenance, and correspon-dence of soft biometry signatures in a coordinated manner from a large number of video streams for security surveillance. The intel-lectual merit of the proposed research is to address three important video analysis problems in a distributed, multi-camera surveillance network: sensor network calibration, peer-to-peer sensor data fu-sion, and stationary-dynamic cooperative camera sensing.
Conference Paper
Full-text available
This work introduces the novel idea of using a bag of facial soft biometrics for person verification and identification. The novel tool inherits the non-intrusiveness and computational efficiency of soft biometrics, which allow for fast and enrolment-free biometric analysis, even in the absence of consent and cooperation of the surveillance subject. In conjunction with the proposed system design and detection algorithms, we also proceed to shed some light on the statistical properties of different parameters that are pertinent to the proposed system, as well as provide insight on general design aspects in soft-biometric systems, and different aspects regarding efficient resource allocation.
Conference Paper
Full-text available
Kernel descriptors provide a unified way to generate rich visual feature sets by turning pixel attributes into patch-level features, and yield impressive results on many object recognition tasks. However, best results with kernel descriptors are achieved using efficient match kernels in conjunction with nonlinear SVMs, which makes it impractical for large-scale problems. In this paper, we propose hierarchical kernel descriptors that apply kernel descriptors recursively to form image-level features and thus provide a conceptually simple and consistent way to generate image-level features from pixel attributes. More importantly, hierarchical kernel descriptors allow linear SVMs to yield state-of-the-art accuracy while being scalable to large datasets. They can also be naturally extended to extract features over depth images. We evaluate hierarchical kernel descriptors both on the CIFAR10 dataset and the new RGB-D Object Dataset consisting of segmented RGB and depth images of 300 everyday objects.
Conference Paper
Full-text available
We propose a new method to quickly and accurately predict human pose---the 3D positions of body joints---from a single depth image, without depending on information from preceding frames. Our approach is strongly rooted in current object recognition strategies. By designing an intermediate representation in terms of body parts, the difficult pose estimation problem is transformed into a simpler per-pixel classification problem, for which efficient machine learning techniques exist. By using computer graphics to synthesize a very large dataset of training image pairs, one can train a classifier that estimates body part labels from test images invariant to pose, body shape, clothing, and other irrelevances. Finally, we generate confidence-scored 3D proposals of several body joints by reprojecting the classification result and finding local modes. The system runs in under 5ms on the Xbox 360. Our evaluation shows high accuracy on both synthetic and real test sets, and investigates the effect of several training parameters. We achieve state-of-the-art accuracy in our comparison with related work and demonstrate improved generalization over exact whole-skeleton nearest neighbor matching.
Article
Tracking across cameras with non-overlapping views is a challenging problem. Firstly, the observations of an object are often widely separated in time and space when viewed from non-overlapping cameras. Secondly, the appearance of an object in one camera view might be very different from its appearance in another camera view due to the differences in illumination, pose and camera properties. To deal with the first problem, we observe that people or vehicles tend to follow the same paths in most cases, i.e., roads, walkways, corridors etc. The proposed algorithm uses this conformity in the traversed paths to establish correspondence. The algorithm learns this conformity and hence the inter-camera relationships in the form of multivariate probability density of space–time variables (entry and exit locations, velocities, and transition times) using kernel density estimation. To handle the appearance change of an object as it moves from one camera to another, we show that all brightness transfer functions from a given camera to another camera lie in a low dimensional subspace. This subspace is learned by using probabilistic principal component analysis and used for appearance matching. The proposed approach does not require explicit inter-camera calibration, rather the system learns the camera topology and subspace of inter-camera brightness transfer functions during a training phase. Once the training is complete, correspondences are assigned using the maximum likelihood (ML) estimation framework using both location and appearance cues. Experiments with real world videos are reported which validate the proposed approach.