ArticlePDF Available

3D Face Recognition: Two Decades of Progress and Prospects

Authors:
  • Information Engineering University

Abstract and Figures

3D face recognition has been extensively investigated in the last two decades due to its wide range of applications in many areas such as security and forensics. Numerous methods have been proposed to deal with the challenges faced by 3D face recognition such as facial expressions, pose variations and occlusions. These methods have achieved superior performances on several small-scale datasets including FRGC v2.0, Bosphorus, BU-3DFE, and Gavab. However, deep learning based 3D face recognition methods are still in their infancy due to the lack of large-scale 3D face datasets. To stimulate future research in this area, we present a comprehensive review of the progress achieved by both traditional and deep learning based 3D face recognition methods in the last two decades. Moreover, comparative results on several publicly available datasets under different challenges of facial expressions, pose variations and occlusions are also presented.
Content may be subject to copyright.
3D Face Recognition: Two Decades of Progress and Prospects
YULAN GUO, Sun Yat-sen University and National University of Defense Technology, China
HANYUN WANG,Information Engineering University, China
LONGGUANG WANG, Aviation University of Air Force, China
YINJIE LEI, Sichuan University, China
LI LIU, National University of Defense Technology, China
MOHAMMED BENNAMOUN, University of Western Australia, Australia
3D face recognition has been extensively investigated in the last two decades due to its wide range of applications in many
areas such as security and forensics. Numerous methods have been proposed to deal with the challenges faced by 3D face
recognition such as facial expressions, pose variations and occlusions. These methods have achieved superior performances
on several small-scale datasets including FRGC v2.0, Bosphorus, BU-3DFE, and Gavab. However, deep learning based 3D face
recognition methods are still in their infancy due to the lack of large-scale 3D face datasets. To stimulate future research
in this area, we present a comprehensive review of the progress achieved by both traditional and deep learning based 3D
face recognition methods in the last two decades. Moreover, comparative results on several publicly available datasets under
dierent challenges of facial expressions, pose variations and occlusions are also presented.
CCS Concepts: Computing methodologies Computer vision;Articial intelligence.
Additional Key Words and Phrases: 3D face recognition, local feature, deep learning, facial expression, pose variation, facial
occlusion
1 INTRODUCTION
The task of biometrics is to recognize a person based on its physiological (e.g., ngerprint, palmprint, iris, retina,
and face) or behavioral characteristics (e.g., gait, handwriting, and voice) [
107
,
110
]. Although dierent biomentrics
approches have been intensively investigated for automatic human identication, face recognition is commonly
considered as a major biometrics technique due to its universal availability, distinctiveness, permanence, non-
contact collectability, and especially invasiveness [
28
,
111
]. Face recognition can be used in many areas including
security, forensic, commercial, medical, education, and robotic applications [121, 198, 246].
Existing face recognition techniques can be broadly divided into 2D and 3D face recognition techniques
according to the data modality. Most research eorts and commercial developments have focused on 2D face
recognition due to its low cost and the wide availability of digital cameras [
88
,
121
,
225
]. As an alternative, 3D
face recognition has a number of advantages compared to its 2D counterpart [
84
,
86
,
198
]. For instance, (i) 3D
Corresponding author
Authors’ addresses: Yulan Guo, Sun Yat-sen University and National University of Defense Technology, Shenzhen, China, guoyulan@sysu.edu.
cn; Hanyun Wang, Information Engineering University, Zhengzhou, China, why.scholar@gmail.com; Longguang Wang, Aviation University
of Air Force, Changchun, China, wanglongguang15@nudt.edu.cn; Yinjie Lei, Sichuan University, Chengdu, China, yinjie@scu.edu.cn; Li Liu,
National University of Defense Technology, Changsha, China, liuli_nudt@nudt.edu.cn; Mohammed Bennamoun, University of Western
Australia, Perth, Australia, mohammed.bennamoun@uwa.edu.au.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that
copies are not made or distributed for prot or commercial advantage and that copies bear this notice and the full citation on the rst page.
Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy
otherwise, or republish, to post on servers or to redistribute to lists, requires prior specic permission and/or a fee. Request permissions from
permissions@acm.org.
© 2023 Copyright held by the owner/author(s). Publication rights licensed to ACM.
0360-0300/2023/8-ART $15.00
https://doi.org/10.1145/3615863
ACM Comput. Surv.
2 Y. Guo et al.
data contain sucient geometrical information of a face without any projection from the 3D physical space to
the 2D imaging plane. (ii) 3D data is more invariant to illumination variations, the use of cosmetics and other
decorations. (iii) Facial pose can be more accurately estimated from 3D data compared to 2D images. Therefore,
3D face recognition has the potential to overcome several of the inherent challenges faced by 2D face recognition
algorithms and provides a solid alternative to the face recognition community [
241
]. With the advancement of
3D sensing devices (e.g., Microsoft Kinect, Iphone X depth camera) and computing devices (e.g., GPU), 3D face
recognition has become an emerging topic in last two decades [89].
Although 3D face recognition has several advantages compared to its 2D counterpart, it also faces several
challenges. First, the shape of a 3D face varies signicantly under dierent expressions as the face is a non-rigid
surface. Second, occlusion and clutter introduced by obscuring factors such as glasses, scarves, and hats increase
the diculty of face recognition. Third, the face can change gradually over time due to aging or change in
body health. Furthermore, the low quality (e.g., noise, holes) of 3D data acquired by low-cost 3D sensors poses
further challenges to 3D face recognition. Although a large number of algorithms have been proposed during the
last two decades, 3D face recognition is still far from real-world applications. It is therefore highly necessary to
comprehensively review the existing work and point out future research directions.
Several early survey papers on 3D face recognition have appeared about ten years ago [
1
,
27
,
28
,
89
,
121
,
192
].
Later, Smeets et al. [
198
] presented a concise review on 3D face recognition with a particular focus on facial
expression issues. Islam et al. [
107
] presented a review on 3D ear and expression invariant face biometrics. Smeets
et al. [
197
] introduced a comparative study of 3D face recognition under expression variations. Zhou et al. [
249
]
reviewed several algorithms for single-modal and multi-modal face recognition. Three reviews [
54
,
72
,
190
]
on 3D facial expression recognition rather than face recognition are also worth mentioning. Soltanpour et al.
[
202
] summarized the state-of-the-art local feature based 3D face recognition methods published before 2017,
and classied existing methods into three categories: keypoints-based, curve-based, and local surface-based
methods. Zhou and Xiao [
250
] summarized the recent progress of 3D face recognition from three dierent aspects
including pose-invariant recognition, expression-invariant recognition, and occlusion-invariant recognition.
Dagnes et al. [
56
] focused on dealing with 3D face recognition with facial occlusions under non-cooperative
and uncontrolled scenarios. Pini et al. [
183
] evaluated the eect of dierent 3D data representations (i.e., depth
and normal images, voxels, point clouds), dierent deep learning-based models, and dierent pre-processing
techniques for face recognition. They concluded that the methods based on normal images and point clouds
perform and generalize better than other 2D and 3D alternatives. Li et al. [
129
] and Jing et al. [
114
] also reviewed
current 3D face recognition methods from the aspects of traditional methods and deep learning-based methods.
Although these papers provide good reviews on the progress of 3D face recognition, some advanced algorithms
such as [
61
,
115
,
238
] proposed in recent years are not covered, especially those deep learning-based methods
[118, 195].
The major contribution of this paper can be summarized as follows:
(i) This paper reviews the major 3D face recognition methods which have been proposed in the last two
decades. It can be used to help a reader understand the history, status, and future trend of 3D face recognition.
(ii) This paper provides a comprehensive review on both the traditional and the emerging deep learning-based
algorithms and adequately covers a large number of up-to-date 3D face recognition algorithms.
(iii) This paper specically discusses the approaches designed to deal with dierent nuisances that are faced by
a 3D face recognition system including facial expressions, pose variations, and occlusions.
(iv) Acomprehensive comparison of existing algorithms on several publicly available datasets are also presented
in tabular forms under facial expression variations (Tables 2 and 3), pose variations (Table 4) and occlusions
(Table 5).
The rest of this paper is organized as follows. Section 2 describes the background concepts and terminology of
3D face recognition. Sections 3 reviews several pre-processing approaches. Section 4 provides a comprehensive
ACM Comput. Surv.
3D Face Recognition: Two Decades of Progress and Prospects 3
survey of existing 3D face recognition methods. Section 5 introduces the recent trends in deep learning methods
for 3D face recognition. Section 6 presents a comprehensive comparison of existing algorithms under dierent
variations including facial expressions, pose variations, and occlusions. Finally, Section 7 concludes this paper.
2 TERMINOLOGY AND DATASETS
2.1 Terminology
3D face recognition usually includes two dierent tasks: face identication and face verication [
1
,
28
,
89
,
198
].
The task of face identication is to compare a probe face against all gallery faces to obtain the identity of the
probe face. The performance of face identication is commonly measured by the Cumulative Match Curve (CMC),
where CMC plots the recognition rate with respect to dierent rank numbers. The Rank-1 Recognition Rate
(R1RR) is an important scalar metric for the evaluation of face identication.
The task of face verication (also called face authentication) is to compare a probe face against the gallery face
with a claimed identity. The performance of face verication is usually measured by the Receiving Operating
Characteristic (ROC) curve which plots the False Rejection Rate (FRR) versus the False Acceptance Rate (FAR)
[
63
]. FRR is the percentage of probes that have incorrectly been determined as non-match against the gallery
face with the same identity, FAR is the percentage of probes that have incorrectly been determined as a match
against the gallery face with a dierent identity. The Equal Error Rate (EER) and Verication Rate (VR) at 0.1%
FAR (VR@0.1%FAR) are two important scalar metrics for the evaluation of face verication. ERR is extracted
from the ROC curve where FAR is equal to FRR.
2.2 Datasets
A large number of datasets have been collected to test the performance of 3D face recognition algorithms since
the 1990s. Although several early datasets are available, such as the MaxPlank [
213
], USF HumanID 3D Face
Database[
20
], XM2VTS [
149
], 3DRMA [
16
], FSU [
100
], Biometrics [
39
,
75
], Gavab [
157
] and CASIA datasets,
we mainly list the datasets which have been introduced in last 15 years (as shown in Table 1). The variations
contained in each dataset are also reported in Table 1, including pose variation (p), facial expression (e), occlusion
(o), time elapse (t), and illumination variation (i). The symbol ‘-’ is used where the corresponding information is
not provided. It is obvious that most datasets introduced before 2012 were collected by expensive but accurate
3D acquisition systems including Minolta Vivid, 3dMD and Di3D scanners. With the release of the low-cost
Microsoft Kinect sensor in 2011, the majority of datasets in the recent years were collected by Kinect sensors,
introducing more new challenges to the 3D face recognition community, such as low resolution, high noise,
and missing data. In the following section, we will briey describe the FGRC dataset, the BU-3DFE dataset, the
Bosphorus dataset, the Gavab dataset, and the 4DFAB dataset.
2.2.1 FRGC dataset. The FRGC dataset contains 4950 3D facial scans of 466 subjects. All of these scans are
captured frontally with a Minolta Vivid 900/910 scanner at a resolution of 0.6mm in the
and
directions. This
dataset is further divided into a training dataset (FRGC v1) and a validation dataset (FRGC v2.0). The traning
dataset contains 943 scans of 200 dierent individuals collected in the 2002-2003 academic year, and the validation
dataset contains 4007 scans of 466 individuals collected during the 2003-2004 academic year. The number of scans
per subject varies from 1 to 22. In addition, the validation dataset contains 2410 scans with neutral expression,
and 1597 facial scans with various facial expressions such as disgust, happiness, sadness, surprise, and anger.
2.2.2 BU-3DFE dataset. The BU-3DFE dataset contains 2500 3D facial scans of 100 subjects (44 males and 56
females) with dierent ages and ethnic/racial ancestries. For each subject, there are one scan with neutral
expression, and six basic non-neutral facial expressions (anger, disgust, fear, happiness, sadness, and surprise)
with four levels of intensity.
ACM Comput. Surv.
4 Y. Guo et al.
Table 1. Major 3D Face Datasets. The variations in each dataset are listed, including pose variation (p), facial expression
(e), occlusion (o), time elapse (t), and illumination variation (i). The availability of 2D texture images in each dataset is also
provided, where ‘Y’ and ‘N’ denote Yes and No, respectively.
No. Name Year # Subjects # Images Acquisition Variations Texture Res.
1 Gavab [157] 2004 61 549 Minolta Vivid sensor p, e, o N -
2 BJUT-3D [234] 2005 500 - Cyberware 3030 e Y High
3 FRGC v1 [182] 2005 200 943 Minolta Vivid 900/910 e Y High
4 FRGC v2.0 [182] 2005 466 4007 Minolta Vivid 900/910 e, t Y High
5 BU-3DFE [236] 2006 100 2500 3dMD e Y High
6 ND-2006 [66] 2006 888 13450 Minolta Vivid 910 e Y High
7 CASIA [228] 2006 123 4059 Minolta Vivid 910 p, e, i Y High
8 FRAV 3D [50] 2006 105 1696 Minolta Vivid 700 p, e, i, t Y High
9 ZJU-3DFED[223] 2006 40 360 InSpeck 3D MEGA Capturor DF e Y High
10 Bechman [102] 2007 475 - Cyberware 3030 e Y High
11 Bosphorus [191] 2008 105 4666 Inspeck Mega Capturor II 3D p, e, o N High
12 IV2[181] 2008 300 2400 Minolta Vivid 7000 p, e, i Y High
13 BU-4DFE [235] 2008 101 606 videos Di3D System e Y High
14 York [99] 2008 350 5250 In-house 3D camera p, e Y Middle
15 Texas [90] 2010 118 1149 MU-2 stereo system e Y High
16 PhotoFace [239] 2011 261 7356 Photometric stereo e, t Y High
17 Houston [177] 2011 281 2150 3dMD p, e N High
18 UHDB11[211] 2014 23 1625 3dMD p, i Y High
19 SHREC 2011 [218] 2011 130 780 Roland and Escan scanners p N High
20 3D-TEC [219] 2011 107 214 Minolta Vivid 910 e Y High
21 UMB-DB [49] 2011 143 1473 Minolta Vivid 900 e, o Y High
22 Florence Superface [12, 13] 2012 50 50 videos 3dMD, Kinect p Y Low/high
23 Aalborg RGB-D Face [101] 2012 31 1581 Kinect p, e Y Low
24 3DMAD [64] 2013 17 255 videos Kinect t Y Low
25 Biwi Kinect Head Pose [71] 2013 20 over 15000 Kinect p Y Low
26 CurtinFaces [125] 2013 52 4784 Kinect p, e, o, i Y Low
27 EURECOM KinectFaceDB [155] 2014 52 936 Kinect p, e, o, t, i Y Low
28 BP4D-Spontanous Database [244] 2014 41 328 videos Di3D e Y High
29 FaceWarehouse [37] 2014 150 3000 Kinect e Y Low
30 HRRFaceD [146] 2014 18 20000 Kinect v2 p, o N Low
31 VT-KFER [5] 2015 32 1956 videos Kinect eY Low
32 Lock3DFace [241] 2016 509 5711 videos Kinect v2 p, e, o, t, i Y Low
33 Pandora [24] 2017 22 110 sequences Kinect v2 p, o Y Low
34 COMA [186] 2018 12 20466 3dMD LLC, Atlanta e N High
34 4DFAB [44] 2018 180 1,835,513
DI4D, Kinect and grayscale camera
e Y High
36 FaceScape [233] 2020 938 18760 Multi-view DSLR cameras e Y High
2.2.3 Bosphorus dataset. The Bosphorus dataset contains 4652 3D facial scans of 105 subjects (60 males and
45 females) with ages between 25 and 35. All of these scans are captured with an Inspeck Mega Capturor II 3D
scanner at a resolution of 0.3mm in the
and
directions and a resolution of 0.4mm in the
direction. For each
subject, the number of scans for each subject is between 31 and 54, and these scans are recorded under dierent
expressions, poses, and occlusions. For facial expressions, the Bosphorus dataset contains six basic non-neutral
facial expressions (anger, disgust, fear, happiness, sadness, and surprise) and 28 facial Action Units (20 Lower AUs,
5 Upper AUs, and 3 Combined AUs). For pose variations, the Bosphorus dataset contains seven yaw rotations
(+10
, +20
, +30
,
±
45
, and
±
90
), and four pitch rotations (strong and slight upwards/downwards), and two
cross rotations incorporating both yaw and pitch rotations (+45
yaw and
±
20
pitch). It should be emphasized
ACM Comput. Surv.
3D Face Recognition: Two Decades of Progress and Prospects 5
that all pose variation scans are captured with a neutral expression. For facial occlusions, there are four types
of occlusions: occlusion of the mouth with hand, occlusion of face with hair, occlusion of left eye and forehead
region with hands, and occlusion with glass.
2.2.4 Gavab dataset. The Gavab dataset contains 549 3D facial scans of 61 adult Caucasian subjects (45 males
and 16 females). All of these scans are captured with a Minolta Vivid scanner at a resolution of 1.5mm in the
image. These scans are recorded under dierent poses, expressions and occlusions. For each subject, there are two
frontal facial scans with neutral expression, four neutral facial expression scans with a rotated posture of the face
(looking-up (+35
), looking-down (-35
), left prole (-90
) and right prole (+90
), and three frontal non-neutral
facial expressions (smile, laugh, and arbitrary expression).
2.2.5 4DFAB dataset. The 4DFAB dataset is a recently published large scale dynamic facial expression database,
and it contains 1,835,513 high-resolution 3D meshes of 180 subjects (120 males and 60 females) with ages between 5
and 75. The capturing system consists of a DI4D dynamic system for 4D face capturing and building, a microphone
for audio signal recording, a frontal grayscale camera for frontal face image recording, and a Kinect for RGB-D
data recording. To ensure that multi-modal facial data are captured simultaneously, all sensors are synchronized
with the DI4D system. The expressions of each subject include posed expressions, spontaneous expressions, and
other evident facial movements.
3 PREPROCESSING
Once a raw 3D face is obtained, preprocessing is required to make the 3D face suitable for face recognition.
Typical preprocessing operations include nose tip detection, data ltering, and pose normalization.
3.1 Nose Tip Detection
Nose tip detection can be used for several purposes in a 3D face recognition system. First, nose tip can be used
to accurately locate a 3D face from the raw 3D data [
59
,
139
]. Second, nose tip is a more distinctive landmark
for detection than other facial parts (e.g., the eyes and cheek) [
59
]. Besides, nose tip can be used to guide the
detection of other facial landmarks [139].
3.1.1 Curvature based Methods. These methods use dierent types of curvatures to locate potential landmarks,
and then use heuristics to nd the nal landmarks (including nose tip).
Colbry et al. [
47
] detected nose tip as the point with the largest shape index and satises several heuristics
(e.g., closest to the scanner). It is demonstrated that the median error of detected landmarks is around 10mm.
Chang et al. [
43
] detected the nose tip from a 3D face by checking the Gaussian and mean curvatures of a facial
surface. Experimental results show that the nose tipe landmark can successfully be detected in 99.4% of the 4485
facial images. Dibeklioğlu et al. [
58
] also used a Gaussian and mean curvatures based heuristic method to detect
nose tip. This method is not appropriate for 3D faces with yaw larger than 45 degrees [
177
]. Colombo et al. [
48
]
detected nose candidates by thresholding the mean curvature of a 3D face, these candidates were then ltered
to obtain the nose tip using the triangle formed by the eyes and the nose. Lu et al. [
142
] used the shape index
and heuristics to detect a set of candidate landmarks. Gupta et al. [
90
] rst detected the nose tip by registering
the query face to a 3D template face using the Iterative Closest Point (ICP) algorithm, and then used Gaussian
curvature to rene the nose tip. This method is relatively computationally expensive.
These methods are intuitive, but they suer from several limitations. First, pre-processing is required to
perform accurate curvature estimation [
179
]. Second, these methods are sensitive to noise as the calculation
of curvatures relies on the derivatives of a 3D surface [
179
]. Third, their applications are limited since a set of
emprically designed heuristics are usually required.
ACM Comput. Surv.
6 Y. Guo et al.
(a) Mian et al.[151] (b) Peng et al.[179] (c) Wang et al.[222]
Fig. 1. An illustration of nose tip detection methods.
3.1.2 Profile based Methods. These methods extract proles from a 3D face and then detect the nose tip from
these 2D proles.
Segundo et al. [
194
] generated a prole curve and a median curve by calculating the maximum and median
depth value for the points with the same
coordinate values in a face range image. Nose tip is then dened as
the peak in the prole curve and is further checked using both the prole and the median curves. Experimental
results show that a detection rate of 99.95% is achieved on the FRGC v2.0 dataset. Mian et al. [
151
] cut a 3D face
into several horizontal slices, and then inscribed a triangle inside a moving circle along each slice. The point with
the largest triangle altitude along each slice is then considered as a nose tip candidate, which is further ltered
using the Random Sample Consensus (RANSAC) approach. The remaining point with the largest triangle altitude
is nally determined as the nose tip, as shown in Fig. 1(a). This method is very time-consuming [
222
], and it is
limited to near frontal faces with small yaw and pitch variations [177].
Faltemier et al. [
68
] rotated a 3D face around the
axes to obtain 37 proles, and then matched each prole
with two manually extracted nose models along the prole, the location with the minimal matching error is
nally determined as the nose tip. A detection rate of over 96.5% is achieved on faces under pose, expression,
and occlusion variations. However, this method is sensitive to scale variations. Peng et al. [
179
] rotated a 3D
face around the
axes 61 times to generate a set of left-most and right-most proles. Nose tip candidates are
detected by moving a circle along each face prole and checking the area of the circle enveloped inside the prole.
Nose tip candidates are further ltered using a cardinal point tness and spike tness, as shown in Fig. 1(b). This
method achieves a detection rate of 99.43% on the FRGC v2.0 dataset and is also able to estimate the roll, yaw and
pitch angles of a face.
Wang et al. [
222
] rst obtained the central prole of a 3D face by intersecting the facial surface with its
symmetry plane, and then determined the nose tip as the point on the central prole with the largest distance
to the tting plane of the facial surface, as shown in Fig. 1(c). It is demonstrated that 99.75% of nose tips are
correctly detected on the FRGC v2.0 dataset with a 4mm tolerance error, which is better than [
141
]. Spreeuwers
[
205
] rst projected a 3D face onto its symmetry plane to obtain a prole, then detected the point with the largest
value of
coordinate. Next, a straight line is tted to the prole around the detected point, the nose tip is nally
determined as the intersection between the tted line and the line passing through the detected point with its
direction parallel to the
axis. It is claimed that the detected nose tips are slightly more stable than the one
detected by the largest curvature or coordinate.
3.1.3 Depth based Methods. These methods assume that the nose tip is the point closest to the sensor and detect
the nose tip using depth information.
Lu et al. [
139
,
141
] rst found the position with the maximum
value for each row in a depth image. The
column with the largest number of these selected positions was used to determine the mid-line of a 3D face. The
nose tip was then found along the mid-line using the gradient of the mid-line curve and the
value. A nose tip
ACM Comput. Surv.
3D Face Recognition: Two Decades of Progress and Prospects 7
localization accuracy of 5mm was achieved. However, this method can only work on frontal faces. To handle
3D faces with dierent frontal poses, Lu and Jain [
140
] rotated a face scan around the vertical axis and they
determined the nose tip candidates as those points with the largest
value. These candidates were then ltered
by checking the nose prole. This method still does not consider the pitch variation of a 3D face. Mohammadzade
and Hatzinakos [
156
] rst detected nose tip candidates using the depth information. They then trained a PCA
space using a set of nose region surfaces. A candidate was considered as a nose tip if the distance between the
nose region of that candidate and its projection on the PCA space was smaller than a threshold. Experimental
results showed that all nose tips in the FRGC dataset can be successfully detected.
3.1.4 Learning based Methods. These methods rst learn a model from a set of training data around labelled
nose tips, and then use the trained model for nose tip detection.
Xu et al. [
229
] rst dened an eective energy to characterize the point distribution of a local surface and to
select nose tip candidates, they then used the means and variances of the eective energy sets to train a SVM
classier, which nally determines the nose tip location. A correct detection rate of 99.3% is achieved on a dataset
containing 280 faces. Mian et al. [
150
] extracted Haar-like features from a facial range image and its
,
gradient
images to train the AdaBoost algorithm for nose detection. Multiple nose detection results in three images are
clustered and anthropometric ratios are used to remove outliers, and a detection rate of 99.18% is reported on the
FRGC v2.0 dataset. Wang et al. [
221
] rst trained individual PCA subspaces for four landmarks including the nose
tip using the point signature feature [
46
]. Each point on a query face is then projected onto the subspace and the
one with the smallest reconstruction error is considered as the landmark. However, this method is computationally
expensive. Zhao et al. [
247
] used a statistical model (i.e., PCA) to learn both the global variations in 3D face
morphology and the local variations around each face landmark using both texture and geometry information.
The landmarks (including nose tip) are determined by maximizing a posterior probability. A localization error of
less than 5mm is achieved on the FRGC dataset, but the method is very time-consuming. Besides, several methods
for 3D facial landmark detection are also available in the literature, e.g., [
29
,
55
,
70
,
79
,
143
,
174
,
180
,
204
], which
are highly related to nose tip detection.
3.2 Data Filtering
Raw 3D facial scans usually contain spikes, holes and noise due to the low scanning quality [
144
]. Spikes are
commonly detected by checking the discontinuity of points, and are removed by thresholding [
63
,
151
] or median
ltering [
67
,
185
,
247
]. Besides, holes can be found in 3D facial scans due to spike removal, self-occlusion, specular
reection of local surface, and light absorption in dark areas. Small holes can be lled using linear interpolation
[
63
,
156
,
205
] and bilinear interpolation [
177
] or the link of boundary edges [
59
]. Large holes can be inferred
using the prior of face symmetry [
205
]. The noise in 3D facial scans can further be smoothed using dierent
ltering methods, such as the 2D Wiener lter [
156
] and the bilateral smoothing lter [
63
]. Finally, resampling is
usually performed on the cropped 3D face to ensure uniform distribution of 3D facial points [151, 177].
3.3 Pose Normalization
To address the pose variations of facial scans, pose normalization is required by 3D face recognition algorithms
working on pose-dependent features. Mian et al. [
151
] performed PCA on the points of a cropped 3D face to
generate three principal axes which then used to form a rotation matrix for face pose normalization. The aligned
face is then resampled and the pose normalization process is repeated until convergence. Experimental results on
ACM Comput. Surv.
8 Y. Guo et al.
the FRGC v2.0 dataset showed that the algorithm is robust to facial expressions and hairs. This algorithm has
been used in several 3D face recognition systems [
85
,
122
,
123
,
156
]. Spreeuwers [
205
] used the vertical symmetry
plane of a facial scan and the slope of the nose bridge to dene an Intrinsic Coordinate System (ICS) for the face,
and then aligned the face with ICS to achieve pose normalization. Similarly, Wang et al. [
222
] used the nose tip,
the nose bridge direction, and the unit normal of the symmetry plane to perform pose normalization for a 3D
facial scan. Besides, pose normalization can be achieved using facial landmarks. Theoretically, a minimum of three
landmarks on a face are sucient to perform pose normalization [
142
,
154
]. Furthermore, pose normalization can
also be achieved by registering a 3D facial scan to a reference 3D face, which is usually an average face model in
canonical pose generated from training data [154].
4 3D FACE RECOGNITION
According to the facial representation type, these geometry based 3D face recognition algorithms can further be
classied into landmark based, curve based, local patch based methods, and holistic methods.
4.1 Landmark based Algorithms
Gordon [
81
] used the left eye width, the right eye width, the eye separation, the total width of eyes, the nose
height, the nose width, the nose depth, the head width and the curvatures to generate a feature descriptor of a
3D face. The 3D face recognition experiments are performed by calculating the distances between these feature
descriptors of 24 faces. Hüsken et al. [
105
] rst extracted several facial landmarks (e.g., nose, eyes, and mouth)
and then used the Hierarchical Graph Matching (HGM) to perform 2D and 3D face recognition. It is observed
that the fusion of 2D and 3D modalities improves the results compared with a single modality.
Gupta et al. [
90
] used the Euclidean and Geodesic distances between 45 pairs (i.e.,
2
10 =
45)) of 10 anthropometric
facial ducial points as the feature of a 3D face. The stepwise linear discriminant analysis method is then used
for feature selection and the Fisher’s Linear Discriminant Analysis (LDA) classier is employed to perform 3D
face recognition. Experimental results on the Texas 3D Face Recognition Database show that an EER of 1.98%
and a R1RR of 96.8% are achieved with automatically detected ducial points. However, the detection of these
ducial points requires the frontal upright position of a 3D face [
90
] and is also computationally expensive [
59
].
4.2 Curve based Algorithms
These methods extract curves or strips from 3D facial surfaces as feature representations for 3D face recognition.
These methods can further be divided into prole-based and contour-based methods, where proles represent
open curves with starting and end points, and contours represent non-intersecting and closed curves with
dierent lengths [
144
,
197
]. The two major problems of these methods are curve extraction and representation
matching [222].
4.2.1 Profile based Algorithms. These methods extract vertical proles, horizontal proles or radial curves from
3D facial surfaces for face representation.
Nagamine et al. [
164
] conducted the pioneering work to test the distinctiveness of three dierent types of
proles (i.e., vertical, horizontal and circular proles) extracted from various locations on a 3D facial surface. It is
observed that the vertical proles in the central region of a face, the circular sections crossing the inner corners
of the eyes and the part of the nose are highly distinctive. In contrast, the distinctiveness of horizontal proles is
relatively low. Beumier and Acheroy [
16
] extracted the central and the lateral proles from a 3D face based on the
ACM Comput. Surv.
3D Face Recognition: Two Decades of Progress and Prospects 9
(a) Zhang et al.[
242
]
(b) Drira et al.[59] (c) Lei et al.[123]
(d) Samir et al.[
189
]
(e) Srivastava et al.[
207
]
(f) Berrei et al.[10]
Fig. 2. Examples of dierent curve based 3D face representations.
assumption of vertical facial symmetry. They then performed face recognition by comparing the curvatures on
the proles of two faces. Experiments are conducted on a 3D face dataset acquired by an in-house structured light
system, and an EER of 7.25% is achieved. It is observed that nose, eyes, moustaches and beards are challenging
for 3D scanning. Besides, combining frontal and prole proles can improve recognition performance. Beumier
and Acheroy [
17
] further combined the 3D and grey data along the central and lateral proles to improve the
face recognition performance. Based on the assumption that a 3D face is symmetric, Pan et al. [
171
] proposed a
robust symmetry plane detection method to extract facial proles. Proles are then matched using the Hausdor
distance for face recognition. Zhang et al. [
242
] used a symmetry prole (i.e., a vertical prole which passes
through the nose tip), a forehead prole and a cheek prole to represent a 3D face, as shown in Fig. 2(a). The
similarity between two 3D faces is then calculated as the weighted sum of the distances between these three
corresponding proles. However, This method is very sensitive to varying facial expressions.
Drira et al. [
59
] represented 3D facial surfaces with radial curves emanating from the nose tips by slicing the
facial surface with several planes, as shown in Fig. 2(b). A Riemannian framework is then developed to analyze
the elastic shapes of these curves and to match the shapes of facial surfaces. Besides, an occlusion detection
and removal step is proposed based on the recursive-ICP algorithm. To handle missing data, a restoration step
is further introduced using the statistical estimation on shape manifolds of curves. Similarly, Lei et al. [
123
]
proposed an Angular Radial Signature (ARS) for 3D face representation by emanating a set of curves from the
nose tip at an interval of
radians, as shown in Fig. 2(c). Middle-level features are then extracted from ARSs
using Kernel Principal Component Analysis (KPCA) and further fed into a SVM to perform face recognition. This
method achieves good performance in terms of both recognition rate and eciency. Yu et al. [
237
] represented
3D facial scans by an order ensemble of radial strings emanating from the nose tip in 2D space, and then matched
two 3D facial scans through a string-to-string scheme based on dynamic programming. The inherent partial
matching mechanism during radial string matching ultimately eliminates the impact of occlusions. Jribi et al.
[
116
] proposed a multi-polar geodesic representation for 3D face recognition, which is invariant under the Special
Euclidean group SE(3). Based on three reference points extracted from the nose tip and eye corners, a set of level
curves on facial meshes are generated and then sampled uniformly to obtain a set of nite and ordered points.
Finally, the principal curvatures computed on these points are used as the 3D face feature descriptor. Later, they
re-parameterized three static parts around the nose and two eyes with multi-polar geodesic representations [
115
].
For the static part around the nose, the nose tip and two inner corners of the eyes are used to form the three
reference points. For the static part around each eye, the center and the two outer corners of the eyes are used to
form the three reference points. Nassih et al. [
165
] took the geodesic distance as the feature of the facial curves
dened by a set of manually selected points, and 3D face recognition is accomplished based on the PCA and
random forest classier.
ACM Comput. Surv.
10 Y. Guo et al.
4.2.2 Contour based Algorithms. These methods extract contours (i.e., level curves) from 3D facial surfaces for
face representation.
Samir et al. [
188
] represented a facial surface
S
with a union of planar level curves of the height (i.e., depth)
function, i.e.,
S=Ð
where
={ S|()=}
. Here,
()
is the depth of point
. The similarity of two
3D facial surfaces is then calculated as the aggregated geodesic distance between their corresponding level curves.
This work has clearly demonstrated the potential of geometric facial curves for 3D face recognition. However,
the level curves are dierent for a face with dierent orientations. That is, this level curve representation is not
completely invariant to rotation. Later, Samir et al. [
189
] represented a facial surface
S
with a union of 3D level
curves of a surface distance function from the nose tip, as shown in Fig. 2(d). This representation is invariant to
rotation and translation. Numerical methods for the calculation of geodesic paths between facial surfaces in the
Riemannian space are also provided. Note that, the level curves can be aected by some facial expressions such as
open mouth. Therefore, this method is unable to handle missing data (e.g., caused by occlusion or pose variations).
Similarly, Li et al. [
128
] generated a Deformation Invariant Image (DII) for a textured 3D face by sampling the
intensity image with geodesic level curves (which is dened on the 3D surface). LDA is then performed on the
DII representations and the Mahalanobis cosine distance between two facial representations is used to measure
their similarity. Srivastava et al. [
207
] represented a facial surface
S
using level curves dened in a Darcyan
coordinate system, as shown in Fig. 2(e). The coordinate system is located at the nose tip, while its two coordinates
1
and
2
specify the distances from the nose tip and the symmetry plane of the face, respectively. Consequently,
a deformation or geodesic paths between 3D facial surfaces can be achieved by analyzing geodesics between
level curves.
The level curve representation is further extended to strip representation. For example, Berretti et al. [
9
,
10
]
represented each 3D face with a set of iso-surfaces generated by the points with the same geodesic distance from
the nose tip, as shown in Fig. 2(f). They then encoded these iso-geodesics and their relationships as a graph
representation using 3D Weighted Walkthroughs (3DWWs). 3D face recognition is achieved using a structural
similarity dened on 3DWWs. It is claimed that partitioning a facial scan into iso-geodesic stripes approximates
the local morphology of faces with facial expressions. This method is therefore robust to facial expressions, and
it is also ecient for matching. Experimental results on the Gavab dataset showe that RIRRs of 93.5% and 82% are
achieved for neutral and non-neutral faces, respectively [
9
]. It is also reported that VRs @ 0.1% FAR of 96.31%
and 80.87 are achieved for neutral and non-neutral faces, respectively [
10
]. However, this method requires that
the mouths of all faces in the dataset are always open or always closed [
10
]. Shi et al. [
196
] rst represented a
3D facial surface with iso-geodesic curve, and then extracted four kinds of Frenet frame based features for each
point of the iso-geodesic curve.
Abbad et al. [
2
] decomposed each 3D facial surface into multiple intrinsic model functions (IMFs) and a residual
using Surfaces Empirical Mode Decomposition (SEMD). The dierent scales of IMFs represent dierent levels of
spatial oscillation modes of a surface, and the residual represents the lowest frequency of surface. Then, both the
radial and the level facial curves are extracted from the 3D surface and each point on the extracted curves is
described by the Wave Kernel Signature (WKS). Thus, each IMF and the residual surface can be represented by
the radial curves, level facial curves and their correpsonding wave kernel signatures. For 3D face recognition,
similarity between surfaces (IMF and residual) at the same scale are nally computed based on the angle of
feature vectors. In [
92
], dierent types of proles and contours are evaluated to select a subset of facial curves for
feature matching. An optimal combination with 8 curves can achieve a Mean Average Precision (MAP) of 0.70
and a recognition rate of 92.5% on the the Shape Retrieval Contest (SHREC’08) dataset.
4.2.3 Summary. Since curves sampled on a 3D face are denser than landmarks, they oer more geometric
information of the 3D facial surface. Consequently, curve based methods are usually more discriminative than
landmark based methods. Besides, curve based methods can encode the geometric information of a 3D face from
ACM Comput. Surv.
3D Face Recognition: Two Decades of Progress and Prospects 11
dierent areas of the face. Therefore, their robustness to facial expressions is boosted [
144
]. However, curved
based methods have also several limitations. First, these methods rely on the accurate localization of proles or
contours. Consequently, robust and accurate preprocessing of 3D faces, such as pose normalization and nose tip
detection, is highly required for the accurate localization of proles and contours. Second, these methods usually
sample an entire 3D facial surface with sparse curves, part of the surface information is lost. Consequently, their
discriminative power is still limited [197].
4.3 Local Patch based Algorithms
These methods extract local patches from 3D faces to handle the global shape variation of faces caused by pose
changes, facial expression variations, noise and occlusions [
11
,
90
,
151
]. According to the type of classiers, these
methods can further be divided into sub-region feature matching, keypoint feature matching, surface registration,
and machine learning based methods.
4.3.1 Sub-region Feature Matching based Algorithms. These methods rst extract local patches from several
pre-dened sub-regions which are usually less sensitive to facial expressions, and then calculate the similarity
between two faces using the feature matching results of these extracted local patches.
Based on the signs of the mean and Gaussian curvatures, Moreno et al. [
159
] used HK segmentation [
214
] to
isolate the regions of pronounced curvature from a 3D face, and then extracted 86 features (e.g., areas, distances,
angles, and average curvature) from these regions. Finally, 35 discriminative features are selected to recognize 3D
face using the Euclidean distance between feature descriptors. Later, based on the signs of the mean and gaussian
curvatures, Moreno et al. [
158
] assigned each 3D point of facial meshes into a label describing the local shape of
the surface. Then, thirty local geometrical features are selected as the most discriminating ones from a set of 86
features according to Fisher coecient. Face recognition is nally accomplished using PCA or SVM classier.
Xu et al. [
231
] considered that areas with larger shape variations are important to characterize individuals,
and used four regions (mouth, nose, left eye and right eye) described through Gaussian-Hermite moments to
represent the local shape variation information of 3D faces. Lin et al. [
133
] used LDA to learn the optimal weights
for the fusion of the similarity scores obtained from multiple local regions of 3D faces. It is shown that the fusion
of mutliple regions can signicantly improve face recognition performance under varying facial expressions.
Zhong et al. [
248
] divided each image into several local patches and used the Gabor lter response vectors of
each patch to generate a 3rd-order tensor. The tensors of all local patches are used to generate a number of
sub-codebooks, which are further concatenated to form a Learned Visual Codebook (LVC). The
1
distance based
NN classier is performed on LVCs for face recognition.
Spreeuwers [
205
] dened an intrinsic coordinate system for each face using the vertical symmetry plane of the
face, the nose tip and the slope of the nose bridge. They then registered each face to the intrinsic coordinate system
and proposed a 3D face classier based on the fusion of several dependent region classiers for overlapping
regions, as shown in Fig. 3(a). For each region classer, PCA-LDA is used to extract features from the range image
of the face and the likelihood ratio is used as a matching score. The fusion is achieved using majority voting. Later,
Spreeuwers improved this method by dealing with head motion, unreliable estimation of registration parameters,
and the sensitivity to outliers during training. The verication rate at a FAR of 0.1% increases from 94.6% to 99.3%
and the identication rate increases from 99.0% to 99.4% on the FRGC v2.0 dataset [206].
Alyuz et al. [
6
] divided the whole facial surface into four regions: eye/forehead, nose, cheeks, and mouth-chin
regions, as shown in Fig. 3(b). The probe face is then registered with these four regions after a coarse registration
with the average face model. Four independent sets of similarity measures between probe and gallery faces are
calculated, and then fused for face recognition. Hajati et al. [
93
] proposed a Patch Geodesic Distance (PGD)
algorithm to transform the 2D texture map for 2.5D face recognition. Specically, both of the range image and
texture image are rst partitioned into equal-sized square patches in a non-overlapping manner, as shown in Fig.
ACM Comput. Surv.
12 Y. Guo et al.
(a) Spreeuwers[205] (b) Alyuz et al.[6] (c) Hajati et al.[93]
Fig. 3. Examples of sub-region feature matching based methods.
3(c). To compute the PGD for all surface points, a local geodesic distance for each point within its patch and a
global geodesic distance to measure the distance between patches in the partitioned 2.5D image are computed.
Then, the 2D texture map is transformed according to the computed patch geodesic distance, and Pseudo-Zernike
Moments (PZMs) are computed as a patch descriptor for each patch. The dissimilarity between a probe scan
and a gallery scan is computed based on PZMs and the location of each patch in the transformed texture map.
Soltanpour et al. [
203
] extended Local Derivative Pattern (LDP) to surface normal components and proposed
a Local Normal Derivative Pattern (LNDP) descriptor to encode derivative direction variations. For 3D face
recognition, each range image is rst resized and then divided into several local patches. The histogram of LNDP
is extracted for each patch and then concatenated for all patches and dierent directions. The nal descriptor
consists of three histograms including the
,
and
channels of normal components. The similarity between
two dierent facial surfaces is measured based on the common areas of the two histograms.
Emambakhsh et al. [
62
] extracted the nasal regions based on nose tip detection and face segmentation, and
detected seven landmarks located on sub-nasale, eye corners and nasal alar groove of the nasal regions. To
reduce the sensitivity to noise and enable the extraction of multi-resolution directional region-based information
from the nasal region, the normal vectors are derived from the depth map ltered by Gabor wavelet. Then,
new keypoints are obtained by dividing the horizontal and vertical lines that connect the seven landmarks, and
described through spherical patches and nasal curves. Finally, stable patches and curves over dierent facial
expressions are selected through a heuristic genetic algorithm. Ocegueda et al. [
167
,
168
] constructed a graph
from the 3D mesh of the face and utilized a Markov random eld model to measure the probability of each vertex
to be discriminative or non-discriminative. Then, the authors extended this model and consructed a compact and
robust features consisting of 360 coecients for face recognition.
4.3.2 Keypoint Feature Matching based Algorithms. These methods rst extract a number of repeatable keypoints,
and then represent the local patch around each keypoint using a surface feature descriptor. The similarity between
two faces is nally calculated by matching these surface descriptors.
Wang and Chua [
220
] manually localized few sparse feature points or evenly sampled a large number of dense
feature points on a 3D facial scan, and then used the 3D Gabor lter and the 3D spherical Gabor lter to represent
each feature point. The Least Trimmed Square Hausdor Distance (LTS-HD) is nally used to address the partial
matching problem between probe and gallery faces. Mian et al. [
152
] extracted a set of repeatable keypoints from
locations on 3D facial surfaces with large shape variations. They then represented each keypoint with a pose
invariant feature generated by tting a surface with a uniform grid to the neighborhood of the keypoint. Local
features of two 3D faces are matched to obtain two corresponding graphs. The similarity of two faces is nally
ACM Comput. Surv.
3D Face Recognition: Two Decades of Progress and Prospects 13
calculated as the similarity between their corresponding graphs. When using 3D data alone, this method achieves
a RIRR of 93.5% and a VR@0.1% FAR of 97.4% on the FRGC v2.0 “Neutral versus All” experiment.
Huang et al. [
103
] rst extracted multiscale extended Local Binary Patterns (eLBP) from the range image of a
3D face, resulting in several eLBP images. These eLBP images correspond to dierent scales and LBP attributes
(i.e., the signs and absolute values of gray value dierences). Then, the SIFT method [
136
] is applied to these
eLBP images to detect keypoints and generate local feature descriptors. The similarity between a probe face
and a gallery face is measured by the fusion of three similarities, i.e., the number of matched keypoint pairs, the
similarity of the facial component constraint (i.e., the matching score between local features in several pre-dened
subregions of the two faces), and the similarity of the facial conguration constraint based on graph matching
[
152
]. It is claimed that this method is robust to facial expression variations, partial occlusions, and moderate pose
changes. Because of the advantages of preserving full 3D geometry of the shape, Werghi et al. [
226
] extended
the mesh-LBP to face recognition. First, a plane based on the nose tip and inner-corner landmark points are
constructed, and an ordered and regularly spaced set of points on the plane are extracted. Then, the neighborhood
facets around these grid points are dened and used to compute multi-resolution mesh-LBP descriptors. Finally,
the histograms of these descriptors are integrated to represent the whole or partial facial surface. In addition, the
photometric channel can also be directly fused over the mesh support.
Smeets et al. [
201
] utilized meshSIFT algorithm for 3D face recognition. Specically, points with mean curvature
extrema in scale space are rst detected as salient points on the 3D facial surface. Second, canonical orientations
for these salient points are calculated based on the normal vectors of each vertex. Third, each salient point
is described through the concatenation of two histograms of shape indices and slant angles. The similarity
between two facial surfaces is then computed based on the angle between two feature vectors. Berretti et al. [
14
]
extracted a number of 3D keypoints on a facial scan using the MeshDOG method [
240
], and represented the local
surface around each keypoint using the meshHOG [
240
], Signature Histogram of Orientations (SHOT) [
212
],
and Geometric Histogram (GH) descriptors. Face similarity is measured by the number of inliers rened by the
RANSAC [
74
] algorithm. Berretti et al. [
11
] also extracted keypoints and their corresponding descriptors from
the depth image of a 3D facial scan using the SIFT algorithm, and then a set of keypoint correspondences are
generated by matching the SIFT descriptors of a probe face to the gallery faces. RANSAC based spatial constraints
are imposed to remove outlier correspondences and the similarity between two faces is generated using the
distances between facial curves connecting pairs of matched keypoints. Later, Berretti et al. [
15
] extended this
approach through the selection of optical scale, and the selection of stable keypoints and most discriminative
features of the local descriptor. Compared with [
14
], the overall rank-1 recognition rate on Bosphorus improves
from 93.4% to 94.5%, and the computational cost is reduced to 1/25. In addition, Berretti et al. [
13
] proposed a
super-resolution approach [
12
] to construct a high-resolution facial model by iteratively registering a sequence
of low-resolution 3D scans to a reference frame, and then performed face recognition using the approach similar
to [11].
Li et al. [
126
] rst detected repeatable points distributed over the entire facial regions based on two principal
curvatures. Then, each keypoint is described based on the Histogram of Multiple surface dierential Quantities
(HOMQ) descriptor which combines the Histogram of Gradient (HOG), the Histogram of Shape index (HOS),
and the Histogram of Gradient of Shape index (HOGS) at the feature level. The 3D face recognition is achieved
through a Sparse Representation based Classier (SRC), which computes the accumulated sparse reconstruction
error for all keypoints of a probe face. Guo et al. [
85
] extracted a few highly repeatable keypoints according
to the geometric variation of the local surface around a keypoint, and described each keypoint through the
Rotational Projection Statistics (RoPS) descriptor [
86
]. Face recognition is accomplished by combining local
feature matching and 3D point cloud registration algorithms. Lei et al. [
124
] represented each facial scan with a
set of local keypoints. Each keypoint is described based on Multiple Triangle Statistics (KMTS) which is robust
to partial facial data, large facial expressions and pose variations. Then, a Two-Phase Weighted Collaborative
ACM Comput. Surv.
14 Y. Guo et al.
Representation Classication (TPWCRC) framework is proposed to deal with the face recognition problem.
Compared with other methods, this method pays more attention to partial data (missing parts, and occlusions)
and single training sample.
Hariri et al. [
94
] represented each 3D facial surface with a set of uniformly sampled feature points. Each feature
point is the center of a patch with xed radius, and is characterized by the covariance of its geometric features.
During matching, the probe facial surface is rst aligned with the gallery surfaces using the ICP algorithm, and
then a global similarity measure based on the geodesic distances on the manifold is computed between two
surfaces. Yu et al. [
238
] represented a 3D facial mesh with a set of sparse 3D directional vertices (3D
2
V) and
performed 3D face recognition using a set-to-set dissimilarity measure. Specically, corner points are extracted
from the ridge and valley curves to generate directional vertices. Each directional vertex is composed of three 3D
coordinates
(, , )
and two unit vectors pointing to its two neighboring vertices on the curve. The dissimilarity
between two 3D
2
Vs is dened as the cost of a conversion process which makes these two 3D
2
Vs fully overlapped.
For the 3D face recognition task, the probe faces and gallery faces are rst represented by a set of sparse 3D
2
Vs,
and then the dissimilarity is computed using the Hausdor distance (HD) or the iterative closet point (ICP)
mechanisms. Gilani et al. [
78
] proposed to utilize dense correspondences between a large number of 3D faces
to construct a Keypoint-based 3D Deformable Model (K3DM). Specically, the faces in the dataset are rst
organized into a minimum spanning tree to increase the possibility of nding point matches between pairs of
faces. Then, the dense correspondences are generated by an iteration process based on the currently established
point matches. At each iteration, a 2D Delaunay triangulation on the X-Y plane is performed, and narrow surface
patches dened on triangle edges between two parent/child nodes of the constructed tree are aligned using a
non-rigid registration algorithm. The points on the narrow surface patches are extracted as keypoints based on
the eigenvalues of the covariance matrix, and then matched by calculating the similarity between corresponding
feature descriptors. The above process is repeated for all surface patches in a pair of faces and for all pairs of
faces in the constructed tree. After obtaining the nal set of point matches, dense points are then generated
uniformly using a level set based sampling strategy and matched by calculating the similarity of feature vectors.
The K3DM model is nally constructed based on these dense correspondences, and face recognition is performed
by tting the query face into the constructed model. Boumedinea et al. [
25
] constructed a dictionary based on
SURF descriptor for the dataset captured by Kinect, and conducted 3D face recognition using a KNN algorithm in
the feature space.
4.3.3 Surface Registration based Algorithms. These methods divide each 3D facial surface into several local
regions to handle facial expressions, and then perform surface registration between the corresponding local
surfaces of two faces to generate multiple matching scores. These matching scores are nally fused to obtain the
overall similarity between the two faces. Dierent approaches for surface segmentation, surface registration, and
score fusion have been developed in the literatures.
Chang et al. [
41
,
43
] detected three regions around the nose and then matched each region independently
from a probe face to gallery faces, as shown in Fig. 4(a). The three matching scores are combined to determine
the identity of the probe face. Experimental results show that this method is more robust to facial expression
variations than the holistic methods. Similarly, Faltemier et al. [
65
] segmented each facial surface into 7 regions
and performed face recognition by fusing the matching scores. Later, Faltemier et al. [
67
] performed scores based
fusion on 38 segmented regions of a facial scan. It is observed that the fusion of 28 regions using the Borda count
and the consensus voting methods achieves the best performance. Mian et al. [
153
] registered the eyes-forehead
and nose regions of a probe face to their corresponding regions of a gallery face individually (as shown in Fig.
4(b)). The matching score are then fused to produce face recognition result. It is reported that an identication
rate of 100% and a verication rate of 99.42% are achieved on the UND Biometrics Database. It is also observed
that eyes-forehead is the most important region for 3D face recognition. This work is further extended in [
151
] by
ACM Comput. Surv.
3D Face Recognition: Two Decades of Progress and Prospects 15
HQWLUHIDFH UHJLRQ& UHJLRQ1 UHJLRQ,
(a) Chang et al.[43]
H\HVIRUHKHDG
QRVH
FKHHNV
(b) Mian et al.[153]
HQWLUHIDFH FLUFXODUQRVH
DUHD
HOOLSWLFDO
QRVHDUHD
XSSHUKHDG
(c) eirolo et al.[185]
Fig. 4. Surface registration based 3D face recognition methods.
including a rejection classier, which is based on the matching of the holistic 3D Spherical Face Representation
(SFR) and SIFT descriptors. Queirolo et al. [
185
] used four regions of a 3D face for face recognition, including
the entire face, the circular nose area, the elliptical nose area, and the upper head, as shown in Fig. 4(c). Surface
registration between corresponding regions of two faces is performed using a Simulated Annealing (SA) based
approach with the Surface Interpenetration Measure (SIM). Similarity score is obtained by fusing the SIM values
of four regions using the summing rule. It is observed that the entire face and the elliptical nose area produce the
best individual performance, while combining all regions achieves the best overall performance. These methods
are relatively robust to varying facial expressions as rigid or semi-rigid regions of faces are selected for surface
registration. However, these regions are selected heuristically and may not be the optimal choice [
222
]. Besides,
stable segmentation of these regions are also highly challenging [9].
In addition to above methods, feature matching has been used to further improve the surface registration
performance. Chua et al. [
45
] extracted a set of sample points from the rigid region of a probe face and used point
signature [
46
] to encode the local patch around each sample point. Possible transformations between two facial
scans are generated by matching point signature features and then further veried by point cloud registration.
The identity of the probe face is determined by the gallery face with the largest registration rate. Dibeklioglu et al.
[
57
] estimated nasal regions based on curvature values and face recognition is accomplished through registration
strategies. With the Bosphorus 2D/3D face database, the proposed method achieves 94.10% recognition rate for
frontal facial expressions and 79.41% recognition rate for pose variations.
4.3.4 Machine Learning based Algorithms. The similarity between two faces can further be predicted by a machine
learning method, such as Support Vector Machine (SVM).
Wang et al. [
221
] combined the point signatures from 3D feature points and Gabor lter responses from 2D
feature points to obtain an integrated feature, and then used SVM to achieve face recognition. Cooke et al. [
51
]
rst applied 18 Log-Gabor lters on an image and then divided an image into 75 semi-independent observations
by 25 square windows and 3 scales. These observations are classied individually using a modied Mahalanobis
Cosine metric and then combined at the score level using SVM. It is reported that this method is more robust
to occlusions, distortions and facial expressions. Wang et al. [
222
,
224
] proposed a Collective Shape Dierence
Classier (CSDC) to achieve high performance in both recognition rate and computational eciency. They rst
generated a Signed Shape Dierence Map (SSDM) between two aligned 3D faces as a intermediate representation
for shape comparison. Three features including Haar-like feature, Gabor feature, and Local Binary Pattern (LBP)
are then extracted from SSDMs to encode the local similarity between facial shapes. These features are further
selected using a boosting algorithm to build three CSDCs, which are nally fused to perform 3D face recognition.
This method is also very ecient, which takes about 3.6s for a recognition against a gallery with 1000 faces. Li
ACM Comput. Surv.
16 Y. Guo et al.
et al. [
130
] utilized sparse representation and low-level geometric features for 3D face recognition. To collect
such features, a uniform remeshing scheme is rst applied across 3D faces. Then, all low-level geometric features
are ranked according to their sensitivity to expressions. The features relatively insensitive to expressions form
a descriptor, which is referred to as the Expression-Insensitive Descriptor (EID). For face recognition, both of
the gallery face and the probe faces are represented by EIDs, and face recognition is accomplished under the
framework of sparse representation. Li et al. [
125
] utilized both depth and RGB images to perform face recognition
based on the multi-model sparse coding techniques, and achieved state-of-the-art performance on the CurtinFaces
dataset. Mantecón et al. [
146
] specically designed a Depth Local Quantized Pattern descriptor (DLQP) to capture
the depth characters of human faces, and then utilized a SVM classier to perform face recognition.
4.3.5 Summary. The 3D face in local patch based methods are represented by utilizing discriminative feature
descriptors extracted from local geometric structures of sub-regions or neighborhoods of repeatable keypoints.
Thus, local patch based methods are more robust to challenges such as facial expressions, occlusions and pose
variations. However, there are also several considerations when designing a new local patch based method. First,
the sub-regions or keypoints must be distributed evenly on the whole 3D face, which captures the local structural
information as much as possible. Second, the feature descriptors must be discriminative enough to describe
the intrinsic geometric properties of the local patches, especially in the presence of facial expressions and pose
variations. Third, the distance metrics or the classier of feature descriptors must be specically designed.
4.4 Holistic Algorithms
These algorithms perform 3D face recognition using the information of the whole face.
4.4.1 Statistical Algorithms. Chang et al. [
40
] applied PCA technique to both 2D and 3D facial data for face
recognition. A R1RR of 83.7% was achieved with the 3D modality and a R1RR of 92.8% is achieved with the
2D+3D fusion approach on a dataset containing 166 subjects. Pan et al. [
170
] parameterized a 3D facial surface
into an isomorphic 2D planar circle to preserve the intrinsic geometrical properties. The relative depth values
of facial points are mapped and eigenface analysis is performed on the mapped depth image. Mousavi et al.
[
160
] considered the nose tip as the reference point, and normalized the 3D face shape into an image with a
standard size. Then, two-dimensional PCA (2D PCA) is applied on the normalized image and the eigenvectors
corresponding to the rst
largest eigenvalues are used as the feature vectors of the 3D facial shape. Face
recognition is nally conducted using a SVM classier. Al-Osaimi et al. [
4
] learned the patterns of expression
deformations from shape residues between non-neutral and neutral scan pairs through PCA. The eigenvectors
corresponding to the top
eigenvalues construct the subspace representing the large expression deformations.
In the test stage, the shape residue between the probe face scan and the neutral scan is also projected on the
constructed subspace. The gallery scan with the minimal similarity measure is considered to be the match of
the probe scan. Haar et al. [
91
] utilized PCA to model one neutral face and six neutral-to-expression models for
expressions of anger, disgust, fear, happiness, sadness and surprise. For face matching, all seven models are tted
to the scans in the dataset and three feature vectors of model coecients are obtained to determine the similarity
of faces. The PCA based 3D face recognition approach is also used in [8, 42, 98, 100, 147, 216, 217, 230].
Tsalakanidou et al. [
215
] rst applied Discrete Cosine Transform (DCT) to both the depth image and the
color image of a face, and then used Hidden Markov Model (HMM) to perform face verication. It is observed
that a signicant improvement can be obtained using both color and depth information. Cook et al. [
52
] rst
partitioned the information of a 3D facial image into frequency bands using Discrete Wavelet Transform (DWT)
or DCT, and then projected each band into a PCA or LDA subspace. The projections in that subspace are nally
compared and fused using the Mahalanobis cosine metric. Xu et al. [
227
] utilized the information from depth and
intensity images, and described each individual with local features extracted using a 2D Gabor lter. To reduce
ACM Comput. Surv.
3D Face Recognition: Two Decades of Progress and Prospects 17
the dimensionality of the extracted features, a novel hierarchical feature selection scheme based on LDA and
AdaBoost learning is proposed to select the most eective and robust features. In addition, LDA has also been
investigated for 3D face recognition in [97].
Mpiperis et al. [
162
] used bilinear models to model a 3D facial surface as the interaction of the expression and
identity components. They rst used an elastically deformable model to establish correspondence between a set of
3D faces, and then used bilinear models to decouple the facial expression and identity components. Consequently,
both expression-invariant face recognition and identity-invariant expression recognition can be jointly achieved.
Huang et al. [
104
] used the histograms of geometrical features (e.g., depth, surface normal, gradient, and
curvature) and 3D Local Binary Patterns (LBPs) to represent the depth image of a facial scan for face recognition.
It is observed that the combination of these two features can improve the 3D face recognition performance. Liu
et al. [
135
] characterized the details of a 3D facial surface by the energies contained in spherical harmonics with
dierent frequencies. Specically, the 3D facial point cloud is rst aligned and projected onto spherical coordinates,
and a 2D Surface Depth Map (SDM) of 3D facial surface is then generated. Based on SDM representation, each 3D
facial surface is characterized by the energies at dierent frequencies of the spherical harmonics. The energies at
the low frequencies capture the global shape of the facial surface, whereas the energies at the high frequencies
capture the facial surface details. Finally, a subset of the most discriminative features are selected based on the
training data for further classication. Smeets et al. [
199
] proposed an isometric deformation model based on the
geodesic distance matrix to deal with expression variations. First, the region in which the vertices with a geodesic
distance to the nose tip are smaller than a predened threshold, is cropped. Then, that region is downsampled
into the same amount of points, and a set of eigenvectors corresponding to the largest
eigenvalues of the
Geodesic Distance Matrix (GDM) is considered as the expression-invariant and permutation-invariant shape
descriptor for each face. Finally, the dissimilarity measure for face recognition is computed according to the mean
normalized Manhattan distance. Later, Smeets et al. [
200
] combined the isometric deformation model and the
region-based method (which uses only the region around the nose) to perform face recognition.
4.4.2 Surface Registration based Algorithms. These methods are usually time-consuming due to the use of surface
registration algorithms [9].
Cook et al. [
53
] rst used the ICP algorithm to register a probe face and a gallery face, the registration errors
are then modeled by Gaussian Mixture Models (GMMs) to dierentiate intra-personal faces from extra-personal
faces. Irfanoglu et al. [
106
] rst automatically extracted several landmarks on a 3D face and then established dense
point correspondences between a probe face and a gallery face using the TPS warping algorithm. The Euclidean
norm between two registered 3D facial scans is used for recognition. Lu et al. [
137
] rst detected multiple feature
points on facial scans to achieve coarse registration between a probe face and a gallery face, ne registration is
then performed using a hybrid ICP algorithm. A combined metric using surface matching, texture matching, and
shape index matching is nally used for 3D face recognition. Lu et al. [
138
,
142
] further integrated range and
texture information for 3D face recognition using the ICP based surface registration algorithm. Faltemier et al.
[
69
] used multi-instances enrollment to deal with facial expression variations for 3D face recognition. Particularly,
a probe face is matched with multiple gallery faces of a subject using the ICP algorithm, and the minimum Root
Mean Square (RMS) error is considered as the distance between the probe and the subject. It is reported that
using multiple scans to enroll a person in the gallery can improve the face recognition performance. Mahoor et
al. [
145
] rst extracted ridge points on the facial surface based on principal curvatures and then constructed a 3D
binary image called ridge image based on these ridge points. Face recognition is nally accomplished through
robust Hausdor distance or iterative closest points (ICP) algorithms. ICP based surface recognition algorithm
has also been used in [
140
,
148
,
173
] for 3D face recognition. Besides, Russ et al. [
187
] used a Hausdor distance
based iterative registration algorithm to align two 3D facial scans for face recognition.
ACM Comput. Surv.
18 Y. Guo et al.
Lu et al. [
141
] rst extracted a number of landmarks from each face, and learned 3D facial deformations from a
control group containing neutral and non-neutral expression facial scans. Deformed models with synthesized
expressions are then generated by transferring the deformations to the 3D neutral facial scans in the gallery. A
probe face is nally recognized by matching the facial scan with the deformed models in the gallery. This method
is able to perform face recognition under facial expression and pose variations. However, it is time-consuming
and requires manual operation for landmark extraction. Gökberk et al. [
80
] systematically compared dierent face
registration algorithms (including ICP and TPS), dierent 3D facial features (including point coordinates, surface
normals, curvatures, depth images, and prole curves), and dierent decision-level fusion approaches (e.g., xed-
rules, voting schemes, rank-based combination rules) for face recognition. It is observed that face registration
without warping provides more discriminatory information, surface normals produce the best recognition
performance among features, and the fusion schemes further improve the recognition accuracy. Mohammadzade
et al. [
156
] combined both of the Euclidean distance and the normal distance to nd the closet point pairs between
the input face and the reference face. Based on the singular value decomposition (SVD) algorithm, the rotation
matrix and the translation vector are computed from the cross correlation matrix. The obtained alignment matrix
between the input face and the reference face results in a more accurate correspondence between their points.
The above process is repeated until no more signicant rotation is obtained. These accurately aligned point pairs
are nally applied for 3D face recognition based on discriminant analysis methods.
4.4.3 2D Parameterization based Algorithms. Bronstein et al. [
30
] considered facial expressions as isometric
transformations and transformed a 3D face to a canonical image using Multi-Dimensional Scaling (MDS). 3D
face recognition is then achieved using eigenforms of both the texture and the canonical images, or using the
high-order moments of canonical images [
32
]. This method achieves accurate face recognition results under
dierent facial expressions [
30
]. Later, Bronstein et al [
33
] embedded a 3D facial surface into another face to
perform partial isometry-invariant face recognition. Besides, Bronstein et al. [
31
,
34
] generated an isometric
invariant representation by transforming a 3D face into a spherical canonical image, resulting in an improved
recognition performance compared to at embedding. These methods mainly work on frontal facial scans and
assume that the mouth is closed under dierent facial expressions [
141
]. A limitation of canonical image based
methods is that accurate model cropping and topology consistence are required for geodesic distance computation,
and these methods are also very time-consuming [9].
Passalis et al. [
176
] rst tted the Annotated Face Model (AFM) to a 3D facial scan and used the UV param-
eterization of the tted AFM to obtain three deformation images. A wavelet transform is then used to extract
a biometric signature from each deformation image, and face recognition is nally performed by comparing
the biometric signatures of probe and gallery faces. Kakadiaris et al. [
119
] represented facial scans with an
annotated face model (AFM), which maps all vertices of the model’s surface from
3
to
2
and vice versa based on
continuous global UV parameterization. Then, the tted model is converted into a 2D geometry image to encode
the surface information of the model. From the geometry image, a normal map image, which distributes the
information evenly among its three components, is constructed. Finally, these images are analyzed using Haar
and Pyramid transform and the spectral coecients are used for comparison between dierent subjects. Mpiperis
et al. [
161
] proposed a geodesic polar parameterization for 3D facial surfaces. Specically, a point
on a 3D face
is represented by a path length
and a pole angle
. Here,
is the geodesic distance between the pole (e.g., the
nose tip) and point
,
is the directed angle between the geodesic path (which links the pole and point
) and a
reference geodesic path ending at the pole. Using this parameterization, a 2D representation (namely, geodesic
polar images) can be obtained for each 3D face by mapping a specic surface attribute (e.g., curvature, depth). 3D
face recognition is nally achieved by using the eigenface method on these geodesic polar images. This method
can handle surface deformations caused by facial expressions. However, it needs to detect lips for faces with
open mouth [
34
]. Al-Osaimi et al. [
3
] computed 11 local rank-0 tensor elds from two local neighborhoods of the
ACM Comput. Surv.
3D Face Recognition: Two Decades of Progress and Prospects 19
vertex for each mesh vertex, and computed 3 global rank-0 tensor elds from the cropped face. All of these tensor
elds are invariant to rigid transformations, and then integrated into multiple 2D histograms of the surface area.
Finally, the PCA coecients of the 2D histograms are concatenated into a single feature vector to represent the
face surface. Dutta et al. [
61
] extracted features from the complementary components of range facial images. First,
each range facial image is decomposed into four basic components according to the rst-order partial derivatives
along the X and Y axes, respectively. Then, four hybrid components are linearly generated based on these four
basic components, and all eight components are fused through the genetic algorithm. To select useful features
from the fused feature vectors, a two-stage particle swarm optimization (PSO) algorithm is adopted to maximize
the recognition rate and minimize the number of features. Final face recognition is performed using an SVM
classier.
4.4.4 3D Morphable Model based Algorithms. 3D Morphable Model (3DMM) has been also investigated as an
intermediary mean for 3D face recognition. For example, Amberg et al. [
7
] proposed an expression and pose
invariant 3D Morphable Model (3DMM) by removing pose and expression components during the nonrigid ICP
based 3DMM tting process. Paysan et al. [
178
] utilized the generative Basel Face Model (BFM) to model the face
shapes and textures, and the similarity of two faces is measured according to the angle between the coecients
of the BFM in Mahalanobis space. Ter Haar and Veltkamp [
210
] rst constructed a 3D Morphable Model based on
the USF HumanID 3D Face Database [
20
], and each 3D face scan is tted to the 3DMM through a global-to-local
tting scheme. To obtain a precise tting to the model, the authors also proposed to t the face scan to the seven
predened face components and blend the borders of these components through a post-processing step. Finally,
they performed 3D face recognition on the UND datasets [
42
] using the distances between 15 facial landmarks
and the distances between 135 sample points on the three facial curves. The recognition results show that the
recognition method based on seven face components achieves the best performance, demonstrating that the face
recognition method that is invariant to facial expressions can increase recognition performance. Blanz et al. [
19
]
proposed to t facial scans to the 3DMM by simultaneously optimizing the shape, texture, pose and illumination.
The 3D face recognition is performed by a scalar product between two 1000-dimensional coecient vectors.
4.4.5 Summary. Although holistic methods have been extensively investigated in the past years with acceptable
performance being achieved on dierent datasets, they can only handle the situations with whole facial surfaces
available. These methods cannot be used when some parts of facial surfaces are missing, e.g., due to occlusions or
pose variations. The evaluation results in Section 6 also demonstrate the above observations.
5 DEEP LEARNING FOR 3D FACE RECOGNITION
Due to the availability of massive training data, deep learning-based methods have shown remarkable performance
in the eld of 2D face recognition [
82
]. However, because of the lack of large-scale 3D face datasets, deep learning-
based 3D face recognition techniques are still in their infancy. Actually, 3D face can be represented with dierent
types of representations, such as 2D representations (e.g., projected views, depth images), 3D representations
(e.g., point clouds, meshes, voxels). Dierent representations usually require dierent types of data processing
and face recognition techniques. Therefore, we categorized these methods into methods with learning based on
2D representations, learning based on 3D representations, and learning based on disentangled representations.
5.1 Learning based on 2D Representations
To fully utilize the achievements of 2D deep learning-based methods for 3D face recognition, many methods
have been proposed to project 3D faces onto 2D images, and then utilize matured 2D deep learning-based face
recognition techniques to perform 3D face recognition. Kim et al. [
120
] rst pre-trained a convolutional neural
network (CNN) based on a large-scale 2D face dataset, and then ne-tuned the network with expression and pose
ACM Comput. Surv.
20 Y. Guo et al.
variations augmented 3D facial scans. Each 3D facial scan is orthogonally projected onto a 2D depth map, and
the hard occlusions are added by randomly removing patches from the converted depth map. The ne-tuned
CNN is used as the feature extractor, and the similarity between the gallery and the probe set is nally computed
based on the learned features. Li et al. [
127
] projected each 3D facial surface onto a 2D plane and three images of
the normal components are estimated based on a local plane tting method. Then, to generate a deep normal
representation, each normal image is fed into a deep face net pre-trained on the 2D face dataset. Finally, a
location-sensitive sparse representation classier is proposed to emphasize the importance of dierent facial
parts. Gilani et al. [
77
] proposed a Deep Landmark Identication Network (DLIN) with a binary classication loss
to detect 11 facial landmarks. The training dataset with known locations of landmarks is synthetically generated
using the commercial software FaceGen
and contains 3D faces augmented with various shapes. Specically, the
variations from age, masculinity/femininity, weight, height, four dierent facial expressions (surprise, happiness,
fear, and disgust), and ve dierent poses (frontal, ± 15° in pith and ± 15° in roll) are considered. Each generated
3D face is converted to a spherical representation, and three channels (i.e., depth, azimuth and elevation) of
images are generated as the input of DLIN. Based on ve detected ducial landmarks (i.e., the nosetip, the upper
and lower lip centers, and the outer eye corners), each 3D face is segmented into ve regions based on geodesic
level set curves. Then, discriminative keypoints are extracted in each region and dense correspondences across
faces are obtained to generate a Region based 3D Deformable Model (R3DM). 3D face recognition is performed
by minimizing the cosine distance between the R3DM model parameters of probe and gallery faces. Borghi et
al. [
22
] proposed a depth-based face verication network JanusNet, which contains three Siamese modules (for
depth, hybrid and RGB images) with the same architecture. Specically, the depth and hybrid Siamese networks
take depth image pairs as their input, while the RGB Siamese network takes RGB image pairs as its input. During
the training phase, the hybrid Siamese network is trained based on the loss of the RGB Siamese network, which
forces the features learned by the hybrid network similar to the features of the RGB network. During the test
phase, the RGB Siamese network is not employed, i.e., the nal face verication result is only based on the depth
and hybrid Siamese networks. Later, Borghi et al. [
23
] took the depth maps as the input of a fully convolutional
network for 3D face recognition. The random horizontal ip with probability 0.5 and the random rotation in the
range of [-5°,+5°] are adopted to augment the training data. Xu et al. [
232
] fused the depth map and texture map
to learn features through CNN, and performed 3D face recognition through a CNN-based twin neural network.
Feng et al. [
73
] utilized two deep CNNs to learn features from 2D images converted from facial color images and
point clouds, and the two learned features are then fused as the input to the face recognition network. Olivetti
et al. [
169
] used a 2D image with three channels (depth, shape index, and curvedness) to represent a 3D facial
surface, and then fed these images into a MobileNetV2 architecture to perform 3D face recognition. Three kinds of
data augmentation strategies (clockwise rotation of 25°, counterclockwise rotation of 40°, and horizontal mirror)
are adopted during training. Dutta et al. [
60
] proposed to use an unsupervised deep learning framework for 3D
face recognition. First, the input 3D point clouds are aligned to the frontal pose and converted to 2.5D depth
images. Features are then learned using a sparse principal component analysis network (SpPCANet), and nally
classied using a linear SVM based classier. Hariri and Zaabi [
95
] proposed a lightweight deep residual feature
quantization method for 3D face recognition. After preprocessings such as cropping and denoising, 3D faces are
transformed into 2D depth images, and fed into a pretrained ResNet-50 network [
96
]. The Radial Basis Function
(RBF) neurons are then applied to quantize the learned features and face recognition is nally performed using
an SVM classier.
To address the low-quality 3D face recognition problem, Tan et al. [
209
] proposed a face recognition framework
specically designed for low-quality 3D data. Based on ResNet [
96
], a deep registration network (DRNet) is
proposed to align a sequence of low-quality data, and a deep convolution network (FRNet) is proposed to learn
deep features from high-quality dense 3D point clouds fused from sequential sparse low-quality registered data.
To simulate the actual distribution of low-quality moving faces from dense and clean facial scans in DRNet,
ACM Comput. Surv.
3D Face Recognition: Two Decades of Progress and Prospects 21
10% of the random noisy points with Gaussian distribution N(0,4) and random poses (roll angles in the range
of [-45°,45°], pitch angles in the range of [-20°,20°], and yaw angles in the range of [-30°,30°]) are rst added on
the dense facial scans. Then, the augmented dense facial scans are rst projected onto a 2D plane (which is
divided into 1000 grids), and the sparse facial scans are obtained by randomly selecting one point from each
grid. The above augmentation process is repeated 6 times to obtain a sequence of sparse facial data. Mu et al.
[
163
] proposed a lightweight CNN architecture Led3D for 3D face recognition on low-quality depth images, and
constructed a ner and bigger dataset for training the deep network. The Led3D utilizes four convolutional layers
and a Multi-Scale Feature Fusion (MSFF) module to learn discriminative feature representation of low-quality
face data. A Spatial Attention Vectorization (SAV) module is used to capture the importance of dierent spatial
clues in a face. Experiments demonstrate that the Led3D achieves state-of-the-art performance on the low-quality
3D face recognition dataset Lock3DFace [
241
], and also can operate at a very high speed of 136 fps on Jeston
TX2. To train the Led3D model, pose variations (pitch angles in the range of [-40°,40°] and yaw angles in the
range of [-60°,60°] with the interval of 20°), shape jittering with random Gaussian noises, and shape scaling are
adopted to augment the training data. Lin et al. proposed a multi-quality fusion network MQFNet to enhance
face recognition performance [
131
]. First, high-quality facial depth images are generated from low-quality depth
images based on the pix2pix network [
108
], and the training of the network is supervised from these high-quality
depth images. Then, the generated high-quality depth images and their corresponding low-quality depth images
are fed into a multi-quality fusion network with two identical pipelines to learn global discriminative facial
representations. To train MQFNet, except for the pose variations same as [
163
], scale augmentation and occlusion
augmentation are also adopted in the data augmentation step.
Cai et al. [
36
] constructed a combined training dataset from four public 3D datasets (FRGC v2.0, Bosphorus, BU-
3DFE, and 3D-TEC) and three in-house datasets with data augmentation, and learned features from range images
of four overlapping facial component patches based on improved residual networks [
96
]. The nal representation
of a facial surface is dened as the concatenation of feature vectors from four facial patches. During training,
ve random poses in each rotation angle in the range of
±
10° and another ve arbitrary poses are rst adopted
to augment the pose variations of 3D faces. Then, the transformation augmentations (including minor random
ane transformation, projection transformation, twisting, and horizontal ipping) and resolution augmentation
simulating dierent z-axis resolutions are also applied to the training images. Gilani et al. [
76
] constructed a
large-scale training dataset with 3.1 million 3D facial scans of 100K individuals for 3D face recognition. They
proposed two methods to generate training scans. The rst method selects pairs of 3D faces with the maximum
shape dierence from a real dataset containing 1,785 individuals and generates 3D faces of 90100 new individuals
by varying the expressions of the face pairs based on a dense correspondence method [
78
]. The second method
selects pairs with a smaller shape dierence from a synthetic dataset containing 300 individuals, and nally
generates 3D faces of 8120 new individuals. All scans generated through these two methods are then transformed
by pose variations and large occlusions. The 3D faces generated by the rst method have maximum inter-person
variations, whereas the 3D faces generated by the second method have smaller inter-person variations. Based
on this dataset, a deep convolutional neural named FR3DNet is proposed for 3D face recognition. Each facial
scan is converted into an image with three channels corresponding to depth, azimuth and elevation angles of the
normal vector. To evaluate the performance of the FR3DNet for both the 3D face identication and verication,
the authors also constructed a large-scale test dataset LS3DFace by merging several existing public datasets, such
as FRGC v2.0 and Bosphorus.
These methods directly utilize 2D deep learning based face recognition techniques and achieve satisfactory per-
formance on current datasets. However, geometric information is partially lost when 3D facial data is transformed
into 2D representation.
ACM Comput. Surv.
22 Y. Guo et al.
5.2 Learning based on 3D Representations
Unlike learning methods based on 2D representations, many recent works directly learn facial representations
from 3D facial data. Lin et al. [
132
] rst extracted several local feature tensors from 3D face meshes and then
fed them into a deep neural network for 3D face recognition. Specically, salient points are rst detected based
on the meshSIFT algorithm [
201
], and three local features (i.e., shape index, slant angle, and relative positions
to the salient point) are then extracted for each salient point and concatenated to represent the 3D face. Next,
a 2D similarity tensor image is obtained through local feature tensor matching and fed into a ResNet for 3D
face classication. To address the lack of large-scale 3D face datasets, a large amount of feature tensors are
generated based on Voronoi diagram instead of 3D face samples. Bhople et al. [
18
] directly took the 3D facial
point clouds as the input, and proposed a PointNet-CNN architecture to learn the global representation of the
3D face. Then, pairs of learned global features are fed into a Siamese network to calculate the similarity of the
two input faces. The PointFace [
112
] encodes pairs of input point clouds with two weight-sharing encoders.
In the training stage, the PointFace uses both the feature similarity loss and the softmax classication loss to
obtain ne-grained representations, which minimizes the embedding distance between the same individual and
maximizes the embedding distance between dierent individuals. During training, one of the strategies (including
random anisotropic scaling in the range [-0.66, 1.5], random translation in the range [-0.2, 0.2], random rotation in
the range [-90°, 90°] on yaw and [-30°, 30°] on pitch) is selected to augment the training data or keep the training
data original. In the test stage, embeddings of the probe faces and the encoders obtain gallery faces and then
used for 3D face recognition. To fully explore the advantages of contrastive learning and boost training, a pair
selection strategy is also adopted to generate positive pairs and negative pairs in the training stage. In [
195
], 3D
face meshes are rst voxelized at three dierent resolutions. Then, the fuzzy C-means clustering is performed to
unify the count of voxels into the same size, and a 3D voxel-based face reconstruction technique is applied to the
clustered voxels. The deep learning framework consisting of a variational autoencoders (VAEs) and a bidirectional
long short-term memeory (BiLSTM) network with triplet loss, is used to extract deep facial features. Finally, a
SVM based classier is used to perform gender, emotion, occlusion and person recognition. In the 4DFAB[
44
]
dataset, 3D face recognition and verication are tested with a simple long short-term memeory (LSTM) network.
RP-Net [38] integrates the RoPS [86] descriptor into PointNet++ [184] to learn facial feature representation.
Kacem et al. [
117
] proposed a dynamic 3D face verication network with a triplet loss. The local deformations
in 3D face sequences are rst encoded by the Sparse Localised deformation Components (SPLOCs) [
166
], and then
stacked into 2D arrays for temporal modeling. Finally, the stacked arrays are fed into a triplet loss network for
nal facial embedding, and 3D face verication is performed by computing cosine similarity distance between the
output embeddings. Papadopoulos et al. [
172
] proposed a novel dynamic 3D face recognition framework named
Face-GCN. First, 2D landmarks are extracted from 2D texture facial images and then mapped to 3D facial meshes
to extract 3D facial landmarks. The midpoints of geodesic paths on the meshes between pairs of 3D landmarks
are augmented as new 3D landmarks. Based on these landmarks, a spatial-temporal graphs containing spatial
edges and temporal edges are constructed. Spatial edges are used to connect landmarks according to predened
neighborhood relationship, and temporal edges are used to connect the same landmarks across consecutive
frames in the expression sequences. Finally, 3D face recognition is performed using a spatial-temporal graph
convolution network. In the experiment, a cross-emotion protocol is adopted based on the dynamic 3D facial
expression dataset BU4DFE [
243
], which takes three emotions for training and the other three expressions for
test. Experimental results show that the proposed Face-GCN method achieves an average recognition accuracy of
88.45% on this challenging cross-emotion protocol.
Beneting from the achievements of 3D deep learning techniques [
87
], direct face representation learning from
3D data has developed quickly in recent years. However, the datasets used for the training of 3D face recognition
networks are still small, and the performance of existing networks is also limited.
ACM Comput. Surv.
3D Face Recognition: Two Decades of Progress and Prospects 23
5.3 Learning based on Disentangled Representations
Similar to the idea of decoupling identity attributes from other attributes such as poses and facial expressions,
and modeling dierent attributes with linear combinations [
7
,
20
,
21
], some recent works learn non-linear
latent representations based on deep learning [
26
,
113
,
134
,
186
,
245
], and perform 3D face recognition based on
disentangled representations. Ranjan et al. [
186
] constructed a hierarchical Convolutional Mesh Autoencoder
(CoMA) to learn non-linear representations for modeling 3D facial expressive variations. To train the CoMA
network, a dataset consisting of 20,466 meshes with 12 classes of extreme expressions from 12 subjects is
proposed. Experimental results demonstrate that CoMA achieves state-of-the-art performances with 75% fewer
parameters than linear PCA models. Sun et al. [
208
] constructed two decoders to disentangle identity and
expression latent representations in a variational autoencoder framework. The network is constructed based on
an attention based point cloud transformer [
83
] applied directly on unordered point clouds, and utilized mutual
information regularization applied on the identity decoder to better reconstruct the identity face. Based on these
disentanglement learning achievements, Kacem et al. [
118
] rst learned the latent representation by applying a
Graph Convolutional Autoencoder [
186
] on pairs of neutral and expressive facial meshes, and then translated
expressive representations to neutral ones using a conditional Generative Adversarial Network (cGAN) [
109
].
Specically, both the latent representations of neutral and expressive faces are rst learned using spectral graph
convolutions [
35
] at the encoding stage, and then mapped back into the same neutral face mesh at the decoding
stage. To map the expressive latent representation to its corresponding neutral latent representation, a translation
function is learned by constraining the output and the distribution of the output close to the counterpart of the
neutral latent representation. Finally, the translated expressive latent representation and neutral representation
are fed into a two fully-connected-layer network to perform 3D face recognition.
In summary, disentanglement learning for 3D facial variations obtains surprising achievements for 3D face
recognition and has huge potential for many other applications due to its powerful representation ability.
6 PERFORMANCE COMPARISON
In this section, comparative results of current approaches with respect to facial expressions, pose variations, and
occlusions are presented. Note that, to achieve fair comparison, all results are directly obtained from the referred
works.
6.1 Comparative Results under Facial Expressions
In this section, the performance of current algorithms under facial expressions is evaluated on the datasets of
FRGC v2.0, BU-3DFE, Bosphorus and Gavab. As described in Section 2.2, all these aforementioned datasets contain
various types of non-neutral facial expressions. Specically, the FRGC v2.0 dataset contains 1597 non-neutral
scans with various types of facial expressions. Such a large number of non-neutral facial scans make the FRGC
v2.0 dataset very popular for the evaluation of the robustness of algorithms to the facial expressions.
Comparative experiments are conducted under dierent experimental settings for the face verication and
identication tasks. For the All vs All’ experiment, both of the gallery set and probe set contain all scans in the
datasets. For the ‘Neutral vs All’ experiment, ‘Neutral vs Neutral’, and ‘Neutral vs Non-neutral’ experiments, the
gallery set usually contains one neutral scan for each subject, and the probe set contains all the remaining scans
in the dierent experiments. For example, in the FRGC v2.0 dataset, the gallery set contains 466 neutral scans.
Meanwhile, the probe set contains 1944 neutral scans, 1597 non-neutral scans, and 3541 scans in the ‘Neutral vs
Neutral’, ‘Neutral vs Non-neutral’, and ‘Neutral vs All’ experiments, respectively. Some works [
167
,
205
] use the
rst image of each subject to form the gallery set. However, not all of the rst images in FRGC v2.0 are neutral
images. Therefore, the experimental results from these works are not included in this paper.
ACM Comput. Surv.
24 Y. Guo et al.
Table 2. VR at 0.1% FAR results under facial Expressions. ‘A vs A’, ‘N vs A, ‘N vs N’, and ‘N vs NN’ stand for All vs All’,
‘Neutral vs All’, ‘Neutral vs Neutral’, and ‘Neutral vs Non-neutral’, respectively.
Methods Modality Dataset ROC I ROC II ROC III A vs A N vs A N vs NN N vs N
Landmark based
Algorithms [105] 3D FRGC V2.0 - - 86.9% - - - -
[105] 3D+2D FRGC V2.0 - - 96.8% - - - -
Curve based
Algorithms
[10] 3D FRGC V2.0 - - - 81.2% 95.5% 91.4% 97.7%
[59] 3D FRGC V2.0 - - 97.14% 93.96% - - -
[123] 3D FRGC V2.0 - - 96.7% - - 97.8% -
[116] 3D FRGC V2.0 - - 99.9% 98.9% 99.9% 98.9% 99.9%
[115] 3D FRGC V2.0 - - 96.2% 99.7% 99.5% 99.7% 99.8%
[115] 3D BU-3DFE - - - 99.6% 99.5% - -
[115] 3D Bosphorus - - - - 99.1% 98.9% 99.9%
Local Patch based
Algorithms
[51] 3D FRGC V2.0 93.71% 92.91% 92.01% 92.31% 95.81% - -
[65] 3D FRGC V2.0 - - 88.8% 87.5% 89.0% - 97.1%
[151] 3D FRGC V2.0 - - - - 98.5% 97.0% 99.4%
[151] 3D+2D FRGC V2.0 - - - - 99.3% 98.3% 99.7%
[133] 3D FRGC V2.0 91.5% 91.0% 90.0% - - - -
[152] 3D FRGC V2.0 - - - - 97.4% 92.7% 99.9%
[152] 3D+2D FRGC V2.0 - - - - 98.6% 96.6% 99.9%
[67] 3D FRGC V2.0 - - 94.8% 93.2% 98.1% - -
[222] 3D FRGC V2.0 97.97% 98.01% 98.04% 98.13% 98.61% - -
[185] 3D FRGC V2.0 - - 96.6% 96.5% - - -
[205] 3D FRGC V2.0 94.6% 94.6% 94.6% 94.6% - - -
[168] 3D FRGC V2.0 96.2% 95.7% 95.2% - - - -
[103] 3D FRGC V2.0 95.1% 95.1% 95.0% 94.2% 98.4% 97.2% 99.6%
[167] 3D FRGC V2.0 96.2% 95.7% 95.2% - - - -
[201] 3D FRGC V2.0 - 78.97% 77.24% - - - -
[15] 3D FRGC V2.0 - - 86.6% - - - -
[206] 3D FRGC V2.0 99.3% 99.3% 99.3% 99.3% - - -
[62] 3D FRGC V2.0 - - 93.5% - - - -
[85] 3D FRGC V2.0 - - - - 99.01% 97.18% 99.9%
[124] 3D FRGC V2.0 - - - - 98.3% 96% 99.9%
[
238
](HD)
3D FRGC V2.0 - - 91.1% - - - -
[
238
](ICP)
3D FRGC V2.0 - - 94.5% - - - -
[124] 3D BU-3DFE - - - - 94.0% - -
[78] 3D FRGC V2.0 - - - - 98.7% 96.6% 99.9%
Holistic
Algorithms
[148] 3D+2D FRGC V2.0 - - - 93.5% 95.8% - 99.2%
[119] 3D FRGC V2.0 97.3% 97.2% 97.0% - - - -
[3] 3D FRGC V2.0 - - - - - - 95.37%
[4] 3D FRGC V2.0 94.55% 94.12% 94.05% - 98.14% 97.73% 98.35%
[227] 3D+2D FRGC V2.0 - - 95.3% - 97.5% - -
[145] 3D FRGC V2.0 90.69% 88.5% 85.75% - - - -
[91] 3D FRGC V2.0 - - - 87% - - -
[135] 3D FRGC V2.0 - - - 90% - - -
[156] 3D FRGC V2.0 - - 99.2% 99.6% - - -
[91] 3D BU-3DFE - - - 82% - - -
[91] 3D Gavab - - - 80% - - -
[135] 3D Bosphorus - - - 81.4% - - -
Deep Learning
based Algorithms
[36] 3D FRGC V2.0 - - 100% - 100% 100% 100%
[36] 3D Bosphorus - - - - 98.39% 98.30% 100%
[36] 3D BU-3DFE - - - - 98.92% - -
ACM Comput. Surv.
3D Face Recognition: Two Decades of Progress and Prospects 25
For face verication, the VRs at 0.1% FAR in dierent experimental settings are evaluated. The evaluation
results are shown in Table 2. For the FRGC v2.0 dataset, three additional experiments (ROC I, ROC II, and ROC
III) are also conducted. ROC I means that gallery and probe scans are collected within a semester, ROC II means
that gallery and probe scans are collected within a year, and ROC III means that gallery and probe scans are
collected in dierent semesters. For face identication, R1RR in dierent experimental settings is evaluated. The
evaluation results are shown in Table 3.
Several observations can be derived from Tables 2 and 3:
Local patch based algorithms have attracted the most research interest, and have achieved almost the best
performance under dierent experimental settings. That is mainly because local patch based algorithms
can use local surface features to handle facial expressions.
Compared to ‘Neutral vs Neutral’ experiments, the performance of current algorithms in ‘Neutral vs
Non-neutral’ experiments on existing datasets still needs to be improved. In addition, the types of facial
expressions in existing datasets are limited. Thus, more attention should be paid when designing challenges
of facial expressions.
Deep learning based 3D face recognition algorithms have been developed very slowly due to the lack
of large-scale datasets. Meanwhile, the performances of existing algorithms on small-scale datasets are
saturated. Therefore, more large-scale datasets rich of facial expressions are needed.
6.2 Comparative Results under Pose Variations
The robustness of current algorithms to pose variations are evaluated on the Gavab and Bosphorus datasets. As
described in Section 2.2, the Gavab dataset contains four types of pose variations, and the Bosphorus dataset
contains 13 types of pose variations. Comparative experiments are conducted on the face identication task
in terms of R1RR, and the results are shown in Table 4. For these two datasets, the gallery set contains one
frontal scan with neutral expression for each subject. For the Gavab dataset, the probe sets of ‘looking down’,
‘looking up, ‘right’ and ‘left’ contain scans under these specic poses. For the Bosphorus dataset, the scans with
pose rotations are further divided into four subsets: Yaw Rotations, Yaw Rotations 90, Pitch Rotations and Cross
Rotations. ‘Overall’ in Table 4 means that all subsets of the probe set are used for these two datasets.
Several observations can be derived from Table 4:
Local patch based methods are the most used methods to deal with pose variations. That is because, local
patch based methods utilize only the local structural information for face recognition, which is more robust
to pose variations than other types of methods.
Current algorithms perform well on pitch rotations on both of these two datasets. Note that, the subset of
pitch rotations of the Bosphorus dataset corresponds to the subsets of ‘look down’ and ‘looking up’ of the
Gavab dataset.
The performance on the subset of yaw rotations is worse than the subsets of pitch rotations and cross
rotations. Current algorithms perform worst on the subsets of yaw rotations of the Bosphorus dataset and
the subsets of right and left of the Gavab dataset.
In summary, pose variations with yaw rotations still raise great challenges for current 3D face recognition
algorithms.
6.3 Comparative Results under Occlusions
The robustness of current algorithms to occlusions are evaluated on the Bosphorus dataset. As described in
Section 2.2, the Bosphorus dataset contains four types of occlusions. Comparative experiments are conducted on
the face identication task in terms of R1RR, and the results are shown in Table 5. Specically, the gallery set
contains one scan with neutral expression, and the probe set contains 381 facial scans with occlusions.
ACM Comput. Surv.
26 Y. Guo et al.
Table 3. Identification results under facial expressions. A vs A’, ‘N vs A’, ‘N vs N’, and ‘N vs NN’ stand for ‘All vs All’, ‘Neutral
vs All’, ‘Neutral vs Neutral’, and ‘Neutral vs Non-neutral’, respectively.
Methods Modality Dataset R1RR
(ROC III)
R1RR
(A vs A)
R1RR
(N vs A)
R1RR
(N vs NN)
R1RR
(N vs N)
Curve based
Algorithms
[59] 3D FRGC V2.0 - - 97.7% 96.8% 99.2%
[116] 3D FRGC V2.0 - 98.0% 96.9% 94.3% 95.9%
[115] 3D FRGC V2.0 - 99.3% 99.5% 99.6% 99.8%
[115] 3D Bosphorus - - 99.0% 99.0% 99.9%
[115] 3D BU-3DFE - 99.0% 99.6% - -
[59] 3D Gavab - - 96.99% 94.54% 100%
Local Patch based
Algorithms
[51] 3D FRGC V2.0 - - 94.63% - -
[151] 3D+2D FRGC V2.0 - - 97.37% 95.37% 99.02%
[152] 3D FRGC V2.0 - - 93.5% 86.7% 99.0%
[152] 3D+2D FRGC V2.0 - - 96.1% 92.1% 99.4%
[67] 3D FRGC V2.0 - - 98.1% - -
[222] 3D FRGC V2.0 - - 98.39% - -
[185] 3D FRGC V2.0 99.6% 99.7% - - -
[103] 3D FRGC V2.0 - - 97.6% 95.1% 99.2%
[11] 3D FRGC V2.0 - - 95.6% 92.8% 97.3%
[201] 3D FRGC V2.0 87.19% - - - -
[62] 3D FRGC V2.0 - - 97.9% 98.5% 98.45%
[85] 3D FRGC V2.0 - - 97.0% 94.0% 99.4%
[124] 3D FRGC V2.0 - - 96.3% 92.2% 99.6%
[203] 3D FRGC V2.0 - - 98.1% - -
[103] 3D Bosphorus - - 97.0% - -
[201] 3D Bosphorus - - 93.66% - -
[14] 3D Bosphorus - - 93.4% - 97.9%
[15] 3D Bosphorus - - 94.5% - 98.5%
[126] 3D Bosphorus - - 96.6% 98.8% -
[62] 3D Bosphorus - - 95.35% - -
[203] 3D Bosphorus - - 97.3% - -
[78] 3D Bosphorus - - 98.6% - -
[93] 3D BU-3DFE - - 84.8% - -
[14] 3D BU-3DFE - - 87.5% - -
[11] 3D Gavab - - - 96.17% 100%
[14] 3D Gavab - - - 94% 100%
[124] 3D Gavab - - 96.99% 95.08% 100%
[78] 3D FRGC V2.0 - - 98.5% 96.9% 99.9%
[94] 3D Gavab - - 97.81% 100% 100%
Holistic
Algorithms
[3] 3D FRGC V2.0 - - - - 93.78%
[4] 3D FRGC V2.0 - - 96.52% 95.2% 97.58%
[91] 3D FRGC V2.0 - 97% - - -
[161] 3D BU-3DFE - - 84.4% - -
[91] 3D BU-3DFE - 100% - - -
[145] 3D Gavab - - - - 95%
[91] 3D Gavab - 98% - - -
Deep Learning
based Algorithms
[127] 3D FRGC V2.0 - - 98.01% 96.29% 99.39%
[36] 3D FRGC V2.0 - - 99.94% 99.88% 100%
[120] 3D Bosphorus - - 99.24% 99.2% 100%
[127] 3D Bosphorus - - - 97.6% -
[36] 3D Bosphorus - - 99.75% 99.73% 100%
[77] 3D Bosphorus - - 98.1% 99.0% -
[120] 3D BU-3DFE - - 93% - -
[127] 3D BU-3DFE - - 96.1% - -
[36] 3D BU-3DFE - - 99.88% - -
ACM Comput. Surv.
3D Face Recognition: Two Decades of Progress and Prospects 27
Table 4. Comparative results under pose variations on the Gavab and Bosphorus datasets. ‘d’ means looking down, ‘u’ means
looking up, ‘r’ means sideway scan from right, ‘l’ means sideway scan from le, ‘yr’ means yaw rotation, ‘pr’ means pitch
rotation, ‘cr’ means cross rotation, and ‘o means overall scans. The types of methods (i.e., curve based (denoted by ‘C’), local
patch based (denoted by ‘L’), and holistic (denoted by ‘H’)) are also included in the tables.
(a) Evaluation Results on the Gavab Dataset
Method
Type d u r l o
[9] C 93.3% 92.8% - - -
[145] H 85.3% 88.6% - - -
[103] L 96.72% 96.72% 78.69% 93.44% 91.39%
[59] C 100% 98.36% 70.49% 86.89% 96.99%
[11] L 96.72% 98.36% 81.97% 93.44% -
[14] L 95.1% 96.7% 83.6% 93.4%
[124] L 98.36% 98.36% - - -
[94] L 99.18% 98.36% 81.96% 83.60% -
(b) Evaluation Results on the Bosphorus Dataset
Method
Type yr yr 90 pr cr o
[93] L - - - - 69.1%
[14] L 81.6% 45.7% 98.3% 93.4% -
[15] L 82.6% - 98.8% 95.3% -
[126] L 84.1% 47.1% 99.5% 99.1% 91.1%
[124] L 83.8% 47.4% 98.3% 98.6% 90.6%
[78] L 99.8% 95.2% 100% 99.1% 99.0%
[77] D 94.8% 86.2% 100% 98.6% 95.7%
From Table 5, we can observe that most current methods use local patch based approaches to handle occlusions.
Local patch based methods utilize the local structural information of the 3D facial surface by extracting sub-
regions or keypoints, and then perform face recognition by matching elaborately designed descriptors of these
local structures. For example, Alyuz et al. [
6
] divide a whole 3D facial surface into four regions, and matches these
four regions independently. In [
85
,
124
,
126
], a 3D facial surface is represented by a set of repeatable keypoints
with elaborately designed descriptors, and then face recognition is accomplished based on these keypoints. In
contrast, landmark based methods rely on anthropometric facial ducial points on a face, curve based methods
rely on geometric proles or contours of a face, and holistic methods rely on the completeness of a 3D facial
surface. Compared with these three types of methods, local patch based methods are more robust to occlusions.
7 CONCLUSION AND FUTURE WORK
This paper has presented a survey of the state-of-the-art 3D face recognition methods in the last twenty years. A
comprehensive survey of preprocessing techniques such as nose tip detection, data ltering and pose normal-
ization, as well as 3D face recognition methods has been conducted. The performance of the current methods
are evaluated on several challenging datasets under the taxonomy of facial expressions, pose variations and
occlusions. In summary, the following conclusions can be made:
(i) For 3D face recognition, local patch based algorithms have attracted more attention than other types of
methods. Beneting from the development of local feature description methods in computer vision, local patch
based algorithms can capture the details of 3D facial surfaces, and thus achieve more robust performance.
ACM Comput. Surv.
28 Y. Guo et al.
Table 5. Comparative results under occlusions on the Bosphorus dataset. The gallery set contains 105 scans that have a
neutral scan for each person, and the probe set contains 381 scans that have occlusions. The types of methods (i.e., curve
based (denoted by ‘C’), and local patch based (denoted by ‘L’)) are also included in the table.
Method
Type Eye Mouth Glasses Hair Overall
[6] L 93.6% 93.6% 97.8% 89.6% 94.12%
[59] C 97.1% 78% 94.2% 81% 87%
[14] L - - - - 93.2%
[15] L - - - - 95.8%
[126] L 100.0% 100.0% 100.0% 95.5% 99.2%
[85] L 96.19% 96.19% 99.04% 95.52% 96.85%
[124] L 90.5% 94.3% 96.2% 88.1% 92.7%
[78] L 99.0% 96.1% 100% 97.3% 98.1%
[77] D 100.0% 97.8% 100.0% 97.1% 98.9%
(ii) More attention is paid on dealing with facial expressions, and less attention is paid on challenges of pose
variations and occlusions. From the evaluation results, we can observe that the current methods achieve promising
results on facial expression, while their performance under pose variations yet needs to be improved, especially
for side scans from left or right. In addition, the datasets designed for occlusions and the occlusion types are
limited. More datasets containing various types of occlusions should be constructed in the future.
(iii) Due to the lack of large-scale datasets for the training of deep neural networks, deep learning based 3D
face recognition methods have developed very slowly as compared to their 2D counterpart. As the largest 3D
face dataset with real individuals, ND-2006 [
66
] only contains 13,450 scans of 888 individuals, which is obviously
smaller than 2D face datasets such as FaceNet [
193
] and VGG-Face [
175
]. That is mainly because it requires more
eort to collect a large-scale 3D face dataset than a 2D face dataset which can be easily obtained by crawling the
web [
76
]. Although some methods have been proposed in recent years, they either utilize models pre-trained from
2D face datasets, or construct their own 3D datasets through face generation or data augmentation techniques.
For example, Gilani et al. [
76
] constructed a large-scale training dataset with 3.1 million 3D facial scans of 100K
individuals for 3D face recognition. However, most of these individuals are generated synthetically [
78
]. Therefore,
developing a large-scale 3D face dataset is highly needed for the community.
(iv) Most of these deep learning based methods convert 3D facial surfaces into 2D maps, and then utilize existing
2D face recognition networks to learn deep features. However, geometric information is lost during the conversion
from 3D data to 2D maps. Actually, an increasing number of deep learning methods have been proposed in the
last ve years to directly work on point clouds for various 3D vision tasks such as shape classication, object
detection, and point cloud segmentation [
87
]. Deep learning based 3D face recognition methods which work
directly with point clouds is a promising research direction.
(v) Disentanglement learning exhibits excellent performance for dealing with 3D facial variations such as poses
and expressions. It decouples the neutral latent representation from other facial attributes and thus makes 3D face
recognition highly robust to these facial attributes. Due to its powerful representation ability, disentanglement
learning also has a huge potential for other applications such as face reconstruction.
Based on these observations and recent technology development, several promising directions are worthy
considering for future work, for examples:
(i) Bridging 3D face recognition and generative AI. Generative AI has demonstrated its capability in many
areas, with several systems already being introduced to the area of 3D model generation. For example, CLIP-Mesh
is able to generate textured 3D meshes from text descriptions, Lumirithmic can generate 3D mesh for heads from
ACM Comput. Surv.
3D Face Recognition: Two Decades of Progress and Prospects 29
facial scans. It is potential to use generative AI systems to mitigate the shortage of 3D facial data, to provide deep
learning algorithms with rich generative 3D models with dierent identities, poses, occlusions, expressions, and
also ethnics.
(ii) Leveraging foundation model for 3D face recognition. Foundation models have dominated the research
of Natural Language Processing (NLP) and shown promising potential in computer vision tasks. With the help
of multi-modality foundation models, it is possible to boost the performance of 3D face recognition. However,
designing a foundation model architecture that is suitable for 3D data processing is still an open question. How to
leverage the knowledge embedded in other modalities (e.g., text, 2D face images) to improve 3D face recognition
performance is also unexplored.
(iii) Achieving cross-resolution, cross-age, and cross-sensor 3D face recognition. It is very common that the
probe 3D faces and gallery 3D faces are acquired with dierent sensors at dierent times, and are represented
with dierent resolutions and noisy levels. How to achieve accurate, robust, and ecient 3D face recognition is
still an unsolved problem.
ACKNOWLEDGMENTS
This work was partially supported by the National Key Research and Development Program of China (No.
2021YFB3100800), the National Natural Science Foundation of China (No. U20A20185, 61972435, 42271457,
62276176), the Guangdong Basic and Applied Basic Research Foundation (2022B1515020103), the Shenzhen
Science and Technology Program (No. RCYX20200714114641140), and the Australian Research Council (Grants
DP210101682 and DP210102674).
REFERENCES
[1]
A. F. Abate, M. Nappi, D. Riccio, and G. Sabatino. 2007. 2D and 3D face recognition: A survey. Pattern Recognition Letters 28, 14 (2007),
1885–1906.
[2]
A. Abbad, K. Abbad, and H. Tairi. 2018. 3D face recognition: Multi-scale strategy based on geometric and local descriptors. Computers
& Electrical Engineering 70 (2018), 525 537.
[3]
F.R. Al-Osaimi, M. Bennamoun, and A. Mian. 2008. Integration of local and global geometrical cues for 3D face recognition. Pattern
Recognition 41, 3 (2008), 1030–1040.
[4]
F. Al-Osaimi, M. Bennamoun, and A. Mian. 2009. An expression deformation approach to non-rigid 3D face recognition. IJCV 81, 3
(2009), 302–316.
[5]
S. Aly, A. Trubanova, L. Abbott, S. White, and A. Youssef. 2015. VT-KFER: A Kinect-based RGBD + Time dataset for spontaneous and
non-spontaneous facial expression recognition. In ICB. 90–97.
[6] N. Alyuz, B. Gokberk, and L. Akarun. 2008. A 3D face recognition system for expression and occlusion invariance. In BTAS. 1–7.
[7] B. Amberg, R. Knothe, and T. Vetter. 2008. Expression invariant 3D face recognition with a morphable model. In FG. 1–6.
[8]
C. BenAbdelkader and P. A. Grin. 2005. Comparing and combining depth and texture cues for face recognition. Image and Vision
Computing 23, 3 (2005), 339–352.
[9]
S. Berretti, A. D. Bimbo, and P. Pala. 2006. Description and retrieval of 3D face models using iso-geodesic stripes. In ACM MIR. 13–22.
[10] S. Berretti, A. D. Bimbo, and P. Pala. 2010. 3D face recognition using isogeodesic stripes. IEEE TPAMI 32, 12 (2010), 2162–2177.
[11]
S. Berretti, A. D. Bimbo, and P. Pala. 2013. Sparse matching of salient facial curves for recognition of 3D faces with missing parts. IEEE
TIFS 8, 2 (2013), 374–389.
[12] S. Berretti, A. D. Del, and P. Pala. 2012. Superfaces: A super-resolution model for 3D faces. In ECCV Workshops. 73–82.
[13]
S. Berretti, P. Pala, and A. D. Bimbo. 2014. Face recognition by super-resolved 3D models from consumer depth cameras. IEEE TIFS 9, 9
(2014), 1436–1449.
[14]
S. Berretti, N. Werghi, A. D. Bimbo, and P. Pala. 2013. Matching 3D face scans using interest points and local histogram descriptors.
Computers & Graphics 37, 5 (2013), 509–525.
[15]
S. Berretti, N. Werghi, A. D. Bimbo, and P. Pala. 2014. Selecting stable keypoints and local descriptors for person identication using
3D face scans. The Visual Computer (2014), 1–18.
[16] C. Beumier and M. Acheroy. 2000. Automatic 3D face authentication. Image and Vision Computing 18, 4 (2000), 315–321.
[17]
C. Beumier and M. Acheroy. 2001. Face verication from 3D and grey level clues. Pattern Recognition Letters 22, 12 (2001), 1321–1329.
ACM Comput. Surv.
30 Y. Guo et al.
[18]
A. R. Bhople, A. M. Shrivastava, and S. Prakasha. 2020. Point cloud based deep convolutional neural network for 3D face recognition.
Multimedia Tools and Applications (2020), 1–23.
[19] Volker Blanz, Kristina Scherbaum, and Hans-Peter Seidel. 2007. Fitting a Morphable Model to 3D Scans of Faces. In ICCV. 1–8.
[20] V. Blanz and T. Vetter. 1999. A morphable model for the synthesis of 3D faces. In SIGGRAPH. 187–194.
[21]
J. Booth, A. Roussos, S. Zafeiriou, A. Ponniah, and D. Dunaway. 2016. A 3D morphable model learnt from 10,000 Faces. In CVPR.
5543–5552.
[22]
G. Borghi, S. Pini, F. Grazioli, R. Vezzani, and R. Cucchiara. 2018. Face verication from depth using privileged information. In BMVC.
303.
[23] G. Borghi, S. Pini, R. Vezzani, and R. Cucchiara. 2019. Driver face verication with depth maps. Sensors 19, 15 (2019), 3361.
[24]
G. Borghi, M. Venturelli, R. Vezzani, and R. Cucchiara. 2017. POSEidon: face-from-depth for driver pose estimation. In CVPR. 5494–5503.
[25]
A. Y. Boumedine, S. Bentaieb, and A. Ouamri. 2022. An improved KNN classier for 3D face recognition based on SURF descriptors.
Journal of Applied Security Research 0, 0 (2022), 1–19.
[26]
G. Bouritsas, S. Bokhnyak, S. Ploumpis, S. Zafeiriou, and M. Bronstein. 2019. Neural 3D morphable models: spiral convolutional
networks for 3D shape representation learning and generation. In ICCV. 7212–7221.
[27] K. W. Bowyer, K. Chang, and P. Flynn. 2004. A survey of approaches to three-dimensional face recognition. In ICPR. 358–361.
[28]
K. W. Bowyer, K. Chang, and P. Flynn. 2006. A survey of approaches and challenges in 3D and multi-modal 3D+2D face recognition.
CVIU 101, 1 (2006), 1–15.
[29]
M. D. Breitenstein, D. Kuettel, T. Weise, L. V. Gool, and H. Pster. 2008. Real-time face pose estimation from single range images. In
CVPR. 1–8.
[30] A. M. Bronstein, M. M. Bronstein, and R. Kimmel. 2003. Expression-invariant 3D face recognition. In AVBPA. 62–70.
[31]
A. M. Bronstein, M. M. Bronstein, and R. Kimmel. 2005. Expression-invariant face recognition via spherical embedding. In ICIP, Vol. 3.
III–756.
[32] A. M. Bronstein, M. M. Bronstein, and R. Kimmel. 2005. Three-dimensional face recognition. IJCV 64, 1 (2005), 5–30.
[33] A. M. Bronstein, M. M. Bronstein, and R. Kimmel. 2006. Robust expression-invariant face recognition from partially missing data. In
ECCV. 396–408.
[34]
A. M. Bronstein, M. M. Bronstein, and R. Kimmel. 2007. Expression-invariant representations of faces. IEEE TIP 16, 1 (2007), 188–197.
[35] J. Bruna, W. Zaremba, A. Szlam, and Y. Lecun. 2013. Spectral networks and locally connected networks on graphs. In ICLR.
[36]
Y. Cai, Y. Lei, M. Yang, Z. You, and S. Shan. 2019. A fast and robust 3D face recognition approach based on deeply learned face
representation. Neurocomputing 363 (2019), 375–397.
[37]
C. Cao, Y. Weng, S. Zhou, Y. Tong, and K. Zhou. 2014. Facewarehouse: a 3D facial expression database for visual computing. IEEE
TVCG 20, 3 (2014), 413–425.
[38]
Y. Cao, S. Liu, P. Zhao, and H. Zhu. 2022. RP-Net: A pointNet++ 3D face recognition algorithm integrating RoPS local descriptor. IEEE
Access 10 (2022), 91245–91252.
[39] K. Chang, K. Bowyer, and P. Flynn. 2003. Face recognition using 2D and 3D facial data. In MMUA. 25–32.
[40] K. I. Chang, K. W. Bowyer, and P. J. Flynn. 2003. Multimodal 2D and 3D biometrics for face recognition. In AMFG. 187–194.
[41]
K. I. Chang, K. W. Bowyer, and P. J. Flynn. 2005. Adaptive rigid multi-region selection for handling expression variation in 3D face
recognition. In CVPR Workshops. 157–157.
[42]
K. I. Chang, K. W. Bowyer, and P. J. Flynn. 2005. An evaluation of multimodal 2D+ 3D face biometrics. IEEE TPAMI 27, 4 (2005),
619–624.
[43]
K. I. Chang, K. W. Bowyer, and P. J. Flynn. 2006. Multiple nose region matching for 3D face recognition under varying facial expression.
IEEE TPAMI 28, 10 (2006), 1695–1700.
[44]
S. Cheng, I. Kotsia, M. Pantic, and S. Zafeiriou. 2018. 4dfab: A large scale 4d database for facial expression analysis and biometric
applications. In CVPR. 5117–5126.
[45] C. Chua, F. Han, and Y. Ho. 2000. 3D human face recognition using point signature. In FG. 233–238.
[46] C. Chua and R. Jarvis. 1997. Point signatures: a new representation for 3D object recognition. IJCV 25, 1 (1997), 63–85.
[47] D. Colbry, G. Stockman, and A. Jain. 2005. Detection of anchor points for 3D face verication. In CVPR Workshops. 118–118.
[48]
A. Colombo, C. Cusano, and R. Schettini. 2006. 3D face detection using curvature analysis. Pattern Recognition 39, 3 (2006), 444–455.
[49] A. Colombo, C. Cusano, and R. Schettini. 2011. UMB-DB: A database of partially occluded 3D faces. In ICCV Workshops. 2113–2119.
[50] C. Conde, A. Serrano, and E. Cabello. 2006. Multimodal 2D, 2.5D & 3D Face Verication. In ICIP. IEEE, 2061–2064.
[51] J. Cook, V. Chandran, and C. Fookes. 2006. 3D face recognition using log-gabor templates. In BMVC. 769–778.
[52] J. Cook, V. Chandran, and S. Sridharan. 2007. Multiscale representation for 3D face recognition. IEEE TIFS 2, 3 (2007), 529–536.
[53]
J. Cook, V. Chandran, S. Sridharan, and C. Fookes. 2004. Face recognition from 3D data using iterative closest point algorithm and
gaussian mixture models. In 3DimPVT. 502–509.
[54]
C. A. Corneanu, M. O. Simón, J. F. Cohn, and S. E. Guerrero. 2016. Survey on RGB, 3D, thermal, and multimodal approaches for facial
expression recognition: history, trends, and aect-related applications. IEEE TPAMI 38, 8 (2016), 1548–1568.
ACM Comput. Surv.
3D Face Recognition: Two Decades of Progress and Prospects 31
[55]
C. Creusot, N. Pears, and J. Austin. 2013. A machine-learning approach to keypoint detection and landmarking on 3D meshes. IJCV
102, 1-3 (2013), 146–179.
[56]
N. Dagnes, E. Vezzetti, F. Marcolin, and S. Tornincasa. 2018. Occlusion detection and restoration techniques for 3D face recognition: a
literature review. Machine Vision & Applications 29, 5 (2018), 789–813.
[57]
H. Dibeklioğlu, B. Gökberk, and L. Akarun. 2009. Nasal region-based 3D face recognition under pose and expression variations. In
Advances in Biometrics. 309–318.
[58]
H. Dibeklioglu, A. A. Salah, and L. Akarun. 2008. 3D facial landmarking under expression, pose, and occlusion variations. In BTAS. 1–6.
[59]
H. Drira, B. B. Amor, A. Srivastava, M. Daoudi, and R. Slama. 2013. 3D face recognition under expressions, occlusions, and pose
variations. IEEE TPAMI 35, 9 (2013), 2270–2283.
[60]
K. Dutta, D. Bhattacharjee, and M. Nasipuri. 2020. SpPCANet: a simple deep learning-based feature extraction approach for 3D face
recognition. Multimedia Tools and Applications (2020), 1–24.
[61]
K. Dutta, D. Bhattacharjee, M. Nasipuri, and O. Krejcar. 2021. Complement component face space for 3D face recognition from range
images. Applied Intelligence 51, 4 (April 2021), 2500–2517.
[62]
M. Emambakhsh and A. Evans. 2016. Nasal patches and curves for expression-robust 3D face recognition. IEEE TPAMI 39, 5 (2016),
995–1007.
[63] N. Erdogmus and J. Dugelay. 2014. 3D assisted face recognition: dealing with expression variations. IEEE TIFS 9, 5 (2014), 826–838.
[64] N. Erdogmus and S. Marcel. 2013. Spoong in 2D face recognition with 3D masks and anti-spoong with kinect. In BTAS. 1–6.
[65] T. Faltemier, K. Bowyer, and P. Flynn. 2006. 3D face recognition with region committee voting. In 3DimPVT. 318–325.
[66]
T. C. Faltemier, K. W. Bowyer, and P. J. Flynn. 2007. Using a multi-instance enrollment representation to improve 3D face recognition.
In BTAS. 1–6.
[67] T. C. Faltemier, K. W. Bowyer, and P. J. Flynn. 2008. A region ensemble for 3D face recognition. IEEE TIFS 3, 1 (2008), 62–73.
[68] T. C. Faltemier, K. W. Bowyer, and P. J. Flynn. 2008. Rotated prole signatures for robust 3D feature detection. In FG. 1–7.
[69]
T. C. Faltemier, K. W. Bowyer, and P. J. Flynn. 2008. Using multi-instance enrollment to improve performance of 3D face recognition.
CVIU 112, 2 (2008), 114–125.
[70]
X. Fan, Q. Jia, K. Huyan, X. Gu, and Z. Luo. 2016. 3D facial landmark localization using texture regression via conformal mapping.
Pattern Recognition Letters 83 (2016), 395–402.
[71]
G. Fanelli, M. Dantone, J. Gall, A. Fossati, and L. V. Gool. 2013. Random forests for real time 3D face analysis. IJCV 101, 3 (2013),
437–458.
[72]
T. Fang, and O. Ocegueda X. Zhao, S. K. Shah, and I. A. Kakadiaris. 2011. 3D facial expression recognition: A perspective on promises
and challenges. In FG Workshops. 603–610.
[73]
J. Feng, Q. Guo, Y. Guan, M. Wu, X. Zhang, and C. Ti. 2019. 3D face recognition method based on deep convolutional neural network.
In ICSICCS. 123–130.
[74]
M. A. Fischler and R. C. Bolles. 1981. Random sample consensus: A paradigm for model tting with applications to image analysis and
automated cartography. Commun. ACM 24, 6 (1981), 381–395.
[75]
P. J. Flynn, K. W. Bowyer, and P. J. Phillips. 2003. Assessment of time dependency in face recognition: An initial study. In AVBPA.
44–51.
[76] S. Z. Gilani and A. Mian. 2018. Learning from millions of 3D scans for large-scale 3D face recognition. In CVPR. 1896–1905.
[77]
S. Z. Gilani, A. Mian, and P. Eastwood. 2017. Deep, dense and accurate 3D face correspondence for generating population specic
deformable models. Pattern Recognition 69 (2017), 238–250.
[78] S. Z. Gilani, A. Mian, F. Shafait, and I. Reid. 2018. Dense 3D face correspondence. IEEE TPAMI 40, 7 (2018), 1584–1598.
[79]
S. Z. Gilani, F. Shafait, and A. Mian. 2015. Shape-based automatic detection of a large number of 3D facial landmarks. In CVPR.
4639–4648.
[80]
B. Gokberk and L. Akarun. 2006. Comparative analysis of decision-level fusion algorithms for 3D face recognition. In ICPR, Vol. 3.
1018–1021.
[81] G. G. Gordon. 1992. Face recognition based on depth and curvature features. In CVPR. 808–810.
[82] G. Guo and N. Zhang. 2019. A survey on deep learning based face recognition. CVIU 189 (2019), 102805.
[83]
M. Guo, J. Cai, Z. Liu, T. Mu, R. R. Martin, and S. Hu. 2021. Pct: Point cloud transformer. Computational Visual Media 7, 2 (2021),
187–199.
[84]
Y. Guo, M. Bennamoun, F. Sohel, M. Lu, and J. Wan. 2014. 3D object recognition in cluttered scenes with local surface features: A
survey. IEEE TTPAMI 36, 11 (2014), 2270–2287.
[85]
Y. Guo, Y. Lei, L. Liu, Y. Wang, M. Bennamoun, and F. Sohel. 2016. EI3D: Expression-invariant 3D face recognition based on feature and
shape matching. Pattern Recognition Letters 83 (2016), 403–412.
[86]
Y. Guo, F. Sohel, M. Bennamoun, M. Lu, and J. Wan. 2013. Rotational projection statistics for 3D local surface description and object
recognition. IJCV 105, 1 (2013), 63–86.
ACM Comput. Surv.
32 Y. Guo et al.
[87]
Y. Guo, H. Wang, Q. Hu, H. Liu, L. Liu, and M. Bennamoun. 2021. Deep learning for 3D point clouds: A survey. IEEE TPAMI 43, 12
(2021), 4338–4364.
[88]
S. Gupta, J. K. Aggarwal, M. K. Markey, and A. C. Bovik. 2007. 3D face recognition founded on the structural diversity of human faces.
In CVPR. 1–7.
[89]
S. Gupta, M. K. Markey, and A. C. Bovik. 2007. Advances and challenges in 3D and 2D+3D human face recognition. Pattern Recognition
in Biology (2007), 63–103.
[90] S. Gupta, M. K. Markey, and A. C. Bovik. 2010. Anthropometric 3D face recognition. IJCV 90, 3 (2010), 331–349.
[91]
F. B. T. Haar and R.C. Veltkamp. 2010. Expression modeling for expression-invariant face recognition. Computers & Graphics 34, 3
(2010), 231–241.
[92] F. B. T. Haar and R. C. Veltkamp. 2009. A 3D face matching framework for facial curves. Graphical Models 71, 2 (2009), 77–91.
[93]
F. Hajati, A. A. Raie, and Y. Gao. 2012. 2.5D face recognition using patch geodesic moments. Pattern Recognition 45, 3 (2012), 969–982.
[94]
W. Hariri, H. Tabia, N. Farah, A. Benouareth, and D. Declercq. 2016. 3D face recognition using covariance based descriptors. Pattern
Recognition Letters 78 (2016), 1–7.
[95] W. Hariri and M. Zaabi. 2021. Deep residual feature quantization for 3D face recognition.
[96] K. He, X. Zhang, S. Ren, and J. Sun. 2016. Deep residual learning for image recognition. In CVPR. 770–778.
[97]
T. Heseltine, N. Pears, and J. Austin. 2004. Three-dimensional face recognition: a shersurface approach. In Image Analysis and
Recognition. 684–691.
[98]
T. Heseltine, N. Pears, and J. Austin. 2004. Three-dimensional face recognition: an eigensurface approach. In ICIP, Vol. 2. 1421–1424.
[99]
T. Heseltine, N. Pears, and J. Austin. 2008. Three-dimensional face recognition using combinations of surface feature map subspace
components. Image and Vision Computing 26, 3 (2008), 382–396.
[100]
C. Hesher, A. Srivastava, and G. Erlebacher. 2003. A novel technique for face recognition using range imaging. In ISSPA, Vol. 2. 201–204.
[101]
R. I. Hg, P. Jasek, C. Rodal, K. Nasrollahi, T. B. Moeslund, and G. Tranchet. 2012. An RGB-Ddatabase using Microsoft’s Kinect for
Windows for face detection. In SITIS. 42–46.
[102]
Y. Hu, Z. Zhang, X. Xu, Y. Fu, and T. S. Huang. 2007. Building large scale 3D face database for face analysis. In Multimedia Content
Analysis and Mining. 343–350.
[103]
D. Huang, M. Ardabilian, Y. Wang, and L. Chen. 2012. 3D face recognition using eLBP-based facial description and local feature hybrid
matching. IEEE TIFS 7, 5 (2012), 1551–1565.
[104]
Y. Huang, Y. Wang, and T. Tan. 2006. Combining statistics of geometrical and correlative features for 3D face recognition. In BMVC.
879–888.
[105]
M. Husken, M. Brauckmann, S. Gehlen, and C. von der Malsburg. 2005. Strategies and benets of fusion of 2D and 3D face recognition.
In CVPR Workshops. 174–174.
[106]
M.O. Irfanoglu, B. Gokberk, and L. Akarun. 2004. 3D shape-based face recognition using automatically registered facial surfaces. In
ICPR, Vol. 4. 183–186.
[107]
S. M. S. Islam, M. Bennamoun, R. A. Owens, and R. Davies. 2012. A review of recent advances in 3D ear and expression invariant face
biometrics. Comput. Surveys 44, 3 (2012), 14.
[108]
P. Isola, J. Zhu, T. Zhou, and A. A. Efros. 2017. Image-to-Image Translation with Conditional Adversarial Networks. In CVPR. 5967–5976.
[109]
P. Isola, J. Zhu, T. Zhou, and A. A. Efros. 2017. Image-to-image translation with conditional adversarial networks. In CVPR. 5967–5976.
[110]
A. K. Jain, K. Nandakumar, and A. Ross. 2016. 50 years of biometric research: accomplishments, challenges, and opportunities. Pattern
Recognition Letters 79 (2016), 80–105.
[111] A. K. Jain, A. Ross, and S. Prabhakar. 2004. An introduction to biometric recognition. IEEE TCSVT 14, 1 (2004), 4–20.
[112]
C. Jiang, S. Lin, W. Chen, F. Liu, and L. Shen. 2022. PointFace: point cloud encoder based feature embedding for 3D face recognition.
IEEE TBIOM (2022), 1–1.
[113] Z. Jiang, Q. Wu, K. Chen, and J. Zhang. 2019. Disentangled representation learning for 3D face shape. In CVPR. 11949–11958.
[114] Y. Jing, X. Lu, and S. Gao. 2021. 3D face recognition: A survey.
[115]
M. Jribi, S. Mathlouthi, and F. Ghorbel. 2021. A geodesic multipolar parameterization-based representation for 3D face recognition.
Signal Processing: Image Communication 99 (Nov. 2021), 116464.
[116]
M. Jribi, A. Rihani, A. B. Khlifa, and F. Ghorbel. 2019. An SE(3) invariant description for 3D face recognition. Image and Vision
Computing 89 (Sept. 2019), 106–119.
[117]
A. Kacem, H. B. Abdesslam, K. Cherenkova, and D. Aouada. 2021. Space-time triplet loss network for dynamic 3D face verication. In
ICPR. 82–90.
[118]
A. Kacem, K. Cherenkova, and D. Aouada. 2022. Disentangled face identity representations for joint 3D face recognition and
neutralisation. In ICVR. 438–443.
[119]
I. A. Kakadiaris, G. Passalis, G. Toderici, M. N. Murtuza, Y. Lu, N. Karampatziakis, and T. Theoharis. 2007. Three-dimensional face
recognition in the presence of facial expressions: An annotated deformable model approach. IEEE TPAMI 29, 4 (2007), 640–649.
[120] D. Kim, M. Hernandez, J. Choi, and G. Medioni. 2017. Deep 3D face identication. In IJCB. 133–142.
ACM Comput. Surv.
3D Face Recognition: Two Decades of Progress and Prospects 33
[121]
J. Kittler, A. Hilton, M. Hamouz, and J.Illingworth. 2005. 3D assisted face recognition: A survey of 3D imaging, modelling and recognition
approaches. In CVPR Workshops. 114–114.
[122]
Y. Lei, M. Bennamoun, and A. A. El-Sallam. 2013. An ecient 3D face recognition approach based on the fusion of novel local low-level
features. Pattern Recognition 46, 1 (2013), 24–37.
[123]
Y. Lei, M. Bennamoun, M. Hayat, and Y. Guo. 2014. An ecient 3D face recognition approach using local geometrical signatures.
Pattern Recognition 47, 2 (2014), 509–524.
[124]
Y. Lei, Y. Guo, M. Hayat, M. Bennamoun, and X. Zhou. 2016. A two-phase weighted collaborative representation for 3D partial face
recognition with single sample. Pattern Recognition 52, 4 (2016), 218–237.
[125]
B. Li, A. S. Mian, W. Liu, and A. Krishna. 2013. Using Kinect for face recognition under varying poses, expressions, illumination and
disguise. In WACV. 186–192.
[126]
H. Li, D. Huang, J. M. Morvan, Y. Wang, and L. Chen. 2015. Towards 3D face recognition in the real: a registration-free approach using
ne-grained matching of 3D keypoint descriptors. IJCV 113, 2 (2015), 128–142.
[127]
H. Li, J. Sun, and L. Chen. 2017. Location-sensitive sparse representation of deep normal patterns for expression-robust 3D Face
Recognition. IJCB (2017).
[128]
L. Li, C. Xu, W. Tang, and C. Zhong. 2008. 3D face recognition by constructing deformation invariant image. Pattern Recognition Letters
29, 10 (2008), 1596–1602.
[129]
M. Li, B. Huang, and G. Tian. 2022. A comprehensive survey on 3D face recognition methods. Engineering Applications of Articial
Intelligence 110 (April 2022), 104669.
[130] X. Li, T. Jia, and H. Zhang. 2009. Expression-insensitive 3D face recognition using sparse representation. In CVPR. 2575–2582.
[131]
S. Lin, C. Jiang, F. Liu, and L. Shen. 2021. High quality facial data synthesis and fusion for 3D low-quality face recognition. In IJCB. 1–8.
[132] S. Lin, F. Liu, Y. Liu, and L. Shen. 2019. Local feature tensor based deep learning for 3D face recognition. In FG. 1–5.
[133]
W. Lin, K. Wong, N. Boston, and Y. Hu. 2007. 3D face recognition under expression variations using similarity metrics fusion. In ICME.
727–730.
[134] F. Liu, L. Tran, and X. Liu. 2019. 3D face modeling from diverse raw scan data. In ICCV. 9407–9417.
[135]
P. Liu, Y. Wang, D. Huang, Z. Zhang, and L. Chen. 2013. Learning the spherical harmonic features for 3D face recognition. IEEE TIP 22,
3 (2013), 914–925.
[136] D. G. Lowe. 2004. Distinctive image features from scale-invariant keypoints. IJCV 60, 2 (2004), 91–110.
[137] X. Lu, D. Colbry, and A. K. Jain. 2004. Matching 2.5D scans for face recognition. In ICBA. 30–36.
[138] X. Lu and A. K. Jain. 2005. Integrating range and texture information for 3D face recognition. In IEEE WACV, Vol. 1. 156–163.
[139] X. Lu and A. K. Jain. 2005. Multimodal facial feature extraction for automatic 3D face recognition. Tech Re (2005).
[140] X. Lu and A. K. Jain. 2006. Automatic feature extraction for multiview 3D face recognition. In FG. 585–590.
[141] X. Lu and A. K. Jain. 2008. Deformation modeling for robust 3D face matching. IEEE TPAMI 30, 8 (2008), 1346–1357.
[142] X. Lu, A. K. Jain, and D. Colbry. 2006. Matching 2.5D face scans to 3D models. IEEE TPAMI 28, 1 (2006), 31–43.
[143]
Mand A. Wollstein M. A. de Jong, C. Ru, D. Dunaway, P. Hysi, T. Spector, F. Liu, W. Niessen, M. J. Koudstaal, M. Kayser, E. B. Wolvius,
and S. Böhringer. 2016. An automatic 3D facial landmarking algorithm using 2D gabor wavelets. IEEE TIP 25, 2 (2016), 580–588.
[144]
F. Sohel M. Bennamoun and Y. Guo. 2015. Feature selection for 2D and 3D face recognition. Wiley Encyclopedia of Electrical and
Electronics Engineering (2015).
[145]
M. H. Mahoor and M. Abdel-Mottaleb. 2009. Face recognition based on 3D ridge images obtained from range data. Pattern Recognition
42, 3 (2009), 445–451.
[146]
T. Mantecon, C. R. del Bianco, F. Jaureguizar, and N. García. 2014. Depth-based face recognition using local quantized patterns adapted
for range data. In ICIP. 293–297.
[147] I. Marras, S. Zafeiriou, and G. Tzimiropoulos. 2012. Robust learning from normals for 3D face recognition. In ECCV. 230–239.
[148]
T. Maurer, D. Guigonis, I. Maslov, B. Pesenti, A. Tsaregorodtsev, D. West, and G. Medioni. 2005. Performance of geometrix ActiveIDˆTM
3D face recognition engine on the FRGC data. In CVPR Workshops. 154–154.
[149]
K. Messer, J. Matas, J. Kittler, J. Luettin, and G. Maitre. 1999. XM2VTSDB: The extended M2VTS database. In AVBPA, Vol. 964. 965–966.
[150] A. Mian. 2011. Robust realtime feature detection in raw 3D face images. In WACV. 220–226.
[151]
A. S. Mian, M. Bennamoun, and R. Owens. 2007. An ecient multimodal 2D-3D hybrid approach to automatic face recognition. IEEE
TPAMI 29, 11 (2007), 1927–1943.
[152]
A. S. Mian, M. Bennamoun, and R. Owens. 2008. Keypoint detection and local feature matching for textured 3D face recognition. IJCV
79, 1 (2008), 1–12.
[153]
A. S. Mian, M. Bennamoun, and R. A. Owens. 2005. Region-based matching for robust 3D face recognition. In BMVC, Vol. 5. 199–208.
[154] A. S. Mian and N. Pears. 2012. 3D face recognition. In 3D Imaging, Analysis and Applications. 311–366.
[155] R. Min, N. Kose, and J. Dugelay. 2014. KinectFaceDB: A Kinect database for face recognition. IEEE TSMC 44, 11 (2014), 1534–1548.
[156]
H. Mohammadzade and D. Hatzinakos. 2013. Iterative closest normal point for 3D face recognition. IEEE TPAMI 35, 2 (2013), 381–397.
[157] A.B. Moreno and A. Sanchez. 2004. GavabDB: a 3D face database. In COST275 Workshop on Biometrics on the Internet. 75–80.
ACM Comput. Surv.
34 Y. Guo et al.
[158]
A. B. Moreno, Á. Sanchez, J. F. Velez, and F. J. Diaz. 2005. Face recognition using 3D local geometrical features: PCA vs. SVM. In ISPA.
185–190.
[159] A. B. Moreno, A. Sánchez, J. F. Vélez, and F. J. Díaz. 2003. Face recognition using 3D surface-extracted descriptors. In IMVIP, Vol. 2.
[160] M. H. Mousavi, K. Faez, and A. Asghari. 2008. Three dimensional face recognition using SVM classier. In ICIS. 208–213.
[161]
I. Mpiperis, S. Malassiotis, and M. G. Strintzis. 2007. 3D face recognition with the geodesic polar representation. IEEE TIFS 2, 3 (2007),
537–547.
[162]
I. Mpiperis, S. Malassiotis, and M. G. Strintzis. 2008. Bilinear models for 3D face and facial expression recognition. IEEE TIFS 3, 3 (2008),
498–511.
[163]
G. Mu, D. Huang, G. Hu, J. Sun, and Y. Wang. 2019. Led3D: A lightweight and ecient deep approach to recognizing low-quality 3D
faces. In CVPR. 5766–5775.
[164] T. Nagamine, T. Uemura, and I. Masuda. 1992. 3D facial image analysis for human identication. In ICPR. 324–327.
[165]
B. Nassih, A. Amine, M. Ngadi, Y. Azdoud, D. Naji, and N. Hmina. 2021. An ecient three-dimensional face recognition system based
random forest and geodesic curves. Computational Geometry 97 (2021), 101758.
[166] T. Neumann, K. Varanasi, S. Wenger, M. Wacker, M. Magnor, and C. Theobalt. 2013. Sparse localized deformation components. ACM
TOG 32, 6 (2013).
[167]
O. Ocegueda, T. Fang, S. K. Shah, and I. A. Kakadiaris. 2013. 3D face discriminant analysis using Gauss-Markov posterior marginals.
IEEE TPAMI 35, 3 (2013), 728–739.
[168] O. Ocegueda, S. K. Shah, and I. A. Kakadiaris. 2011. Which parts of the face give out your identity?. In CVPR. 641–648.
[169]
E. C. Olivetti, J. Ferretti, G. Cirrincione, F. Nonis, S. Tornincasa, and F. Marcolin. 2019. Deep CNN for 3D face recognition. In Design
Tools and Methods in Industrial Engineering. 665–674.
[170] G. Pan, S. Han, Z. Wu, and Y. Wang. 2005. 3D face recognition using mapped depth images. In CVPR Workshops. 175–175.
[171] G. Pan, Y. Wu, Z. Wu, and W. Liu. 2003. 3D Face recognition by prole and surface matching. In IJCNN, Vol. 3. 2169–2174.
[172]
K. Papadopoulos, A. Kacem, A. E. R. Shabayek, and D. Aouada. 2022. Face-GCN: a graph convolutional network for 3D dynamic face
recognition. In ICVR. 454–458.
[173]
T. Papatheodorou and D. Rueckert. 2004. Evaluation of automatic 4D face recognition using surface and texture registration. In FG.
321–326.
[174]
C. Papazov, T. K. Marks, and M. Jones. 2015. Real-Time 3D head pose and facial landmark estimation from depth images using triangular
surface patch features. In CVPR. 4722–4730.
[175] O. M. Parkhi, A. Vedaldi, and A. Zisserman. 2015. Deep face recognition. In BMVC. 41.1–41.12.
[176]
G. Passalis, I.A. Kakadiaris, T. Theoharis, G. Toderici, and N. Murtuza. 2005. Evaluation of 3D face recognition in the presence of facial
expressions: an annotated deformable model approach. In CVPR Workshops. 171–171.
[177]
G. Passalis, P. Perakis, T. Theoharis, and I. A. Kakadiaris. 2011. Using facial symmetry to handle pose variations in real-world 3D face
recognition. IEEE TPAMI 33, 10 (2011), 1938–1951.
[178]
P. Paysan, R. Knothe, B. Amberg, S. Romdhani, and T. Vetter. 2009. A 3D face model for pose and illumination invariant face recognition.
In AVSS. 296–301.
[179]
X. Peng, M. Bennamoun, and A. S. Mian. 2011. A training-free nose tip detection method from face range images. Pattern Recognition
44, 3 (2011), 544–558.
[180]
P. Perakis, G. Passalis, T. Theoharis, and I. A. Kakadiaris. 2013. 3D facial landmark detection under large yaw and expression variations.
IEEE TPAMI 35, 7 (2013), 1552–1564.
[181]
D. Petrovska-Delacretaz, S. Lelandais, J. Colineau, L. Chen, B. Dorizzi, M. Ardabilian, E. Krichen, M. Mellakh, A. Chaari, S. Guer, J.
D’Hose, and B. Amor. 2008. The IV 2 multimodal biometric database (including iris, 2D, 3D, stereoscopic, and talking face data), and
the IV 2-2007 evaluation campaign. In BTAS. 1–7.
[182]
P. J. Phillips, P. J. Flynn, T. Scruggs, K. W. Bowyer, J. Chang, K. Homan, J. Marques, J. Min, and W. Worek. 2005. Overview of the face
recognition grand challenge. In CVPR, Vol. 1. 947–954.
[183]
S. Pini, G. Borghi, R. Vezzani, D. Maltoni, and R. Cucchiara. 2021. A systematic comparison of depth map representations for face
recognition. Sensors 21, 3 (2021), 944.
[184]
C. R. Qi, L. Yi, H. Su, and L. J. Guibas. 2017. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In NeurIPS,
Vol. 30.
[185]
C. C. Queirolo, L. Silva, O. R. P. Bellon, and M. P. Segundo. 2010. 3D face recognition using simulated annealing and the surface
interpenetration measure. IEEE TPAMI 32, 2 (2010), 206–219.
[186] A. Ranjan, T. Bolkart, S. Sanyal, and M. J. Black. 2018. Generating 3D faces using convolutional mesh autoencoders. In ECCV.
[187]
T. D. Russ, M. W. Koch, and C. Q. Little. 2005. A 2D range Hausdor approach for 3D face recognition. In CVPR Workshops. 169–169.
[188]
C. Samir, A. Srivastava, and M. Daoudi. 2006. Three-dimensional face recognition using shapes of facial curves. IEEE TPAMI 28, 11
(2006), 1858–1863.
ACM Comput. Surv.
3D Face Recognition: Two Decades of Progress and Prospects 35
[189]
C. Samir, A. Srivastava, M. Daoudi, and E. Klassen. 2009. An intrinsic framework for analysis of facial surfaces. IJCV 82, 1 (2009),
80–95.
[190] G. Sandbach, S. Zafeiriou, M. Pantic, and L. Yin. 2012. Static and dynamic 3D facial expression recognition: A comprehensive survey.
Image and Vision Computing 30, 10 (2012), 683–697.
[191]
A. Savran, N. Alyüz, H. Dibeklioğlu, O. Çeliktutan, B. Gökberk, B. Sankur, and L. Akarun. 2008. Bosphorus database for 3D face analysis.
In Biometrics and Identity Management. 47–56.
[192] A. Scheenstra, A. Ruifrok, and R. Veltkamp. 2005. A survey of 3D face recognition methods. In AVBPA. 325–345.
[193]
F. Schro, D. Kalenichenko, and J. Philbin. 2015. FaceNet: A unied embedding for face recognition and clustering. In CVPR. 815–823.
[194]
M. P. Segundo, C. Queirolo, O. R. P. Bellon, and L. Silva. 2007. Automatic 3D facial segmentation and landmark detection. In ICIAP.
431–436.
[195]
S. Sharma and V. Kumar. 2020. Voxel-based 3D face reconstruction and its application to face recognition using sequential deep learning.
Multimedia Tools and Applications 79, 25-26 (July 2020), 17303–17330.
[196]
B. Shi, H. Zang, R. Zheng, and S. Zhan. 2019. An ecient 3D face recognition approach using frenet feature of iso-geodesic curves.
JVCIR 59 (2019), 455 460.
[197]
D. Smeets, P. Claes, J. Hermans, D. Vandermeulen, and P. Suetens. 2012. A comparative study of 3D face recognition under expression
variations. IEEE TSMCC 42, 5 (2012), 710–727.
[198]
D. Smeets, P. Claes, D. Vandermeulen, and J. G. Clement. 2010. Objective 3D face recognition: Evolution, approaches and challenges.
Forensic Science International 201, 1-3 (2010), 125–132.
[199]
D. Smeets, F. Fabry, J. Hermans, D. Vandermeulen, and P. Suetens. 2009. Isometric deformation modeling using singular value
decomposition for 3D expression-invariant face recognition. In BTAS. 1–6.
[200]
D. Smeets, T. Fabry, J. Hermans, D. Vandermeulen, and P. Suetens. 2010. Fusion of an isometric deformation modeling approach using
spectral decomposition and a region-based approach using ICP for expression-invariant 3D face recognition. In ICPR. 1172–1175.
[201]
D. Smeets, J. Keustermans, D. Vandermeulen, and P. Suetens. 2013. meshSIFT: local surface features for 3D face recognition under
expression variations and partial data. CVIU 117, 2 (2013), 158–169.
[202]
S. Soltanpour, B. Boufama, and Q.M. J. Wu. 2017. A survey of local feature methods for 3D face recognition. Pattern Recognition 72
(2017), 391–406.
[203] S. Soltanpour and Q.M. J. Wu. 2017. High-order local normal derivative pattern (LNDP) for 3D face recognition. In ICIP. 2811–2815.
[204]
M. Song, D. Tao, S. Sun, C. Chen, and S. J. Maybank. 2014. Robust 3D face landmark localization based on local coordinate coding. IEEE
TIP 23, 12 (2014), 5108–5122.
[205] L. Spreeuwers. 2011. Fast and accurate 3D face recognition. IJCV 93, 3 (2011), 389–414.
[206]
L. Spreeuwers. 2015. Breaking the 99% barrier: optimisation of three-dimensional face recognition. IET Biometrics 4, 3 (2015), 169–178.
[207]
A. Srivastava, C. Samir, S. H. Joshi, and M. Daoudi. 2009. Elastic shape models for face analysis using curvilinear coordinates. Journal
of Mathematical Imaging and Vision 33, 2 (2009), 253–265.
[208]
H. Sun, N. Pears, and Y. Gu. 2022. Information Bottlenecked Variational Autoencoder for Disentangled 3D Facial Expression Modelling.
In WACV. 2334–2343.
[209]
Y. Tan, H. Lin, Z. Xiao, S. Ding, and H. Chao. 2019. Face recognition from sequential sparse 3D data via deep registration. In ICB. 1–8.
[210] Frank B. ter Haar and Remco C. Veltkamp. 2008. 3D Face Model Fitting for Recognition. In ECCV. 652–664.
[211]
G. Toderici, G. Evangelopoulos, T. Fang, T. Theoharis, and I. A. Kakadiaris. 2014. UHDB11 Database for 3D-2D face recognition. In
PSIVT. 73–86.
[212] F. Tombari, S. Salti, and L. D. Stefano. 2010. Unique signatures of histograms for local surface description. In ECCV. 356–369.
[213]
N. F. Troje and H. H. Bültho. 1996. Face recognition under varying poses: The role of texture and shape. Vision Research 36, 12 (1996),
1761–1771.
[214] E. Trucco and A. Verri. 1998. Introductory techniques for 3D computer vision.
[215]
F. Tsalakanidou, S. Malassiotis, and M. G. Strintzis. 2005. Face localization and authentication using color and depth images. IEEE TIP
14, 2 (2005), 152–168.
[216]
F. Tsalakanidou, S. Malassiotis, and M. G. Strintzis. 2007. A 3D face and hand biometric system for robust user-friendly authentication.
Pattern Recognition Letters 28, 16 (2007), 2238–2249.
[217]
F. Tsalakanidou, D. Tzovaras, and M. G. Strintzis. 2003. Use of depth and colour eigenfaces for face recognition. Pattern Recognition
Letters 24, 9 (2003), 1427–1435.
[218]
R. C. Veltkamp, S. V. Jole, H. Drira, B. B. Amor, M. Daoudi, H. Li, L. Chen, P. Claes, D. Smeets, J. Hermans, D. Vandermeulen, and P.
Suetensothers. 2011. SHREC’11 track: 3D face models retrieval. In 3DOR. 89–95.
[219]
V. Vijayan, K. W. Bowyer, P. J. Flynn, D. Huang, L. Chen, M. Hansen, O. Ocegueda, S. K. Shah, and I. A. Kakadiaris. 2011. Twins 3D face
recognition challenge. In IJCB. 1–7.
[220]
Y. Wang and C. Chua. 2005. Face recognition from 2D and 3D images using 3D Gabor lters. Image and Vision Computing 23, 11 (2005),
1018–1028.
ACM Comput. Surv.
36 Y. Guo et al.
[221]
Y. Wang, C. Chua, and Y. Ho. 2002. Facial feature detection and face recognition from 2D and 3D images. Pattern Recognition Letters 23,
10 (2002), 1191–1202.
[222]
Y. Wang, J. Liu, and X. Tang. 2010. Robust 3D face recognition by local shape dierence boosting. IEEE TPAMI 32, 10 (2010), 1858–1870.
[223]
Y. Wang, G. Pan, Z. Wu, and Y. Wang. 2006. Exploring facial expression eects in 3D face recognition using partial ICP. In ACCV.
581–590.
[224] Y. Wang, X. Tang, J. Liu, G. Pan, and R. Xiao. 2008. 3D face recognition by local shape dierence boosting. In ECCV. 603–616.
[225]
Z. Wang, Z. Miao, Q.M. J. Wu, Y. Wan, and Z. Tang. 2014. Low-resolution face recognition: a review. The Visual Computer 30, 4 (2014),
359–386.
[226]
N. Werghi, C. Tortorici, S. Berretti, and A. D. Bimbo. 2016. Boosting 3D LBP-based face recognition by fusing shape and texture
descriptors on the mesh. IEEE TIFS 11, 5 (2016), 964–979.
[227]
C. Xu, S. Li, T. Tan, and L. Quan. 2009. Automatic 3D face recognition from depth and intensity Gabor features. Pattern Recognition 42,
9 (2009), 1895–1905.
[228]
C. Xu, T. Tan, S. Li, Y. Wang, and C. Zhong. 2006. Learning eective intrinsic features to boost 3D-based face recognition. In ECCV.
416–427.
[229]
C. Xu, T. Tan, Y. Wang, and L. Quan. 2006. Combining local features for robust nose location in 3D facial data. Pattern Recognition
Letters 27, 13 (2006), 1487–1494.
[230] C. Xu, Y. Wang, T. Tan, and L. Quan. 2004. A new attempt to face recognition using 3D eigenfaces. In ACCV, Vol. 2. 884–889.
[231]
C. Xu, Y. Wang, T. Tan, and L. Quan. 2004. Automatic 3D face recognition combining global geometric features with local shape
variation information. In FG. 308–313.
[232]
K. Xu, X. Wang, Z. Hu, and Z. Zhang. 2019. 3D face recognition based on twin neural network combining deep map and texture. In
ICCT. 1665–1668.
[233]
H. Yang, H. Zhu, Y. Wang, M. Huang, Q. Shen, R. Yang, and X. Cao. 2020. Facescape: a large-scale high quality 3d face dataset and
detailed riggable 3d face prediction. In CVPR. 601–610.
[234] B. Yin, Y. Sun, C. Wang, and Y. Ge. 2005. The BJUT-3D large-scale Chinese face database. Technical Report.
[235] L. Yin, X. Chen, Y. Sun, T. Worm, and M. Reale. 2008. A high-resolution 3D dynamic facial expression database. In FG. 1–6.
[236] L. Yin, X. Wei, Y. Sun, J. Wang, and M. J. Rosato. 2006. A 3D facial expression database for facial behavior research. In FG. 211–216.
[237] X. Yu, Y. Gao, and J. Zhou. 2016. 3D face recognition under partial occlusions using radial strings. In ICIP. 3016–3020.
[238]
X. Yu, Y. Gao, and J. Zhou. 2017. Sparse 3D directional vertices vs continuous 3D curves: ecient 3D surface matching and its application
for single model face recognition. Pattern Recognition 65 (May 2017), 296–306.
[239]
S. Zafeiriou, M. Hansen, G. Atkinson, V. Argyriou, M. Petrou, M. Smith, and L. Smith. 2011. The photoface database. In CVPR Workshops.
132–139.
[240]
A. Zaharescu, E. Boyer, and R. Horaud. 2012. Keypoints and local descriptors of scalar functions on 2D manifolds. IJCV 100 (2012),
78–98.
[241] J. Zhang, D. Huang, Y. Wang, and J. Sun. 2016. Lock3DFace: A large-scale database of low-cost Kinect 3D faces. In ICB. 1–8.
[242]
L. Zhang, A. Razdan, G. Farin, J. Femiani, M. Bae, and C. Lockwood. 2006. 3D face authentication and recognition based on bilateral
symmetry analysis. The Visual Computer 22, 1 (2006), 43–55.
[243]
X. Zhang, L. Yin, J. F. Cohn, S. Canavan, M. Reale, A. Horowitz, and P. Liu. 2013. A high-resolution spontaneous 3D dynamic facial
expression database. In FG. 1–6.
[244]
X. Zhang, L. Yin, J. F. Cohn, S. Canavan, M. Reale, A. Horowitz, P. Liu, and J. M. Girard. 2014. BP4D-Spontaneous: a high-resolution
spontaneous 3D dynamic facial expression database. Image and Vision Computing 32, 10 (2014), 692–706.
[245]
Z. Zhang, C. Yu, H. Li, J. Sun, and F. Liu. 2020. Learning distribution independent latent representation for 3D face disentanglement. In
3DV. 848–857.
[246]
W. Zhao, R. Chellappa, P. J. Phillips, and A. Rosenfeld. 2003. Face recognition: a literature survey. Comput. Surveys 35, 4 (2003), 399–458.
[247]
X. Zhao, E. Dellandrea, L. Chen, and I. A. Kakadiaris. 2011. Accurate landmarking of three-dimensional facial data in the presence of
facial expressions and occlusions using a three-dimensional statistical facial feature model. IEEE TSMC 41, 5 (2011), 1417–1428.
[248] C. Zhong, Z. Sun, and T. Tan. 2007. Robust 3D face recognition using learned visual codebook. In CVPR. 1–6.
[249]
H. Zhou, A. Mian, L. Wei, D. Creighton, M. Hossny, and S. Nahavandi. 2014. Recent advances on singlemodal and multimodal face
recognition: A survey. IEEE THMS 44, 6 (2014), 701–716.
[250] S. Zhou and S. Xiao. 2018. 3D face recognition: A survey. HCIS 8, 1 (2018), 1–27.
ACM Comput. Surv.
... By capturing the three-dimensional structure of the face, including the contours, shape, and spatial relationships of facial features, 3D systems can create a more accurate and unique facial signature for each individual. When comparing 2D and 3D facial analysis accuracy under identical pose and lighting conditions, 3D face recognition often outperforms its 2D counterpart [159]. Table 5 summarizes 3D data sets The point cloud representation is the most fundamental way to depict the facial surface. ...
Article
Full-text available
Due to their ease-of-use, biometric verification methods to control access to digital devices have become ubiquitous. Many rely on supervised machine learning, a process that is notoriously data-hungry. At the same time, biometric data is sensitive from a privacy perspective, and a comprehensive review from a data set perspective is lacking. In this survey, we present a comprehensive review of multimodal face data sets (e.g., data sets containing RGB color plus other channels such as infrared or depth). This follows a trend in both industry and academia to use such additional modalities to improve the robustness and reliability of the resulting biometric verification systems. Furthermore, such data sets open the path to a plethora of additional applications, such as 3D face reconstruction (e.g., to create avatars for VR and AR environments), face detection, registration, alignment, and recognition systems, emotion detection, anti-spoofing, etc. We also provide information regarding the data acquisition setup and data attributes (ethnicities, poses, facial expressions, age, population size, etc.) as well as a thorough discussion of related applications and state-of-the-art benchmarking. Readers may thus use this survey as a tool to navigate the existing data sets both from the application and data set perspective. To existing surveys we contribute, to the best of our knowledge, the first exhaustive review of multimodalities in these data sets.
... For example, facial features can be represented by a number of features -whether via simple, traditional low-dimensional distance measures, or via modern complex machine learning features [8]. ...
Preprint
Full-text available
Little demonstrable progress has been made toward AGI (Artificial General Intelligence) since the term was coined some 20 years ago. In spite of the fantastic breakthroughs in Statistical AI such as AlphaZero, ChatGPT, and Stable Diffusion none of these projects have, or claim to have, a clear path to AGI. In order to expedite the development of AGI it is crucial to understand and identify the core requirements of human-like intelligence as it pertains to AGI. From that one can distill which particular development steps are necessary to achieve AGI, and which are a distraction. Such analysis highlights the need for a Cognitive AI approach rather than the currently favored statistical and generative efforts. More specifically it identifies the central role of concepts in human-like cognition. Here we outline an architecture and development plan, together with some preliminary results, that offers a much more direct path to full Human-Level AI (HLAI)/ AGI.
... For example, facial features can be represented by a number of features -whether via simple, traditional low-dimensional distance measures, or via modern complex machine learning features [8]. ...
Preprint
Full-text available
Little demonstrable progress has been made toward AGI (Artificial General Intelligence) since the term was coined some 20 years ago. In spite of the fantastic breakthroughs in Statistical AI such as AlphaZero, ChatGPT, and Stable Diffusion none of these projects have, or claim to have, a clear path to AGI. In order to expedite the development of AGI it is crucial to understand and identify the core requirements of human-like intelligence as it pertains to AGI. From that one can distill which particular development steps are necessary to achieve AGI, and which are a distraction. Such analysis highlights the need for a Cognitive AI approach rather than the currently favored statistical and generative efforts. More specifically it identifies the central role of concepts in human-like cognition. Here we outline an architecture and development plan, together with some preliminary results, that offers a much more direct path to full Human-Level AI (HLAI)/ AGI.
Article
Full-text available
In the past ten years, research on face recognition has shifted to using 3D facial surfaces, as 3D geometric information provides more discriminative features. This comprehensive survey reviews 3D face recognition techniques developed in the past decade, both conventional methods and deep learning methods. These methods are evaluated with detailed descriptions of selected representative works. Their advantages and disadvantages are summarized in terms of accuracy, complexity, and robustness to facial variations (expression, pose, occlusion, etc.). A review of 3D face databases is also provided, and a discussion of future research challenges and directions of the topic.
Article
Full-text available
as a biometric identification method in the post-epidemic era, face recognition owing more and more attention in practical applications to its non-contact and interaction-friendly advantages. Researchers more favor 3D faces because they have richer spatial information than 2D faces and are not easily affected by the environment. However 3D faces are not all collected in normal environments. To enhance the facial features of 3D faces and improve the recognition degree of 3D faces in weak-light or dark environments, a 3D face recognition algorithm based on point cloud depth learning is proposed. First, 3D faces are automatically detected from 3D raw data and preprocessed, including nose-tip detection and face cropping, spike removal and hole filling, and surface normals. Then, rotated projection statistical local feature descriptors (RoPS) are integrated into the PointNet++ network to describe and classify local features. Finally, feature matching is performed using the nearest neighbor distance ratio. The algorithm was tested on the Bosphorus and CASIA-3D datasets, and good results were obtained in a simulated weak-light environment.
Preprint
Full-text available
3D face recognition (FR) has been successfully applied using Convolutional neural networks (CNN) which have demonstrated stunning results in diverse computer vision and image classification tasks. Learning CNNs, however, need to estimate millions of parameters that expect high-performance computing capacity and storage. To deal with this issue, we propose an efficient method based on the quantization of residual features extracted from ResNet-50 pre-trained model. The method starts by describing each 3D face using a convolutional feature extraction block, and then apply the Bag-of-Features (BoF) paradigm to learn deep neural networks (we call it Deep BoF). To do so, we apply Radial Basis Function (RBF) neurons to quantize the deep features extracted from the last convolutional layers. An SVM classifier is then applied to classify faces according to their quantized term vectors. The obtained model is lightweight compared to classical CNN and it allows classifying arbitrary-sized images. The experimental results on the FRGCv2 and Bosphorus datasets show the powerful of our method compared to state of the art methods.
Article
The accuracy of 2D face recognition (FR) has progressed significantly due to the availability of large-scale training data. However, the research of deep learning based 3D FR is still in the early stage. Most of available 3D FR generate 2D maps from 3D data and apply existing 2D CNNs to the generated 2D maps for feature extraction. We propose in this paper a light-weight framework, named PointFace, to directly process point set data for 3D FR. In this framework, two weight-shared encoders are designed to directly extract features from a pair of 3D faces and the distances between embeddings of the same person and different person are minimized and maximized, respectively. The framework also use a feature similarity loss to guide the encoders to obtain discriminative face representations. A pair selection strategy is proposed to generate positive and negative face pairs to further improve the FR performance. Extensive experiments on Lock3DFace and Bosphorus show that the proposed PointFace outperforms state-of-the-art 2D CNN based FR methods.
Article
In this article, we propose a three-dimensional (3D) face recognition approach for depth data captured by Kinect based on a combination of speeded up robust features (SURF) and k-nearest neighbor (KNN) algorithms. First, the shape index maps of the preprocessed 3D faces of the training gallery are computed, then the SURF feature vectors are extracted and used to form the dictionary. In the recognition process, we propose an improved KNN classifier to find the best match. The evaluation was performed using CurtinFaces and KinectFaceDB data sets, achieving rank-1 recognition rates of 96.78% and 94.23%, respectively, when using two samples per person for training.
Article
3D face recognition (3DFR) has emerged as an effective means of characterizing facial identity over the past several decades. Depending on the types of techniques used in recognition, these methods are categorized into traditional and modern. The former generally extract distinctive facial features (e.g. global, local, and hybrid features) for matching, whereas the latter rely primarily on deep learning to perform 3DFR in an end-to-end way. Many literature surveys have been carried out reviewing either traditional or modern methods alone, while only a few studies are conducted simultaneously on both of them. This survey presents a state-of-the-art for 3DFR covering both traditional and modern methods, focusing on the techniques used in face processing, feature extraction, and classification. In addition, we review some specific face recognition challenges, including pose, illumination, expression variations, self-occlusion, and spoofing attack. The commonly used 3D face datasets have been summarized as well.