ArticlePDF Available

3D Face Recognition: Two Decades of Progress and Prospects

August 2023
ACM Computing Surveys 56(3)

August 2023
56(3)

DOI:10.1145/3615863

Authors:

Hanyun Wang

Information Engineering University

Longguang Wang

National University of Defense Technology

Show all 6 authorsHide

3D face recognition has been extensively investigated in the last two decades due to its wide range of applications in many areas such as security and forensics. Numerous methods have been proposed to deal with the challenges faced by 3D face recognition such as facial expressions, pose variations and occlusions. These methods have achieved superior performances on several small-scale datasets including FRGC v2.0, Bosphorus, BU-3DFE, and Gavab. However, deep learning based 3D face recognition methods are still in their infancy due to the lack of large-scale 3D face datasets. To stimulate future research in this area, we present a comprehensive review of the progress achieved by both traditional and deep learning based 3D face recognition methods in the last two decades. Moreover, comparative results on several publicly available datasets under different challenges of facial expressions, pose variations and occlusions are also presented.

Major 3D Face Datasets. The variations in each dataset are listed, including pose variation (p), facial expression (e), occlusion (o), time elapse (t), and illumination variation (i). The availability of 2D texture images in each dataset is also provided, where 'Y' and 'N' denote Yes and No, respectively.

…

VR at 0.1% FAR results under facial Expressions. 'A vs A', 'N vs A', 'N vs N', and 'N vs NN' stand for 'All vs All', 'Neutral vs All', 'Neutral vs Neutral', and 'Neutral vs Non-neutral', respectively.

…

Identification results under facial expressions. 'A vs A', 'N vs A', 'N vs N', and 'N vs NN' stand for 'All vs All', 'Neutral vs All', 'Neutral vs Neutral', and 'Neutral vs Non-neutral', respectively.

…

Figures - uploaded by Hanyun Wang

Content may be subject to copyright.

Content uploaded by Hanyun Wang

Content may be subject to copyright.

3D Face Recognition: Two Decades of Progress and Prospects

YULAN GUO, Sun Yat-sen University and National University of Defense Technology, China

HANYUN WANG∗,Information Engineering University, China

LONGGUANG WANG, Aviation University of Air Force, China

YINJIE LEI, Sichuan University, China

LI LIU, National University of Defense Technology, China

MOHAMMED BENNAMOUN, University of Western Australia, Australia

3D face recognition has been extensively investigated in the last two decades due to its wide range of applications in many

areas such as security and forensics. Numerous methods have been proposed to deal with the challenges faced by 3D face

recognition such as facial expressions, pose variations and occlusions. These methods have achieved superior performances

on several small-scale datasets including FRGC v2.0, Bosphorus, BU-3DFE, and Gavab. However, deep learning based 3D face

recognition methods are still in their infancy due to the lack of large-scale 3D face datasets. To stimulate future research

in this area, we present a comprehensive review of the progress achieved by both traditional and deep learning based 3D

face recognition methods in the last two decades. Moreover, comparative results on several publicly available datasets under

dierent challenges of facial expressions, pose variations and occlusions are also presented.

CCS Concepts: • Computing methodologies →Computer vision;Articial intelligence.

Additional Key Words and Phrases: 3D face recognition, local feature, deep learning, facial expression, pose variation, facial

occlusion

1 INTRODUCTION

The task of biometrics is to recognize a person based on its physiological (e.g., ngerprint, palmprint, iris, retina,

and face) or behavioral characteristics (e.g., gait, handwriting, and voice) [

107

110

]. Although dierent biomentrics

approches have been intensively investigated for automatic human identication, face recognition is commonly

considered as a major biometrics technique due to its universal availability, distinctiveness, permanence, non-

contact collectability, and especially invasiveness [

111

]. Face recognition can be used in many areas including

security, forensic, commercial, medical, education, and robotic applications [121, 198, 246].

Existing face recognition techniques can be broadly divided into 2D and 3D face recognition techniques

according to the data modality. Most research eorts and commercial developments have focused on 2D face

recognition due to its low cost and the wide availability of digital cameras [

121

225

]. As an alternative, 3D

face recognition has a number of advantages compared to its 2D counterpart [

198

]. For instance, (i) 3D

∗Corresponding author

Authors’ addresses: Yulan Guo, Sun Yat-sen University and National University of Defense Technology, Shenzhen, China, guoyulan@sysu.edu.

cn; Hanyun Wang, Information Engineering University, Zhengzhou, China, why.scholar@gmail.com; Longguang Wang, Aviation University

of Air Force, Changchun, China, wanglongguang15@nudt.edu.cn; Yinjie Lei, Sichuan University, Chengdu, China, yinjie@scu.edu.cn; Li Liu,

National University of Defense Technology, Changsha, China, liuli_nudt@nudt.edu.cn; Mohammed Bennamoun, University of Western

Australia, Perth, Australia, mohammed.bennamoun@uwa.edu.au.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that

copies are not made or distributed for prot or commercial advantage and that copies bear this notice and the full citation on the rst page.

Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy

otherwise, or republish, to post on servers or to redistribute to lists, requires prior specic permission and/or a fee. Request permissions from

permissions@acm.org.

0360-0300/2023/8-ART $15.00

https://doi.org/10.1145/3615863

ACM Comput. Surv.

2 • Y. Guo et al.

data contain sucient geometrical information of a face without any projection from the 3D physical space to

the 2D imaging plane. (ii) 3D data is more invariant to illumination variations, the use of cosmetics and other

decorations. (iii) Facial pose can be more accurately estimated from 3D data compared to 2D images. Therefore,

3D face recognition has the potential to overcome several of the inherent challenges faced by 2D face recognition

algorithms and provides a solid alternative to the face recognition community [

241

]. With the advancement of

3D sensing devices (e.g., Microsoft Kinect, Iphone X depth camera) and computing devices (e.g., GPU), 3D face

recognition has become an emerging topic in last two decades [89].

Although 3D face recognition has several advantages compared to its 2D counterpart, it also faces several

challenges. First, the shape of a 3D face varies signicantly under dierent expressions as the face is a non-rigid

surface. Second, occlusion and clutter introduced by obscuring factors such as glasses, scarves, and hats increase

the diculty of face recognition. Third, the face can change gradually over time due to aging or change in

body health. Furthermore, the low quality (e.g., noise, holes) of 3D data acquired by low-cost 3D sensors poses

further challenges to 3D face recognition. Although a large number of algorithms have been proposed during the

last two decades, 3D face recognition is still far from real-world applications. It is therefore highly necessary to

comprehensively review the existing work and point out future research directions.

Several early survey papers on 3D face recognition have appeared about ten years ago [

121

192

Later, Smeets et al. [

198

] presented a concise review on 3D face recognition with a particular focus on facial

expression issues. Islam et al. [

107

] presented a review on 3D ear and expression invariant face biometrics. Smeets

et al. [

197

] introduced a comparative study of 3D face recognition under expression variations. Zhou et al. [

249

]

reviewed several algorithms for single-modal and multi-modal face recognition. Three reviews [

190

]

on 3D facial expression recognition rather than face recognition are also worth mentioning. Soltanpour et al.

[

202

] summarized the state-of-the-art local feature based 3D face recognition methods published before 2017,

and classied existing methods into three categories: keypoints-based, curve-based, and local surface-based

methods. Zhou and Xiao [

250

] summarized the recent progress of 3D face recognition from three dierent aspects

including pose-invariant recognition, expression-invariant recognition, and occlusion-invariant recognition.

Dagnes et al. [

] focused on dealing with 3D face recognition with facial occlusions under non-cooperative

and uncontrolled scenarios. Pini et al. [

183

] evaluated the eect of dierent 3D data representations (i.e., depth

and normal images, voxels, point clouds), dierent deep learning-based models, and dierent pre-processing

techniques for face recognition. They concluded that the methods based on normal images and point clouds

perform and generalize better than other 2D and 3D alternatives. Li et al. [

129

] and Jing et al. [

114

] also reviewed

current 3D face recognition methods from the aspects of traditional methods and deep learning-based methods.

Although these papers provide good reviews on the progress of 3D face recognition, some advanced algorithms

such as [

115

238

] proposed in recent years are not covered, especially those deep learning-based methods

[118, 195].

The major contribution of this paper can be summarized as follows:

(i) This paper reviews the major 3D face recognition methods which have been proposed in the last two

decades. It can be used to help a reader understand the history, status, and future trend of 3D face recognition.

(ii) This paper provides a comprehensive review on both the traditional and the emerging deep learning-based

algorithms and adequately covers a large number of up-to-date 3D face recognition algorithms.

(iii) This paper specically discusses the approaches designed to deal with dierent nuisances that are faced by

a 3D face recognition system including facial expressions, pose variations, and occlusions.

(iv) Acomprehensive comparison of existing algorithms on several publicly available datasets are also presented

in tabular forms under facial expression variations (Tables 2 and 3), pose variations (Table 4) and occlusions

(Table 5).

The rest of this paper is organized as follows. Section 2 describes the background concepts and terminology of

3D face recognition. Sections 3 reviews several pre-processing approaches. Section 4 provides a comprehensive

ACM Comput. Surv.

3D Face Recognition: Two Decades of Progress and Prospects • 3

survey of existing 3D face recognition methods. Section 5 introduces the recent trends in deep learning methods

for 3D face recognition. Section 6 presents a comprehensive comparison of existing algorithms under dierent

variations including facial expressions, pose variations, and occlusions. Finally, Section 7 concludes this paper.

2 TERMINOLOGY AND DATASETS

2.1 Terminology

3D face recognition usually includes two dierent tasks: face identication and face verication [

198

The task of face identication is to compare a probe face against all gallery faces to obtain the identity of the

probe face. The performance of face identication is commonly measured by the Cumulative Match Curve (CMC),

where CMC plots the recognition rate with respect to dierent rank numbers. The Rank-1 Recognition Rate

(R1RR) is an important scalar metric for the evaluation of face identication.

The task of face verication (also called face authentication) is to compare a probe face against the gallery face

with a claimed identity. The performance of face verication is usually measured by the Receiving Operating

Characteristic (ROC) curve which plots the False Rejection Rate (FRR) versus the False Acceptance Rate (FAR)

[

]. FRR is the percentage of probes that have incorrectly been determined as non-match against the gallery

face with the same identity, FAR is the percentage of probes that have incorrectly been determined as a match

against the gallery face with a dierent identity. The Equal Error Rate (EER) and Verication Rate (VR) at 0.1%

FAR (VR@0.1%FAR) are two important scalar metrics for the evaluation of face verication. ERR is extracted

from the ROC curve where FAR is equal to FRR.

2.2 Datasets

A large number of datasets have been collected to test the performance of 3D face recognition algorithms since

the 1990s. Although several early datasets are available, such as the MaxPlank [

213

], USF HumanID 3D Face

Database[

], XM2VTS [

149

], 3DRMA [

], FSU [

100

], Biometrics [

], Gavab [

157

] and CASIA datasets,

we mainly list the datasets which have been introduced in last 15 years (as shown in Table 1). The variations

contained in each dataset are also reported in Table 1, including pose variation (p), facial expression (e), occlusion

(o), time elapse (t), and illumination variation (i). The symbol ‘-’ is used where the corresponding information is

not provided. It is obvious that most datasets introduced before 2012 were collected by expensive but accurate

3D acquisition systems including Minolta Vivid, 3dMD and Di3D scanners. With the release of the low-cost

Microsoft Kinect sensor in 2011, the majority of datasets in the recent years were collected by Kinect sensors,

introducing more new challenges to the 3D face recognition community, such as low resolution, high noise,

and missing data. In the following section, we will briey describe the FGRC dataset, the BU-3DFE dataset, the

Bosphorus dataset, the Gavab dataset, and the 4DFAB dataset.

2.2.1 FRGC dataset. The FRGC dataset contains 4950 3D facial scans of 466 subjects. All of these scans are

captured frontally with a Minolta Vivid 900/910 scanner at a resolution of 0.6mm in the



and



directions. This

dataset is further divided into a training dataset (FRGC v1) and a validation dataset (FRGC v2.0). The traning

dataset contains 943 scans of 200 dierent individuals collected in the 2002-2003 academic year, and the validation

dataset contains 4007 scans of 466 individuals collected during the 2003-2004 academic year. The number of scans

per subject varies from 1 to 22. In addition, the validation dataset contains 2410 scans with neutral expression,

and 1597 facial scans with various facial expressions such as disgust, happiness, sadness, surprise, and anger.

2.2.2 BU-3DFE dataset. The BU-3DFE dataset contains 2500 3D facial scans of 100 subjects (44 males and 56

females) with dierent ages and ethnic/racial ancestries. For each subject, there are one scan with neutral

expression, and six basic non-neutral facial expressions (anger, disgust, fear, happiness, sadness, and surprise)

with four levels of intensity.

ACM Comput. Surv.

4 • Y. Guo et al.

Table 1. Major 3D Face Datasets. The variations in each dataset are listed, including pose variation (p), facial expression

(e), occlusion (o), time elapse (t), and illumination variation (i). The availability of 2D texture images in each dataset is also

provided, where ‘Y’ and ‘N’ denote Yes and No, respectively.

No. Name Year # Subjects # Images Acquisition Variations Texture Res.

1 Gavab [157] 2004 61 549 Minolta Vivid sensor p, e, o N -

2 BJUT-3D [234] 2005 500 - Cyberware 3030 e Y High

3 FRGC v1 [182] 2005 200 943 Minolta Vivid 900/910 e Y High

4 FRGC v2.0 [182] 2005 466 4007 Minolta Vivid 900/910 e, t Y High

5 BU-3DFE [236] 2006 100 2500 3dMD e Y High

6 ND-2006 [66] 2006 888 13450 Minolta Vivid 910 e Y High

7 CASIA [228] 2006 123 4059 Minolta Vivid 910 p, e, i Y High

8 FRAV 3D [50] 2006 105 1696 Minolta Vivid 700 p, e, i, t Y High

9 ZJU-3DFED[223] 2006 40 360 InSpeck 3D MEGA Capturor DF e Y High

10 Bechman [102] 2007 475 - Cyberware 3030 e Y High

11 Bosphorus [191] 2008 105 4666 Inspeck Mega Capturor II 3D p, e, o N High

12 IV2[181] 2008 300 2400 Minolta Vivid 7000 p, e, i Y High

13 BU-4DFE [235] 2008 101 606 videos Di3D System e Y High

14 York [99] 2008 350 5250 In-house 3D camera p, e Y Middle

15 Texas [90] 2010 118 1149 MU-2 stereo system e Y High

16 PhotoFace [239] 2011 261 7356 Photometric stereo e, t Y High

17 Houston [177] 2011 281 2150 3dMD p, e N High

18 UHDB11[211] 2014 23 1625 3dMD p, i Y High

19 SHREC 2011 [218] 2011 130 780 Roland and Escan scanners p N High

20 3D-TEC [219] 2011 107 214 Minolta Vivid 910 e Y High

21 UMB-DB [49] 2011 143 1473 Minolta Vivid 900 e, o Y High

22 Florence Superface [12, 13] 2012 50 50 videos 3dMD, Kinect p Y Low/high

23 Aalborg RGB-D Face [101] 2012 31 1581 Kinect p, e Y Low

24 3DMAD [64] 2013 17 255 videos Kinect t Y Low

25 Biwi Kinect Head Pose [71] 2013 20 over 15000 Kinect p Y Low

26 CurtinFaces [125] 2013 52 4784 Kinect p, e, o, i Y Low

27 EURECOM KinectFaceDB [155] 2014 52 936 Kinect p, e, o, t, i Y Low

28 BP4D-Spontanous Database [244] 2014 41 328 videos Di3D e Y High

29 FaceWarehouse [37] 2014 150 3000 Kinect e Y Low

30 HRRFaceD [146] 2014 18 20000 Kinect v2 p, o N Low

31 VT-KFER [5] 2015 32 1956 videos Kinect eY Low

32 Lock3DFace [241] 2016 509 5711 videos Kinect v2 p, e, o, t, i Y Low

33 Pandora [24] 2017 22 110 sequences Kinect v2 p, o Y Low

34 COMA [186] 2018 12 20466 3dMD LLC, Atlanta e N High

34 4DFAB [44] 2018 180 1,835,513

DI4D, Kinect and grayscale camera

e Y High

36 FaceScape [233] 2020 938 18760 Multi-view DSLR cameras e Y High

2.2.3 Bosphorus dataset. The Bosphorus dataset contains 4652 3D facial scans of 105 subjects (60 males and

45 females) with ages between 25 and 35. All of these scans are captured with an Inspeck Mega Capturor II 3D

scanner at a resolution of 0.3mm in the



and



directions and a resolution of 0.4mm in the



direction. For each

subject, the number of scans for each subject is between 31 and 54, and these scans are recorded under dierent

expressions, poses, and occlusions. For facial expressions, the Bosphorus dataset contains six basic non-neutral

facial expressions (anger, disgust, fear, happiness, sadness, and surprise) and 28 facial Action Units (20 Lower AUs,

5 Upper AUs, and 3 Combined AUs). For pose variations, the Bosphorus dataset contains seven yaw rotations

(+10

◦

, +20

◦

, +30

◦

, and

◦

), and four pitch rotations (strong and slight upwards/downwards), and two

cross rotations incorporating both yaw and pitch rotations (+45

◦

yaw and

◦

pitch). It should be emphasized

ACM Comput. Surv.

3D Face Recognition: Two Decades of Progress and Prospects • 5

that all pose variation scans are captured with a neutral expression. For facial occlusions, there are four types

of occlusions: occlusion of the mouth with hand, occlusion of face with hair, occlusion of left eye and forehead

region with hands, and occlusion with glass.

2.2.4 Gavab dataset. The Gavab dataset contains 549 3D facial scans of 61 adult Caucasian subjects (45 males

and 16 females). All of these scans are captured with a Minolta Vivid scanner at a resolution of 1.5mm in the

image. These scans are recorded under dierent poses, expressions and occlusions. For each subject, there are two

frontal facial scans with neutral expression, four neutral facial expression scans with a rotated posture of the face

(looking-up (+35

◦

), looking-down (-35

◦

), left prole (-90

◦

) and right prole (+90

◦

), and three frontal non-neutral

facial expressions (smile, laugh, and arbitrary expression).

2.2.5 4DFAB dataset. The 4DFAB dataset is a recently published large scale dynamic facial expression database,

and it contains 1,835,513 high-resolution 3D meshes of 180 subjects (120 males and 60 females) with ages between 5

and 75. The capturing system consists of a DI4D dynamic system for 4D face capturing and building, a microphone

for audio signal recording, a frontal grayscale camera for frontal face image recording, and a Kinect for RGB-D

data recording. To ensure that multi-modal facial data are captured simultaneously, all sensors are synchronized

with the DI4D system. The expressions of each subject include posed expressions, spontaneous expressions, and

other evident facial movements.

3 PREPROCESSING

Once a raw 3D face is obtained, preprocessing is required to make the 3D face suitable for face recognition.

Typical preprocessing operations include nose tip detection, data ltering, and pose normalization.

3.1 Nose Tip Detection

Nose tip detection can be used for several purposes in a 3D face recognition system. First, nose tip can be used

to accurately locate a 3D face from the raw 3D data [

139

]. Second, nose tip is a more distinctive landmark

for detection than other facial parts (e.g., the eyes and cheek) [

]. Besides, nose tip can be used to guide the

detection of other facial landmarks [139].

3.1.1 Curvature based Methods. These methods use dierent types of curvatures to locate potential landmarks,

and then use heuristics to nd the nal landmarks (including nose tip).

Colbry et al. [

] detected nose tip as the point with the largest shape index and satises several heuristics

(e.g., closest to the scanner). It is demonstrated that the median error of detected landmarks is around 10mm.

Chang et al. [

] detected the nose tip from a 3D face by checking the Gaussian and mean curvatures of a facial

surface. Experimental results show that the nose tipe landmark can successfully be detected in 99.4% of the 4485

facial images. Dibeklioğlu et al. [

] also used a Gaussian and mean curvatures based heuristic method to detect

nose tip. This method is not appropriate for 3D faces with yaw larger than 45 degrees [

177

]. Colombo et al. [

]

detected nose candidates by thresholding the mean curvature of a 3D face, these candidates were then ltered

to obtain the nose tip using the triangle formed by the eyes and the nose. Lu et al. [

142

] used the shape index

and heuristics to detect a set of candidate landmarks. Gupta et al. [

] rst detected the nose tip by registering

the query face to a 3D template face using the Iterative Closest Point (ICP) algorithm, and then used Gaussian

curvature to rene the nose tip. This method is relatively computationally expensive.

These methods are intuitive, but they suer from several limitations. First, pre-processing is required to

perform accurate curvature estimation [

179

]. Second, these methods are sensitive to noise as the calculation

of curvatures relies on the derivatives of a 3D surface [

179

]. Third, their applications are limited since a set of

emprically designed heuristics are usually required.

ACM Comput. Surv.

6 • Y. Guo et al.

(a) Mian et al.[151] (b) Peng et al.[179] (c) Wang et al.[222]

Fig. 1. An illustration of nose tip detection methods.

3.1.2 Profile based Methods. These methods extract proles from a 3D face and then detect the nose tip from

these 2D proles.

Segundo et al. [

194

] generated a prole curve and a median curve by calculating the maximum and median

depth value for the points with the same



coordinate values in a face range image. Nose tip is then dened as

the peak in the prole curve and is further checked using both the prole and the median curves. Experimental

results show that a detection rate of 99.95% is achieved on the FRGC v2.0 dataset. Mian et al. [

151

] cut a 3D face

into several horizontal slices, and then inscribed a triangle inside a moving circle along each slice. The point with

the largest triangle altitude along each slice is then considered as a nose tip candidate, which is further ltered

using the Random Sample Consensus (RANSAC) approach. The remaining point with the largest triangle altitude

is nally determined as the nose tip, as shown in Fig. 1(a). This method is very time-consuming [

222

], and it is

limited to near frontal faces with small yaw and pitch variations [177].

Faltemier et al. [

] rotated a 3D face around the



axes to obtain 37 proles, and then matched each prole

with two manually extracted nose models along the prole, the location with the minimal matching error is

nally determined as the nose tip. A detection rate of over 96.5% is achieved on faces under pose, expression,

and occlusion variations. However, this method is sensitive to scale variations. Peng et al. [

179

] rotated a 3D

face around the



axes 61 times to generate a set of left-most and right-most proles. Nose tip candidates are

detected by moving a circle along each face prole and checking the area of the circle enveloped inside the prole.

Nose tip candidates are further ltered using a cardinal point tness and spike tness, as shown in Fig. 1(b). This

method achieves a detection rate of 99.43% on the FRGC v2.0 dataset and is also able to estimate the roll, yaw and

pitch angles of a face.

Wang et al. [

222

] rst obtained the central prole of a 3D face by intersecting the facial surface with its

symmetry plane, and then determined the nose tip as the point on the central prole with the largest distance

to the tting plane of the facial surface, as shown in Fig. 1(c). It is demonstrated that 99.75% of nose tips are

correctly detected on the FRGC v2.0 dataset with a 4mm tolerance error, which is better than [

141

]. Spreeuwers

[

205

] rst projected a 3D face onto its symmetry plane to obtain a prole, then detected the point with the largest

value of



coordinate. Next, a straight line is tted to the prole around the detected point, the nose tip is nally

determined as the intersection between the tted line and the line passing through the detected point with its

direction parallel to the



axis. It is claimed that the detected nose tips are slightly more stable than the one

detected by the largest curvature or coordinate.

3.1.3 Depth based Methods. These methods assume that the nose tip is the point closest to the sensor and detect

the nose tip using depth information.

Lu et al. [

139

141

] rst found the position with the maximum



value for each row in a depth image. The

column with the largest number of these selected positions was used to determine the mid-line of a 3D face. The

nose tip was then found along the mid-line using the gradient of the mid-line curve and the



value. A nose tip

ACM Comput. Surv.

3D Face Recognition: Two Decades of Progress and Prospects • 7

localization accuracy of 5mm was achieved. However, this method can only work on frontal faces. To handle

3D faces with dierent frontal poses, Lu and Jain [

140

] rotated a face scan around the vertical axis and they

determined the nose tip candidates as those points with the largest



value. These candidates were then ltered

by checking the nose prole. This method still does not consider the pitch variation of a 3D face. Mohammadzade

and Hatzinakos [

156

] rst detected nose tip candidates using the depth information. They then trained a PCA

space using a set of nose region surfaces. A candidate was considered as a nose tip if the distance between the

nose region of that candidate and its projection on the PCA space was smaller than a threshold. Experimental

results showed that all nose tips in the FRGC dataset can be successfully detected.

3.1.4 Learning based Methods. These methods rst learn a model from a set of training data around labelled

nose tips, and then use the trained model for nose tip detection.

Xu et al. [

229

] rst dened an eective energy to characterize the point distribution of a local surface and to

select nose tip candidates, they then used the means and variances of the eective energy sets to train a SVM

classier, which nally determines the nose tip location. A correct detection rate of 99.3% is achieved on a dataset

containing 280 faces. Mian et al. [

150

] extracted Haar-like features from a facial range image and its





gradient

images to train the AdaBoost algorithm for nose detection. Multiple nose detection results in three images are

clustered and anthropometric ratios are used to remove outliers, and a detection rate of 99.18% is reported on the

FRGC v2.0 dataset. Wang et al. [

221

] rst trained individual PCA subspaces for four landmarks including the nose

tip using the point signature feature [

]. Each point on a query face is then projected onto the subspace and the

one with the smallest reconstruction error is considered as the landmark. However, this method is computationally

expensive. Zhao et al. [

247

] used a statistical model (i.e., PCA) to learn both the global variations in 3D face

morphology and the local variations around each face landmark using both texture and geometry information.

The landmarks (including nose tip) are determined by maximizing a posterior probability. A localization error of

less than 5mm is achieved on the FRGC dataset, but the method is very time-consuming. Besides, several methods

for 3D facial landmark detection are also available in the literature, e.g., [

143

174

180

204

], which

are highly related to nose tip detection.

3.2 Data Filtering

Raw 3D facial scans usually contain spikes, holes and noise due to the low scanning quality [

144

]. Spikes are

commonly detected by checking the discontinuity of points, and are removed by thresholding [

151

] or median

ltering [

185

247

]. Besides, holes can be found in 3D facial scans due to spike removal, self-occlusion, specular

reection of local surface, and light absorption in dark areas. Small holes can be lled using linear interpolation

[

156

205

] and bilinear interpolation [

177

] or the link of boundary edges [

]. Large holes can be inferred

using the prior of face symmetry [

205

]. The noise in 3D facial scans can further be smoothed using dierent

ltering methods, such as the 2D Wiener lter [

156

] and the bilateral smoothing lter [

]. Finally, resampling is

usually performed on the cropped 3D face to ensure uniform distribution of 3D facial points [151, 177].

3.3 Pose Normalization

To address the pose variations of facial scans, pose normalization is required by 3D face recognition algorithms

working on pose-dependent features. Mian et al. [

151

] performed PCA on the points of a cropped 3D face to

generate three principal axes which then used to form a rotation matrix for face pose normalization. The aligned

face is then resampled and the pose normalization process is repeated until convergence. Experimental results on

ACM Comput. Surv.

8 • Y. Guo et al.

the FRGC v2.0 dataset showed that the algorithm is robust to facial expressions and hairs. This algorithm has

been used in several 3D face recognition systems [

122

123

156

]. Spreeuwers [

205

] used the vertical symmetry

plane of a facial scan and the slope of the nose bridge to dene an Intrinsic Coordinate System (ICS) for the face,

and then aligned the face with ICS to achieve pose normalization. Similarly, Wang et al. [

222

] used the nose tip,

the nose bridge direction, and the unit normal of the symmetry plane to perform pose normalization for a 3D

facial scan. Besides, pose normalization can be achieved using facial landmarks. Theoretically, a minimum of three

landmarks on a face are sucient to perform pose normalization [

142

154

]. Furthermore, pose normalization can

also be achieved by registering a 3D facial scan to a reference 3D face, which is usually an average face model in

canonical pose generated from training data [154].

4 3D FACE RECOGNITION

According to the facial representation type, these geometry based 3D face recognition algorithms can further be

classied into landmark based, curve based, local patch based methods, and holistic methods.

4.1 Landmark based Algorithms

Gordon [

] used the left eye width, the right eye width, the eye separation, the total width of eyes, the nose

height, the nose width, the nose depth, the head width and the curvatures to generate a feature descriptor of a

3D face. The 3D face recognition experiments are performed by calculating the distances between these feature

descriptors of 24 faces. Hüsken et al. [

105

] rst extracted several facial landmarks (e.g., nose, eyes, and mouth)

and then used the Hierarchical Graph Matching (HGM) to perform 2D and 3D face recognition. It is observed

that the fusion of 2D and 3D modalities improves the results compared with a single modality.

Gupta et al. [

] used the Euclidean and Geodesic distances between 45 pairs (i.e.,

2

10 =

45)) of 10 anthropometric

facial ducial points as the feature of a 3D face. The stepwise linear discriminant analysis method is then used

for feature selection and the Fisher’s Linear Discriminant Analysis (LDA) classier is employed to perform 3D

face recognition. Experimental results on the Texas 3D Face Recognition Database show that an EER of 1.98%

and a R1RR of 96.8% are achieved with automatically detected ducial points. However, the detection of these

ducial points requires the frontal upright position of a 3D face [

] and is also computationally expensive [

4.2 Curve based Algorithms

These methods extract curves or strips from 3D facial surfaces as feature representations for 3D face recognition.

These methods can further be divided into prole-based and contour-based methods, where proles represent

open curves with starting and end points, and contours represent non-intersecting and closed curves with

dierent lengths [

144

197

]. The two major problems of these methods are curve extraction and representation

matching [222].

4.2.1 Profile based Algorithms. These methods extract vertical proles, horizontal proles or radial curves from

3D facial surfaces for face representation.

Nagamine et al. [

164

] conducted the pioneering work to test the distinctiveness of three dierent types of

proles (i.e., vertical, horizontal and circular proles) extracted from various locations on a 3D facial surface. It is

observed that the vertical proles in the central region of a face, the circular sections crossing the inner corners

of the eyes and the part of the nose are highly distinctive. In contrast, the distinctiveness of horizontal proles is

relatively low. Beumier and Acheroy [

] extracted the central and the lateral proles from a 3D face based on the

ACM Comput. Surv.

3D Face Recognition: Two Decades of Progress and Prospects • 9

(a) Zhang et al.[

242

]

(b) Drira et al.[59] (c) Lei et al.[123]

(d) Samir et al.[

189

]

(e) Srivastava et al.[

207

]

(f) Berrei et al.[10]

Fig. 2. Examples of dierent curve based 3D face representations.

assumption of vertical facial symmetry. They then performed face recognition by comparing the curvatures on

the proles of two faces. Experiments are conducted on a 3D face dataset acquired by an in-house structured light

system, and an EER of 7.25% is achieved. It is observed that nose, eyes, moustaches and beards are challenging

for 3D scanning. Besides, combining frontal and prole proles can improve recognition performance. Beumier

and Acheroy [

] further combined the 3D and grey data along the central and lateral proles to improve the

face recognition performance. Based on the assumption that a 3D face is symmetric, Pan et al. [

171

] proposed a

robust symmetry plane detection method to extract facial proles. Proles are then matched using the Hausdor

distance for face recognition. Zhang et al. [

242

] used a symmetry prole (i.e., a vertical prole which passes

through the nose tip), a forehead prole and a cheek prole to represent a 3D face, as shown in Fig. 2(a). The

similarity between two 3D faces is then calculated as the weighted sum of the distances between these three

corresponding proles. However, This method is very sensitive to varying facial expressions.

Drira et al. [

] represented 3D facial surfaces with radial curves emanating from the nose tips by slicing the

facial surface with several planes, as shown in Fig. 2(b). A Riemannian framework is then developed to analyze

the elastic shapes of these curves and to match the shapes of facial surfaces. Besides, an occlusion detection

and removal step is proposed based on the recursive-ICP algorithm. To handle missing data, a restoration step

is further introduced using the statistical estimation on shape manifolds of curves. Similarly, Lei et al. [

123

]

proposed an Angular Radial Signature (ARS) for 3D face representation by emanating a set of curves from the

nose tip at an interval of



radians, as shown in Fig. 2(c). Middle-level features are then extracted from ARSs

using Kernel Principal Component Analysis (KPCA) and further fed into a SVM to perform face recognition. This

method achieves good performance in terms of both recognition rate and eciency. Yu et al. [

237

] represented

3D facial scans by an order ensemble of radial strings emanating from the nose tip in 2D space, and then matched

two 3D facial scans through a string-to-string scheme based on dynamic programming. The inherent partial

matching mechanism during radial string matching ultimately eliminates the impact of occlusions. Jribi et al.

[

116

] proposed a multi-polar geodesic representation for 3D face recognition, which is invariant under the Special

Euclidean group SE(3). Based on three reference points extracted from the nose tip and eye corners, a set of level

curves on facial meshes are generated and then sampled uniformly to obtain a set of nite and ordered points.

Finally, the principal curvatures computed on these points are used as the 3D face feature descriptor. Later, they

re-parameterized three static parts around the nose and two eyes with multi-polar geodesic representations [

115

For the static part around the nose, the nose tip and two inner corners of the eyes are used to form the three

reference points. For the static part around each eye, the center and the two outer corners of the eyes are used to

form the three reference points. Nassih et al. [

165

] took the geodesic distance as the feature of the facial curves

dened by a set of manually selected points, and 3D face recognition is accomplished based on the PCA and

random forest classier.

ACM Comput. Surv.

10 • Y. Guo et al.

4.2.2 Contour based Algorithms. These methods extract contours (i.e., level curves) from 3D facial surfaces for

face representation.

Samir et al. [

188

] represented a facial surface

with a union of planar level curves of the height (i.e., depth)

function, i.e.,

S=Ð

where

={�∈ S|(�)=}

. Here,

(�)

is the depth of point

�

. The similarity of two

3D facial surfaces is then calculated as the aggregated geodesic distance between their corresponding level curves.

This work has clearly demonstrated the potential of geometric facial curves for 3D face recognition. However,

the level curves are dierent for a face with dierent orientations. That is, this level curve representation is not

completely invariant to rotation. Later, Samir et al. [

189

] represented a facial surface

with a union of 3D level

curves of a surface distance function from the nose tip, as shown in Fig. 2(d). This representation is invariant to

rotation and translation. Numerical methods for the calculation of geodesic paths between facial surfaces in the

Riemannian space are also provided. Note that, the level curves can be aected by some facial expressions such as

open mouth. Therefore, this method is unable to handle missing data (e.g., caused by occlusion or pose variations).

Similarly, Li et al. [

128

] generated a Deformation Invariant Image (DII) for a textured 3D face by sampling the

intensity image with geodesic level curves (which is dened on the 3D surface). LDA is then performed on the

DII representations and the Mahalanobis cosine distance between two facial representations is used to measure

their similarity. Srivastava et al. [

207

] represented a facial surface

using level curves dened in a Darcyan

coordinate system, as shown in Fig. 2(e). The coordinate system is located at the nose tip, while its two coordinates

1

and

2

specify the distances from the nose tip and the symmetry plane of the face, respectively. Consequently,

a deformation or geodesic paths between 3D facial surfaces can be achieved by analyzing geodesics between

level curves.

The level curve representation is further extended to strip representation. For example, Berretti et al. [

]

represented each 3D face with a set of iso-surfaces generated by the points with the same geodesic distance from

the nose tip, as shown in Fig. 2(f). They then encoded these iso-geodesics and their relationships as a graph

representation using 3D Weighted Walkthroughs (3DWWs). 3D face recognition is achieved using a structural

similarity dened on 3DWWs. It is claimed that partitioning a facial scan into iso-geodesic stripes approximates

the local morphology of faces with facial expressions. This method is therefore robust to facial expressions, and

it is also ecient for matching. Experimental results on the Gavab dataset showe that RIRRs of 93.5% and 82% are

achieved for neutral and non-neutral faces, respectively [

]. It is also reported that VRs @ 0.1% FAR of 96.31%

and 80.87 are achieved for neutral and non-neutral faces, respectively [

]. However, this method requires that

the mouths of all faces in the dataset are always open or always closed [

]. Shi et al. [

196

] rst represented a

3D facial surface with iso-geodesic curve, and then extracted four kinds of Frenet frame based features for each

point of the iso-geodesic curve.

Abbad et al. [

] decomposed each 3D facial surface into multiple intrinsic model functions (IMFs) and a residual

using Surfaces Empirical Mode Decomposition (SEMD). The dierent scales of IMFs represent dierent levels of

spatial oscillation modes of a surface, and the residual represents the lowest frequency of surface. Then, both the

radial and the level facial curves are extracted from the 3D surface and each point on the extracted curves is

described by the Wave Kernel Signature (WKS). Thus, each IMF and the residual surface can be represented by

the radial curves, level facial curves and their correpsonding wave kernel signatures. For 3D face recognition,

similarity between surfaces (IMF and residual) at the same scale are nally computed based on the angle of

feature vectors. In [

], dierent types of proles and contours are evaluated to select a subset of facial curves for

feature matching. An optimal combination with 8 curves can achieve a Mean Average Precision (MAP) of 0.70

and a recognition rate of 92.5% on the the Shape Retrieval Contest (SHREC’08) dataset.

4.2.3 Summary. Since curves sampled on a 3D face are denser than landmarks, they oer more geometric

information of the 3D facial surface. Consequently, curve based methods are usually more discriminative than

landmark based methods. Besides, curve based methods can encode the geometric information of a 3D face from

ACM Comput. Surv.

3D Face Recognition: Two Decades of Progress and Prospects • 11

dierent areas of the face. Therefore, their robustness to facial expressions is boosted [

144

]. However, curved

based methods have also several limitations. First, these methods rely on the accurate localization of proles or

contours. Consequently, robust and accurate preprocessing of 3D faces, such as pose normalization and nose tip

detection, is highly required for the accurate localization of proles and contours. Second, these methods usually

sample an entire 3D facial surface with sparse curves, part of the surface information is lost. Consequently, their

discriminative power is still limited [197].

4.3 Local Patch based Algorithms

These methods extract local patches from 3D faces to handle the global shape variation of faces caused by pose

changes, facial expression variations, noise and occlusions [

151

]. According to the type of classiers, these

methods can further be divided into sub-region feature matching, keypoint feature matching, surface registration,

and machine learning based methods.

4.3.1 Sub-region Feature Matching based Algorithms. These methods rst extract local patches from several

pre-dened sub-regions which are usually less sensitive to facial expressions, and then calculate the similarity

between two faces using the feature matching results of these extracted local patches.

Based on the signs of the mean and Gaussian curvatures, Moreno et al. [

159

] used HK segmentation [

214

] to

isolate the regions of pronounced curvature from a 3D face, and then extracted 86 features (e.g., areas, distances,

angles, and average curvature) from these regions. Finally, 35 discriminative features are selected to recognize 3D

face using the Euclidean distance between feature descriptors. Later, based on the signs of the mean and gaussian

curvatures, Moreno et al. [

158

] assigned each 3D point of facial meshes into a label describing the local shape of

the surface. Then, thirty local geometrical features are selected as the most discriminating ones from a set of 86

features according to Fisher coecient. Face recognition is nally accomplished using PCA or SVM classier.

Xu et al. [

231

] considered that areas with larger shape variations are important to characterize individuals,

and used four regions (mouth, nose, left eye and right eye) described through Gaussian-Hermite moments to

represent the local shape variation information of 3D faces. Lin et al. [

133

] used LDA to learn the optimal weights

for the fusion of the similarity scores obtained from multiple local regions of 3D faces. It is shown that the fusion

of mutliple regions can signicantly improve face recognition performance under varying facial expressions.

Zhong et al. [

248

] divided each image into several local patches and used the Gabor lter response vectors of

each patch to generate a 3rd-order tensor. The tensors of all local patches are used to generate a number of

sub-codebooks, which are further concatenated to form a Learned Visual Codebook (LVC). The

1

distance based

NN classier is performed on LVCs for face recognition.

Spreeuwers [

205

] dened an intrinsic coordinate system for each face using the vertical symmetry plane of the

face, the nose tip and the slope of the nose bridge. They then registered each face to the intrinsic coordinate system

and proposed a 3D face classier based on the fusion of several dependent region classiers for overlapping

regions, as shown in Fig. 3(a). For each region classer, PCA-LDA is used to extract features from the range image

of the face and the likelihood ratio is used as a matching score. The fusion is achieved using majority voting. Later,

Spreeuwers improved this method by dealing with head motion, unreliable estimation of registration parameters,

and the sensitivity to outliers during training. The verication rate at a FAR of 0.1% increases from 94.6% to 99.3%

and the identication rate increases from 99.0% to 99.4% on the FRGC v2.0 dataset [206].

Alyuz et al. [

] divided the whole facial surface into four regions: eye/forehead, nose, cheeks, and mouth-chin

regions, as shown in Fig. 3(b). The probe face is then registered with these four regions after a coarse registration

with the average face model. Four independent sets of similarity measures between probe and gallery faces are

calculated, and then fused for face recognition. Hajati et al. [

] proposed a Patch Geodesic Distance (PGD)

algorithm to transform the 2D texture map for 2.5D face recognition. Specically, both of the range image and

texture image are rst partitioned into equal-sized square patches in a non-overlapping manner, as shown in Fig.

ACM Comput. Surv.

12 • Y. Guo et al.

(a) Spreeuwers[205] (b) Alyuz et al.[6] (c) Hajati et al.[93]

Fig. 3. Examples of sub-region feature matching based methods.

3(c). To compute the PGD for all surface points, a local geodesic distance for each point within its patch and a

global geodesic distance to measure the distance between patches in the partitioned 2.5D image are computed.

Then, the 2D texture map is transformed according to the computed patch geodesic distance, and Pseudo-Zernike

Moments (PZMs) are computed as a patch descriptor for each patch. The dissimilarity between a probe scan

and a gallery scan is computed based on PZMs and the location of each patch in the transformed texture map.

Soltanpour et al. [

203

] extended Local Derivative Pattern (LDP) to surface normal components and proposed

a Local Normal Derivative Pattern (LNDP) descriptor to encode derivative direction variations. For 3D face

recognition, each range image is rst resized and then divided into several local patches. The histogram of LNDP

is extracted for each patch and then concatenated for all patches and dierent directions. The nal descriptor

consists of three histograms including the





and



channels of normal components. The similarity between

two dierent facial surfaces is measured based on the common areas of the two histograms.

Emambakhsh et al. [

] extracted the nasal regions based on nose tip detection and face segmentation, and

detected seven landmarks located on sub-nasale, eye corners and nasal alar groove of the nasal regions. To

reduce the sensitivity to noise and enable the extraction of multi-resolution directional region-based information

from the nasal region, the normal vectors are derived from the depth map ltered by Gabor wavelet. Then,

new keypoints are obtained by dividing the horizontal and vertical lines that connect the seven landmarks, and

described through spherical patches and nasal curves. Finally, stable patches and curves over dierent facial

expressions are selected through a heuristic genetic algorithm. Ocegueda et al. [

167

168

] constructed a graph

from the 3D mesh of the face and utilized a Markov random eld model to measure the probability of each vertex

to be discriminative or non-discriminative. Then, the authors extended this model and consructed a compact and

robust features consisting of 360 coecients for face recognition.

4.3.2 Keypoint Feature Matching based Algorithms. These methods rst extract a number of repeatable keypoints,

and then represent the local patch around each keypoint using a surface feature descriptor. The similarity between

two faces is nally calculated by matching these surface descriptors.

Wang and Chua [

220

] manually localized few sparse feature points or evenly sampled a large number of dense

feature points on a 3D facial scan, and then used the 3D Gabor lter and the 3D spherical Gabor lter to represent

each feature point. The Least Trimmed Square Hausdor Distance (LTS-HD) is nally used to address the partial

matching problem between probe and gallery faces. Mian et al. [

152

] extracted a set of repeatable keypoints from

locations on 3D facial surfaces with large shape variations. They then represented each keypoint with a pose

invariant feature generated by tting a surface with a uniform grid to the neighborhood of the keypoint. Local

features of two 3D faces are matched to obtain two corresponding graphs. The similarity of two faces is nally

ACM Comput. Surv.

3D Face Recognition: Two Decades of Progress and Prospects • 13

calculated as the similarity between their corresponding graphs. When using 3D data alone, this method achieves

a RIRR of 93.5% and a VR@0.1% FAR of 97.4% on the FRGC v2.0 “Neutral versus All” experiment.

Huang et al. [

103

] rst extracted multiscale extended Local Binary Patterns (eLBP) from the range image of a

3D face, resulting in several eLBP images. These eLBP images correspond to dierent scales and LBP attributes

(i.e., the signs and absolute values of gray value dierences). Then, the SIFT method [

136

] is applied to these

eLBP images to detect keypoints and generate local feature descriptors. The similarity between a probe face

and a gallery face is measured by the fusion of three similarities, i.e., the number of matched keypoint pairs, the

similarity of the facial component constraint (i.e., the matching score between local features in several pre-dened

subregions of the two faces), and the similarity of the facial conguration constraint based on graph matching

[

152

]. It is claimed that this method is robust to facial expression variations, partial occlusions, and moderate pose

changes. Because of the advantages of preserving full 3D geometry of the shape, Werghi et al. [

226

] extended

the mesh-LBP to face recognition. First, a plane based on the nose tip and inner-corner landmark points are

constructed, and an ordered and regularly spaced set of points on the plane are extracted. Then, the neighborhood

facets around these grid points are dened and used to compute multi-resolution mesh-LBP descriptors. Finally,

the histograms of these descriptors are integrated to represent the whole or partial facial surface. In addition, the

photometric channel can also be directly fused over the mesh support.

Smeets et al. [

201

] utilized meshSIFT algorithm for 3D face recognition. Specically, points with mean curvature

extrema in scale space are rst detected as salient points on the 3D facial surface. Second, canonical orientations

for these salient points are calculated based on the normal vectors of each vertex. Third, each salient point

is described through the concatenation of two histograms of shape indices and slant angles. The similarity

between two facial surfaces is then computed based on the angle between two feature vectors. Berretti et al. [

]

extracted a number of 3D keypoints on a facial scan using the MeshDOG method [

240

], and represented the local

surface around each keypoint using the meshHOG [

240

], Signature Histogram of Orientations (SHOT) [

212

and Geometric Histogram (GH) descriptors. Face similarity is measured by the number of inliers rened by the

RANSAC [

] algorithm. Berretti et al. [

] also extracted keypoints and their corresponding descriptors from

the depth image of a 3D facial scan using the SIFT algorithm, and then a set of keypoint correspondences are

generated by matching the SIFT descriptors of a probe face to the gallery faces. RANSAC based spatial constraints

are imposed to remove outlier correspondences and the similarity between two faces is generated using the

distances between facial curves connecting pairs of matched keypoints. Later, Berretti et al. [

] extended this

approach through the selection of optical scale, and the selection of stable keypoints and most discriminative

features of the local descriptor. Compared with [

], the overall rank-1 recognition rate on Bosphorus improves

from 93.4% to 94.5%, and the computational cost is reduced to 1/25. In addition, Berretti et al. [

] proposed a

super-resolution approach [

] to construct a high-resolution facial model by iteratively registering a sequence

of low-resolution 3D scans to a reference frame, and then performed face recognition using the approach similar

to [11].

Li et al. [

126

] rst detected repeatable points distributed over the entire facial regions based on two principal

curvatures. Then, each keypoint is described based on the Histogram of Multiple surface dierential Quantities

(HOMQ) descriptor which combines the Histogram of Gradient (HOG), the Histogram of Shape index (HOS),

and the Histogram of Gradient of Shape index (HOGS) at the feature level. The 3D face recognition is achieved

through a Sparse Representation based Classier (SRC), which computes the accumulated sparse reconstruction

error for all keypoints of a probe face. Guo et al. [

] extracted a few highly repeatable keypoints according

to the geometric variation of the local surface around a keypoint, and described each keypoint through the

Rotational Projection Statistics (RoPS) descriptor [

]. Face recognition is accomplished by combining local

feature matching and 3D point cloud registration algorithms. Lei et al. [

124

] represented each facial scan with a

set of local keypoints. Each keypoint is described based on Multiple Triangle Statistics (KMTS) which is robust

to partial facial data, large facial expressions and pose variations. Then, a Two-Phase Weighted Collaborative

ACM Comput. Surv.

14 • Y. Guo et al.

Representation Classication (TPWCRC) framework is proposed to deal with the face recognition problem.

Compared with other methods, this method pays more attention to partial data (missing parts, and occlusions)

and single training sample.

Hariri et al. [

] represented each 3D facial surface with a set of uniformly sampled feature points. Each feature

point is the center of a patch with xed radius, and is characterized by the covariance of its geometric features.

During matching, the probe facial surface is rst aligned with the gallery surfaces using the ICP algorithm, and

then a global similarity measure based on the geodesic distances on the manifold is computed between two

surfaces. Yu et al. [

238

] represented a 3D facial mesh with a set of sparse 3D directional vertices (3D

V) and

performed 3D face recognition using a set-to-set dissimilarity measure. Specically, corner points are extracted

from the ridge and valley curves to generate directional vertices. Each directional vertex is composed of three 3D

coordinates

(, ,  )

and two unit vectors pointing to its two neighboring vertices on the curve. The dissimilarity

between two 3D

Vs is dened as the cost of a conversion process which makes these two 3D

Vs fully overlapped.

For the 3D face recognition task, the probe faces and gallery faces are rst represented by a set of sparse 3D

Vs,

and then the dissimilarity is computed using the Hausdor distance (HD) or the iterative closet point (ICP)

mechanisms. Gilani et al. [

] proposed to utilize dense correspondences between a large number of 3D faces

to construct a Keypoint-based 3D Deformable Model (K3DM). Specically, the faces in the dataset are rst

organized into a minimum spanning tree to increase the possibility of nding point matches between pairs of

faces. Then, the dense correspondences are generated by an iteration process based on the currently established

point matches. At each iteration, a 2D Delaunay triangulation on the X-Y plane is performed, and narrow surface

patches dened on triangle edges between two parent/child nodes of the constructed tree are aligned using a

non-rigid registration algorithm. The points on the narrow surface patches are extracted as keypoints based on

the eigenvalues of the covariance matrix, and then matched by calculating the similarity between corresponding

feature descriptors. The above process is repeated for all surface patches in a pair of faces and for all pairs of

faces in the constructed tree. After obtaining the nal set of point matches, dense points are then generated

uniformly using a level set based sampling strategy and matched by calculating the similarity of feature vectors.

The K3DM model is nally constructed based on these dense correspondences, and face recognition is performed

by tting the query face into the constructed model. Boumedinea et al. [

] constructed a dictionary based on

SURF descriptor for the dataset captured by Kinect, and conducted 3D face recognition using a KNN algorithm in

the feature space.

4.3.3 Surface Registration based Algorithms. These methods divide each 3D facial surface into several local

regions to handle facial expressions, and then perform surface registration between the corresponding local

surfaces of two faces to generate multiple matching scores. These matching scores are nally fused to obtain the

overall similarity between the two faces. Dierent approaches for surface segmentation, surface registration, and

score fusion have been developed in the literatures.

Chang et al. [

] detected three regions around the nose and then matched each region independently

from a probe face to gallery faces, as shown in Fig. 4(a). The three matching scores are combined to determine

the identity of the probe face. Experimental results show that this method is more robust to facial expression

variations than the holistic methods. Similarly, Faltemier et al. [

] segmented each facial surface into 7 regions

and performed face recognition by fusing the matching scores. Later, Faltemier et al. [

] performed scores based

fusion on 38 segmented regions of a facial scan. It is observed that the fusion of 28 regions using the Borda count

and the consensus voting methods achieves the best performance. Mian et al. [

153

] registered the eyes-forehead

and nose regions of a probe face to their corresponding regions of a gallery face individually (as shown in Fig.

4(b)). The matching score are then fused to produce face recognition result. It is reported that an identication

rate of 100% and a verication rate of 99.42% are achieved on the UND Biometrics Database. It is also observed

that eyes-forehead is the most important region for 3D face recognition. This work is further extended in [

151

] by

ACM Comput. Surv.

3D Face Recognition: Two Decades of Progress and Prospects • 15

HQWLUHIDFH UHJLRQ& UHJLRQ1 UHJLRQ,

(a) Chang et al.[43]

H\HVIRUHKHDG

QRVH

FKHHNV

(b) Mian et al.[153]

HQWLUHIDFH FLUFXODUQRVH

DUHD

HOOLSWLFDO

QRVHDUHD

XSSHUKHDG

Fig. 4. Surface registration based 3D face recognition methods.

including a rejection classier, which is based on the matching of the holistic 3D Spherical Face Representation

(SFR) and SIFT descriptors. Queirolo et al. [

185

] used four regions of a 3D face for face recognition, including

the entire face, the circular nose area, the elliptical nose area, and the upper head, as shown in Fig. 4(c). Surface

registration between corresponding regions of two faces is performed using a Simulated Annealing (SA) based

approach with the Surface Interpenetration Measure (SIM). Similarity score is obtained by fusing the SIM values

of four regions using the summing rule. It is observed that the entire face and the elliptical nose area produce the

best individual performance, while combining all regions achieves the best overall performance. These methods

are relatively robust to varying facial expressions as rigid or semi-rigid regions of faces are selected for surface

registration. However, these regions are selected heuristically and may not be the optimal choice [

222

]. Besides,

stable segmentation of these regions are also highly challenging [9].

In addition to above methods, feature matching has been used to further improve the surface registration

performance. Chua et al. [

] extracted a set of sample points from the rigid region of a probe face and used point

signature [

] to encode the local patch around each sample point. Possible transformations between two facial

scans are generated by matching point signature features and then further veried by point cloud registration.

The identity of the probe face is determined by the gallery face with the largest registration rate. Dibeklioglu et al.

[

] estimated nasal regions based on curvature values and face recognition is accomplished through registration

strategies. With the Bosphorus 2D/3D face database, the proposed method achieves 94.10% recognition rate for

frontal facial expressions and 79.41% recognition rate for pose variations.

4.3.4 Machine Learning based Algorithms. The similarity between two faces can further be predicted by a machine

learning method, such as Support Vector Machine (SVM).

Wang et al. [

221

] combined the point signatures from 3D feature points and Gabor lter responses from 2D

feature points to obtain an integrated feature, and then used SVM to achieve face recognition. Cooke et al. [

]

rst applied 18 Log-Gabor lters on an image and then divided an image into 75 semi-independent observations

by 25 square windows and 3 scales. These observations are classied individually using a modied Mahalanobis

Cosine metric and then combined at the score level using SVM. It is reported that this method is more robust

to occlusions, distortions and facial expressions. Wang et al. [

222

224

] proposed a Collective Shape Dierence

Classier (CSDC) to achieve high performance in both recognition rate and computational eciency. They rst

generated a Signed Shape Dierence Map (SSDM) between two aligned 3D faces as a intermediate representation

for shape comparison. Three features including Haar-like feature, Gabor feature, and Local Binary Pattern (LBP)

are then extracted from SSDMs to encode the local similarity between facial shapes. These features are further

selected using a boosting algorithm to build three CSDCs, which are nally fused to perform 3D face recognition.

This method is also very ecient, which takes about 3.6s for a recognition against a gallery with 1000 faces. Li

ACM Comput. Surv.

16 • Y. Guo et al.

et al. [

130

] utilized sparse representation and low-level geometric features for 3D face recognition. To collect

such features, a uniform remeshing scheme is rst applied across 3D faces. Then, all low-level geometric features

are ranked according to their sensitivity to expressions. The features relatively insensitive to expressions form

a descriptor, which is referred to as the Expression-Insensitive Descriptor (EID). For face recognition, both of

the gallery face and the probe faces are represented by EIDs, and face recognition is accomplished under the

framework of sparse representation. Li et al. [

125

] utilized both depth and RGB images to perform face recognition

based on the multi-model sparse coding techniques, and achieved state-of-the-art performance on the CurtinFaces

dataset. Mantecón et al. [

146

] specically designed a Depth Local Quantized Pattern descriptor (DLQP) to capture

the depth characters of human faces, and then utilized a SVM classier to perform face recognition.

4.3.5 Summary. The 3D face in local patch based methods are represented by utilizing discriminative feature

descriptors extracted from local geometric structures of sub-regions or neighborhoods of repeatable keypoints.

Thus, local patch based methods are more robust to challenges such as facial expressions, occlusions and pose

variations. However, there are also several considerations when designing a new local patch based method. First,

the sub-regions or keypoints must be distributed evenly on the whole 3D face, which captures the local structural

information as much as possible. Second, the feature descriptors must be discriminative enough to describe

the intrinsic geometric properties of the local patches, especially in the presence of facial expressions and pose

variations. Third, the distance metrics or the classier of feature descriptors must be specically designed.

4.4 Holistic Algorithms

These algorithms perform 3D face recognition using the information of the whole face.

4.4.1 Statistical Algorithms. Chang et al. [

] applied PCA technique to both 2D and 3D facial data for face

recognition. A R1RR of 83.7% was achieved with the 3D modality and a R1RR of 92.8% is achieved with the

2D+3D fusion approach on a dataset containing 166 subjects. Pan et al. [

170

] parameterized a 3D facial surface

into an isomorphic 2D planar circle to preserve the intrinsic geometrical properties. The relative depth values

of facial points are mapped and eigenface analysis is performed on the mapped depth image. Mousavi et al.

[

160

] considered the nose tip as the reference point, and normalized the 3D face shape into an image with a

standard size. Then, two-dimensional PCA (2D PCA) is applied on the normalized image and the eigenvectors

corresponding to the rst



largest eigenvalues are used as the feature vectors of the 3D facial shape. Face

recognition is nally conducted using a SVM classier. Al-Osaimi et al. [

] learned the patterns of expression

deformations from shape residues between non-neutral and neutral scan pairs through PCA. The eigenvectors

corresponding to the top



eigenvalues construct the subspace representing the large expression deformations.

In the test stage, the shape residue between the probe face scan and the neutral scan is also projected on the

constructed subspace. The gallery scan with the minimal similarity measure is considered to be the match of

the probe scan. Haar et al. [

] utilized PCA to model one neutral face and six neutral-to-expression models for

expressions of anger, disgust, fear, happiness, sadness and surprise. For face matching, all seven models are tted

to the scans in the dataset and three feature vectors of model coecients are obtained to determine the similarity

of faces. The PCA based 3D face recognition approach is also used in [8, 42, 98, 100, 147, 216, 217, 230].

Tsalakanidou et al. [

215

] rst applied Discrete Cosine Transform (DCT) to both the depth image and the

color image of a face, and then used Hidden Markov Model (HMM) to perform face verication. It is observed

that a signicant improvement can be obtained using both color and depth information. Cook et al. [

] rst

partitioned the information of a 3D facial image into frequency bands using Discrete Wavelet Transform (DWT)

or DCT, and then projected each band into a PCA or LDA subspace. The projections in that subspace are nally

compared and fused using the Mahalanobis cosine metric. Xu et al. [

227

] utilized the information from depth and

intensity images, and described each individual with local features extracted using a 2D Gabor lter. To reduce

ACM Comput. Surv.

3D Face Recognition: Two Decades of Progress and Prospects • 17

the dimensionality of the extracted features, a novel hierarchical feature selection scheme based on LDA and

AdaBoost learning is proposed to select the most eective and robust features. In addition, LDA has also been

investigated for 3D face recognition in [97].

Mpiperis et al. [

162

] used bilinear models to model a 3D facial surface as the interaction of the expression and

identity components. They rst used an elastically deformable model to establish correspondence between a set of

3D faces, and then used bilinear models to decouple the facial expression and identity components. Consequently,

both expression-invariant face recognition and identity-invariant expression recognition can be jointly achieved.

Huang et al. [

104

] used the histograms of geometrical features (e.g., depth, surface normal, gradient, and

curvature) and 3D Local Binary Patterns (LBPs) to represent the depth image of a facial scan for face recognition.

It is observed that the combination of these two features can improve the 3D face recognition performance. Liu

et al. [

135

] characterized the details of a 3D facial surface by the energies contained in spherical harmonics with

dierent frequencies. Specically, the 3D facial point cloud is rst aligned and projected onto spherical coordinates,

and a 2D Surface Depth Map (SDM) of 3D facial surface is then generated. Based on SDM representation, each 3D

facial surface is characterized by the energies at dierent frequencies of the spherical harmonics. The energies at

the low frequencies capture the global shape of the facial surface, whereas the energies at the high frequencies

capture the facial surface details. Finally, a subset of the most discriminative features are selected based on the

training data for further classication. Smeets et al. [

199

] proposed an isometric deformation model based on the

geodesic distance matrix to deal with expression variations. First, the region in which the vertices with a geodesic

distance to the nose tip are smaller than a predened threshold, is cropped. Then, that region is downsampled

into the same amount of points, and a set of eigenvectors corresponding to the largest



eigenvalues of the

Geodesic Distance Matrix (GDM) is considered as the expression-invariant and permutation-invariant shape

descriptor for each face. Finally, the dissimilarity measure for face recognition is computed according to the mean

normalized Manhattan distance. Later, Smeets et al. [

200

] combined the isometric deformation model and the

region-based method (which uses only the region around the nose) to perform face recognition.

4.4.2 Surface Registration based Algorithms. These methods are usually time-consuming due to the use of surface

registration algorithms [9].

Cook et al. [

] rst used the ICP algorithm to register a probe face and a gallery face, the registration errors

are then modeled by Gaussian Mixture Models (GMMs) to dierentiate intra-personal faces from extra-personal

faces. Irfanoglu et al. [

106

] rst automatically extracted several landmarks on a 3D face and then established dense

point correspondences between a probe face and a gallery face using the TPS warping algorithm. The Euclidean

norm between two registered 3D facial scans is used for recognition. Lu et al. [

137

] rst detected multiple feature

points on facial scans to achieve coarse registration between a probe face and a gallery face, ne registration is

then performed using a hybrid ICP algorithm. A combined metric using surface matching, texture matching, and

shape index matching is nally used for 3D face recognition. Lu et al. [

138

142

] further integrated range and

texture information for 3D face recognition using the ICP based surface registration algorithm. Faltemier et al.

[

] used multi-instances enrollment to deal with facial expression variations for 3D face recognition. Particularly,

a probe face is matched with multiple gallery faces of a subject using the ICP algorithm, and the minimum Root

Mean Square (RMS) error is considered as the distance between the probe and the subject. It is reported that

using multiple scans to enroll a person in the gallery can improve the face recognition performance. Mahoor et

al. [

145

] rst extracted ridge points on the facial surface based on principal curvatures and then constructed a 3D

binary image called ridge image based on these ridge points. Face recognition is nally accomplished through

robust Hausdor distance or iterative closest points (ICP) algorithms. ICP based surface recognition algorithm

has also been used in [

140

148

173

] for 3D face recognition. Besides, Russ et al. [

187

] used a Hausdor distance

based iterative registration algorithm to align two 3D facial scans for face recognition.

ACM Comput. Surv.

18 • Y. Guo et al.

Lu et al. [

141

] rst extracted a number of landmarks from each face, and learned 3D facial deformations from a

control group containing neutral and non-neutral expression facial scans. Deformed models with synthesized

expressions are then generated by transferring the deformations to the 3D neutral facial scans in the gallery. A

probe face is nally recognized by matching the facial scan with the deformed models in the gallery. This method

is able to perform face recognition under facial expression and pose variations. However, it is time-consuming

and requires manual operation for landmark extraction. Gökberk et al. [

] systematically compared dierent face

registration algorithms (including ICP and TPS), dierent 3D facial features (including point coordinates, surface

normals, curvatures, depth images, and prole curves), and dierent decision-level fusion approaches (e.g., xed-

rules, voting schemes, rank-based combination rules) for face recognition. It is observed that face registration

without warping provides more discriminatory information, surface normals produce the best recognition

performance among features, and the fusion schemes further improve the recognition accuracy. Mohammadzade

et al. [

156

] combined both of the Euclidean distance and the normal distance to nd the closet point pairs between

the input face and the reference face. Based on the singular value decomposition (SVD) algorithm, the rotation

matrix and the translation vector are computed from the cross correlation matrix. The obtained alignment matrix

between the input face and the reference face results in a more accurate correspondence between their points.

The above process is repeated until no more signicant rotation is obtained. These accurately aligned point pairs

are nally applied for 3D face recognition based on discriminant analysis methods.

4.4.3 2D Parameterization based Algorithms. Bronstein et al. [

] considered facial expressions as isometric

transformations and transformed a 3D face to a canonical image using Multi-Dimensional Scaling (MDS). 3D

face recognition is then achieved using eigenforms of both the texture and the canonical images, or using the

high-order moments of canonical images [

]. This method achieves accurate face recognition results under

dierent facial expressions [

]. Later, Bronstein et al [

] embedded a 3D facial surface into another face to

perform partial isometry-invariant face recognition. Besides, Bronstein et al. [

] generated an isometric

invariant representation by transforming a 3D face into a spherical canonical image, resulting in an improved

recognition performance compared to at embedding. These methods mainly work on frontal facial scans and

assume that the mouth is closed under dierent facial expressions [

141

]. A limitation of canonical image based

methods is that accurate model cropping and topology consistence are required for geodesic distance computation,

and these methods are also very time-consuming [9].

Passalis et al. [

176

] rst tted the Annotated Face Model (AFM) to a 3D facial scan and used the UV param-

eterization of the tted AFM to obtain three deformation images. A wavelet transform is then used to extract

a biometric signature from each deformation image, and face recognition is nally performed by comparing

the biometric signatures of probe and gallery faces. Kakadiaris et al. [

119

] represented facial scans with an

annotated face model (AFM), which maps all vertices of the model’s surface from

3

2

and vice versa based on

continuous global UV parameterization. Then, the tted model is converted into a 2D geometry image to encode

the surface information of the model. From the geometry image, a normal map image, which distributes the

information evenly among its three components, is constructed. Finally, these images are analyzed using Haar

and Pyramid transform and the spectral coecients are used for comparison between dierent subjects. Mpiperis

et al. [

161

] proposed a geodesic polar parameterization for 3D facial surfaces. Specically, a point

�

on a 3D face

is represented by a path length



and a pole angle



. Here,



is the geodesic distance between the pole (e.g., the

nose tip) and point

�



is the directed angle between the geodesic path (which links the pole and point

�

) and a

reference geodesic path ending at the pole. Using this parameterization, a 2D representation (namely, geodesic

polar images) can be obtained for each 3D face by mapping a specic surface attribute (e.g., curvature, depth). 3D

face recognition is nally achieved by using the eigenface method on these geodesic polar images. This method

can handle surface deformations caused by facial expressions. However, it needs to detect lips for faces with

open mouth [

]. Al-Osaimi et al. [

] computed 11 local rank-0 tensor elds from two local neighborhoods of the

ACM Comput. Surv.

3D Face Recognition: Two Decades of Progress and Prospects • 19

vertex for each mesh vertex, and computed 3 global rank-0 tensor elds from the cropped face. All of these tensor

elds are invariant to rigid transformations, and then integrated into multiple 2D histograms of the surface area.

Finally, the PCA coecients of the 2D histograms are concatenated into a single feature vector to represent the

face surface. Dutta et al. [

] extracted features from the complementary components of range facial images. First,

each range facial image is decomposed into four basic components according to the rst-order partial derivatives

along the X and Y axes, respectively. Then, four hybrid components are linearly generated based on these four

basic components, and all eight components are fused through the genetic algorithm. To select useful features

from the fused feature vectors, a two-stage particle swarm optimization (PSO) algorithm is adopted to maximize

the recognition rate and minimize the number of features. Final face recognition is performed using an SVM

classier.

4.4.4 3D Morphable Model based Algorithms. 3D Morphable Model (3DMM) has been also investigated as an

intermediary mean for 3D face recognition. For example, Amberg et al. [

] proposed an expression and pose

invariant 3D Morphable Model (3DMM) by removing pose and expression components during the nonrigid ICP

based 3DMM tting process. Paysan et al. [

178

] utilized the generative Basel Face Model (BFM) to model the face

shapes and textures, and the similarity of two faces is measured according to the angle between the coecients

of the BFM in Mahalanobis space. Ter Haar and Veltkamp [

210

] rst constructed a 3D Morphable Model based on

the USF HumanID 3D Face Database [

], and each 3D face scan is tted to the 3DMM through a global-to-local

tting scheme. To obtain a precise tting to the model, the authors also proposed to t the face scan to the seven

predened face components and blend the borders of these components through a post-processing step. Finally,

they performed 3D face recognition on the UND datasets [

] using the distances between 15 facial landmarks

and the distances between 135 sample points on the three facial curves. The recognition results show that the

recognition method based on seven face components achieves the best performance, demonstrating that the face

recognition method that is invariant to facial expressions can increase recognition performance. Blanz et al. [

]

proposed to t facial scans to the 3DMM by simultaneously optimizing the shape, texture, pose and illumination.

The 3D face recognition is performed by a scalar product between two 1000-dimensional coecient vectors.

4.4.5 Summary. Although holistic methods have been extensively investigated in the past years with acceptable

performance being achieved on dierent datasets, they can only handle the situations with whole facial surfaces

available. These methods cannot be used when some parts of facial surfaces are missing, e.g., due to occlusions or

pose variations. The evaluation results in Section 6 also demonstrate the above observations.

5 DEEP LEARNING FOR 3D FACE RECOGNITION

Due to the availability of massive training data, deep learning-based methods have shown remarkable performance

in the eld of 2D face recognition [

]. However, because of the lack of large-scale 3D face datasets, deep learning-

based 3D face recognition techniques are still in their infancy. Actually, 3D face can be represented with dierent

types of representations, such as 2D representations (e.g., projected views, depth images), 3D representations

(e.g., point clouds, meshes, voxels). Dierent representations usually require dierent types of data processing

and face recognition techniques. Therefore, we categorized these methods into methods with learning based on

2D representations, learning based on 3D representations, and learning based on disentangled representations.

5.1 Learning based on 2D Representations

To fully utilize the achievements of 2D deep learning-based methods for 3D face recognition, many methods

have been proposed to project 3D faces onto 2D images, and then utilize matured 2D deep learning-based face

recognition techniques to perform 3D face recognition. Kim et al. [

120

] rst pre-trained a convolutional neural

network (CNN) based on a large-scale 2D face dataset, and then ne-tuned the network with expression and pose

ACM Comput. Surv.

20 • Y. Guo et al.

variations augmented 3D facial scans. Each 3D facial scan is orthogonally projected onto a 2D depth map, and

the hard occlusions are added by randomly removing patches from the converted depth map. The ne-tuned

CNN is used as the feature extractor, and the similarity between the gallery and the probe set is nally computed

based on the learned features. Li et al. [

127

] projected each 3D facial surface onto a 2D plane and three images of

the normal components are estimated based on a local plane tting method. Then, to generate a deep normal

representation, each normal image is fed into a deep face net pre-trained on the 2D face dataset. Finally, a

location-sensitive sparse representation classier is proposed to emphasize the importance of dierent facial

parts. Gilani et al. [

] proposed a Deep Landmark Identication Network (DLIN) with a binary classication loss

to detect 11 facial landmarks. The training dataset with known locations of landmarks is synthetically generated

using the commercial software FaceGen

 

and contains 3D faces augmented with various shapes. Specically, the

variations from age, masculinity/femininity, weight, height, four dierent facial expressions (surprise, happiness,

fear, and disgust), and ve dierent poses (frontal, ± 15° in pith and ± 15° in roll) are considered. Each generated

3D face is converted to a spherical representation, and three channels (i.e., depth, azimuth and elevation) of

images are generated as the input of DLIN. Based on ve detected ducial landmarks (i.e., the nosetip, the upper

and lower lip centers, and the outer eye corners), each 3D face is segmented into ve regions based on geodesic

level set curves. Then, discriminative keypoints are extracted in each region and dense correspondences across

faces are obtained to generate a Region based 3D Deformable Model (R3DM). 3D face recognition is performed

by minimizing the cosine distance between the R3DM model parameters of probe and gallery faces. Borghi et

al. [

] proposed a depth-based face verication network JanusNet, which contains three Siamese modules (for

depth, hybrid and RGB images) with the same architecture. Specically, the depth and hybrid Siamese networks

take depth image pairs as their input, while the RGB Siamese network takes RGB image pairs as its input. During

the training phase, the hybrid Siamese network is trained based on the loss of the RGB Siamese network, which

forces the features learned by the hybrid network similar to the features of the RGB network. During the test

phase, the RGB Siamese network is not employed, i.e., the nal face verication result is only based on the depth

and hybrid Siamese networks. Later, Borghi et al. [

] took the depth maps as the input of a fully convolutional

network for 3D face recognition. The random horizontal ip with probability 0.5 and the random rotation in the

range of [-5°,+5°] are adopted to augment the training data. Xu et al. [

232

] fused the depth map and texture map

to learn features through CNN, and performed 3D face recognition through a CNN-based twin neural network.

Feng et al. [

] utilized two deep CNNs to learn features from 2D images converted from facial color images and

point clouds, and the two learned features are then fused as the input to the face recognition network. Olivetti

et al. [

169

] used a 2D image with three channels (depth, shape index, and curvedness) to represent a 3D facial

surface, and then fed these images into a MobileNetV2 architecture to perform 3D face recognition. Three kinds of

data augmentation strategies (clockwise rotation of 25°, counterclockwise rotation of 40°, and horizontal mirror)

are adopted during training. Dutta et al. [

] proposed to use an unsupervised deep learning framework for 3D

face recognition. First, the input 3D point clouds are aligned to the frontal pose and converted to 2.5D depth

images. Features are then learned using a sparse principal component analysis network (SpPCANet), and nally

classied using a linear SVM based classier. Hariri and Zaabi [

] proposed a lightweight deep residual feature

quantization method for 3D face recognition. After preprocessings such as cropping and denoising, 3D faces are

transformed into 2D depth images, and fed into a pretrained ResNet-50 network [

]. The Radial Basis Function

(RBF) neurons are then applied to quantize the learned features and face recognition is nally performed using

an SVM classier.

To address the low-quality 3D face recognition problem, Tan et al. [

209

] proposed a face recognition framework

specically designed for low-quality 3D data. Based on ResNet [

], a deep registration network (DRNet) is

proposed to align a sequence of low-quality data, and a deep convolution network (FRNet) is proposed to learn

deep features from high-quality dense 3D point clouds fused from sequential sparse low-quality registered data.

To simulate the actual distribution of low-quality moving faces from dense and clean facial scans in DRNet,

ACM Comput. Surv.

3D Face Recognition: Two Decades of Progress and Prospects • 21

10% of the random noisy points with Gaussian distribution N(0,4) and random poses (roll angles in the range

of [-45°,45°], pitch angles in the range of [-20°,20°], and yaw angles in the range of [-30°,30°]) are rst added on

the dense facial scans. Then, the augmented dense facial scans are rst projected onto a 2D plane (which is

divided into 1000 grids), and the sparse facial scans are obtained by randomly selecting one point from each

grid. The above augmentation process is repeated 6 times to obtain a sequence of sparse facial data. Mu et al.

[

163

] proposed a lightweight CNN architecture Led3D for 3D face recognition on low-quality depth images, and

constructed a ner and bigger dataset for training the deep network. The Led3D utilizes four convolutional layers

and a Multi-Scale Feature Fusion (MSFF) module to learn discriminative feature representation of low-quality

face data. A Spatial Attention Vectorization (SAV) module is used to capture the importance of dierent spatial

clues in a face. Experiments demonstrate that the Led3D achieves state-of-the-art performance on the low-quality

3D face recognition dataset Lock3DFace [

241

], and also can operate at a very high speed of 136 fps on Jeston

TX2. To train the Led3D model, pose variations (pitch angles in the range of [-40°,40°] and yaw angles in the

range of [-60°,60°] with the interval of 20°), shape jittering with random Gaussian noises, and shape scaling are

adopted to augment the training data. Lin et al. proposed a multi-quality fusion network MQFNet to enhance

face recognition performance [

131

]. First, high-quality facial depth images are generated from low-quality depth

images based on the pix2pix network [

108

], and the training of the network is supervised from these high-quality

depth images. Then, the generated high-quality depth images and their corresponding low-quality depth images

are fed into a multi-quality fusion network with two identical pipelines to learn global discriminative facial

representations. To train MQFNet, except for the pose variations same as [

163

], scale augmentation and occlusion

augmentation are also adopted in the data augmentation step.

Cai et al. [

] constructed a combined training dataset from four public 3D datasets (FRGC v2.0, Bosphorus, BU-

3DFE, and 3D-TEC) and three in-house datasets with data augmentation, and learned features from range images

of four overlapping facial component patches based on improved residual networks [

]. The nal representation

of a facial surface is dened as the concatenation of feature vectors from four facial patches. During training,

ve random poses in each rotation angle in the range of

10° and another ve arbitrary poses are rst adopted

to augment the pose variations of 3D faces. Then, the transformation augmentations (including minor random

ane transformation, projection transformation, twisting, and horizontal ipping) and resolution augmentation

simulating dierent z-axis resolutions are also applied to the training images. Gilani et al. [

] constructed a

large-scale training dataset with 3.1 million 3D facial scans of 100K individuals for 3D face recognition. They

proposed two methods to generate training scans. The rst method selects pairs of 3D faces with the maximum

shape dierence from a real dataset containing 1,785 individuals and generates 3D faces of 90100 new individuals

by varying the expressions of the face pairs based on a dense correspondence method [

]. The second method

selects pairs with a smaller shape dierence from a synthetic dataset containing 300 individuals, and nally

generates 3D faces of 8120 new individuals. All scans generated through these two methods are then transformed

by pose variations and large occlusions. The 3D faces generated by the rst method have maximum inter-person

variations, whereas the 3D faces generated by the second method have smaller inter-person variations. Based

on this dataset, a deep convolutional neural named FR3DNet is proposed for 3D face recognition. Each facial

scan is converted into an image with three channels corresponding to depth, azimuth and elevation angles of the

normal vector. To evaluate the performance of the FR3DNet for both the 3D face identication and verication,

the authors also constructed a large-scale test dataset LS3DFace by merging several existing public datasets, such

as FRGC v2.0 and Bosphorus.

These methods directly utilize 2D deep learning based face recognition techniques and achieve satisfactory per-

formance on current datasets. However, geometric information is partially lost when 3D facial data is transformed

into 2D representation.

ACM Comput. Surv.

22 • Y. Guo et al.

5.2 Learning based on 3D Representations

Unlike learning methods based on 2D representations, many recent works directly learn facial representations

from 3D facial data. Lin et al. [

132

] rst extracted several local feature tensors from 3D face meshes and then

fed them into a deep neural network for 3D face recognition. Specically, salient points are rst detected based

on the meshSIFT algorithm [

201

], and three local features (i.e., shape index, slant angle, and relative positions

to the salient point) are then extracted for each salient point and concatenated to represent the 3D face. Next,

a 2D similarity tensor image is obtained through local feature tensor matching and fed into a ResNet for 3D

face classication. To address the lack of large-scale 3D face datasets, a large amount of feature tensors are

generated based on Voronoi diagram instead of 3D face samples. Bhople et al. [

] directly took the 3D facial

point clouds as the input, and proposed a PointNet-CNN architecture to learn the global representation of the

3D face. Then, pairs of learned global features are fed into a Siamese network to calculate the similarity of the

two input faces. The PointFace [

112

] encodes pairs of input point clouds with two weight-sharing encoders.

In the training stage, the PointFace uses both the feature similarity loss and the softmax classication loss to

obtain ne-grained representations, which minimizes the embedding distance between the same individual and

maximizes the embedding distance between dierent individuals. During training, one of the strategies (including

random anisotropic scaling in the range [-0.66, 1.5], random translation in the range [-0.2, 0.2], random rotation in

the range [-90°, 90°] on yaw and [-30°, 30°] on pitch) is selected to augment the training data or keep the training

data original. In the test stage, embeddings of the probe faces and the encoders obtain gallery faces and then

used for 3D face recognition. To fully explore the advantages of contrastive learning and boost training, a pair

selection strategy is also adopted to generate positive pairs and negative pairs in the training stage. In [

195

], 3D

face meshes are rst voxelized at three dierent resolutions. Then, the fuzzy C-means clustering is performed to

unify the count of voxels into the same size, and a 3D voxel-based face reconstruction technique is applied to the

clustered voxels. The deep learning framework consisting of a variational autoencoders (VAEs) and a bidirectional

long short-term memeory (BiLSTM) network with triplet loss, is used to extract deep facial features. Finally, a

SVM based classier is used to perform gender, emotion, occlusion and person recognition. In the 4DFAB[

]

dataset, 3D face recognition and verication are tested with a simple long short-term memeory (LSTM) network.

RP-Net [38] integrates the RoPS [86] descriptor into PointNet++ [184] to learn facial feature representation.

Kacem et al. [

117

] proposed a dynamic 3D face verication network with a triplet loss. The local deformations

in 3D face sequences are rst encoded by the Sparse Localised deformation Components (SPLOCs) [

166

], and then

stacked into 2D arrays for temporal modeling. Finally, the stacked arrays are fed into a triplet loss network for

nal facial embedding, and 3D face verication is performed by computing cosine similarity distance between the

output embeddings. Papadopoulos et al. [

172

] proposed a novel dynamic 3D face recognition framework named

Face-GCN. First, 2D landmarks are extracted from 2D texture facial images and then mapped to 3D facial meshes

to extract 3D facial landmarks. The midpoints of geodesic paths on the meshes between pairs of 3D landmarks

are augmented as new 3D landmarks. Based on these landmarks, a spatial-temporal graphs containing spatial

edges and temporal edges are constructed. Spatial edges are used to connect landmarks according to predened

neighborhood relationship, and temporal edges are used to connect the same landmarks across consecutive

frames in the expression sequences. Finally, 3D face recognition is performed using a spatial-temporal graph

convolution network. In the experiment, a cross-emotion protocol is adopted based on the dynamic 3D facial

expression dataset BU4DFE [

243

], which takes three emotions for training and the other three expressions for

test. Experimental results show that the proposed Face-GCN method achieves an average recognition accuracy of

88.45% on this challenging cross-emotion protocol.

Beneting from the achievements of 3D deep learning techniques [

], direct face representation learning from

3D data has developed quickly in recent years. However, the datasets used for the training of 3D face recognition

networks are still small, and the performance of existing networks is also limited.

ACM Comput. Surv.

3D Face Recognition: Two Decades of Progress and Prospects • 23

5.3 Learning based on Disentangled Representations

Similar to the idea of decoupling identity attributes from other attributes such as poses and facial expressions,

and modeling dierent attributes with linear combinations [

], some recent works learn non-linear

latent representations based on deep learning [

113

134

186

245

], and perform 3D face recognition based on

disentangled representations. Ranjan et al. [

186

] constructed a hierarchical Convolutional Mesh Autoencoder

(CoMA) to learn non-linear representations for modeling 3D facial expressive variations. To train the CoMA

network, a dataset consisting of 20,466 meshes with 12 classes of extreme expressions from 12 subjects is

proposed. Experimental results demonstrate that CoMA achieves state-of-the-art performances with 75% fewer

parameters than linear PCA models. Sun et al. [

208

] constructed two decoders to disentangle identity and

expression latent representations in a variational autoencoder framework. The network is constructed based on

an attention based point cloud transformer [

] applied directly on unordered point clouds, and utilized mutual

information regularization applied on the identity decoder to better reconstruct the identity face. Based on these

disentanglement learning achievements, Kacem et al. [

118

] rst learned the latent representation by applying a

Graph Convolutional Autoencoder [

186

] on pairs of neutral and expressive facial meshes, and then translated

expressive representations to neutral ones using a conditional Generative Adversarial Network (cGAN) [

109

Specically, both the latent representations of neutral and expressive faces are rst learned using spectral graph

convolutions [

] at the encoding stage, and then mapped back into the same neutral face mesh at the decoding

stage. To map the expressive latent representation to its corresponding neutral latent representation, a translation

function is learned by constraining the output and the distribution of the output close to the counterpart of the

neutral latent representation. Finally, the translated expressive latent representation and neutral representation

are fed into a two fully-connected-layer network to perform 3D face recognition.

In summary, disentanglement learning for 3D facial variations obtains surprising achievements for 3D face

recognition and has huge potential for many other applications due to its powerful representation ability.

6 PERFORMANCE COMPARISON

In this section, comparative results of current approaches with respect to facial expressions, pose variations, and

occlusions are presented. Note that, to achieve fair comparison, all results are directly obtained from the referred

works.

6.1 Comparative Results under Facial Expressions

In this section, the performance of current algorithms under facial expressions is evaluated on the datasets of

FRGC v2.0, BU-3DFE, Bosphorus and Gavab. As described in Section 2.2, all these aforementioned datasets contain

various types of non-neutral facial expressions. Specically, the FRGC v2.0 dataset contains 1597 non-neutral

scans with various types of facial expressions. Such a large number of non-neutral facial scans make the FRGC

v2.0 dataset very popular for the evaluation of the robustness of algorithms to the facial expressions.

Comparative experiments are conducted under dierent experimental settings for the face verication and

identication tasks. For the ‘All vs All’ experiment, both of the gallery set and probe set contain all scans in the

datasets. For the ‘Neutral vs All’ experiment, ‘Neutral vs Neutral’, and ‘Neutral vs Non-neutral’ experiments, the

gallery set usually contains one neutral scan for each subject, and the probe set contains all the remaining scans

in the dierent experiments. For example, in the FRGC v2.0 dataset, the gallery set contains 466 neutral scans.

Meanwhile, the probe set contains 1944 neutral scans, 1597 non-neutral scans, and 3541 scans in the ‘Neutral vs

Neutral’, ‘Neutral vs Non-neutral’, and ‘Neutral vs All’ experiments, respectively. Some works [

167

205

] use the

rst image of each subject to form the gallery set. However, not all of the rst images in FRGC v2.0 are neutral

images. Therefore, the experimental results from these works are not included in this paper.

ACM Comput. Surv.

24 • Y. Guo et al.

Table 2. VR at 0.1% FAR results under facial Expressions. ‘A vs A’, ‘N vs A’, ‘N vs N’, and ‘N vs NN’ stand for ‘All vs All’,

‘Neutral vs All’, ‘Neutral vs Neutral’, and ‘Neutral vs Non-neutral’, respectively.

Methods Modality Dataset ROC I ROC II ROC III A vs A N vs A N vs NN N vs N

Landmark based

Algorithms [105] 3D FRGC V2.0 - - 86.9% - - - -

[105] 3D+2D FRGC V2.0 - - 96.8% - - - -

Curve based

Algorithms

[10] 3D FRGC V2.0 - - - 81.2% 95.5% 91.4% 97.7%

[59] 3D FRGC V2.0 - - 97.14% 93.96% - - -

[123] 3D FRGC V2.0 - - 96.7% - - 97.8% -

[116] 3D FRGC V2.0 - - 99.9% 98.9% 99.9% 98.9% 99.9%

[115] 3D FRGC V2.0 - - 96.2% 99.7% 99.5% 99.7% 99.8%

[115] 3D BU-3DFE - - - 99.6% 99.5% - -

[115] 3D Bosphorus - - - - 99.1% 98.9% 99.9%

Local Patch based

Algorithms

[51] 3D FRGC V2.0 93.71% 92.91% 92.01% 92.31% 95.81% - -

[65] 3D FRGC V2.0 - - 88.8% 87.5% 89.0% - 97.1%

[151] 3D FRGC V2.0 - - - - 98.5% 97.0% 99.4%

[151] 3D+2D FRGC V2.0 - - - - 99.3% 98.3% 99.7%

[133] 3D FRGC V2.0 91.5% 91.0% 90.0% - - - -

[152] 3D FRGC V2.0 - - - - 97.4% 92.7% 99.9%

[152] 3D+2D FRGC V2.0 - - - - 98.6% 96.6% 99.9%

[67] 3D FRGC V2.0 - - 94.8% 93.2% 98.1% - -

[222] 3D FRGC V2.0 97.97% 98.01% 98.04% 98.13% 98.61% - -

[185] 3D FRGC V2.0 - - 96.6% 96.5% - - -

[205] 3D FRGC V2.0 94.6% 94.6% 94.6% 94.6% - - -

[168] 3D FRGC V2.0 96.2% 95.7% 95.2% - - - -

[103] 3D FRGC V2.0 95.1% 95.1% 95.0% 94.2% 98.4% 97.2% 99.6%

[167] 3D FRGC V2.0 96.2% 95.7% 95.2% - - - -

[201] 3D FRGC V2.0 - 78.97% 77.24% - - - -

[15] 3D FRGC V2.0 - - 86.6% - - - -

[206] 3D FRGC V2.0 99.3% 99.3% 99.3% 99.3% - - -

[62] 3D FRGC V2.0 - - 93.5% - - - -

[85] 3D FRGC V2.0 - - - - 99.01% 97.18% 99.9%

[124] 3D FRGC V2.0 - - - - 98.3% 96% 99.9%

[

238

](HD)

3D FRGC V2.0 - - 91.1% - - - -

[

238

](ICP)

3D FRGC V2.0 - - 94.5% - - - -

[124] 3D BU-3DFE - - - - 94.0% - -

[78] 3D FRGC V2.0 - - - - 98.7% 96.6% 99.9%

Holistic

Algorithms

[148] 3D+2D FRGC V2.0 - - - 93.5% 95.8% - 99.2%

[119] 3D FRGC V2.0 97.3% 97.2% 97.0% - - - -

[3] 3D FRGC V2.0 - - - - - - 95.37%

[4] 3D FRGC V2.0 94.55% 94.12% 94.05% - 98.14% 97.73% 98.35%

[227] 3D+2D FRGC V2.0 - - 95.3% - 97.5% - -

[145] 3D FRGC V2.0 90.69% 88.5% 85.75% - - - -

[91] 3D FRGC V2.0 - - - 87% - - -

[135] 3D FRGC V2.0 - - - 90% - - -

[156] 3D FRGC V2.0 - - 99.2% 99.6% - - -

[91] 3D BU-3DFE - - - 82% - - -

[91] 3D Gavab - - - 80% - - -

[135] 3D Bosphorus - - - 81.4% - - -

Deep Learning

based Algorithms

[36] 3D FRGC V2.0 - - 100% - 100% 100% 100%

[36] 3D Bosphorus - - - - 98.39% 98.30% 100%

[36] 3D BU-3DFE - - - - 98.92% - -

ACM Comput. Surv.

3D Face Recognition: Two Decades of Progress and Prospects • 25

For face verication, the VRs at 0.1% FAR in dierent experimental settings are evaluated. The evaluation

results are shown in Table 2. For the FRGC v2.0 dataset, three additional experiments (ROC I, ROC II, and ROC

III) are also conducted. ROC I means that gallery and probe scans are collected within a semester, ROC II means

that gallery and probe scans are collected within a year, and ROC III means that gallery and probe scans are

collected in dierent semesters. For face identication, R1RR in dierent experimental settings is evaluated. The

evaluation results are shown in Table 3.

Several observations can be derived from Tables 2 and 3:

•

Local patch based algorithms have attracted the most research interest, and have achieved almost the best

performance under dierent experimental settings. That is mainly because local patch based algorithms

can use local surface features to handle facial expressions.

•

Compared to ‘Neutral vs Neutral’ experiments, the performance of current algorithms in ‘Neutral vs

Non-neutral’ experiments on existing datasets still needs to be improved. In addition, the types of facial

expressions in existing datasets are limited. Thus, more attention should be paid when designing challenges

of facial expressions.

•

Deep learning based 3D face recognition algorithms have been developed very slowly due to the lack

of large-scale datasets. Meanwhile, the performances of existing algorithms on small-scale datasets are

saturated. Therefore, more large-scale datasets rich of facial expressions are needed.

6.2 Comparative Results under Pose Variations

The robustness of current algorithms to pose variations are evaluated on the Gavab and Bosphorus datasets. As

described in Section 2.2, the Gavab dataset contains four types of pose variations, and the Bosphorus dataset

contains 13 types of pose variations. Comparative experiments are conducted on the face identication task

in terms of R1RR, and the results are shown in Table 4. For these two datasets, the gallery set contains one

frontal scan with neutral expression for each subject. For the Gavab dataset, the probe sets of ‘looking down’,

‘looking up’, ‘right’ and ‘left’ contain scans under these specic poses. For the Bosphorus dataset, the scans with

pose rotations are further divided into four subsets: Yaw Rotations, Yaw Rotations 90, Pitch Rotations and Cross

Rotations. ‘Overall’ in Table 4 means that all subsets of the probe set are used for these two datasets.

Several observations can be derived from Table 4:

•

Local patch based methods are the most used methods to deal with pose variations. That is because, local

patch based methods utilize only the local structural information for face recognition, which is more robust

to pose variations than other types of methods.

•

Current algorithms perform well on pitch rotations on both of these two datasets. Note that, the subset of

pitch rotations of the Bosphorus dataset corresponds to the subsets of ‘look down’ and ‘looking up’ of the

Gavab dataset.

•

The performance on the subset of yaw rotations is worse than the subsets of pitch rotations and cross

rotations. Current algorithms perform worst on the subsets of yaw rotations of the Bosphorus dataset and

the subsets of right and left of the Gavab dataset.

In summary, pose variations with yaw rotations still raise great challenges for current 3D face recognition

algorithms.

6.3 Comparative Results under Occlusions

The robustness of current algorithms to occlusions are evaluated on the Bosphorus dataset. As described in

Section 2.2, the Bosphorus dataset contains four types of occlusions. Comparative experiments are conducted on

the face identication task in terms of R1RR, and the results are shown in Table 5. Specically, the gallery set

contains one scan with neutral expression, and the probe set contains 381 facial scans with occlusions.

ACM Comput. Surv.

26 • Y. Guo et al.

Table 3. Identification results under facial expressions. ‘A vs A’, ‘N vs A’, ‘N vs N’, and ‘N vs NN’ stand for ‘All vs All’, ‘Neutral

vs All’, ‘Neutral vs Neutral’, and ‘Neutral vs Non-neutral’, respectively.

Methods Modality Dataset R1RR

(ROC III)

R1RR

(A vs A)

R1RR

(N vs A)

R1RR

(N vs NN)

R1RR

(N vs N)

Curve based

Algorithms

[59] 3D FRGC V2.0 - - 97.7% 96.8% 99.2%

[116] 3D FRGC V2.0 - 98.0% 96.9% 94.3% 95.9%

[115] 3D FRGC V2.0 - 99.3% 99.5% 99.6% 99.8%

[115] 3D Bosphorus - - 99.0% 99.0% 99.9%

[115] 3D BU-3DFE - 99.0% 99.6% - -

[59] 3D Gavab - - 96.99% 94.54% 100%

Local Patch based

Algorithms

[51] 3D FRGC V2.0 - - 94.63% - -

[151] 3D+2D FRGC V2.0 - - 97.37% 95.37% 99.02%

[152] 3D FRGC V2.0 - - 93.5% 86.7% 99.0%

[152] 3D+2D FRGC V2.0 - - 96.1% 92.1% 99.4%

[67] 3D FRGC V2.0 - - 98.1% - -

[222] 3D FRGC V2.0 - - 98.39% - -

[185] 3D FRGC V2.0 99.6% 99.7% - - -

[103] 3D FRGC V2.0 - - 97.6% 95.1% 99.2%

[11] 3D FRGC V2.0 - - 95.6% 92.8% 97.3%

[201] 3D FRGC V2.0 87.19% - - - -

[62] 3D FRGC V2.0 - - 97.9% 98.5% 98.45%

[85] 3D FRGC V2.0 - - 97.0% 94.0% 99.4%

[124] 3D FRGC V2.0 - - 96.3% 92.2% 99.6%

[203] 3D FRGC V2.0 - - 98.1% - -

[103] 3D Bosphorus - - 97.0% - -

[201] 3D Bosphorus - - 93.66% - -

[14] 3D Bosphorus - - 93.4% - 97.9%

[15] 3D Bosphorus - - 94.5% - 98.5%

[126] 3D Bosphorus - - 96.6% 98.8% -

[62] 3D Bosphorus - - 95.35% - -

[203] 3D Bosphorus - - 97.3% - -

[78] 3D Bosphorus - - 98.6% - -

[93] 3D BU-3DFE - - 84.8% - -

[14] 3D BU-3DFE - - 87.5% - -

[11] 3D Gavab - - - 96.17% 100%

[14] 3D Gavab - - - 94% 100%

[124] 3D Gavab - - 96.99% 95.08% 100%

[78] 3D FRGC V2.0 - - 98.5% 96.9% 99.9%

[94] 3D Gavab - - 97.81% 100% 100%

Holistic

Algorithms

[3] 3D FRGC V2.0 - - - - 93.78%

[4] 3D FRGC V2.0 - - 96.52% 95.2% 97.58%

[91] 3D FRGC V2.0 - 97% - - -

[161] 3D BU-3DFE - - 84.4% - -

[91] 3D BU-3DFE - 100% - - -

[145] 3D Gavab - - - - 95%

[91] 3D Gavab - 98% - - -

Deep Learning

based Algorithms

[127] 3D FRGC V2.0 - - 98.01% 96.29% 99.39%

[36] 3D FRGC V2.0 - - 99.94% 99.88% 100%

[120] 3D Bosphorus - - 99.24% 99.2% 100%

[127] 3D Bosphorus - - - 97.6% -

[36] 3D Bosphorus - - 99.75% 99.73% 100%

[77] 3D Bosphorus - - 98.1% 99.0% -

[120] 3D BU-3DFE - - 93% - -

[127] 3D BU-3DFE - - 96.1% - -

[36] 3D BU-3DFE - - 99.88% - -

ACM Comput. Surv.

3D Face Recognition: Two Decades of Progress and Prospects • 27

Table 4. Comparative results under pose variations on the Gavab and Bosphorus datasets. ‘d’ means looking down, ‘u’ means

looking up, ‘r’ means sideway scan from right, ‘l’ means sideway scan from le, ‘yr’ means yaw rotation, ‘pr’ means pitch

rotation, ‘cr’ means cross rotation, and ‘o’ means overall scans. The types of methods (i.e., curve based (denoted by ‘C’), local

patch based (denoted by ‘L’), and holistic (denoted by ‘H’)) are also included in the tables.

(a) Evaluation Results on the Gavab Dataset

Method

Type d u r l o

[9] C 93.3% 92.8% - - -

[145] H 85.3% 88.6% - - -

[103] L 96.72% 96.72% 78.69% 93.44% 91.39%

[59] C 100% 98.36% 70.49% 86.89% 96.99%

[11] L 96.72% 98.36% 81.97% 93.44% -

[14] L 95.1% 96.7% 83.6% 93.4%

[124] L 98.36% 98.36% - - -

[94] L 99.18% 98.36% 81.96% 83.60% -

(b) Evaluation Results on the Bosphorus Dataset

Method

Type yr yr 90 pr cr o

[93] L - - - - 69.1%

[14] L 81.6% 45.7% 98.3% 93.4% -

[15] L 82.6% - 98.8% 95.3% -

[126] L 84.1% 47.1% 99.5% 99.1% 91.1%

[124] L 83.8% 47.4% 98.3% 98.6% 90.6%

[78] L 99.8% 95.2% 100% 99.1% 99.0%

[77] D 94.8% 86.2% 100% 98.6% 95.7%

From Table 5, we can observe that most current methods use local patch based approaches to handle occlusions.

Local patch based methods utilize the local structural information of the 3D facial surface by extracting sub-

regions or keypoints, and then perform face recognition by matching elaborately designed descriptors of these

local structures. For example, Alyuz et al. [

] divide a whole 3D facial surface into four regions, and matches these

four regions independently. In [

124

126

], a 3D facial surface is represented by a set of repeatable keypoints

with elaborately designed descriptors, and then face recognition is accomplished based on these keypoints. In

contrast, landmark based methods rely on anthropometric facial ducial points on a face, curve based methods

rely on geometric proles or contours of a face, and holistic methods rely on the completeness of a 3D facial

surface. Compared with these three types of methods, local patch based methods are more robust to occlusions.

7 CONCLUSION AND FUTURE WORK

This paper has presented a survey of the state-of-the-art 3D face recognition methods in the last twenty years. A

comprehensive survey of preprocessing techniques such as nose tip detection, data ltering and pose normal-

ization, as well as 3D face recognition methods has been conducted. The performance of the current methods

are evaluated on several challenging datasets under the taxonomy of facial expressions, pose variations and

occlusions. In summary, the following conclusions can be made:

(i) For 3D face recognition, local patch based algorithms have attracted more attention than other types of

methods. Beneting from the development of local feature description methods in computer vision, local patch

based algorithms can capture the details of 3D facial surfaces, and thus achieve more robust performance.

ACM Comput. Surv.

28 • Y. Guo et al.

Table 5. Comparative results under occlusions on the Bosphorus dataset. The gallery set contains 105 scans that have a

neutral scan for each person, and the probe set contains 381 scans that have occlusions. The types of methods (i.e., curve

based (denoted by ‘C’), and local patch based (denoted by ‘L’)) are also included in the table.

Method

Type Eye Mouth Glasses Hair Overall

[6] L 93.6% 93.6% 97.8% 89.6% 94.12%

[59] C 97.1% 78% 94.2% 81% 87%

[14] L - - - - 93.2%

[15] L - - - - 95.8%

[126] L 100.0% 100.0% 100.0% 95.5% 99.2%

[85] L 96.19% 96.19% 99.04% 95.52% 96.85%

[124] L 90.5% 94.3% 96.2% 88.1% 92.7%

[78] L 99.0% 96.1% 100% 97.3% 98.1%

[77] D 100.0% 97.8% 100.0% 97.1% 98.9%

(ii) More attention is paid on dealing with facial expressions, and less attention is paid on challenges of pose

variations and occlusions. From the evaluation results, we can observe that the current methods achieve promising

results on facial expression, while their performance under pose variations yet needs to be improved, especially

for side scans from left or right. In addition, the datasets designed for occlusions and the occlusion types are

limited. More datasets containing various types of occlusions should be constructed in the future.

(iii) Due to the lack of large-scale datasets for the training of deep neural networks, deep learning based 3D

face recognition methods have developed very slowly as compared to their 2D counterpart. As the largest 3D

face dataset with real individuals, ND-2006 [

] only contains 13,450 scans of 888 individuals, which is obviously

smaller than 2D face datasets such as FaceNet [

193

] and VGG-Face [

175

]. That is mainly because it requires more

eort to collect a large-scale 3D face dataset than a 2D face dataset which can be easily obtained by crawling the

web [

]. Although some methods have been proposed in recent years, they either utilize models pre-trained from

2D face datasets, or construct their own 3D datasets through face generation or data augmentation techniques.

For example, Gilani et al. [

] constructed a large-scale training dataset with 3.1 million 3D facial scans of 100K

individuals for 3D face recognition. However, most of these individuals are generated synthetically [

]. Therefore,

developing a large-scale 3D face dataset is highly needed for the community.

(iv) Most of these deep learning based methods convert 3D facial surfaces into 2D maps, and then utilize existing

2D face recognition networks to learn deep features. However, geometric information is lost during the conversion

from 3D data to 2D maps. Actually, an increasing number of deep learning methods have been proposed in the

last ve years to directly work on point clouds for various 3D vision tasks such as shape classication, object

detection, and point cloud segmentation [

]. Deep learning based 3D face recognition methods which work

directly with point clouds is a promising research direction.

(v) Disentanglement learning exhibits excellent performance for dealing with 3D facial variations such as poses

and expressions. It decouples the neutral latent representation from other facial attributes and thus makes 3D face

recognition highly robust to these facial attributes. Due to its powerful representation ability, disentanglement

learning also has a huge potential for other applications such as face reconstruction.

Based on these observations and recent technology development, several promising directions are worthy

considering for future work, for examples:

(i) Bridging 3D face recognition and generative AI. Generative AI has demonstrated its capability in many

areas, with several systems already being introduced to the area of 3D model generation. For example, CLIP-Mesh

is able to generate textured 3D meshes from text descriptions, Lumirithmic can generate 3D mesh for heads from

ACM Comput. Surv.

3D Face Recognition: Two Decades of Progress and Prospects • 29

facial scans. It is potential to use generative AI systems to mitigate the shortage of 3D facial data, to provide deep

learning algorithms with rich generative 3D models with dierent identities, poses, occlusions, expressions, and

also ethnics.

(ii) Leveraging foundation model for 3D face recognition. Foundation models have dominated the research

of Natural Language Processing (NLP) and shown promising potential in computer vision tasks. With the help

of multi-modality foundation models, it is possible to boost the performance of 3D face recognition. However,

designing a foundation model architecture that is suitable for 3D data processing is still an open question. How to

leverage the knowledge embedded in other modalities (e.g., text, 2D face images) to improve 3D face recognition

performance is also unexplored.

(iii) Achieving cross-resolution, cross-age, and cross-sensor 3D face recognition. It is very common that the

probe 3D faces and gallery 3D faces are acquired with dierent sensors at dierent times, and are represented

with dierent resolutions and noisy levels. How to achieve accurate, robust, and ecient 3D face recognition is

still an unsolved problem.

ACKNOWLEDGMENTS

This work was partially supported by the National Key Research and Development Program of China (No.

2021YFB3100800), the National Natural Science Foundation of China (No. U20A20185, 61972435, 42271457,

62276176), the Guangdong Basic and Applied Basic Research Foundation (2022B1515020103), the Shenzhen

Science and Technology Program (No. RCYX20200714114641140), and the Australian Research Council (Grants

DP210101682 and DP210102674).

REFERENCES

[1]

A. F. Abate, M. Nappi, D. Riccio, and G. Sabatino. 2007. 2D and 3D face recognition: A survey. Pattern Recognition Letters 28, 14 (2007),

1885–1906.

[2]

A. Abbad, K. Abbad, and H. Tairi. 2018. 3D face recognition: Multi-scale strategy based on geometric and local descriptors. Computers

& Electrical Engineering 70 (2018), 525 – 537.

[3]

F.R. Al-Osaimi, M. Bennamoun, and A. Mian. 2008. Integration of local and global geometrical cues for 3D face recognition. Pattern

Recognition 41, 3 (2008), 1030–1040.

[4]

F. Al-Osaimi, M. Bennamoun, and A. Mian. 2009. An expression deformation approach to non-rigid 3D face recognition. IJCV 81, 3

(2009), 302–316.

[5]

S. Aly, A. Trubanova, L. Abbott, S. White, and A. Youssef. 2015. VT-KFER: A Kinect-based RGBD + Time dataset for spontaneous and

non-spontaneous facial expression recognition. In ICB. 90–97.

[6] N. Alyuz, B. Gokberk, and L. Akarun. 2008. A 3D face recognition system for expression and occlusion invariance. In BTAS. 1–7.

[7] B. Amberg, R. Knothe, and T. Vetter. 2008. Expression invariant 3D face recognition with a morphable model. In FG. 1–6.

[8]

C. BenAbdelkader and P. A. Grin. 2005. Comparing and combining depth and texture cues for face recognition. Image and Vision

Computing 23, 3 (2005), 339–352.

[9]

S. Berretti, A. D. Bimbo, and P. Pala. 2006. Description and retrieval of 3D face models using iso-geodesic stripes. In ACM MIR. 13–22.

[10] S. Berretti, A. D. Bimbo, and P. Pala. 2010. 3D face recognition using isogeodesic stripes. IEEE TPAMI 32, 12 (2010), 2162–2177.

[11]

S. Berretti, A. D. Bimbo, and P. Pala. 2013. Sparse matching of salient facial curves for recognition of 3D faces with missing parts. IEEE

TIFS 8, 2 (2013), 374–389.

[12] S. Berretti, A. D. Del, and P. Pala. 2012. Superfaces: A super-resolution model for 3D faces. In ECCV Workshops. 73–82.

[13]

S. Berretti, P. Pala, and A. D. Bimbo. 2014. Face recognition by super-resolved 3D models from consumer depth cameras. IEEE TIFS 9, 9

(2014), 1436–1449.

[14]

S. Berretti, N. Werghi, A. D. Bimbo, and P. Pala. 2013. Matching 3D face scans using interest points and local histogram descriptors.

Computers & Graphics 37, 5 (2013), 509–525.

[15]

S. Berretti, N. Werghi, A. D. Bimbo, and P. Pala. 2014. Selecting stable keypoints and local descriptors for person identication using

3D face scans. The Visual Computer (2014), 1–18.

[16] C. Beumier and M. Acheroy. 2000. Automatic 3D face authentication. Image and Vision Computing 18, 4 (2000), 315–321.

[17]

C. Beumier and M. Acheroy. 2001. Face verication from 3D and grey level clues. Pattern Recognition Letters 22, 12 (2001), 1321–1329.

ACM Comput. Surv.

30 • Y. Guo et al.

[18]

A. R. Bhople, A. M. Shrivastava, and S. Prakasha. 2020. Point cloud based deep convolutional neural network for 3D face recognition.

Multimedia Tools and Applications (2020), 1–23.

[19] Volker Blanz, Kristina Scherbaum, and Hans-Peter Seidel. 2007. Fitting a Morphable Model to 3D Scans of Faces. In ICCV. 1–8.

[20] V. Blanz and T. Vetter. 1999. A morphable model for the synthesis of 3D faces. In SIGGRAPH. 187–194.

[21]

J. Booth, A. Roussos, S. Zafeiriou, A. Ponniah, and D. Dunaway. 2016. A 3D morphable model learnt from 10,000 Faces. In CVPR.

5543–5552.

[22]

G. Borghi, S. Pini, F. Grazioli, R. Vezzani, and R. Cucchiara. 2018. Face verication from depth using privileged information. In BMVC.

303.

[23] G. Borghi, S. Pini, R. Vezzani, and R. Cucchiara. 2019. Driver face verication with depth maps. Sensors 19, 15 (2019), 3361.

[24]

G. Borghi, M. Venturelli, R. Vezzani, and R. Cucchiara. 2017. POSEidon: face-from-depth for driver pose estimation. In CVPR. 5494–5503.

[25]

A. Y. Boumedine, S. Bentaieb, and A. Ouamri. 2022. An improved KNN classier for 3D face recognition based on SURF descriptors.

Journal of Applied Security Research 0, 0 (2022), 1–19.

[26]

G. Bouritsas, S. Bokhnyak, S. Ploumpis, S. Zafeiriou, and M. Bronstein. 2019. Neural 3D morphable models: spiral convolutional

networks for 3D shape representation learning and generation. In ICCV. 7212–7221.

[27] K. W. Bowyer, K. Chang, and P. Flynn. 2004. A survey of approaches to three-dimensional face recognition. In ICPR. 358–361.

[28]

K. W. Bowyer, K. Chang, and P. Flynn. 2006. A survey of approaches and challenges in 3D and multi-modal 3D+2D face recognition.

CVIU 101, 1 (2006), 1–15.

[29]

M. D. Breitenstein, D. Kuettel, T. Weise, L. V. Gool, and H. Pster. 2008. Real-time face pose estimation from single range images. In

CVPR. 1–8.

[30] A. M. Bronstein, M. M. Bronstein, and R. Kimmel. 2003. Expression-invariant 3D face recognition. In AVBPA. 62–70.

[31]

A. M. Bronstein, M. M. Bronstein, and R. Kimmel. 2005. Expression-invariant face recognition via spherical embedding. In ICIP, Vol. 3.

III–756.

[32] A. M. Bronstein, M. M. Bronstein, and R. Kimmel. 2005. Three-dimensional face recognition. IJCV 64, 1 (2005), 5–30.

[33] A. M. Bronstein, M. M. Bronstein, and R. Kimmel. 2006. Robust expression-invariant face recognition from partially missing data. In

ECCV. 396–408.

[34]

A. M. Bronstein, M. M. Bronstein, and R. Kimmel. 2007. Expression-invariant representations of faces. IEEE TIP 16, 1 (2007), 188–197.

[35] J. Bruna, W. Zaremba, A. Szlam, and Y. Lecun. 2013. Spectral networks and locally connected networks on graphs. In ICLR.

[36]

Y. Cai, Y. Lei, M. Yang, Z. You, and S. Shan. 2019. A fast and robust 3D face recognition approach based on deeply learned face

representation. Neurocomputing 363 (2019), 375–397.

[37]

C. Cao, Y. Weng, S. Zhou, Y. Tong, and K. Zhou. 2014. Facewarehouse: a 3D facial expression database for visual computing. IEEE

TVCG 20, 3 (2014), 413–425.

[38]

Y. Cao, S. Liu, P. Zhao, and H. Zhu. 2022. RP-Net: A pointNet++ 3D face recognition algorithm integrating RoPS local descriptor. IEEE

Access 10 (2022), 91245–91252.

[39] K. Chang, K. Bowyer, and P. Flynn. 2003. Face recognition using 2D and 3D facial data. In MMUA. 25–32.

[40] K. I. Chang, K. W. Bowyer, and P. J. Flynn. 2003. Multimodal 2D and 3D biometrics for face recognition. In AMFG. 187–194.

[41]

K. I. Chang, K. W. Bowyer, and P. J. Flynn. 2005. Adaptive rigid multi-region selection for handling expression variation in 3D face

recognition. In CVPR Workshops. 157–157.

[42]

K. I. Chang, K. W. Bowyer, and P. J. Flynn. 2005. An evaluation of multimodal 2D+ 3D face biometrics. IEEE TPAMI 27, 4 (2005),

619–624.

[43]

K. I. Chang, K. W. Bowyer, and P. J. Flynn. 2006. Multiple nose region matching for 3D face recognition under varying facial expression.

IEEE TPAMI 28, 10 (2006), 1695–1700.

[44]

S. Cheng, I. Kotsia, M. Pantic, and S. Zafeiriou. 2018. 4dfab: A large scale 4d database for facial expression analysis and biometric

applications. In CVPR. 5117–5126.

[45] C. Chua, F. Han, and Y. Ho. 2000. 3D human face recognition using point signature. In FG. 233–238.

[46] C. Chua and R. Jarvis. 1997. Point signatures: a new representation for 3D object recognition. IJCV 25, 1 (1997), 63–85.

[47] D. Colbry, G. Stockman, and A. Jain. 2005. Detection of anchor points for 3D face verication. In CVPR Workshops. 118–118.

[48]

A. Colombo, C. Cusano, and R. Schettini. 2006. 3D face detection using curvature analysis. Pattern Recognition 39, 3 (2006), 444–455.

[49] A. Colombo, C. Cusano, and R. Schettini. 2011. UMB-DB: A database of partially occluded 3D faces. In ICCV Workshops. 2113–2119.

[50] C. Conde, A. Serrano, and E. Cabello. 2006. Multimodal 2D, 2.5D & 3D Face Verication. In ICIP. IEEE, 2061–2064.

[51] J. Cook, V. Chandran, and C. Fookes. 2006. 3D face recognition using log-gabor templates. In BMVC. 769–778.

[52] J. Cook, V. Chandran, and S. Sridharan. 2007. Multiscale representation for 3D face recognition. IEEE TIFS 2, 3 (2007), 529–536.

[53]

J. Cook, V. Chandran, S. Sridharan, and C. Fookes. 2004. Face recognition from 3D data using iterative closest point algorithm and

gaussian mixture models. In 3DimPVT. 502–509.

[54]

C. A. Corneanu, M. O. Simón, J. F. Cohn, and S. E. Guerrero. 2016. Survey on RGB, 3D, thermal, and multimodal approaches for facial

expression recognition: history, trends, and aect-related applications. IEEE TPAMI 38, 8 (2016), 1548–1568.

ACM Comput. Surv.

3D Face Recognition: Two Decades of Progress and Prospects • 31

[55]

C. Creusot, N. Pears, and J. Austin. 2013. A machine-learning approach to keypoint detection and landmarking on 3D meshes. IJCV

102, 1-3 (2013), 146–179.

[56]

N. Dagnes, E. Vezzetti, F. Marcolin, and S. Tornincasa. 2018. Occlusion detection and restoration techniques for 3D face recognition: a

literature review. Machine Vision & Applications 29, 5 (2018), 789–813.

[57]

H. Dibeklioğlu, B. Gökberk, and L. Akarun. 2009. Nasal region-based 3D face recognition under pose and expression variations. In

Advances in Biometrics. 309–318.

[58]

H. Dibeklioglu, A. A. Salah, and L. Akarun. 2008. 3D facial landmarking under expression, pose, and occlusion variations. In BTAS. 1–6.

[59]

H. Drira, B. B. Amor, A. Srivastava, M. Daoudi, and R. Slama. 2013. 3D face recognition under expressions, occlusions, and pose

variations. IEEE TPAMI 35, 9 (2013), 2270–2283.

[60]

K. Dutta, D. Bhattacharjee, and M. Nasipuri. 2020. SpPCANet: a simple deep learning-based feature extraction approach for 3D face

recognition. Multimedia Tools and Applications (2020), 1–24.

[61]

K. Dutta, D. Bhattacharjee, M. Nasipuri, and O. Krejcar. 2021. Complement component face space for 3D face recognition from range

images. Applied Intelligence 51, 4 (April 2021), 2500–2517.

[62]

M. Emambakhsh and A. Evans. 2016. Nasal patches and curves for expression-robust 3D face recognition. IEEE TPAMI 39, 5 (2016),

995–1007.

[63] N. Erdogmus and J. Dugelay. 2014. 3D assisted face recognition: dealing with expression variations. IEEE TIFS 9, 5 (2014), 826–838.

[64] N. Erdogmus and S. Marcel. 2013. Spoong in 2D face recognition with 3D masks and anti-spoong with kinect. In BTAS. 1–6.

[65] T. Faltemier, K. Bowyer, and P. Flynn. 2006. 3D face recognition with region committee voting. In 3DimPVT. 318–325.

[66]

T. C. Faltemier, K. W. Bowyer, and P. J. Flynn. 2007. Using a multi-instance enrollment representation to improve 3D face recognition.

In BTAS. 1–6.

[67] T. C. Faltemier, K. W. Bowyer, and P. J. Flynn. 2008. A region ensemble for 3D face recognition. IEEE TIFS 3, 1 (2008), 62–73.

[68] T. C. Faltemier, K. W. Bowyer, and P. J. Flynn. 2008. Rotated prole signatures for robust 3D feature detection. In FG. 1–7.

[69]

T. C. Faltemier, K. W. Bowyer, and P. J. Flynn. 2008. Using multi-instance enrollment to improve performance of 3D face recognition.

CVIU 112, 2 (2008), 114–125.

[70]

X. Fan, Q. Jia, K. Huyan, X. Gu, and Z. Luo. 2016. 3D facial landmark localization using texture regression via conformal mapping.

Pattern Recognition Letters 83 (2016), 395–402.

[71]

G. Fanelli, M. Dantone, J. Gall, A. Fossati, and L. V. Gool. 2013. Random forests for real time 3D face analysis. IJCV 101, 3 (2013),

437–458.

[72]

T. Fang, and O. Ocegueda X. Zhao, S. K. Shah, and I. A. Kakadiaris. 2011. 3D facial expression recognition: A perspective on promises

and challenges. In FG Workshops. 603–610.

[73]

J. Feng, Q. Guo, Y. Guan, M. Wu, X. Zhang, and C. Ti. 2019. 3D face recognition method based on deep convolutional neural network.

In ICSICCS. 123–130.

[74]

M. A. Fischler and R. C. Bolles. 1981. Random sample consensus: A paradigm for model tting with applications to image analysis and

automated cartography. Commun. ACM 24, 6 (1981), 381–395.

[75]

P. J. Flynn, K. W. Bowyer, and P. J. Phillips. 2003. Assessment of time dependency in face recognition: An initial study. In AVBPA.

44–51.

[76] S. Z. Gilani and A. Mian. 2018. Learning from millions of 3D scans for large-scale 3D face recognition. In CVPR. 1896–1905.

[77]

S. Z. Gilani, A. Mian, and P. Eastwood. 2017. Deep, dense and accurate 3D face correspondence for generating population specic

deformable models. Pattern Recognition 69 (2017), 238–250.

[78] S. Z. Gilani, A. Mian, F. Shafait, and I. Reid. 2018. Dense 3D face correspondence. IEEE TPAMI 40, 7 (2018), 1584–1598.

[79]

S. Z. Gilani, F. Shafait, and A. Mian. 2015. Shape-based automatic detection of a large number of 3D facial landmarks. In CVPR.

4639–4648.

[80]

B. Gokberk and L. Akarun. 2006. Comparative analysis of decision-level fusion algorithms for 3D face recognition. In ICPR, Vol. 3.

1018–1021.

[81] G. G. Gordon. 1992. Face recognition based on depth and curvature features. In CVPR. 808–810.

[82] G. Guo and N. Zhang. 2019. A survey on deep learning based face recognition. CVIU 189 (2019), 102805.

[83]

M. Guo, J. Cai, Z. Liu, T. Mu, R. R. Martin, and S. Hu. 2021. Pct: Point cloud transformer. Computational Visual Media 7, 2 (2021),

187–199.

[84]

Y. Guo, M. Bennamoun, F. Sohel, M. Lu, and J. Wan. 2014. 3D object recognition in cluttered scenes with local surface features: A

survey. IEEE TTPAMI 36, 11 (2014), 2270–2287.

[85]

Y. Guo, Y. Lei, L. Liu, Y. Wang, M. Bennamoun, and F. Sohel. 2016. EI3D: Expression-invariant 3D face recognition based on feature and

shape matching. Pattern Recognition Letters 83 (2016), 403–412.

[86]

Y. Guo, F. Sohel, M. Bennamoun, M. Lu, and J. Wan. 2013. Rotational projection statistics for 3D local surface description and object

recognition. IJCV 105, 1 (2013), 63–86.

ACM Comput. Surv.

32 • Y. Guo et al.

[87]

Y. Guo, H. Wang, Q. Hu, H. Liu, L. Liu, and M. Bennamoun. 2021. Deep learning for 3D point clouds: A survey. IEEE TPAMI 43, 12

(2021), 4338–4364.

[88]

S. Gupta, J. K. Aggarwal, M. K. Markey, and A. C. Bovik. 2007. 3D face recognition founded on the structural diversity of human faces.

In CVPR. 1–7.

[89]

S. Gupta, M. K. Markey, and A. C. Bovik. 2007. Advances and challenges in 3D and 2D+3D human face recognition. Pattern Recognition

in Biology (2007), 63–103.

[90] S. Gupta, M. K. Markey, and A. C. Bovik. 2010. Anthropometric 3D face recognition. IJCV 90, 3 (2010), 331–349.

[91]

F. B. T. Haar and R.C. Veltkamp. 2010. Expression modeling for expression-invariant face recognition. Computers & Graphics 34, 3

(2010), 231–241.

[92] F. B. T. Haar and R. C. Veltkamp. 2009. A 3D face matching framework for facial curves. Graphical Models 71, 2 (2009), 77–91.

[93]

F. Hajati, A. A. Raie, and Y. Gao. 2012. 2.5D face recognition using patch geodesic moments. Pattern Recognition 45, 3 (2012), 969–982.

[94]

W. Hariri, H. Tabia, N. Farah, A. Benouareth, and D. Declercq. 2016. 3D face recognition using covariance based descriptors. Pattern

Recognition Letters 78 (2016), 1–7.

[95] W. Hariri and M. Zaabi. 2021. Deep residual feature quantization for 3D face recognition.

[96] K. He, X. Zhang, S. Ren, and J. Sun. 2016. Deep residual learning for image recognition. In CVPR. 770–778.

[97]

T. Heseltine, N. Pears, and J. Austin. 2004. Three-dimensional face recognition: a shersurface approach. In Image Analysis and

Recognition. 684–691.

[98]

T. Heseltine, N. Pears, and J. Austin. 2004. Three-dimensional face recognition: an eigensurface approach. In ICIP, Vol. 2. 1421–1424.

[99]

T. Heseltine, N. Pears, and J. Austin. 2008. Three-dimensional face recognition using combinations of surface feature map subspace

components. Image and Vision Computing 26, 3 (2008), 382–396.

[100]

C. Hesher, A. Srivastava, and G. Erlebacher. 2003. A novel technique for face recognition using range imaging. In ISSPA, Vol. 2. 201–204.

[101]

R. I. Hg, P. Jasek, C. Rodal, K. Nasrollahi, T. B. Moeslund, and G. Tranchet. 2012. An RGB-Ddatabase using Microsoft’s Kinect for

Windows for face detection. In SITIS. 42–46.

[102]

Y. Hu, Z. Zhang, X. Xu, Y. Fu, and T. S. Huang. 2007. Building large scale 3D face database for face analysis. In Multimedia Content

Analysis and Mining. 343–350.

[103]

D. Huang, M. Ardabilian, Y. Wang, and L. Chen. 2012. 3D face recognition using eLBP-based facial description and local feature hybrid

matching. IEEE TIFS 7, 5 (2012), 1551–1565.

[104]

Y. Huang, Y. Wang, and T. Tan. 2006. Combining statistics of geometrical and correlative features for 3D face recognition. In BMVC.

879–888.

[105]

M. Husken, M. Brauckmann, S. Gehlen, and C. von der Malsburg. 2005. Strategies and benets of fusion of 2D and 3D face recognition.

In CVPR Workshops. 174–174.

[106]

M.O. Irfanoglu, B. Gokberk, and L. Akarun. 2004. 3D shape-based face recognition using automatically registered facial surfaces. In

ICPR, Vol. 4. 183–186.

[107]

S. M. S. Islam, M. Bennamoun, R. A. Owens, and R. Davies. 2012. A review of recent advances in 3D ear and expression invariant face

biometrics. Comput. Surveys 44, 3 (2012), 14.

[108]

P. Isola, J. Zhu, T. Zhou, and A. A. Efros. 2017. Image-to-Image Translation with Conditional Adversarial Networks. In CVPR. 5967–5976.

[109]

P. Isola, J. Zhu, T. Zhou, and A. A. Efros. 2017. Image-to-image translation with conditional adversarial networks. In CVPR. 5967–5976.

[110]

A. K. Jain, K. Nandakumar, and A. Ross. 2016. 50 years of biometric research: accomplishments, challenges, and opportunities. Pattern

Recognition Letters 79 (2016), 80–105.

[111] A. K. Jain, A. Ross, and S. Prabhakar. 2004. An introduction to biometric recognition. IEEE TCSVT 14, 1 (2004), 4–20.

[112]

C. Jiang, S. Lin, W. Chen, F. Liu, and L. Shen. 2022. PointFace: point cloud encoder based feature embedding for 3D face recognition.

IEEE TBIOM (2022), 1–1.

[113] Z. Jiang, Q. Wu, K. Chen, and J. Zhang. 2019. Disentangled representation learning for 3D face shape. In CVPR. 11949–11958.

[114] Y. Jing, X. Lu, and S. Gao. 2021. 3D face recognition: A survey.

[115]

M. Jribi, S. Mathlouthi, and F. Ghorbel. 2021. A geodesic multipolar parameterization-based representation for 3D face recognition.

Signal Processing: Image Communication 99 (Nov. 2021), 116464.

[116]

M. Jribi, A. Rihani, A. B. Khlifa, and F. Ghorbel. 2019. An SE(3) invariant description for 3D face recognition. Image and Vision

Computing 89 (Sept. 2019), 106–119.

[117]

A. Kacem, H. B. Abdesslam, K. Cherenkova, and D. Aouada. 2021. Space-time triplet loss network for dynamic 3D face verication. In

ICPR. 82–90.

[118]

A. Kacem, K. Cherenkova, and D. Aouada. 2022. Disentangled face identity representations for joint 3D face recognition and

neutralisation. In ICVR. 438–443.

[119]

I. A. Kakadiaris, G. Passalis, G. Toderici, M. N. Murtuza, Y. Lu, N. Karampatziakis, and T. Theoharis. 2007. Three-dimensional face

recognition in the presence of facial expressions: An annotated deformable model approach. IEEE TPAMI 29, 4 (2007), 640–649.

[120] D. Kim, M. Hernandez, J. Choi, and G. Medioni. 2017. Deep 3D face identication. In IJCB. 133–142.

ACM Comput. Surv.

3D Face Recognition: Two Decades of Progress and Prospects • 33

[121]

J. Kittler, A. Hilton, M. Hamouz, and J.Illingworth. 2005. 3D assisted face recognition: A survey of 3D imaging, modelling and recognition

approaches. In CVPR Workshops. 114–114.

[122]

Y. Lei, M. Bennamoun, and A. A. El-Sallam. 2013. An ecient 3D face recognition approach based on the fusion of novel local low-level

features. Pattern Recognition 46, 1 (2013), 24–37.

[123]

Y. Lei, M. Bennamoun, M. Hayat, and Y. Guo. 2014. An ecient 3D face recognition approach using local geometrical signatures.

Pattern Recognition 47, 2 (2014), 509–524.

[124]

Y. Lei, Y. Guo, M. Hayat, M. Bennamoun, and X. Zhou. 2016. A two-phase weighted collaborative representation for 3D partial face

recognition with single sample. Pattern Recognition 52, 4 (2016), 218–237.

[125]

B. Li, A. S. Mian, W. Liu, and A. Krishna. 2013. Using Kinect for face recognition under varying poses, expressions, illumination and

disguise. In WACV. 186–192.

[126]

H. Li, D. Huang, J. M. Morvan, Y. Wang, and L. Chen. 2015. Towards 3D face recognition in the real: a registration-free approach using

ne-grained matching of 3D keypoint descriptors. IJCV 113, 2 (2015), 128–142.

[127]

H. Li, J. Sun, and L. Chen. 2017. Location-sensitive sparse representation of deep normal patterns for expression-robust 3D Face

Recognition. IJCB (2017).

[128]

L. Li, C. Xu, W. Tang, and C. Zhong. 2008. 3D face recognition by constructing deformation invariant image. Pattern Recognition Letters

29, 10 (2008), 1596–1602.

[129]

M. Li, B. Huang, and G. Tian. 2022. A comprehensive survey on 3D face recognition methods. Engineering Applications of Articial

Intelligence 110 (April 2022), 104669.

[130] X. Li, T. Jia, and H. Zhang. 2009. Expression-insensitive 3D face recognition using sparse representation. In CVPR. 2575–2582.

[131]

S. Lin, C. Jiang, F. Liu, and L. Shen. 2021. High quality facial data synthesis and fusion for 3D low-quality face recognition. In IJCB. 1–8.

[132] S. Lin, F. Liu, Y. Liu, and L. Shen. 2019. Local feature tensor based deep learning for 3D face recognition. In FG. 1–5.

[133]

W. Lin, K. Wong, N. Boston, and Y. Hu. 2007. 3D face recognition under expression variations using similarity metrics fusion. In ICME.

727–730.

[134] F. Liu, L. Tran, and X. Liu. 2019. 3D face modeling from diverse raw scan data. In ICCV. 9407–9417.

[135]

P. Liu, Y. Wang, D. Huang, Z. Zhang, and L. Chen. 2013. Learning the spherical harmonic features for 3D face recognition. IEEE TIP 22,

3 (2013), 914–925.

[136] D. G. Lowe. 2004. Distinctive image features from scale-invariant keypoints. IJCV 60, 2 (2004), 91–110.

[137] X. Lu, D. Colbry, and A. K. Jain. 2004. Matching 2.5D scans for face recognition. In ICBA. 30–36.

[138] X. Lu and A. K. Jain. 2005. Integrating range and texture information for 3D face recognition. In IEEE WACV, Vol. 1. 156–163.

[139] X. Lu and A. K. Jain. 2005. Multimodal facial feature extraction for automatic 3D face recognition. Tech Re (2005).

[140] X. Lu and A. K. Jain. 2006. Automatic feature extraction for multiview 3D face recognition. In FG. 585–590.

[141] X. Lu and A. K. Jain. 2008. Deformation modeling for robust 3D face matching. IEEE TPAMI 30, 8 (2008), 1346–1357.

[142] X. Lu, A. K. Jain, and D. Colbry. 2006. Matching 2.5D face scans to 3D models. IEEE TPAMI 28, 1 (2006), 31–43.

[143]

Mand A. Wollstein M. A. de Jong, C. Ru, D. Dunaway, P. Hysi, T. Spector, F. Liu, W. Niessen, M. J. Koudstaal, M. Kayser, E. B. Wolvius,

and S. Böhringer. 2016. An automatic 3D facial landmarking algorithm using 2D gabor wavelets. IEEE TIP 25, 2 (2016), 580–588.

[144]

F. Sohel M. Bennamoun and Y. Guo. 2015. Feature selection for 2D and 3D face recognition. Wiley Encyclopedia of Electrical and

Electronics Engineering (2015).

[145]

M. H. Mahoor and M. Abdel-Mottaleb. 2009. Face recognition based on 3D ridge images obtained from range data. Pattern Recognition

42, 3 (2009), 445–451.

[146]

T. Mantecon, C. R. del Bianco, F. Jaureguizar, and N. García. 2014. Depth-based face recognition using local quantized patterns adapted

for range data. In ICIP. 293–297.

[147] I. Marras, S. Zafeiriou, and G. Tzimiropoulos. 2012. Robust learning from normals for 3D face recognition. In ECCV. 230–239.

[148]

T. Maurer, D. Guigonis, I. Maslov, B. Pesenti, A. Tsaregorodtsev, D. West, and G. Medioni. 2005. Performance of geometrix ActiveIDˆTM

3D face recognition engine on the FRGC data. In CVPR Workshops. 154–154.

[149]

K. Messer, J. Matas, J. Kittler, J. Luettin, and G. Maitre. 1999. XM2VTSDB: The extended M2VTS database. In AVBPA, Vol. 964. 965–966.

[150] A. Mian. 2011. Robust realtime feature detection in raw 3D face images. In WACV. 220–226.

[151]

A. S. Mian, M. Bennamoun, and R. Owens. 2007. An ecient multimodal 2D-3D hybrid approach to automatic face recognition. IEEE

TPAMI 29, 11 (2007), 1927–1943.

[152]

A. S. Mian, M. Bennamoun, and R. Owens. 2008. Keypoint detection and local feature matching for textured 3D face recognition. IJCV

79, 1 (2008), 1–12.

[153]

A. S. Mian, M. Bennamoun, and R. A. Owens. 2005. Region-based matching for robust 3D face recognition. In BMVC, Vol. 5. 199–208.

[154] A. S. Mian and N. Pears. 2012. 3D face recognition. In 3D Imaging, Analysis and Applications. 311–366.

[155] R. Min, N. Kose, and J. Dugelay. 2014. KinectFaceDB: A Kinect database for face recognition. IEEE TSMC 44, 11 (2014), 1534–1548.

[156]

H. Mohammadzade and D. Hatzinakos. 2013. Iterative closest normal point for 3D face recognition. IEEE TPAMI 35, 2 (2013), 381–397.

[157] A.B. Moreno and A. Sanchez. 2004. GavabDB: a 3D face database. In COST275 Workshop on Biometrics on the Internet. 75–80.

ACM Comput. Surv.

34 • Y. Guo et al.

[158]

A. B. Moreno, Á. Sanchez, J. F. Velez, and F. J. Diaz. 2005. Face recognition using 3D local geometrical features: PCA vs. SVM. In ISPA.

185–190.

[159] A. B. Moreno, A. Sánchez, J. F. Vélez, and F. J. Díaz. 2003. Face recognition using 3D surface-extracted descriptors. In IMVIP, Vol. 2.

[160] M. H. Mousavi, K. Faez, and A. Asghari. 2008. Three dimensional face recognition using SVM classier. In ICIS. 208–213.

[161]

I. Mpiperis, S. Malassiotis, and M. G. Strintzis. 2007. 3D face recognition with the geodesic polar representation. IEEE TIFS 2, 3 (2007),

537–547.

[162]

I. Mpiperis, S. Malassiotis, and M. G. Strintzis. 2008. Bilinear models for 3D face and facial expression recognition. IEEE TIFS 3, 3 (2008),

498–511.

[163]

G. Mu, D. Huang, G. Hu, J. Sun, and Y. Wang. 2019. Led3D: A lightweight and ecient deep approach to recognizing low-quality 3D

faces. In CVPR. 5766–5775.

[164] T. Nagamine, T. Uemura, and I. Masuda. 1992. 3D facial image analysis for human identication. In ICPR. 324–327.

[165]

B. Nassih, A. Amine, M. Ngadi, Y. Azdoud, D. Naji, and N. Hmina. 2021. An ecient three-dimensional face recognition system based

random forest and geodesic curves. Computational Geometry 97 (2021), 101758.

[166] T. Neumann, K. Varanasi, S. Wenger, M. Wacker, M. Magnor, and C. Theobalt. 2013. Sparse localized deformation components. ACM

TOG 32, 6 (2013).

[167]

O. Ocegueda, T. Fang, S. K. Shah, and I. A. Kakadiaris. 2013. 3D face discriminant analysis using Gauss-Markov posterior marginals.

IEEE TPAMI 35, 3 (2013), 728–739.

[168] O. Ocegueda, S. K. Shah, and I. A. Kakadiaris. 2011. Which parts of the face give out your identity?. In CVPR. 641–648.

[169]

E. C. Olivetti, J. Ferretti, G. Cirrincione, F. Nonis, S. Tornincasa, and F. Marcolin. 2019. Deep CNN for 3D face recognition. In Design

Tools and Methods in Industrial Engineering. 665–674.

[170] G. Pan, S. Han, Z. Wu, and Y. Wang. 2005. 3D face recognition using mapped depth images. In CVPR Workshops. 175–175.

[171] G. Pan, Y. Wu, Z. Wu, and W. Liu. 2003. 3D Face recognition by prole and surface matching. In IJCNN, Vol. 3. 2169–2174.

[172]

K. Papadopoulos, A. Kacem, A. E. R. Shabayek, and D. Aouada. 2022. Face-GCN: a graph convolutional network for 3D dynamic face

recognition. In ICVR. 454–458.

[173]

T. Papatheodorou and D. Rueckert. 2004. Evaluation of automatic 4D face recognition using surface and texture registration. In FG.

321–326.

[174]

C. Papazov, T. K. Marks, and M. Jones. 2015. Real-Time 3D head pose and facial landmark estimation from depth images using triangular

surface patch features. In CVPR. 4722–4730.

[175] O. M. Parkhi, A. Vedaldi, and A. Zisserman. 2015. Deep face recognition. In BMVC. 41.1–41.12.

[176]

G. Passalis, I.A. Kakadiaris, T. Theoharis, G. Toderici, and N. Murtuza. 2005. Evaluation of 3D face recognition in the presence of facial

expressions: an annotated deformable model approach. In CVPR Workshops. 171–171.

[177]

G. Passalis, P. Perakis, T. Theoharis, and I. A. Kakadiaris. 2011. Using facial symmetry to handle pose variations in real-world 3D face

recognition. IEEE TPAMI 33, 10 (2011), 1938–1951.

[178]

P. Paysan, R. Knothe, B. Amberg, S. Romdhani, and T. Vetter. 2009. A 3D face model for pose and illumination invariant face recognition.

In AVSS. 296–301.

[179]

X. Peng, M. Bennamoun, and A. S. Mian. 2011. A training-free nose tip detection method from face range images. Pattern Recognition

44, 3 (2011), 544–558.

[180]

P. Perakis, G. Passalis, T. Theoharis, and I. A. Kakadiaris. 2013. 3D facial landmark detection under large yaw and expression variations.

IEEE TPAMI 35, 7 (2013), 1552–1564.

[181]

D. Petrovska-Delacretaz, S. Lelandais, J. Colineau, L. Chen, B. Dorizzi, M. Ardabilian, E. Krichen, M. Mellakh, A. Chaari, S. Guer, J.

D’Hose, and B. Amor. 2008. The IV 2 multimodal biometric database (including iris, 2D, 3D, stereoscopic, and talking face data), and

the IV 2-2007 evaluation campaign. In BTAS. 1–7.

[182]

P. J. Phillips, P. J. Flynn, T. Scruggs, K. W. Bowyer, J. Chang, K. Homan, J. Marques, J. Min, and W. Worek. 2005. Overview of the face

recognition grand challenge. In CVPR, Vol. 1. 947–954.

[183]

S. Pini, G. Borghi, R. Vezzani, D. Maltoni, and R. Cucchiara. 2021. A systematic comparison of depth map representations for face

recognition. Sensors 21, 3 (2021), 944.

[184]

C. R. Qi, L. Yi, H. Su, and L. J. Guibas. 2017. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In NeurIPS,

Vol. 30.

[185]

C. C. Queirolo, L. Silva, O. R. P. Bellon, and M. P. Segundo. 2010. 3D face recognition using simulated annealing and the surface

interpenetration measure. IEEE TPAMI 32, 2 (2010), 206–219.

[186] A. Ranjan, T. Bolkart, S. Sanyal, and M. J. Black. 2018. Generating 3D faces using convolutional mesh autoencoders. In ECCV.

[187]

T. D. Russ, M. W. Koch, and C. Q. Little. 2005. A 2D range Hausdor approach for 3D face recognition. In CVPR Workshops. 169–169.

[188]

C. Samir, A. Srivastava, and M. Daoudi. 2006. Three-dimensional face recognition using shapes of facial curves. IEEE TPAMI 28, 11

(2006), 1858–1863.

ACM Comput. Surv.

3D Face Recognition: Two Decades of Progress and Prospects • 35

[189]

C. Samir, A. Srivastava, M. Daoudi, and E. Klassen. 2009. An intrinsic framework for analysis of facial surfaces. IJCV 82, 1 (2009),

80–95.

[190] G. Sandbach, S. Zafeiriou, M. Pantic, and L. Yin. 2012. Static and dynamic 3D facial expression recognition: A comprehensive survey.

Image and Vision Computing 30, 10 (2012), 683–697.

[191]

A. Savran, N. Alyüz, H. Dibeklioğlu, O. Çeliktutan, B. Gökberk, B. Sankur, and L. Akarun. 2008. Bosphorus database for 3D face analysis.

In Biometrics and Identity Management. 47–56.

[192] A. Scheenstra, A. Ruifrok, and R. Veltkamp. 2005. A survey of 3D face recognition methods. In AVBPA. 325–345.

[193]

F. Schro, D. Kalenichenko, and J. Philbin. 2015. FaceNet: A unied embedding for face recognition and clustering. In CVPR. 815–823.

[194]

M. P. Segundo, C. Queirolo, O. R. P. Bellon, and L. Silva. 2007. Automatic 3D facial segmentation and landmark detection. In ICIAP.

431–436.

[195]

S. Sharma and V. Kumar. 2020. Voxel-based 3D face reconstruction and its application to face recognition using sequential deep learning.

Multimedia Tools and Applications 79, 25-26 (July 2020), 17303–17330.

[196]

B. Shi, H. Zang, R. Zheng, and S. Zhan. 2019. An ecient 3D face recognition approach using frenet feature of iso-geodesic curves.

JVCIR 59 (2019), 455 – 460.

[197]

D. Smeets, P. Claes, J. Hermans, D. Vandermeulen, and P. Suetens. 2012. A comparative study of 3D face recognition under expression

variations. IEEE TSMCC 42, 5 (2012), 710–727.

[198]

D. Smeets, P. Claes, D. Vandermeulen, and J. G. Clement. 2010. Objective 3D face recognition: Evolution, approaches and challenges.

Forensic Science International 201, 1-3 (2010), 125–132.

[199]

D. Smeets, F. Fabry, J. Hermans, D. Vandermeulen, and P. Suetens. 2009. Isometric deformation modeling using singular value

decomposition for 3D expression-invariant face recognition. In BTAS. 1–6.

[200]

D. Smeets, T. Fabry, J. Hermans, D. Vandermeulen, and P. Suetens. 2010. Fusion of an isometric deformation modeling approach using

spectral decomposition and a region-based approach using ICP for expression-invariant 3D face recognition. In ICPR. 1172–1175.

[201]

D. Smeets, J. Keustermans, D. Vandermeulen, and P. Suetens. 2013. meshSIFT: local surface features for 3D face recognition under

expression variations and partial data. CVIU 117, 2 (2013), 158–169.

[202]

S. Soltanpour, B. Boufama, and Q.M. J. Wu. 2017. A survey of local feature methods for 3D face recognition. Pattern Recognition 72

(2017), 391–406.

[203] S. Soltanpour and Q.M. J. Wu. 2017. High-order local normal derivative pattern (LNDP) for 3D face recognition. In ICIP. 2811–2815.

[204]

M. Song, D. Tao, S. Sun, C. Chen, and S. J. Maybank. 2014. Robust 3D face landmark localization based on local coordinate coding. IEEE

TIP 23, 12 (2014), 5108–5122.

[205] L. Spreeuwers. 2011. Fast and accurate 3D face recognition. IJCV 93, 3 (2011), 389–414.

[206]

L. Spreeuwers. 2015. Breaking the 99% barrier: optimisation of three-dimensional face recognition. IET Biometrics 4, 3 (2015), 169–178.

[207]

A. Srivastava, C. Samir, S. H. Joshi, and M. Daoudi. 2009. Elastic shape models for face analysis using curvilinear coordinates. Journal

of Mathematical Imaging and Vision 33, 2 (2009), 253–265.

[208]

H. Sun, N. Pears, and Y. Gu. 2022. Information Bottlenecked Variational Autoencoder for Disentangled 3D Facial Expression Modelling.

In WACV. 2334–2343.

[209]

Y. Tan, H. Lin, Z. Xiao, S. Ding, and H. Chao. 2019. Face recognition from sequential sparse 3D data via deep registration. In ICB. 1–8.

[210] Frank B. ter Haar and Remco C. Veltkamp. 2008. 3D Face Model Fitting for Recognition. In ECCV. 652–664.

[211]

G. Toderici, G. Evangelopoulos, T. Fang, T. Theoharis, and I. A. Kakadiaris. 2014. UHDB11 Database for 3D-2D face recognition. In

PSIVT. 73–86.

[212] F. Tombari, S. Salti, and L. D. Stefano. 2010. Unique signatures of histograms for local surface description. In ECCV. 356–369.

[213]

N. F. Troje and H. H. Bültho. 1996. Face recognition under varying poses: The role of texture and shape. Vision Research 36, 12 (1996),

1761–1771.

[214] E. Trucco and A. Verri. 1998. Introductory techniques for 3D computer vision.

[215]

F. Tsalakanidou, S. Malassiotis, and M. G. Strintzis. 2005. Face localization and authentication using color and depth images. IEEE TIP

14, 2 (2005), 152–168.

[216]

F. Tsalakanidou, S. Malassiotis, and M. G. Strintzis. 2007. A 3D face and hand biometric system for robust user-friendly authentication.

Pattern Recognition Letters 28, 16 (2007), 2238–2249.

[217]

F. Tsalakanidou, D. Tzovaras, and M. G. Strintzis. 2003. Use of depth and colour eigenfaces for face recognition. Pattern Recognition

Letters 24, 9 (2003), 1427–1435.

[218]

R. C. Veltkamp, S. V. Jole, H. Drira, B. B. Amor, M. Daoudi, H. Li, L. Chen, P. Claes, D. Smeets, J. Hermans, D. Vandermeulen, and P.

Suetensothers. 2011. SHREC’11 track: 3D face models retrieval. In 3DOR. 89–95.

[219]

V. Vijayan, K. W. Bowyer, P. J. Flynn, D. Huang, L. Chen, M. Hansen, O. Ocegueda, S. K. Shah, and I. A. Kakadiaris. 2011. Twins 3D face

recognition challenge. In IJCB. 1–7.

[220]

Y. Wang and C. Chua. 2005. Face recognition from 2D and 3D images using 3D Gabor lters. Image and Vision Computing 23, 11 (2005),

1018–1028.

ACM Comput. Surv.

36 • Y. Guo et al.

[221]

Y. Wang, C. Chua, and Y. Ho. 2002. Facial feature detection and face recognition from 2D and 3D images. Pattern Recognition Letters 23,

10 (2002), 1191–1202.

[222]

Y. Wang, J. Liu, and X. Tang. 2010. Robust 3D face recognition by local shape dierence boosting. IEEE TPAMI 32, 10 (2010), 1858–1870.

[223]

Y. Wang, G. Pan, Z. Wu, and Y. Wang. 2006. Exploring facial expression eects in 3D face recognition using partial ICP. In ACCV.

581–590.

[224] Y. Wang, X. Tang, J. Liu, G. Pan, and R. Xiao. 2008. 3D face recognition by local shape dierence boosting. In ECCV. 603–616.

[225]

Z. Wang, Z. Miao, Q.M. J. Wu, Y. Wan, and Z. Tang. 2014. Low-resolution face recognition: a review. The Visual Computer 30, 4 (2014),

359–386.

[226]

N. Werghi, C. Tortorici, S. Berretti, and A. D. Bimbo. 2016. Boosting 3D LBP-based face recognition by fusing shape and texture

descriptors on the mesh. IEEE TIFS 11, 5 (2016), 964–979.

[227]

C. Xu, S. Li, T. Tan, and L. Quan. 2009. Automatic 3D face recognition from depth and intensity Gabor features. Pattern Recognition 42,

9 (2009), 1895–1905.

[228]

C. Xu, T. Tan, S. Li, Y. Wang, and C. Zhong. 2006. Learning eective intrinsic features to boost 3D-based face recognition. In ECCV.

416–427.

[229]

C. Xu, T. Tan, Y. Wang, and L. Quan. 2006. Combining local features for robust nose location in 3D facial data. Pattern Recognition

Letters 27, 13 (2006), 1487–1494.

[230] C. Xu, Y. Wang, T. Tan, and L. Quan. 2004. A new attempt to face recognition using 3D eigenfaces. In ACCV, Vol. 2. 884–889.

[231]

C. Xu, Y. Wang, T. Tan, and L. Quan. 2004. Automatic 3D face recognition combining global geometric features with local shape

variation information. In FG. 308–313.

[232]

K. Xu, X. Wang, Z. Hu, and Z. Zhang. 2019. 3D face recognition based on twin neural network combining deep map and texture. In

ICCT. 1665–1668.

[233]

H. Yang, H. Zhu, Y. Wang, M. Huang, Q. Shen, R. Yang, and X. Cao. 2020. Facescape: a large-scale high quality 3d face dataset and

detailed riggable 3d face prediction. In CVPR. 601–610.

[234] B. Yin, Y. Sun, C. Wang, and Y. Ge. 2005. The BJUT-3D large-scale Chinese face database. Technical Report.

[235] L. Yin, X. Chen, Y. Sun, T. Worm, and M. Reale. 2008. A high-resolution 3D dynamic facial expression database. In FG. 1–6.

[236] L. Yin, X. Wei, Y. Sun, J. Wang, and M. J. Rosato. 2006. A 3D facial expression database for facial behavior research. In FG. 211–216.

[237] X. Yu, Y. Gao, and J. Zhou. 2016. 3D face recognition under partial occlusions using radial strings. In ICIP. 3016–3020.

[238]

X. Yu, Y. Gao, and J. Zhou. 2017. Sparse 3D directional vertices vs continuous 3D curves: ecient 3D surface matching and its application

for single model face recognition. Pattern Recognition 65 (May 2017), 296–306.

[239]

S. Zafeiriou, M. Hansen, G. Atkinson, V. Argyriou, M. Petrou, M. Smith, and L. Smith. 2011. The photoface database. In CVPR Workshops.

132–139.

[240]

A. Zaharescu, E. Boyer, and R. Horaud. 2012. Keypoints and local descriptors of scalar functions on 2D manifolds. IJCV 100 (2012),

78–98.

[241] J. Zhang, D. Huang, Y. Wang, and J. Sun. 2016. Lock3DFace: A large-scale database of low-cost Kinect 3D faces. In ICB. 1–8.

[242]

L. Zhang, A. Razdan, G. Farin, J. Femiani, M. Bae, and C. Lockwood. 2006. 3D face authentication and recognition based on bilateral

symmetry analysis. The Visual Computer 22, 1 (2006), 43–55.

[243]

X. Zhang, L. Yin, J. F. Cohn, S. Canavan, M. Reale, A. Horowitz, and P. Liu. 2013. A high-resolution spontaneous 3D dynamic facial

expression database. In FG. 1–6.

[244]

X. Zhang, L. Yin, J. F. Cohn, S. Canavan, M. Reale, A. Horowitz, P. Liu, and J. M. Girard. 2014. BP4D-Spontaneous: a high-resolution

spontaneous 3D dynamic facial expression database. Image and Vision Computing 32, 10 (2014), 692–706.

[245]

Z. Zhang, C. Yu, H. Li, J. Sun, and F. Liu. 2020. Learning distribution independent latent representation for 3D face disentanglement. In

3DV. 848–857.

[246]

W. Zhao, R. Chellappa, P. J. Phillips, and A. Rosenfeld. 2003. Face recognition: a literature survey. Comput. Surveys 35, 4 (2003), 399–458.

[247]

X. Zhao, E. Dellandrea, L. Chen, and I. A. Kakadiaris. 2011. Accurate landmarking of three-dimensional facial data in the presence of

facial expressions and occlusions using a three-dimensional statistical facial feature model. IEEE TSMC 41, 5 (2011), 1417–1428.

[248] C. Zhong, Z. Sun, and T. Tan. 2007. Robust 3D face recognition using learned visual codebook. In CVPR. 1–6.

[249]

H. Zhou, A. Mian, L. Wei, D. Creighton, M. Hossny, and S. Nahavandi. 2014. Recent advances on singlemodal and multimodal face

recognition: A survey. IEEE THMS 44, 6 (2014), 701–716.

[250] S. Zhou and S. Xiao. 2018. 3D face recognition: A survey. HCIS 8, 1 (2018), 1–27.

ACM Comput. Surv.

Multimodal Face Data Sets—A Survey of Technologies, Applications, and Contents

Article

Full-text available

Jan 2024

Due to their ease-of-use, biometric verification methods to control access to digital devices have become ubiquitous. Many rely on supervised machine learning, a process that is notoriously data-hungry. At the same time, biometric data is sensitive from a privacy perspective, and a comprehensive review from a data set perspective is lacking. In this survey, we present a comprehensive review of multimodal face data sets (e.g., data sets containing RGB color plus other channels such as infrared or depth). This follows a trend in both industry and academia to use such additional modalities to improve the robustness and reliability of the resulting biometric verification systems. Furthermore, such data sets open the path to a plethora of additional applications, such as 3D face reconstruction (e.g., to create avatars for VR and AR environments), face detection, registration, alignment, and recognition systems, emotion detection, anti-spoofing, etc. We also provide information regarding the data acquisition setup and data attributes (ethnicities, poses, facial expressions, age, population size, etc.) as well as a thorough discussion of related applications and state-of-the-art benchmarking. Readers may thus use this survey as a tool to navigate the existing data sets both from the application and data set perspective. To existing surveys we contribute, to the best of our knowledge, the first exhaustive review of multimodalities in these data sets.

Concepts is All You Need: A More Direct Path to AGI

Preprint

Full-text available

Sep 2023

Little demonstrable progress has been made toward AGI (Artificial General Intelligence) since the term was coined some 20 years ago. In spite of the fantastic breakthroughs in Statistical AI such as AlphaZero, ChatGPT, and Stable Diffusion none of these projects have, or claim to have, a clear path to AGI. In order to expedite the development of AGI it is crucial to understand and identify the core requirements of human-like intelligence as it pertains to AGI. From that one can distill which particular development steps are necessary to achieve AGI, and which are a distraction. Such analysis highlights the need for a Cognitive AI approach rather than the currently favored statistical and generative efforts. More specifically it identifies the central role of concepts in human-like cognition. Here we outline an architecture and development plan, together with some preliminary results, that offers a much more direct path to full Human-Level AI (HLAI)/ AGI.

Concepts is All You Need: A More Direct Path to AGI

Preprint

Full-text available

Sep 2023

3D face recognition: A comprehensive survey in 2022

Article

Full-text available

Aug 2023

In the past ten years, research on face recognition has shifted to using 3D facial surfaces, as 3D geometric information provides more discriminative features. This comprehensive survey reviews 3D face recognition techniques developed in the past decade, both conventional methods and deep learning methods. These methods are evaluated with detailed descriptions of selected representative works. Their advantages and disadvantages are summarized in terms of accuracy, complexity, and robustness to facial variations (expression, pose, occlusion, etc.). A review of 3D face databases is also provided, and a discussion of future research challenges and directions of the topic.

RP-Net: a PointNet++ 3D face recognition algorithm integrating RoPS local descriptor

Article

Full-text available

Jan 2022

as a biometric identification method in the post-epidemic era, face recognition owing more and more attention in practical applications to its non-contact and interaction-friendly advantages. Researchers more favor 3D faces because they have richer spatial information than 2D faces and are not easily affected by the environment. However 3D faces are not all collected in normal environments. To enhance the facial features of 3D faces and improve the recognition degree of 3D faces in weak-light or dark environments, a 3D face recognition algorithm based on point cloud depth learning is proposed. First, 3D faces are automatically detected from 3D raw data and preprocessed, including nose-tip detection and face cropping, spike removal and hole filling, and surface normals. Then, rotated projection statistical local feature descriptors (RoPS) are integrated into the PointNet++ network to describe and classify local features. Finally, feature matching is performed using the nearest neighbor distance ratio. The algorithm was tested on the Bosphorus and CASIA-3D datasets, and good results were obtained in a simulated weak-light environment.

Deep Residual Feature Quantization for 3D Face Recognition

Preprint

Full-text available

Nov 2021

3D face recognition (FR) has been successfully applied using Convolutional neural networks (CNN) which have demonstrated stunning results in diverse computer vision and image classification tasks. Learning CNNs, however, need to estimate millions of parameters that expect high-performance computing capacity and storage. To deal with this issue, we propose an efficient method based on the quantization of residual features extracted from ResNet-50 pre-trained model. The method starts by describing each 3D face using a convolutional feature extraction block, and then apply the Bag-of-Features (BoF) paradigm to learn deep neural networks (we call it Deep BoF). To do so, we apply Radial Basis Function (RBF) neurons to quantize the deep features extracted from the last convolutional layers. An SVM classifier is then applied to classify faces according to their quantized term vectors. The obtained model is lightweight compared to classical CNN and it allows classifying arbitrary-sized images. The experimental results on the FRGCv2 and Bosphorus datasets show the powerful of our method compared to state of the art methods.

A Morphable Model For The Synthesis Of 3D Faces

Chapter

Aug 2023

Face-GCN: A Graph Convolutional Network for 3D Dynamic Face Recognition

Conference Paper

May 2022

Disentangled Face Identity Representations for Joint 3D Face Recognition and Neutralisation

Conference Paper

May 2022

PointFace: Point Cloud Encoder Based Feature Embedding for 3D Face Recognition

Article

Oct 2022

The accuracy of 2D face recognition (FR) has progressed significantly due to the availability of large-scale training data. However, the research of deep learning based 3D FR is still in the early stage. Most of available 3D FR generate 2D maps from 3D data and apply existing 2D CNNs to the generated 2D maps for feature extraction. We propose in this paper a light-weight framework, named PointFace, to directly process point set data for 3D FR. In this framework, two weight-shared encoders are designed to directly extract features from a pair of 3D faces and the distances between embeddings of the same person and different person are minimized and maximized, respectively. The framework also use a feature similarity loss to guide the encoders to obtain discriminative face representations. A pair selection strategy is proposed to generate positive and negative face pairs to further improve the FR performance. Extensive experiments on Lock3DFace and Bosphorus show that the proposed PointFace outperforms state-of-the-art 2D CNN based FR methods.

An Improved KNN Classifier for 3D Face Recognition Based on SURF Descriptors

Article

Jul 2022

In this article, we propose a three-dimensional (3D) face recognition approach for depth data captured by Kinect based on a combination of speeded up robust features (SURF) and k-nearest neighbor (KNN) algorithms. First, the shape index maps of the preprocessed 3D faces of the training gallery are computed, then the SURF feature vectors are extracted and used to form the dictionary. In the recognition process, we propose an improved KNN classifier to find the best match. The evaluation was performed using CurtinFaces and KinectFaceDB data sets, achieving rank-1 recognition rates of 96.78% and 94.23%, respectively, when using two samples per person for training.

Information Bottlenecked Variational Autoencoder for Disentangled 3D Facial Expression Modelling

Conference Paper

Jan 2022

A comprehensive survey on 3D face recognition methods

Article

Apr 2022
ENG APPL ARTIF INTEL

3D face recognition (3DFR) has emerged as an effective means of characterizing facial identity over the past several decades. Depending on the types of techniques used in recognition, these methods are categorized into traditional and modern. The former generally extract distinctive facial features (e.g. global, local, and hybrid features) for matching, whereas the latter rely primarily on deep learning to perform 3DFR in an end-to-end way. Many literature surveys have been carried out reviewing either traditional or modern methods alone, while only a few studies are conducted simultaneously on both of them. This survey presents a state-of-the-art for 3DFR covering both traditional and modern methods, focusing on the techniques used in face processing, feature extraction, and classification. In addition, we review some specific face recognition challenges, including pose, illumination, expression variations, self-occlusion, and spoofing attack. The commonly used 3D face datasets have been summarized as well.

3D Face Recognition: Two Decades of Progress and Prospects

Abstract and Figures

Recommended publications

3D face recognition: A comprehensive survey in 2022

3D Face Recognition: A Survey

A comprehensive survey on 3D face recognition methods

3D Face Recognition