PreprintPDF Available

Face Recognition: From Traditional to Deep Learning Methods

October 2018

October 2018

Authors:

Amazon

Preprints and early-stage research may not have been peer reviewed yet.

Starting in the seventies, face recognition has become one of the most researched topics in computer vision and biometrics. Traditional methods based on hand-crafted features and traditional machine learning techniques have recently been superseded by deep neural networks trained with very large datasets. In this paper we provide a comprehensive and up-to-date literature review of popular face recognition methods including both traditional (geometry-based, holistic, feature-based and hybrid methods) and deep learning methods.

Face recognition building blocks.

…

(a) Bounding boxes found by a face detector. (b) and (c) Aligned faces and reference points.

…

Typical hybrid face representation.

…

Original residual block proposed in [114].

…

Effect of introducing a margin m in the decision boundary between two classes. (a) Softmax loss. (b) Softmax loss with margin.

…

Figures - uploaded by Daniel Saez Trigueros

Content may be subject to copyright.

Content uploaded by Daniel Saez Trigueros

Content may be subject to copyright.

Face Recognition: From Traditional to Deep

Learning Methods

Daniel S´

aez Trigueros, Li Meng

School of Engineering and Technology

University of Hertfordshire

Hatﬁeld AL10 9AB, UK

Margaret Hartnett

GBG plc

London E14 9QD, UK

Abstract—Starting in the seventies, face recognition has be-

come one of the most researched topics in computer vision and

biometrics. Traditional methods based on hand-crafted features

and traditional machine learning techniques have recently been

superseded by deep neural networks trained with very large

datasets. In this paper we provide a comprehensive and up-

to-date literature review of popular face recognition methods

including both traditional (geometry-based, holistic, feature-

based and hybrid methods) and deep learning methods.

I. INTRODUCTION

Face recognition refers to the technology capable of iden-

tifying or verifying the identity of subjects in images or

videos. The ﬁrst face recognition algorithms were developed

in the early seventies [1], [2]. Since then, their accuracy

has improved to the point that nowadays face recognition

is often preferred over other biometric modalities that have

traditionally been considered more robust, such as ﬁngerprint

or iris recognition [3]. One of the differential factors that

make face recognition more appealing than other biometric

modalities is its non-intrusive nature. For example, ﬁngerprint

recognition requires users to place a ﬁnger in a sensor, iris

recognition requires users to get signiﬁcantly close to a cam-

era, and speaker recognition requires users to speak out loud.

In contrast, modern face recognition systems only require users

to be within the ﬁeld of view of a camera (provided that they

are within a reasonable distance from the camera). This makes

face recognition the most user friendly biometric modality. It

also means that the range of potential applications of face

recognition is wider, as it can be deployed in environments

where the users are not expected to cooperate with the system,

such as in surveillance systems. Other common applications

of face recognition include access control, fraud detection,

identity veriﬁcation and social media.

Face recognition is one of the most challenging biometric

modalities when deployed in unconstrained environments due

to the high variability that face images present in the real

world (these type of face images are commonly referred to

as faces in-the-wild). Some of these variations include head

poses, aging, occlusions, illumination conditions, and facial

expressions. Examples of these are shown in Figure 1.

Face recognition techniques have shifted signiﬁcantly over

the years. Traditional methods relied on hand-crafted features,

such as edges and texture descriptors, combined with machine

(a) (b)

(e)

Fig. 1: Typical variations found in faces in-the-wild. (a) Head

pose. (b) Age. (c) Illumination. (d) Facial expression. (e)

Occlusion.

learning techniques, such as principal component analysis,

linear discriminant analysis or support vector machines. The

difﬁculty of engineering features that were robust to the

different variations encountered in unconstrained environments

made researchers focus on specialised methods for each

type of variation, e.g. age-invariant methods [4], [5], pose-

invariant methods [6], illumination-invariant methods [7], [8],

etc. Recently, traditional face recognition methods have been

superseded by deep learning methods based on convolutional

neural networks (CNNs). The main advantage of deep learning

methods is that they can be trained with very large datasets to

learn the best features to represent the data. The availability

of faces in-the-wild on the web has allowed the collection

of large-scale datasets of faces [9], [10], [11], [12], [13],

[14], [15] containing real-world variations. CNN-based face

recognition methods trained with these datasets have achieved

very high accuracy as they are able to learn features that

are robust to the real-world variations present in the face

images used during training. Moreover, the rise in popularity

of deep learning methods for computer vision has accelerated

arXiv:1811.00116v1 [cs.CV] 31 Oct 2018

Fig. 2: Face recognition building blocks.

face recognition research, as CNNs are being used to solve

many other computer vision tasks, such as object detection

and recognition, segmentation, optical character recognition,

facial expression analysis, age estimation, etc.

Face recognition systems are usually composed of the

following building blocks:

1) Face detection. A face detector ﬁnds the position of the

faces in an image and (if any) returns the coordinates of

a bounding box for each one of them. This is illustrated

in Figure 3a.

2) Face alignment. The goal of face alignment is to scale

and crop face images in the same way using a set of

reference points located at ﬁxed locations in the image.

This process typically requires ﬁnding a set of facial

landmarks using a landmark detector and, in the case of a

simple 2D alignment, ﬁnding the best afﬁne transforma-

tion that ﬁts the reference points. Figures 3b and 3c show

two face images aligned using the same set of reference

points. More complex 3D alignment algorithms (e.g.

[16]) can also achieve face frontalisation, i.e. changing

the pose of a face to frontal.

3) Face representation. At the face representation stage,

the pixel values of a face image are transformed into a

compact and discriminative feature vector, also known

as a template. Ideally, all the faces of a same subject

should map to similar feature vectors.

4) Face matching. In the face matching building block,

two templates are compared to produce a similarity score

that indicates the likelihood that they belong to the same

subject.

Face representation is arguably the most important compo-

nent of a face recognition system and the focus of the literature

review in Section II.

(a) (b) (c)

Fig. 3: (a) Bounding boxes found by a face detector. (b) and

II. LITERATURE REVIEW

Early research on face recognition focused on methods that

used image processing techniques to match simple features de-

scribing the geometry of the faces. Even though these methods

only worked under very constrained settings, they showed that

it is possible to use computers to automatically recognise faces.

After that, statistical subspaces methods such as principal

component analysis (PCA) and linear discriminant analysis

(LDA) gained popularity. These methods are referred to as

holistic since they use the entire face region as an input. At

the same time, progress in other computer vision domains led

to the development of local feature extractors that are able to

describe the texture of an image at different locations. Feature-

based approaches to face recognition consist of matching these

local features across face images. Holistic and feature-based

methods were further developed and combined into hybrid

methods. Face recognition systems based on hybrid methods

remained the state-of-the-art until recently, when deep learning

emerged as the leading approach to most computer vision

applications, including face recognition. The rest of this paper

provides a summary of some of the most representative re-

search works on each of the aforementioned types of methods.

A. Geometry-based Methods

Kelly’s [1] and Kanade’s [2] PhD theses in the early

seventies are considered the ﬁrst research works on automatic

face recognition. They proposed the use of specialised edge

and contour detectors to ﬁnd the location of a set of facial land-

marks and to measure relative positions and distances between

them. The accuracy of these early systems was demonstrated

on very small databases of faces (a database of 10 subjects was

used in [1] and a database of 20 subjects was used in [2]). In

[17], a geometry-based method similar to [2] was compared

with a method that represents face images as gradient images.

The authors showed that comparing gradient images provided

better recognition accuracy than comparing geometry-based

features. However, the geometry-based method was faster and

needed less memory. The feasibility of using facial landmarks

and their geometry for face recognition was thoroughly stud-

ied in [18]. Speciﬁcally, they proposed a method based on

measuring the Procrustes distance [19] between two sets of

facial landmarks and a method based on measuring ratios

of distances between facial landmarks. The authors argued

that even though other methods that extract more information

from the face (e.g. holistic methods) could achieve greater

recognition accuracy, the proposed geometry-based methods

were faster and could be used in combination with other

Fig. 4: Top 5 eigenfaces computed using the ORL database of

faces [31] sorted from most variance (left) to least variance

(right).

methods to develop hybrid methods. Geometry-based methods

have proven more effective in 3D face recognition thanks to

the depth information encoded in 3D landmarks [20], [21].

Geometry-based methods were crucial during the early days

of face recognition research. They can be used as a fast

alternative to (or in combination with) the more advanced

methods described in the rest of this review.

B. Holistic Methods

Holistic methods represent faces using the entire face re-

gion. Many of these methods work by projecting face images

onto a low-dimensional space that discards superﬂuous details

and variations not needed for the recognition task. One of the

most popular approaches in this category is based on PCA.

The idea, ﬁrst proposed in [22], [23], is to apply PCA to a

set of training face images in order to ﬁnd the eigenvectors

that account for the most variance in the data distribution. In

this context, the eigenvectors are typically called eigenfaces

due to their resemblance to real faces, as shown in Figure 4.

New faces can be projected onto the subspace spanned by the

eigenfaces to obtain the weights of the linear combination of

eigenfaces needed to reconstruct them. This idea was used

in [24] to identify faces by comparing the weights of new

faces to the weights of faces in a gallery set. A probabilistic

version of this approach based on a Bayesian analysis of image

differences was proposed in [25]. In this method, two sets

of eigenfaces were used to model intra-personal and inter-

personal variations separately. Many other variations of the

original eigenfaces method have been proposed. For example,

a nonlinear extension of PCA based on kernel methods,

namely kernel PCA [26], was proposed in [27]; independent

component analysis (ICA) [28], a generalisation of PCA that

can capture high-order dependencies between pixels, was

proposed in [29]; and a two-dimensional PCA based on 2D

image matrices instead of 1D vectors was proposed in [30].

One issue with PCA-based approaches is that the projection

maximises the variance across all the images in the training

set. This implies that the top eigenvectors might have a

negative impact on the recognition accuracy since they might

correspond to intra-personal variations that are not relevant

for the recognition task (e.g. illumination, pose or expression).

Holistic methods based on linear discriminant analysis (LDA),

also called Fisher discriminant analysis, [32] have been pro-

posed to solve this issue [33], [34], [35], [36]. The main idea

behind LDA is to use the class labels to ﬁnd a projection

matrix Wthat maximises the variance between classes while

minimising the variance within classes:

W∗= arg max

|WTSbW|

|WTSwW|(1)

where Swand Sbare the between-class and within-class

scatter matrices deﬁned as follows:

Sw=

xj∈Ck

(xj−µk)(xj−µk)T(2)

Sb=

(µ−µk)(µ−µk)T(3)

where xjrepresents a data sample, µkis the mean of class Ck,

µis the overall mean and Kis the number of classes in the

dataset. The solution to Equation 1 can be found by computing

the eigenvectors of the separation matrix S=S−1

wSb. Similar

to PCA, LDA can be used for dimensionality reduction by

selecting a subset of eigenvectors corresponding to the largest

eigenvalues. Even though LDA is considered a more suitable

technique for face recognition than PCA, pure LDA-based

methods are prone to overﬁtting when the within-class scatter

matrix Swis not correctly estimated [35], [36]. This happens

when the input data is high-dimensional and not many samples

per class are available during training. In the extreme case, Sw

becomes singular and Wcannot be computed [33]. For this

reason, it is common to reduce the dimensionality of the data

with PCA before applying LDA [33], [35], [36]. LDA has also

been extended to the nonlinear case using kernels [37], [38]

and to probabilistic LDA [39].

Support vector machines (SVMs) have also been used as

holistic methods for face recognition. In [40], the task was

formulated as a two-class problem by training an SVM with

image differences. More speciﬁcally, the two classes are the

within-class difference set, which contains all the differences

between images of the same class, and the between-class

difference set, which contains all the differences between

images of distinct classes (this formulation is similar to the

probabilistic PCA proposed in [25]). In addition, [40] modiﬁed

the traditional SVM formulation by adding a parameter to

control the operating point of the system. In [41], a separate

SVM was trained for each class. The authors experimented

with SVMs trained with PCA projections and with LDA pro-

jections. It was found that this SVM approach only gives better

performance compared with simple Euclidean distance when

trained with PCA projections, since LDA already encodes the

discriminant information needed to recognise faces.

An approach related to PCA and LDA is the locality

preserving projections (LPP) method proposed in [42]. While

PCA and LDA preserve the global structure of the image

space (maximising variance and discriminant information re-

spectively), LPP aims to preserve the local structure of the

image space. This means that the projection learnt by LPP

maps images with similar local information to neighbouring

points in the LPP subspace. For example, two images of the

same person with open and closed mouth would be mapped

to similar points using LPP, but not necessarily with PCA

or LDA. This approach was shown to be superior than PCA

and LDA on multiple datasets. Further improvements were

achieved in [43] by making the LPP basis vectors orthogonal.

Another popular family of holistic methods is based on

sparse representation of faces. The idea, ﬁrst proposed in

[44] as sparse representation-based classiﬁcation (SRC), is to

represent faces using a linear combination of training images:

y=Ax0(4)

where yis a test image, Ais a matrix containing all the

training images and x0is a vector of sparse coefﬁcients. By

enforcing sparsity in the representation, most of the nonzero

coefﬁcients belong to training images from the correct class.

At test time, the coefﬁcients belonging to each class are

used to reconstruct the image, and the class that achieves the

lowest reconstruction error is considered the correct one. The

robustness of this approach to image corruptions like noise or

occlusions can be increased by adding a term of sparse error

coefﬁcients e0to the linear combination:

y=Ax0+e0(5)

where the nonzero entries of e0correspond to corrupted pixels.

Many variations of this approach have been proposed for

increased robustness and reduced computational complexity.

For example, the discriminative K-SVD algorithm was pro-

posed in [45] to select a more compact and discriminative set

of training images to reconstruct the images; in [46], SRC

was extended by using a Markov random ﬁeld to model the

prior assumption about the spatial continuity of the occluded

regions; and in [47], it was proposed to weight each pixel in the

image independently to achieve better reconstructed images.

More recently, inspired by probabilistic PCA [25], the joint

Bayesian method [48] has been proposed. In this method,

instead of using image differences as in [25], a face image is

represented as the sum of two independent Gaussian variables

representing intra-personal and inter-personal variations. Using

this method, an accuracy of 92.4% was achieved on the

challenging Labeled Faces in the Wild (LFW) dataset [49].

This is the highest accuracy reported by a holistic method on

this dataset.

Holistic methods have been of paramount importance to

the development of real-world face recognition systems, as

evidenced by the large number of approaches proposed in the

literature. In the next subsection, a popular family of methods

that evolved as an alternative to holistic methods, namely

feature-based methods, is discussed.

C. Feature-based Methods

Feature-based methods refer to methods that leverage local

features extracted at different locations in a face image.

Unlike geometry-based methods, feature-based methods focus

on extracting discriminative features rather than computing

their geometry1. Feature-based methods tend to be more robust

1Technically, geometry-based methods can be seen as a special case of

feature-based methods, since many feature-based methods also leverage the

geometry of the extracted features.

than holistic methods when dealing with faces presenting local

variations (e.g. facial expression or illumination). For example,

consider two face images of the same subject in which the only

difference between them is that the person’s eyes are closed in

one of them. In a feature-based method, only the coefﬁcients

of the feature vectors that correspond to features extracted

around the eyes will differ between the two images. On the

other hand, in a holistic method, all the coefﬁcients of the

feature vectors might differ. Moreover, many of the descriptors

used in feature-based methods are designed to be invariant to

different variations (e.g. scaling, rotation or translation).

One of the ﬁrst feature-based methods was the modular

eigenfaces method proposed in [50], an extension of the

original eigenfaces technique. In this method, PCA was inde-

pendently applied to different local regions in the face image to

produce sets of eigenfeatures. Although [50] showed that both

eigenfeatures and eigenfaces can achieve the same accuracy,

the eigenfeatures approach provided better accuracy when only

a few eigenvectors were used.

A feature-based method that uses binary edge features was

proposed in [51]. Their main contribution was to improve the

Hausdorff distance that was used in [52] to compare binary

images. The Hausdorff distance measures proximity between

two set of points by looking at the greatest distance from a

point in one set to the closest point in the other set. In the

modiﬁed Hausdorff distance proposed in [51], each point in

one set has to be near some point in the other set. It was

argued that this property makes the method more robust to

small, non-rigid local distortions. A variation of this method

proposed line edge maps (LEMs) for face representation [53].

LEMs provide a compact face representation since edges are

encoded as line segments, i.e. only the coordinates of the end

points are used. A line segment Hausdorff distance was also

proposed in this work to match LEMs. The proposed distance

is discouraged to match lines with different orientations, is

robust to line displacements, and incorporates a measure of the

difference between the number of lines found in two LEMs.

A very popular feature-based method was the elastic bunch

graph matching (EBGM) method [54], an extension of the

dynamic link architecture proposed in [55]. In this method,

a face is represented using a graph of nodes. The nodes

contain Gabor wavelet coefﬁcients [56] extracted around a

set of predeﬁned facial landmarks. During training, a face

bunch graph (FBG) model is created by stacking the manually

located nodes of each training image. When a test face image

is presented, a new graph is created and ﬁtted to the facial

landmarks by searching for the most similar nodes in the

FBG model. Two images can be compared by measuring the

similarity between their graph nodes. A version of this method

that uses histograms of oriented gradients (HOG) [57], [58]

instead of Gabor wavelet features was proposed in [59]. This

method outperforms the original EBGM method thanks to

the increased robustness of HOG descriptors to changes in

illumination, rotation and small displacements.

With the development of local feature descriptors in other

computer vision applications [60], the popularity of feature-

(a) (b)

Fig. 5: (a) Face image divided into 4×4local regions. (b)

Histograms of LBP descriptors computed from each local

region.

based methods for face recognition increased. In [61], his-

tograms of LBP descriptors were extracted from local regions

independently, as shown in Figure 5, and concatenated to

form a global feature vector. Additionally, they measured

the similarity between two feature vectors aand busing a

weighted Chi-square distance:

χ2(a,b) = X

wi(ai−bi)2

ai+bi

(6)

where wiis a weight that controls the contribution of the i-

th coefﬁcient of the feature vectors. As shown in [62], many

variations of this method have been proposed to improve face

recognition accuracy and to tackle other related tasks such

as face detection, facial expression analysis and demographic

classiﬁcation. For example, LBP descriptors extracted from

Gabor feature maps, known as LGBP descriptors, were pro-

posed in [63], [64]; a rotation invariant LBP descriptor that

applies Fourier transform to LBP histograms was proposed in

[65]; and a variation of LBP called local derivative pattern

(LDP) was proposed in [66] to extract high-order local infor-

mation by encoding directional pattern features.

Scale-invariant feature transform (SIFT) descriptors [67]

have also been extensively used for face recognition. Three

different methodologies for matching SIFT descriptors across

face images were proposed in [68]: (i) computing the distances

between all pairs of SIFT descriptors and using the minimum

distance as a similarity score; (ii) similar to (i) but SIFT

descriptors around the eyes and the mouth are compared inde-

pendently, and the average of the two minimum distances is

used as a similarity score; and (iii) computing SIFT descriptors

over a regular grid and using the average distance between

the corresponding pairs of descriptors as a similarity score.

The best recognition accuracy was obtained using the third

method. A related method [69] proposed the use of speeded

up robust features (SURF) [70] features instead of SIFT. In this

work, the authors observed that dense feature extraction over a

regular grid provides the best results. In [71], two variations of

SIFT were proposed, namely, the volume-SIFT which removes

unreliable keypoints based on their scale, and the partial-

descriptor-SIFT which ﬁnds keypoints at large scales and

near face boundaries. Compared to the original SIFT, both

approaches were shown to improve face recognition accuracy.

Some feature-based methods have focused on learning local

features from training samples. For example, in [72], unsu-

pervised learning techniques (K-means [73], PCA tree [74]

and random-projection tree [74]) were used to encode local

microstructures of faces into a set of discrete codes. The

discrete codes were then grouped into histograms at different

facial regions. The ﬁnal local descriptors were computed by

applying PCA to each histogram. A learning-based descriptor

with similarities to LBP was proposed in [75]. Speciﬁcally,

this descriptor consists of a differential pattern generated by

subtracting the centre pixel of a local 3×3region to its

neighbouring pixels and a training of a Gaussian mixture

model to compute high-order statistics of the differential

pattern. Another LBP-like descriptor that has a learning stage

was proposed in [76]. In this work, LDA was used to (i)

learn a ﬁlter that when applied to an image enhances the

discriminative ability of the differential patterns, and (ii) learn

a set of weights that are assigned to the neighbouring pixels

within each local region to reﬂect their contribution to the

differential pattern.

Feature-based methods have been shown to provide more

robustness to different types of variations than holistic meth-

ods. However, some of the advantages of holistic methods are

lost (e.g. discarding non-discriminant information and more

compact representations). Hybrid methods that combine both

of these approaches are discussed next.

D. Hybrid Methods

Hybrid methods combine techniques from holistic and

feature-based methods. Before deep learning became

widespread, most state-of-the-art face recognition systems

were based on hybrid methods. Some hybrid methods simply

combine two different techniques without any interaction

between them. For example, in the modular eigenfaces

work [50] covered earlier, the authors experimented with

a combined representation using both eigenfaces and

eigenfeatures and achieved better accuracy than using either

of these two methods alone. However, the most popular

hybrid approach is to extract local features (e.g. LBP, SIFT)

and project them onto a lower-dimensional and discriminative

subspace (e.g. using PCA or LDA) as shown in Figure 6.

Several hybrid methods that use Gabor wavelet features

combined with different subspaces methods have been pro-

posed [77], [78], [79]. In these methods, Gabor kernels of

different orientations and scales are convolved with an image

and their outputs are concatenated into a feature vector. The

feature vector is then downsampled to reduce its dimensional-

ity. In [77], the feature vector was further processed using the

enhanced linear discriminant model proposed in [80]. PCA

followed by ICA were applied to the downsampled feature

vector in [78], and the probabilistic reasoning model from

[80] was used to classify whether two images belong to the

same subject. In [79], kernel PCA with polynomial kernels was

applied to the feature vector to encode high-order statistics. All

Fig. 6: Typical hybrid face representation.

these hybrid methods were shown to provide better accuracy

than using Gabor wavelet features alone.

LBP descriptors have been a key component in many

hybrid methods. In [81], an image was divided into non-

overlapping regions and LBP descriptors were extracted at

multiple resolutions. The LBP coefﬁcients at each region

were concatenated into regional feature vectors and projected

onto PCA+LDA subspaces. This approach was extended to

colour images in [82]. Laplacian PCA, an extension of PCA,

was shown to outperform standard PCA and kernel PCA

when applied to LBP descriptors in [83]. Two novel patch

versions of LBP, namely three-patch LBP (TPLBP) and four-

patch LBP (FPLBP), were combined with LDA and SVMs in

[84]. The proposed TPLBP and FPLBP descriptors can boost

face recognition accuracy by encoding similarities between

neighbouring patches of pixels. More recently, [85] proposed

a high-dimensional face representation by densely extracting

multi-scale LBP (MLBP) descriptors around facial landmarks.

The high-dimensional feature vector (100K-dim) was reduced

to 400 dimensions by PCA and a ﬁnal discriminative feature

vector was learnt using joint Bayesian. In their experiments,

[85] showed that extracting high-dimensional features can

increase face recognition accuracy by 6-7% when going

from 1K to 100K dimensions. The main drawback of this

approach is the high computational costs needed to perform a

dimensionality reduction of such magnitude. For this reason,

they proposed to approximate the PCA and joint Bayesian

transformations with a sparse linear projection matrix Bby

solving the following optimisation problem:

min

BkY−BTXk2

2+λkBk1(7)

where the ﬁrst term is a reconstruction error between the ma-

trix Xof high-dimensional feature vectors and the matrix Y

of projected low-dimensional feature vectors; the second term

enforces sparsity in the projection matrix B; and λbalances

the contribution of each term. Another recent method proposed

a multi-task learning approach based on a discriminative

Gaussian process latent variable model, named GaussianFace

[86]. This method extended the Gaussian process approach

proposed in [87] and incorporated a computationally more

efﬁcient version of kernel LDA to learn a face representation

from LBP descriptors that can exploit data from multiple

source domains. Using this method, an accuracy of 98.52%

was achieved on the LFW dataset. This is competitive with

the accuracy achieved by many deep learning methods.

Some hybrid methods have proposed to use a combination

of different local features. For example, Gabor wavelet and

LBP features were used in [88]. The authors argued that

these two types of features capture complementary informa-

tion. While LBP descriptors capture small appearance details,

Gabor wavelet features encode facial shape over a broader

range of scales. PCA was applied independently to the feature

vectors containing the Gabor wavelet coefﬁcients and the

LBP coefﬁcients to reduce their dimensionality. The ﬁnal face

representation was obtained by concatenating the two PCA-

transformed feature vectors and applying a subspace method

similar to kernel LDA, namely kernel discriminative common

vector [89]. Another method that uses Gabor wavelet and

LBP features was proposed in [90]. In this method, faces

were represented by applying PCA+LDA to regions containing

histograms of LGBP descriptors [64]. A multi-feature system

was proposed in [8] to tackle face recognition under difﬁcult

illumination conditions. Three contributions were made in this

work: (i) a preprocessing pipeline that reduces the effect of

illumination variation; (ii) an extension of LBP, called local

ternary patterns (LTP), which is more discriminant and less

sensitive to noise in uniform regions; and (iii) an architecture

that combines sets of Gabor wavelet and LBP/LTP features

followed by kernel LDA, score normalisation and score fusion.

A related method [91] proposed a novel descriptor robust to

blur that extends local phase quantization (LPQ) descriptors

[92] to multiple scales (MLPQ). In addition, a kernel fusion

technique was used to combine MLPQ descriptors with MLBP

descriptors in the kernel LDA framework. In [5], an age

invariant face recognition system was proposed based on dense

extraction of SIFT and multi-scale LBP descriptors combined

with a novel multi-feature discriminant analysis (MFDA). The

MFDA technique uses random subspace sampling [93] to

construct multiple lower-dimensional feature subspaces, and

bagging [94] to select subsets of training samples for LDA

that contain inter-class pairs near the classiﬁcation boundary to

increase the discriminative ability of the representation. Dense

SIFT descriptors were also used in [95] as texture features,

and combined with shape features in the form of relative

distances between pairs of facial landmarks. This combination

of shape and texture features was further processed using

multiple PCA+LDA transformations.

To conclude this subsection, other types of hybrids methods

that do not follow the pipeline described in Figure 6 are

reviewed. In [96], low-level local features (image intensities in

RGB and HSV colour spaces, edge magnitudes, and gradient

directions) were used to compute high-level visual features

by training attribute and simile binary SVM classiﬁers. The

attribute classiﬁers detect describable attributes of faces such

as gender, race and age. On the other hand, the simile

classiﬁers detect non-describable attributes by measuring the

similarity of different parts of a face to a limited set of

reference subjects. To compare two images, the outputs of all

the attribute and simile classiﬁers for both images are fed to

an SVM classiﬁer. A method similar to the simile classiﬁers

from [96] was proposed in [97]. The main differences are

that [97] used a large number of simple one-vs-one classiﬁers

instead of the more complex one-vs-all classiﬁers used in [96],

and that SIFT descriptors were used as the low-level features.

Two metric learning approaches for face identiﬁcation were

proposed in [98]. The ﬁrst one, called logistic discriminant

metric learning (LDML) is based on the idea that the distance

between positive pairs (belonging to the same subject) should

be smaller than the distance between negative pairs (belonging

to different subjects). The second one, called marginalised

kNN (MkNN), uses a k-nearest neighbour classiﬁer to ﬁnd

how many positive neighbour pairs can be formed from the

neighbours of the two compared vectors. Both methods were

trained on pairs of vectors of SIFT descriptors computed at

ﬁxed points on the face (corners of the mouth, eyes and nose).

Hybrid methods offer the best of holistic and feature-based

methods. Their main limitation is the choice of good features

that can fully extract the information needed to recognise a

face. Some approaches have tried to overcome this issue by

combining different types of features whereas others have

introduced a learning stage to improve the discriminative

ability of the features. Deep learning methods, discussed next,

take these ideas further by training end-to-end systems that

can learn a large number of features that are optimal for the

recognition task.

E. Deep Learning Methods

Convolutional neural networks (CNNs) are the most com-

mon type of deep learning method for face recognition. The

main advantage of deep learning methods is that they can be

trained with large amounts of data to learn a face represen-

tation that is robust to the variations present in the training

data. In this way, instead of designing specialised features

that are robust to different types of intra-class variations (e.g.

illumination, pose, facial expression, age, etc.), CNNs can

learn them from training data. The main drawback of deep

learning methods is that they need to be trained with very

large datasets that contain enough variations to generalise to

unseen samples. Fortunately, several large-scale face datasets

containing in-the-wild face images have recently been released

into the public domain [9], [10], [11], [12], [13], [14], [15]

to train CNN models. Apart from learning discriminative

features, neural networks can reduce dimensionality and be

trained as classiﬁers or using metric learning approaches.

CNNs are considered end-to-end trainable systems that do not

need to be combined with any other speciﬁc methods.

CNN models for face recognition can be trained using

different approaches. One of them consists of treating the

problem as a classiﬁcation one, wherein each subject in the

training set corresponds to a class. After training, the model

can be used to recognise subjects that are not present in the

training set by discarding the classiﬁcation layer and using the

features of the previous layer as the face representation [99].

In the deep learning literature, these features are commonly

referred to as bottleneck features. Following this ﬁrst training

stage, the model can be further trained using other techniques

to optimise the bottleneck features for the target application

(e.g. using joint Bayesian [9] or ﬁne-tuning the CNN model

with a different loss function [10]). Another common approach

to learning face representation is to directly learn bottleneck

features by optimising a distance metric between pairs of faces

[100], [101] or triplets of faces [102].

The idea of using neural networks for face recognition is

not new. An early method based on a probabilistic decision-

based neural network (PBDNN) [103] was proposed in 1997

for face detection, eye localisation and face recognition. The

face recognition PDBNN was divided into one fully-connected

subnet per training subject to reduce the number of hidden

units and avoid overﬁtting. Two PBDNNs were trained using

intensity and edge features respectively and their outputs

were combined to give a ﬁnal classiﬁcation decision. Another

early method [104] proposed to use a combination of a self-

organising map (SOM) and a convolutional neural network. A

self-organising map [105] is a type of neural network trained in

an unsupervised way that projects the input data onto a lower-

dimensional space that preserves the topological properties of

the input space (i.e. inputs that are nearby in the original

space are also nearby in the output space). Note that none

of these two early methods were trained end-to-end (edge

features were used in [103] and a SOM in [104]), and that

the proposed neural network architectures were shallow. An

end-to-end face recognition CNN was proposed in [100]. This

method used a siamese architecture trained with a contrastive

loss function [106]. The contrastive loss implements a metric

learning procedure that aims to minimise the distance between

pairs of feature vectors corresponding to the same subject

while maximising the distance between pairs of feature vectors

corresponding to different subjects. The CNN architecture

used in this method was also shallow and was trained with

small datasets.

None of the methods mentioned above achieved ground-

breaking results, mainly due to the low capacity of the

networks used and the relatively small datasets available for

training at the time. It was not until these models were scaled

up and trained with large amounts of data [107] that the ﬁrst

deep learning methods for face recognition [99], [9] became

the state-of-the-art. In particular, Facebook’s DeepFace [99],

one of the ﬁrst CNN-based approaches for face recognition

that used a high capacity model, achieved an accuracy of

97.35% on the LFW benchmark, reducing the error of the

previous state-of-the-art by 27%. The authors trained a CNN

with softmax loss2using a dataset containing 4.4 million faces

from 4,030 subjects. Two novel contributions were made in

this work: (i) an effective facial alignment system based on

explicit 3D modelling of faces, and (ii) a CNN architecture

containing locally connected layers [108], [109] that (unlike

regular convolutional layers) can learn different features from

each region in an image. Concurrently, the DeepID system

[9] achieved similar results by training 60 different CNNs

on patches comprising ten regions, three scales and RGB or

grey channels. During testing, 160 bottleneck features were ex-

tracted from each patch and its horizontally ﬂipped counterpart

2We refer to softmax loss to the combination of the softmax activation

function and the cross-entropy loss used to train classiﬁers.

TABLE I: Public large-scale face datasets.

Dataset Images Subjects Images per subject

CelebFaces+ [9] 202,599 10,177 19.9

UMDFaces [14] 367,920 8,501 43.3

CASIA-WebFace [10] 494,414 10,575 46.8

VGGFace [11] 2.6M 2,622 1,000

VGGFace2 [15] 3.31M 9,131 362.6

MegaFace [13] 4.7M 672,057 7

MS-Celeb-1M [12] 10M 100,000 100

to form a 19,200-dimensional feature vector (160 ×2×60).

Similar to [99], the proposed CNN architecture also used

locally connected layers. The veriﬁcation result was obtained

by training a joint Bayesian classiﬁer [48] on the 19,200-

dimensional feature vectors extracted by the CNNs. The sys-

tem was trained on a dataset containing 202,599 face images

of 10,177 celebrities [9].

There are three main factors that affect the accuracy of

CNN-based methods for face recognition: training data, CNN

architecture, and loss function. As in most deep learning

applications, large training sets are needed to prevent overﬁt-

ting. In general, CNNs trained for classiﬁcation become more

accurate as the number of samples per class increases. This is

because the CNN model is able to learn more robust features

when is exposed to more intra-class variations. However, in

face recognition we are interested in extracting features that

generalise to subjects not present in the training set. Hence,

the datasets used for face recognition need to also contain a

large number of subjects so that the model is exposed to more

inter-class variations. The effect that the number of subjects

in a dataset has in face recognition accuracy was studied in

[110]. In this work, a large dataset was ﬁrst sorted by the

number of images per subject in decreasing order. Then, a

CNN was trained with different subsets of training data by

gradually increasing the number of subjects. The best accuracy

was obtained when the ﬁrst 10,000 subjects with the most

images were used for training. Adding more subjects decreased

the accuracy since very few images were available for each

extra subject. Another study [111] investigated whether wider

datasets are better than deeper datasets or vice versa (a dataset

is considered wider than another if it contains more subjects;

similarly, a dataset is considered deeper than another if it

contains more images per subject). From this study, it was

concluded that given the same number of images, wider

datasets provide better accuracy. The authors argued that this

is due to the fact that wider datasets contain more inter-class

variations and, therefore, generalise better to unseen subjects.

Table I shows some of the most common public datasets used

to train CNNs for face recognition.

CNN architectures for face recognition have been inspired

by those achieving state-of-the-art accuracy on the ImageNet

Large Scale Visual Recognition Challenge (ILSVRC). For

example, a version of the VGG network [112] with 16 layers

was used in [11], and a similar but smaller network was used

in [10]. In [102], two different types of CNN architectures

were explored: VGG style networks [112] and GoogleNet style

Fig. 7: Original residual block proposed in [114].

networks [113]. Even though both types of networks achieved

comparable accuracy, the GoogleNet style networks had 20

times fewer parameters. More recently, residual networks

(ResNets) [114] have become the preferred choice for many

object recognition tasks, including face recognition [115],

[116], [117], [118], [119], [120], [121]. The main novelty of

ResNets is the introduction of a building block that uses a

shortcut connection to learn a residual mapping, as shown in

Figure 7. The use of shortcut connections allows the training

of much deeper architectures as they facilitate the ﬂow of

information across layers. A thorough study of different CNN

architectures was carried out in [121]. The best trade-off

between accuracy, speed and model size was obtained with

a 100-layer ResNet with a residual block similar to the one

proposed in [122].

The choice of loss function for training CNN-based methods

has been the most recent active area of research in face

recognition. Even though CNNs trained with softmax loss

have been very successful [99], [9], [10], [123], it has been

argued that the use of this loss function does not generalise

well to subjects not present in the training set. This is because

the softmax loss is encouraged to learn features that increase

inter-class differences (to be able to separate the classes in

the training set) but does not necessarily reduce intra-class

variations. Several methods have been proposed to mitigate

this issue. A simple approach is to optimise the bottleneck

features using a discriminative subspace method such as joint

Bayesian [48], as done in [9], [124], [125], [126], [10], [127].

Another approach is to use metric learning. For example, a

pairwise contrastive loss was used as the only supervisory

signal in [100], [101] and combined with a classiﬁcation loss

in [124], [125], [126]. One of the most popular metric learning

approaches for face recognition is the triplet loss function

[128], ﬁrst used in [102] for the face recognition task. The aim

of the triplet loss is to separate the distance between positive

pairs from the distance between negative pairs by a margin.

More formally, for each triplet ithe following condition needs

to be satisﬁed [102]:

kf(xa)−f(xp)k2

2+α < kf(xa)−f(xn)k2

2(8)

where xais an anchor image, xpis an image of the same

subject, xnis an image of a different subject, fis a mapping

learnt by a model and αis a margin that is enforced between

positive and negative pairs. In practice, CNNs trained with

triplet loss converge slower than with softmax loss due to the

large number of triplets (or pairs in the case of contrastive

loss) needed to cover the entire training set. Although this

problem can be alleviated by selecting hard triplets (i.e. triplets

that violate the margin condition) during training [102], it is

common to train with softmax loss in a ﬁrst training stage and

then ﬁne-tune bottleneck features with triplet loss in a second

training stage [11], [129], [130]. Some variations of the triplet

loss have been proposed. For example, in [129], the dot prod-

uct was used as a similarity measure instead of the Euclidean

distance; a probabilistic triplet loss was proposed in [130];

and a modiﬁed triplet loss that also minimises the standard

deviation of the distributions of positive and negative scores

was proposed in [131], [132]. An alternative loss function used

to learn discriminative features is the centre loss proposed

in [133]. The goal of the centre loss is to minimise the

distances between bottleneck features and their corresponding

class centres. By jointly training with softmax and centre

loss, it was shown that the features learnt by a CNN could

effectively increase inter-personal variations (softmax loss)

and reduce intra-personal variations (centre loss). The centre

loss has the advantage of being more efﬁcient and easier to

implement than the contrastive and triplet losses since it does

not require forming pairs or triplets during training. Another

related metric learning method is the range loss proposed in

[134] for improving training with unbalanced datasets. The

range loss has two components. The intra-class component

of the loss minimises the k-largest distances between samples

of the same class, and the inter-class component of the loss

maximises the distance between the closest two class centres

in each training batch. By using these extreme cases, the range

loss uses the same information from each class, regardless of

how many samples per class are available. Similar to the centre

loss, the range loss needs to be combined with softmax loss

to avoid the loss being degraded to zeros [133].

One of the difﬁculties that arise when combining different

loss functions is ﬁnding the correct balance between each

term. Recently, several approaches have proposed to modify

the softmax loss so that it can learn discriminative features

with no need to combine it with other losses. One approach

that has been shown to increase the discriminative ability of the

bottleneck features is feature normalisation [115], [118]. For

example, [115] proposed to normalise the features to have unit

L2-norm and [118] proposed to normalise the features to have

zero mean and unit variance. A very successful development

has been the introduction of a margin in the decision boundary

between each class in the softmax loss [135]. For simplicity,

consider binary classiﬁcation with softmax loss. In this case,

(a) (b)

Fig. 8: Effect of introducing a margin min the decision

boundary between two classes. (a) Softmax loss. (b) Softmax

loss with margin.

TABLE II: Decision boundaries for different variations of the

softmax loss with margin. Note that the decision boundaries

are for class 1 in a binary classiﬁcation case.

Type of softmax margin Decision boundary

Multiplicative angular margin [116] kxk(cosmθ1−cos θ2)=0

Additive cosine margin [119], [120] s(cos θ1−m−cos θ2)=0

Additive angular margin [121] s(cos(θ1+m)−cos θ2)=0

the decision boundary between each class (if the biases are

zero) is given by:

kxk(kW1kcos θ1− kW2kcos θ2)=0 (9)

where xis a feature vector, W1and W2are the weights

corresponding to each class and θ1and θ2are the angles

between xand W1and W2respectively. By introducing

a multiplicative margin min Equation 9, the two decision

boundaries become more stringent:

kxk(kW1kcos mθ1− kW2kcos θ2) = 0 for class 1 (10)

kxk(kW1kcos θ1− kW2kcos mθ2) = 0 for class 2 (11)

As shown in Figure 8, the margin can effectively increase the

separation between classes and their intra-class compactness.

Several alternative approaches have been proposed depending

on how the margin is incorporated into the loss [116], [119],

[120], [121]. For example, in [116] the weight vectors were

normalised to have unit norm so that the decision boundary

only depends on the angles θ1and θ2. In [119], [120],

an additive cosine margin was proposed. Compared to the

multiplicative margin [135], [116], the additive margin is

easier to implement and optimise. In this work, apart from

normalising the weight vectors, the feature vectors were also

normalised and scaled as done in [115]. An alternative additive

margin was proposed in [121] which keeps the advantages of

[119], [120] but has a better geometric interpretation since the

margin is added to the angle and not to the cosine. Table II

summarises the decision boundaries for the different variations

of the softmax loss with margin. These approaches are the

current state-of-the-art in face recognition.

III. CONCLUSIONS

We have seen how face recognition has followed the same

transition as many other computer vision applications. Tra-

ditional methods based on hand-engineered features that pro-

vided state-of-the-art accuracy only a few years ago have been

replaced by deep learning methods based on CNNs. Indeed,

face recognition systems based on CNNs have become the

standard due to the signiﬁcant accuracy improvement achieved

over other types of methods. Moreover, it is straightforward

to scale-up these systems to achieve even higher accuracy by

increasing the size of the training sets and/or the capacity of

the networks. However, collecting large amounts of labelled

face images is expensive, and very deep CNN architectures

are slow to train and deploy. Generative adversarial networks

(GANs) [136] are a promising solution to the ﬁrst issue. Re-

cent works on GANs with face images include facial attributes

manipulation [137], [138], [139], [140], [141], [142], [143],

[144], [145], [146], facial expression editing [147], [148],

[142], generation of novel identities [149], face frontalisation

[150], [151] and face ageing [152], [153]. It is expected

that these advancements will be used to generate additional

training images without requiring millions of face images to

be labelled. To address the second issue, more efﬁcient archi-

tectures such as MobileNets [154], [155] are being developed

and used for real-time face recognition on devices with limited

computational resources [156].

REFERENCES

[1] M. D. Kelly, “Visual identiﬁcation of people by computer.,” tech. rep.,

STANFORD UNIV CALIF DEPT OF COMPUTER SCIENCE, 1970.

[2] T. KANADE, “Picture processing by computer complex and recogni-

tion of human faces,” PhD Thesis, Kyoto University, 1973.

[3] K. Delac and M. Grgic, “A survey of biometric recognition methods,” in

46th International Symposium Electronics in Marine, vol. 46, pp. 16–

18, 2004.

[4] U. Park, Y. Tong, and A. K. Jain, “Age-invariant face recognition,”

IEEE transactions on pattern analysis and machine intelligence,

vol. 32, no. 5, pp. 947–954, 2010.

[5] Z. Li, U. Park, and A. K. Jain, “A discriminative model for age

invariant face recognition,” IEEE transactions on information forensics

and security, vol. 6, no. 3, pp. 1028–1037, 2011.

[6] C. Ding and D. Tao, “A comprehensive survey on pose-invariant face

recognition,” ACM Transactions on intelligent systems and technology

(TIST), vol. 7, no. 3, p. 37, 2016.

[7] D.-H. Liu, K.-M. Lam, and L.-S. Shen, “Illumination invariant face

recognition,” Pattern Recognition, vol. 38, no. 10, pp. 1705–1716,

2005.

[8] X. Tan and B. Triggs, “Enhanced local texture feature sets for face

recognition under difﬁcult lighting conditions,” IEEE transactions on

image processing, vol. 19, no. 6, pp. 1635–1650, 2010.

[9] Y. Sun, X. Wang, and X. Tang, “Deep learning face representation from

predicting 10,000 classes,” in Proceedings of the IEEE Conference on

Computer Vision and Pattern Recognition, pp. 1891–1898, 2014.

[10] D. Yi, Z. Lei, S. Liao, and S. Z. Li, “Learning face representation from

scratch,” arXiv preprint arXiv:1411.7923, 2014.

[11] O. M. Parkhi, A. Vedaldi, A. Zisserman, et al., “Deep face recogni-

tion.,” in BMVC, vol. 1, p. 6, 2015.

[12] Y. Guo, L. Zhang, Y. Hu, X. He, and J. Gao, “Ms-celeb-1m: A

dataset and benchmark for large-scale face recognition,” in European

Conference on Computer Vision, pp. 87–102, Springer, 2016.

[13] A. Nech and I. Kemelmacher-Shlizerman, “Level playing ﬁeld for

million scale face recognition,” in 2017 IEEE Conference on Computer

Vision and Pattern Recognition (CVPR), pp. 3406–3415, IEEE, 2017.

[14] A. Bansal, A. Nanduri, C. D. Castillo, R. Ranjan, and R. Chellappa,

“Umdfaces: An annotated face dataset for training deep networks,”

in Biometrics (IJCB), 2017 IEEE International Joint Conference on,

pp. 464–473, IEEE, 2017.

[15] Q. Cao, L. Shen, W. Xie, O. M. Parkhi, and A. Zisserman, “Vggface2:

A dataset for recognising faces across pose and age,” arXiv preprint

arXiv:1710.08092, 2017.

[16] T. Hassner, S. Harel, E. Paz, and R. Enbar, “Effective face frontalization

in unconstrained images,” in Proceedings of the IEEE Conference on

Computer Vision and Pattern Recognition, pp. 4295–4304, 2015.

[17] R. Brunelli and T. Poggio, “Face recognition: Features versus tem-

plates,” IEEE transactions on pattern analysis and machine intelli-

gence, vol. 15, no. 10, pp. 1042–1052, 1993.

[18] J. Shi, A. Samal, and D. Marx, “How effective are landmarks and

their geometry for face recognition?,” Computer vision and image

understanding, vol. 102, no. 2, pp. 117–133, 2006.

[19] I. L. Dryden and K. V. Mardia, Statistical shape analysis, vol. 4. Wiley

Chichester, 1998.

[20] F. Daniyal, P. Nair, and A. Cavallaro, “Compact signatures for 3d

face recognition under varying expressions,” in Advanced Video and

Signal Based Surveillance, 2009. AVSS’09. Sixth IEEE International

Conference on, pp. 302–307, IEEE, 2009.

[21] S. Gupta, M. K. Markey, and A. C. Bovik, “Anthropometric 3d face

recognition,” International journal of computer vision, vol. 90, no. 3,

pp. 331–349, 2010.

[22] L. Sirovich and M. Kirby, “Low-dimensional procedure for the char-

acterization of human faces,” Josa a, vol. 4, no. 3, pp. 519–524, 1987.

[23] M. Kirby and L. Sirovich, “Application of the karhunen-loeve proce-

dure for the characterization of human faces,” IEEE Transactions on

Pattern analysis and Machine intelligence, vol. 12, no. 1, pp. 103–108,

1990.

[24] M. Turk and A. Pentland, “Eigenfaces for recognition,” Journal of

cognitive neuroscience, vol. 3, no. 1, pp. 71–86, 1991.

[25] B. Moghaddam, W. Wahid, and A. Pentland, “Beyond eigenfaces:

Probabilistic matching for face recognition,” in Automatic Face and

Gesture Recognition, 1998. Proceedings. Third IEEE International

Conference on, pp. 30–35, IEEE, 1998.

[26] B. Sch¨

olkopf, A. Smola, and K.-R. M¨

uller, “Kernel principal com-

ponent analysis,” in International Conference on Artiﬁcial Neural

Networks, pp. 583–588, Springer, 1997.

[27] K. I. Kim, K. Jung, and H. J. Kim, “Face recognition using kernel

principal component analysis,” IEEE signal processing letters, vol. 9,

no. 2, pp. 40–42, 2002.

[28] P. Comon, “Independent component analysis, a new concept?,” Signal

processing, vol. 36, no. 3, pp. 287–314, 1994.

[29] M. S. Bartlett, “Independent component representations for face recog-

nition,” in Face Image Analysis by Unsupervised Learning, pp. 39–67,

Springer, 2001.

[30] J. Yang, D. Zhang, A. F. Frangi, and J.-y. Yang, “Two-dimensional pca:

a new approach to appearance-based face representation and recogni-

tion,” IEEE transactions on pattern analysis and machine intelligence,

vol. 26, no. 1, pp. 131–137, 2004.

[31] F. S. Samaria and A. C. Harter, “Parameterisation of a stochastic model

for human face identiﬁcation,” in Applications of Computer Vision,

1994., Proceedings of the Second IEEE Workshop on, pp. 138–142,

IEEE, 1994.

[32] R. A. Fisher, “The statistical utilization of multiple measurements,”

Annals of Human Genetics, vol. 8, no. 4, pp. 376–386, 1938.

[33] P. N. Belhumeur, J. P. Hespanha, and D. J. Kriegman, “Eigenfaces vs.

ﬁsherfaces: Recognition using class speciﬁc linear projection,” IEEE

Transactions on pattern analysis and machine intelligence, vol. 19,

no. 7, pp. 711–720, 1997.

[34] K. Etemad and R. Chellappa, “Discriminant analysis for recognition

of human face images,” JOSA A, vol. 14, no. 8, pp. 1724–1733, 1997.

[35] W. Zhao, A. Krishnaswamy, R. Chellappa, D. L. Swets, and J. Weng,

“Discriminant analysis of principal components for face recognition,”

in Face Recognition, pp. 73–85, Springer, 1998.

[36] W. Zhao, R. Chellappa, and P. J. Phillips, Subspace linear discriminant

analysis for face recognition. Citeseer, 1999.

[37] S. Mika, G. Ratsch, J. Weston, B. Scholkopf, and K.-R. Mullers,

“Fisher discriminant analysis with kernels,” in Neural networks for

signal processing IX, 1999. Proceedings of the 1999 IEEE signal

processing society workshop., pp. 41–48, Ieee, 1999.

[38] Q. Liu, R. Huang, H. Lu, and S. Ma, “Face recognition using kernel-

based ﬁsher discriminant analysis,” in Automatic Face and Gesture

Recognition, 2002. Proceedings. Fifth IEEE International Conference

on, pp. 197–201, IEEE, 2002.

[39] S. Ioffe, “Probabilistic linear discriminant analysis,” in European

Conference on Computer Vision, pp. 531–542, Springer, 2006.

[40] P. J. Phillips, “Support vector machines applied to face recognition,”

in Advances in Neural Information Processing Systems, pp. 803–809,

1999.

[41] K. Jonsson, J. Kittler, Y. Li, and J. Matas, “Support vector machines

for face authentication,” Image and Vision Computing, vol. 20, no. 5-6,

pp. 369–375, 2002.

[42] X. He, S. Yan, Y. Hu, P. Niyogi, and H.-J. Zhang, “Face recognition

using laplacianfaces,” IEEE transactions on pattern analysis and ma-

chine intelligence, vol. 27, no. 3, pp. 328–340, 2005.

[43] D. Cai, X. He, J. Han, and H.-J. Zhang, “Orthogonal laplacianfaces

for face recognition,” IEEE transactions on image processing, vol. 15,

no. 11, pp. 3608–3614, 2006.

[44] J. Wright, A. Y. Yang, A. Ganesh, S. S. Sastry, and Y. Ma, “Robust face

recognition via sparse representation,” IEEE transactions on pattern

analysis and machine intelligence, vol. 31, no. 2, pp. 210–227, 2009.

[45] Q. Zhang and B. Li, “Discriminative k-svd for dictionary learning in

face recognition,” in Computer Vision and Pattern Recognition (CVPR),

2010 IEEE Conference on, pp. 2691–2698, IEEE, 2010.

[46] Z. Zhou, A. Wagner, H. Mobahi, J. Wright, and Y. Ma, “Face

recognition with contiguous occlusion using markov random ﬁelds,”

in Computer Vision, 2009 IEEE 12th International Conference on,

pp. 1050–1057, IEEE, 2009.

[47] H. Jia and A. M. Martinez, “Face recognition with occlusions in the

training and testing sets,” in Automatic Face & Gesture Recognition,

2008. FG’08. 8th IEEE International Conference on, pp. 1–6, IEEE,

2008.

[48] D. Chen, X. Cao, L. Wang, F. Wen, and J. Sun, “Bayesian face

revisited: A joint formulation,” in European Conference on Computer

Vision, pp. 566–579, Springer, 2012.

[49] G. B. Huang, M. Ramesh, T. Berg, and E. Learned-Miller, “Labeled

faces in the wild: A database for studying face recognition in uncon-

strained environments,” tech. rep., Technical Report 07-49, University

of Massachusetts, Amherst, 2007.

[50] A. Pentland, B. Moghaddam, T. Starner, et al., “View-based and

modular eigenspaces for face recognition,” 1994.

[51] B. Takacs, “Comparing face images using the modiﬁed hausdorff

distance,” Pattern Recognition, vol. 31, no. 12, pp. 1873–1881, 1998.

[52] D. P. Huttenlocher, G. A. Klanderman, and W. J. Rucklidge, “Compar-

ing images using the hausdorff distance,” IEEE Transactions on pattern

analysis and machine intelligence, vol. 15, no. 9, pp. 850–863, 1993.

[53] Y. Gao and M. K. Leung, “Face recognition using line edge map,” IEEE

transactions on pattern analysis and machine intelligence, vol. 24,

no. 6, pp. 764–779, 2002.

[54] L. Wiskott, N. Kr¨

uger, N. Kuiger, and C. Von Der Malsburg, “Face

recognition by elastic bunch graph matching,” IEEE Transactions on

pattern analysis and machine intelligence, vol. 19, no. 7, pp. 775–779,

1997.

[55] M. Lades, J. C. Vorbruggen, J. Buhmann, J. Lange, C. Von Der Mals-

burg, R. P. Wurtz, and W. Konen, “Distortion invariant object recogni-

tion in the dynamic link architecture,” IEEE Transactions on computers,

vol. 42, no. 3, pp. 300–311, 1993.

[56] T. S. Lee, “Image representation using 2d gabor wavelets,” IEEE

Transactions on Pattern Analysis & Machine Intelligence, no. 10,

pp. 959–971, 1996.

[57] W. T. Freeman and M. Roth, “Orientation histograms for hand gesture

recognition,” in International workshop on automatic face and gesture

recognition, vol. 12, pp. 296–301, 1995.

[58] N. Dalal and B. Triggs, “Histograms of oriented gradients for human

detection,” in Computer Vision and Pattern Recognition, 2005. CVPR

2005. IEEE Computer Society Conference on, vol. 1, pp. 886–893,

IEEE, 2005.

[59] A. Albiol, D. Monzo, A. Martin, J. Sastre, and A. Albiol, “Face

recognition using hog–ebgm,” Pattern Recognition Letters, vol. 29,

no. 10, pp. 1537–1543, 2008.

[60] K. Mikolajczyk and C. Schmid, “A performance evaluation of local

descriptors,” IEEE transactions on pattern analysis and machine intel-

ligence, vol. 27, no. 10, pp. 1615–1630, 2005.

[61] T. Ahonen, A. Hadid, and M. Pietikainen, “Face description with local

binary patterns: Application to face recognition,” IEEE transactions on

pattern analysis and machine intelligence, vol. 28, no. 12, pp. 2037–

2041, 2006.

[62] D. Huang, C. Shan, M. Ardabilian, Y. Wang, and L. Chen, “Local

binary patterns and its application to facial image analysis: a survey,”

IEEE Transactions on Systems, Man, and Cybernetics, Part C (Appli-

cations and Reviews), vol. 41, no. 6, pp. 765–781, 2011.

[63] W. Zhang, S. Shan, H. Zhang, W. Gao, and X. Chen, “Multi-resolution

histograms of local variation patterns (mhlvp) for robust face recogni-

tion,” in International Conference on Audio-and Video-Based Biometric

Person Authentication, pp. 937–944, Springer, 2005.

[64] W. Zhang, S. Shan, W. Gao, X. Chen, and H. Zhang, “Local gabor

binary pattern histogram sequence (lgbphs): a novel non-statistical

model for face representation and recognition,” in Computer Vision,

2005. ICCV 2005. Tenth IEEE International Conference on, vol. 1,

pp. 786–791, IEEE, 2005.

[65] T. Ahonen, J. Matas, C. He, and M. Pietik¨

ainen, “Rotation invariant

image description with local binary pattern histogram fourier features,”

in Scandinavian Conference on Image Analysis, pp. 61–70, Springer,

2009.

[66] B. Zhang, Y. Gao, S. Zhao, and J. Liu, “Local derivative pattern versus

local binary pattern: face recognition with high-order local pattern

descriptor,” IEEE transactions on image processing, vol. 19, no. 2,

pp. 533–544, 2010.

[67] D. G. Lowe, “Object recognition from local scale-invariant features,”

in Computer vision, 1999. The proceedings of the seventh IEEE

international conference on, vol. 2, pp. 1150–1157, Ieee, 1999.

[68] M. Bicego, A. Lagorio, E. Grosso, and M. Tistarelli, “On the use of

sift features for face authentication,” in Computer Vision and Pattern

Recognition Workshop, 2006. CVPRW’06. Conference on, pp. 35–35,

IEEE, 2006.

[69] P. Dreuw, P. Steingrube, H. Hanselmann, H. Ney, and G. Aachen, “Surf-

face: Face recognition under viewpoint consistency constraints.,” in

BMVC, pp. 1–11, 2009.

[70] H. Bay, T. Tuytelaars, and L. Van Gool, “Surf: Speeded up robust

features,” in European conference on computer vision, pp. 404–417,

Springer, 2006.

[71] C. Geng and X. Jiang, “Face recognition using sift features,” in

Image Processing (ICIP), 2009 16th IEEE International Conference

on, pp. 3313–3316, IEEE, 2009.

[72] Z. Cao, Q. Yin, X. Tang, and J. Sun, “Face recognition with learning-

based descriptor,” in Computer Vision and Pattern Recognition (CVPR),

2010 IEEE Conference on, pp. 2707–2714, IEEE, 2010.

[73] S. Lloyd, “Least squares quantization in pcm,” IEEE transactions on

information theory, vol. 28, no. 2, pp. 129–137, 1982.

[74] Y. Freund, S. Dasgupta, M. Kabra, and N. Verma, “Learning the

structure of manifolds using random projections,” in Advances in

Neural Information Processing Systems, pp. 473–480, 2008.

[75] G. Sharma, S. ul Hussain, and F. Jurie, “Local higher-order statistics

(lhs) for texture categorization and facial analysis,” in European

Conference on Computer Vision, pp. 1–12, Springer, 2012.

[76] Z. Lei, M. Pietik ¨

ainen, and S. Z. Li, “Learning discriminant face

descriptor,” IEEE Transactions on Pattern Analysis and Machine In-

telligence, vol. 36, no. 2, pp. 289–302, 2014.

[77] C. Liu and H. Wechsler, “Gabor feature based classiﬁcation using the

enhanced ﬁsher linear discriminant model for face recognition,” IEEE

Transactions on Image processing, vol. 11, no. 4, pp. 467–476, 2002.

[78] C. Liu and H. Wechsler, “Independent component analysis of gabor

features for face recognition,” IEEE transactions on Neural Networks,

vol. 14, no. 4, pp. 919–928, 2003.

[79] C. Liu, “Gabor-based kernel pca with fractional power polynomial

models for face recognition,” IEEE transactions on pattern analysis

and machine intelligence, vol. 26, no. 5, pp. 572–581, 2004.

[80] C. Liu and H. Wechsler, “Robust coding schemes for indexing and

retrieval from large face databases,” IEEE Transactions on image

processing, vol. 9, no. 1, pp. 132–137, 2000.

[81] C.-H. Chan, J. Kittler, and K. Messer, “Multi-scale local binary

pattern histograms for face recognition,” in International conference

on biometrics, pp. 809–818, Springer, 2007.

[82] C.-H. Chan, J. Kittler, and K. Messer, “Multispectral local binary

pattern histogram for component-based color face veriﬁcation,” in

Biometrics: Theory, Applications, and Systems, 2007. BTAS 2007. First

IEEE International Conference on, pp. 1–7, IEEE, 2007.

[83] D. Zhao, Z. Lin, and X. Tang, “Laplacian pca and its applications,” in

Computer Vision, 2007. ICCV 2007. IEEE 11th International Confer-

ence on, pp. 1–8, IEEE, 2007.

[84] L. Wolf, T. Hassner, and Y. Taigman, “Descriptor based methods in the

wild,” in Workshop on faces in’real-life’images: Detection, alignment,

and recognition, 2008.

[85] D. Chen, X. Cao, F. Wen, and J. Sun, “Blessing of dimensionality:

High-dimensional feature and its efﬁcient compression for face veri-

ﬁcation,” in Computer Vision and Pattern Recognition (CVPR), 2013

IEEE Conference on, pp. 3025–3032, IEEE, 2013.

[86] C. Lu and X. Tang, “Surpassing human-level face veriﬁcation perfor-

mance on lfw with gaussianface.,” in AAAI, pp. 3811–3819, 2015.

[87] R. Urtasun and T. Darrell, “Discriminative gaussian process latent vari-

able model for classiﬁcation,” in Proceedings of the 24th international

conference on Machine learning, pp. 927–934, ACM, 2007.

[88] X. Tan and B. Triggs, “Fusing gabor and lbp feature sets for kernel-

based face recognition,” in International Workshop on Analysis and

Modeling of Faces and Gestures, pp. 235–249, Springer, 2007.

[89] H. Cevikalp, M. Neamtu, and M. Wilkes, “Discriminative common

vector method with kernels,” IEEE Transactions on Neural Networks,

vol. 17, no. 6, pp. 1550–1565, 2006.

[90] S. Shan, W. Zhang, Y. Su, X. Chen, and W. Gao, “Ensemble of

piecewise fda based on spatial histograms of local (gabor) binary

patterns for face recognition,” in Pattern Recognition, 2006. ICPR

2006. 18th International Conference on, vol. 3, IEEE, 2006.

[91] C. H. Chan, M. A. Tahir, J. Kittler, and M. Pietik¨

ainen, “Multiscale

local phase quantization for robust component-based face recognition

using kernel fusion of multiple descriptors,” IEEE Transactions on

Pattern Analysis and Machine Intelligence, vol. 35, no. 5, pp. 1164–

1177, 2013.

[92] E. Rahtu, J. Heikkil ¨

a, V. Ojansivu, and T. Ahonen, “Local phase

quantization for blur-insensitive image analysis,” Image and Vision

Computing, vol. 30, no. 8, pp. 501–512, 2012.

[93] T. K. Ho, “The random subspace method for constructing decision

forests,” IEEE transactions on pattern analysis and machine intelli-

gence, vol. 20, no. 8, pp. 832–844, 1998.

[94] L. Breiman, “Bagging predictors,” Machine learning, vol. 24, no. 2,

pp. 123–140, 1996.

[95] D. S´

aez-Trigueros, H. Hertlein, L. Meng, and M. Hartnett, “Shape

and texture combined face recognition for detection of forged id doc-

uments,” in Information and Communication Technology, Electronics

and Microelectronics (MIPRO), 2016 39th International Convention

on, pp. 1343–1348, IEEE, 2016.

[96] N. Kumar, A. C. Berg, P. N. Belhumeur, and S. K. Nayar, “Attribute

and simile classiﬁers for face veriﬁcation,” in Computer Vision, 2009

IEEE 12th International Conference on, pp. 365–372, IEEE, 2009.

[97] T. Berg and P. N. Belhumeur, “Tom-vs-pete classiﬁers and identity-

preserving alignment for face veriﬁcation.,” in BMVC, vol. 2, p. 7,

Citeseer, 2012.

[98] M. Guillaumin, J. Verbeek, and C. Schmid, “Is that you? metric

learning approaches for face identiﬁcation,” in Computer Vision, 2009

IEEE 12th international conference on, pp. 498–505, IEEE, 2009.

[99] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf, “Deepface: Closing the

gap to human-level performance in face veriﬁcation,” in Proceedings

of the IEEE conference on computer vision and pattern recognition,

pp. 1701–1708, 2014.

[100] S. Chopra, R. Hadsell, and Y. LeCun, “Learning a similarity metric

discriminatively, with application to face veriﬁcation,” in Computer

Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer

Society Conference on, vol. 1, pp. 539–546, IEEE, 2005.

[101] H. Fan, Z. Cao, Y. Jiang, Q. Yin, and C. Doudou, “Learning deep face

representation,” arXiv preprint arXiv:1403.2802, 2014.

[102] F. Schroff, D. Kalenichenko, and J. Philbin, “Facenet: A uniﬁed

embedding for face recognition and clustering,” in Proceedings of the

IEEE conference on computer vision and pattern recognition, pp. 815–

823, 2015.

[103] S.-H. Lin, S.-Y. Kung, and L.-J. Lin, “Face recognition/detection by

probabilistic decision-based neural network,” IEEE transactions on

neural networks, vol. 8, no. 1, pp. 114–132, 1997.

[104] S. Lawrence, C. L. Giles, A. C. Tsoi, and A. D. Back, “Face recogni-

tion: A convolutional neural-network approach,” IEEE transactions on

neural networks, vol. 8, no. 1, pp. 98–113, 1997.

[105] T. Kohonen, “The self-organizing map,” Neurocomputing, vol. 21,

no. 1-3, pp. 1–6, 1998.

[106] J. Bromley, I. Guyon, Y. LeCun, E. S¨

ackinger, and R. Shah, “Signature

veriﬁcation using a” siamese” time delay neural network,” in Advances

in Neural Information Processing Systems, pp. 737–744, 1994.

[107] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classiﬁcation

with deep convolutional neural networks,” in Advances in neural

information processing systems, pp. 1097–1105, 2012.

[108] K. Gregor and Y. LeCun, “Emergence of complex-like cells in a

temporal product network with local receptive ﬁelds,” arXiv preprint

arXiv:1006.0448, 2010.

[109] G. B. Huang, H. Lee, and E. Learned-Miller, “Learning hierarchical

representations for face veriﬁcation with convolutional deep belief

networks,” in Computer Vision and Pattern Recognition (CVPR), 2012

IEEE Conference on, pp. 2518–2525, IEEE, 2012.

[110] E. Zhou, Z. Cao, and Q. Yin, “Naive-deep face recognition: Touching

the limit of lfw benchmark or not?,” arXiv preprint arXiv:1501.04690,

2015.

[111] A. Bansal, C. Castillo, R. Ranjan, and R. Chellappa, “The

do’s and don’ts for cnn-based face veriﬁcation,” arXiv preprint

arXiv:1705.07426, vol. 5, 2017.

[112] K. Simonyan and A. Zisserman, “Very deep convolutional networks for

large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.

[113] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov,

D. Erhan, V. Vanhoucke, A. Rabinovich, et al., “Going deeper with

convolutions,” Cvpr, 2015.

[114] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image

recognition,” in Proceedings of the IEEE conference on computer vision

and pattern recognition, pp. 770–778, 2016.

[115] R. Ranjan, C. D. Castillo, and R. Chellappa, “L2-constrained

softmax loss for discriminative face veriﬁcation,” arXiv preprint

arXiv:1703.09507, 2017.

[116] W. Liu, Y. Wen, Z. Yu, M. Li, B. Raj, and L. Song, “Sphereface: Deep

hypersphere embedding for face recognition,” in The IEEE Conference

on Computer Vision and Pattern Recognition (CVPR), vol. 1, 2017.

[117] Y. Wu, H. Liu, J. Li, and Y. Fu, “Deep face recognition with center

invariant loss,” in Proceedings of the on Thematic Workshops of ACM

Multimedia 2017, pp. 408–414, ACM, 2017.

[118] A. Hasnat, J. Bohn´

e, J. Milgram, S. Gentric, and L. Chen, “Deepvisage:

Making face recognition simple yet with powerful generalization

skills,” arXiv preprint arXiv:1703.08388, 2017.

[119] H. Wang, Y. Wang, Z. Zhou, X. Ji, Z. Li, D. Gong, J. Zhou, and W. Liu,

“Cosface: Large margin cosine loss for deep face recognition,” arXiv

preprint arXiv:1801.09414, 2018.

[120] F. Wang, W. Liu, H. Liu, and J. Cheng, “Additive margin softmax for

face veriﬁcation,” arXiv preprint arXiv:1801.05599, 2018.

[121] J. Deng, J. Guo, and S. Zafeiriou, “Arcface: Additive angular margin

loss for deep face recognition,” arXiv preprint arXiv:1801.07698, 2018.

[122] Y. Yamada, M. Iwamura, and K. Kise, “Deep pyramidal resid-

ual networks with separated stochastic depth,” arXiv preprint

arXiv:1612.01230, 2016.

[123] X. Wu, R. He, and Z. Sun, “A lightened cnn for deep face representa-

tion,” in 2015 IEEE Conference on IEEE Computer Vision and Pattern

Recognition (CVPR), vol. 4, 2015.

[124] Y. Sun, Y. Chen, X. Wang, and X. Tang, “Deep learning face rep-

resentation by joint identiﬁcation-veriﬁcation,” in Advances in neural

information processing systems, pp. 1988–1996, 2014.

[125] W.-S. T. WST, “Deeply learned face representations are sparse, selec-

tive, and robust,” perception, vol. 31, pp. 411–438, 2008.

[126] Y. Sun, D. Liang, X. Wang, and X. Tang, “Deepid3: Face recognition

with very deep neural networks,” arXiv preprint arXiv:1502.00873,

2015.

[127] J.-C. Chen, V. M. Patel, and R. Chellappa, “Unconstrained face

veriﬁcation using deep cnn features,” in Applications of Computer

Vision (WACV), 2016 IEEE Winter Conference on, pp. 1–9, IEEE, 2016.

[128] K. Q. Weinberger and L. K. Saul, “Distance metric learning for large

margin nearest neighbor classiﬁcation,” Journal of Machine Learning

Research, vol. 10, no. Feb, pp. 207–244, 2009.

[129] S. Sankaranarayanan, A. Alavi, and R. Chellappa, “Triplet similarity

embedding for face veriﬁcation,” arXiv preprint arXiv:1602.03418,

2016.

[130] S. Sankaranarayanan, A. Alavi, C. D. Castillo, and R. Chellappa,

“Triplet probabilistic embedding for face veriﬁcation and clustering,”

in Biometrics Theory, Applications and Systems (BTAS), 2016 IEEE

8th International Conference on, pp. 1–8, IEEE, 2016.

[131] B. Kumar, G. Carneiro, I. Reid, et al., “Learning local image descriptors

with deep siamese and triplet convolutional networks by minimising

global loss functions,” in Proceedings of the IEEE Conference on

Computer Vision and Pattern Recognition, pp. 5385–5394, 2016.

[132] D. S. Trigueros, L. Meng, and M. Hartnett, “Enhancing convolutional

neural networks for face recognition with occlusion maps and batch

triplet loss,” Image and Vision Computing, 2018.

[133] Y. Wen, K. Zhang, Z. Li, and Y. Qiao, “A discriminative feature

learning approach for deep face recognition,” in European Conference

on Computer Vision, pp. 499–515, Springer, 2016.

[134] X. Zhang, Z. Fang, Y. Wen, Z. Li, and Y. Qiao, “Range loss for

deep face recognition with long-tail,” arXiv preprint arXiv:1611.08976,

2016.

[135] W. Liu, Y. Wen, Z. Yu, and M. Yang, “Large-margin softmax loss for

convolutional neural networks.,” in ICML, pp. 507–516, 2016.

[136] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,

S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,”

in Advances in neural information processing systems, pp. 2672–2680,

2014.

[137] A. B. L. Larsen, S. K. Sønderby, H. Larochelle, and O. Winther,

“Autoencoding beyond pixels using a learned similarity metric,” arXiv

preprint arXiv:1512.09300, 2015.

[138] G. Perarnau, J. van de Weijer, B. Raducanu, and J. M. ´

Alvarez,

“Invertible conditional gans for image editing,” arXiv preprint

arXiv:1611.06355, 2016.

[139] A. Brock, T. Lim, J. M. Ritchie, and N. Weston, “Neural photo

editing with introspective adversarial networks,” arXiv preprint

arXiv:1609.07093, 2016.

[140] W. Shen and R. Liu, “Learning residual images for face attribute

manipulation,” in 2017 IEEE Conference on Computer Vision and

Pattern Recognition (CVPR), pp. 1225–1233, IEEE, 2017.

[141] Y. Lu, Y.-W. Tai, and C.-K. Tang, “Conditional cyclegan for attribute

guided face image generation,” arXiv preprint arXiv:1705.09966, 2017.

[142] Y. Choi, M. Choi, M. Kim, J.-W. Ha, S. Kim, and J. Choo, “Stargan:

Uniﬁed generative adversarial networks for multi-domain image-to-

image translation,” arXiv preprint arXiv:1711.09020, 2017.

[143] W. Yin, Y. Fu, L. Sigal, and X. Xue, “Semi-latent gan: Learning to

generate and modify facial images from attributes,” arXiv preprint

arXiv:1704.02166, 2017.

[144] Z. Shu, E. Yumer, S. Hadap, K. Sunkavalli, E. Shechtman, and

D. Samaras, “Neural face editing with intrinsic image disentangling,”

in Computer Vision and Pattern Recognition (CVPR), 2017 IEEE

Conference on, pp. 5444–5453, IEEE, 2017.

[145] G. Lample, N. Zeghidour, N. Usunier, A. Bordes, L. Denoyer, et al.,

“Fader networks: Manipulating images by sliding attributes,” in Ad-

vances in Neural Information Processing Systems, pp. 5969–5978,

2017.

[146] Z. He, W. Zuo, M. Kan, S. Shan, and X. Chen, “Arbitrary fa-

cial attribute editing: Only change what you want,” arXiv preprint

arXiv:1711.10678, 2017.

[147] Y. Zhou and B. E. Shi, “Photorealistic facial expression synthesis

by the conditional difference adversarial autoencoder,” arXiv preprint

arXiv:1708.09126, 2017.

[148] H. Ding, K. Sricharan, and R. Chellappa, “Exprgan: Facial expres-

sion editing with controllable expression intensity,” arXiv preprint

arXiv:1709.03842, 2017.

[149] C. Donahue, A. Balsubramani, J. McAuley, and Z. C. Lipton, “Se-

mantically decomposing the latent spaces of generative adversarial

networks,” arXiv preprint arXiv:1705.07904, 2017.

[150] R. Huang, S. Zhang, T. Li, R. He, et al., “Beyond face rotation: Global

and local perception gan for photorealistic and identity preserving

frontal view synthesis,” arXiv preprint arXiv:1704.04086, 2017.

[151] L. Tran, X. Yin, and X. Liu, “Representation learning by rotating your

faces,” arXiv preprint arXiv:1705.11136, 2017.

[152] G. Antipov, M. Baccouche, and J.-L. Dugelay, “Face aging

with conditional generative adversarial networks,” arXiv preprint

arXiv:1702.01983, 2017.

[153] Z. Zhang, Y. Song, and H. Qi, “Age progression/regression by condi-

tional adversarial autoencoder,” in The IEEE Conference on Computer

Vision and Pattern Recognition (CVPR), vol. 2, 2017.

[154] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang,

T. Weyand, M. Andreetto, and H. Adam, “Mobilenets: Efﬁcient convo-

lutional neural networks for mobile vision applications,” arXiv preprint

arXiv:1704.04861, 2017.

[155] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen,

“Inverted residuals and linear bottlenecks: Mobile networks for classiﬁ-

cation, detection and segmentation,” arXiv preprint arXiv:1801.04381,

2018.

[156] S. Chen, Y. Liu, X. Gao, and Z. Han, “Mobilefacenets: Efﬁcient

cnns for accurate real-time face veriﬁcation on mobile devices,” arXiv

preprint arXiv:1804.07573, 2018.

ResearchGate has not been able to resolve any citations for this publication.

VGGFace2: A Dataset for Recognising Faces across Pose and Age

Conference Paper

Full-text available

May 2018

CosFace: Large Margin Cosine Loss for Deep Face Recognition

Article

Full-text available

Jan 2018

Face recognition has achieved revolutionary advancement owing to the advancement of the deep convolutional neural network (CNN). The central task of face recognition, including face verification and identification, involves face feature discrimination. However, traditional softmax loss of deep CNN usually lacks the power of discrimination. To address this problem, recently several loss functions such as central loss \cite{centerloss}, large margin softmax loss \cite{lsoftmax}, and angular softmax loss \cite{sphereface} have been proposed. All these improvement algorithms share the same idea: maximizing inter-class variance and minimizing intra-class variance. In this paper, we design a novel loss function, namely large margin cosine loss (LMCL), to realize this idea from a different perspective. More specifically, we reformulate the softmax loss as cosine loss by L2 normalizing both features and weight vectors to remove radial variation, based on which a cosine margin term \emph{$m$} is introduced to further maximize decision margin in angular space. As a result, minimum intra-class variance and maximum inter-class variance are achieved by normalization and cosine decision margin maximization. We refer to our model trained with LMCL as CosFace. To test our approach, extensive experimental evaluations are conducted on the most popular public-domain face recognition datasets such as MegaFace Challenge, Youtube Faces (YTF) and Labeled Face in the Wild (LFW). We achieve the state-of-the-art performance on these benchmark experiments, which confirms the effectiveness of our approach.

ArcFace: Additive Angular Margin Loss for Deep Face Recognition

Article

Full-text available

Jan 2018

Convolutional neural networks have significantly boosted the performance of face recognition in recent years due to its high capacity in learning discriminative features. To enhance the discriminative power of the Softmax loss, multiplicative angular margin and additive cosine margin incorporate angular margin and cosine margin into the loss functions, respectively. In this paper, we propose a novel supervisor signal, additive angular margin (ArcFace), which has a better geometrical interpretation than supervision signals proposed so far. Specifically, the proposed ArcFace $\cos(\theta + m)$ directly maximise decision boundary in angular (arc) space based on the L2 normalised weights and features. Compared to multiplicative angular margin $\cos(m\theta)$ and additive cosine margin $\cos\theta-m$, ArcFace can obtain more discriminative deep features. We also emphasise the importance of network settings and data refinement in the problem of deep face recognition. Extensive experiments on several relevant face recognition benchmarks, LFW, CFP and AgeDB, prove the effectiveness of the proposed ArcFace. Most importantly, we get state-of-art performance in the MegaFace Challenge in a totally reproducible way. We make data, models and training/test code public available~\footnote{https://github.com/deepinsight/insightface}.

Additive Margin Softmax for Face Verification

Article

Full-text available

Jan 2018

In this paper, we propose a conceptually simple and geometrically interpretable objective function, i.e. additive margin Softmax (AM-Softmax), for deep face verification. In general, the face verification task can be viewed as a metric learning problem, so learning large-margin face features whose intra-class variation is small and inter-class difference is large is of great importance in order to achieve good performance. Recently, Large-margin Softmax and Angular Softmax have been proposed to incorporate the angular margin in a multiplicative manner. In this work, we introduce a novel additive angular margin for the Softmax loss, which is intuitively appealing and more interpretable than the existing works. We also emphasize and discuss the importance of feature normalization in the paper. Most importantly, our experiments on LFW BLUFR and MegaFace show that our additive margin softmax loss consistently performs better than the current state-of-the-art methods using the same network architecture and training dataset. Our code has also been made available at https://github.com/happynear/AMSoftmax

Arbitrary Facial Attribute Editing: Only Change What You Want

Article

Full-text available

Nov 2017

Facial attribute editing aims to modify either single or multiple attributes on a face image. Recently, the generative adversarial net (GAN) and the encoder-decoder architecture are usually incorporated to handle this task. The attribute editing can then be conducted by decoding the latent representation of the face image conditioned on the specified attributes. A few existing methods attempt to establish attribute-independent latent representation for arbitrarily changing the attributes. However, since the attributes portray the characteristics of the face image, the attribute-independent constraint on the latent representation is excessive. Such constraint may result in information loss and unexpected distortion on the generated images (e.g. over smoothing), especially for those identifiable attributes such as gender, race etc. Instead of imposing the attribute-independent constraint on the latent representation, we introduce an attribute classification constraint on the generated image, just requiring the correct change of the attributes. Meanwhile, reconstruction learning is introduced in order to guarantee the preservation of all other attribute-excluding details on the generated image, and adversarial learning is employed for visually realistic generation. Moreover, our method can be naturally extended to attribute intensity manipulation. Experiments on the CelebA dataset show that our method outperforms the state-of-the-arts on generating realistic attribute editing results with facial details well preserved.

Shape and texture combined face recognition for detection of forged ID documents

Conference Paper

May 2016

MobileFaceNets: Efficient CNNs for Accurate Real-time Face Verification on Mobile Devices

Article

Apr 2018

In this paper, we present a class of extremely efficient CNN models called MobileFaceNets, which use no more than 1 million parameters and specifically tailored for high-accuracy real-time face verification on mobile and embedded devices. We also make a simple analysis on the weakness of common mobile networks for face verification. The weakness has been well overcome by our specifically designed MobileFaceNets. Under the same experimental conditions, our MobileFaceNets achieve significantly superior accuracy as well as more than 2 times actual speedup over MobileNetV2. After trained by ArcFace loss on the refined MS-Celeb-1M from scratch, our single MobileFaceNet model of 4.0MB size achieves 99.55% face verification accuracy on LFW and 92.59% TAR (FAR1e-6) on MegaFace Challenge 1, which is even comparable to state-of-the-art big CNN models of hundreds MB size. The fastest one of our MobileFaceNets has an actual inference time of 18 milliseconds on a mobile phone. Our experiments on LFW, AgeDB, and MegaFace show that our MobileFaceNets achieve significantly improved efficiency compared with the state-of-the-art lightweight and mobile CNNs for face verification.

Face aging with conditional generative adversarial networks

Conference Paper

Sep 2017

Photorealistic facial expression synthesis by the conditional difference adversarial autoencoder

Conference Paper

Oct 2017

Inverted Residuals and Linear Bottlenecks: Mobile Networks forClassification, Detection and Segmentation

Article

Jan 2018

In this paper we describe a new mobile architecture, MobileNetV2, that improves the state of the art performance of mobile models on multiple tasks and benchmarks as well as across a spectrum of different model sizes. We also describe efficient ways of applying these mobile models to object detection in a novel framework we call SSDLite. Additionally, we demonstrate how to build mobile semantic segmentation models through a reduced form of DeepLabv3 which we call Mobile DeepLabv3. The MobileNetV2 architecture is based on an inverted residual structure where the input and output of the residual block are thin bottleneck layers opposite to traditional residual models which use expanded representations in the input an MobileNetV2 uses lightweight depthwise convolutions to filter features in the intermediate expansion layer. Additionally, we find that it is important to remove non-linearities in the narrow layers in order to maintain representational power. We demonstrate that this improves performance and provide an intuition that led to this design. Finally, our approach allows decoupling of the input/output domains from the expressiveness of the transformation, which provides a convenient framework for further analysis. We measure our performance on Imagenet classification, COCO object detection, VOC image segmentation. We evaluate the trade-offs between accuracy, and number of operations measured by multiply-adds (MAdd), as well as the number of parameters

Face Recognition: From Traditional to Deep Learning Methods

Abstract and Figures

Recommended publications

Multi-scale multi-class conditional generative adversarial network for handwritten character generat...

Comprehensive Analysis of Neural Network Techniques in Computational Linguistic Applications

Partial Least Squares for Face Hashing

A Local-Global Approach to Semantic Segmentation in Aerial Images