PreprintPDF Available

Face Recognition: From Traditional to Deep Learning Methods

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

Starting in the seventies, face recognition has become one of the most researched topics in computer vision and biometrics. Traditional methods based on hand-crafted features and traditional machine learning techniques have recently been superseded by deep neural networks trained with very large datasets. In this paper we provide a comprehensive and up-to-date literature review of popular face recognition methods including both traditional (geometry-based, holistic, feature-based and hybrid methods) and deep learning methods.
Content may be subject to copyright.
Face Recognition: From Traditional to Deep
Learning Methods
Daniel S´
aez Trigueros, Li Meng
School of Engineering and Technology
University of Hertfordshire
Hatfield AL10 9AB, UK
Margaret Hartnett
GBG plc
London E14 9QD, UK
Abstract—Starting in the seventies, face recognition has be-
come one of the most researched topics in computer vision and
biometrics. Traditional methods based on hand-crafted features
and traditional machine learning techniques have recently been
superseded by deep neural networks trained with very large
datasets. In this paper we provide a comprehensive and up-
to-date literature review of popular face recognition methods
including both traditional (geometry-based, holistic, feature-
based and hybrid methods) and deep learning methods.
I. INTRODUCTION
Face recognition refers to the technology capable of iden-
tifying or verifying the identity of subjects in images or
videos. The first face recognition algorithms were developed
in the early seventies [1], [2]. Since then, their accuracy
has improved to the point that nowadays face recognition
is often preferred over other biometric modalities that have
traditionally been considered more robust, such as fingerprint
or iris recognition [3]. One of the differential factors that
make face recognition more appealing than other biometric
modalities is its non-intrusive nature. For example, fingerprint
recognition requires users to place a finger in a sensor, iris
recognition requires users to get significantly close to a cam-
era, and speaker recognition requires users to speak out loud.
In contrast, modern face recognition systems only require users
to be within the field of view of a camera (provided that they
are within a reasonable distance from the camera). This makes
face recognition the most user friendly biometric modality. It
also means that the range of potential applications of face
recognition is wider, as it can be deployed in environments
where the users are not expected to cooperate with the system,
such as in surveillance systems. Other common applications
of face recognition include access control, fraud detection,
identity verification and social media.
Face recognition is one of the most challenging biometric
modalities when deployed in unconstrained environments due
to the high variability that face images present in the real
world (these type of face images are commonly referred to
as faces in-the-wild). Some of these variations include head
poses, aging, occlusions, illumination conditions, and facial
expressions. Examples of these are shown in Figure 1.
Face recognition techniques have shifted significantly over
the years. Traditional methods relied on hand-crafted features,
such as edges and texture descriptors, combined with machine
(a) (b)
(c) (d)
(e)
Fig. 1: Typical variations found in faces in-the-wild. (a) Head
pose. (b) Age. (c) Illumination. (d) Facial expression. (e)
Occlusion.
learning techniques, such as principal component analysis,
linear discriminant analysis or support vector machines. The
difficulty of engineering features that were robust to the
different variations encountered in unconstrained environments
made researchers focus on specialised methods for each
type of variation, e.g. age-invariant methods [4], [5], pose-
invariant methods [6], illumination-invariant methods [7], [8],
etc. Recently, traditional face recognition methods have been
superseded by deep learning methods based on convolutional
neural networks (CNNs). The main advantage of deep learning
methods is that they can be trained with very large datasets to
learn the best features to represent the data. The availability
of faces in-the-wild on the web has allowed the collection
of large-scale datasets of faces [9], [10], [11], [12], [13],
[14], [15] containing real-world variations. CNN-based face
recognition methods trained with these datasets have achieved
very high accuracy as they are able to learn features that
are robust to the real-world variations present in the face
images used during training. Moreover, the rise in popularity
of deep learning methods for computer vision has accelerated
arXiv:1811.00116v1 [cs.CV] 31 Oct 2018
Fig. 2: Face recognition building blocks.
face recognition research, as CNNs are being used to solve
many other computer vision tasks, such as object detection
and recognition, segmentation, optical character recognition,
facial expression analysis, age estimation, etc.
Face recognition systems are usually composed of the
following building blocks:
1) Face detection. A face detector finds the position of the
faces in an image and (if any) returns the coordinates of
a bounding box for each one of them. This is illustrated
in Figure 3a.
2) Face alignment. The goal of face alignment is to scale
and crop face images in the same way using a set of
reference points located at fixed locations in the image.
This process typically requires finding a set of facial
landmarks using a landmark detector and, in the case of a
simple 2D alignment, finding the best affine transforma-
tion that fits the reference points. Figures 3b and 3c show
two face images aligned using the same set of reference
points. More complex 3D alignment algorithms (e.g.
[16]) can also achieve face frontalisation, i.e. changing
the pose of a face to frontal.
3) Face representation. At the face representation stage,
the pixel values of a face image are transformed into a
compact and discriminative feature vector, also known
as a template. Ideally, all the faces of a same subject
should map to similar feature vectors.
4) Face matching. In the face matching building block,
two templates are compared to produce a similarity score
that indicates the likelihood that they belong to the same
subject.
Face representation is arguably the most important compo-
nent of a face recognition system and the focus of the literature
review in Section II.
(a) (b) (c)
Fig. 3: (a) Bounding boxes found by a face detector. (b) and
(c) Aligned faces and reference points.
II. LITERATURE REVIEW
Early research on face recognition focused on methods that
used image processing techniques to match simple features de-
scribing the geometry of the faces. Even though these methods
only worked under very constrained settings, they showed that
it is possible to use computers to automatically recognise faces.
After that, statistical subspaces methods such as principal
component analysis (PCA) and linear discriminant analysis
(LDA) gained popularity. These methods are referred to as
holistic since they use the entire face region as an input. At
the same time, progress in other computer vision domains led
to the development of local feature extractors that are able to
describe the texture of an image at different locations. Feature-
based approaches to face recognition consist of matching these
local features across face images. Holistic and feature-based
methods were further developed and combined into hybrid
methods. Face recognition systems based on hybrid methods
remained the state-of-the-art until recently, when deep learning
emerged as the leading approach to most computer vision
applications, including face recognition. The rest of this paper
provides a summary of some of the most representative re-
search works on each of the aforementioned types of methods.
A. Geometry-based Methods
Kelly’s [1] and Kanade’s [2] PhD theses in the early
seventies are considered the first research works on automatic
face recognition. They proposed the use of specialised edge
and contour detectors to find the location of a set of facial land-
marks and to measure relative positions and distances between
them. The accuracy of these early systems was demonstrated
on very small databases of faces (a database of 10 subjects was
used in [1] and a database of 20 subjects was used in [2]). In
[17], a geometry-based method similar to [2] was compared
with a method that represents face images as gradient images.
The authors showed that comparing gradient images provided
better recognition accuracy than comparing geometry-based
features. However, the geometry-based method was faster and
needed less memory. The feasibility of using facial landmarks
and their geometry for face recognition was thoroughly stud-
ied in [18]. Specifically, they proposed a method based on
measuring the Procrustes distance [19] between two sets of
facial landmarks and a method based on measuring ratios
of distances between facial landmarks. The authors argued
that even though other methods that extract more information
from the face (e.g. holistic methods) could achieve greater
recognition accuracy, the proposed geometry-based methods
were faster and could be used in combination with other
Fig. 4: Top 5 eigenfaces computed using the ORL database of
faces [31] sorted from most variance (left) to least variance
(right).
methods to develop hybrid methods. Geometry-based methods
have proven more effective in 3D face recognition thanks to
the depth information encoded in 3D landmarks [20], [21].
Geometry-based methods were crucial during the early days
of face recognition research. They can be used as a fast
alternative to (or in combination with) the more advanced
methods described in the rest of this review.
B. Holistic Methods
Holistic methods represent faces using the entire face re-
gion. Many of these methods work by projecting face images
onto a low-dimensional space that discards superfluous details
and variations not needed for the recognition task. One of the
most popular approaches in this category is based on PCA.
The idea, first proposed in [22], [23], is to apply PCA to a
set of training face images in order to find the eigenvectors
that account for the most variance in the data distribution. In
this context, the eigenvectors are typically called eigenfaces
due to their resemblance to real faces, as shown in Figure 4.
New faces can be projected onto the subspace spanned by the
eigenfaces to obtain the weights of the linear combination of
eigenfaces needed to reconstruct them. This idea was used
in [24] to identify faces by comparing the weights of new
faces to the weights of faces in a gallery set. A probabilistic
version of this approach based on a Bayesian analysis of image
differences was proposed in [25]. In this method, two sets
of eigenfaces were used to model intra-personal and inter-
personal variations separately. Many other variations of the
original eigenfaces method have been proposed. For example,
a nonlinear extension of PCA based on kernel methods,
namely kernel PCA [26], was proposed in [27]; independent
component analysis (ICA) [28], a generalisation of PCA that
can capture high-order dependencies between pixels, was
proposed in [29]; and a two-dimensional PCA based on 2D
image matrices instead of 1D vectors was proposed in [30].
One issue with PCA-based approaches is that the projection
maximises the variance across all the images in the training
set. This implies that the top eigenvectors might have a
negative impact on the recognition accuracy since they might
correspond to intra-personal variations that are not relevant
for the recognition task (e.g. illumination, pose or expression).
Holistic methods based on linear discriminant analysis (LDA),
also called Fisher discriminant analysis, [32] have been pro-
posed to solve this issue [33], [34], [35], [36]. The main idea
behind LDA is to use the class labels to find a projection
matrix Wthat maximises the variance between classes while
minimising the variance within classes:
W= arg max
W
|WTSbW|
|WTSwW|(1)
where Swand Sbare the between-class and within-class
scatter matrices defined as follows:
Sw=
K
X
k
X
xjCk
(xjµk)(xjµk)T(2)
Sb=
K
X
k
(µµk)(µµk)T(3)
where xjrepresents a data sample, µkis the mean of class Ck,
µis the overall mean and Kis the number of classes in the
dataset. The solution to Equation 1 can be found by computing
the eigenvectors of the separation matrix S=S1
wSb. Similar
to PCA, LDA can be used for dimensionality reduction by
selecting a subset of eigenvectors corresponding to the largest
eigenvalues. Even though LDA is considered a more suitable
technique for face recognition than PCA, pure LDA-based
methods are prone to overfitting when the within-class scatter
matrix Swis not correctly estimated [35], [36]. This happens
when the input data is high-dimensional and not many samples
per class are available during training. In the extreme case, Sw
becomes singular and Wcannot be computed [33]. For this
reason, it is common to reduce the dimensionality of the data
with PCA before applying LDA [33], [35], [36]. LDA has also
been extended to the nonlinear case using kernels [37], [38]
and to probabilistic LDA [39].
Support vector machines (SVMs) have also been used as
holistic methods for face recognition. In [40], the task was
formulated as a two-class problem by training an SVM with
image differences. More specifically, the two classes are the
within-class difference set, which contains all the differences
between images of the same class, and the between-class
difference set, which contains all the differences between
images of distinct classes (this formulation is similar to the
probabilistic PCA proposed in [25]). In addition, [40] modified
the traditional SVM formulation by adding a parameter to
control the operating point of the system. In [41], a separate
SVM was trained for each class. The authors experimented
with SVMs trained with PCA projections and with LDA pro-
jections. It was found that this SVM approach only gives better
performance compared with simple Euclidean distance when
trained with PCA projections, since LDA already encodes the
discriminant information needed to recognise faces.
An approach related to PCA and LDA is the locality
preserving projections (LPP) method proposed in [42]. While
PCA and LDA preserve the global structure of the image
space (maximising variance and discriminant information re-
spectively), LPP aims to preserve the local structure of the
image space. This means that the projection learnt by LPP
maps images with similar local information to neighbouring
points in the LPP subspace. For example, two images of the
same person with open and closed mouth would be mapped
to similar points using LPP, but not necessarily with PCA
or LDA. This approach was shown to be superior than PCA
and LDA on multiple datasets. Further improvements were
achieved in [43] by making the LPP basis vectors orthogonal.
Another popular family of holistic methods is based on
sparse representation of faces. The idea, first proposed in
[44] as sparse representation-based classification (SRC), is to
represent faces using a linear combination of training images:
y=Ax0(4)
where yis a test image, Ais a matrix containing all the
training images and x0is a vector of sparse coefficients. By
enforcing sparsity in the representation, most of the nonzero
coefficients belong to training images from the correct class.
At test time, the coefficients belonging to each class are
used to reconstruct the image, and the class that achieves the
lowest reconstruction error is considered the correct one. The
robustness of this approach to image corruptions like noise or
occlusions can be increased by adding a term of sparse error
coefficients e0to the linear combination:
y=Ax0+e0(5)
where the nonzero entries of e0correspond to corrupted pixels.
Many variations of this approach have been proposed for
increased robustness and reduced computational complexity.
For example, the discriminative K-SVD algorithm was pro-
posed in [45] to select a more compact and discriminative set
of training images to reconstruct the images; in [46], SRC
was extended by using a Markov random field to model the
prior assumption about the spatial continuity of the occluded
regions; and in [47], it was proposed to weight each pixel in the
image independently to achieve better reconstructed images.
More recently, inspired by probabilistic PCA [25], the joint
Bayesian method [48] has been proposed. In this method,
instead of using image differences as in [25], a face image is
represented as the sum of two independent Gaussian variables
representing intra-personal and inter-personal variations. Using
this method, an accuracy of 92.4% was achieved on the
challenging Labeled Faces in the Wild (LFW) dataset [49].
This is the highest accuracy reported by a holistic method on
this dataset.
Holistic methods have been of paramount importance to
the development of real-world face recognition systems, as
evidenced by the large number of approaches proposed in the
literature. In the next subsection, a popular family of methods
that evolved as an alternative to holistic methods, namely
feature-based methods, is discussed.
C. Feature-based Methods
Feature-based methods refer to methods that leverage local
features extracted at different locations in a face image.
Unlike geometry-based methods, feature-based methods focus
on extracting discriminative features rather than computing
their geometry1. Feature-based methods tend to be more robust
1Technically, geometry-based methods can be seen as a special case of
feature-based methods, since many feature-based methods also leverage the
geometry of the extracted features.
than holistic methods when dealing with faces presenting local
variations (e.g. facial expression or illumination). For example,
consider two face images of the same subject in which the only
difference between them is that the person’s eyes are closed in
one of them. In a feature-based method, only the coefficients
of the feature vectors that correspond to features extracted
around the eyes will differ between the two images. On the
other hand, in a holistic method, all the coefficients of the
feature vectors might differ. Moreover, many of the descriptors
used in feature-based methods are designed to be invariant to
different variations (e.g. scaling, rotation or translation).
One of the first feature-based methods was the modular
eigenfaces method proposed in [50], an extension of the
original eigenfaces technique. In this method, PCA was inde-
pendently applied to different local regions in the face image to
produce sets of eigenfeatures. Although [50] showed that both
eigenfeatures and eigenfaces can achieve the same accuracy,
the eigenfeatures approach provided better accuracy when only
a few eigenvectors were used.
A feature-based method that uses binary edge features was
proposed in [51]. Their main contribution was to improve the
Hausdorff distance that was used in [52] to compare binary
images. The Hausdorff distance measures proximity between
two set of points by looking at the greatest distance from a
point in one set to the closest point in the other set. In the
modified Hausdorff distance proposed in [51], each point in
one set has to be near some point in the other set. It was
argued that this property makes the method more robust to
small, non-rigid local distortions. A variation of this method
proposed line edge maps (LEMs) for face representation [53].
LEMs provide a compact face representation since edges are
encoded as line segments, i.e. only the coordinates of the end
points are used. A line segment Hausdorff distance was also
proposed in this work to match LEMs. The proposed distance
is discouraged to match lines with different orientations, is
robust to line displacements, and incorporates a measure of the
difference between the number of lines found in two LEMs.
A very popular feature-based method was the elastic bunch
graph matching (EBGM) method [54], an extension of the
dynamic link architecture proposed in [55]. In this method,
a face is represented using a graph of nodes. The nodes
contain Gabor wavelet coefficients [56] extracted around a
set of predefined facial landmarks. During training, a face
bunch graph (FBG) model is created by stacking the manually
located nodes of each training image. When a test face image
is presented, a new graph is created and fitted to the facial
landmarks by searching for the most similar nodes in the
FBG model. Two images can be compared by measuring the
similarity between their graph nodes. A version of this method
that uses histograms of oriented gradients (HOG) [57], [58]
instead of Gabor wavelet features was proposed in [59]. This
method outperforms the original EBGM method thanks to
the increased robustness of HOG descriptors to changes in
illumination, rotation and small displacements.
With the development of local feature descriptors in other
computer vision applications [60], the popularity of feature-
(a) (b)
Fig. 5: (a) Face image divided into 4×4local regions. (b)
Histograms of LBP descriptors computed from each local
region.
based methods for face recognition increased. In [61], his-
tograms of LBP descriptors were extracted from local regions
independently, as shown in Figure 5, and concatenated to
form a global feature vector. Additionally, they measured
the similarity between two feature vectors aand busing a
weighted Chi-square distance:
χ2(a,b) = X
i
wi(aibi)2
ai+bi
(6)
where wiis a weight that controls the contribution of the i-
th coefficient of the feature vectors. As shown in [62], many
variations of this method have been proposed to improve face
recognition accuracy and to tackle other related tasks such
as face detection, facial expression analysis and demographic
classification. For example, LBP descriptors extracted from
Gabor feature maps, known as LGBP descriptors, were pro-
posed in [63], [64]; a rotation invariant LBP descriptor that
applies Fourier transform to LBP histograms was proposed in
[65]; and a variation of LBP called local derivative pattern
(LDP) was proposed in [66] to extract high-order local infor-
mation by encoding directional pattern features.
Scale-invariant feature transform (SIFT) descriptors [67]
have also been extensively used for face recognition. Three
different methodologies for matching SIFT descriptors across
face images were proposed in [68]: (i) computing the distances
between all pairs of SIFT descriptors and using the minimum
distance as a similarity score; (ii) similar to (i) but SIFT
descriptors around the eyes and the mouth are compared inde-
pendently, and the average of the two minimum distances is
used as a similarity score; and (iii) computing SIFT descriptors
over a regular grid and using the average distance between
the corresponding pairs of descriptors as a similarity score.
The best recognition accuracy was obtained using the third
method. A related method [69] proposed the use of speeded
up robust features (SURF) [70] features instead of SIFT. In this
work, the authors observed that dense feature extraction over a
regular grid provides the best results. In [71], two variations of
SIFT were proposed, namely, the volume-SIFT which removes
unreliable keypoints based on their scale, and the partial-
descriptor-SIFT which finds keypoints at large scales and
near face boundaries. Compared to the original SIFT, both
approaches were shown to improve face recognition accuracy.
Some feature-based methods have focused on learning local
features from training samples. For example, in [72], unsu-
pervised learning techniques (K-means [73], PCA tree [74]
and random-projection tree [74]) were used to encode local
microstructures of faces into a set of discrete codes. The
discrete codes were then grouped into histograms at different
facial regions. The final local descriptors were computed by
applying PCA to each histogram. A learning-based descriptor
with similarities to LBP was proposed in [75]. Specifically,
this descriptor consists of a differential pattern generated by
subtracting the centre pixel of a local 3×3region to its
neighbouring pixels and a training of a Gaussian mixture
model to compute high-order statistics of the differential
pattern. Another LBP-like descriptor that has a learning stage
was proposed in [76]. In this work, LDA was used to (i)
learn a filter that when applied to an image enhances the
discriminative ability of the differential patterns, and (ii) learn
a set of weights that are assigned to the neighbouring pixels
within each local region to reflect their contribution to the
differential pattern.
Feature-based methods have been shown to provide more
robustness to different types of variations than holistic meth-
ods. However, some of the advantages of holistic methods are
lost (e.g. discarding non-discriminant information and more
compact representations). Hybrid methods that combine both
of these approaches are discussed next.
D. Hybrid Methods
Hybrid methods combine techniques from holistic and
feature-based methods. Before deep learning became
widespread, most state-of-the-art face recognition systems
were based on hybrid methods. Some hybrid methods simply
combine two different techniques without any interaction
between them. For example, in the modular eigenfaces
work [50] covered earlier, the authors experimented with
a combined representation using both eigenfaces and
eigenfeatures and achieved better accuracy than using either
of these two methods alone. However, the most popular
hybrid approach is to extract local features (e.g. LBP, SIFT)
and project them onto a lower-dimensional and discriminative
subspace (e.g. using PCA or LDA) as shown in Figure 6.
Several hybrid methods that use Gabor wavelet features
combined with different subspaces methods have been pro-
posed [77], [78], [79]. In these methods, Gabor kernels of
different orientations and scales are convolved with an image
and their outputs are concatenated into a feature vector. The
feature vector is then downsampled to reduce its dimensional-
ity. In [77], the feature vector was further processed using the
enhanced linear discriminant model proposed in [80]. PCA
followed by ICA were applied to the downsampled feature
vector in [78], and the probabilistic reasoning model from
[80] was used to classify whether two images belong to the
same subject. In [79], kernel PCA with polynomial kernels was
applied to the feature vector to encode high-order statistics. All
Fig. 6: Typical hybrid face representation.
these hybrid methods were shown to provide better accuracy
than using Gabor wavelet features alone.
LBP descriptors have been a key component in many
hybrid methods. In [81], an image was divided into non-
overlapping regions and LBP descriptors were extracted at
multiple resolutions. The LBP coefficients at each region
were concatenated into regional feature vectors and projected
onto PCA+LDA subspaces. This approach was extended to
colour images in [82]. Laplacian PCA, an extension of PCA,
was shown to outperform standard PCA and kernel PCA
when applied to LBP descriptors in [83]. Two novel patch
versions of LBP, namely three-patch LBP (TPLBP) and four-
patch LBP (FPLBP), were combined with LDA and SVMs in
[84]. The proposed TPLBP and FPLBP descriptors can boost
face recognition accuracy by encoding similarities between
neighbouring patches of pixels. More recently, [85] proposed
a high-dimensional face representation by densely extracting
multi-scale LBP (MLBP) descriptors around facial landmarks.
The high-dimensional feature vector (100K-dim) was reduced
to 400 dimensions by PCA and a final discriminative feature
vector was learnt using joint Bayesian. In their experiments,
[85] showed that extracting high-dimensional features can
increase face recognition accuracy by 6-7% when going
from 1K to 100K dimensions. The main drawback of this
approach is the high computational costs needed to perform a
dimensionality reduction of such magnitude. For this reason,
they proposed to approximate the PCA and joint Bayesian
transformations with a sparse linear projection matrix Bby
solving the following optimisation problem:
min
BkYBTXk2
2+λkBk1(7)
where the first term is a reconstruction error between the ma-
trix Xof high-dimensional feature vectors and the matrix Y
of projected low-dimensional feature vectors; the second term
enforces sparsity in the projection matrix B; and λbalances
the contribution of each term. Another recent method proposed
a multi-task learning approach based on a discriminative
Gaussian process latent variable model, named GaussianFace
[86]. This method extended the Gaussian process approach
proposed in [87] and incorporated a computationally more
efficient version of kernel LDA to learn a face representation
from LBP descriptors that can exploit data from multiple
source domains. Using this method, an accuracy of 98.52%
was achieved on the LFW dataset. This is competitive with
the accuracy achieved by many deep learning methods.
Some hybrid methods have proposed to use a combination
of different local features. For example, Gabor wavelet and
LBP features were used in [88]. The authors argued that
these two types of features capture complementary informa-
tion. While LBP descriptors capture small appearance details,
Gabor wavelet features encode facial shape over a broader
range of scales. PCA was applied independently to the feature
vectors containing the Gabor wavelet coefficients and the
LBP coefficients to reduce their dimensionality. The final face
representation was obtained by concatenating the two PCA-
transformed feature vectors and applying a subspace method
similar to kernel LDA, namely kernel discriminative common
vector [89]. Another method that uses Gabor wavelet and
LBP features was proposed in [90]. In this method, faces
were represented by applying PCA+LDA to regions containing
histograms of LGBP descriptors [64]. A multi-feature system
was proposed in [8] to tackle face recognition under difficult
illumination conditions. Three contributions were made in this
work: (i) a preprocessing pipeline that reduces the effect of
illumination variation; (ii) an extension of LBP, called local
ternary patterns (LTP), which is more discriminant and less
sensitive to noise in uniform regions; and (iii) an architecture
that combines sets of Gabor wavelet and LBP/LTP features
followed by kernel LDA, score normalisation and score fusion.
A related method [91] proposed a novel descriptor robust to
blur that extends local phase quantization (LPQ) descriptors
[92] to multiple scales (MLPQ). In addition, a kernel fusion
technique was used to combine MLPQ descriptors with MLBP
descriptors in the kernel LDA framework. In [5], an age
invariant face recognition system was proposed based on dense
extraction of SIFT and multi-scale LBP descriptors combined
with a novel multi-feature discriminant analysis (MFDA). The
MFDA technique uses random subspace sampling [93] to
construct multiple lower-dimensional feature subspaces, and
bagging [94] to select subsets of training samples for LDA
that contain inter-class pairs near the classification boundary to
increase the discriminative ability of the representation. Dense
SIFT descriptors were also used in [95] as texture features,
and combined with shape features in the form of relative
distances between pairs of facial landmarks. This combination
of shape and texture features was further processed using
multiple PCA+LDA transformations.
To conclude this subsection, other types of hybrids methods
that do not follow the pipeline described in Figure 6 are
reviewed. In [96], low-level local features (image intensities in
RGB and HSV colour spaces, edge magnitudes, and gradient
directions) were used to compute high-level visual features
by training attribute and simile binary SVM classifiers. The
attribute classifiers detect describable attributes of faces such
as gender, race and age. On the other hand, the simile
classifiers detect non-describable attributes by measuring the
similarity of different parts of a face to a limited set of
reference subjects. To compare two images, the outputs of all
the attribute and simile classifiers for both images are fed to
an SVM classifier. A method similar to the simile classifiers
from [96] was proposed in [97]. The main differences are
that [97] used a large number of simple one-vs-one classifiers
instead of the more complex one-vs-all classifiers used in [96],
and that SIFT descriptors were used as the low-level features.
Two metric learning approaches for face identification were
proposed in [98]. The first one, called logistic discriminant
metric learning (LDML) is based on the idea that the distance
between positive pairs (belonging to the same subject) should
be smaller than the distance between negative pairs (belonging
to different subjects). The second one, called marginalised
kNN (MkNN), uses a k-nearest neighbour classifier to find
how many positive neighbour pairs can be formed from the
neighbours of the two compared vectors. Both methods were
trained on pairs of vectors of SIFT descriptors computed at
fixed points on the face (corners of the mouth, eyes and nose).
Hybrid methods offer the best of holistic and feature-based
methods. Their main limitation is the choice of good features
that can fully extract the information needed to recognise a
face. Some approaches have tried to overcome this issue by
combining different types of features whereas others have
introduced a learning stage to improve the discriminative
ability of the features. Deep learning methods, discussed next,
take these ideas further by training end-to-end systems that
can learn a large number of features that are optimal for the
recognition task.
E. Deep Learning Methods
Convolutional neural networks (CNNs) are the most com-
mon type of deep learning method for face recognition. The
main advantage of deep learning methods is that they can be
trained with large amounts of data to learn a face represen-
tation that is robust to the variations present in the training
data. In this way, instead of designing specialised features
that are robust to different types of intra-class variations (e.g.
illumination, pose, facial expression, age, etc.), CNNs can
learn them from training data. The main drawback of deep
learning methods is that they need to be trained with very
large datasets that contain enough variations to generalise to
unseen samples. Fortunately, several large-scale face datasets
containing in-the-wild face images have recently been released
into the public domain [9], [10], [11], [12], [13], [14], [15]
to train CNN models. Apart from learning discriminative
features, neural networks can reduce dimensionality and be
trained as classifiers or using metric learning approaches.
CNNs are considered end-to-end trainable systems that do not
need to be combined with any other specific methods.
CNN models for face recognition can be trained using
different approaches. One of them consists of treating the
problem as a classification one, wherein each subject in the
training set corresponds to a class. After training, the model
can be used to recognise subjects that are not present in the
training set by discarding the classification layer and using the
features of the previous layer as the face representation [99].
In the deep learning literature, these features are commonly
referred to as bottleneck features. Following this first training
stage, the model can be further trained using other techniques
to optimise the bottleneck features for the target application
(e.g. using joint Bayesian [9] or fine-tuning the CNN model
with a different loss function [10]). Another common approach
to learning face representation is to directly learn bottleneck
features by optimising a distance metric between pairs of faces
[100], [101] or triplets of faces [102].
The idea of using neural networks for face recognition is
not new. An early method based on a probabilistic decision-
based neural network (PBDNN) [103] was proposed in 1997
for face detection, eye localisation and face recognition. The
face recognition PDBNN was divided into one fully-connected
subnet per training subject to reduce the number of hidden
units and avoid overfitting. Two PBDNNs were trained using
intensity and edge features respectively and their outputs
were combined to give a final classification decision. Another
early method [104] proposed to use a combination of a self-
organising map (SOM) and a convolutional neural network. A
self-organising map [105] is a type of neural network trained in
an unsupervised way that projects the input data onto a lower-
dimensional space that preserves the topological properties of
the input space (i.e. inputs that are nearby in the original
space are also nearby in the output space). Note that none
of these two early methods were trained end-to-end (edge
features were used in [103] and a SOM in [104]), and that
the proposed neural network architectures were shallow. An
end-to-end face recognition CNN was proposed in [100]. This
method used a siamese architecture trained with a contrastive
loss function [106]. The contrastive loss implements a metric
learning procedure that aims to minimise the distance between
pairs of feature vectors corresponding to the same subject
while maximising the distance between pairs of feature vectors
corresponding to different subjects. The CNN architecture
used in this method was also shallow and was trained with
small datasets.
None of the methods mentioned above achieved ground-
breaking results, mainly due to the low capacity of the
networks used and the relatively small datasets available for
training at the time. It was not until these models were scaled
up and trained with large amounts of data [107] that the first
deep learning methods for face recognition [99], [9] became
the state-of-the-art. In particular, Facebook’s DeepFace [99],
one of the first CNN-based approaches for face recognition
that used a high capacity model, achieved an accuracy of
97.35% on the LFW benchmark, reducing the error of the
previous state-of-the-art by 27%. The authors trained a CNN
with softmax loss2using a dataset containing 4.4 million faces
from 4,030 subjects. Two novel contributions were made in
this work: (i) an effective facial alignment system based on
explicit 3D modelling of faces, and (ii) a CNN architecture
containing locally connected layers [108], [109] that (unlike
regular convolutional layers) can learn different features from
each region in an image. Concurrently, the DeepID system
[9] achieved similar results by training 60 different CNNs
on patches comprising ten regions, three scales and RGB or
grey channels. During testing, 160 bottleneck features were ex-
tracted from each patch and its horizontally flipped counterpart
2We refer to softmax loss to the combination of the softmax activation
function and the cross-entropy loss used to train classifiers.
TABLE I: Public large-scale face datasets.
Dataset Images Subjects Images per subject
CelebFaces+ [9] 202,599 10,177 19.9
UMDFaces [14] 367,920 8,501 43.3
CASIA-WebFace [10] 494,414 10,575 46.8
VGGFace [11] 2.6M 2,622 1,000
VGGFace2 [15] 3.31M 9,131 362.6
MegaFace [13] 4.7M 672,057 7
MS-Celeb-1M [12] 10M 100,000 100
to form a 19,200-dimensional feature vector (160 ×2×60).
Similar to [99], the proposed CNN architecture also used
locally connected layers. The verification result was obtained
by training a joint Bayesian classifier [48] on the 19,200-
dimensional feature vectors extracted by the CNNs. The sys-
tem was trained on a dataset containing 202,599 face images
of 10,177 celebrities [9].
There are three main factors that affect the accuracy of
CNN-based methods for face recognition: training data, CNN
architecture, and loss function. As in most deep learning
applications, large training sets are needed to prevent overfit-
ting. In general, CNNs trained for classification become more
accurate as the number of samples per class increases. This is
because the CNN model is able to learn more robust features
when is exposed to more intra-class variations. However, in
face recognition we are interested in extracting features that
generalise to subjects not present in the training set. Hence,
the datasets used for face recognition need to also contain a
large number of subjects so that the model is exposed to more
inter-class variations. The effect that the number of subjects
in a dataset has in face recognition accuracy was studied in
[110]. In this work, a large dataset was first sorted by the
number of images per subject in decreasing order. Then, a
CNN was trained with different subsets of training data by
gradually increasing the number of subjects. The best accuracy
was obtained when the first 10,000 subjects with the most
images were used for training. Adding more subjects decreased
the accuracy since very few images were available for each
extra subject. Another study [111] investigated whether wider
datasets are better than deeper datasets or vice versa (a dataset
is considered wider than another if it contains more subjects;
similarly, a dataset is considered deeper than another if it
contains more images per subject). From this study, it was
concluded that given the same number of images, wider
datasets provide better accuracy. The authors argued that this
is due to the fact that wider datasets contain more inter-class
variations and, therefore, generalise better to unseen subjects.
Table I shows some of the most common public datasets used
to train CNNs for face recognition.
CNN architectures for face recognition have been inspired
by those achieving state-of-the-art accuracy on the ImageNet
Large Scale Visual Recognition Challenge (ILSVRC). For
example, a version of the VGG network [112] with 16 layers
was used in [11], and a similar but smaller network was used
in [10]. In [102], two different types of CNN architectures
were explored: VGG style networks [112] and GoogleNet style
Fig. 7: Original residual block proposed in [114].
networks [113]. Even though both types of networks achieved
comparable accuracy, the GoogleNet style networks had 20
times fewer parameters. More recently, residual networks
(ResNets) [114] have become the preferred choice for many
object recognition tasks, including face recognition [115],
[116], [117], [118], [119], [120], [121]. The main novelty of
ResNets is the introduction of a building block that uses a
shortcut connection to learn a residual mapping, as shown in
Figure 7. The use of shortcut connections allows the training
of much deeper architectures as they facilitate the flow of
information across layers. A thorough study of different CNN
architectures was carried out in [121]. The best trade-off
between accuracy, speed and model size was obtained with
a 100-layer ResNet with a residual block similar to the one
proposed in [122].
The choice of loss function for training CNN-based methods
has been the most recent active area of research in face
recognition. Even though CNNs trained with softmax loss
have been very successful [99], [9], [10], [123], it has been
argued that the use of this loss function does not generalise
well to subjects not present in the training set. This is because
the softmax loss is encouraged to learn features that increase
inter-class differences (to be able to separate the classes in
the training set) but does not necessarily reduce intra-class
variations. Several methods have been proposed to mitigate
this issue. A simple approach is to optimise the bottleneck
features using a discriminative subspace method such as joint
Bayesian [48], as done in [9], [124], [125], [126], [10], [127].
Another approach is to use metric learning. For example, a
pairwise contrastive loss was used as the only supervisory
signal in [100], [101] and combined with a classification loss
in [124], [125], [126]. One of the most popular metric learning
approaches for face recognition is the triplet loss function
[128], first used in [102] for the face recognition task. The aim
of the triplet loss is to separate the distance between positive
pairs from the distance between negative pairs by a margin.
More formally, for each triplet ithe following condition needs
to be satisfied [102]:
kf(xa)f(xp)k2
2+α < kf(xa)f(xn)k2
2(8)
where xais an anchor image, xpis an image of the same
subject, xnis an image of a different subject, fis a mapping
learnt by a model and αis a margin that is enforced between
positive and negative pairs. In practice, CNNs trained with
triplet loss converge slower than with softmax loss due to the
large number of triplets (or pairs in the case of contrastive
loss) needed to cover the entire training set. Although this
problem can be alleviated by selecting hard triplets (i.e. triplets
that violate the margin condition) during training [102], it is
common to train with softmax loss in a first training stage and
then fine-tune bottleneck features with triplet loss in a second
training stage [11], [129], [130]. Some variations of the triplet
loss have been proposed. For example, in [129], the dot prod-
uct was used as a similarity measure instead of the Euclidean
distance; a probabilistic triplet loss was proposed in [130];
and a modified triplet loss that also minimises the standard
deviation of the distributions of positive and negative scores
was proposed in [131], [132]. An alternative loss function used
to learn discriminative features is the centre loss proposed
in [133]. The goal of the centre loss is to minimise the
distances between bottleneck features and their corresponding
class centres. By jointly training with softmax and centre
loss, it was shown that the features learnt by a CNN could
effectively increase inter-personal variations (softmax loss)
and reduce intra-personal variations (centre loss). The centre
loss has the advantage of being more efficient and easier to
implement than the contrastive and triplet losses since it does
not require forming pairs or triplets during training. Another
related metric learning method is the range loss proposed in
[134] for improving training with unbalanced datasets. The
range loss has two components. The intra-class component
of the loss minimises the k-largest distances between samples
of the same class, and the inter-class component of the loss
maximises the distance between the closest two class centres
in each training batch. By using these extreme cases, the range
loss uses the same information from each class, regardless of
how many samples per class are available. Similar to the centre
loss, the range loss needs to be combined with softmax loss
to avoid the loss being degraded to zeros [133].
One of the difficulties that arise when combining different
loss functions is finding the correct balance between each
term. Recently, several approaches have proposed to modify
the softmax loss so that it can learn discriminative features
with no need to combine it with other losses. One approach
that has been shown to increase the discriminative ability of the
bottleneck features is feature normalisation [115], [118]. For
example, [115] proposed to normalise the features to have unit
L2-norm and [118] proposed to normalise the features to have
zero mean and unit variance. A very successful development
has been the introduction of a margin in the decision boundary
between each class in the softmax loss [135]. For simplicity,
consider binary classification with softmax loss. In this case,
(a) (b)
Fig. 8: Effect of introducing a margin min the decision
boundary between two classes. (a) Softmax loss. (b) Softmax
loss with margin.
TABLE II: Decision boundaries for different variations of the
softmax loss with margin. Note that the decision boundaries
are for class 1 in a binary classification case.
Type of softmax margin Decision boundary
Multiplicative angular margin [116] kxk(cos1cos θ2)=0
Additive cosine margin [119], [120] s(cos θ1mcos θ2)=0
Additive angular margin [121] s(cos(θ1+m)cos θ2)=0
the decision boundary between each class (if the biases are
zero) is given by:
kxk(kW1kcos θ1− kW2kcos θ2)=0 (9)
where xis a feature vector, W1and W2are the weights
corresponding to each class and θ1and θ2are the angles
between xand W1and W2respectively. By introducing
a multiplicative margin min Equation 9, the two decision
boundaries become more stringent:
kxk(kW1kcos 1− kW2kcos θ2) = 0 for class 1 (10)
kxk(kW1kcos θ1− kW2kcos 2) = 0 for class 2 (11)
As shown in Figure 8, the margin can effectively increase the
separation between classes and their intra-class compactness.
Several alternative approaches have been proposed depending
on how the margin is incorporated into the loss [116], [119],
[120], [121]. For example, in [116] the weight vectors were
normalised to have unit norm so that the decision boundary
only depends on the angles θ1and θ2. In [119], [120],
an additive cosine margin was proposed. Compared to the
multiplicative margin [135], [116], the additive margin is
easier to implement and optimise. In this work, apart from
normalising the weight vectors, the feature vectors were also
normalised and scaled as done in [115]. An alternative additive
margin was proposed in [121] which keeps the advantages of
[119], [120] but has a better geometric interpretation since the
margin is added to the angle and not to the cosine. Table II
summarises the decision boundaries for the different variations
of the softmax loss with margin. These approaches are the
current state-of-the-art in face recognition.
III. CONCLUSIONS
We have seen how face recognition has followed the same
transition as many other computer vision applications. Tra-
ditional methods based on hand-engineered features that pro-
vided state-of-the-art accuracy only a few years ago have been
replaced by deep learning methods based on CNNs. Indeed,
face recognition systems based on CNNs have become the
standard due to the significant accuracy improvement achieved
over other types of methods. Moreover, it is straightforward
to scale-up these systems to achieve even higher accuracy by
increasing the size of the training sets and/or the capacity of
the networks. However, collecting large amounts of labelled
face images is expensive, and very deep CNN architectures
are slow to train and deploy. Generative adversarial networks
(GANs) [136] are a promising solution to the first issue. Re-
cent works on GANs with face images include facial attributes
manipulation [137], [138], [139], [140], [141], [142], [143],
[144], [145], [146], facial expression editing [147], [148],
[142], generation of novel identities [149], face frontalisation
[150], [151] and face ageing [152], [153]. It is expected
that these advancements will be used to generate additional
training images without requiring millions of face images to
be labelled. To address the second issue, more efficient archi-
tectures such as MobileNets [154], [155] are being developed
and used for real-time face recognition on devices with limited
computational resources [156].
REFERENCES
[1] M. D. Kelly, “Visual identification of people by computer.,” tech. rep.,
STANFORD UNIV CALIF DEPT OF COMPUTER SCIENCE, 1970.
[2] T. KANADE, “Picture processing by computer complex and recogni-
tion of human faces,” PhD Thesis, Kyoto University, 1973.
[3] K. Delac and M. Grgic, “A survey of biometric recognition methods,” in
46th International Symposium Electronics in Marine, vol. 46, pp. 16–
18, 2004.
[4] U. Park, Y. Tong, and A. K. Jain, “Age-invariant face recognition,
IEEE transactions on pattern analysis and machine intelligence,
vol. 32, no. 5, pp. 947–954, 2010.
[5] Z. Li, U. Park, and A. K. Jain, “A discriminative model for age
invariant face recognition,IEEE transactions on information forensics
and security, vol. 6, no. 3, pp. 1028–1037, 2011.
[6] C. Ding and D. Tao, “A comprehensive survey on pose-invariant face
recognition,” ACM Transactions on intelligent systems and technology
(TIST), vol. 7, no. 3, p. 37, 2016.
[7] D.-H. Liu, K.-M. Lam, and L.-S. Shen, “Illumination invariant face
recognition,” Pattern Recognition, vol. 38, no. 10, pp. 1705–1716,
2005.
[8] X. Tan and B. Triggs, “Enhanced local texture feature sets for face
recognition under difficult lighting conditions,IEEE transactions on
image processing, vol. 19, no. 6, pp. 1635–1650, 2010.
[9] Y. Sun, X. Wang, and X. Tang, “Deep learning face representation from
predicting 10,000 classes,” in Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, pp. 1891–1898, 2014.
[10] D. Yi, Z. Lei, S. Liao, and S. Z. Li, “Learning face representation from
scratch,” arXiv preprint arXiv:1411.7923, 2014.
[11] O. M. Parkhi, A. Vedaldi, A. Zisserman, et al., “Deep face recogni-
tion.,” in BMVC, vol. 1, p. 6, 2015.
[12] Y. Guo, L. Zhang, Y. Hu, X. He, and J. Gao, “Ms-celeb-1m: A
dataset and benchmark for large-scale face recognition,” in European
Conference on Computer Vision, pp. 87–102, Springer, 2016.
[13] A. Nech and I. Kemelmacher-Shlizerman, “Level playing field for
million scale face recognition,” in 2017 IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), pp. 3406–3415, IEEE, 2017.
[14] A. Bansal, A. Nanduri, C. D. Castillo, R. Ranjan, and R. Chellappa,
“Umdfaces: An annotated face dataset for training deep networks,
in Biometrics (IJCB), 2017 IEEE International Joint Conference on,
pp. 464–473, IEEE, 2017.
[15] Q. Cao, L. Shen, W. Xie, O. M. Parkhi, and A. Zisserman, “Vggface2:
A dataset for recognising faces across pose and age,” arXiv preprint
arXiv:1710.08092, 2017.
[16] T. Hassner, S. Harel, E. Paz, and R. Enbar, “Effective face frontalization
in unconstrained images,” in Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, pp. 4295–4304, 2015.
[17] R. Brunelli and T. Poggio, “Face recognition: Features versus tem-
plates,” IEEE transactions on pattern analysis and machine intelli-
gence, vol. 15, no. 10, pp. 1042–1052, 1993.
[18] J. Shi, A. Samal, and D. Marx, “How effective are landmarks and
their geometry for face recognition?,” Computer vision and image
understanding, vol. 102, no. 2, pp. 117–133, 2006.
[19] I. L. Dryden and K. V. Mardia, Statistical shape analysis, vol. 4. Wiley
Chichester, 1998.
[20] F. Daniyal, P. Nair, and A. Cavallaro, “Compact signatures for 3d
face recognition under varying expressions,” in Advanced Video and
Signal Based Surveillance, 2009. AVSS’09. Sixth IEEE International
Conference on, pp. 302–307, IEEE, 2009.
[21] S. Gupta, M. K. Markey, and A. C. Bovik, “Anthropometric 3d face
recognition,” International journal of computer vision, vol. 90, no. 3,
pp. 331–349, 2010.
[22] L. Sirovich and M. Kirby, “Low-dimensional procedure for the char-
acterization of human faces,” Josa a, vol. 4, no. 3, pp. 519–524, 1987.
[23] M. Kirby and L. Sirovich, “Application of the karhunen-loeve proce-
dure for the characterization of human faces,” IEEE Transactions on
Pattern analysis and Machine intelligence, vol. 12, no. 1, pp. 103–108,
1990.
[24] M. Turk and A. Pentland, “Eigenfaces for recognition,Journal of
cognitive neuroscience, vol. 3, no. 1, pp. 71–86, 1991.
[25] B. Moghaddam, W. Wahid, and A. Pentland, “Beyond eigenfaces:
Probabilistic matching for face recognition,” in Automatic Face and
Gesture Recognition, 1998. Proceedings. Third IEEE International
Conference on, pp. 30–35, IEEE, 1998.
[26] B. Sch¨
olkopf, A. Smola, and K.-R. M¨
uller, “Kernel principal com-
ponent analysis,” in International Conference on Artificial Neural
Networks, pp. 583–588, Springer, 1997.
[27] K. I. Kim, K. Jung, and H. J. Kim, “Face recognition using kernel
principal component analysis,” IEEE signal processing letters, vol. 9,
no. 2, pp. 40–42, 2002.
[28] P. Comon, “Independent component analysis, a new concept?,” Signal
processing, vol. 36, no. 3, pp. 287–314, 1994.
[29] M. S. Bartlett, “Independent component representations for face recog-
nition,” in Face Image Analysis by Unsupervised Learning, pp. 39–67,
Springer, 2001.
[30] J. Yang, D. Zhang, A. F. Frangi, and J.-y. Yang, “Two-dimensional pca:
a new approach to appearance-based face representation and recogni-
tion,” IEEE transactions on pattern analysis and machine intelligence,
vol. 26, no. 1, pp. 131–137, 2004.
[31] F. S. Samaria and A. C. Harter, “Parameterisation of a stochastic model
for human face identification,” in Applications of Computer Vision,
1994., Proceedings of the Second IEEE Workshop on, pp. 138–142,
IEEE, 1994.
[32] R. A. Fisher, “The statistical utilization of multiple measurements,
Annals of Human Genetics, vol. 8, no. 4, pp. 376–386, 1938.
[33] P. N. Belhumeur, J. P. Hespanha, and D. J. Kriegman, “Eigenfaces vs.
fisherfaces: Recognition using class specific linear projection,” IEEE
Transactions on pattern analysis and machine intelligence, vol. 19,
no. 7, pp. 711–720, 1997.
[34] K. Etemad and R. Chellappa, “Discriminant analysis for recognition
of human face images,” JOSA A, vol. 14, no. 8, pp. 1724–1733, 1997.
[35] W. Zhao, A. Krishnaswamy, R. Chellappa, D. L. Swets, and J. Weng,
“Discriminant analysis of principal components for face recognition,”
in Face Recognition, pp. 73–85, Springer, 1998.
[36] W. Zhao, R. Chellappa, and P. J. Phillips, Subspace linear discriminant
analysis for face recognition. Citeseer, 1999.
[37] S. Mika, G. Ratsch, J. Weston, B. Scholkopf, and K.-R. Mullers,
“Fisher discriminant analysis with kernels,” in Neural networks for
signal processing IX, 1999. Proceedings of the 1999 IEEE signal
processing society workshop., pp. 41–48, Ieee, 1999.
[38] Q. Liu, R. Huang, H. Lu, and S. Ma, “Face recognition using kernel-
based fisher discriminant analysis,” in Automatic Face and Gesture
Recognition, 2002. Proceedings. Fifth IEEE International Conference
on, pp. 197–201, IEEE, 2002.
[39] S. Ioffe, “Probabilistic linear discriminant analysis,” in European
Conference on Computer Vision, pp. 531–542, Springer, 2006.
[40] P. J. Phillips, “Support vector machines applied to face recognition,”
in Advances in Neural Information Processing Systems, pp. 803–809,
1999.
[41] K. Jonsson, J. Kittler, Y. Li, and J. Matas, “Support vector machines
for face authentication,” Image and Vision Computing, vol. 20, no. 5-6,
pp. 369–375, 2002.
[42] X. He, S. Yan, Y. Hu, P. Niyogi, and H.-J. Zhang, “Face recognition
using laplacianfaces,” IEEE transactions on pattern analysis and ma-
chine intelligence, vol. 27, no. 3, pp. 328–340, 2005.
[43] D. Cai, X. He, J. Han, and H.-J. Zhang, “Orthogonal laplacianfaces
for face recognition,” IEEE transactions on image processing, vol. 15,
no. 11, pp. 3608–3614, 2006.
[44] J. Wright, A. Y. Yang, A. Ganesh, S. S. Sastry, and Y. Ma, “Robust face
recognition via sparse representation,” IEEE transactions on pattern
analysis and machine intelligence, vol. 31, no. 2, pp. 210–227, 2009.
[45] Q. Zhang and B. Li, “Discriminative k-svd for dictionary learning in
face recognition,” in Computer Vision and Pattern Recognition (CVPR),
2010 IEEE Conference on, pp. 2691–2698, IEEE, 2010.
[46] Z. Zhou, A. Wagner, H. Mobahi, J. Wright, and Y. Ma, “Face
recognition with contiguous occlusion using markov random fields,”
in Computer Vision, 2009 IEEE 12th International Conference on,
pp. 1050–1057, IEEE, 2009.
[47] H. Jia and A. M. Martinez, “Face recognition with occlusions in the
training and testing sets,” in Automatic Face & Gesture Recognition,
2008. FG’08. 8th IEEE International Conference on, pp. 1–6, IEEE,
2008.
[48] D. Chen, X. Cao, L. Wang, F. Wen, and J. Sun, “Bayesian face
revisited: A joint formulation,” in European Conference on Computer
Vision, pp. 566–579, Springer, 2012.
[49] G. B. Huang, M. Ramesh, T. Berg, and E. Learned-Miller, “Labeled
faces in the wild: A database for studying face recognition in uncon-
strained environments,” tech. rep., Technical Report 07-49, University
of Massachusetts, Amherst, 2007.
[50] A. Pentland, B. Moghaddam, T. Starner, et al., “View-based and
modular eigenspaces for face recognition,” 1994.
[51] B. Takacs, “Comparing face images using the modified hausdorff
distance,” Pattern Recognition, vol. 31, no. 12, pp. 1873–1881, 1998.
[52] D. P. Huttenlocher, G. A. Klanderman, and W. J. Rucklidge, “Compar-
ing images using the hausdorff distance,IEEE Transactions on pattern
analysis and machine intelligence, vol. 15, no. 9, pp. 850–863, 1993.
[53] Y. Gao and M. K. Leung, “Face recognition using line edge map,” IEEE
transactions on pattern analysis and machine intelligence, vol. 24,
no. 6, pp. 764–779, 2002.
[54] L. Wiskott, N. Kr¨
uger, N. Kuiger, and C. Von Der Malsburg, “Face
recognition by elastic bunch graph matching,” IEEE Transactions on
pattern analysis and machine intelligence, vol. 19, no. 7, pp. 775–779,
1997.
[55] M. Lades, J. C. Vorbruggen, J. Buhmann, J. Lange, C. Von Der Mals-
burg, R. P. Wurtz, and W. Konen, “Distortion invariant object recogni-
tion in the dynamic link architecture,” IEEE Transactions on computers,
vol. 42, no. 3, pp. 300–311, 1993.
[56] T. S. Lee, “Image representation using 2d gabor wavelets,IEEE
Transactions on Pattern Analysis & Machine Intelligence, no. 10,
pp. 959–971, 1996.
[57] W. T. Freeman and M. Roth, “Orientation histograms for hand gesture
recognition,” in International workshop on automatic face and gesture
recognition, vol. 12, pp. 296–301, 1995.
[58] N. Dalal and B. Triggs, “Histograms of oriented gradients for human
detection,” in Computer Vision and Pattern Recognition, 2005. CVPR
2005. IEEE Computer Society Conference on, vol. 1, pp. 886–893,
IEEE, 2005.
[59] A. Albiol, D. Monzo, A. Martin, J. Sastre, and A. Albiol, “Face
recognition using hog–ebgm,” Pattern Recognition Letters, vol. 29,
no. 10, pp. 1537–1543, 2008.
[60] K. Mikolajczyk and C. Schmid, “A performance evaluation of local
descriptors,” IEEE transactions on pattern analysis and machine intel-
ligence, vol. 27, no. 10, pp. 1615–1630, 2005.
[61] T. Ahonen, A. Hadid, and M. Pietikainen, “Face description with local
binary patterns: Application to face recognition,” IEEE transactions on
pattern analysis and machine intelligence, vol. 28, no. 12, pp. 2037–
2041, 2006.
[62] D. Huang, C. Shan, M. Ardabilian, Y. Wang, and L. Chen, “Local
binary patterns and its application to facial image analysis: a survey,”
IEEE Transactions on Systems, Man, and Cybernetics, Part C (Appli-
cations and Reviews), vol. 41, no. 6, pp. 765–781, 2011.
[63] W. Zhang, S. Shan, H. Zhang, W. Gao, and X. Chen, “Multi-resolution
histograms of local variation patterns (mhlvp) for robust face recogni-
tion,” in International Conference on Audio-and Video-Based Biometric
Person Authentication, pp. 937–944, Springer, 2005.
[64] W. Zhang, S. Shan, W. Gao, X. Chen, and H. Zhang, “Local gabor
binary pattern histogram sequence (lgbphs): a novel non-statistical
model for face representation and recognition,” in Computer Vision,
2005. ICCV 2005. Tenth IEEE International Conference on, vol. 1,
pp. 786–791, IEEE, 2005.
[65] T. Ahonen, J. Matas, C. He, and M. Pietik¨
ainen, “Rotation invariant
image description with local binary pattern histogram fourier features,”
in Scandinavian Conference on Image Analysis, pp. 61–70, Springer,
2009.
[66] B. Zhang, Y. Gao, S. Zhao, and J. Liu, “Local derivative pattern versus
local binary pattern: face recognition with high-order local pattern
descriptor,IEEE transactions on image processing, vol. 19, no. 2,
pp. 533–544, 2010.
[67] D. G. Lowe, “Object recognition from local scale-invariant features,
in Computer vision, 1999. The proceedings of the seventh IEEE
international conference on, vol. 2, pp. 1150–1157, Ieee, 1999.
[68] M. Bicego, A. Lagorio, E. Grosso, and M. Tistarelli, “On the use of
sift features for face authentication,” in Computer Vision and Pattern
Recognition Workshop, 2006. CVPRW’06. Conference on, pp. 35–35,
IEEE, 2006.
[69] P. Dreuw, P. Steingrube, H. Hanselmann, H. Ney, and G. Aachen, “Surf-
face: Face recognition under viewpoint consistency constraints.,” in
BMVC, pp. 1–11, 2009.
[70] H. Bay, T. Tuytelaars, and L. Van Gool, “Surf: Speeded up robust
features,” in European conference on computer vision, pp. 404–417,
Springer, 2006.
[71] C. Geng and X. Jiang, “Face recognition using sift features,” in
Image Processing (ICIP), 2009 16th IEEE International Conference
on, pp. 3313–3316, IEEE, 2009.
[72] Z. Cao, Q. Yin, X. Tang, and J. Sun, “Face recognition with learning-
based descriptor,” in Computer Vision and Pattern Recognition (CVPR),
2010 IEEE Conference on, pp. 2707–2714, IEEE, 2010.
[73] S. Lloyd, “Least squares quantization in pcm,” IEEE transactions on
information theory, vol. 28, no. 2, pp. 129–137, 1982.
[74] Y. Freund, S. Dasgupta, M. Kabra, and N. Verma, “Learning the
structure of manifolds using random projections,” in Advances in
Neural Information Processing Systems, pp. 473–480, 2008.
[75] G. Sharma, S. ul Hussain, and F. Jurie, “Local higher-order statistics
(lhs) for texture categorization and facial analysis,” in European
Conference on Computer Vision, pp. 1–12, Springer, 2012.
[76] Z. Lei, M. Pietik ¨
ainen, and S. Z. Li, “Learning discriminant face
descriptor,IEEE Transactions on Pattern Analysis and Machine In-
telligence, vol. 36, no. 2, pp. 289–302, 2014.
[77] C. Liu and H. Wechsler, “Gabor feature based classification using the
enhanced fisher linear discriminant model for face recognition,” IEEE
Transactions on Image processing, vol. 11, no. 4, pp. 467–476, 2002.
[78] C. Liu and H. Wechsler, “Independent component analysis of gabor
features for face recognition,” IEEE transactions on Neural Networks,
vol. 14, no. 4, pp. 919–928, 2003.
[79] C. Liu, “Gabor-based kernel pca with fractional power polynomial
models for face recognition,” IEEE transactions on pattern analysis
and machine intelligence, vol. 26, no. 5, pp. 572–581, 2004.
[80] C. Liu and H. Wechsler, “Robust coding schemes for indexing and
retrieval from large face databases,IEEE Transactions on image
processing, vol. 9, no. 1, pp. 132–137, 2000.
[81] C.-H. Chan, J. Kittler, and K. Messer, “Multi-scale local binary
pattern histograms for face recognition,” in International conference
on biometrics, pp. 809–818, Springer, 2007.
[82] C.-H. Chan, J. Kittler, and K. Messer, “Multispectral local binary
pattern histogram for component-based color face verification,” in
Biometrics: Theory, Applications, and Systems, 2007. BTAS 2007. First
IEEE International Conference on, pp. 1–7, IEEE, 2007.
[83] D. Zhao, Z. Lin, and X. Tang, “Laplacian pca and its applications,” in
Computer Vision, 2007. ICCV 2007. IEEE 11th International Confer-
ence on, pp. 1–8, IEEE, 2007.
[84] L. Wolf, T. Hassner, and Y. Taigman, “Descriptor based methods in the
wild,” in Workshop on faces in’real-life’images: Detection, alignment,
and recognition, 2008.
[85] D. Chen, X. Cao, F. Wen, and J. Sun, “Blessing of dimensionality:
High-dimensional feature and its efficient compression for face veri-
fication,” in Computer Vision and Pattern Recognition (CVPR), 2013
IEEE Conference on, pp. 3025–3032, IEEE, 2013.
[86] C. Lu and X. Tang, “Surpassing human-level face verification perfor-
mance on lfw with gaussianface.,” in AAAI, pp. 3811–3819, 2015.
[87] R. Urtasun and T. Darrell, “Discriminative gaussian process latent vari-
able model for classification,” in Proceedings of the 24th international
conference on Machine learning, pp. 927–934, ACM, 2007.
[88] X. Tan and B. Triggs, “Fusing gabor and lbp feature sets for kernel-
based face recognition,” in International Workshop on Analysis and
Modeling of Faces and Gestures, pp. 235–249, Springer, 2007.
[89] H. Cevikalp, M. Neamtu, and M. Wilkes, “Discriminative common
vector method with kernels,” IEEE Transactions on Neural Networks,
vol. 17, no. 6, pp. 1550–1565, 2006.
[90] S. Shan, W. Zhang, Y. Su, X. Chen, and W. Gao, “Ensemble of
piecewise fda based on spatial histograms of local (gabor) binary
patterns for face recognition,” in Pattern Recognition, 2006. ICPR
2006. 18th International Conference on, vol. 3, IEEE, 2006.
[91] C. H. Chan, M. A. Tahir, J. Kittler, and M. Pietik¨
ainen, “Multiscale
local phase quantization for robust component-based face recognition
using kernel fusion of multiple descriptors,” IEEE Transactions on
Pattern Analysis and Machine Intelligence, vol. 35, no. 5, pp. 1164–
1177, 2013.
[92] E. Rahtu, J. Heikkil ¨
a, V. Ojansivu, and T. Ahonen, “Local phase
quantization for blur-insensitive image analysis,Image and Vision
Computing, vol. 30, no. 8, pp. 501–512, 2012.
[93] T. K. Ho, “The random subspace method for constructing decision
forests,” IEEE transactions on pattern analysis and machine intelli-
gence, vol. 20, no. 8, pp. 832–844, 1998.
[94] L. Breiman, “Bagging predictors,” Machine learning, vol. 24, no. 2,
pp. 123–140, 1996.
[95] D. S´
aez-Trigueros, H. Hertlein, L. Meng, and M. Hartnett, “Shape
and texture combined face recognition for detection of forged id doc-
uments,” in Information and Communication Technology, Electronics
and Microelectronics (MIPRO), 2016 39th International Convention
on, pp. 1343–1348, IEEE, 2016.
[96] N. Kumar, A. C. Berg, P. N. Belhumeur, and S. K. Nayar, “Attribute
and simile classifiers for face verification,” in Computer Vision, 2009
IEEE 12th International Conference on, pp. 365–372, IEEE, 2009.
[97] T. Berg and P. N. Belhumeur, “Tom-vs-pete classifiers and identity-
preserving alignment for face verification.,” in BMVC, vol. 2, p. 7,
Citeseer, 2012.
[98] M. Guillaumin, J. Verbeek, and C. Schmid, “Is that you? metric
learning approaches for face identification,” in Computer Vision, 2009
IEEE 12th international conference on, pp. 498–505, IEEE, 2009.
[99] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf, “Deepface: Closing the
gap to human-level performance in face verification,” in Proceedings
of the IEEE conference on computer vision and pattern recognition,
pp. 1701–1708, 2014.
[100] S. Chopra, R. Hadsell, and Y. LeCun, “Learning a similarity metric
discriminatively, with application to face verification,” in Computer
Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer
Society Conference on, vol. 1, pp. 539–546, IEEE, 2005.
[101] H. Fan, Z. Cao, Y. Jiang, Q. Yin, and C. Doudou, “Learning deep face
representation,” arXiv preprint arXiv:1403.2802, 2014.
[102] F. Schroff, D. Kalenichenko, and J. Philbin, “Facenet: A unified
embedding for face recognition and clustering,” in Proceedings of the
IEEE conference on computer vision and pattern recognition, pp. 815–
823, 2015.
[103] S.-H. Lin, S.-Y. Kung, and L.-J. Lin, “Face recognition/detection by
probabilistic decision-based neural network,” IEEE transactions on
neural networks, vol. 8, no. 1, pp. 114–132, 1997.
[104] S. Lawrence, C. L. Giles, A. C. Tsoi, and A. D. Back, “Face recogni-
tion: A convolutional neural-network approach,IEEE transactions on
neural networks, vol. 8, no. 1, pp. 98–113, 1997.
[105] T. Kohonen, “The self-organizing map,Neurocomputing, vol. 21,
no. 1-3, pp. 1–6, 1998.
[106] J. Bromley, I. Guyon, Y. LeCun, E. S¨
ackinger, and R. Shah, “Signature
verification using a” siamese” time delay neural network,” in Advances
in Neural Information Processing Systems, pp. 737–744, 1994.
[107] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification
with deep convolutional neural networks,” in Advances in neural
information processing systems, pp. 1097–1105, 2012.
[108] K. Gregor and Y. LeCun, “Emergence of complex-like cells in a
temporal product network with local receptive fields,arXiv preprint
arXiv:1006.0448, 2010.
[109] G. B. Huang, H. Lee, and E. Learned-Miller, “Learning hierarchical
representations for face verification with convolutional deep belief
networks,” in Computer Vision and Pattern Recognition (CVPR), 2012
IEEE Conference on, pp. 2518–2525, IEEE, 2012.
[110] E. Zhou, Z. Cao, and Q. Yin, “Naive-deep face recognition: Touching
the limit of lfw benchmark or not?,” arXiv preprint arXiv:1501.04690,
2015.
[111] A. Bansal, C. Castillo, R. Ranjan, and R. Chellappa, “The
do’s and don’ts for cnn-based face verification,” arXiv preprint
arXiv:1705.07426, vol. 5, 2017.
[112] K. Simonyan and A. Zisserman, “Very deep convolutional networks for
large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
[113] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov,
D. Erhan, V. Vanhoucke, A. Rabinovich, et al., “Going deeper with
convolutions,” Cvpr, 2015.
[114] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
recognition,” in Proceedings of the IEEE conference on computer vision
and pattern recognition, pp. 770–778, 2016.
[115] R. Ranjan, C. D. Castillo, and R. Chellappa, “L2-constrained
softmax loss for discriminative face verification,arXiv preprint
arXiv:1703.09507, 2017.
[116] W. Liu, Y. Wen, Z. Yu, M. Li, B. Raj, and L. Song, “Sphereface: Deep
hypersphere embedding for face recognition,” in The IEEE Conference
on Computer Vision and Pattern Recognition (CVPR), vol. 1, 2017.
[117] Y. Wu, H. Liu, J. Li, and Y. Fu, “Deep face recognition with center
invariant loss,” in Proceedings of the on Thematic Workshops of ACM
Multimedia 2017, pp. 408–414, ACM, 2017.
[118] A. Hasnat, J. Bohn´
e, J. Milgram, S. Gentric, and L. Chen, “Deepvisage:
Making face recognition simple yet with powerful generalization
skills,” arXiv preprint arXiv:1703.08388, 2017.
[119] H. Wang, Y. Wang, Z. Zhou, X. Ji, Z. Li, D. Gong, J. Zhou, and W. Liu,
“Cosface: Large margin cosine loss for deep face recognition,arXiv
preprint arXiv:1801.09414, 2018.
[120] F. Wang, W. Liu, H. Liu, and J. Cheng, “Additive margin softmax for
face verification,arXiv preprint arXiv:1801.05599, 2018.
[121] J. Deng, J. Guo, and S. Zafeiriou, “Arcface: Additive angular margin
loss for deep face recognition,” arXiv preprint arXiv:1801.07698, 2018.
[122] Y. Yamada, M. Iwamura, and K. Kise, “Deep pyramidal resid-
ual networks with separated stochastic depth,” arXiv preprint
arXiv:1612.01230, 2016.
[123] X. Wu, R. He, and Z. Sun, “A lightened cnn for deep face representa-
tion,” in 2015 IEEE Conference on IEEE Computer Vision and Pattern
Recognition (CVPR), vol. 4, 2015.
[124] Y. Sun, Y. Chen, X. Wang, and X. Tang, “Deep learning face rep-
resentation by joint identification-verification,” in Advances in neural
information processing systems, pp. 1988–1996, 2014.
[125] W.-S. T. WST, “Deeply learned face representations are sparse, selec-
tive, and robust,perception, vol. 31, pp. 411–438, 2008.
[126] Y. Sun, D. Liang, X. Wang, and X. Tang, “Deepid3: Face recognition
with very deep neural networks,arXiv preprint arXiv:1502.00873,
2015.
[127] J.-C. Chen, V. M. Patel, and R. Chellappa, “Unconstrained face
verification using deep cnn features,” in Applications of Computer
Vision (WACV), 2016 IEEE Winter Conference on, pp. 1–9, IEEE, 2016.
[128] K. Q. Weinberger and L. K. Saul, “Distance metric learning for large
margin nearest neighbor classification,” Journal of Machine Learning
Research, vol. 10, no. Feb, pp. 207–244, 2009.
[129] S. Sankaranarayanan, A. Alavi, and R. Chellappa, “Triplet similarity
embedding for face verification,arXiv preprint arXiv:1602.03418,
2016.
[130] S. Sankaranarayanan, A. Alavi, C. D. Castillo, and R. Chellappa,
“Triplet probabilistic embedding for face verification and clustering,
in Biometrics Theory, Applications and Systems (BTAS), 2016 IEEE
8th International Conference on, pp. 1–8, IEEE, 2016.
[131] B. Kumar, G. Carneiro, I. Reid, et al., “Learning local image descriptors
with deep siamese and triplet convolutional networks by minimising
global loss functions,” in Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, pp. 5385–5394, 2016.
[132] D. S. Trigueros, L. Meng, and M. Hartnett, “Enhancing convolutional
neural networks for face recognition with occlusion maps and batch
triplet loss,” Image and Vision Computing, 2018.
[133] Y. Wen, K. Zhang, Z. Li, and Y. Qiao, “A discriminative feature
learning approach for deep face recognition,” in European Conference
on Computer Vision, pp. 499–515, Springer, 2016.
[134] X. Zhang, Z. Fang, Y. Wen, Z. Li, and Y. Qiao, “Range loss for
deep face recognition with long-tail,” arXiv preprint arXiv:1611.08976,
2016.
[135] W. Liu, Y. Wen, Z. Yu, and M. Yang, “Large-margin softmax loss for
convolutional neural networks.,” in ICML, pp. 507–516, 2016.
[136] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,
S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,
in Advances in neural information processing systems, pp. 2672–2680,
2014.
[137] A. B. L. Larsen, S. K. Sønderby, H. Larochelle, and O. Winther,
“Autoencoding beyond pixels using a learned similarity metric,” arXiv
preprint arXiv:1512.09300, 2015.
[138] G. Perarnau, J. van de Weijer, B. Raducanu, and J. M. ´
Alvarez,
“Invertible conditional gans for image editing,arXiv preprint
arXiv:1611.06355, 2016.
[139] A. Brock, T. Lim, J. M. Ritchie, and N. Weston, “Neural photo
editing with introspective adversarial networks,arXiv preprint
arXiv:1609.07093, 2016.
[140] W. Shen and R. Liu, “Learning residual images for face attribute
manipulation,” in 2017 IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), pp. 1225–1233, IEEE, 2017.
[141] Y. Lu, Y.-W. Tai, and C.-K. Tang, “Conditional cyclegan for attribute
guided face image generation,” arXiv preprint arXiv:1705.09966, 2017.
[142] Y. Choi, M. Choi, M. Kim, J.-W. Ha, S. Kim, and J. Choo, “Stargan:
Unified generative adversarial networks for multi-domain image-to-
image translation,” arXiv preprint arXiv:1711.09020, 2017.
[143] W. Yin, Y. Fu, L. Sigal, and X. Xue, “Semi-latent gan: Learning to
generate and modify facial images from attributes,arXiv preprint
arXiv:1704.02166, 2017.
[144] Z. Shu, E. Yumer, S. Hadap, K. Sunkavalli, E. Shechtman, and
D. Samaras, “Neural face editing with intrinsic image disentangling,”
in Computer Vision and Pattern Recognition (CVPR), 2017 IEEE
Conference on, pp. 5444–5453, IEEE, 2017.
[145] G. Lample, N. Zeghidour, N. Usunier, A. Bordes, L. Denoyer, et al.,
“Fader networks: Manipulating images by sliding attributes,” in Ad-
vances in Neural Information Processing Systems, pp. 5969–5978,
2017.
[146] Z. He, W. Zuo, M. Kan, S. Shan, and X. Chen, “Arbitrary fa-
cial attribute editing: Only change what you want,arXiv preprint
arXiv:1711.10678, 2017.
[147] Y. Zhou and B. E. Shi, “Photorealistic facial expression synthesis
by the conditional difference adversarial autoencoder,arXiv preprint
arXiv:1708.09126, 2017.
[148] H. Ding, K. Sricharan, and R. Chellappa, “Exprgan: Facial expres-
sion editing with controllable expression intensity,arXiv preprint
arXiv:1709.03842, 2017.
[149] C. Donahue, A. Balsubramani, J. McAuley, and Z. C. Lipton, “Se-
mantically decomposing the latent spaces of generative adversarial
networks,” arXiv preprint arXiv:1705.07904, 2017.
[150] R. Huang, S. Zhang, T. Li, R. He, et al., “Beyond face rotation: Global
and local perception gan for photorealistic and identity preserving
frontal view synthesis,arXiv preprint arXiv:1704.04086, 2017.
[151] L. Tran, X. Yin, and X. Liu, “Representation learning by rotating your
faces,” arXiv preprint arXiv:1705.11136, 2017.
[152] G. Antipov, M. Baccouche, and J.-L. Dugelay, “Face aging
with conditional generative adversarial networks,arXiv preprint
arXiv:1702.01983, 2017.
[153] Z. Zhang, Y. Song, and H. Qi, “Age progression/regression by condi-
tional adversarial autoencoder,” in The IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), vol. 2, 2017.
[154] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang,
T. Weyand, M. Andreetto, and H. Adam, “Mobilenets: Efficient convo-
lutional neural networks for mobile vision applications,” arXiv preprint
arXiv:1704.04861, 2017.
[155] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen,
“Inverted residuals and linear bottlenecks: Mobile networks for classifi-
cation, detection and segmentation,” arXiv preprint arXiv:1801.04381,
2018.
[156] S. Chen, Y. Liu, X. Gao, and Z. Han, “Mobilefacenets: Efficient
cnns for accurate real-time face verification on mobile devices,arXiv
preprint arXiv:1804.07573, 2018.
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Face recognition has achieved revolutionary advancement owing to the advancement of the deep convolutional neural network (CNN). The central task of face recognition, including face verification and identification, involves face feature discrimination. However, traditional softmax loss of deep CNN usually lacks the power of discrimination. To address this problem, recently several loss functions such as central loss \cite{centerloss}, large margin softmax loss \cite{lsoftmax}, and angular softmax loss \cite{sphereface} have been proposed. All these improvement algorithms share the same idea: maximizing inter-class variance and minimizing intra-class variance. In this paper, we design a novel loss function, namely large margin cosine loss (LMCL), to realize this idea from a different perspective. More specifically, we reformulate the softmax loss as cosine loss by L2 normalizing both features and weight vectors to remove radial variation, based on which a cosine margin term \emph{$m$} is introduced to further maximize decision margin in angular space. As a result, minimum intra-class variance and maximum inter-class variance are achieved by normalization and cosine decision margin maximization. We refer to our model trained with LMCL as CosFace. To test our approach, extensive experimental evaluations are conducted on the most popular public-domain face recognition datasets such as MegaFace Challenge, Youtube Faces (YTF) and Labeled Face in the Wild (LFW). We achieve the state-of-the-art performance on these benchmark experiments, which confirms the effectiveness of our approach.
Article
Full-text available
Convolutional neural networks have significantly boosted the performance of face recognition in recent years due to its high capacity in learning discriminative features. To enhance the discriminative power of the Softmax loss, multiplicative angular margin and additive cosine margin incorporate angular margin and cosine margin into the loss functions, respectively. In this paper, we propose a novel supervisor signal, additive angular margin (ArcFace), which has a better geometrical interpretation than supervision signals proposed so far. Specifically, the proposed ArcFace $\cos(\theta + m)$ directly maximise decision boundary in angular (arc) space based on the L2 normalised weights and features. Compared to multiplicative angular margin $\cos(m\theta)$ and additive cosine margin $\cos\theta-m$, ArcFace can obtain more discriminative deep features. We also emphasise the importance of network settings and data refinement in the problem of deep face recognition. Extensive experiments on several relevant face recognition benchmarks, LFW, CFP and AgeDB, prove the effectiveness of the proposed ArcFace. Most importantly, we get state-of-art performance in the MegaFace Challenge in a totally reproducible way. We make data, models and training/test code public available~\footnote{https://github.com/deepinsight/insightface}.
Article
Full-text available
In this paper, we propose a conceptually simple and geometrically interpretable objective function, i.e. additive margin Softmax (AM-Softmax), for deep face verification. In general, the face verification task can be viewed as a metric learning problem, so learning large-margin face features whose intra-class variation is small and inter-class difference is large is of great importance in order to achieve good performance. Recently, Large-margin Softmax and Angular Softmax have been proposed to incorporate the angular margin in a multiplicative manner. In this work, we introduce a novel additive angular margin for the Softmax loss, which is intuitively appealing and more interpretable than the existing works. We also emphasize and discuss the importance of feature normalization in the paper. Most importantly, our experiments on LFW BLUFR and MegaFace show that our additive margin softmax loss consistently performs better than the current state-of-the-art methods using the same network architecture and training dataset. Our code has also been made available at https://github.com/happynear/AMSoftmax
Article
Full-text available
Facial attribute editing aims to modify either single or multiple attributes on a face image. Recently, the generative adversarial net (GAN) and the encoder-decoder architecture are usually incorporated to handle this task. The attribute editing can then be conducted by decoding the latent representation of the face image conditioned on the specified attributes. A few existing methods attempt to establish attribute-independent latent representation for arbitrarily changing the attributes. However, since the attributes portray the characteristics of the face image, the attribute-independent constraint on the latent representation is excessive. Such constraint may result in information loss and unexpected distortion on the generated images (e.g. over smoothing), especially for those identifiable attributes such as gender, race etc. Instead of imposing the attribute-independent constraint on the latent representation, we introduce an attribute classification constraint on the generated image, just requiring the correct change of the attributes. Meanwhile, reconstruction learning is introduced in order to guarantee the preservation of all other attribute-excluding details on the generated image, and adversarial learning is employed for visually realistic generation. Moreover, our method can be naturally extended to attribute intensity manipulation. Experiments on the CelebA dataset show that our method outperforms the state-of-the-arts on generating realistic attribute editing results with facial details well preserved.
Article
In this paper, we present a class of extremely efficient CNN models called MobileFaceNets, which use no more than 1 million parameters and specifically tailored for high-accuracy real-time face verification on mobile and embedded devices. We also make a simple analysis on the weakness of common mobile networks for face verification. The weakness has been well overcome by our specifically designed MobileFaceNets. Under the same experimental conditions, our MobileFaceNets achieve significantly superior accuracy as well as more than 2 times actual speedup over MobileNetV2. After trained by ArcFace loss on the refined MS-Celeb-1M from scratch, our single MobileFaceNet model of 4.0MB size achieves 99.55% face verification accuracy on LFW and 92.59% TAR (FAR1e-6) on MegaFace Challenge 1, which is even comparable to state-of-the-art big CNN models of hundreds MB size. The fastest one of our MobileFaceNets has an actual inference time of 18 milliseconds on a mobile phone. Our experiments on LFW, AgeDB, and MegaFace show that our MobileFaceNets achieve significantly improved efficiency compared with the state-of-the-art lightweight and mobile CNNs for face verification.
Article
In this paper we describe a new mobile architecture, MobileNetV2, that improves the state of the art performance of mobile models on multiple tasks and benchmarks as well as across a spectrum of different model sizes. We also describe efficient ways of applying these mobile models to object detection in a novel framework we call SSDLite. Additionally, we demonstrate how to build mobile semantic segmentation models through a reduced form of DeepLabv3 which we call Mobile DeepLabv3. The MobileNetV2 architecture is based on an inverted residual structure where the input and output of the residual block are thin bottleneck layers opposite to traditional residual models which use expanded representations in the input an MobileNetV2 uses lightweight depthwise convolutions to filter features in the intermediate expansion layer. Additionally, we find that it is important to remove non-linearities in the narrow layers in order to maintain representational power. We demonstrate that this improves performance and provide an intuition that led to this design. Finally, our approach allows decoupling of the input/output domains from the expressiveness of the transformation, which provides a convenient framework for further analysis. We measure our performance on Imagenet classification, COCO object detection, VOC image segmentation. We evaluate the trade-offs between accuracy, and number of operations measured by multiply-adds (MAdd), as well as the number of parameters