ArticlePDF Available

Efficient Group-n Encoding and Decoding for Facial Age Estimation

Authors:
  • Institute of Automation, Chinese Academy of Sciences

Abstract and Figures

Different ages are closely related especially among the adjacent ages because aging is a slow and extremely non-stationary process with much randomness. To explore the relationship between the real age and its adjacent ages, an age group-n encoding (AGEn) method is proposed in this paper. In our model, adjacent ages are grouped into the same group and each age corresponds to n groups. The ages grouped into the same group would be regarded as an independent class in the training stage. On this basis, the original age estimation problem can be transformed into a series of binary classification sub-problems. And a deep Convolutional Neural Networks (CNN) with multiple classifiers is designed to cope with such sub-problems. Later, a Local Age Decoding (LAD) strategy is further presented to accelerate the prediction process, which locally decodes the estimated age value from ordinal classifiers. Besides, to alleviate the imbalance data learning problem of each classifier, a penalty factor is inserted into the unified objective function to favor the minority class. To compare with state-of-the-art methods, we evaluate the proposed method on FG-NET, MORPH II, CACD and Chalearn LAP 2015 Databases and it achieves the best performance.
Content may be subject to copyright.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. ?, NO. ?, ? ? 1
Efficient Group-n Encoding and Decoding for
Facial Age Estimation
Zichang Tan, Jun Wan, Member, IEEE, Zhen Lei, Senior Member, IEEE, Ruicong Zhi, Guodong Guo,
Senior Member, IEEE, and Stan Z. Li, Fellow, IEEE
Abstract—Different ages are closely related especially among the adjacent ages because aging is a slow and extremely
non-stationary process with much randomness. To explore the relationship between the real age and its adjacent ages, an age group-n
encoding (AGEn) method is proposed in this paper. In our model, adjacent ages are grouped into the same group and each age
corresponds to n groups. The ages grouped into the same group would be regarded as an independent class in the training stage. On
this basis, the original age estimation problem can be transformed into a series of binary classification sub-problems. And a deep
Convolutional Neural Networks (CNN) with multiple classifiers is designed to cope with such sub-problems. Later, a Local Age
Decoding (LAD) strategy is further presented to accelerate the prediction process, which locally decodes the estimated age value from
ordinal classifiers. Besides, to alleviate the imbalance data learning problem of each classifier, a penalty factor is inserted into the
unified objective function to favor the minority class. To compare with state-of-the-art methods, we evaluate the proposed method on
FG-NET, MORPH II, CACD and Chalearn LAP 2015 Databases and it achieves the best performance.
Index Terms—Age estimation, deep learning, convolutional neural network, age grouping, data imbalance
F
1 INTRODUCTION
HUMAN age estimation makes an important compo-
nent in face attribute analysis [1], which has many
applications in real-world, such as business intelligence,
human computer interaction (HCI) and visual surveillance
[2], [3], [4], [5]. However, human age is still hard to estimate
precisely from a single face image even though the problem
has been extensively studied for many years.
Facial aging process is filled with randomness and is
not stationary for everyone. The randomness exists in many
aspects, such as different diets, living or working environ-
ment, and most importantly, the various genes. All of those
factors can more or less affect human aging and further
leads to aging differences in the appearance. In real world,
people at the same age may look differently, appearing
slightly older or younger comparing to each other. On the
other hand, faces from close ages look similar [6] because
of the slow and gradual aging process. Sometimes it is hard
to judge which one is older or younger between two faces
from close ages. So, there is a strong correlation between age
classes especially for adjacent ages.
Most previous methods estimated age by casting it as
Z. Tan, J. Wan, Z. Lei and S.Z. Li are with Center for Biometrics
and Security Research &National Laboratory of Pattern Recognition,
Institute of Automation, Chinese Academy of Sciences, Room 1402,
Intelligent Building, 95 Zhongguancun Donglu, Haidian District, Beijing
100190, China. And Z. Tan is also with the University of Chinese
Academy of Sciences, Beijing, China. (e-mail: tanzichang2016@ia.ac.cn,
{jun.wan,zlei,szli}@nlpr.ia.ac.cn).
Ruicong Zhi is with the School of Computer and Communication Engi-
neering, University of Science and Technology Beijing, Beijing 100083, P.
R. China. e-mail: zhirc@ustb.edu.cn.
G. Guo is with the Lane Department of Computer Science and Electrical
Engineering, West Virginia University, Morgantown, WV 26506 USA.
e-mail: guodong.guo@mail.wvu.edu.
Manuscript received February 28, 2017; revised September 4, 2017; accepted
17 November, 2017.
a classification problem [5], [7], [8], [9] or regression prob-
lem [10], [11], [12], [13], [14]. For age classification, each
age class is assumed to be independent to one another,
which ignores the relationship between different classes.
In contrast, regression problem treats age as continuous
value and employs regression methods to predict age based
on extracted features, such as Partial Least Squares (PLS)
[15], Canonical Correlation Analysis (CCA) [16], Support
Vector Regression (SVR) [17]. However, those methods do
not involve any aging information, either.
Due to the aging randomness in the aging process,
there is an ambiguous mapping rather than exact mapping
between face and its real age. This is particularly evident
in senior people. We may say that a man looks like in his
late thirties but can never be sure about his exact age just
from his appearance. Thus, assigning each face with a single
age label seems difficult because of the strong correlation
among age classes especially among the adjacent classes.
Furthermore, training with several adjacent ages together
for age estimation may be more helpful than treating each
age as an independent class.
Inspired by this, we group the face images within a
specific age range and then regard each age group as an
independent class in the training stage. Our age grouping
method is inspired by [18] but with crucial differences.
Unlike [18], which needs to group ages for many times and
each time they divide ages into non-overlapped groups, our
age grouping method conducts age division only one time,
where all ages are divided into overlapped age groups. We
carefully design the grouping strategy to encode ages into
age groups, which ensures that each age corresponds to
an unique age group set. Based on this, the exact age can
be recovered by decoding the group classification results
according to a certain mapping relation between the age and
age groups. Therefore, our method could be implemented in
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. ?, NO. ?, ? ? 2
Fig. 1. The pipeline of our framework for age estimation. It consists of two stages: training stage and testing stage. In the training stage, the training
images with different scales are first processed by face detection, alignment and cropping. All the images are aligned according to the point of
the center of two eyes and the upper lip. Then all the training images are grouped by the age group-n encoding strategy, where the images from
adjacent ages would grouped into the same group. After that, the training images are used to train the CNNs. In the testing stage, the test image
is first processed in the same way as the training stage used. Then, the processed image is input into the trained CNN network, and age group
classification is employed to obtain the probabilities of each group. Finally, the predicted age is obtained by decoding the group classification results.
a single network rather than an ensemble of networks [18]
or cascaded networks [19].
Using the novel grouping method, we can transform
the age estimation problem into a series of binary classifi-
cation problems, where each classifier determines whether
the face image belongs to the corresponding group or not.
The CNN with multiple output layers is also employed
in our approach. Unlike [20], [21], our method aims to
explore the relationship between the adjacent ages based
on age group classification, while the approaches of [20],
[21] mainly exploit the relative order relation among age
labels. Besides, each classifier of the network in [20], [21]
acts as a comparator to determine whether or not the age of
the input face is greater than a value, while each classifier
of our network aims to distinguish images within each age
group.
For each binary classifier, the number of training images
belonging to the corresponding group is far less than the
others (imbalanced data learning). This is because we group
images only within a small age range. A viable solution to
the imbalance data problem is to modify the algorithm via
cost-sensitive learning [22], [23]. In this paper, we modify
our training algorithm by employing a penalty factor to
shift the bias of the classifier to favor the minority class,
which increases the contributions of the minority class in
the learning stage.
The proposed age estimation framework is shown in Fig.
1, and the source codes and models are available at the
website 1. The main contributions of our work include:
1) A novel age grouping strategy called Age Group-n
Encoding (AGEn) is proposed, where the adjacent
ages are grouped into the same group and each age
corresponds to ngroups. Moreover, unlike employ-
1. http://www.cbsr.ia.ac.cn/users/zctan/projects/
AgeEncodingDecoding/main.htm
ing an ensemble of multiple networks to obtain the
exact age due to grouping ages for many times [18],
only a single network (see Fig. 1) is used to make
the prediction with our age division.
2) To accelerate the predicting process, a Local Age
Decoding (LAD) strategy is proposed to obtain the
predicted age by locally decoding the outputs of the
binary classifiers.
3) Inspired by previous works [22] [23], we extend the
cost-sensitive learning strategy used in traditional
methods (i.e. Cost-Sensitive Dataspace Weighting
with Adaptive Boosting [23], Cost-Sensitive Deci-
sion Trees [23]) into our designed objective function
of the proposed CNN framework for age estimation,
which is effective to deal with the imbalanced data
problem caused by age grouping.
4) Our method achieves the state-of-the-art results on
multiple datasets, including FG-NET [24], MORPH
II [25], CACD [26] and Chalearn LAP 2015 databases
[27].
2 RE LATED WO RK
Human Age estimation has been studied extensively for
over 20 years. The earliest work of age estimation was possi-
bly reported by Kwon et al. [29] in the 1990s, which judged
the age range of face images with hand-crafted features,
such as baby, young adult and senior adult. However, only
dozens of face images were analyzed in their work. At that
time, the lack of a large-scale age dataset also hindered the
development of age estimation technology. With the joint
efforts of many scholars from all over the world, large age
datasets such as FG-NET [30], MORPH II [25] and CACD
[26] databases are available for the community, which are
also the most popular age datasets nowadays.
With the development of facial analysis technology, re-
searchers started to predict the exact age rather than simply
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. ?, NO. ?, ? ? 3
Fig. 2. The architecture of the proposed network. Our network is based on the VGG-16 network [28] and we adopt the BGR face image as the
input with the size of 224 ×224. The CNN network consists of two fully connected layers and the later one produces a feature vector for age group
classification. After that, the network branches out Toutput layers, where each layer is employed as an binary classifier that judges whether the
input image belongs to the corresponding age group or not. Moreover, all the convolutional layers are followed by ReLU non-linearity.
estimate the coarse age range from face images. Also, a large
number of methods have been proposed for age estimation,
such as Active Appearance Models (AAM) [31], AGing
pattErn Subspace (AGES) [7], [32], age manifold [10], [33],
[34], and methods with local features [8], [35], [36]. Particu-
larly, Biologically Inspired Features (BIF) [8] has the most
outstanding ability in age estimation among those local
features. After features extracted by local image descriptors,
classification or regression methods would be employed to
obtain the predicted age, such as BIF+SVM [8], BIF+SVR [8],
BIF+CCA [12]. More recently, Geng et al. [6], [37] allowed
each face image labeled with a label distribution rather than
a single age label, where both the real age and its adjacent
ages would contribute to the learning. The work in [38],
[39] also integrate the idea of label distribution into deep
learning framework and achieve promising performance.
Recently, deep learning has gained a lot of success on
age estimation. Yi et al. [14] deployed many parallel CNNs
with multi-scale face images for age estimation. Malli et al.
[18] estimated apparent ages with age grouping to account
for multiple labels per image. However, this work needs
an ensemble of models to further predict the exact age,
seeming relatively tedious. Antipov et al. [40] developed a
children-specialized deep learning method for apparent age
estimation, and achieved the best performance at Chalearn
Looking At People (LAP) challenge 2016. Niu et al. [21]
casted age estimation as an ordinal regression problem
with a multiple outputs CNN, which achieved the state-
of-the-art result on MORPH II database. Zhu et al. [19]
first used age group classifier to acquire the coarse age
range of face images with CNN, and then multiple local
age estimators were employed to predict the exact age. Liu
et al. [41] exploited a general-to-special transfer learning
scheme for age estimation based on GoogleNet [42]. Rothe
et al. [9] proposed a Deep EXpectation (DEX) method for
apparent age estimation based on VGG-16 architecture [28]
and won the first place at Chalearn LAP challenge 2015.
However, DEX only conducts the refinement that fuses all
ages information in the prediction phase but neglects the
correlation between different ages in the training stage.
In this work, the correlation between adjacent ages
would be explored through grouping and training the
adjacent ages together. Different from previous grouping-
based methods, which estimate the age for a facial image
through an ensemble of models or cascaded structures, the
proposed method estimates age from facial images with a
Fig. 3. Example of grouping results with Age Group-3 Encoding for
age set {0,1··· ,100}. There are 103 groups in total and each age
corresponds to 3 groups.
single network based on well-designed group-n encoding
and decoding processes. To our best knowledge, it is the
first work to conduct age estimation with a single network
based on age group classification.
3 OU R METHOD
The pipeline of our method for age estimation is shown in
Fig. 1, and our method mainly consists of fine-grained age
grouping, age group classification and age decoding. The
specific algorithm is given in Algorithm 1.
3.1 Fine-grained Age Grouping
Unlike previous age grouping methods where each age
corresponds to one group, we introduce a novel age group-
ing method called Age Group-n Encoding (AGEn) for age
estimation, where each face image is assigned to ngroups.
The grouping rules are given below:
1. Given the age set Y={l0, l1·· · , lK}, we can group
ages into T(T=K+n)groups. Note that l0and lKare the
minimum and maximum ages, respectively, and l0< l1<
·· · < lK.
2. For age li, it is assigned to group i,i+ 1,···,i+
n1, where each age corresponds to ngroups. Each group
includes at least one age but at most nages.
Figure 3 gives a grouping example when K= 100 and
n= 3 for age set {0,1,· ·· ,100}. According to our grouping
rules, each age is encoded into a unique group set, which is
essential for the prediction stage that is a decoding process
from group to age. In order to facilitate later parts of the
paper, Ca={c0, c1,··· , cn1}are used to denote the indices
of the groups that age abelongs to. We also let Strepresents
those ages that the t-th group includes. For example, as
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. ?, NO. ?, ? ? 4
shown in Fig. 3, C1={1,2,3}indicates that age 1 is assigned
to group 1, 2 and 3 and S3={1,2,3}denotes that group 3
consists of ages 1, 2 and 3.
3.2 Age Group Classification
The network architecture of age group classification, called
Multiple Outputs CNN (MO-CNN), is illustrated in Fig. 2.
The network includes multiple output layers, where each
output layer corresponds to a binary classification task that
judges whether the input sample belongs to the age group
or not. Assuming we have a training set with Nsamples,
where each sample is attached with a chronological label
and Tage group labels where T=K+nwhen the Age
Group-n Encoding strategy is employed. Each sample is
represented as xi, yi,{gt
i}T1
i=0 , where xiRdis the
i-th sample, yi∈ Y represents the age label for xiand
gt
i∈ G ={0,1}is the age group label indicating whether
the i-th sample belongs to age group tor not. If xibelongs
to age group t,gt
i= 1; otherwise, gt
i= 0. As shown in
Fig. 2, the network extracts high level feature xl
ithrough a
sequence of non-linear mappings with a set of parameters
W={Wi}l
i=0, where Wirepresents the weights of layer
i. With shared representation xl
i, we conduct the group
classifications via multiple binary classifiers with the param-
eters W={Wt}T1
t=0 , where Wtdenotes the weights of t-th
classifier. Thus, the parameters of the whole network can be
denoted as {W, W }.
For each classifier, the cross-entropy loss is used as the
loss function, thus the objective function of the t-th classifier
can be written as
Jt=1
N
N1
i=0
1
m=0
1{gt
i=m}log (p(gt
i=m|xi,W, W ))
(1)
where p(gt
i=m|xi,W, W ) = exp{(Wm
t)Txl
i}
jexp{(Wj
t)Txl
i}is softmax
function and Wj
tdenotes the j-th column of the parameter
matrix Wtfor t-th task.
However, the data distribution is extremely unbalanced
for each classifier, and training unevenly could jeopardize
the whole model. Each sample in a binary classifier has two
states, belonging to the group (a positive sample) or not
belonging to the group (a negative sample). As shown in
Fig. 4, the number of positive samples is much less than the
negative samples. To alleviate the imbalanced data learning
problem, we impose penalty factors to penalize positive and
negative samples at different degrees for each task. The
penalty coefficients are represented as ρ=ρ0
t, ρ1
tT1
t=0 ,
where ρ0
tis the penalty coefficient for negative samples and
ρ1
tfor positive samples. Thus the objective function of t-th
task is
Jt=1
N
N1
i=0
1
m=0
1{gt
i=m}ρm
tlog (p(gt
i=m|xi,W, W ))
(2)
Therefore, we can balance the contribution of positive and
negative samples via adjusting the magnitude of the penalty
coefficients.
We have Tbinary classification tasks all together and
each task corresponds to an output layer. Let αtdenotes the
Fig. 4. The distribution of positive and negative samples for each age
group on MORPH II training set with AGE3, AGE9 and AGE15. When
grouped by AGE3, the distribution is extremely uneven and negative
samples is many times larger than positive samples. The number of
positive samples of middle groups would increase as nrises, but the
imbalance is still serious in the marginal groups.
importance level of the t-task, and the objective function of
the whole CNN can then be written as
J=1
N
N1
i=0
T1
t=0
1
m=0αt1gt
i=mρm
t·
log p(gt
i=m|xi,W, W )(3)
In the training process, we apply the stochastic gradi-
ent descent (SGD) [43] to search the suitable parameters
{W, W }for our MO-CNN.
3.3 Age Decoding
We elaborate a delicate CNN with multiple binary classifiers
to determine which groups a face image belongs to. How-
ever, we can only acquire an ambiguous age range using
the classification framework. Since only an ambiguous age
range can be acquired using the classification framework, a
decoding stage is further developed to obtain the exact age
considering the specific mapping relation between ages and
age groups. Detailed age decoding stage is explained below.
The objective function, Eq. (3), can be rewritten as
J=1
Nlog N1
i=0
T1
t=0
1
m=0
1gt
i=m·
p(gt
i=m|xi,W, W )ρm
tαt.
(4)
Removing the negative logarithm and average factor
terms of Eq. (4), our learning procedure is actually to maxi-
mize the following equation
p(G|X, W, W ) =
N1
i=0
T1
t=0 1
m=0
1gt
i=m·
pgt
i=m|xi,W, W ρm
tαt
,
(5)
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. ?, NO. ?, ? ? 5
where X={xi}N1
i=0 and G={gt
i}T1
t=0 N1
i=0 are the
whole dataset and the corresponding group labels, respec-
tively.
In Section 3.1, we use an index set Cato represent the
groups that the face images with age abelong to. And Eq.(5)
can be rewritten as following with the index set Ca
p(G|X, W, W ) =
N1
i=0
t∈Cyi
pgt
i= 1 |xi,W, W αtρ1
t·
t¯
Cyi
pgt
i= 0 |xi,W, W αtρ0
t.
(6)
Note that Cyirepresents those groups that the face image
with age yibelongs to, and ¯
Cyiis the complementary set
of Cyi. It is assumed that the samples are independent to
each other. Therefore, we can define the probability of a face
image belongs to age aas following
P(a|xi,W, W ) = 1
Z
t∈Ca
pgt
i= 1 |xi,W, W αtρ1
t·
t¯
Ca
pgt
i= 0 |xi,W, W αtρ0
t,
(7)
where Zis the normalization factor that makes sure
a∈Y P(a|xi,W, W ) = 1. In the training stage, our learn-
ing procedure aims to make the probability P(a|xi,W, W )
reach its maximum when aequals to its real age label yi.
Therefore, the predicting age y
ifor image xiis
y
i= arg max
a∈Y P(a|xi,W, W ).(8)
Our age decoding method is to find the maximal probability
of P(a|xi,W, W )for the whole age set Yand take the
corresponding age as the final estimated age. This is called
Global Age Decoding (GAD). However, it also leads to
an enormous computational burden because it conducts
computation for all ages and then finds the maximum as
its corresponding age. Actually, we can get the coarse age
range from the age group classification results, and then use
the Local Age Decoding (LAD) to recover the exact age to re-
duce the computational complexity. Assume that group mis
the group with the maximal probability p(gt
i= 1 |xi,W, W )
for image xi, which shows that the images xiis most
likely to belong to group m. Thus LAD only compares the
probabilities for the ages in Sm, and it can be written as
y
i= arg max
a∈SmP(a|xi,W, W ).(9)
We have made comparisons between GAD and LAD in
Section 5.3, which shows that the LAD is more efficient.
4 EXPERIMENTS
In this section, we first introduce the databases and ex-
plain some training details about our experiments. Then we
present the experimental results.
Algorithm 1 The algorithm of the proposed method
Input: The training data D={xi, yi}N1
i=0 , and the test data
D={xi}M1
i=0 .
Output: The predictions {yi}M1
i=0 for the test data.
1: conduct age grouping for training data Dwith AGEn,
and obtain the group labels {gt
i}T1
t=0 N1
i=0 , the age
group index Cafor each age aand the age set Stfor
each group t.
2: train MO-CNN with {xi,{gt
i}T1
t=0 }N1
i=0 for searching the
optimal network parameters {W, W }.
3: for i= 0,1,···, M 1do
4: input the face image x
iinto MO-CNN
5: obtain {{p(gt
i=m|x
i,W, W )}1
m=0}T1
t=0
6: marg max
t
p(gt
i=m|x
i,W, W )
7: for a∈ Smdo
8: compute P(a|x
i,W, W )according to Eq.(7)
9: end forend
10: y
iarg max
a∈SmP(a|x
i,W, W )
11: end forend
12: return The predictions {y
i}M1
i=0
TABLE 1
Summary of the databases that used in our experiments. The table
contains the age range information, and the number of images of the
corresponding database and its split. The non-face images (e.g., the
tattoo images in MORPH database) are removed in our experiments,
thus those images are not counted in this table.
Database Images Age range
MORPH 55244
16 - 77
80-20 protocol 5493
Train (80% images) 4395
Test (20% images) 1098
S1-S2-S3 protocol 55244
S1 10634
S2 10634
S3 33976
FG-NET 1002 0 - 69
Train 990(avg.)
Test 12(avg.)
CACD 162941
14 - 62
Train(1800 celebs) 144792
Val(80 celebs) 7585
Test(120 celebs) 10564
Chalearn LAP 2015 4691
3 - 85
Train 2476
Validation 1136
Test 1079
Chalearn LAP 2016 7591
1 - 89
Train 4113
Validation 1500
Test 1978
IMDB-WIKI 523051
0 - 100Train 297163
Val 10000
4.1 Databases
For real age estimation, we evaluate the proposed method
on FG-NET [24], Morph II [25] and CACD [26] databases,
under both the controlled and uncontrolled environments.
We also evaluate the performance of the proposed method
for apparent age estimation on Chalearn LAP datasets [27],
[44]. IMDB-WIKI database [27], [44] is also introduced to
pretrain our network when evaluating our model on FG-
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. ?, NO. ?, ? ? 6
Fig. 5. Sample images from Chalearn LAP, FG-NET, MORPH, CACD
and IMDB-WIKI databases. The value below the image is its corre-
sponding age label. FG-NET database includes some old photos (gray
image) as shown in the second row. The face images of Chalearn LAP
and MORPH databases are taken from the ordinary people, while the
images of CACD and IMDB-WIKI databases are from the celebrities.
And this difference can be easily found from the figure. Additionally, The
CACD database contains some noise. For example, the second image
of this databases was wrong labeled. For IMDB-WIKI, it contains more
noise, such as a image contains more than one face (see the second
image of IMDB-WIKI database) or no face (see the last image of IMDB-
WIKI database).
NET, Morph and Chalearn LAP datasets. A summary of
those databases is given in Table 1, including age range
information, the size of each database and its corresponding
spits. Fig. 4 shows some exemplar images of each database.
Here, we take a brief introduction on those databases and
the test protocols.
FG-NET The FG-NET dataset contains 1002 color or
grayscale face images of 82 subjects. Those images are taken
in a totally uncontrolled environment with large variations
of lighting, poses, and expressions. When evaluating on
this dataset, we take leave-one person-out (LOPO) cross
validation strategy according to the setup of [5], [33], [45],
[46], and the averaging performance over the 82 splits is
reported.
MORPH II This database is probably the largest
database with precise age labeling and ethnicities. The
database includes about 55 thousand face images and age
ranges from 16 to 77 years. In our experiments, we employ
two typical protocols for evaluation on MORPH dataset:
According to the test protocol2provided by Yi et
al. [14], the MORPH dataset would be split into
three non-overlapped subsets S1, S2, S3 obeying the
constructing rules that are detailed in the website
provided above. All experiments are repeated twice:
1) training with S1 and testing with S2+S3. 2) training
2. http://www.cbsr.ia.ac.cn/users/dyi/agr.html
with S2 and testing with S1+S3. Table 1 shows the
number of images in each subset. It can be found
that, in either way, the number of training images is
about a quarter of testing images. For simplicity, we
call this test protocol as S1-S2-S3 protocol.
Following the experimental setting in [21], [45], [46],
[47], a subset of 5493 images was used, where the
images are selected from Caucasian descent to reduce
the cross-race influence. We also randomly split the
whole dataset into two non-overlapped parts: 80%
images for training and 20% images for testing. The
number of images for training and testing sets are al-
so given in Table 1. In this way, the number of testing
images is a quarter of training images. And we call
this protocol as 80-20 protocol for convenience.
CACD The Cross-Age Celebrity Dataset (CACD) is the
largest public cross-age database, which is collected from
the Internet Movie DataBase (IMDB). This database, collect-
ed from search engines using celebrity name and year (2004-
2013) as keywords, contains more than 160 thousand images
from 2000 celebrities. However, the database contains much
noise because the age was simply estimated by query year
and birth year of that celebrity. We split the database into
three subsets: 1800 noisy celebrities for training, where the
number of images is big enough but the age labeling is
less precise; 80 cleaned celebrities for validation and 120
cleaned celebrities for testing, where the images are manu-
ally checked and the noise images are removed.
Chalearn LAP The Chalearn LAP challenge is the first
competition for apparent age estimation, and it offers im-
ages labeled by at least 10 users and then the average age is
used as the final annotation. Moreover, the dataset offers the
standard deviation for each age label. For the first edition of
Chalearn LAP challenge (2015) [27], the organizers collected
4691 images and all images were split into three subsets:
2476 images for training, 1136 images for validation and
1079 images for testing. For the second edition of Chalearn
LAP challenge (2016) [44], the dataset has been extended
to 7591 images, where 4113 images for training, 1500 for
validation and 1978 for testing. In addition to increasing
the number of images, most ages in the dataset are not
integers and the standard deviation covers a larger range.
Some sample images are given in Fig. 5.
IMDB-WIKI IMDB-WIKI [5], [9], which contains 523051
images in total, is the largest dataset for age estimation as far
as we know, where the images are crawled from celebrities
in IMDb3and Wikipedia4. However, this dataset contains
much noise. The age label is just calculated based on the
date of birth of the corresponding celebrity and the year
when the photo was taken, thus the accuracy of the age
annotations cannot be guaranteed when wrong timestamp
occurs or the image comes from a wrong celebrity. Addi-
tionally, tiny faces, multiple faces or non-face problems also
occur in the dataset as shown in Fig. 5. Even though this
dataset is not suitable for evaluation, it is still a good dataset
for pretraining for that the majority of the annotations are
correct. To use the dataset effectively, we select about 300
thousand images according to the settings in [5], where all
3. www.imdb.com
4. en.wikipedia.org
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. ?, NO. ?, ? ? 7
TABLE 2
MAE results with a variety of nand ρ1on validation set. (a) Results on the validation set of S1 on MORPH II. (b) Results on the validation set of S2
on MORPH II. (c) Results on the validation set of CACD database. (d) Results on the validation set of Chalearn LAP 2015 database. (e) Results
on the validation set of Chalearn LAP 2016 database.
(a) (b) (c) (d) (e)
HHH
H
ρ1
n5 7 9 11 5 7 9 11 5 7 9 11 5 7 9 11 5 7 9 11
1 3.61 3.45 3.41 3.32 3.44 3.39 3.22 3.27 5.52 5.33 5.23 5.32 4.97 4.97 4.87 4.90 5.41 5.15 4.98 5.01
2 3.38 3.30 3.21 3.30 3.26 3.18 3.17 3.17 5.26 5.34 5.43 5.33 4.93 4.89 4.86 4.88 5.08 4.94 4.99 5.04
3 3.26 3.28 3.23 3.38 3.20 3.22 3.25 3.25 5.25 5.41 5.54 5.50 4.96 5.04 5.03 4.86 5.15 4.97 5.02 5.08
4 3.63 3.32 3.34 3.45 3.18 3.19 3.17 3.32 5.45 5.50 5.63 5.49 4.99 4.96 4.91 4.99 5.08 5.14 5.06 5.15
Fig. 6. The network that we used for parameters searching. The network
is based on the AlexNet [43], and the last layer also be replaced with
multiple binary classifiers. More details of the convolution and pooling
layers are shown in the figure.
Fig. 7. (a) and (b) show the last two layers of the network of DEX and
VGG+Euclidean, respectively. The architecture of the lower layers of
DEX and VGG+Euclidean are the same to our network’s.
non-face images and part of images with multiple faces are
removed. What’s more, as shown in Table 1, the selected
images are randomly divided into two parts: 10000 images
for validation and the rest for training.
4.2 Preprocessing and Experimental Setting
Face Alignment Face alignment is helpful for age estima-
tion. First, all images are processed by a face detector [48]
and a few non-face images would be removed, for example,
tattoo images in Morph II database. Then, the active shape
models (ASM) [49] are used to detect facial landmarks and
all faces would be aligned according to the eyes center and
the upper lip. After that, all images are cropped into the size
of 224 ×224 and then fed into the network. Some aligned
images are shown in Fig. 9.
Data Augmentation When evaluating on FG-NET,
MORPH II and Chalearn LAP databases, the training im-
ages are extremely insufficient. For example, less than five
thousand images are used for training when evaluation is
taken on Morph dataset with 80-20 protocol. The training set
of Chalearn LAP 2015 dataset contains no more than three
thousand images, which is even more inadequate. There-
fore, increasing training samples is necessary to improve
the performance. Usually, there are two ways to expand the
training data. One is to enrich the training set with other
datasets. For example, we usually pretrain the network from
other larger datasets to improve its performance. Another
way is to add the virtual image samples. The first one
is a well-known technology and we mainly introduce the
method that is used to increase virtual images in our ex-
periments. Here, we augment training images with flipping,
rotating by ±50and ±100, and adding Gaussian white noise
with variance of 0.001, 0.005, 0.01, 0.015 and 0.02. The total
number of images was increased by 36 times after augmen-
tation. However, data augmentation is only conducted for
FG-NET, MORPH II and Chalearn LAP datasets since it is
not necessary for CACD database.
Experimental Setting We train the deep network with
a weight decay of 0.0005 and a momentum of 0.9. The
learning rate starts from 0.001 and reduced by a factor of
10 along with the number of iterations increases. We set
αt= 0.1for all tasks. AGE7 grouping strategy is taken
when experimenting on Chalearn 2016 dataset and AGE9
is taken for the others. Moreover, we set ρ1
tto 1 for the
experiments on CACD dataset and set ρ1
tto 2 for the
rest experiments. More details of the setting of AGEn and
parameters of balance strategy can be found in Section 4.4.
Our algorithm is implemented within the caffe framework
[50] on TITAN X GPU. And for all experiments the VGG-16
network was initialized with the weights from training on
ImageNet dataset first. For some experiments, the network
would be pretrained on IMDB-WIKI dataset and we would
explain it in the text.
4.3 Evaluation Metrics
For real age estimation, the Mean Absolute Error (MAE)
and Cumulative Score (CS) are usually used as evaluation
metrics. MAE indicates the mean absolute error between the
predicted result and the ground truth for testing set, and it
is calculated as
MAE =1
m
m1
i=0 |y
iyi|(10)
where y
idenotes the predicting age for i-th image and m
is the number of testing face images. MAE is the most
frequently used evaluation metric, and obviously, lower
MAE result means a better performance. CS(n) is computed
as follows
CS (n) = men
m(11)
where menrepresents the total number of test images
whose absolute error between the predicting results and
the ground truth is not greater than nyears. Obviously, the
higher the CS(n), the better performance it gets.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. ?, NO. ?, ? ? 8
Fig. 8. (a) CS comparisons on FG-NET. (b) CS comparisons on MORPH II with 80-20 protocol when training with 80% images and testing with
20% images. (c) CS comparisons on MORPH II with S1-S2-S3 protocol. The experiments are repeated twice: 1) training with S2 and testing with
S1+S3; 2) training with S1 and testing with S2+S3, and the average CS performance is reported. (d) CS comparisons on CACD when training with
1800 celebrities and testing with 120 celebrities.
TABLE 3
The comparisons between the proposed method and other
state-of-the-art methods on MORPH II database with the S1-S2-S3
protocol.
Method Train Set Test Set MAE Avg. MAE
Ours(IMDB-WIKI) S1 S2+S3 2.82 2.70
S2 S1+S3 2.58
Ours S1 S2+S3 3.04 2.86
S2 S1+S3 2.68
Soft softmax [38] S1 S2+S3 3.14 3.03
(IMDB-WIKI) S2 S1+S3 2.92
Soft softmax [38] S1 S2+S3 3.24 3.14
S2 S1+S3 3.03
Multi-scale CNN [14] S1 S2+S3 3.72 3.63
S2 S1+S3 3.54
BIF+KCCA [12] S1 S2+S3 4.00 3.98
S2 S1+S3 3.95
BIF+KPLS [11] S1 S2+S3 4.07 4.04
S2 S1+S3 4.01
BIF+rCCA [12] S1 S2+S3 4.43 4.42
S2 S1+S3 4.40
BIF+PLS [11] S1 S2+S3 4.58 4.56
S2 S1+S3 4.54
CNN [51] S1 S2+S3 4.64 4.60
S2 S1+S3 4.55
BIF+KSVM [12] S1 S2+S3 4.89 4.91
S2 S1+S3 4.92
BIF+LSVM [12] S1 S2+S3 5.06 5.09
S2 S1+S3 5.12
BIF+CCA [12] S1 S2+S3 5.39 5.37
S2 S1+S3 5.35
For apparent age estimation, the ϵ-error is used as a
quantitative measure, which is proposed by the Chalearn
LAP competition. The ϵ-error is computed as
ϵ=1 e(xµ)2
2σ2.(12)
It not only measures the error between the predicted value
xand the averaging labeled age µ, but also takes into
consideration the standard deviation σ. The final ϵ-error is
the average over all predictions. Of course, lower ϵ-error
means a better performance and it reaches to 0 when the
perfect prediction is achieved.
4.4 Parameters Discussion
As shown in Fig. 11, the distributions of MORPH II and
CACD databases differ greatly. We believe that the optimal
TABLE 4
The results on Morph II database with 80-20 protocol and FG-NET
database. Our method achieves the state-of-the-art performance on
both databases.
Method Morph II FG-NET
Human workers [52] 6.30 4.70
AGES [7] 8.83 6.77
MTWGP [53] 6.28 4.83
CA-SVR [46] 5.88 4.67
OHRank [45] 5.69 4.85
DLA [47] 4.77 4.26
VGG+SVR [54] 3.45
VGG+Euclidean 3.49 4.77
VGG+Euclidean (IMDB-WIKI) 3.15 4.30
DEX [5] 3.25 4.63
DEX (IMDB-WIKI) [5] 2.68 3.09
Ours 2.93 4.34
Ours (IMDB-WIKI) 2.52 2.96
parameters of the model is closely related to training data
distributions. In this subsection, we find appropriate age
grouping range nof age grouping strategy and penalty
coefficient ρof data balance strategy via conducting the
experiments on validation set with a variety of nand ρ. For
the penalty coefficient ρ=ρ0
t, ρ1
tT1
t=0 . We assume that it
is the same for all tasks, so it can be written as ρ=ρ0, ρ1.
However, it would take a lot of effort for ρ={ρ0, ρ1}. For
the sake of simplicity, we set ρ0= 1 and only change the
value of ρ1in the parameter searching process.
The CACD and Chalearn LAP 2015 & 2016 datasets have
offered validation sets. Thus, we directly evaluate the model
on their validation set to choose the appropriate parameters.
However, since validation set is not offered in MORPH II
dataset, we randomly select 2000 images from its training set
as validation set. These images will, therefore, not be used
for training in the parameter searching process. Random
selection also ensures that the distribution of training data
remains unchanged. Since training with VGG-16 network
consumes a lot of time, we conduct the experiments with a
shallower network, basd on AlexNet [43], which is shown
in Fig. 6.
The results on validation set are shown in Table 2.
From the results, we adopt AGE9 strategy and ρ1=
2, ρ1= 1, ρ1= 2 for Morph II, CACD and Chalearn LAP
2015 datasets, respectively. Moreover, AGE7 and ρ1= 2
are adopted for Chalearn LAP 2016 dataset. For FG-NET
database, it contains too few images and all images would
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. ?, NO. ?, ? ? 9
TABLE 5
Comparisons with the state-of-the-art methods on the Chalearn LAP 2015 dataset. The proposed method achieves the state-of-the-art
performance. (: the smaller the better).
Rank Team Validation Set1Test Set2Pretrain Network Num. of
MAEϵ-errorMAEϵ-errorSet Networks
Ours 3.21 0.28 2.94 0.263547 IMDB-WIKI VGG-16 8
1 CVL ETHZ [5], [9] 3.25 0.28 0.264975 IMDB-WIKI VGG-16 20
2 ICT-VIPL [41] 3.33 0.29 0.270685 FG-NET,Morph, CACD, et al. GoogleNet 8
3 WVU CVL [19] 0.31 0.294835 FG-NET, Morph, CACD, et al. GoogleNet 5
4 SEU NJU [55] 0.34 0.305763 FG-NET, Morph, Adience [56], et al. GoogleNet 6
human reference 0.34
5 UMD – 0.373352
6 Enjuto – 0.374390
7 Sungbin Choi 0.420554
8 Lab219A – 0.499181
9 Bogazici – 0.524055
10 Notts CVLab 0.594248
1The performance on evaluation set is tested based on a single network.
2The performance on test set is evaluated by a ensemble of multiple networks, where the number of networks used is shown in the last
column of the table.
TABLE 6
Comparisons with the state-of-the-art methods on the Chalearn LAP 2016 dataset.(: the smaller the better).
Rank Team Test Set Pretrain Network Num. of
MAEϵ-errorSet Networks
Ours 3.82 0.3100 IMDB-WIKI VGG-16 1
1 OrangeLabs [40] 0.2411 cleaned IMDB-WIKI, a private children dataset VGG-16 14
2 palm seu [57] 0.3214 IMDB-WIKI VGG-16 4
3 cmp+ETH [58] 0.3361 IMDB-WIKI VGG-16 10
4 WYU CVL 0.3405
5 ITU SiMiT [18] 0.3668 IMDB-WIKI VGG-16 3
6 Bogazici [59] 0.3740 VGG-16 8
7 MIPAL SNU 0.4569
8 DeepAge – 0.4573
be used to evaluate with LOPO strategy. Therefore, we use
n= 9, ρ1= 2 for FG-NET by our experience because those
two parameters perform well in most cases.
We can see that AGE9 is a relatively stable grouping
strategy and the model could achieve promising results with
AGE9 strategy on most validation sets. For grouping range
n, when nis smaller, the relationship between the adjacent
ages cannot be explored thoroughly and the imbalanced
data problem between the images belonging to group or not
is more serious because each group includes fewer images.
When nis bigger, the images within the group shows greater
diversity, which would be harmful to the model. Thus,
AGE9 strategy performs well maybe because n= 9 is an
appropriate grouping value which achieves a good tradeoff
between the above two aspects.
4.5 Comparisons
4.5.1 Real Age estimation
In this section we conduct comprehensive evaluations of the
proposed method on Morph, FG-NET and CACD datasets
for real age estimation.
Results on MORPH II with S1-S2-S3 Protocol. The
proposed method achieves an average MAE of 2.86 without
pretraining on any additional age dataset. It reduces the
MAE by 0.17 compared with the previous state-of-the-art
result reported in [38] (see Table 3). To the best of our
knowledge, it is the first report with MAE below 3 years
under this protocol. The pretraining on IMDB-WIKI dataset
further improve the performance, which achieves a MAE
of 2.70 years. The CS results are shown in Fig. 8, and our
method achieves the best performance.
Results on MORPH II with 80-20 Protocol. Usually, age
estimation can be treated as a classification or regression
problem. We take two baseline methods of age classification
and regression for comparison in this protocol. For age
classification, each age is regarded as an independent class.
We take Deep EXpection (DEX) [5], [9] as the baseline
method for age classification. DEX is one of the most popu-
lar methods for age estimation, which won the first prize of
the ChaLearn Looking At People ICCV 2015 challenge [60].
For age regression, we take the classic regression method
as the baseline for comparison where the Euclidean loss is
employed as the loss function. For a fair comparison, the
network architecture of DEX and regression-based method
are the same to our MO-CNN except the output layer, which
are shown in Fig. 7.
From Table 4, our method achieves the state-of-the-art
performance with the MAE of 2.93 when directly finetuning
on Morph dataset. As far as we know, it is also the first work
that reduces the MAE to under 3 years without finetuning
on additional age dataset. To further improve the perfor-
mance, the network is first finetuned on the IMDB-WIKI
dataset before finetuning on the Morph dataset, and the pro-
posed method achieves a MAE of 2.52 years, which reduces
the state-of-the-art performance by 0.18 years. Besides, the
CS comparisons with the state-of-the-art methods are shown
in Fig. 8, again our approach also shows its superiority.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. ?, NO. ?, ? ? 10
TABLE 7
The comparisons on CACD dataset.
Method Train Set Test Set Avg. MAE
Ours 1800 celebs 120 celebs 4.68
DEX [5] 1800 celebs 120 celebs 4.79
VGG+Euclidean 1800 celebs 120 celebs 5.08
Results on FG-NET. Due to FG-NET dataset contains
only 1002 images, we first pretrain our network on IMDB-
WIKI datset and then finetune on FG-NET. Two baseline
methods have also been added for comparisons. As shown
in Table 4, our method achieves the state-of-the-art perfor-
mance on FG-NET database with an average MAE of 2.96.
This improves the previous state-of-the-art result by 0.13.
The CS comparisons are shown in Fig. 8, and the proposed
method also performs better than other methods.
Results on CACD. Only few works conduct evalua-
tion on CACD database because of its noise. Here, we
compare the result with two baseline methods, which are
VGG+Euclidean regression and DEX. The comparisons are
shown in Table 7. Our method achieves the best perfor-
mance with the lowest MAE of 4.68 years. When CS is taken
as the criteria, our method also performs much better than
other methods as shown in Fig. 8. This indicates that our
method is capable of estimating age from face images in the
wild. Note that we do not finetune our network on IMDB-
WIKI dataset because some images from IMDB-WIKI and
CACD are duplicated.
4.5.2 Apparent Age estimation
In this section, the evaluation on Chalearn LAP dataset will
be presented.
Results on Chalearn LAP 2015. As a competition dataset
of apparent age estimation, Chalearn LAP dataset is more
special than other public datasets. Following the tricks used
in [5], [9], [41], we finetune our network on both training
and validation sets after finetuning on a large additional
age dataset, e.g., IMDB-WIKI dataset. In the test phase, each
image is flipped, and then rotated by 0o,±5o, thus each
image would be tested by 6 times and then averaging those
predictions. Note that for all results except for Chalearn LAP
dataset in this paper are based on a single test image. To fur-
ther improve the performance, an ensemble of 8 networks
is employed and we take the average of the predictions
as the final estimated age. But the ensemble technology is
only taken when evaluating on the test set of Chalearn LAP
dataset. We also report the performance on the validation
set with only finetuning on training set.
The experimental results are shown in Table 5. The
proposed method achieves a better performance than other
teams with a final ϵ-error of 0.263547. For validation set, our
method also achieves a lower MAE and ϵ-error based on a
single network. Due to many tricks we have employed in
this evaluation, more training details is presented in Section
5.1.
Results on Chalearn LAP 2016. Different from Chalearn
LAP 2015, most ages in Chalearn LAP 2016 dataset are not
integers. If we train the network with rounding ages, much
information would be sacrificed. To reduce information loss-
ing, we follow the work [40], [41] to encode each age label y
Fig. 10. The visualization of age and age group probabilities.
TABLE 8
The comparisons between GAD and LAD. The experiment is
conducted on Morph II dataset with 80-20 protocol.
Methods GAD LAD
Avg. MAE 2.52004 2.52004
Time (per image) 51.7ms 4.6ms
and its corresponding deviation σinto a label distribution.
The distribution is a set of possibilities representing the
description degrees of their corresponding labels, which is
defined as follows:
Pli=1
Zy,σ
1
2πσ2e(liy)2
2σ2, i = 0,··· , K (13)
where Zy,σ is the normlization factor related to age label
yand its deviation σ. We generate a random age label for
each image according to its label distribution and regard the
random age label as the ground truth label in the training
stage. Other experimental settings are the same to Chalearn
LAP 2015’s.
We find that the performance of methods [18], [40],
[57], [58] varies on validation set and test set, for example,
OrangeLabs’s method didn’t achieve the best performance
on validation set but outperformed other methods by a
large margin on test set. Therefore, we only conduct the
evaluation on test set for a consistent comparison. The
comparisons are reported in Table 6. Our method achieves
the performance on test set with epsilon error of 0.3100
based on a single network, which is the second best result
only next to OrangeLabs’s [40]. OrangeLabs’s method could
achieve better performance mainly due to the following
reasons: first, they pretrained their network on a cleaned
IMDB-WIKI dataset that was arranged and annotated by
26 persons lasting for a few days; second, they manually
collected a private dataset with a considerable quantity of
images of children, and they have trained 3 separate models
for estimating apparent ages of children using the children
dataset; third, they used an ensemble of multiple models to
boost the performance.
4.6 Computation Time Analysis
We train an age group classification network treating adja-
cent ages as an independent class. Then a decoding process
(LAD or GAD) is used to obtain the probability of each
age. In this subsection, we mainly analyze the accuracy
and computational efficiency between the GAD and LAD
methods. The comparative experiments are conducted on
MORPH II database with CPU, and we only compare the
time consumed in the decoding phase. In decoding, there
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. ?, NO. ?, ? ? 11
Fig. 9. The original and aligned images of Chalearn LAP, FG-NET, Morph and CACD databases. The predicted ages of both good and bad
estimation are given in the figure. Note that the predicted age on Chalearn LAP dataset is not an integer due to the averaging of the predictions of
the augmented testing images and an ensemble of networks.
TABLE 9
Some training details of our method on chalearn dataset. All results are finetuning on both training and validation sets, and then testing on test set.
Crop size of Data augmentation IMDB-WIKI Data augmentation Num. of MAE ϵ-error
training images on training Set pretraining on testing Set networks
224×224 No No No 1 4.64 0.4027
224×224 Yes No No 1 4.30 0.3709
224×224 Yes Yes No 1 3.08 0.2789
224×224 Yes Yes Yes 1 2.97 0.2669
224×224 Yes Yes Yes 8 2.94 0.2635
TABLE 10
The comparisons between with and without grouping and decoding components on Morph II dataset under 80-20 protocal.
With grouping and decoding Without grouping and decoding
ρ1– 1234567891011
MAE 2.5212.89 2.82 2.79 2.78 2.74 2.73 2.70 2.75 2.71 2.71 2.76
1The result is got by using AGE9 and ρ1of 2.
are only two terms changed between the probability of a
and a+1 accordingly to Eq. (7). To avoid decrease in perfor-
mance due to rounding error in the continuous calculation
process, P(a),P(a+ 1), ... are not calculated sequentially.
Instead, we compute P(a)for each ain the whole age set
Y(GAD) or age group set Sm(LAD) with the maximal
classification probability. We find that LAD could spend less
time while gets the same performance as with GAD. As
shown in Table 8, LAD only needs 4.6ms to analyze one face
image while GAD needs 51.7ms to do so, which decreases
the decoding speed by about 10 times. The visualization of
the age probabilities with both LAD and GAD is shown
in Fig. 10 (a randomly selected sample from test set). The
decoding results are virtually with no difference.
5 DISCUSSION
5.1 Exploring Training Details
Many tricks have been employed when evaluating on
Chalearn LAP dataset, e.g., pretraining, data augmentation,
a ensemble of networks. In this section a step-by-step inves-
tigation is conducted to explore the contributions of each
trick.
As shown in Table 9, the pretraining on IMDB-WIKI
dataset seems very helpful, which can reduce the ϵ-error
from 0.3709 to 0.2789. This significant improvement shows
that the IMDB-WIKI dataset is still useful even though
it contains much noise. Also, our data augmentation on
Fig. 11. The distribution of training sets.
training set also makes a great contribution. The ϵ-error is
dropped by about 0.012 with the training data augmenta-
tion. It is worth noting that the proposed method achieves
an ϵ-error of 0.2669 with a single network, which is very
close to the best result of Chalearn LAP competition [5], [9].
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. ?, NO. ?, ? ? 12
5.2 Detailed comparison with DEX
To compare with DEX method thoroughly, we re-implement
DEX with the same experimental settings where both face
alignment and data augmentation are used. The network of
re-implemented DEX method is the same to ours except the
last layer as shown in Fig. 7. We conduct the comparisons on
FG-NET, Morph II and CACD datasets. When experiment-
ing on FG-NET and Moroh II datasets, the networks are first
pretrained on IMDB-WIKI datatset. As shown in the Table
11, our method could still perform better than DEX method
on those datasets when adopting the same experimental
settings. Furthermore, we also implement our method with
the same training settings as DEX’s. Besides having selected
part of images with small noise of IMDB-WIKI dataset for
pretraining, Rothe et. al [5] have also equalized the age
distribution of the selected images to improve the model
generalization capability. However, they didn’t make the
list of pretraining images public to the community. There-
fore, for a fair comparison, we didn’t pretrain the model
on IMDB-WIKI dataset when conducting experiments with
the same training settings as DEX’s. The comparisons on
FG-NET, Morph and CACD dataset are shown in Table
11, our method achieves a better performance. No matter
which training settings are employed, our method shows
superiority to DEX.
TABLE 11
The comparisons between our method and DEX on FG-NET, Morph II
and CACD datasets. Note that here we adopt 80-20 protocol when
evaluating on Morph II dataset.
Our training settings DEX’s training settings
Method FG-NET Morph II CACD FG-NET Morph II CACD
Ours 2.96 2.52 4.68 4.30 3.01 4.73
DEX 3.01 2.66 4.75 4.63 3.25 4.79
5.3 Ablation Study
In this section, we conduct the ablation analysis on
the grouping and decoding components of the proposed
method. We train the network with multiple classifiers with-
out the grouping component, where each classifier is used to
determine the input image belonging to the corresponding
age or not. By removing the grouping stage, the predicted
age could be directly obtained via maximum probability
of the classifiers. So the decoding stage is also dropped
in this way. To make a fair comparison, we also conduct
experiments with a variety of ρ1to find an appropriate
value. As shown in Table 10, the minimum MAE can only
reach 2.70 when grouping and decoding components are
dropped. This means that the performance (or deviation of
age estimation) of model without grouping and decoding
components is 0.18 years less than that with those com-
ponents. When grouping and decoding components are
dropped, each age would be regarded as a single age group
and the relationship between adjacent ages can’t be explored
either. All those result in a decrease in performance. From
this perspective, the grouping and decoding components are
of critical importance to our method.
6 CONCLUSION
In this paper, we propose a deep learning solution for age
estimation based on a single network to account for aging
randomness. First, an age group-n encoding strategy is
proposed to group ages, where adjacent ages are grouped
into the same group and each group is regarded as an
independent class. Then, age group classification is imple-
mented in a CNN with multiple outputs and we recover the
exact age for each face image by decoding the classification
results. Moreover, we modify our algorithm to address the
imbalance data learning problem. Finally, the evaluations
on multiple age databases show that the proposed method
achieves the state-of-the-art performance.
ACKNOWLEDGMENTS
This work was supported by the National Key Research
and Development Plan (Grant No.2016YFC0801002), the
Chinese National Natural Science Foundation Projects
61502491, 61473291, 61572501, 61572536, 61673052, Sci-
ence and Technology Development Fund of Macau (No.
112/2014/A3, 151/2017/A, 152/2017/A), NVIDIA GPU do-
nation program and AuthenMetric R&DFunds. Zichang
Tan and Jun Wan contribute equally to this paper. Jun Wan
is the corresponding author.
REFERENCES
[1] A. K. Jain and S. Z. Li, Handbook of face recognition. Springer, 2005.
[2] M. Fairhurst, Age Factors in Biometric Processing. Institution of
Engineering and Technology, 2013.
[3] Y. Ma and G. Qian, Intelligent video surveillance: systems and technol-
ogy. CRC Press, 2009.
[4] C. Shan, F. Porikli, T. Xiang, and S. Gong, Video Analytics for
Business Intelligence. Springer, 2012, vol. 1.
[5] R. Rothe, R. Timofte, and L. Van Gool, “Deep expectation
of real and apparent age from a single image without facial
landmarks,” International Journal of Computer Vision, Aug 2016.
[Online]. Available: https://doi.org/10.1007/s11263-016- 0940-3
[6] X. Geng, C. Yin, and Z. Zhou, “Facial age estimation by learning
from label distributions,” IEEE Transactions on Pattern Analysis and
Machine Intelligence, vol. 35, no. 10, pp. 2401–2412, 2013.
[7] X. Geng, Z. Zhou, and K. Smithmiles, “Automatic age estimation
based on facial aging patterns,” IEEE Transactions on Pattern Anal-
ysis and Machine Intelligence, vol. 29, no. 12, pp. 2234–2240, 2007.
[8] G. Guo, G. Mu, Y. Fu, and T. Huang, “Human age estimation using
bio-inspired features,” pp. 112–119, 2009.
[9] R. Rothe, R. Timofte, and L. V. Gool, “Dex: Deep expectation of
apparent age from a single image,” in IEEE International Conference
on Computer Vision Workshops (ICCVW), December 2015.
[10] Y. Fu and T. S. Huang, “Human age estimation with regression on
discriminative aging manifold,” IEEE Transactions on Multimedia,
vol. 10, no. 4, pp. 578–584, 2008.
[11] G. Guo and G. Mu, “Simultaneous dimensionality reduction and
human age estimation via kernel partial least squares regression,”
in Computer Vision and Pattern Recognition (CVPR), 2011 IEEE
Conference on. IEEE, 2011, pp. 657–664.
[12] G. Guo. and G. Mu., “Joint estimation of age, gender and ethnicity:
Cca vs. pls,” in Automatic Face and Gesture Recognition (FG), 2013
10th IEEE International Conference and Workshops on. IEEE, 2013,
pp. 1–6.
[13] T. Liu, Z. Lei, J. Wan, and S. Z. Li, “Dfdnet: discriminant face
descriptor network for facial age estimation,” in Chinese Conference
on Biometric Recognition. Springer, 2015, pp. 649–658.
[14] D. Yi, Z. Lei, and S. Z. Li, “Age estimation by multi-scale convolu-
tional network,” in Computer Vision–ACCV 2014. Springer, 2014,
pp. 144–158.
[15] P. Geladi and B. R. Kowalski, “Partial least-squares regression: a
tutorial,” Analytica Chimica Acta, vol. 185, no. 86, pp. 1–17, 1986.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. ?, NO. ?, ? ? 13
[16] D. R. Hardoon, S. Szedmak, and J. Shawe-Taylor, “Canonical
correlation analysis: an overview with application to learning
methods.” Neural Computation, vol. 16, no. 12, pp. 2639–2664, 2004.
[17] D. Basak, S. Pal, and D. C. Patranabis, “Support vector regression,”
Neural Information Processing-Letters and Reviews, vol. 11, no. 10, pp.
203–224, 2007.
[18] R. Can Malli, M. Aygun, and H. Kemal Ekenel, “Apparent age
estimation using ensemble of deep learning models,” in The IEEE
Conference on Computer Vision and Pattern Recognition (CVPR) Work-
shops, June 2016.
[19] Y. Zhu, Y. Li, G. Mu, and G. Guo, “A study on apparent age
estimation,” in The IEEE International Conference on Computer Vision
(ICCV) Workshops, December 2015.
[20] H. F. Yang, B. Y. Lin, K. Y. Chang, and C. S. Chen, “Automatic
age estimation from face images via deep ranking,” in Proc. British
Machine Vision Conference, 2015.
[21] Z. Niu, M. Zhou, L. Wang, X. Gao, and G. Hua, “Ordinal regres-
sion with multiple output cnn for age estimation,” in The IEEE
Conference on Computer Vision and Pattern Recognition (CVPR), June
2016.
[22] R. Longadge and S. Dongre, “Class imbalance problem in data
mining review,” arXiv preprint arXiv:1305.1707, 2013.
[23] H. He and E. A. Garcia, “Learning from imbalanced data,” IEEE
Transactions on Knowledge and Data Engineering, vol. 21, no. 9, pp.
1263–1284, Sept 2009.
[24] “The fg-net aging database,” http://www.fgnet.rsunit.com/.
[25] K. Ricanek and T. Tesafaye, “Morph: a longitudinal image
database of normal adult age-progression,” in International Confer-
ence on Automatic Face and Gesture Recognition, 2006, pp. 341–345.
[26] B. C. Chen, C. S. Chen, and W. H. Hsu, “Face recognition and
retrieval using cross-age reference coding with cross-age celebrity
dataset,” IEEE Transactions on Multimedia, vol. 17, no. 6, pp. 804–
815, June 2015.
[27] S. Escalera, J. Fabian, P. Pardo, X. Bar ´
o, J. Gonzalez, H. J. Escalante,
D. Misevic, U. Steiner, and I. Guyon, “Chalearn looking at people
2015: Apparent age and cultural event recognition datasets and
results,” in Proceedings of the IEEE International Conference on Com-
puter Vision Workshops, 2015, pp. 1–9.
[28] K. Simonyan and A. Zisserman, “Very deep convolutional net-
works for large-scale image recognition,” arXiv preprint arX-
iv:1409.1556, 2014.
[29] N. D. V. Young HoKwon, “Age classification from facial images,”
Computer Vision and Image Understanding, vol. 74, no. 1, pp. 1–21,
1999.
[30] A. Lanitis, C. Draganova, and C. Christodoulou, “Comparing
different classifiers for automatic age estimation,” Systems, Man,
and Cybernetics, Part B: Cybernetics, IEEE Transactions on, vol. 34,
no. 1, pp. 621–628, 2004.
[31] T. F. Cootes, G. J. Edwards, and C. J. Taylor, “Active appearance
models,” IEEE Transactions on Pattern Analysis and Machine Intelli-
gence, vol. 23, no. 6, pp. 681–685, 2001.
[32] X. Geng, Z. H. Zhou, Y. Zhang, G. Li, and H. Dai, “Learning from
facial aging patterns for automatic age estimation,” in ACM Inter-
national Conference on Multimedia, Santa Barbara, Ca, Usa, October,
2006, pp. 307–316.
[33] G. Guo, Y. Fu, C. R. Dyer, and T. S. Huang, “Image-based human
age estimation by manifold learning and locally adjusted robust
regression.” IEEE Transactions on Image Processing A Publication of
the IEEE Signal Processing Society, vol. 17, no. 7, pp. 1178–1188,
2008.
[34] Y. Fu, Y. Xu, and T. S. Huang, “Estimating human age by manifold
analysis of face pictures and regression on aging features,” in IEEE
International Conference on Multimedia and Expo, 2007, pp. 1383–
1386.
[35] F. Gao and H. Ai, “Face age classification on consumer images
with gabor feature and fuzzy lda method,” in Advances in biomet-
rics. Springer, 2009, pp. 132–141.
[36] A. G ¨
unay and V. V. Nabiyev, “Automatic age classification with
lbp,” in Computer and Information Sciences, 2008. ISCIS’08. 23rd
International Symposium on. IEEE, 2008, pp. 1–4.
[37] X. Geng, Q. Wang, and Y. Xia, “Facial age estimation by adaptive
label distribution learning,” in International Conference on Pattern
Recognition, 2014, pp. 4465–4470.
[38] Z. Tan, Z. Shuai, W. Jun, L. Zhen, and S. Z. Li, “Age estimation
based on a single network with soft softmax of aging modeling,”
in Computer Vision–ACCV 2016, 2016.
[39] Z. Huo, X. Yang, C. Xing, Y. Zhou, P. Hou, J. Lv, and X. Geng,
“Deep age distribution learning for apparent age estimation,” in
Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition Workshops, 2016, pp. 17–24.
[40] G. Antipov, M. Baccouche, S.-A. Berrani, and J.-L. Dugelay, “Ap-
parent age estimation from face images combining general and
children-specialized deep learning models,” in Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition Work-
shops, 2016, pp. 96–104.
[41] X. Liu, S. Li, M. Kan, J. Zhang, S. Wu, W. Liu, H. Han, S. Shan,
and X. Chen, “Agenet: Deeply learned regressor and classifier
for robust apparent age estimation,” in The IEEE International
Conference on Computer Vision (ICCV) Workshops, December 2015.
[42] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov,
D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with
convolutions,” in The IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), June 2015.
[43] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classifi-
cation with deep convolutional neural networks,” in Advances in
neural information processing systems, 2012, pp. 1097–1105.
[44] S. Escalera, M. Torres Torres, B. Martinez, X. Baro, H. Jair Es-
calante, I. Guyon, G. Tzimiropoulos, C. Corneou, M. Oliu, M. Al-
i Bagheri, and M. Valstar, “Chalearn looking at people and faces
of the world: Face analysis workshop and challenge 2016,” in The
IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Workshops, June 2016.
[45] K. Y. Chang, C. S. Chen, and Y. P. Hung, “Ordinal hyperplanes
ranker with cost sensitivities for age estimation,” in Computer
Vision and Pattern Recognition, 2011, pp. 585–592.
[46] K. Chen, S. Gong, T. Xiang, and C. L. Chen, “Cumulative attribute
space for age and crowd density estimation,” in IEEE Conference
on Computer Vision and Pattern Recognition, 2013, pp. 2467–2474.
[47] X. Wang, R. Guo, and C. Kambhamettu, “Deeply-learned feature
for age estimation,” in IEEE Winter Conference on Applications of
Computer Vision, 2015, pp. 534–541.
[48] P. Viola and M. Jones, “Rapid object detection using a boosted
cascade of simple features,” in Computer Vision and Pattern Recogni-
tion, 2001. CVPR 2001. Proceedings of the 2001 IEEE Computer Society
Conference on, vol. 1. IEEE, 2001, pp. I–I.
[49] T. F. Cootes, C. J. Taylor, D. H. Cooper, and J. Graham, “Active
shape modelsłtheir training and application,” Computer Vision and
Image Understanding, vol. 61, no. 1, pp. 38–59, 1995.
[50] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick,
S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture
for fast feature embedding,” in Proceedings of the 22nd ACM inter-
national conference on Multimedia. ACM, 2014, pp. 675–678.
[51] M. Yang, S. Zhu, F. Lv, and K. Yu, “Correspondence driven
adaptation for human profile recognition,” in Computer Vision and
Pattern Recognition (CVPR), 2011 IEEE Conference on. IEEE, 2011,
pp. 505–512.
[52] H. Han, C. Otto, X. Liu, and A. K. Jain, “Demographic estimation
from face images: Human vs. machine performance,” IEEE Trans-
actions on Pattern Analysis and Machine Intelligence, vol. 37, no. 6,
pp. 1148–1161, 2015.
[53] Y. Zhang and D.-Y. Yeung, “Multi-task warped gaussian process
for personalized age estimation,” in Computer Vision and Pattern
Recognition (CVPR), 2010 IEEE Conference on. IEEE, 2010, pp. 2622–
2629.
[54] R. Rothe, R. Timofte, and L. Van Gool, “Some like it hot-visual
guidance for preference prediction,” in Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, 2016, pp.
5553–5561.
[55] X. Yang, B. B. Gao, C. Xing, Z. W. Huo, X. S. Wei, Y. Zhou, J. Wu,
and X. Geng, “Deep label distribution learning for apparent age
estimation,” in Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition Workshops, 2015, pp. 344–350.
[56] G. Levi and T. Hassner, “Age and gender classification using con-
volutional neural networks,” in The IEEE Conference on Computer
Vision and Pattern Recognition (CVPR) Workshops, June 2015.
[57] Z. Huo, X. Yang, C. Xing, Y. Zhou, P. Hou, J. Lv, and X. Geng,
“Deep age distribution learning for apparent age estimation,” in
The IEEE Conference on Computer Vision and Pattern Recognition
(CVPR) Workshops, June 2016.
[58] M. Uricar, R. Timofte, R. Rothe, J. Matas, and L. Van Gool,
“Structured output svm prediction of apparent age, gender and
smile from deep features,” in The IEEE Conference on Computer
Vision and Pattern Recognition (CVPR) Workshops, June 2016.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. ?, NO. ?, ? ? 14
[59] F. Gurpinar, H. Kaya, H. Dibeklioglu, and A. Salah, “Kernel elm
and cnn based facial age estimation,” in The IEEE Conference on
Computer Vision and Pattern Recognition (CVPR) Workshops, June
2016.
[60] X. Baro, J. Gonzalez, J. Fabian, M. A. Bautista, M. Oliu, H. J.
Escalante, I. Guyon, and S. Escalera, “Chalearn looking at people
2015 challenges: Action spotting and cultural event recognition,”
in IEEE Conference on Computer Vision and Pattern Recognition
Workshops, 2015, pp. 1–9.
Zichang Tan received the B.E. degree from the
department of Automation, Huazhong Universi-
ty of Science and Technology (HUST), Wuhan,
China in 2016. He was named as an outstand-
ing graduate of the college when he graduated.
He is currently studying for a Ph.D. degree at
the Institute of Automation, Chinese Academy of
Science (CASIA). His main research interests in-
clude deep learning, face attribute analysis and
face recognition.
Jun Wan received his B.S. degree from the Chi-
na University of Geosciences, Beijing, China, in
2008, and the PhD degree from the Institute of
Information Science, Beijing Jiaotong University,
Beijing, China, in 2015. Since January 2015, he
has been an assistant professor at the National
Laboratory of Pattern Recognition (NLPR), In-
stitute of Automation, Chinese Academy of Sci-
ence (CASIA). He received the 2012 ChaLearn
One-Shot-Learning Gesture Challenge Award,
sponsored by Microsoft, ICPR 2012. He also
received the 2013, 2014 Best Paper Award from the Institute of Infor-
mation Science, Beijing Jiaotong University. His main research interests
include computer vision, machine learning, especially for gesture and
action recognition, facial attribution analysis (i.e. age estimation, facial
expression, gender and race classification). He has published papers
in top journals, such as JMLR, TPAMI, TIP, and TCYB. He has served
as the reviewer on several top journals and conferences, such as JML-
R, TPAMI, TIP, TMM, TSMC, PR, ICPR2016, CVPR2017, ICCV2017,
FG2017.
Zhen Lei received the B.S. degree in automation
from the University of Science and Technology of
China, in 2005, and the Ph.D. degree from the In-
stitute of Automation, Chinese Academy of Sci-
ences, in 2010, where he is currently an Asso-
ciate Professor. He has published over 100 pa-
pers in international journals and conferences.
His research interests are in computer vision,
pattern recognition, image processing, and face
recognition in particular. He served as an Area
Chair of the International Joint Conference on
Biometrics in 2014, the IAPR/IEEE International Conference on Bio-
metric in 2015, 2016, 2018, and the IEEE International Conference on
Automatic Face and Gesture Recognition in 2015.
Ruicong Zhi received the PhD degree in signal
and information processing from Beijing Jiao-
tong University in 2010. From 2008 to 2009, she
visited the Sound and Image Processing Labo-
ratory, Royal Institute of Technology (KTH) as a
joint PhD. She is currently an associate profes-
sor in School of Computer and Communication
Engineering, University of Science and Technol-
ogy Beijing. She has published more than 50
papers, and has six patents. She has been the
recipient of more than ten awards, including the
National Excellent Doctoral Dissertation Award nomination, the prize of
Science and Technology of Beijing etc. Her research interests include
facial and behavior analysis, emotion analysis, image processing and
pattern recognition.
Guodong Guo (M’07-SM’07) received the B.E.
degree in automation from Tsinghua Universi-
ty, Beijing, China, the Ph.D. degree in pattern
recognition and intelligent control from Chinese
Academy of Sciences, Beijing, China, and the
Ph.D. degree in computer science from Universi-
ty of Wisconsin-Madison, Madison, WI, USA. He
is an Associate Professor with the Department
of Computer Science and Electrical Engineering,
West Virginia University (WVU), Morgantown,
WV, USA. In the past, he visited and worked in
several places, including INRIA, Sophia Antipolis, France; Ritsumeikan
University, Kyoto, Japan; Microsoft Research, Beijing, China; and North
Carolina Central University. He authored a book, Face, Expression, and
Iris Recognition Using Learning-based Approaches (2008), co-edited
a book, Support Vector Machines Applications (2014), and published
about 100 technical papers. His research interests include computer
vision, machine learning, and multimedia. He received the North Car-
olina State Award for Excellence in Innovation in 2008, Outstanding
Researcher (2013-2014) at CEMR, WVU, and New Researcher of the
Year (2010-2011) at CEMR, WVU. He was selected the “People’s Hero
of the Week” by BSJB under Minority Media and Telecommunications
Council (MMTC) on July 29, 2013. Two of his papers were selected as
“The Best of FG’13” and “The Best of FG’15”, respectively.
Stan Z. Li received the BEng degree from Hu-
nan University, China, the MEng degree from
National University of Defense Technology, Chi-
na, and the PhD degree from Surrey University,
United Kingdom. He is currently a professor and
the director of Center for Biometrics and Security
Research (CBSR), Institute of Automation, Chi-
nese Academy of Sciences (CASIA). He was at
Microsoft Research Asia as a researcher from
2000 to 2004. Prior to that, he was an associate
professor at Nanyang Technological University,
Singapore. His research interests include pattern recognition and ma-
chine learning, image and vision processing, face recognition, biomet-
rics, and intelligent video surveillance. He has published more than
200 papers in international journals and conferences, and authored and
edited eight books. He was an associate editor of the IEEE Transactions
on Pattern Analysis and Machine Intelligence and is acting as the editor-
in-chief for the Encyclopedia of Biometrics. He served as a program
cochair for the International Conference on Biometrics 2007 and 2009,
and has been involved in organizing other international conferences
and workshops in the fields of his research interest. He was elevated
to IEEE fellow for his contributions to the fields of face recognition,
pattern recognition and computer vision and he is a member of the IEEE
Computer Society.
... Meanwhile, researchers also explored local and global facial representations to capture the comprehensive facial information for expression recognition [20] and facial attribute recognition [11,27,62] tasks. Regarding the age estimation tasks, several methods try to learn a label distribution and lead to a better understanding of the continuity of different labels [30,41,60]. However, previous methods commonly treat the above-mentioned face analysis tasks as separate problems. ...
... AgeED [41] 6.22 -FaRL [60] 5.64 MV [30] 5.31 -CMT [ query vectors are used for the adaptive feature extraction. We supervise them using a BCE loss function and use the F1 score as the evaluation metric. ...
Preprint
Although face analysis has achieved remarkable improvements in the past few years, designing a multi-task face analysis model is still challenging. Most face analysis tasks are studied as separate problems and do not benefit from the synergy among related tasks. In this work, we propose a novel task-adaptive multi-task face analysis method named as Q-Face, which simultaneously performs multiple face analysis tasks with a unified model. We fuse the features from multiple layers of a large-scale pre-trained model so that the whole model can use both local and global facial information to support multiple tasks. Furthermore, we design a task-adaptive module that performs cross-attention between a set of query vectors and the fused multi-stage features and finally adaptively extracts desired features for each face analysis task. Extensive experiments show that our method can perform multiple tasks simultaneously and achieves state-of-the-art performance on face expression recognition, action unit detection, face attribute analysis, age estimation, and face pose estimation. Compared to conventional methods, our method opens up new possibilities for multi-task face analysis and shows the potential for both accuracy and efficiency.
... First, one can treat age estimation as a regression problem when age is viewed as a continuous value [5]. However, as mentioned in [6], [7], the regression-based age estimation approaches may cause overfitting problem. Second, many research works used multi-class classification approach to address the age estimation problem by quantizing ages into groups since it is easy to categorize people into several age groups such as teenagers, middle-aged or old people [6]- [8]. ...
... IMDB-WIKI consists of two parts: IMDB with 460,723 images and WIKI with 62,328 images. As mentioned in [7], the dataset contains much noise. Therefore, it is not suitable to evaluate performance on age estimation. ...
Article
Full-text available
In this paper, we study lightweight age estimation methods based on a coarse-to-fine approach in which the network performs age prediction with multiple stages. In each stage, the network only focuses on refining the coarse age prediction generated from the previous stage. The final age prediction is the combination of all staged prediction values. We observe that these stages have a causal relationship, that is, the output of each stage is highly correlated with outputs of its former stages. Thus, each stage should share the information of its previous stage before making a refined prediction. Based on this observation, we construct a new compact CNN model called Homologous Stagewise Regression Network (HSR-Net). In HSR-Net, each stage shares the information of the last convolutional layer and then generates its own refined value. In addition, HSR-Net also addresses the age group ambiguity problem by utilizing an easy dynamic range construction. In order to enhance the prediction performance of HSR-Nets, it is naive to increase the number of kernels in each convolutional layer of HSR-Nets. However, the constructed HSR-Net has extremely large parameter size. To address this problem, we propose the separable HSR-Nets (SepHSR-Nets) where standard convolutions are replaced by depth-wise separable convolutions in the convolutional layers of HSR-Nets. In general, the parameter size of SepHSR-Nets ranges from 10K to 75K without sacrificing prediction performance. Experimental results show that SepHSR-Nets achieve competitive performance compared with the state-of-the-art compact models. Our code, data, and models are available at https://github.com/yanjenhuang/hsr-net.
... The performance of the models for age prediction is often evaluated by their mean absolute error (MAE). After training and optimization procedures, we reach an MAE of ~2.9 on the MORPH-II data set, which is in line with other researchers [1,54]. This means, on average, our model has an error boundary of +/-3 years when predicting the age. ...
Preprint
Full-text available
AI is becoming increasingly common across different domains. However, as sophisticated AI-based systems are often black-boxed, rendering the decision-making logic opaque, users find it challenging to comply with their recommendations. Although researchers are investigating Explainable AI (XAI) to increase the transparency of the underlying machine learning models, it is unclear what types of explanations are effective and what other factors increase compliance. To better understand the interplay of these factors, we conducted an experiment with 562 participants who were presented with the recommendations of an AI and two different types of XAI. We find that users' compliance increases with the introduction of XAI but is also affected by AI literacy. We also find that the relationships between AI literacy XAI and users' compliance are mediated by the users' mental model of AI. Our study has several implications for successfully designing AI-based systems utilizing XAI.
Chapter
Facial attributes indicate the intuitive semantic descriptions of a human face like gender, race, expression, and so on. In the past few years, automated facial attribute analysis has become an active field in the area of biometric recognition due to its wide range of possible applications, such as face verification [5, 59], face identification [63, 80], or surveillance [110], just to mention a few.
Chapter
Face recognition aims to identify or verify the identity of a person through his or her face images or videos. It is one of the most important research topics in computer vision with great commercial applications [37, 59, 86, 210], like biometric authentication, financial security, access control, intelligent surveillance, etc. Because of its commercial potential and practical value, face recognition has attracted great interest from both academia and industry. The concept of face recognition probably appeared as early as 1960s [10], when researchers tried to use a computer to recognize the human face. In the 1990s and early 2000s, face recognition had rapid development and methodologies were dominated by holistic approaches (e.g., linear space [11], manifold learning [70], and sparse representations [227, 253]), which extract low-dimensional features by taking the whole face as the input.
Chapter
In this paper, we present a graph-causal regularization (GCR) for robust facial age estimation. Existing label facial age estimation methods often suffer from overfitting and overconfidence issues due to limited data and domain bias. To address these challenges and leveraging the chronological correlation of age labels, we propose a dynamic graph learning method that enforces causal regularization to discover an attentive feature space while preserving age label dependencies. To mitigate domain bias and enhance aging details, our approach incorporates counterfactual attention and bilateral pooling fusion techniques. Consequently, the proposed GCR achieves reliable feature learning and accurate ordinal decision-making within a globally-tuned framework. Extensive experiments under widely-used protocols demonstrate the superior performance of GCR compared to state-of-the-art approaches.
Conference Paper
Learning age estimation from face images has attracted much attention due to its favorable of various practical applications such as age-invariant identity-related representation. However, most existing facial age estimation methods usually extract age features from the RGB images, making them sensitive to the gender, race, pose and illumination changes. In this paper, we propose an end-to-end multi-feature integrated network for robust RGB-D facial age estimation, which consists of a 2D triple-type characteristic learning net and a 3D depth features leaning net. The triple-type characteristic learning net aims to extensively exploit multiple aging-related information including the gender, race features as well as the preliminary age features from RGB images, while the depth-feature learning net learns the pose and illumination-invariant age-related features from depth images. By incorporating these multi-dimensional feature nets, our proposed integrated network can extract the robust and complementary age features between RGB and depth modalities. Extensive experimental results on the widely used databases clearly demonstrate the effectiveness of our proposed method.
Article
Facial age estimation has received a lot of attention for its diverse application scenarios. Most existing studies treat each sample equally and aim to reduce the average estimation error for the entire dataset, which can be summarized as General Age Estimation . However, due to the long-tailed distribution prevalent in the dataset, treating all samples equally will inevitably bias the model toward the head classes (usually the adult with a majority of samples). Driven by this, some works suggest that each class should be treated equally to improve performance in tail classes (with a minority of samples), which can be summarized as Long-tailed Age Estimation . However, Long-tailed Age Estimation usually faces a performance trade-off, i.e ., achieving improvement in tail classes by sacrificing the head classes. In this paper, our goal is to design a unified framework to perform well on both tasks, killing two birds with one stone. To this end, we propose a simple, effective, and flexible training paradigm named GLAE, which is two-fold. First, we propose Feature Rearrangement (FR) and Pixel-level Auxiliary learning (PA) for better feature utilization to improve the overall age estimation performance. Second, we propose Adaptive Routing (AR) for selecting the appropriate classifier to improve performance in the tail classes while maintaining the head classes. Moreover, we introduce a new metric, named Class-wise Mean Absolute Error (CMAE), to equally evaluate the performance of all classes. Our GLAE provides a surprising improvement on Morph II, reaching the lowest MAE and CMAE of 1.14 and 1.27 years, respectively. Compared to the previous best method, MAE dropped by up to 34%, which is an unprecedented improvement, and for the first time, MAE is close to 1 year old. Extensive experiments on other age benchmark datasets, including CACD, MIVIA, and Chalearn LAP 2015, also indicate that GLAE outperforms the state-of-the-art approaches significantly.
Book
From the streets of London to subway stations in New York City, hundreds of thousands of surveillance cameras ubiquitously collect hundreds of thousands of videos, often running 24/7. How can such vast volumes of video data be stored, analyzed, indexed, and searched? How can advanced video analysis and systems autonomously recognize people and detect targeted activities real-time? Collating and presenting the latest information Intelligent Video Surveillance: Systems and Technology explores these issues, from fundamentals principle to algorithmic design and system implementation. An Integrated discussion of key research and applications Written and edited by a collection of industry experts, the book presents state-of-the-art technologies and systems in intelligent video surveillance. The book integrates key research, design, and implementation themes of intelligent video surveillance systems and technology into one comprehensive reference. The chapters cover the computational principles behind the technologies and systems and include system implementation issues as well as examples of successful applications of these technologies. Builds a foundation for future developments Changing appearance caused by changing viewpoints, illumination, expression, and movement, self/cross body occlusion, modeling of cluttered background capable of efficient background subtraction for object detection, and spatial and temporal alignment of multiple cameras are just a few of the challenges that remain in further developing and refining intelligent video surveillance technology and systems. Fully illustrated with line art, tables, and photographs demonstrating the collected video and results obtained using the related algorithms, including a color plate section, the book provides a high-level blueprint for advances and insights into future directions of the field.
Book
Closed Circuit TeleVision (CCTV) cameras have been increasingly deployed pervasively in public spaces including retail centres and shopping malls. Intelligent video analytics aims to automatically analyze content of massive amount of public space video data and has been one of the most active areas of computer vision research in the last two decades. Current focus of video analytics research has been largely on detecting alarm events and abnormal behaviours for public safety and security applications. However, increasingly CCTV installations have also been exploited for gathering and analyzing business intelligence information, in order to enhance marketing and operational efficiency. For example, in retail environments, surveillance cameras can be utilised to collect statistical information about shopping behaviour and preference for marketing (e.g., how many people entered a shop; how many females/males or which age groups of people showed interests to a particular product; how long did they stay in the shop; and what are the frequent paths), and to measure operational efficiency for improving customer experience. Video analytics has the enormous potential for non-security oriented commercial applications. This book presents the latest developments on video analytics for business intelligence applications. It provides both academic and commercial practitioners an understanding of the state-of-the-art and a resource for potential applications and successful practice.
Technical Report
In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively.
Conference Paper
We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 dif- ferent classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implemen- tation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called dropout that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry
Conference Paper
In this paper, we propose a novel approach based on a single convolutional neural network (CNN) for age estimation. In our proposed network architecture, we first model the randomness of aging with the Gaussian distribution which is used to calculate the Gaussian integral of an age interval. Then, we present a soft softmax regression function used in the network. The new function applies the aging modeling to compute the function loss. Compared with the traditional softmax function, the new function considers not only the chronological age but also the interval nearby true age. Moreover, owing to the complex of Gaussian integral in soft softmax function, a look up table is built to accelerate this process. All the integrals of age values are calculated offline in advance. We evaluate our method on two public datasets: MORPH II and Cross-Age Celebrity Dataset (CACD), and experimental results have shown that the proposed method has gained superior performances compared to the state of the art.
Conference Paper
We present the 2016 ChaLearn Looking at People and Faces of the World Challenge and Workshop, which ran three competitions on the common theme of face analysis from still images. The first one, Looking at People, addressed age estimation, while the second and third competitions, Faces of the World, addressed accessory classification and smile and gender classification, respectively. We present two crowd-sourcing methodologies used to collect manual annotations. A custom-build application was used to collect and label data about the apparent age of people (as opposed to the real age). For the Faces of the World data, the citizen-science Zooniverse platform was used. This paper summarizes the three challenges and the data used, as well as the results achieved by the participants of the competitions. Details of the ChaLearn LAP FotW competitions can be found at http://gesture.chalearn.org.