ArticlePDF Available

Efficient Group-n Encoding and Decoding for Facial Age Estimation

December 2017
IEEE Transactions on Pattern Analysis and Machine Intelligence PP(99):1-1

December 2017
PP(99):1-1

DOI:10.1109/TPAMI.2017.2779808

Authors:

Zichang Tan

Baidu

Jun Wan

Institute of Automation, Chinese Academy of Sciences

Show all 6 authorsHide

Different ages are closely related especially among the adjacent ages because aging is a slow and extremely non-stationary process with much randomness. To explore the relationship between the real age and its adjacent ages, an age group-n encoding (AGEn) method is proposed in this paper. In our model, adjacent ages are grouped into the same group and each age corresponds to n groups. The ages grouped into the same group would be regarded as an independent class in the training stage. On this basis, the original age estimation problem can be transformed into a series of binary classification sub-problems. And a deep Convolutional Neural Networks (CNN) with multiple classifiers is designed to cope with such sub-problems. Later, a Local Age Decoding (LAD) strategy is further presented to accelerate the prediction process, which locally decodes the estimated age value from ordinal classifiers. Besides, to alleviate the imbalance data learning problem of each classifier, a penalty factor is inserted into the unified objective function to favor the minority class. To compare with state-of-the-art methods, we evaluate the proposed method on FG-NET, MORPH II, CACD and Chalearn LAP 2015 Databases and it achieves the best performance.

The architecture of the proposed network. Our network is based on the VGG-16 network [28] and we adopt the BGR face image as the input with the size of 224 × 224. The CNN network consists of two fully connected layers and the later one produces a feature vector for age group classification. After that, the network branches out T output layers, where each layer is employed as an binary classifier that judges whether the input image belongs to the corresponding age group or not. Moreover, all the convolutional layers are followed by ReLU non-linearity.

…

Example of grouping results with Age Group-3 Encoding for age set {0, 1 · · · , 100}. There are 103 groups in total and each age corresponds to 3 groups.

…

The distribution of positive and negative samples for each age group on MORPH II training set with AGE3, AGE9 and AGE15. When grouped by AGE3, the distribution is extremely uneven and negative samples is many times larger than positive samples. The number of positive samples of middle groups would increase as n rises, but the imbalance is still serious in the marginal groups.

…

Sample images from Chalearn LAP, FG-NET, MORPH, CACD and IMDB-WIKI databases. The value below the image is its corresponding age label. FG-NET database includes some old photos (gray image) as shown in the second row. The face images of Chalearn LAP and MORPH databases are taken from the ordinary people, while the images of CACD and IMDB-WIKI databases are from the celebrities. And this difference can be easily found from the figure. Additionally, The CACD database contains some noise. For example, the second image of this databases was wrong labeled. For IMDB-WIKI, it contains more noise, such as a image contains more than one face (see the second image of IMDB-WIKI database) or no face (see the last image of IMDB-WIKI database).

…

The visualization of age and age group probabilities.

…

Figures - uploaded by Zichang Tan

Content may be subject to copyright.

Content uploaded by Zichang Tan

Content may be subject to copyright.

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. ?, NO. ?, ? ? 1

Efﬁcient Group-n Encoding and Decoding for

Facial Age Estimation

Zichang Tan, Jun Wan, Member, IEEE, Zhen Lei, Senior Member, IEEE, Ruicong Zhi, Guodong Guo,

Senior Member, IEEE, and Stan Z. Li, Fellow, IEEE

Abstract—Different ages are closely related especially among the adjacent ages because aging is a slow and extremely

non-stationary process with much randomness. To explore the relationship between the real age and its adjacent ages, an age group-n

encoding (AGEn) method is proposed in this paper. In our model, adjacent ages are grouped into the same group and each age

corresponds to n groups. The ages grouped into the same group would be regarded as an independent class in the training stage. On

this basis, the original age estimation problem can be transformed into a series of binary classiﬁcation sub-problems. And a deep

Convolutional Neural Networks (CNN) with multiple classiﬁers is designed to cope with such sub-problems. Later, a Local Age

Decoding (LAD) strategy is further presented to accelerate the prediction process, which locally decodes the estimated age value from

ordinal classiﬁers. Besides, to alleviate the imbalance data learning problem of each classiﬁer, a penalty factor is inserted into the

uniﬁed objective function to favor the minority class. To compare with state-of-the-art methods, we evaluate the proposed method on

FG-NET, MORPH II, CACD and Chalearn LAP 2015 Databases and it achieves the best performance.

Index Terms—Age estimation, deep learning, convolutional neural network, age grouping, data imbalance

1 INTRODUCTION

HUMAN age estimation makes an important compo-

nent in face attribute analysis [1], which has many

applications in real-world, such as business intelligence,

human computer interaction (HCI) and visual surveillance

[2], [3], [4], [5]. However, human age is still hard to estimate

precisely from a single face image even though the problem

has been extensively studied for many years.

Facial aging process is ﬁlled with randomness and is

not stationary for everyone. The randomness exists in many

aspects, such as different diets, living or working environ-

ment, and most importantly, the various genes. All of those

factors can more or less affect human aging and further

leads to aging differences in the appearance. In real world,

people at the same age may look differently, appearing

slightly older or younger comparing to each other. On the

other hand, faces from close ages look similar [6] because

of the slow and gradual aging process. Sometimes it is hard

to judge which one is older or younger between two faces

from close ages. So, there is a strong correlation between age

classes especially for adjacent ages.

Most previous methods estimated age by casting it as

•Z. Tan, J. Wan, Z. Lei and S.Z. Li are with Center for Biometrics

and Security Research &National Laboratory of Pattern Recognition,

Institute of Automation, Chinese Academy of Sciences, Room 1402,

Intelligent Building, 95 Zhongguancun Donglu, Haidian District, Beijing

100190, China. And Z. Tan is also with the University of Chinese

Academy of Sciences, Beijing, China. (e-mail: tanzichang2016@ia.ac.cn,

{jun.wan,zlei,szli}@nlpr.ia.ac.cn).

•Ruicong Zhi is with the School of Computer and Communication Engi-

neering, University of Science and Technology Beijing, Beijing 100083, P.

R. China. e-mail: zhirc@ustb.edu.cn.

•G. Guo is with the Lane Department of Computer Science and Electrical

Engineering, West Virginia University, Morgantown, WV 26506 USA.

e-mail: guodong.guo@mail.wvu.edu.

Manuscript received February 28, 2017; revised September 4, 2017; accepted

17 November, 2017.

a classiﬁcation problem [5], [7], [8], [9] or regression prob-

lem [10], [11], [12], [13], [14]. For age classiﬁcation, each

age class is assumed to be independent to one another,

which ignores the relationship between different classes.

In contrast, regression problem treats age as continuous

value and employs regression methods to predict age based

on extracted features, such as Partial Least Squares (PLS)

[15], Canonical Correlation Analysis (CCA) [16], Support

Vector Regression (SVR) [17]. However, those methods do

not involve any aging information, either.

Due to the aging randomness in the aging process,

there is an ambiguous mapping rather than exact mapping

between face and its real age. This is particularly evident

in senior people. We may say that a man looks like in his

late thirties but can never be sure about his exact age just

from his appearance. Thus, assigning each face with a single

age label seems difﬁcult because of the strong correlation

among age classes especially among the adjacent classes.

Furthermore, training with several adjacent ages together

for age estimation may be more helpful than treating each

age as an independent class.

Inspired by this, we group the face images within a

speciﬁc age range and then regard each age group as an

independent class in the training stage. Our age grouping

method is inspired by [18] but with crucial differences.

Unlike [18], which needs to group ages for many times and

each time they divide ages into non-overlapped groups, our

age grouping method conducts age division only one time,

where all ages are divided into overlapped age groups. We

carefully design the grouping strategy to encode ages into

age groups, which ensures that each age corresponds to

an unique age group set. Based on this, the exact age can

be recovered by decoding the group classiﬁcation results

according to a certain mapping relation between the age and

age groups. Therefore, our method could be implemented in

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. ?, NO. ?, ? ? 2

Fig. 1. The pipeline of our framework for age estimation. It consists of two stages: training stage and testing stage. In the training stage, the training

images with different scales are ﬁrst processed by face detection, alignment and cropping. All the images are aligned according to the point of

the center of two eyes and the upper lip. Then all the training images are grouped by the age group-n encoding strategy, where the images from

adjacent ages would grouped into the same group. After that, the training images are used to train the CNNs. In the testing stage, the test image

is ﬁrst processed in the same way as the training stage used. Then, the processed image is input into the trained CNN network, and age group

classiﬁcation is employed to obtain the probabilities of each group. Finally, the predicted age is obtained by decoding the group classiﬁcation results.

a single network rather than an ensemble of networks [18]

or cascaded networks [19].

Using the novel grouping method, we can transform

the age estimation problem into a series of binary classiﬁ-

cation problems, where each classiﬁer determines whether

the face image belongs to the corresponding group or not.

The CNN with multiple output layers is also employed

in our approach. Unlike [20], [21], our method aims to

explore the relationship between the adjacent ages based

on age group classiﬁcation, while the approaches of [20],

[21] mainly exploit the relative order relation among age

labels. Besides, each classiﬁer of the network in [20], [21]

acts as a comparator to determine whether or not the age of

the input face is greater than a value, while each classiﬁer

of our network aims to distinguish images within each age

group.

For each binary classiﬁer, the number of training images

belonging to the corresponding group is far less than the

others (imbalanced data learning). This is because we group

images only within a small age range. A viable solution to

the imbalance data problem is to modify the algorithm via

cost-sensitive learning [22], [23]. In this paper, we modify

our training algorithm by employing a penalty factor to

shift the bias of the classiﬁer to favor the minority class,

which increases the contributions of the minority class in

the learning stage.

The proposed age estimation framework is shown in Fig.

1, and the source codes and models are available at the

website 1. The main contributions of our work include:

1) A novel age grouping strategy called Age Group-n

Encoding (AGEn) is proposed, where the adjacent

ages are grouped into the same group and each age

corresponds to ngroups. Moreover, unlike employ-

1. http://www.cbsr.ia.ac.cn/users/zctan/projects/

AgeEncodingDecoding/main.htm

ing an ensemble of multiple networks to obtain the

exact age due to grouping ages for many times [18],

only a single network (see Fig. 1) is used to make

the prediction with our age division.

2) To accelerate the predicting process, a Local Age

Decoding (LAD) strategy is proposed to obtain the

predicted age by locally decoding the outputs of the

binary classiﬁers.

3) Inspired by previous works [22] [23], we extend the

cost-sensitive learning strategy used in traditional

methods (i.e. Cost-Sensitive Dataspace Weighting

with Adaptive Boosting [23], Cost-Sensitive Deci-

sion Trees [23]) into our designed objective function

of the proposed CNN framework for age estimation,

which is effective to deal with the imbalanced data

problem caused by age grouping.

4) Our method achieves the state-of-the-art results on

multiple datasets, including FG-NET [24], MORPH

II [25], CACD [26] and Chalearn LAP 2015 databases

[27].

2 RE LATED WO RK

Human Age estimation has been studied extensively for

over 20 years. The earliest work of age estimation was possi-

bly reported by Kwon et al. [29] in the 1990s, which judged

the age range of face images with hand-crafted features,

such as baby, young adult and senior adult. However, only

dozens of face images were analyzed in their work. At that

time, the lack of a large-scale age dataset also hindered the

development of age estimation technology. With the joint

efforts of many scholars from all over the world, large age

datasets such as FG-NET [30], MORPH II [25] and CACD

[26] databases are available for the community, which are

also the most popular age datasets nowadays.

With the development of facial analysis technology, re-

searchers started to predict the exact age rather than simply

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. ?, NO. ?, ? ? 3

Fig. 2. The architecture of the proposed network. Our network is based on the VGG-16 network [28] and we adopt the BGR face image as the

input with the size of 224 ×224. The CNN network consists of two fully connected layers and the later one produces a feature vector for age group

classiﬁcation. After that, the network branches out Toutput layers, where each layer is employed as an binary classiﬁer that judges whether the

input image belongs to the corresponding age group or not. Moreover, all the convolutional layers are followed by ReLU non-linearity.

estimate the coarse age range from face images. Also, a large

number of methods have been proposed for age estimation,

such as Active Appearance Models (AAM) [31], AGing

pattErn Subspace (AGES) [7], [32], age manifold [10], [33],

[34], and methods with local features [8], [35], [36]. Particu-

larly, Biologically Inspired Features (BIF) [8] has the most

outstanding ability in age estimation among those local

features. After features extracted by local image descriptors,

classiﬁcation or regression methods would be employed to

obtain the predicted age, such as BIF+SVM [8], BIF+SVR [8],

BIF+CCA [12]. More recently, Geng et al. [6], [37] allowed

each face image labeled with a label distribution rather than

a single age label, where both the real age and its adjacent

ages would contribute to the learning. The work in [38],

[39] also integrate the idea of label distribution into deep

learning framework and achieve promising performance.

Recently, deep learning has gained a lot of success on

age estimation. Yi et al. [14] deployed many parallel CNNs

with multi-scale face images for age estimation. Malli et al.

[18] estimated apparent ages with age grouping to account

for multiple labels per image. However, this work needs

an ensemble of models to further predict the exact age,

seeming relatively tedious. Antipov et al. [40] developed a

children-specialized deep learning method for apparent age

estimation, and achieved the best performance at Chalearn

Looking At People (LAP) challenge 2016. Niu et al. [21]

casted age estimation as an ordinal regression problem

with a multiple outputs CNN, which achieved the state-

of-the-art result on MORPH II database. Zhu et al. [19]

ﬁrst used age group classiﬁer to acquire the coarse age

range of face images with CNN, and then multiple local

age estimators were employed to predict the exact age. Liu

et al. [41] exploited a general-to-special transfer learning

scheme for age estimation based on GoogleNet [42]. Rothe

et al. [9] proposed a Deep EXpectation (DEX) method for

apparent age estimation based on VGG-16 architecture [28]

and won the ﬁrst place at Chalearn LAP challenge 2015.

However, DEX only conducts the reﬁnement that fuses all

ages information in the prediction phase but neglects the

correlation between different ages in the training stage.

In this work, the correlation between adjacent ages

would be explored through grouping and training the

adjacent ages together. Different from previous grouping-

based methods, which estimate the age for a facial image

through an ensemble of models or cascaded structures, the

proposed method estimates age from facial images with a

Fig. 3. Example of grouping results with Age Group-3 Encoding for

age set {0,1··· ,100}. There are 103 groups in total and each age

corresponds to 3 groups.

single network based on well-designed group-n encoding

and decoding processes. To our best knowledge, it is the

ﬁrst work to conduct age estimation with a single network

based on age group classiﬁcation.

3 OU R METHOD

The pipeline of our method for age estimation is shown in

Fig. 1, and our method mainly consists of ﬁne-grained age

grouping, age group classiﬁcation and age decoding. The

speciﬁc algorithm is given in Algorithm 1.

3.1 Fine-grained Age Grouping

Unlike previous age grouping methods where each age

corresponds to one group, we introduce a novel age group-

ing method called Age Group-n Encoding (AGEn) for age

estimation, where each face image is assigned to ngroups.

The grouping rules are given below:

1. Given the age set Y={l0, l1·· · , lK}, we can group

ages into T(T=K+n)groups. Note that l0and lKare the

minimum and maximum ages, respectively, and l0< l1<

·· · < lK.

2. For age li, it is assigned to group i,i+ 1,···,i+

n−1, where each age corresponds to ngroups. Each group

includes at least one age but at most nages.

Figure 3 gives a grouping example when K= 100 and

n= 3 for age set {0,1,· ·· ,100}. According to our grouping

rules, each age is encoded into a unique group set, which is

essential for the prediction stage that is a decoding process

from group to age. In order to facilitate later parts of the

paper, Ca={c0, c1,··· , cn−1}are used to denote the indices

of the groups that age abelongs to. We also let Strepresents

those ages that the t-th group includes. For example, as

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. ?, NO. ?, ? ? 4

shown in Fig. 3, C1={1,2,3}indicates that age 1 is assigned

to group 1, 2 and 3 and S3={1,2,3}denotes that group 3

consists of ages 1, 2 and 3.

3.2 Age Group Classiﬁcation

The network architecture of age group classiﬁcation, called

Multiple Outputs CNN (MO-CNN), is illustrated in Fig. 2.

The network includes multiple output layers, where each

output layer corresponds to a binary classiﬁcation task that

judges whether the input sample belongs to the age group

or not. Assuming we have a training set with Nsamples,

where each sample is attached with a chronological label

and Tage group labels where T=K+nwhen the Age

Group-n Encoding strategy is employed. Each sample is

represented as xi, yi,{gt

i}T−1

i=0 , where xi∈Rdis the

i-th sample, yi∈ Y represents the age label for xiand

i∈ G ={0,1}is the age group label indicating whether

the i-th sample belongs to age group tor not. If xibelongs

to age group t,gt

i= 1; otherwise, gt

i= 0. As shown in

Fig. 2, the network extracts high level feature xl

ithrough a

sequence of non-linear mappings with a set of parameters

W={Wi}l

i=0, where Wirepresents the weights of layer

i. With shared representation xl

i, we conduct the group

classiﬁcations via multiple binary classiﬁers with the param-

eters W={Wt}T−1

t=0 , where Wtdenotes the weights of t-th

classiﬁer. Thus, the parameters of the whole network can be

denoted as {W, W }.

For each classiﬁer, the cross-entropy loss is used as the

loss function, thus the objective function of the t-th classiﬁer

can be written as

Jt=−1

N−1



i=0



m=0

1{gt

i=m}log (p(gt

i=m|xi,W, W ))

(1)

where p(gt

i=m|xi,W, W ) = exp{(Wm

t)Txl

∑jexp{(Wj

t)Txl

i}is softmax

function and Wj

tdenotes the j-th column of the parameter

matrix Wtfor t-th task.

However, the data distribution is extremely unbalanced

for each classiﬁer, and training unevenly could jeopardize

the whole model. Each sample in a binary classiﬁer has two

states, belonging to the group (a positive sample) or not

belonging to the group (a negative sample). As shown in

Fig. 4, the number of positive samples is much less than the

negative samples. To alleviate the imbalanced data learning

problem, we impose penalty factors to penalize positive and

negative samples at different degrees for each task. The

penalty coefﬁcients are represented as ρ=ρ0

t, ρ1

tT−1

t=0 ,

where ρ0

tis the penalty coefﬁcient for negative samples and

ρ1

tfor positive samples. Thus the objective function of t-th

task is

Jt=−1

N−1



i=0



m=0

1{gt

i=m}ρm

tlog (p(gt

i=m|xi,W, W ))

(2)

Therefore, we can balance the contribution of positive and

negative samples via adjusting the magnitude of the penalty

coefﬁcients.

We have Tbinary classiﬁcation tasks all together and

each task corresponds to an output layer. Let αtdenotes the

Fig. 4. The distribution of positive and negative samples for each age

group on MORPH II training set with AGE3, AGE9 and AGE15. When

grouped by AGE3, the distribution is extremely uneven and negative

samples is many times larger than positive samples. The number of

positive samples of middle groups would increase as nrises, but the

imbalance is still serious in the marginal groups.

importance level of the t-task, and the objective function of

the whole CNN can then be written as

J=−1

N−1



i=0

T−1



t=0



m=0αt1gt

i=mρm

t·

log p(gt

i=m|xi,W, W )(3)

In the training process, we apply the stochastic gradi-

ent descent (SGD) [43] to search the suitable parameters

{W, W }for our MO-CNN.

3.3 Age Decoding

We elaborate a delicate CNN with multiple binary classiﬁers

to determine which groups a face image belongs to. How-

ever, we can only acquire an ambiguous age range using

the classiﬁcation framework. Since only an ambiguous age

range can be acquired using the classiﬁcation framework, a

decoding stage is further developed to obtain the exact age

considering the speciﬁc mapping relation between ages and

age groups. Detailed age decoding stage is explained below.

The objective function, Eq. (3), can be rewritten as

J=−1

Nlog N−1



i=0

T−1



t=0 



m=0

1gt

i=m·

p(gt

i=m|xi,W, W )ρm

tαt.

(4)

Removing the negative logarithm and average factor

terms of Eq. (4), our learning procedure is actually to maxi-

mize the following equation

p(G|X, W, W ) =

N−1



i=0

T−1



t=0 1



m=0

1gt

i=m·

pgt

i=m|xi,W, W ρm

tαt

(5)

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. ?, NO. ?, ? ? 5

where X={xi}N−1

i=0 and G={gt

i}T−1

t=0 N−1

i=0 are the

whole dataset and the corresponding group labels, respec-

tively.

In Section 3.1, we use an index set Cato represent the

groups that the face images with age abelong to. And Eq.(5)

can be rewritten as following with the index set Ca

p(G|X, W, W ) =

N−1



i=0 

t∈Cyi

pgt

i= 1 |xi,W, W αtρ1

t·



t∈¯

Cyi

pgt

i= 0 |xi,W, W αtρ0

t.

(6)

Note that Cyirepresents those groups that the face image

with age yibelongs to, and ¯

Cyiis the complementary set

of Cyi. It is assumed that the samples are independent to

each other. Therefore, we can deﬁne the probability of a face

image belongs to age aas following

P(a|xi,W, W ) = 1

Z

t∈Ca

pgt

i= 1 |xi,W, W αtρ1

t·



t∈¯

pgt

i= 0 |xi,W, W αtρ0

(7)

where Zis the normalization factor that makes sure

a∈Y P(a|xi,W, W ) = 1. In the training stage, our learn-

ing procedure aims to make the probability P(a|xi,W, W )

reach its maximum when aequals to its real age label yi.

Therefore, the predicting age y′

ifor image xiis

y′

i= arg max

a∈Y P(a|xi,W, W ).(8)

Our age decoding method is to ﬁnd the maximal probability

of P(a|xi,W, W )for the whole age set Yand take the

corresponding age as the ﬁnal estimated age. This is called

Global Age Decoding (GAD). However, it also leads to

an enormous computational burden because it conducts

computation for all ages and then ﬁnds the maximum as

its corresponding age. Actually, we can get the coarse age

range from the age group classiﬁcation results, and then use

the Local Age Decoding (LAD) to recover the exact age to re-

duce the computational complexity. Assume that group mis

the group with the maximal probability p(gt

i= 1 |xi,W, W )

for image xi, which shows that the images xiis most

likely to belong to group m. Thus LAD only compares the

probabilities for the ages in Sm, and it can be written as

y′

i= arg max

a∈SmP(a|xi,W, W ).(9)

We have made comparisons between GAD and LAD in

Section 5.3, which shows that the LAD is more efﬁcient.

4 EXPERIMENTS

In this section, we ﬁrst introduce the databases and ex-

plain some training details about our experiments. Then we

present the experimental results.

Algorithm 1 The algorithm of the proposed method

Input: The training data D={xi, yi}N−1

i=0 , and the test data

D′={x′i}M−1

i=0 .

Output: The predictions {y′i}M−1

i=0 for the test data.

1: conduct age grouping for training data Dwith AGEn,

and obtain the group labels {gt

i}T−1

t=0 N−1

i=0 , the age

group index Cafor each age aand the age set Stfor

each group t.

2: train MO-CNN with {xi,{gt

i}T−1

t=0 }N−1

i=0 for searching the

optimal network parameters {W, W }.

3: for i= 0,1,···, M −1do

4: input the face image x′

iinto MO-CNN

5: obtain {{p(gt

i=m|x′

i,W, W )}1

m=0}T−1

t=0

6: m←arg max

p(gt

i=m|x′

i,W, W )

7: for a∈ Smdo

8: compute P(a|x′

i,W, W )according to Eq.(7)

9: end forend

10: y′

i←arg max

a∈SmP(a|x′

i,W, W )

11: end forend

12: return The predictions {y′

i}M−1

i=0

TABLE 1

Summary of the databases that used in our experiments. The table

contains the age range information, and the number of images of the

corresponding database and its split. The non-face images (e.g., the

tattoo images in MORPH database) are removed in our experiments,

thus those images are not counted in this table.

Database Images Age range

MORPH 55244

16 - 77

80-20 protocol 5493

Train (80% images) 4395

Test (20% images) 1098

S1-S2-S3 protocol 55244

S1 10634

S2 10634

S3 33976

FG-NET 1002 0 - 69

Train 990(avg.)

Test 12(avg.)

CACD 162941

14 - 62

Train(1800 celebs) 144792

Val(80 celebs) 7585

Test(120 celebs) 10564

Chalearn LAP 2015 4691

3 - 85

Train 2476

Validation 1136

Test 1079

Chalearn LAP 2016 7591

1 - 89

Train 4113

Validation 1500

Test 1978

IMDB-WIKI 523051

0 - 100Train 297163

Val 10000

4.1 Databases

For real age estimation, we evaluate the proposed method

on FG-NET [24], Morph II [25] and CACD [26] databases,

under both the controlled and uncontrolled environments.

We also evaluate the performance of the proposed method

for apparent age estimation on Chalearn LAP datasets [27],

[44]. IMDB-WIKI database [27], [44] is also introduced to

pretrain our network when evaluating our model on FG-

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. ?, NO. ?, ? ? 6

Fig. 5. Sample images from Chalearn LAP, FG-NET, MORPH, CACD

and IMDB-WIKI databases. The value below the image is its corre-

sponding age label. FG-NET database includes some old photos (gray

image) as shown in the second row. The face images of Chalearn LAP

and MORPH databases are taken from the ordinary people, while the

images of CACD and IMDB-WIKI databases are from the celebrities.

And this difference can be easily found from the ﬁgure. Additionally, The

CACD database contains some noise. For example, the second image

of this databases was wrong labeled. For IMDB-WIKI, it contains more

noise, such as a image contains more than one face (see the second

image of IMDB-WIKI database) or no face (see the last image of IMDB-

WIKI database).

NET, Morph and Chalearn LAP datasets. A summary of

those databases is given in Table 1, including age range

information, the size of each database and its corresponding

spits. Fig. 4 shows some exemplar images of each database.

Here, we take a brief introduction on those databases and

the test protocols.

FG-NET The FG-NET dataset contains 1002 color or

grayscale face images of 82 subjects. Those images are taken

in a totally uncontrolled environment with large variations

of lighting, poses, and expressions. When evaluating on

this dataset, we take leave-one person-out (LOPO) cross

validation strategy according to the setup of [5], [33], [45],

[46], and the averaging performance over the 82 splits is

reported.

MORPH II This database is probably the largest

database with precise age labeling and ethnicities. The

database includes about 55 thousand face images and age

ranges from 16 to 77 years. In our experiments, we employ

two typical protocols for evaluation on MORPH dataset:

•According to the test protocol2provided by Yi et

al. [14], the MORPH dataset would be split into

three non-overlapped subsets S1, S2, S3 obeying the

constructing rules that are detailed in the website

provided above. All experiments are repeated twice:

1) training with S1 and testing with S2+S3. 2) training

2. http://www.cbsr.ia.ac.cn/users/dyi/agr.html

with S2 and testing with S1+S3. Table 1 shows the

number of images in each subset. It can be found

that, in either way, the number of training images is

about a quarter of testing images. For simplicity, we

call this test protocol as S1-S2-S3 protocol.

•Following the experimental setting in [21], [45], [46],

[47], a subset of 5493 images was used, where the

images are selected from Caucasian descent to reduce

the cross-race inﬂuence. We also randomly split the

whole dataset into two non-overlapped parts: 80%

images for training and 20% images for testing. The

number of images for training and testing sets are al-

so given in Table 1. In this way, the number of testing

images is a quarter of training images. And we call

this protocol as 80-20 protocol for convenience.

CACD The Cross-Age Celebrity Dataset (CACD) is the

largest public cross-age database, which is collected from

the Internet Movie DataBase (IMDB). This database, collect-

ed from search engines using celebrity name and year (2004-

2013) as keywords, contains more than 160 thousand images

from 2000 celebrities. However, the database contains much

noise because the age was simply estimated by query year

and birth year of that celebrity. We split the database into

three subsets: 1800 noisy celebrities for training, where the

number of images is big enough but the age labeling is

less precise; 80 cleaned celebrities for validation and 120

cleaned celebrities for testing, where the images are manu-

ally checked and the noise images are removed.

Chalearn LAP The Chalearn LAP challenge is the ﬁrst

competition for apparent age estimation, and it offers im-

ages labeled by at least 10 users and then the average age is

used as the ﬁnal annotation. Moreover, the dataset offers the

standard deviation for each age label. For the ﬁrst edition of

Chalearn LAP challenge (2015) [27], the organizers collected

4691 images and all images were split into three subsets:

2476 images for training, 1136 images for validation and

1079 images for testing. For the second edition of Chalearn

LAP challenge (2016) [44], the dataset has been extended

to 7591 images, where 4113 images for training, 1500 for

validation and 1978 for testing. In addition to increasing

the number of images, most ages in the dataset are not

integers and the standard deviation covers a larger range.

Some sample images are given in Fig. 5.

IMDB-WIKI IMDB-WIKI [5], [9], which contains 523051

images in total, is the largest dataset for age estimation as far

as we know, where the images are crawled from celebrities

in IMDb3and Wikipedia4. However, this dataset contains

much noise. The age label is just calculated based on the

date of birth of the corresponding celebrity and the year

when the photo was taken, thus the accuracy of the age

annotations cannot be guaranteed when wrong timestamp

occurs or the image comes from a wrong celebrity. Addi-

tionally, tiny faces, multiple faces or non-face problems also

occur in the dataset as shown in Fig. 5. Even though this

dataset is not suitable for evaluation, it is still a good dataset

for pretraining for that the majority of the annotations are

correct. To use the dataset effectively, we select about 300

thousand images according to the settings in [5], where all

3. www.imdb.com

4. en.wikipedia.org

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. ?, NO. ?, ? ? 7

TABLE 2

MAE results with a variety of nand ρ1on validation set. (a) Results on the validation set of S1 on MORPH II. (b) Results on the validation set of S2

on MORPH II. (c) Results on the validation set of CACD database. (d) Results on the validation set of Chalearn LAP 2015 database. (e) Results

on the validation set of Chalearn LAP 2016 database.

(a) (b) (c) (d) (e)

HHH

ρ1

n5 7 9 11 5 7 9 11 5 7 9 11 5 7 9 11 5 7 9 11

1 3.61 3.45 3.41 3.32 3.44 3.39 3.22 3.27 5.52 5.33 5.23 5.32 4.97 4.97 4.87 4.90 5.41 5.15 4.98 5.01

2 3.38 3.30 3.21 3.30 3.26 3.18 3.17 3.17 5.26 5.34 5.43 5.33 4.93 4.89 4.86 4.88 5.08 4.94 4.99 5.04

3 3.26 3.28 3.23 3.38 3.20 3.22 3.25 3.25 5.25 5.41 5.54 5.50 4.96 5.04 5.03 4.86 5.15 4.97 5.02 5.08

4 3.63 3.32 3.34 3.45 3.18 3.19 3.17 3.32 5.45 5.50 5.63 5.49 4.99 4.96 4.91 4.99 5.08 5.14 5.06 5.15

Fig. 6. The network that we used for parameters searching. The network

is based on the AlexNet [43], and the last layer also be replaced with

multiple binary classiﬁers. More details of the convolution and pooling

layers are shown in the ﬁgure.

Fig. 7. (a) and (b) show the last two layers of the network of DEX and

VGG+Euclidean, respectively. The architecture of the lower layers of

DEX and VGG+Euclidean are the same to our network’s.

non-face images and part of images with multiple faces are

removed. What’s more, as shown in Table 1, the selected

images are randomly divided into two parts: 10000 images

for validation and the rest for training.

4.2 Preprocessing and Experimental Setting

Face Alignment Face alignment is helpful for age estima-

tion. First, all images are processed by a face detector [48]

and a few non-face images would be removed, for example,

tattoo images in Morph II database. Then, the active shape

models (ASM) [49] are used to detect facial landmarks and

all faces would be aligned according to the eyes center and

the upper lip. After that, all images are cropped into the size

of 224 ×224 and then fed into the network. Some aligned

images are shown in Fig. 9.

Data Augmentation When evaluating on FG-NET,

MORPH II and Chalearn LAP databases, the training im-

ages are extremely insufﬁcient. For example, less than ﬁve

thousand images are used for training when evaluation is

taken on Morph dataset with 80-20 protocol. The training set

of Chalearn LAP 2015 dataset contains no more than three

thousand images, which is even more inadequate. There-

fore, increasing training samples is necessary to improve

the performance. Usually, there are two ways to expand the

training data. One is to enrich the training set with other

datasets. For example, we usually pretrain the network from

other larger datasets to improve its performance. Another

way is to add the virtual image samples. The ﬁrst one

is a well-known technology and we mainly introduce the

method that is used to increase virtual images in our ex-

periments. Here, we augment training images with ﬂipping,

rotating by ±50and ±100, and adding Gaussian white noise

with variance of 0.001, 0.005, 0.01, 0.015 and 0.02. The total

number of images was increased by 36 times after augmen-

tation. However, data augmentation is only conducted for

FG-NET, MORPH II and Chalearn LAP datasets since it is

not necessary for CACD database.

Experimental Setting We train the deep network with

a weight decay of 0.0005 and a momentum of 0.9. The

learning rate starts from 0.001 and reduced by a factor of

10 along with the number of iterations increases. We set

αt= 0.1for all tasks. AGE7 grouping strategy is taken

when experimenting on Chalearn 2016 dataset and AGE9

is taken for the others. Moreover, we set ρ1

tto 1 for the

experiments on CACD dataset and set ρ1

tto 2 for the

rest experiments. More details of the setting of AGEn and

parameters of balance strategy can be found in Section 4.4.

Our algorithm is implemented within the caffe framework

[50] on TITAN X GPU. And for all experiments the VGG-16

network was initialized with the weights from training on

ImageNet dataset ﬁrst. For some experiments, the network

would be pretrained on IMDB-WIKI dataset and we would

explain it in the text.

4.3 Evaluation Metrics

For real age estimation, the Mean Absolute Error (MAE)

and Cumulative Score (CS) are usually used as evaluation

metrics. MAE indicates the mean absolute error between the

predicted result and the ground truth for testing set, and it

is calculated as

MAE =1

m−1



i=0 |y′

i−yi|(10)

where y′

idenotes the predicting age for i-th image and m

is the number of testing face images. MAE is the most

frequently used evaluation metric, and obviously, lower

MAE result means a better performance. CS(n) is computed

as follows

CS (n) = me≤n

m(11)

where me≤nrepresents the total number of test images

whose absolute error between the predicting results and

the ground truth is not greater than nyears. Obviously, the

higher the CS(n), the better performance it gets.

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. ?, NO. ?, ? ? 8

Fig. 8. (a) CS comparisons on FG-NET. (b) CS comparisons on MORPH II with 80-20 protocol when training with 80% images and testing with

20% images. (c) CS comparisons on MORPH II with S1-S2-S3 protocol. The experiments are repeated twice: 1) training with S2 and testing with

S1+S3; 2) training with S1 and testing with S2+S3, and the average CS performance is reported. (d) CS comparisons on CACD when training with

1800 celebrities and testing with 120 celebrities.

TABLE 3

The comparisons between the proposed method and other

state-of-the-art methods on MORPH II database with the S1-S2-S3

protocol.

Method Train Set Test Set MAE Avg. MAE

Ours(IMDB-WIKI) S1 S2+S3 2.82 2.70

S2 S1+S3 2.58

Ours S1 S2+S3 3.04 2.86

S2 S1+S3 2.68

Soft softmax [38] S1 S2+S3 3.14 3.03

(IMDB-WIKI) S2 S1+S3 2.92

Soft softmax [38] S1 S2+S3 3.24 3.14

S2 S1+S3 3.03

Multi-scale CNN [14] S1 S2+S3 3.72 3.63

S2 S1+S3 3.54

BIF+KCCA [12] S1 S2+S3 4.00 3.98

S2 S1+S3 3.95

BIF+KPLS [11] S1 S2+S3 4.07 4.04

S2 S1+S3 4.01

BIF+rCCA [12] S1 S2+S3 4.43 4.42

S2 S1+S3 4.40

BIF+PLS [11] S1 S2+S3 4.58 4.56

S2 S1+S3 4.54

CNN [51] S1 S2+S3 4.64 4.60

S2 S1+S3 4.55

BIF+KSVM [12] S1 S2+S3 4.89 4.91

S2 S1+S3 4.92

BIF+LSVM [12] S1 S2+S3 5.06 5.09

S2 S1+S3 5.12

BIF+CCA [12] S1 S2+S3 5.39 5.37

S2 S1+S3 5.35

For apparent age estimation, the ϵ-error is used as a

quantitative measure, which is proposed by the Chalearn

LAP competition. The ϵ-error is computed as

ϵ=1 −e−(x−µ)2

2σ2.(12)

It not only measures the error between the predicted value

xand the averaging labeled age µ, but also takes into

consideration the standard deviation σ. The ﬁnal ϵ-error is

the average over all predictions. Of course, lower ϵ-error

means a better performance and it reaches to 0 when the

perfect prediction is achieved.

4.4 Parameters Discussion

As shown in Fig. 11, the distributions of MORPH II and

CACD databases differ greatly. We believe that the optimal

TABLE 4

The results on Morph II database with 80-20 protocol and FG-NET

database. Our method achieves the state-of-the-art performance on

both databases.

Method Morph II FG-NET

Human workers [52] 6.30 4.70

AGES [7] 8.83 6.77

MTWGP [53] 6.28 4.83

CA-SVR [46] 5.88 4.67

OHRank [45] 5.69 4.85

DLA [47] 4.77 4.26

VGG+SVR [54] 3.45 –

VGG+Euclidean 3.49 4.77

VGG+Euclidean (IMDB-WIKI) 3.15 4.30

DEX [5] 3.25 4.63

DEX (IMDB-WIKI) [5] 2.68 3.09

Ours 2.93 4.34

Ours (IMDB-WIKI) 2.52 2.96

parameters of the model is closely related to training data

distributions. In this subsection, we ﬁnd appropriate age

grouping range nof age grouping strategy and penalty

coefﬁcient ρof data balance strategy via conducting the

experiments on validation set with a variety of nand ρ. For

the penalty coefﬁcient ρ=ρ0

t, ρ1

tT−1

t=0 . We assume that it

is the same for all tasks, so it can be written as ρ=ρ0, ρ1.

However, it would take a lot of effort for ρ={ρ0, ρ1}. For

the sake of simplicity, we set ρ0= 1 and only change the

value of ρ1in the parameter searching process.

The CACD and Chalearn LAP 2015 & 2016 datasets have

offered validation sets. Thus, we directly evaluate the model

on their validation set to choose the appropriate parameters.

However, since validation set is not offered in MORPH II

dataset, we randomly select 2000 images from its training set

as validation set. These images will, therefore, not be used

for training in the parameter searching process. Random

selection also ensures that the distribution of training data

remains unchanged. Since training with VGG-16 network

consumes a lot of time, we conduct the experiments with a

shallower network, basd on AlexNet [43], which is shown

in Fig. 6.

The results on validation set are shown in Table 2.

From the results, we adopt AGE9 strategy and ρ1=

2, ρ1= 1, ρ1= 2 for Morph II, CACD and Chalearn LAP

2015 datasets, respectively. Moreover, AGE7 and ρ1= 2

are adopted for Chalearn LAP 2016 dataset. For FG-NET

database, it contains too few images and all images would

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. ?, NO. ?, ? ? 9

TABLE 5

Comparisons with the state-of-the-art methods on the Chalearn LAP 2015 dataset. The proposed method achieves the state-of-the-art

performance. (↓: the smaller the better).

Rank Team Validation Set1Test Set2Pretrain Network Num. of

MAE↓ϵ-error↓MAE↓ϵ-error↓Set Networks

– Ours 3.21 0.28 2.94 0.263547 IMDB-WIKI VGG-16 8

1 CVL ETHZ [5], [9] 3.25 0.28 – 0.264975 IMDB-WIKI VGG-16 20

2 ICT-VIPL [41] 3.33 0.29 – 0.270685 FG-NET,Morph, CACD, et al. GoogleNet 8

3 WVU CVL [19] – 0.31 – 0.294835 FG-NET, Morph, CACD, et al. GoogleNet 5

4 SEU NJU [55] – 0.34 – 0.305763 FG-NET, Morph, Adience [56], et al. GoogleNet 6

human reference – – – 0.34 – – –

5 UMD – – – 0.373352 – – –

6 Enjuto – – – 0.374390 – – –

7 Sungbin Choi – – – 0.420554 – – –

8 Lab219A – – – 0.499181 – – –

9 Bogazici – – – 0.524055 – – –

10 Notts CVLab – – – 0.594248 – – –

1The performance on evaluation set is tested based on a single network.

2The performance on test set is evaluated by a ensemble of multiple networks, where the number of networks used is shown in the last

column of the table.

TABLE 6

Comparisons with the state-of-the-art methods on the Chalearn LAP 2016 dataset.(↓: the smaller the better).

Rank Team Test Set Pretrain Network Num. of

MAE↓ϵ-error↓Set Networks

– Ours 3.82 0.3100 IMDB-WIKI VGG-16 1

1 OrangeLabs [40] – 0.2411 cleaned IMDB-WIKI, a private children dataset VGG-16 14

2 palm seu [57] – 0.3214 IMDB-WIKI VGG-16 4

3 cmp+ETH [58] – 0.3361 IMDB-WIKI VGG-16 10

4 WYU CVL – 0.3405 – – –

5 ITU SiMiT [18] – 0.3668 IMDB-WIKI VGG-16 3

6 Bogazici [59] – 0.3740 – VGG-16 8

7 MIPAL SNU – 0.4569 – – –

8 DeepAge – 0.4573 – – –

be used to evaluate with LOPO strategy. Therefore, we use

n= 9, ρ1= 2 for FG-NET by our experience because those

two parameters perform well in most cases.

We can see that AGE9 is a relatively stable grouping

strategy and the model could achieve promising results with

AGE9 strategy on most validation sets. For grouping range

n, when nis smaller, the relationship between the adjacent

ages cannot be explored thoroughly and the imbalanced

data problem between the images belonging to group or not

is more serious because each group includes fewer images.

When nis bigger, the images within the group shows greater

diversity, which would be harmful to the model. Thus,

AGE9 strategy performs well maybe because n= 9 is an

appropriate grouping value which achieves a good tradeoff

between the above two aspects.

4.5 Comparisons

4.5.1 Real Age estimation

In this section we conduct comprehensive evaluations of the

proposed method on Morph, FG-NET and CACD datasets

for real age estimation.

Results on MORPH II with S1-S2-S3 Protocol. The

proposed method achieves an average MAE of 2.86 without

pretraining on any additional age dataset. It reduces the

MAE by 0.17 compared with the previous state-of-the-art

result reported in [38] (see Table 3). To the best of our

knowledge, it is the ﬁrst report with MAE below 3 years

under this protocol. The pretraining on IMDB-WIKI dataset

further improve the performance, which achieves a MAE

of 2.70 years. The CS results are shown in Fig. 8, and our

method achieves the best performance.

Results on MORPH II with 80-20 Protocol. Usually, age

estimation can be treated as a classiﬁcation or regression

problem. We take two baseline methods of age classiﬁcation

and regression for comparison in this protocol. For age

classiﬁcation, each age is regarded as an independent class.

We take Deep EXpection (DEX) [5], [9] as the baseline

method for age classiﬁcation. DEX is one of the most popu-

lar methods for age estimation, which won the ﬁrst prize of

the ChaLearn Looking At People ICCV 2015 challenge [60].

For age regression, we take the classic regression method

as the baseline for comparison where the Euclidean loss is

employed as the loss function. For a fair comparison, the

network architecture of DEX and regression-based method

are the same to our MO-CNN except the output layer, which

are shown in Fig. 7.

From Table 4, our method achieves the state-of-the-art

performance with the MAE of 2.93 when directly ﬁnetuning

on Morph dataset. As far as we know, it is also the ﬁrst work

that reduces the MAE to under 3 years without ﬁnetuning

on additional age dataset. To further improve the perfor-

mance, the network is ﬁrst ﬁnetuned on the IMDB-WIKI

dataset before ﬁnetuning on the Morph dataset, and the pro-

posed method achieves a MAE of 2.52 years, which reduces

the state-of-the-art performance by 0.18 years. Besides, the

CS comparisons with the state-of-the-art methods are shown

in Fig. 8, again our approach also shows its superiority.

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. ?, NO. ?, ? ? 10

TABLE 7

The comparisons on CACD dataset.

Method Train Set Test Set Avg. MAE

Ours 1800 celebs 120 celebs 4.68

DEX [5] 1800 celebs 120 celebs 4.79

VGG+Euclidean 1800 celebs 120 celebs 5.08

Results on FG-NET. Due to FG-NET dataset contains

only 1002 images, we ﬁrst pretrain our network on IMDB-

WIKI datset and then ﬁnetune on FG-NET. Two baseline

methods have also been added for comparisons. As shown

in Table 4, our method achieves the state-of-the-art perfor-

mance on FG-NET database with an average MAE of 2.96.

This improves the previous state-of-the-art result by 0.13.

The CS comparisons are shown in Fig. 8, and the proposed

method also performs better than other methods.

Results on CACD. Only few works conduct evalua-

tion on CACD database because of its noise. Here, we

compare the result with two baseline methods, which are

VGG+Euclidean regression and DEX. The comparisons are

shown in Table 7. Our method achieves the best perfor-

mance with the lowest MAE of 4.68 years. When CS is taken

as the criteria, our method also performs much better than

other methods as shown in Fig. 8. This indicates that our

method is capable of estimating age from face images in the

wild. Note that we do not ﬁnetune our network on IMDB-

WIKI dataset because some images from IMDB-WIKI and

CACD are duplicated.

4.5.2 Apparent Age estimation

In this section, the evaluation on Chalearn LAP dataset will

be presented.

Results on Chalearn LAP 2015. As a competition dataset

of apparent age estimation, Chalearn LAP dataset is more

special than other public datasets. Following the tricks used

in [5], [9], [41], we ﬁnetune our network on both training

and validation sets after ﬁnetuning on a large additional

age dataset, e.g., IMDB-WIKI dataset. In the test phase, each

image is ﬂipped, and then rotated by 0o,±5o, thus each

image would be tested by 6 times and then averaging those

predictions. Note that for all results except for Chalearn LAP

dataset in this paper are based on a single test image. To fur-

ther improve the performance, an ensemble of 8 networks

is employed and we take the average of the predictions

as the ﬁnal estimated age. But the ensemble technology is

only taken when evaluating on the test set of Chalearn LAP

dataset. We also report the performance on the validation

set with only ﬁnetuning on training set.

The experimental results are shown in Table 5. The

proposed method achieves a better performance than other

teams with a ﬁnal ϵ-error of 0.263547. For validation set, our

method also achieves a lower MAE and ϵ-error based on a

single network. Due to many tricks we have employed in

this evaluation, more training details is presented in Section

5.1.

Results on Chalearn LAP 2016. Different from Chalearn

LAP 2015, most ages in Chalearn LAP 2016 dataset are not

integers. If we train the network with rounding ages, much

information would be sacriﬁced. To reduce information loss-

ing, we follow the work [40], [41] to encode each age label y

Fig. 10. The visualization of age and age group probabilities.

TABLE 8

The comparisons between GAD and LAD. The experiment is

conducted on Morph II dataset with 80-20 protocol.

Methods GAD LAD

Avg. MAE 2.52004 2.52004

Time (per image) 51.7ms 4.6ms

and its corresponding deviation σinto a label distribution.

The distribution is a set of possibilities representing the

description degrees of their corresponding labels, which is

deﬁned as follows:

Pli=1

Zy,σ

√2πσ2e−(li−y)2

2σ2, i = 0,··· , K (13)

where Zy,σ is the normlization factor related to age label

yand its deviation σ. We generate a random age label for

each image according to its label distribution and regard the

random age label as the ground truth label in the training

stage. Other experimental settings are the same to Chalearn

LAP 2015’s.

We ﬁnd that the performance of methods [18], [40],

[57], [58] varies on validation set and test set, for example,

OrangeLabs’s method didn’t achieve the best performance

on validation set but outperformed other methods by a

large margin on test set. Therefore, we only conduct the

evaluation on test set for a consistent comparison. The

comparisons are reported in Table 6. Our method achieves

the performance on test set with epsilon error of 0.3100

based on a single network, which is the second best result

only next to OrangeLabs’s [40]. OrangeLabs’s method could

achieve better performance mainly due to the following

reasons: ﬁrst, they pretrained their network on a cleaned

IMDB-WIKI dataset that was arranged and annotated by

26 persons lasting for a few days; second, they manually

collected a private dataset with a considerable quantity of

images of children, and they have trained 3 separate models

for estimating apparent ages of children using the children

dataset; third, they used an ensemble of multiple models to

boost the performance.

4.6 Computation Time Analysis

We train an age group classiﬁcation network treating adja-

cent ages as an independent class. Then a decoding process

(LAD or GAD) is used to obtain the probability of each

age. In this subsection, we mainly analyze the accuracy

and computational efﬁciency between the GAD and LAD

methods. The comparative experiments are conducted on

MORPH II database with CPU, and we only compare the

time consumed in the decoding phase. In decoding, there

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. ?, NO. ?, ? ? 11

Fig. 9. The original and aligned images of Chalearn LAP, FG-NET, Morph and CACD databases. The predicted ages of both good and bad

estimation are given in the ﬁgure. Note that the predicted age on Chalearn LAP dataset is not an integer due to the averaging of the predictions of

the augmented testing images and an ensemble of networks.

TABLE 9

Some training details of our method on chalearn dataset. All results are ﬁnetuning on both training and validation sets, and then testing on test set.

Crop size of Data augmentation IMDB-WIKI Data augmentation Num. of MAE ϵ-error

training images on training Set pretraining on testing Set networks

224×224 No No No 1 4.64 0.4027

224×224 Yes No No 1 4.30 0.3709

224×224 Yes Yes No 1 3.08 0.2789

224×224 Yes Yes Yes 1 2.97 0.2669

224×224 Yes Yes Yes 8 2.94 0.2635

TABLE 10

The comparisons between with and without grouping and decoding components on Morph II dataset under 80-20 protocal.

With grouping and decoding Without grouping and decoding

ρ1– 1234567891011

MAE 2.5212.89 2.82 2.79 2.78 2.74 2.73 2.70 2.75 2.71 2.71 2.76

1The result is got by using AGE9 and ρ1of 2.

are only two terms changed between the probability of a

and a+1 accordingly to Eq. (7). To avoid decrease in perfor-

mance due to rounding error in the continuous calculation

process, P(a),P(a+ 1), ... are not calculated sequentially.

Instead, we compute P(a)for each ain the whole age set

Y(GAD) or age group set Sm(LAD) with the maximal

classiﬁcation probability. We ﬁnd that LAD could spend less

time while gets the same performance as with GAD. As

shown in Table 8, LAD only needs 4.6ms to analyze one face

image while GAD needs 51.7ms to do so, which decreases

the decoding speed by about 10 times. The visualization of

the age probabilities with both LAD and GAD is shown

in Fig. 10 (a randomly selected sample from test set). The

decoding results are virtually with no difference.

5 DISCUSSION

5.1 Exploring Training Details

Many tricks have been employed when evaluating on

Chalearn LAP dataset, e.g., pretraining, data augmentation,

a ensemble of networks. In this section a step-by-step inves-

tigation is conducted to explore the contributions of each

trick.

As shown in Table 9, the pretraining on IMDB-WIKI

dataset seems very helpful, which can reduce the ϵ-error

from 0.3709 to 0.2789. This signiﬁcant improvement shows

that the IMDB-WIKI dataset is still useful even though

it contains much noise. Also, our data augmentation on

Fig. 11. The distribution of training sets.

training set also makes a great contribution. The ϵ-error is

dropped by about 0.012 with the training data augmenta-

tion. It is worth noting that the proposed method achieves

an ϵ-error of 0.2669 with a single network, which is very

close to the best result of Chalearn LAP competition [5], [9].

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. ?, NO. ?, ? ? 12

5.2 Detailed comparison with DEX

To compare with DEX method thoroughly, we re-implement

DEX with the same experimental settings where both face

alignment and data augmentation are used. The network of

re-implemented DEX method is the same to ours except the

last layer as shown in Fig. 7. We conduct the comparisons on

FG-NET, Morph II and CACD datasets. When experiment-

ing on FG-NET and Moroh II datasets, the networks are ﬁrst

pretrained on IMDB-WIKI datatset. As shown in the Table

11, our method could still perform better than DEX method

on those datasets when adopting the same experimental

settings. Furthermore, we also implement our method with

the same training settings as DEX’s. Besides having selected

part of images with small noise of IMDB-WIKI dataset for

pretraining, Rothe et. al [5] have also equalized the age

distribution of the selected images to improve the model

generalization capability. However, they didn’t make the

list of pretraining images public to the community. There-

fore, for a fair comparison, we didn’t pretrain the model

on IMDB-WIKI dataset when conducting experiments with

the same training settings as DEX’s. The comparisons on

FG-NET, Morph and CACD dataset are shown in Table

11, our method achieves a better performance. No matter

which training settings are employed, our method shows

superiority to DEX.

TABLE 11

The comparisons between our method and DEX on FG-NET, Morph II

and CACD datasets. Note that here we adopt 80-20 protocol when

evaluating on Morph II dataset.

Our training settings DEX’s training settings

Method FG-NET Morph II CACD FG-NET Morph II CACD

Ours 2.96 2.52 4.68 4.30 3.01 4.73

DEX 3.01 2.66 4.75 4.63 3.25 4.79

5.3 Ablation Study

In this section, we conduct the ablation analysis on

the grouping and decoding components of the proposed

method. We train the network with multiple classiﬁers with-

out the grouping component, where each classiﬁer is used to

determine the input image belonging to the corresponding

age or not. By removing the grouping stage, the predicted

age could be directly obtained via maximum probability

of the classiﬁers. So the decoding stage is also dropped

in this way. To make a fair comparison, we also conduct

experiments with a variety of ρ1to ﬁnd an appropriate

value. As shown in Table 10, the minimum MAE can only

reach 2.70 when grouping and decoding components are

dropped. This means that the performance (or deviation of

age estimation) of model without grouping and decoding

components is 0.18 years less than that with those com-

ponents. When grouping and decoding components are

dropped, each age would be regarded as a single age group

and the relationship between adjacent ages can’t be explored

either. All those result in a decrease in performance. From

this perspective, the grouping and decoding components are

of critical importance to our method.

6 CONCLUSION

In this paper, we propose a deep learning solution for age

estimation based on a single network to account for aging

randomness. First, an age group-n encoding strategy is

proposed to group ages, where adjacent ages are grouped

into the same group and each group is regarded as an

independent class. Then, age group classiﬁcation is imple-

mented in a CNN with multiple outputs and we recover the

exact age for each face image by decoding the classiﬁcation

results. Moreover, we modify our algorithm to address the

imbalance data learning problem. Finally, the evaluations

on multiple age databases show that the proposed method

achieves the state-of-the-art performance.

ACKNOWLEDGMENTS

This work was supported by the National Key Research

and Development Plan (Grant No.2016YFC0801002), the

Chinese National Natural Science Foundation Projects

♯61502491, ♯61473291, ♯61572501, ♯61572536, ♯61673052, Sci-

ence and Technology Development Fund of Macau (No.

112/2014/A3, 151/2017/A, 152/2017/A), NVIDIA GPU do-

nation program and AuthenMetric R&DFunds. Zichang

Tan and Jun Wan contribute equally to this paper. Jun Wan

is the corresponding author.

REFERENCES

[1] A. K. Jain and S. Z. Li, Handbook of face recognition. Springer, 2005.

[2] M. Fairhurst, Age Factors in Biometric Processing. Institution of

Engineering and Technology, 2013.

[3] Y. Ma and G. Qian, Intelligent video surveillance: systems and technol-

ogy. CRC Press, 2009.

[4] C. Shan, F. Porikli, T. Xiang, and S. Gong, Video Analytics for

Business Intelligence. Springer, 2012, vol. 1.

[5] R. Rothe, R. Timofte, and L. Van Gool, “Deep expectation

of real and apparent age from a single image without facial

landmarks,” International Journal of Computer Vision, Aug 2016.

[Online]. Available: https://doi.org/10.1007/s11263-016- 0940-3

[6] X. Geng, C. Yin, and Z. Zhou, “Facial age estimation by learning

from label distributions,” IEEE Transactions on Pattern Analysis and

Machine Intelligence, vol. 35, no. 10, pp. 2401–2412, 2013.

[7] X. Geng, Z. Zhou, and K. Smithmiles, “Automatic age estimation

based on facial aging patterns,” IEEE Transactions on Pattern Anal-

ysis and Machine Intelligence, vol. 29, no. 12, pp. 2234–2240, 2007.

[8] G. Guo, G. Mu, Y. Fu, and T. Huang, “Human age estimation using

bio-inspired features,” pp. 112–119, 2009.

[9] R. Rothe, R. Timofte, and L. V. Gool, “Dex: Deep expectation of

apparent age from a single image,” in IEEE International Conference

on Computer Vision Workshops (ICCVW), December 2015.

[10] Y. Fu and T. S. Huang, “Human age estimation with regression on

discriminative aging manifold,” IEEE Transactions on Multimedia,

vol. 10, no. 4, pp. 578–584, 2008.

[11] G. Guo and G. Mu, “Simultaneous dimensionality reduction and

human age estimation via kernel partial least squares regression,”

in Computer Vision and Pattern Recognition (CVPR), 2011 IEEE

Conference on. IEEE, 2011, pp. 657–664.

[12] G. Guo. and G. Mu., “Joint estimation of age, gender and ethnicity:

Cca vs. pls,” in Automatic Face and Gesture Recognition (FG), 2013

10th IEEE International Conference and Workshops on. IEEE, 2013,

pp. 1–6.

[13] T. Liu, Z. Lei, J. Wan, and S. Z. Li, “Dfdnet: discriminant face

descriptor network for facial age estimation,” in Chinese Conference

on Biometric Recognition. Springer, 2015, pp. 649–658.

[14] D. Yi, Z. Lei, and S. Z. Li, “Age estimation by multi-scale convolu-

tional network,” in Computer Vision–ACCV 2014. Springer, 2014,

pp. 144–158.

[15] P. Geladi and B. R. Kowalski, “Partial least-squares regression: a

tutorial,” Analytica Chimica Acta, vol. 185, no. 86, pp. 1–17, 1986.

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. ?, NO. ?, ? ? 13

[16] D. R. Hardoon, S. Szedmak, and J. Shawe-Taylor, “Canonical

correlation analysis: an overview with application to learning

methods.” Neural Computation, vol. 16, no. 12, pp. 2639–2664, 2004.

[17] D. Basak, S. Pal, and D. C. Patranabis, “Support vector regression,”

Neural Information Processing-Letters and Reviews, vol. 11, no. 10, pp.

203–224, 2007.

[18] R. Can Malli, M. Aygun, and H. Kemal Ekenel, “Apparent age

estimation using ensemble of deep learning models,” in The IEEE

Conference on Computer Vision and Pattern Recognition (CVPR) Work-

shops, June 2016.

[19] Y. Zhu, Y. Li, G. Mu, and G. Guo, “A study on apparent age

estimation,” in The IEEE International Conference on Computer Vision

(ICCV) Workshops, December 2015.

[20] H. F. Yang, B. Y. Lin, K. Y. Chang, and C. S. Chen, “Automatic

age estimation from face images via deep ranking,” in Proc. British

Machine Vision Conference, 2015.

[21] Z. Niu, M. Zhou, L. Wang, X. Gao, and G. Hua, “Ordinal regres-

sion with multiple output cnn for age estimation,” in The IEEE

Conference on Computer Vision and Pattern Recognition (CVPR), June

2016.

[22] R. Longadge and S. Dongre, “Class imbalance problem in data

mining review,” arXiv preprint arXiv:1305.1707, 2013.

[23] H. He and E. A. Garcia, “Learning from imbalanced data,” IEEE

Transactions on Knowledge and Data Engineering, vol. 21, no. 9, pp.

1263–1284, Sept 2009.

[24] “The fg-net aging database,” http://www.fgnet.rsunit.com/.

[25] K. Ricanek and T. Tesafaye, “Morph: a longitudinal image

database of normal adult age-progression,” in International Confer-

ence on Automatic Face and Gesture Recognition, 2006, pp. 341–345.

[26] B. C. Chen, C. S. Chen, and W. H. Hsu, “Face recognition and

retrieval using cross-age reference coding with cross-age celebrity

dataset,” IEEE Transactions on Multimedia, vol. 17, no. 6, pp. 804–

815, June 2015.

[27] S. Escalera, J. Fabian, P. Pardo, X. Bar ´

o, J. Gonzalez, H. J. Escalante,

D. Misevic, U. Steiner, and I. Guyon, “Chalearn looking at people

2015: Apparent age and cultural event recognition datasets and

results,” in Proceedings of the IEEE International Conference on Com-

puter Vision Workshops, 2015, pp. 1–9.

[28] K. Simonyan and A. Zisserman, “Very deep convolutional net-

works for large-scale image recognition,” arXiv preprint arX-

iv:1409.1556, 2014.

[29] N. D. V. Young HoKwon, “Age classiﬁcation from facial images,”

Computer Vision and Image Understanding, vol. 74, no. 1, pp. 1–21,

1999.

[30] A. Lanitis, C. Draganova, and C. Christodoulou, “Comparing

different classiﬁers for automatic age estimation,” Systems, Man,

and Cybernetics, Part B: Cybernetics, IEEE Transactions on, vol. 34,

no. 1, pp. 621–628, 2004.

[31] T. F. Cootes, G. J. Edwards, and C. J. Taylor, “Active appearance

models,” IEEE Transactions on Pattern Analysis and Machine Intelli-

gence, vol. 23, no. 6, pp. 681–685, 2001.

[32] X. Geng, Z. H. Zhou, Y. Zhang, G. Li, and H. Dai, “Learning from

facial aging patterns for automatic age estimation,” in ACM Inter-

national Conference on Multimedia, Santa Barbara, Ca, Usa, October,

2006, pp. 307–316.

[33] G. Guo, Y. Fu, C. R. Dyer, and T. S. Huang, “Image-based human

age estimation by manifold learning and locally adjusted robust

regression.” IEEE Transactions on Image Processing A Publication of

the IEEE Signal Processing Society, vol. 17, no. 7, pp. 1178–1188,

2008.

[34] Y. Fu, Y. Xu, and T. S. Huang, “Estimating human age by manifold

analysis of face pictures and regression on aging features,” in IEEE

International Conference on Multimedia and Expo, 2007, pp. 1383–

1386.

[35] F. Gao and H. Ai, “Face age classiﬁcation on consumer images

with gabor feature and fuzzy lda method,” in Advances in biomet-

rics. Springer, 2009, pp. 132–141.

[36] A. G ¨

unay and V. V. Nabiyev, “Automatic age classiﬁcation with

lbp,” in Computer and Information Sciences, 2008. ISCIS’08. 23rd

International Symposium on. IEEE, 2008, pp. 1–4.

[37] X. Geng, Q. Wang, and Y. Xia, “Facial age estimation by adaptive

label distribution learning,” in International Conference on Pattern

Recognition, 2014, pp. 4465–4470.

[38] Z. Tan, Z. Shuai, W. Jun, L. Zhen, and S. Z. Li, “Age estimation

based on a single network with soft softmax of aging modeling,”

in Computer Vision–ACCV 2016, 2016.

[39] Z. Huo, X. Yang, C. Xing, Y. Zhou, P. Hou, J. Lv, and X. Geng,

“Deep age distribution learning for apparent age estimation,” in

Proceedings of the IEEE Conference on Computer Vision and Pattern

Recognition Workshops, 2016, pp. 17–24.

[40] G. Antipov, M. Baccouche, S.-A. Berrani, and J.-L. Dugelay, “Ap-

parent age estimation from face images combining general and

children-specialized deep learning models,” in Proceedings of the

IEEE Conference on Computer Vision and Pattern Recognition Work-

shops, 2016, pp. 96–104.

[41] X. Liu, S. Li, M. Kan, J. Zhang, S. Wu, W. Liu, H. Han, S. Shan,

and X. Chen, “Agenet: Deeply learned regressor and classiﬁer

for robust apparent age estimation,” in The IEEE International

Conference on Computer Vision (ICCV) Workshops, December 2015.

[42] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov,

D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with

convolutions,” in The IEEE Conference on Computer Vision and

Pattern Recognition (CVPR), June 2015.

[43] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classiﬁ-

cation with deep convolutional neural networks,” in Advances in

neural information processing systems, 2012, pp. 1097–1105.

[44] S. Escalera, M. Torres Torres, B. Martinez, X. Baro, H. Jair Es-

calante, I. Guyon, G. Tzimiropoulos, C. Corneou, M. Oliu, M. Al-

i Bagheri, and M. Valstar, “Chalearn looking at people and faces

of the world: Face analysis workshop and challenge 2016,” in The

IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

Workshops, June 2016.

[45] K. Y. Chang, C. S. Chen, and Y. P. Hung, “Ordinal hyperplanes

ranker with cost sensitivities for age estimation,” in Computer

Vision and Pattern Recognition, 2011, pp. 585–592.

[46] K. Chen, S. Gong, T. Xiang, and C. L. Chen, “Cumulative attribute

space for age and crowd density estimation,” in IEEE Conference

on Computer Vision and Pattern Recognition, 2013, pp. 2467–2474.

[47] X. Wang, R. Guo, and C. Kambhamettu, “Deeply-learned feature

for age estimation,” in IEEE Winter Conference on Applications of

Computer Vision, 2015, pp. 534–541.

[48] P. Viola and M. Jones, “Rapid object detection using a boosted

cascade of simple features,” in Computer Vision and Pattern Recogni-

tion, 2001. CVPR 2001. Proceedings of the 2001 IEEE Computer Society

Conference on, vol. 1. IEEE, 2001, pp. I–I.

[49] T. F. Cootes, C. J. Taylor, D. H. Cooper, and J. Graham, “Active

shape modelsłtheir training and application,” Computer Vision and

Image Understanding, vol. 61, no. 1, pp. 38–59, 1995.

[50] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick,

S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture

for fast feature embedding,” in Proceedings of the 22nd ACM inter-

national conference on Multimedia. ACM, 2014, pp. 675–678.

[51] M. Yang, S. Zhu, F. Lv, and K. Yu, “Correspondence driven

adaptation for human proﬁle recognition,” in Computer Vision and

Pattern Recognition (CVPR), 2011 IEEE Conference on. IEEE, 2011,

pp. 505–512.

[52] H. Han, C. Otto, X. Liu, and A. K. Jain, “Demographic estimation

from face images: Human vs. machine performance,” IEEE Trans-

actions on Pattern Analysis and Machine Intelligence, vol. 37, no. 6,

pp. 1148–1161, 2015.

[53] Y. Zhang and D.-Y. Yeung, “Multi-task warped gaussian process

for personalized age estimation,” in Computer Vision and Pattern

Recognition (CVPR), 2010 IEEE Conference on. IEEE, 2010, pp. 2622–

2629.

[54] R. Rothe, R. Timofte, and L. Van Gool, “Some like it hot-visual

guidance for preference prediction,” in Proceedings of the IEEE

Conference on Computer Vision and Pattern Recognition, 2016, pp.

5553–5561.

[55] X. Yang, B. B. Gao, C. Xing, Z. W. Huo, X. S. Wei, Y. Zhou, J. Wu,

and X. Geng, “Deep label distribution learning for apparent age

estimation,” in Proceedings of the IEEE Conference on Computer Vision

and Pattern Recognition Workshops, 2015, pp. 344–350.

[56] G. Levi and T. Hassner, “Age and gender classiﬁcation using con-

volutional neural networks,” in The IEEE Conference on Computer

Vision and Pattern Recognition (CVPR) Workshops, June 2015.

[57] Z. Huo, X. Yang, C. Xing, Y. Zhou, P. Hou, J. Lv, and X. Geng,

“Deep age distribution learning for apparent age estimation,” in

The IEEE Conference on Computer Vision and Pattern Recognition

(CVPR) Workshops, June 2016.

[58] M. Uricar, R. Timofte, R. Rothe, J. Matas, and L. Van Gool,

“Structured output svm prediction of apparent age, gender and

smile from deep features,” in The IEEE Conference on Computer

Vision and Pattern Recognition (CVPR) Workshops, June 2016.

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. ?, NO. ?, ? ? 14

[59] F. Gurpinar, H. Kaya, H. Dibeklioglu, and A. Salah, “Kernel elm

and cnn based facial age estimation,” in The IEEE Conference on

Computer Vision and Pattern Recognition (CVPR) Workshops, June

2016.

[60] X. Baro, J. Gonzalez, J. Fabian, M. A. Bautista, M. Oliu, H. J.

Escalante, I. Guyon, and S. Escalera, “Chalearn looking at people

2015 challenges: Action spotting and cultural event recognition,”

in IEEE Conference on Computer Vision and Pattern Recognition

Workshops, 2015, pp. 1–9.

Zichang Tan received the B.E. degree from the

department of Automation, Huazhong Universi-

ty of Science and Technology (HUST), Wuhan,

China in 2016. He was named as an outstand-

ing graduate of the college when he graduated.

He is currently studying for a Ph.D. degree at

the Institute of Automation, Chinese Academy of

Science (CASIA). His main research interests in-

clude deep learning, face attribute analysis and

face recognition.

Jun Wan received his B.S. degree from the Chi-

na University of Geosciences, Beijing, China, in

2008, and the PhD degree from the Institute of

Information Science, Beijing Jiaotong University,

Beijing, China, in 2015. Since January 2015, he

has been an assistant professor at the National

Laboratory of Pattern Recognition (NLPR), In-

stitute of Automation, Chinese Academy of Sci-

ence (CASIA). He received the 2012 ChaLearn

One-Shot-Learning Gesture Challenge Award,

sponsored by Microsoft, ICPR 2012. He also

received the 2013, 2014 Best Paper Award from the Institute of Infor-

mation Science, Beijing Jiaotong University. His main research interests

include computer vision, machine learning, especially for gesture and

action recognition, facial attribution analysis (i.e. age estimation, facial

expression, gender and race classiﬁcation). He has published papers

in top journals, such as JMLR, TPAMI, TIP, and TCYB. He has served

as the reviewer on several top journals and conferences, such as JML-

R, TPAMI, TIP, TMM, TSMC, PR, ICPR2016, CVPR2017, ICCV2017,

FG2017.

Zhen Lei received the B.S. degree in automation

from the University of Science and Technology of

China, in 2005, and the Ph.D. degree from the In-

stitute of Automation, Chinese Academy of Sci-

ences, in 2010, where he is currently an Asso-

ciate Professor. He has published over 100 pa-

pers in international journals and conferences.

His research interests are in computer vision,

pattern recognition, image processing, and face

recognition in particular. He served as an Area

Chair of the International Joint Conference on

Biometrics in 2014, the IAPR/IEEE International Conference on Bio-

metric in 2015, 2016, 2018, and the IEEE International Conference on

Automatic Face and Gesture Recognition in 2015.

Ruicong Zhi received the PhD degree in signal

and information processing from Beijing Jiao-

tong University in 2010. From 2008 to 2009, she

visited the Sound and Image Processing Labo-

ratory, Royal Institute of Technology (KTH) as a

joint PhD. She is currently an associate profes-

sor in School of Computer and Communication

Engineering, University of Science and Technol-

ogy Beijing. She has published more than 50

papers, and has six patents. She has been the

recipient of more than ten awards, including the

National Excellent Doctoral Dissertation Award nomination, the prize of

Science and Technology of Beijing etc. Her research interests include

facial and behavior analysis, emotion analysis, image processing and

pattern recognition.

Guodong Guo (M’07-SM’07) received the B.E.

degree in automation from Tsinghua Universi-

ty, Beijing, China, the Ph.D. degree in pattern

recognition and intelligent control from Chinese

Academy of Sciences, Beijing, China, and the

Ph.D. degree in computer science from Universi-

ty of Wisconsin-Madison, Madison, WI, USA. He

is an Associate Professor with the Department

of Computer Science and Electrical Engineering,

West Virginia University (WVU), Morgantown,

WV, USA. In the past, he visited and worked in

several places, including INRIA, Sophia Antipolis, France; Ritsumeikan

University, Kyoto, Japan; Microsoft Research, Beijing, China; and North

Carolina Central University. He authored a book, Face, Expression, and

Iris Recognition Using Learning-based Approaches (2008), co-edited

a book, Support Vector Machines Applications (2014), and published

about 100 technical papers. His research interests include computer

vision, machine learning, and multimedia. He received the North Car-

olina State Award for Excellence in Innovation in 2008, Outstanding

Researcher (2013-2014) at CEMR, WVU, and New Researcher of the

Year (2010-2011) at CEMR, WVU. He was selected the “People’s Hero

of the Week” by BSJB under Minority Media and Telecommunications

Council (MMTC) on July 29, 2013. Two of his papers were selected as

“The Best of FG’13” and “The Best of FG’15”, respectively.

Stan Z. Li received the BEng degree from Hu-

nan University, China, the MEng degree from

National University of Defense Technology, Chi-

na, and the PhD degree from Surrey University,

United Kingdom. He is currently a professor and

the director of Center for Biometrics and Security

Research (CBSR), Institute of Automation, Chi-

nese Academy of Sciences (CASIA). He was at

Microsoft Research Asia as a researcher from

2000 to 2004. Prior to that, he was an associate

professor at Nanyang Technological University,

Singapore. His research interests include pattern recognition and ma-

chine learning, image and vision processing, face recognition, biomet-

rics, and intelligent video surveillance. He has published more than

200 papers in international journals and conferences, and authored and

edited eight books. He was an associate editor of the IEEE Transactions

on Pattern Analysis and Machine Intelligence and is acting as the editor-

in-chief for the Encyclopedia of Biometrics. He served as a program

cochair for the International Conference on Biometrics 2007 and 2009,

and has been involved in organizing other international conferences

and workshops in the ﬁelds of his research interest. He was elevated

to IEEE fellow for his contributions to the ﬁelds of face recognition,

pattern recognition and computer vision and he is a member of the IEEE

Computer Society.

Task-adaptive Q-Face

Preprint

May 2024

Although face analysis has achieved remarkable improvements in the past few years, designing a multi-task face analysis model is still challenging. Most face analysis tasks are studied as separate problems and do not benefit from the synergy among related tasks. In this work, we propose a novel task-adaptive multi-task face analysis method named as Q-Face, which simultaneously performs multiple face analysis tasks with a unified model. We fuse the features from multiple layers of a large-scale pre-trained model so that the whole model can use both local and global facial information to support multiple tasks. Furthermore, we design a task-adaptive module that performs cross-attention between a set of query vectors and the fused multi-stage features and finally adaptively extracts desired features for each face analysis task. Extensive experiments show that our method can perform multiple tasks simultaneously and achieves state-of-the-art performance on face expression recognition, action unit detection, face attribute analysis, age estimation, and face pose estimation. Compared to conventional methods, our method opens up new possibilities for multi-task face analysis and shows the potential for both accuracy and efficiency.

New Coarse-to-Fine Approaches for Age Estimation Based on Separable Convolutions

Article

Full-text available

Jan 2023

In this paper, we study lightweight age estimation methods based on a coarse-to-fine approach in which the network performs age prediction with multiple stages. In each stage, the network only focuses on refining the coarse age prediction generated from the previous stage. The final age prediction is the combination of all staged prediction values. We observe that these stages have a causal relationship, that is, the output of each stage is highly correlated with outputs of its former stages. Thus, each stage should share the information of its previous stage before making a refined prediction. Based on this observation, we construct a new compact CNN model called Homologous Stagewise Regression Network (HSR-Net). In HSR-Net, each stage shares the information of the last convolutional layer and then generates its own refined value. In addition, HSR-Net also addresses the age group ambiguity problem by utilizing an easy dynamic range construction. In order to enhance the prediction performance of HSR-Nets, it is naive to increase the number of kernels in each convolutional layer of HSR-Nets. However, the constructed HSR-Net has extremely large parameter size. To address this problem, we propose the separable HSR-Nets (SepHSR-Nets) where standard convolutions are replaced by depth-wise separable convolutions in the convolutional layers of HSR-Nets. In general, the parameter size of SepHSR-Nets ranges from 10K to 75K without sacrificing prediction performance. Experimental results show that SepHSR-Nets achieve competitive performance compared with the state-of-the-art compact models. Our code, data, and models are available at https://github.com/yanjenhuang/hsr-net.

Investigating the Role of Explainability and AI Literacy in User Compliance

Preprint

Full-text available

Jun 2024

AI is becoming increasingly common across different domains. However, as sophisticated AI-based systems are often black-boxed, rendering the decision-making logic opaque, users find it challenging to comply with their recommendations. Although researchers are investigating Explainable AI (XAI) to increase the transparency of the underlying machine learning models, it is unclear what types of explanations are effective and what other factors increase compliance. To better understand the interplay of these factors, we conducted an experiment with 562 participants who were presented with the recommendations of an AI and two different types of XAI. We find that users' compliance increases with the introduction of XAI but is also affected by AI literacy. We also find that the relationships between AI literacy XAI and users' compliance are mediated by the users' mental model of AI. Our study has several implications for successfully designing AI-based systems utilizing XAI.

P-Age: Pexels Dataset for Robust Spatio-Temporal Apparent Age Classification

Conference Paper

Jan 2024

Facial Attribute Analysis

Chapter

Dec 2023

Facial attributes indicate the intuitive semantic descriptions of a human face like gender, race, expression, and so on. In the past few years, automated facial attribute analysis has become an active field in the area of biometric recognition due to its wide range of possible applications, such as face verification [5, 59], face identification [63, 80], or surveillance [110], just to mention a few.

Face Recognition Research and Development

Chapter

Dec 2023

Face recognition aims to identify or verify the identity of a person through his or her face images or videos. It is one of the most important research topics in computer vision with great commercial applications [37, 59, 86, 210], like biometric authentication, financial security, access control, intelligent surveillance, etc. Because of its commercial potential and practical value, face recognition has attracted great interest from both academia and industry. The concept of face recognition probably appeared as early as 1960s [10], when researchers tried to use a computer to recognize the human face. In the 1990s and early 2000s, face recognition had rapid development and methodologies were dominated by holistic approaches (e.g., linear space [11], manifold learning [70], and sparse representations [227, 253]), which extract low-dimensional features by taking the whole face as the input.

Co-regularized Facial Age Estimation with Graph-Causal Learning

Chapter

Dec 2023

In this paper, we present a graph-causal regularization (GCR) for robust facial age estimation. Existing label facial age estimation methods often suffer from overfitting and overconfidence issues due to limited data and domain bias. To address these challenges and leveraging the chronological correlation of age labels, we propose a dynamic graph learning method that enforces causal regularization to discover an attentive feature space while preserving age label dependencies. To mitigate domain bias and enhance aging details, our approach incorporates counterfactual attention and bilateral pooling fusion techniques. Consequently, the proposed GCR achieves reliable feature learning and accurate ordinal decision-making within a globally-tuned framework. Extensive experiments under widely-used protocols demonstrate the superior performance of GCR compared to state-of-the-art approaches.

RRFAE-Net: Robust RGB-D Facial Age Estimation Network

Conference Paper

Dec 2023

Learning age estimation from face images has attracted much attention due to its favorable of various practical applications such as age-invariant identity-related representation. However, most existing facial age estimation methods usually extract age features from the RGB images, making them sensitive to the gender, race, pose and illumination changes. In this paper, we propose an end-to-end multi-feature integrated network for robust RGB-D facial age estimation, which consists of a 2D triple-type characteristic learning net and a 3D depth features leaning net. The triple-type characteristic learning net aims to extensively exploit multiple aging-related information including the gender, race features as well as the preliminary age features from RGB images, while the depth-feature learning net learns the pose and illumination-invariant age-related features from depth images. By incorporating these multi-dimensional feature nets, our proposed integrated network can extract the robust and complementary age features between RGB and depth modalities. Extensive experimental results on the widely used databases clearly demonstrate the effectiveness of our proposed method.

Conference Paper

Oct 2023

General vs. Long-Tailed Age Estimation: An Approach to Kill Two Birds with One Stone

Article

Nov 2023
IEEE T IMAGE PROCESS

Facial age estimation has received a lot of attention for its diverse application scenarios. Most existing studies treat each sample equally and aim to reduce the average estimation error for the entire dataset, which can be summarized as General Age Estimation . However, due to the long-tailed distribution prevalent in the dataset, treating all samples equally will inevitably bias the model toward the head classes (usually the adult with a majority of samples). Driven by this, some works suggest that each class should be treated equally to improve performance in tail classes (with a minority of samples), which can be summarized as Long-tailed Age Estimation . However, Long-tailed Age Estimation usually faces a performance trade-off, i.e ., achieving improvement in tail classes by sacrificing the head classes. In this paper, our goal is to design a unified framework to perform well on both tasks, killing two birds with one stone. To this end, we propose a simple, effective, and flexible training paradigm named GLAE, which is two-fold. First, we propose Feature Rearrangement (FR) and Pixel-level Auxiliary learning (PA) for better feature utilization to improve the overall age estimation performance. Second, we propose Adaptive Routing (AR) for selecting the appropriate classifier to improve performance in the tail classes while maintaining the head classes. Moreover, we introduce a new metric, named Class-wise Mean Absolute Error (CMAE), to equally evaluate the performance of all classes. Our GLAE provides a surprising improvement on Morph II, reaching the lowest MAE and CMAE of 1.14 and 1.27 years, respectively. Compared to the previous best method, MAE dropped by up to 34%, which is an unprecedented improvement, and for the first time, MAE is close to 1 year old. Extensive experiments on other age benchmark datasets, including CACD, MIVIA, and Chalearn LAP 2015, also indicate that GLAE outperforms the state-of-the-art approaches significantly.

Apparent Age Estimation Using Ensemble of Deep Learning Models

Conference Paper

Full-text available

Jun 2016

Intelligent video surveillance: Systems and technology

Book

Jan 2009

From the streets of London to subway stations in New York City, hundreds of thousands of surveillance cameras ubiquitously collect hundreds of thousands of videos, often running 24/7. How can such vast volumes of video data be stored, analyzed, indexed, and searched? How can advanced video analysis and systems autonomously recognize people and detect targeted activities real-time? Collating and presenting the latest information Intelligent Video Surveillance: Systems and Technology explores these issues, from fundamentals principle to algorithmic design and system implementation. An Integrated discussion of key research and applications Written and edited by a collection of industry experts, the book presents state-of-the-art technologies and systems in intelligent video surveillance. The book integrates key research, design, and implementation themes of intelligent video surveillance systems and technology into one comprehensive reference. The chapters cover the computational principles behind the technologies and systems and include system implementation issues as well as examples of successful applications of these technologies. Builds a foundation for future developments Changing appearance caused by changing viewpoints, illumination, expression, and movement, self/cross body occlusion, modeling of cluttered background capable of efficient background subtraction for object detection, and spatial and temporal alignment of multiple cameras are just a few of the challenges that remain in further developing and refining intelligent video surveillance technology and systems. Fully illustrated with line art, tables, and photographs demonstrating the collected video and results obtained using the related algorithms, including a color plate section, the book provides a high-level blueprint for advances and insights into future directions of the field.

Video Analytics for Business Intelligence

Book

Jan 2012

Closed Circuit TeleVision (CCTV) cameras have been increasingly deployed pervasively in public spaces including retail centres and shopping malls. Intelligent video analytics aims to automatically analyze content of massive amount of public space video data and has been one of the most active areas of computer vision research in the last two decades. Current focus of video analytics research has been largely on detecting alarm events and abnormal behaviours for public safety and security applications. However, increasingly CCTV installations have also been exploited for gathering and analyzing business intelligence information, in order to enhance marketing and operational efficiency. For example, in retail environments, surveillance cameras can be utilised to collect statistical information about shopping behaviour and preference for marketing (e.g., how many people entered a shop; how many females/males or which age groups of people showed interests to a particular product; how long did they stay in the shop; and what are the frequent paths), and to measure operational efficiency for improving customer experience. Video analytics has the enormous potential for non-security oriented commercial applications. This book presents the latest developments on video analytics for business intelligence applications. It provides both academic and commercial practitioners an understanding of the state-of-the-art and a resource for potential applications and successful practice.

Very Deep Convolutional Networks for Large-Scale Image Recognition

Technical Report

Sep 2014

In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively.

Imagenet classification with deep convolutional neural networks

Conference Paper

Jan 2012

We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 dif- ferent classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implemen- tation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called dropout that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry

Age Estimation Based on a Single Network with Soft Softmax of Aging Modeling

Conference Paper

Mar 2017
Lect Notes Comput Sci

In this paper, we propose a novel approach based on a single convolutional neural network (CNN) for age estimation. In our proposed network architecture, we first model the randomness of aging with the Gaussian distribution which is used to calculate the Gaussian integral of an age interval. Then, we present a soft softmax regression function used in the network. The new function applies the aging modeling to compute the function loss. Compared with the traditional softmax function, the new function considers not only the chronological age but also the interval nearby true age. Moreover, owing to the complex of Gaussian integral in soft softmax function, a look up table is built to accelerate this process. All the integrals of age values are calculated offline in advance. We evaluate our method on two public datasets: MORPH II and Cross-Age Celebrity Dataset (CACD), and experimental results have shown that the proposed method has gained superior performances compared to the state of the art.

DEX: Deep EXpectation of apparent age from a single image[C]

Article

Jan 2015

ChaLearn Looking at People and Faces of the World: Face AnalysisWorkshop and Challenge 2016

Conference Paper

Jun 2016

We present the 2016 ChaLearn Looking at People and Faces of the World Challenge and Workshop, which ran three competitions on the common theme of face analysis from still images. The first one, Looking at People, addressed age estimation, while the second and third competitions, Faces of the World, addressed accessory classification and smile and gender classification, respectively. We present two crowd-sourcing methodologies used to collect manual annotations. A custom-build application was used to collect and label data about the apparent age of people (as opposed to the real age). For the Faces of the World data, the citizen-science Zooniverse platform was used. This paper summarizes the three challenges and the data used, as well as the results achieved by the participants of the competitions. Details of the ChaLearn LAP FotW competitions can be found at http://gesture.chalearn.org.

Kernel ELM and CNN Based Facial Age Estimation

Conference Paper