ArticlePDF Available

Attention-Based Two-Stream Convolutional Networks for Face Spoofing Detection

June 2019
IEEE Transactions on Information Forensics and Security PP(99):1-1

June 2019
PP(99):1-1

DOI:10.1109/TIFS.2019.2922241

Authors:

Guosheng Hu

University of Surrey

Show all 6 authorsHide

Since the human face preserves the richest information for recognizing individuals, face recognition has been widely investigated and achieved great success in various applications in the past decades. However, face spoofing attacks (e.g. face video replay attack) remain a threat to modern face recognition systems.Though many effective methods have been proposed for anti-spoofing, we find that the performance of many existing methods is degraded by illuminations. It motivates us to develop illumination-invariant methods for anti-spoofing. In this paper, we propose a two stream convolutional neural network (TSCNN) which works on two complementary space: RGB space (original imaging space) and multi-scale retinex (MSR) space (illumination-invariant space). Specifically, RGB space contains the detailed facial textures yet is sensitive to illumination; MSR is invariant to illumination yet contains less detailed facial information. In addition, MSR images can effectively capture the high-frequency information, which is discriminative for face spoofing detection. Images from two spaces are fed to the TSCNN to learn the discriminative features for anti-spoofing. To effectively fuse the features from two sources (RGB and MSR), we propose an attention-based fusion method, which can effectively capture the complementarity of two features. We evaluate the proposed framework on various databases, i.e. CASIA-FASD, REPLAY-ATTACK and OULU, and achieve very competitive performance. To further verify the generalization capacity of the proposed strategies, we conduct cross-database experiments, and the results show the great effectiveness of our method.

Motivation of the fusion of RGB (Col 1) and MSR (Col 3) images. The individual feature scores of RGB (Col 2) and MSR (Col 4) and fused scores (Col 5) are shown. The fused scores are improved compared with individual scores.

…

(A) is the overall pipeline; In (B), every single block represents one SSR module. The outputs of all SSR modules are weighted with scale parameters to form MSR; (C) illustrates the work flow of attention-based fusion.

…

Sample from the CASIA FASD. From top to bottom: low, normal and high quality images. From the left to the right: real faces and warped photo, cut photo and video replay attacks.

…

Samples from the REPLAY-ATTACK database. The first row presents images taken from the controlled scenario, while the second row corresponds to the images from the adverse scenario. From the left to the right: real faces and high definition, mobile and print attacks.

…

Samples from the OULU-NPU database. From top to bottom is the three sessions with different acquisition conditions. From the left to the right: real faces, print attack 1, print attack 2, video attack 1 and video attack 2.

…

Figures - uploaded by Neil M. Robertson

Content may be subject to copyright.

Content uploaded by Neil M. Robertson

Content may be subject to copyright.

Attention-Based Two-Stream Convolutional Networks for Face

Spoofing Detection

Chen, H., Hu, G., Lei, Z., Chen, Y., Robertson, N., & Li, S. (2019). Attention-Based Two-Stream Convolutional

Networks for Face Spoofing Detection.

IEEE Transactions on Information Forensics and Security

https://doi.org/10.1109/TIFS.2019.2922241

Published in:

IEEE Transactions on Information Forensics and Security

Document Version:

Peer reviewed version

Queen's University Belfast - Research Portal:

Link to publication record in Queen's University Belfast Research Portal

Publisher rights

This work is made available online in accordance with the publisher’s policies. Please refer to any applicable terms of use of the publisher.

General rights

Copyright for the publications made accessible via the Queen's University Belfast Research Portal is retained by the author(s) and / or other

copyright owners and it is a condition of accessing these publications that users recognise and abide by the legal requirements associated

with these rights.

Take down policy

The Research Portal is Queen's institutional repository that provides access to Queen's research output. Every effort has been made to

ensure that content in the Research Portal does not infringe any person's rights, or applicable UK laws. If you discover content in the

Research Portal that you believe breaches copyright or violates any law, please contact openaccess@qub.ac.uk.

Download date:02. Jun. 2020

Attention-Based Two-Stream Convolutional

Networks for Face Spooﬁng Detection

Haonan Chen1*, Guosheng Hu2* , Zhen Lei3, Yaowu Chen14, Neil M. Robertson2and Stan Z.Li3

1Zhejiang Provincial Key Laboratory for Network Multimedia Technologies, Zhejiang University

2Queens University Belfast

3Institute of Automation, Chinese Academy of Sciences

Abstract—Since the human face preserves the richest informa-

tion for recognizing individuals, face recognition has been widely

investigated and achieved great success in various applications

in the past decades. However, face spooﬁng attacks (e.g. face

video replay attack) remain a threat to modern face recognition

systems.Though many effective methods have been proposed for

anti-spooﬁng, we ﬁnd that the performance of many existing

methods is degraded by illuminations. It motivates us to de-

velop illumination-invariant methods for anti-spooﬁng. In this

paper, we propose a two stream convolutional neural network

(TSCNN) which works on two complementary space: RGB space

(original imaging space) and multi-scale retinex (MSR) space

(illumination-invariant space). Speciﬁcally, RGB space contains

the detailed facial textures yet is sensitive to illumination; MSR

is invariant to illumination yet contains less detailed facial

information. In addition, MSR images can effectively capture

the high-frequency information, which is discriminative for face

spooﬁng detection. Images from two spaces are fed to the

TSCNN to learn the discriminative features for anti-spooﬁng. To

effectively fuse the features from two sources (RGB and MSR), we

propose an attention-based fusion method, which can effectively

capture the complementarity of two features. We evaluate the

proposed framework on various databases, i.e. CASIA-FASD,

REPLAY-ATTACK and OULU, and achieve very competitive

performance. To further verify the generalization capacity of the

proposed strategies, we conduct cross-database experiments, and

the results show the great effectiveness of our method.

Index Terms—Face spooﬁng, multi-scale retinex, deep learning,

attention model, feature fusion.

I. INTRODUCTION

COMPARED with traditional authentication approaches

including password, veriﬁcation code and secret ques-

tion, biometrics authentication is more user-friendly. Since

the human face preserves rich information for recognizing

individuals, face becomes the most popular biometric cue with

the excellent performance of identity recognition. Currently,

person identiﬁcation can easily use the face images captured

from a distance without physical contact with the camera on

the mobile devices, e.g. mobile phone.

As the application of face recognition system becomes more

and more popular with the widespread of the Mobile phone,

their weaknesses of security become increasingly conspicuous.

For example, owing to the popularity of social network, it is

quite easy to access a person’s face image on the Internet to

*Equal Contribution

4Corresponding author (cyw@mail.bme.zju.edu.cn)

attack a face recognition system. Hence, a deep attention for

face spooﬁng detection has been drawn and it has motivated

great quantity of studies in the past few years.

In general, there are mainly four types of face spooﬁng

attacks: photo attack, masking attack, video replay attack and

3D attack. Due to the high cost of the masking attack and

3D attack, therefore, the photo attack and video reply attack

are the two most common attacks. Photo and video replay

attacks can be launched with still face images and videos of

the user in front of the camera, which are actually recaptured

from the real ones. Obviously, the recaptured image is of

lower quality compared with the real one in the same capture

conditions. The lower quality of attacks can result from: lack

of high frequency information [1]–[5], image banding or moire

effects [6], [7], video noise signatures, etc. Clearly, these

image quality degradation factors can work as the useful cues

to distinguish the real faces and the fake ones.

Face spooﬁng detection, which is also called face liveness

detection, has been designed to counter different types of

spooﬁng attacks. Face spooﬁng detection usually works as a

preprocessing step of the face recognition systems to judge

whether the face image is acquired from a real person or a

printed photo (replay video). Therefore, face spooﬁng detec-

tion is actually a binary classiﬁcation problem.

To counter the face spooﬁng attacks, there are mainly

four solutions available in the research literature: (1) micro-

texture based methods, (2) image quality based methods, (3)

motion based methods, and (4) reﬂectance based methods.

For (1), local micro-texture features are demonstrated as a

useful cue when attacked by photo and video. Researchers start

the texture-based methods by feeding hand-crafted features

extracted from facial texture to classiﬁers [8]–[12]. With the

development of deep learning, CNN [13]–[15] is utilized to

learn discriminative features for face spooﬁng detection. For

(2), the low imaging quality of the fake images offers the

useful clues [1]–[7], e.g. the loss to high frequency infor-

mation, these clues have successfully been used for spooﬁng

detection. For (3), motion-based methods mainly contain:

physiological reaction based [16]–[18] and physical movement

based [19], [20]. Motion-based methods may become less

effective when conducted by video replay which can present

the facial motions. For (4), reﬂectance of the face image is

another widely used cue for liveness detection because the

lighting reﬂectance from real face (3D) and attacking (mostly

Fig. 1. Motivation of the fusion of RGB (Col 1) and MSR (Col 3) images. The

individual feature scores of RGB (Col 2) and MSR (Col 4) and fused scores

(Col 5) are shown. The fused scores are improved compared with individual

scores.

2D, such as photo and replay attacks) face is very different

[1], [21], [22].

In this work, we propose a novel deep learning based micro-

texture based (MTB) method. The existing MTB methods

usually process and analyze the input images in original

RGB color space. However, the RGB images are sensitive to

illuminations. The RGB based MTB methods can potentially

reduce their performance in the presence of illuminations. This

motivates us to develop a illumination-robust MTB method.

Therefore, we proposed a two-stream convolutional neural

network (TSCNN) which is trained on two complementary

space: RGB space (original space) and multi-scale retinex

(MSR) [23] space (illumination-invariant space).

First, both RGB and MSR images contain discriminative

information: RGB images can be used to train end-to-end

discriminative CNNs for spooﬁng detection; MSR can capture

high frequency information, and this information is veriﬁed

particularly effective for spooﬁng detection. Second, RGB and

MSR images are complementary: RGB space contains the

detailed facial information yet is sensitive to illumination;

MSR is invariant to illumination yet contains less detailed

facial information. In the framework of TSCNN, the RGB and

MSR images are fed to two CNNs (two branches of TSCNN)

separately and generate two features which are discriminative

for anti-spooﬁng. To effectively fuse these two features, we

propose a learning-based fusion method inspired by attention

mechanism [24] detailed in Section III-C. Apart from the

commonly used fusion methods, e.g. feature averaging fusion,

our attention-based fusion can adaptively weight features to

achieve promising performance of fused features. Fig.1 shows

the complementarity of RGB and MSR and the importance of

the feature fusion. Our contributions can be summarized as:

•We propose a two-stream CNN (TSCNN) which accepts

two complementary information (RGB and MSR images)

as input. To our knowledge, we are the ﬁrst to investigate

the fusion of these two discriminative clues (RGB and

MSR) for face anti-spooﬁng.

•To adaptively and effectively fuse two features gener-

ated by TSCNN, we proposed an attention-based fusion

method. The proposed fusion method can make the

TSCNN generalize well to images under various lighting

conditions.

•We conduct extensive evaluations on three popular anti-

spooﬁng databases: CASIA-FASD, REPLAY-ATTACK

and OULU. The results show the effectiveness of the

proposed strategies. In addition, we run cross-database

experiments with very competitive results, showing the

great generalization capacity of the proposed method.

II. RE LATE D WOR KS

A. Face Spooﬁng Detection

In these years, various methods have been proposed for

face spooﬁng detection. In this section, we brieﬂy review the

existing anti-spooﬁng methods.

Texture Based Methods Texture based methods focus on

exploring different texture-based features for face spooﬁng

detection. The features can be simply classiﬁed as: hand-

crafted features and deep learning based features.

We ﬁrst introduce hand-crafted feature based method. Based

on the idea that speciﬁc frequency bands preserve most texture

information of real faces, the work in [3] employed various

difference-of-Gaussian ﬁlters to select a favorable frequency

band for detection. Texture features used in face detection

and face recognition tasks can be migrate to face spooﬁng

detection and perform quite well.

Apart from hand-crafted features, deep learning, in particu-

lar, CNN based features achieved great success in recent years.

In this category, the CNN learns the discriminative features for

liveness detection. The large amount of training data guides

the CNN to learn an effective feature. [25] extracts the local

texture features and depth features from the face images and

fuses them for face spooﬁng detection. Furthermore, a LSTM-

CNN architecture [26] was proposed to fuse the predictions

of the multiple frames of a video, which was proved to be

effective for video face spooﬁng detection.

Image Quality Based Methods Methods in this category

are motivated by the fact that the photo and replay video are

likely to have an image quality degradation in the recapture

process. In [1], the method exploits to analyze the attack

photos in 2D Fourier spectra, showing interesting results.

However, the performance might drop for higher-quality image

data. Moreover, in [5], an image quality based method was

proposed by applying chromatic moment feature, specular

reﬂection feature, blurriness feature and color diversity feature.

Motion Based Methods This type of methods aim to select

the physiological reaction motions such as eye blinking, lips

movements and the head motions to distinguish the real

face from the fake one. In [20], different movements in the

facial parts were extracted as features for this task. Though

physiological sign based methods have shown satisfactory

performance to counter printed photo attacks with the user

cooperation, they may become less effective for video replay

attack. However, [27] advances a method for facial anti-

spooﬁng by applying dynamic mode decomposition (DMD),

which can conveniently represent the temporal information of

the replay video as a single image with the same dimensions

as frames in the video. This method based on the motion in-

formation is proved less time consuming and is more accurate.

Reﬂectance Based Methods The reﬂectance differences

between the real and fake faces, in particular for the print

attack and replay attack, can offer important information for

face spooﬁng detection. The reﬂectance cue from a single

image is used to detect the face spooﬁng [1], [22]. [28]

utilizes the different multi-spectral reﬂectance distributions to

distinguish real and fake faces based on Lambertian model.

Multi-Feature Fusion Based Methods The fusion of mul-

tiple features show improved accuracy compared to individual

feature. [29] proposed a feature fusion with video motion

feature and texture feature to distinguish the authenticity of

the face. The author obtains the moving image from the face

video and the LBP feature from the last frame, fuses them and

uses the linear discriminant analysis (LDA) for classiﬁcation.

[9] extracts the texture features from three multi-scale ﬁltering

methods, then the resulting features are concatenated to form

the fused feature for classiﬁcation.

Other Methods Apart from the aforementioned methods,

additional hardwares can also be employed for face spooﬁng

detection. Unlike face images directly captured by camera, 3D

depth information [30]–[32] and multi-spectrum and infrared

(IR) image. [30] proposed a method for face liveness detection

based on 3D projective invariants. In [31], the authors pro-

posed to recover sparse 3D shapes for face images to counter

the different kinds of photo attacks.

Summary The methods we introduced can usually achieve

promising performance of anti-spooﬁng on intra-database sce-

nario, however, it is still challenging to achieve strong perfor-

mance for inter-database scenario. The degraded generalization

capacity results from many cross-database factors: different

capture devices, different imaging environments, different

illuminations, different facial poses, etc. In this work, we

propose an anti-spooﬁng method which is illumination-robust,

generalizing well to environments with strong illumination

environments and without, achieves promising cross-database

performance.

B. Multi-Scale Retinex

Many related researches have been conducted to simulate

the human vision system using different luminance algorithms.

Land’s Retinex theory [33] proposed the a lightness model

named as Retinex theory to measure the lightness reﬂexion

in an image. After that, the Retinex algorithm has been

successfully applied to image enhancement [34], [35]. [36]

introduced a model called Single Scale Retinex (SSR), which

applied the Gaussian ﬁlter to normalize illumination of source

image. The work [37] focused on the ﬁlter of the SSR and

employed an improved SSR with the guided ﬁlter and achieved

promising image enhancement performance. The performance

of SSR algorithm is highly dependent on the parameter of

Gaussian ﬁlter. To overcome this limitation, a multi-scale

Retinex (MSR) model [23], which weights the outputs of

several SSRs, is proposed. [38] proposed a novel MSR based

on an adaptive weights to aggregate the SSRs and applied

in image contrast enhancement. In our work, we applied

MSR because: (1) MSR can separate an image to illumination

component and reﬂectance component, and the illumination-

removed reﬂectance component is used for liveness detection;

(2) the MSR algorithm can be regarded as a optimized high

pass ﬁlter, thus it can effectively preserve the high frequency

components which is discriminative between the real and fake

faces.

C. Feature Fusion

Existing fusion methods consist of two part: early fusion

(feature-level fusion) and late fusion (Score-level fusion).

Feature aggregation or subspace learning is actually the early

fusion. Aggregation approaches are usually performed by

simply element averaging or concatenation [39]. Subspace

learning methods aim to project the concatenated feature to

a subspace with the best use of the complementarity of the

features. Late fusion is to fuse the predicted scores after

computation based on different classiﬁer by averaging [40] or

stacking another classiﬁer result [41]. For the deep learning

task, researchers usually use simple fusion methods for fusing

deep features features, such as score fusion, feature averaging,

etc. In our work, we proposed an attention based fusion

method, aiming to exploit the best use of the features to fuse.

D. Visual Attention Model

Visual attention is a powerful mechanism that enables

perception to focus on important part which offers more

information. To combine spatial and temporal information [42]

employed an end-to-end deep neural network. In [43], the

authors proposed a novel visual attention model to integrate

different spatial features including color, orientation and lu-

minance orientation features, which can reﬂect the region of

interests of the human visual system. Different mechanisms of

attention have been employed to deal with the computer vision

tasks, including action recognition [44], emotion recognition

[45], image classiﬁcation [46]. On the whole, the attention

model is usually used for aggregating features extracted by

different images. Inspired by the great success of attention

models, we apply attention model to fuse our features derived

from RGB images and MSR images.

III. METHODOLOGY

Spooﬁng detection is actually a binary (real vs. fake face)

classiﬁcation problem. In deep learning era, a natural solution

of this task is to feed the input RGB images to a carefully

designed CNN with classiﬁcation loss (softmax and cross

entropy loss) for end-to-end training. This CNN-based frame-

work has been widely investigated by [25], [26], [47]–[50].

Despite the strong nonlinear feature learning capacity of

deep learning, the performance of anti-spooﬁng degrades when

the input images are captured by different devices, under dif-

ferent lighting, etc. In this work, we aim to train a CNN which

generalizes better to various environments, mainly various

lightings.

The RGB images are sensitive to illumination variations yet

cover very detailed facial texture information. Motivated by

extensive research of (single-scale and multi-scale) Retinex

image, we ﬁnd the Retinex (we use Multi-Scale Retinex -

MSR in this work) image is invariant to illumination yet

loses minor facial texture. Thus, in this work, we propose a

Fig. 2. (A) is the overall pipeline; In (B), every single block represents one SSR module. The outputs of all SSR modules are weighted with scale parameters

to form MSR; (C) illustrates the work ﬂow of attention-based fusion.

two-stream CNN (TSCNN) which trains two separate CNNs

accepting RGB images and MSR images as input respectively.

To effectively fuse RGB feature and MSR feature, we propose

an attention based fusion method.

In this section, ﬁrstly, we introduce the theory of the Retinex

to explain the reason why MSR image is discriminative for

anti-spooﬁng. After that, the complementarity of the RGB

and MSR features is analyzed and the proposed TSCNN is

detailed. Last, we introduce our attention-based feature fusion

method.

A. The Retinex Theory

Assumption Retinex theory was ﬁrst raised by Land and

McCann in 1971 [33]. According to the literal meaning of the

word ‘Retinex’, it is a portmanteau constituted by ‘retina’ and

‘cortex’, imitating how the human visual system works. The

Retinex theory is based on the assumption that the color of

the object is determined by the reﬂection ability of light of

different wavelengths. The color of the object is not affected

by the non-uniformity illumination. The theory separates the

source image S(x, y)into two parts: the reﬂectance R(x, y)

and the illumination L(x, y). In particular, R(x, y)and L(x, y)

contain different components of frequency. R(x, y)focuses

on high frequency components, while L(x, y)tends to low

frequency components. We formulate Retinex by Eq. (1):

S(x, y) = R(x, y)·L(x, y )(1)

where xand yare image pixel coordinates.

Motivation L(x, y)and R(x, y)represent the illumination

and reﬂectance (facial skin texture in our task) components

respectively. L(x, y)is determined by the light source, while

R(x, y)is determined by the property of the surface of cap-

tured objects, i.e face in our application. Illumination is clearly

not relevant to most classiﬁcation tasks including face spooﬁng

detection, thus the separation of illumination and reﬂectance

(texture) is important because the separated reﬂectance only

can be used for illumination-invariant classiﬁcation. Since

Retinex theory aims to conduct this separation, Retinex is used

in this work for illumination-invariant face spooﬁng detection.

Computation For the convenience of calculation, Eq. (1) is

usually transformed into the logarithmic domain:

log[S(x, y)] = log[R(x, y)] + log[L(x, y)] (2)

where log[S(x, y)],log[R(x, y)], and log[L(x, y)] are repre-

sented by s(x, y),r(x, y), and l(x, y )for convenience.

Since s(x, y)is logarithmic form of the original image, we

can calculate the Retinex output r(x, y)by appraising l(x, y).

Thus, the performance of the Retinex is determined by the

estimation of l(x, y). Selecting the apposite method to estimate

l(x, y)is a considerable step for illumination normalization.

Summarizing the previous work of the Retinex, the illumi-

nation image can be generated from the source image using

the center/surround Retinex. Single-scale Retinex (SSR) [36]

is a center/surround based Retinex and is formulated as Eq.

(3):

r(x, y) = s(x, y)−log[S(x, y)∗F(x, y )] (3)

where F(x, y)denotes the surround function, and Symbol ‘*’

is the convolution operation. There are several forms of the

surround function which depends on the effect of the SSR.

The work [36] shows that a Gaussian ﬁlter works well for the

illumination normalization.

G(x, y) = Ke−(x2+y2)/c (4)

where cis the scale parameter of Gaussian surround function.

The value of cis empirically determined. Kis selected to

satisfy: ZZ F(x, y)dxdy = 1 (5)

Let G(x, y)represent F(x, y), then Eq. (3) can be rewritten

as:

r(x, y) = s(x, y)−log[S(x, y)∗G(x, y )] (6)

The large illumination discontinuities produce halo effects

which are often visible. This limitation expands SSR to a

more balanced method, multi-scale retinex (MSR) [23], by

superposing several outputs of SSRs with small, middle, and

large scale parameters at certain weights, shown in Fig.2 (B).

Speciﬁcally, this is expressed by,

rMS R(x, y ) =

i=1

wilog[S(x, y)] −log[S(x, y)∗Gi(x, y)]

(7)

Summary Retinex (MSR in our work) is used for face

spooﬁng detection with two reasons. (1) The MSR can

separate illumination and reﬂectance. In this work, we use

the reﬂectance images (MSR image) to train a CNN for

illumination-invariant face spooﬁng detection. (2) Since the

fake face image is regraded as the recaptured image in

many cases, which may lose some high frequency information

compared to genuine ones. Thus, high frequency information

can work as a discriminative clue for anti-spooﬁng. MSR

algorithm can be viewed as an optimized high pass ﬁlter to

capture the high frequency information for spooﬁng detection.

B. Two Stream Convolutional Neural Network (TSCNN)

In this section, we introduce our framework for anti-spoof:

TSCNN. Speciﬁcally, the original RGB images are converted

to MSR images in an off-line way. The two image sources

(RGB and MSR) are separately fed to two CNN for end-to-

end training with cross-entropy binary classiﬁcation loss. The

learned two features (derived from RGB and MSR images)

are then learned to fuse using attention mechanism. In the

remaining parts of this section, we will detail each component

of our framework.

Complementarity of RGB and MSR Images RGB color

space is commonly used for capturing and displaying color

images. The advantage of the use of RGB images is clear:

RGB images can naturally capture detailed facial texture

which is discriminative for spooﬁng detection. However, the

disadvantage of RGB image is that it is very sensitive to

illumination variation. The intrinsic reason is that RGB space

has high correlation between the three color channels, making

it rather difﬁcult to separate the luminance and chrominance

information. Because the luminance conditions of face images

in real world are different and the separation of luminance

(illumination) and chrominance (skin color) is rather difﬁcult,

the features learned from RGB space tend to be affected by

illumination.

The MSR algorithm can achieve illumination invariant face

image by removing the illumination effects as introduced in

SectionIII-A. Thus, the MSR face image preserves the micro-

texture information of facial skin without the illumination

effects. Apart from the illumination-invariant merit of MSR

images, MSR images can generate discriminative information

for spooﬁng detection. Speciﬁcally, MSR algorithm removes

the low frequency components (illumination) from the original

image and leaves the high frequency ones (texture details).

However, the high frequency information is discriminative for

spoof detection because: the real faces have rich facial texture

details, while the fake faces, in particular recaptured faces,

lose some of such details.

As analyzed above, RGB and MSR images are comple-

mentary because: RGB images contain detailed facial texture

yet are sensitive to illuminations; while MSR images contain

less detailed texture yet are illumination invariant. In addition,

MSR images can keep high frequency information, which is

also discriminative for spooﬁng detection.

Two-stream Architecture Our method is motivated by the

fact that both RGB and MSR features are discriminative for

face spooﬁng detection. It is natural to train CNNs using

these two sources of information. In this work, therefore, we

proposed a two-stream convolutional neural network (TSCNN)

as shown in Fig.2 (A). The TSCNN consists of two identical

sub-networks with different inputs (RGB and MSR images)

and extract the learned features derived from RGB and MSR

images following the last convolution layer of the two sub-

networks. Given one input image/frame, we use MTCNN [51]

for face and landmark detection. Then the detected faces are

aligned using afﬁne transformation. The RGB stream operates

on single RGB frames extracted from a video sequence. For

the MSR stream, the single RGB frames (processed to gray

scale ﬁrst) are converted to MSR images as shown in Fig.2-

(B). Then MSR images are fed to the MSR subnetwork for

training. Each stream is based on the same network, in this

work, we use two successful networks (MobileNet [52] and

ResNet-18 [53]). To effectively fuse the features from two

streams, we propose an attention based fusion block, shown

in Fig.2-(C), which will be detailed in Section III-C.

To formulate the TSCNN framework (M), we introduce

a quadruplet M= (ERGB , EMS R, F, C). Here ERGB and

EMS R are features extractors for RGB and MSR streams

respectively. Fis a fusion function and Cis the classiﬁer.

The feature extractor is a mapping E:I→fthat takes an

input image (either RGB or MSR) Iand outputs a feature f

of D-dimension.

Both the extracted feature fRGB and fMS R must have the

same dimension of Dto be compatible for early (feature)

fusion. In particular, fRGB and fMSR can be obtained via dif-

ferent extractors (CNNs), while the feature dimension should

be assured the same.

The fusion function Faggregates fRGB and fMS R into a

fused feature vvia F:

v=F(fRGB , fMS R)(8)

The fused feature is then fed into a classiﬁer C. Thus, the

TSCNN can be formulated as an optimization problem:

min

i=1

l[C(F(fRGB, fMS R)), y](9)

where l(:,:) is a loss function, Nis the number of samples,

yis the one-hot encoding label vector.

Backbone Deep Networks CNNs have been successfully

applied to face anti-spooﬁng [25], [26], [47]–[49]. Most ex-

isting works trained their CNN models from scratch using the

existing face anti-spooﬁng databases, which are quite small

and captured in unitary environments. Since CNNs are data

hungry model, small training data might lead to overﬁtting.

To overcome overﬁtting and improve the performance of many

computer vision tasks, model ﬁnetuning/pretraining from big

image classiﬁcation database, usually ImageNet [54], is an

effective way. In this work, we used two backbone networks

pretrained on ImageNet, i.e MobileNet [52] (lighter, less

accurate) and ResNet-18 [53] (heavier, more accurate) for

spooﬁng detection.

To adapt the MobileNet and ResNet-18 models to our face

anti-spooﬁng problem, we ﬁnetuned the pretrained models

using the face spooﬁng database. The 2-class cross-entropy

loss, i.e. Eq (10), is used for binary classiﬁcation (real vs fake

faces). The output of bottleneck layers of MobileNet (1024D)

and ResNet-18 (512D) models work as the features for anti-

spooﬁng.

C=−1

[yilnˆyi+ (1 −yi)ln(1 −ˆyi)] (10)

where iis the index of training sample, Nis the number of

training samples, ˆyiis the predict value of the ith sample, yi

is the label of the ith sample.

C. Attention based Feature fusion

Feature fusion is important for performance improvement

in many computer vision tasks. Improper fusion methods

can make the fused feature works worse than individual

features. In deep learning era, fusion methods including score

averaging, feature concatenation, feature averaging, feature

max pooling and feature min pooling are normally used. In

our anti-spooﬁng task, we ﬁnd these fusion methods cannot

explore deeply the interplay of features from different sources,

therefore, we propose an attention-based fusion method as

shown in Fig.2-(C).

The proposed attention-based fusion methods is actually a

general framework which can be used for many deep learning

based fusion scenarios, certainly including the fusion of RGB

and MSR features. Given a set of features {fi, i = 1, ..., N },

we try to learn a set of weights corresponding to the features

{wi, i = 1, ..., N } to generate the aggregated feature v:

i=1

wifi(11)

Clearly, the key part of our attention method is to learn

the weights {wi} of Eq. (11). Note that our method becomes

feature average fusion if wi= 1/N , showing the generaliza-

tion capacity of our method. In our task of spooﬁng detection,

N= 2, and the features to be fused are fRGB and fMSR.

Apart from learning widirectly, we learn a kernel qwhich

has the same dimensionality of fi.qis used to ﬁlter the feature

vectors via dot product:

di=qTfi(12)

The ﬁlter generates a vector which represent the signiﬁcance

of the corresponding feature, named di. To convert the signif-

icances to weights wisubject to Piwi= 1, we passed dito

a softmax operator and achieve all positive weights wi:

wi=edi

Pjedj(13)

Obviously, the aggregation result ris unrelated with the

quantity of input feature fi. The only parameters to learn is

the ﬁlter kernel q, which is easy to be trained via standard

backpropagation and stochastic gradient descent.

IV. EXPERIMENTS

In this Section, we conduct extensive experiments and

evaluate our method. We ﬁrst have a brief introduction of

three benchmark databases in Section IV-A. After that, we

present the experimental settings of our method in section B

so that the other researchers can reproduce our results. The

following sections (SectionIV-C to G) present the results on

the three databases. In particular, the results on CASIA-FASD

are shown with the seven test scenarios.

A. Benchmark Database

In this subsection, to assess the effectiveness of our pro-

posed anti-spooﬁng technique, an experimental evaluation on

the CASIA Face Anti-Spooﬁng Database [55], the REPLAY-

ATTACK database [56] and the OULU database [57] is

provided. These three datasets consist of real client accesses

and different types of attacks, which are captured in different

imaging qualities with different cameras. In the following

paragraphs, we will have a brief introduction of the databases.

1) The CASIA Face Anti-Spooﬁng Database (CASIA

FASD): The CASIA Face Anti-Spooﬁng Database is divided

into the training set consisted of 20 subjects and the test

set containing 30 individuals(see, Fig.3). The fake faces were

made by capturing the genuine faces. Three different cameras

are used in this database to collect the videos with various

imaging qualities: low, normal, and high. In addition, the

individuals were asked to blink and not to keep still in the

videos to collect abundant frames for detection. Three types

of face attacks were designed as follows: 1) Warped Photo

Attack: A high resolution (1920 ×1080) image, which is

recorded by a Sony NEX-5 camera, was used to print a

photo. The attacker simulates the facial motion by warps the

photo in a warped photo attack. 2) Cut Photo Attack: The

high resolution printed photos are then used for the cut photo

attacks. In this scenario, an attacker hides behinds the photo

Fig. 3. Sample from the CASIA FASD. From top to bottom: low, normal and

high quality images. From the left to the right: real faces and warped photo,

cut photo and video replay attacks.

Fig. 4. Samples from the REPLAY-ATTACK database. The ﬁrst row presents

images taken from the controlled scenario, while the second row corresponds

to the images from the adverse scenario. From the left to the right: real faces

and high deﬁnition, mobile and print attacks.

Fig. 5. Samples from the OULU-NPU database. From top to bottom is the

three sessions with different acquisition conditions. From the left to the right:

real faces, print attack 1, print attack 2, video attack 1 and video attack 2.

and exhibits eye-blinking through the holes of the eye region,

which was cut off before attack. In addition, the attacker put

a intact photo behind the cut photo, putting the eye region

overlapping from the holes and moving the intact photo up

and down slightly to simulate the blinking of the eyes. 3)

Video Attack: In this attack, the high resolution videos are

displayed on an iPad and captured by a camera.

2) REPLAY-ATTACK Database: The REPLAY-ATTACK

Database consists of video recordings of real accesses and

attack attempts to 50 clients (see, Fig.4). There are 1200

videos taken by the webcam on a MacBook with the resolution

320 ×240 under two illumination conditions: 1) controlled

condition with a uniform background and light supplied by

a ﬂuorescent lamp, 2) adverse condition with non-uniform

background and the day-light. For performance evaluation,

the data set is divided into three subsets of training (360

videos), development (360 videos), and testing (480 videos).

To generate the fake faces, a high resolution videos were taken

for each person using a Canon PowerShot camera and an

iPhone 3GS camera, under the same illumination conditions.

Three types of attacks were designed: (1) Print Attacks: High

resolution pictures were printed on A4 paper and recaptured

by cameras; (2) Mobile Attacks: High resolution pictures

and videos were displayed on the screen of an iPhone 3GS

and recaptured by cameras; (3) High Deﬁnition Attacks: the

pictures and the videos were displayed on the screen of an

iPad with resolution of 1024 ×168.

3) OULU-NPU Database: OULU-NPU face presentation

attack database consists of 4950 real access and attack videos

that were recorded using front facing cameras of six different

mobile phones (see, Fig.5). The real videos and attack materi-

als were collected in three sessions with different illumination

condition. The attack types considered in the OULU-NPU

database are print and video-replay. These attacks were created

using two printers (Printer 1 and 2) and two display devices

(Display 1 and 2). The videos of the real accesses and attacks,

corresponding to the 55 subjects, are divided into three subject-

disjoint subsets for training, development and testing with 20,

15 and 20 users, respectively.

B. Experimental Settings

In our experiments, we followed the protocols associated

with each of the three databases which allows a fair com-

parison with other methods proposed in the state of art. For

CASIA FASD, the model parameters are trained and tuned

using the training set and the results are reported in terms of

Equal Error Rate (EER) on the test set. Since the REPLAY-

ATTACK database provides a validation set, the results are

given in terms of EER on the validation set and the Half Total

Error Rate (HTER) on the test set following the ofﬁcial test

protocol. EER is achieved at the point where the false rejection

rate (FRR) is equal to false acceptance rate (FAR). To compute

HTER, we ﬁrst compute EER and the corresponding threshold

on the validation set. Then HTER can be calculated via the

threshold on the test set.

Following [58], we evaluate our method on OULU-NPU

database with two metrics: Attack Presentation Classiﬁcation

Error Rate (APCER) (Eq. (14)) and Bona Fide Presentation

Classiﬁcation Error Rate (BPCER) (Eq. (15)).

AP CE RP AI =1

NP AI

i=1

(1 −Resi)(14)

BP C ER =PNB F

i=1 Resi

NBF

(15)

where, NP AI is the number of the attack presentations for

the certain presentation attack instruments (PAI), NBF is the

total number of the bona ﬁde presentations. If the prediction

of ith presentation is attack, Resigets the value "1", while

the prediction is bona ﬁde, the value of Resiis "0". These

two metrics correspond to the False Acceptance Rate (FAR)

and False Rejection Rate (FRR) commonly used in the PAD

related literature [58], [59]. In addition, we apply the average

of the APCER and the BPCER, called Average Classiﬁcation

Error Rate (ACER), to measure the overall performances.

For the operational systems, the metrics we used (EER,

HTER, APCER and BPCER) cannot quantify veriﬁcation

performance. Following the Face Recognition Vendor Test

(FRVT) and the common metrics of face recognition, the

Receiver Operating Characteristic (ROC) is used to measure

the performance of liveness detection. To clearly visualize

the TPR@FAR=0.1 and TPR@FAR=0.01 in the ﬁgures, the

logarithmic coordinates are used for the X-axis of the ROC

curves.

To be consistent with many previous works, the pre-

processing steps are needed, consisted of frame sampling and

face alignment. Since these three databases consist of videos,

we extract the frames from each video. After that, the MTCNN

[51] is used for face detection and landmark detection. Then

the detected faces are aligned to size of 128 ×128. For

every aligned face, we conduct data augmentation including

horizontal ﬂipping, random rotation (0-20 degree), and random

crop (114 ×114).

For each database, we used the training set to ﬁne-tune

the MobileNet and ResNet-18 model with cross-entropy loss

and the testing set and validation set are used to evaluate the

performance.

For the learning parameter setting, we set the momentum as

0.9 and the learning rate as 0.0001 for training the network. It

is observed that the network training converges after 50 epochs

with the batch size 128 during the training.

C. Results of CASIA-FASD

The CASIA-FASD is split into the training set comprised

of 20 subjects and the test set containing 30 individuals. For

each of the seven attacking scenarios, the data should then

be selected from the corresponding training and test sets for

model training and evaluation.

Different color spaces might lead to different performance

of anti-spooﬁng [48], though RGB color is the most widely

used. To explore the effect of color space, we conduct

experiments and compare the performance of three color

spaces: RGB, HSV and YCbCr. All the training settings of

3 color space keep the same. Speciﬁcally, the original input

images/frames in database are converted to MSR images. Then

the images of different color spaces are fed to our TSCNN

respectively. The spooﬁng detection results (EER, the lower

the better) based on MobileNet and ResNet-18 are reported

in Table I. The ROC curves are shown in Fig.6-(a) and

the attention Fusion results in terms of TPR@FAR=0.1 and

TPR@FAR=0.01 are presented in Table VII.

Results: (1) From results on seven scenarios, RGB and

YCbCr generally outperform HSV color space using both

ResNet-18 and MobileNet. And the results of RGB and YCbCr

are quite similar.

(2) We can see that RGB, HSV and YCbCr features all work

better than MSR features for both MobileNet (4.931%, 5.134%

and 5.091% vs. 9.531%) and ResNet-18 (3.437%, 4.831% and

3.635 vs. 7.883%).

(3) The fusion of MSR and RGB features works better than

MSR and HSV, MSR and YCbCr for both MobileNet (4.175%

VS 5.061% and 4.339%) and ResNet-18 (3.145% VS 4.661%

and 4.761%). (4) The fusion of MSR and RGB features works

better than individual one for MobileNet (fusion: 4.175% vs

RGB: 4.931% and MSR: 9.513%). The same conclusion can

be drawn for ResNet-18 fusion. As for the reason why RGB

is better than HSV and YCbCr, we believe that the MSR plays

a role of reducing the impact of illuminations, while the RGB

tries to preserve the detailed facial textures. However, HSV and

YCbCr are based on the separation of the luminance and the

chrominance, which are not effective for the fusion with MSR.

It veriﬁes the complementarity of RGB and MSR images.

(4) From the Table VII, not surprisingly, the overall results

of CASIA-FASD with ResNet (99.71% and 85.33%) are better

than that with MobileNet (98.95% and 82.51%).

D. Results of REPLAY-ATTACK and OULU-NPU

REPLAY-ATTACK and OULU-NPU are divided into three

subsets: training, test and development. The training set is used

to train a classiﬁer or feature extractor while the development

set is typically employed to adjust parameters of the classiﬁer.

The test set is used for result evaluation. In this experiment,

we follow the experimental settings of CASIA-FASD and use

MobileNet and ResNet-18 for evaluation.

From Table II and Fig.6-b, we can see the fusion of MSR

and RGB works better than individual ones in terms of EER

(fusion: 0.131% vs RGB: 0.384% and MSR: 7.365%) and

HTER (fusion: 0.254% vs RGB: 1.561% and MSR: 8.584%)

on REPLAY-ATTACK database using MobileNet. The same

conclusion can be found for ResNet-18. From Table VII,

the overall results of REPLAY-ATTACK using MobileNet

(99.42% and 99.13%) are better than that with ResNet-18

(99.21% and 98.59%). In addition, we further fuse the fused

MobileNet features (RGB+MSR) and fused ResNet-18 fea-

tures (RGB+MSR). Because feature dimensionality of original

MobileNet (1024D) and ResNet-18 (512D) is different, we

change the bottleneck layer of the MobileNet to be of 512D

to conduct our attention-based fusion. From Table II, we can

see this further fusion works better than ResNet fusion, but

slightly worse than the MobileNet fusion.

To further verify the effectiveness of the fusion of RGB and

MSR on illumination variations, we conduct the experiment on

REPLAY-ATTACK database which contains two illumination

conditions: 1) controlled condition with a uniform background

and light supplied by a ﬂuorescent lamp, 2) adverse condition

with non-uniform background and the day-light. To discuss

TABLE I

EER (%) OF T HRE E CO LOR S PACES A ND MSR FE ATUR ES ON CASIA-FASD DATABAS E IN SE VEN SC ENA RI OS

Attack Scenarios Low Normal High Warped Cut Video Overall

LBP

RGB 15.301 8.996 6.412 8.551 6.011 5.661 7.802

MSR 10.690 10.302 5.331 7.609 8.091 8.701 9.003

RGB+MSR Fusion 8.996 9.330 5.981 7.604 6.771 4.390 7.408

MobileNet

RGB 10.610 4.606 5.260 5.934 3.978 3.846 4.931

HSV 8.714 5.884 6.995 3.723 4.709 4.682 5.143

YCbCr 8.441 4.993 4.519 6.410 5.792 3.904 5.091

MSR 7.056 8.129 5.818 9.828 5.126 9.833 9.531

RGB+MSR Fusion 6.745 4.068 3.258 5.258 2.453 2.647 4.175

HSV+MSR Fusion 7.633 4.982 5.601 4.679 4.510 4.511 5.061

YCbCr+MSR Fusion 7.003 5.120 3.227 4.031 6.001 3.799 4.339

ResNet

RGB 4.021 5.851 1.703 5.019 1.941 2.679 3.437

HSV 6.341 2.291 5.815 3.459 2.992 4.578 4.831

YCbCr 7.441 2.185 1.713 4.249 3.329 3.716 3.635

MSR 6.793 6.270 10.098 7.665 5.087 9.531 7.883

RGB+MSR Fusion 3.545 2.170 2.785 4.419 2.572 4.931 3.145

HSV+MSR Fusion 5.319 2.907 4.886 3.299 2.555 4.931 4.661

YCbCr+MSR Fusion 6.178 3.099 4.690 4.003 3.133 3.999 4.761

Fig. 6. ROC curves on REPLAY-ATTACK and CASIA-FASD databases. (a) ROC curves on CASIA-FASD with ResNet under different color spaces and

MSR. (b) ROC curves on REPLAY-ATTACK with MobileNet with LBP and CNNs.

the improvements over lightings, we divided the database into

two parts: adverse illumination and controlled illumination and

run the experiments separately. From Table III and Fig.7-(a),

MSR features have the better results than RGB features in

adverse illumination (stronger lighting), showing the robust-

ness of MSR on strong lightings. On the other hand, RGB

outperforms MSR features in controlled illumination (close to

neutral lighting), showing the RGB has the strong capacity to

maintain the texture details under neutral illuminations. After

fusion, the results are improved in both adverse and controlled

illumination. So the Fusion of MSR and RGB can effectively

handle various lightings and improve the performance.

For the OULU-NPU database, we follow [58] to use four

metrics: we present EER in development set and APCER,

BPCER and ACER in test set.

Table IV and Table VII shows the results of RGB, MSR

Fig. 7. ROC curves on on REPLAY-ATTACK and CASIA-FASD databases. (a) ROC curves on REPLAY-ATTACK with MobileNet under different

illuminations. (b) ROC curves on CASIA-FASD with ResNet under different fusion methods.

TABLE II

EER (%) AND HTER (%) OF RGB AND MSR FE ATUR ES ON

REPLAY-ATTACK DATABA SE

Method REPLAY-ATTACK

EER HTER

LBP RGB 3.990 4.788

LBP MSR 4.701 5.060

MobileNet RGB 0.384 1.561

ResNet RGB 0.628 2.038

MobileNet MSR 7.365 8.584

ResNet MSR 8.350 9.576

LBP Attention Fusion 3.491 4.903

MobileNet Attention Fusion 0.131 0.254

ResNet Attention Fusion 0.210 0.389

ResNet + MobileNet Attention Fusion 0.177 0.293

TABLE III

EER (%) AND HTER (%) OF RGB AND MSR FE ATUR ES ON AD VER SE

ILLUMINATION AND CONTROLLED ILLUMINATION IN REPLAY-ATTACK

DATABAS E

Method

Adverse

illumination

Controlled

illumination

EER HTER EER HTER

MobileNet RGB 0.451 1.971 0.140 1.107

ResNet RGB 0.705 2.444 0.411 1.677

MobileNet MSR 7.660 8.621 6.138 7.218

ResNet MSR 8.720 9.031 7.993 8.930

MobileNet Attention Fusion 0.165 1.299 0.093 0.097

ResNet Attention Fusion 0.285 1.433 0.169 1.310

and fusion feature based on MobileNet and ResNet-18. In

terms of ACER and EER, we can see the fusion of RGB and

MSR performs better than individual ones. For most results in

four protocols, the fusion of features signiﬁcantly outperforms

individual features.

The consistent improvement of feature fusion shows the

effectiveness of the use of two information sources: RGB and

MSR. As shown in Table II and Table IV, the popular networks

(MobileNet and ResNet-18) achieve competitive performances

on REPLAY-ATTACK and OULU-NPU database .

TABLE IV

EER (%), APCER (%) , BPCER (%) AND ACER (%) OF RGB AN D MSR

FEATU RES O N OULU-NPU DATABASE

Prot. Methods Dev Test

EER(%) APCER(%) BPCER(%) ACER(%)

MobileNet RGB 6.1 9.6 6.2 7.9

ResNet RGB 2.3 3.5 8.7 6.1

MobileNet MSR 10.5 10.6 9.4 10.0

ResNet MSR 5.7 7.5 9.3 8.4

MobileNet

Attention Fusion 5.2 3.9 9.5 6.7

ResNet

Attention Fusion 2.1 5.1 6.7 5.9

MobileNet RGB 5.7 6.5 10.7 8.6

ResNet RGB 2.7 3.7 8.1 5.9

MobileNet MSR 9.6 8.9 9.9 9.4

ResNet MSR 4.3 3.8 11.6 7.8

MobileNet

Attention Fusion 5.1 3.6 9.0 6.3

ResNet

Attention Fusion 2.0 7.6 2.2 4.9

MobileNet RGB 5.3±0.5 3.5±1.8 9.3±2.6 6.4±3.7

ResNet RGB 2.7±0.8 9.3±0.8 5.7±1.2 7.2±2.6

MobileNet MSR 10.8±1.2 6.9±2.5 12.3±0.9 9.7±1.9

ResNet MSR 4.6±0.8 8.3±1.9 9.4±1.8 8.7±2.1

MobileNet

Attention Fusion 5.1±0.3 8.7±4.5 5.3±2.3 6.3±2.2

ResNet

Attention Fusion 1.9±0.4 3.9±2.8 7.3±1.1 5.6±1.6

MobileNet RGB 6.3±0.4 12.3±7.5 9.7±2.6 10.3±3.1

ResNet RGB 2.6±0.5 17.9±9.1 10.1±5.5 14.9±6.4

MobileNet MSR 11.8±1.8 24.7±10.5 21.3±12.8 22.0±11.6

ResNet MSR 6.6±0.7 19.6±9.1 16.2±8.8 17.1±8.1

MobileNet

Attention Fusion 6.1±0.7 10.9±4.6 12.7±5.1 11.3±3.9

ResNet

Attention Fusion 2.3±0.3 11.3±3.9 9.7±4.8 9.8±4.2

E. Attention based fusion results

As mentioned above, RGB feature is mainly focusing on

micro-texture of facial skin on the all frequencies together,

while the MSR feature is focusing on the high frequencies

which reduces the inﬂuence of illumination. Table I, Table II

and Table IV have veriﬁed the effectiveness of the fusion of

these two features (RGB and MSR). In this section, we further

explore this effectiveness.

TABLE V

EER (%) OF DI FFER EN T FUSION METH ODS O N CASIA-FASD DATABA SES I N SEV EN SCE NAR IOS

Attack Scenarios Low Normal High Warped Cut Video Overall

MobileNet

Concatenated Features 7.808 3.473 5.957 5.364 3.267 4.479 5.191

Score Average 10.611 4.612 5.312 5.934 3.971 3.877 4.953

Feature Average 8.086 3.311 5.819 5.253 3.333 4.278 5.108

Feature Max 8.048 3.410 6.017 5.347 3.321 4.529 5.201

Feature Min 7.820 3.458 5.380 5.149 3.267 4.064 4.887

Attention Fusion 6.745 4.068 3.258 5.258 2.453 2.647 4.175

ResNet

Concatenated Features 5.568 3.099 4.302 4.092 2.516 3.143 3.380

Score Average 5.902 2.969 3.830 4.202 2.658 3.224 3.332

Feature Average 6.242 3.291 4.689 3.935 2.929 3.956 3.895

Feature Max 5.846 4.039 4.536 4.331 3.091 4.198 4.189

Feature Min 7.244 2.825 4.941 4.280 3.030 3.984 4.157

Attention Fusion 3.545 2.170 2.785 4.419 2.572 4.931 3.145

First, we show some qualitative results via visualization.

Compared with average feature fusion which weights different

features equally, attention fusion has the ﬂexibility to adap-

tively weight the features in an asymmetry way. Therefore,

our attention-based fusion has the potential to obtain the

better weights leading to better performance. Fig.8-(A) shows

this asymmetry weighting mechanism of our attention-based

fusion method. The samples in Fig.8-(A) are selected from

REPLAY-ATTACK database which covers two imaging light-

ness conditions: adverse illumination (uneven, complicated

lightings), controlled illumination (even, neutral lightings).

From the samples in Fig.8, we can see the weights for

MSR and RGB are adaptively asymmetry. Under adverse

(uneven, complicated lightings) illumination, the weights of

MSR images are higher than those of RGB ones because MSR

images are more illumination-invariant than RGB ones. Under

controlled illumination, unsurprisingly, the RGB images gain

higher weights. Fig.8 (B) shows some samples under different

illuminations with three scores (RGB, MSR, the fusion of

them). We can see some samples failed with individual RGB or

MSR scores, but the fusion results lead to correct recognition,

showing the effectiveness of the fusion of RGB and MSR, in

particular, under various illuminations.

Second, we show some qualitative results. Speciﬁcally,

we compare the proposed attention-based fusion methods

with some popular feature fusion methods including score

averaging, feature concatenation, feature averaging, feature

max pooling, feature min pooling and the proposed attention

method. The fusion results are presented separately for differ-

ent databases.

Table V shows the results of CASIA-FASD with the seven

scenarios. In addition, Fig.7-(b) shows the ROC curves of

the popular feature fusion methods using MobileNet. The

proposed attention based fusion method achieves the lowest

EER across all other scenarios (’Overall’) 4.175% (MobileNet)

and 3.145% (ResNet-18), showing that the superiority of

the our fusion methods against others. For MoblieNet and

ResNet-18, the 2nd and 3rd best performed fusion methods are

{’Feature Min’ and ’Score Average’} and {’Score Average’

and ’Concatenated Features’}, respectively.

Table IV-E shows the fusion results on REPLAY-ATTACK

and OULU-NPU. We can see that our attention-based fusion

works consistently better than all other fusion methods on

both REPLAY-ATTACK (EER and HTER) and OULU-NPU

(EER). The promising performance results from the fact that

attention-based fusion can adaptively weight the RGB and

MSR features.

TABLE VI

EER (%) AN D HTER (%)OF DIFF ER ENT FUSION ME THO DS O N

REPLAY-ATTACK AND OULU-NPU DATABAS ES

Methods REPLAY-ATTACK OULU-NPU

EER HTER EER

MobileNet

Concatenated Features 0.412 0.381 6.381

Score Average 0.363 0.360 6.472

Feature Average 0.396 0.395 7.549

Feature Max 0.310 0.294 8.317

Feature Min 0.574 0.565 9.841

Attention Fusion 0.131 0.254 5.692

ResNet

Concatenated Features 0.841 0.668 4.518

Score Average 1.278 1.178 9.565

Feature Average 0.873 0.725 5.358

Feature Max 0.958 0.906 4.964

Feature Min 0.579 0.490 2.578

Attention Fusion 0.210 0.389 2.021

F. Comparisons with State-of-the-art

Table VIII presents the comparisons of our approach with

the state-of-the-art methods for face spooﬁng detection. In gen-

eral, the proposed algorithm outperforms many competitors,

demonstrating the effectiveness of our method by fusing RGB

feature and MSR feature with attention model.

For REPLAY-ATTACK database, the proposed method

achieves the best (MobileNet+Attention) and 2nd best

(ResNet-18+Attention) performance in terms of EER, show-

ing the effectiveness of the fusion of two clues (RGB and

MSR). In terms of HTER, our method (MobileNet+Attention)

achieves the 2nd best performance, slightly lower than Bottle-

neck feature fusion + NN [50]. However, our method greatly

outperforms [50] in terms of EER.

For CASIA-FASD database, it can be seen in Table VIII that

we also achieve the best (ResNet-18 + Attention) and 2nd best

(MobileNet + Attention) performance in terms of EER.

For OULU-NPU database, as shown in Table IX, we can

achieve 2nd best performance for most results under the four

protocols, while the method of [63] works best, which uses

the additional information of 3D depth shape and rPPG (The

rPPG signal provides temporal information about face liveness,

Fig. 8. Results on REPLAY-ATTACK database. (A) Attention fusion weights (numbers in the boxes) showing the importance of RGB and MSR. Samples

cover 2 imaging lightness conditions: adverse illumination (Row 1 and 2) and controlled illumination (Row 3 and 4). (B) Three prediction scores: RGB, MSR

and the fusion of them (numbers in the boxes). The red and green boxes indicate the wrong and correct predictions respectively.

TABLE VII

TPR@FAR=0.1 AND TPR@FAR=0.01 OF TH E ATTE NTI ON FUSION RE SULT S ON CASIA-FASD, REPLAY-ATTACK AND OULU-NPU DATABAS ES

Database Methods Protocol TPR@FAR=0.1 TPR@FAR=0.01

CASIA-FASD ResNet Attention Fusion overall 99.71% 85.33%

MobileNet Attention Fusion overall 98.95% 82.51%

REPLAY-ATTACK ResNet Attention Fusion overall 99.21% 98.59%

MobileNet Attention Fusion overall 99.42% 99.13%

OULU-NPU

ResNet Attention Fusion

Prot.1 94.15% 83.44%

Prot.2 95.11% 86.78%

Prot.3 93.59%±0.5% 84.39%±0.4%

Prot.4 93.09%±0.4% 83.69%±0.5%

MobileNet Attention Fusion

Prot.1 98.94% 96.74%

Prot.2 99.10% 96.86%

Prot.3 98.41%±0.6% 96.04%±0.5%

Prot.4 97.83%±0.4% 95.22%±0.6%

which is related to the intensity changes of facial skin over

time).

To summarize, our method can achieve very strong perfor-

mance across all the three benchmark databases, showing the

merits of the proposed method.

G. Cross-Database Comparisons

The spooﬁng faces of different databases are captured using

different devices under different environments (e.g. lightings).

Therefore, it is interesting to evaluate our strategy in a cross-

database protocol to verify its generalization capacity.We con-

ducted a cross-database evaluation between CASIA-FASD and

REPLAY-ATTACK. To be more speciﬁc, cross-database is to

train and tune the classiﬁer on one database and test on another

database. The generalization ability of the system in this case

is manifested by the HTER obtained on the validation and test

sets. The countermeasure was trained and tuned with CASIA-

FASD or REPLAY-ATTACK each time, and then tested on

the other databases. The results are reported in Table IV-F

compared with the state-of-the-art techniques in this cross-

database manner.

Due to the domain shift (different imaging environments)

between databases, the performaence of all the anti-spooﬁng

methods drops. Compared with the state-of-the-art methods,

TABLE VIII

COMPARISON BETWEEN THE PROPOS ED COUNTERMEASURE AND

STATE-OF -TH E-ART ME TH ODS O N REPLAY-ATTACK AND

CASIA-FASD DATABAS ES I N TER MS O F EER(%) AND HTER(%)

Methods REPLAY-ATTACK CASIA-FASD

EER HTER EER

Motion [60] 11.6 11.7 26.6

LBP [56] 13.9 13.8 18.2

LBP-TOP [61] 7.90 7.60 10.00

CDD [62] - - 11.8

DOG [3] - - 17.0

DMD [27] 5.3 3.8 21.8

IQA [4] - 15.2 32.4

CNN [14] 6.10 2.10 7.40

IDA [5] - 7.4 -

Motion + LBP [29] 4.50 5.11 -

Color-LBP [10] 0.40 2.90 6.20

Bottleneck feature

fusion + NN [50] 0.83 0.00 5.83

Ours (MobileNet

+ Attention) 0.131 0.254 4.175

Ours (ResNet-18

+ Attention) 0.210 0.389 3.145

TABLE IX

COMPARISON BETWEEN THE PROPOS ED COUNTERMEASURE AND

STATE-OF -TH E-ART ME TH ODS O N OULU-NPU DATAB ASE I N TE RMS O F

EER (%), APCER (%), BPCER (%) AN D ACER (%)

Prot. Methods Dev Test

EER(%) APCER(%) BPCER(%) ACER(%)

CpqD [58] 0.6 2.9 10.8 6.9

GRADANT [58] 1.1 1.3 12.5 6.9

Depth + rPPG [63] - 1.6 1.6 1.6

MobileNet

Attention Fusion 5.2 3.9 9.5 6.7

ResNet

Attention Fusion 2.1 5.1 6.7 5.9

MixedFASNet [58] 1.3 9.7 2.5 6.1

GRADANT [58] 0.9 3.1 1.9 2.5

Depth + rPPG [63] - 2.7 2.7 2.7

MobileNet

Attention Fusion 5.1 3.6 9.0 6.3

ResNet

Attention Fusion 2.0 7.6 2.2 4.9

MixedFASNet [58] 1.4±0.5 5.3±6.7 7.8±5.5 6.5±4.6

GRADANT [58] 0.9±0.4 2.6±3.9 5.0±5.3 3.8±2.4

Depth + rPPG [63] - 2.7±1.3 3.1±1.7 2.9±1.5

MobileNet

Attention Fusion 5.1±0.3 8.7±4.5 5.3±2.3 6.3±2.2

ResNet

Attention Fusion 1.9±0.4 3.9±2.8 7.3±1.1 5.6±1.6

Massy HNU [58] 1.0±0.4 35.8±35.3 8.3±4.1 22.1±17.6

GRADANT [58] 1.1±0.3 5.0±4.5 15.0±7.1 10.0±5.0

Depth + rPPG [63] - 9.3±5.6 10.4±6.0 9.5±6.0

MobileNet

Attention Fusion 6.1±0.7 10.9±4.6 12.7±5.1 11.3±3.9

ResNet

Attention Fusion 2.3±0.3 11.3±3.9 9.7±4.8 9.8±4.2

TABLE X

INT ER- DATABAS E TEST RE SU LTS IN T ERM S OF HTER (%) O N THE

CASIA-FASD AND RE PLAY-ATTACK DATABA SE

Method Train Test Train Test

CASIA

FASD

REPLAY

ATTACK

REPLAY

ATTACK

CASIA

FASD

Motion [60] 50.2% 47.9%

LBP [56] 55.9% 57.6%

LBP-TOP [61] 49.7% 60.6%

Motion-Mag [64] 50.1% 47.0%

Spectral cubes [22] 34.4% 45.5%

CNN [14] 48.5% 39.6%

Color-LBP [10] 47.0% 39.6%

Colour Texture [8] 30.3% 37.7%

Depth + rPPG [63] 27.6% 28.4%

Deep-Learning [13] 48.2% 45.4%

KSA [65] 33.1% 32.1%

Frame difference [66] 50.25% 43.05%

Ours (MobileNet +

Attention) 30.0% 33.4%

Ours (ResNet-18 +

Attention) 36.2% 34.7%

TABLE XI

INT ER- DATABAS E TEST RE SU LTS FO R RGB F EATU RES I N TE RMS O F

MAXIMUM MEAN DISCREPANCY ON THE CASIA-FASD AND

REPLAY-ATTACK DATABAS E

Model Train Val MMD

Resnet18 RGB

CASIA-FASD CASIA-FASD 0.7653

CASIA-FASD REPLAY-ATTACK 1.4561

REPLAY-ATTACK REPLAY-ATTACK 0.6871

REPLAY-ATTACK CASIA-FASD 1.3484

Mobilenet RGB

CASIA-FASD CASIA-FASD 0.8654

CASIA-FASD REPLAY-ATTACK 1.3276

REPLAY-ATTACK REPLAY-ATTACK 0.7469

REPLAY-ATTACK CASIA-FASD 1.2765

TABLE XII

INT ER- DATABAS E TEST RE SU LTS FO R MSR F EATU RES I N TE RMS O F

MAXIMUM MEAN DISCREPANCY ON THE CASIA-FASD AND

REPLAY-ATTACK DATABAS E

Model Train Val MMD

Resnet18 MSR

CASIA-FASD CASIA-FASD 0.9831

CASIA-FASD REPLAY-ATTACK 1.8746

REPLAY-ATTACK REPLAY-ATTACK 0.6541

REPLAY-ATTACK CASIA-FASD 1.0133

Mobilenet MSR

CASIA-FASD CASIA-FASD 0.8655

CASIA-FASD REPLAY-ATTACK 1.7749

REPLAY-ATTACK REPLAY-ATTACK 0.8811

REPLAY-ATTACK CASIA-FASD 1.1661

TABLE XIII

INT ER- DATABAS E TEST RE SU LTS FO R RGB A ND MSR FUSION

FEATU RES I N TE RMS O F MAXIMUM MEAN DISCREPANCY ON THE

CASIA-FASD AND RE PLAY-ATTACK DATABA SE

Model Train Val MMD

Resnet18

RGB + MSR Fusion

CASIA-FASD CASIA-FASD 0.6215

CASIA-FASD REPLAY-ATTACK 1.2511

REPLAY-ATTACK REPLAY-ATTACK 0.7003

REPLAY-ATTACK CASIA-FASD 1.1295

Mobilenet

RGB + MSR Fusion

CASIA-FASD CASIA-FASD 0.6619

CASIA-FASD REPLAY-ATTACK 1.3518

REPLAY-ATTACK REPLAY-ATTACK 0.7139

REPLAY-ATTACK CASIA-FASD 1.0551

our method (MobileNet + Attention) achieves the 2nd best

performance (30.0% and 33.4%), slightly worse than the

best one [63] (27.6% and 28.4%). However, [63] uses more

auxiliary information (3D face shape, rPPG signals) than our

method.

To explore the reasons of performance drop in the cross-

database evaluation, we consider the standard distribution

distance metric, maximum mean discrepancy (MMD) [67] to

measure the distance domain shift between the source feature

and target feature distributions.

MMD(FT, FV) =



|FT|X

ft∈FT

φ(ft)−1

|FV|X

fv∈FV

φ(fv)



(16)

As shown in the equation above, we deﬁne a representation

φ(), which operates on train data features, ft∈FTand

validate data features, fv∈FV. The larger the value of MMD,

the bigger the domain shift.

From the result of Table XI XII XIII, we can see that: (1)

When we train and test on the same database, the MMD is

smaller than that train and test on different databases for both

MobileNet and ResNet-18.

(2) Since the CASIA-FASD has seven scenarios, when we

train on the CASIA-FASD database and test on the REPLAY-

ATTACK database, the MMD is bigger than that we train

on the REPLAY-ATTACK and test on the CASIA-FASD

database.

(3) The fusion of RGB and MSR features reduced the MMD

of the cross-database compared with individual one for both

MobileNet and ResNet-18.

V. CONCLUSION

In this paper, we proposed an attention-based two stream

convolutional networks for face spooﬁng detection to distin-

guish real and fake faces. The proposed approach applies

the complementary features (RGB and MSR) extracted via

CNN models (MobileNet and ResNet-18) and then employs

the attention based fusion method to fuse these two features.

The adaptively weighted features contain more discriminative

information under various lighting conditions.

We evaluated our approaches of face spooﬁng on three chal-

lenging databases, i.e. CASIA-FASD, REPLAY-ATTACK and

OULU-NPU, which indicated the competitive performance in

both intra-database and inter-database. The experiments of

fusion methods show that the attention model can achieve

promising results on feature fusion. The cross-database evalu-

ations show the effectiveness of the fusion of RGB and MSR

information.

ACK NOW LE DG EM EN T

The authors would like to thank the journal reviewers for

their valuable suggestions. This work was supported in part by

the National Natural Science Foundation of China (61876072,

61876178, 61872367, 61572501) and the Fundamental Re-

search Funds for the Central Universities.

REFERENCES

[1] J. Li, Y. Wang, and A. K. Jain, “Live face detection based on the analysis

of fourier spectra,” Proc Spie, vol. 5404, pp. 296–303, 2004.

[2] X. Tan, Y. Li, J. Liu, and L. Jiang, “Face liveness detection from a single

image with sparse low rank bilinear discriminative model,” in Computer

Vision - ECCV 2010 - 11th European Conference on Computer Vision,

Heraklion, Crete, Greece, September 5-11, 2010, Proceedings, Part VI,

2010, pp. 504–517.

[3] Z. Zhang, J. Yan, S. Liu, Z. Lei, D. Yi, and S. Z. Li, “A face antispooﬁng

database with diverse attacks,” in Iapr International Conference on

Biometrics, 2012, pp. 26–31.

[4] J. Galbally and S. Marcel, “Face anti-spooﬁng based on general image

quality assessment,” in 22nd International Conference on Pattern Recog-

nition, ICPR 2014, Stockholm, Sweden, August 24-28, 2014, 2014, pp.

1173–1178.

[5] D. Wen, H. Han, and A. K. Jain, “Face spoof detection with image

distortion analysis,” IEEE Trans. Information Forensics and Security,

vol. 10, no. 4, pp. 746–761, 2015.

[6] J. Yan, Z. Zhang, Z. Lei, D. Yi, and S. Z. Li, “Face liveness detection

by exploring multiple scenic clues,” in 12th International Conference

on Control Automation Robotics & Vision, ICARCV 2012, Guangzhou,

China, December 5-7, 2012, 2012, pp. 188–193.

[7] K. Patel, H. Han, A. K. Jain, and G. Ott, “Live face video vs. spoof

face video: Use of moiré patterns to detect replay video attacks,” in

International Conference on Biometrics, ICB 2015, Phuket, Thailand,

19-22 May, 2015, 2015, pp. 98–105.

[8] Z. Boulkenafet, J. Komulainen, and A. Hadid, “Face spooﬁng detection

using colour texture analysis,” IEEE Trans. Information Forensics and

Security, vol. 11, no. 8, pp. 1818–1830, 2016.

[9] Z. Boulkenafet, J. Komulainen, X. Feng, and A. Hadid, “Scale space

texture analysis for face anti-spooﬁng,” in International Conference on

Biometrics, ICB 2016, Halmstad, Sweden, June 13-16, 2016, 2016, pp.

1–6.

[10] Z. Boulkenafet, J. Komulainen, and A. Hadid, “Face anti-spooﬁng based

on color texture analysis,” in 2015 IEEE International Conference on

Image Processing, ICIP 2015, Quebec City, QC, Canada, September

27-30, 2015, 2015, pp. 2636–2640.

[11] T. Ojala, M. Pietikäinen, and T. Mäenpää, “Multiresolution gray-scale

and rotation invariant texture classiﬁcation with local binary patterns,”

IEEE Trans. Pattern Anal. Mach. Intell., vol. 24, no. 7, pp. 971–987,

2002.

[12] J. Määttä, A. Hadid, and M. Pietikäinen, “Face spooﬁng detection from

single images using micro-texture analysis,” in 2011 IEEE International

Joint Conference on Biometrics, IJCB 2011, Washington, DC, USA,

October 11-13, 2011, 2011, pp. 1–7.

[13] D. Menotti, G. Chiachia, A. da Silva Pinto, W. R. Schwartz, H. Pedrini,

A. X. Falcão, and A. Rocha, “Deep representations for iris, face, and

ﬁngerprint spooﬁng detection,” IEEE Trans. Information Forensics and

Security, vol. 10, no. 4, pp. 864–879, 2015.

[14] J. Yang, Z. Lei, and S. Z. Li, “Learn convolutional neural network for

face anti-spooﬁng,” CoRR, vol. abs/1408.5601, 2014.

[15] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classiﬁcation

with deep convolutional neural networks,” in Advances in Neural In-

formation Processing Systems 25: 26th Annual Conference on Neural

Information Processing Systems 2012. Proceedings of a meeting held

December 3-6, 2012, Lake Tahoe, Nevada, United States., 2012, pp.

1106–1114.

[16] G. Pan, L. Sun, Z. Wu, and S. Lao, “Eyeblink-based anti-spooﬁng in

face recognition from a generic webcamera,” in IEEE 11th International

Conference on Computer Vision, ICCV 2007, Rio de Janeiro, Brazil,

October 14-20, 2007, 2007, pp. 1–8.

[17] G. Pan, L. Sun, Z. Wu, and Y. Wang, “Monocular camera-based

face liveness detection by combining eyeblink and scene context,”

Telecommunication Systems, vol. 47, no. 3-4, pp. 215–225, 2011.

[18] L. Sun, G. Pan, Z. Wu, and S. Lao, “Blinking-based live face de-

tection using conditional random ﬁelds,” in Advances in Biometrics,

International Conference, ICB 2007, Seoul, Korea, August 27-29, 2007,

Proceedings, 2007, pp. 252–260.

[19] A. Anjos, M. M. Chakka, and S. Marcel, “Motion-based counter-

measures to photo attacks in face recognition,” Iet Biometrics, vol. 3,

no. 3, pp. 147–158, 2014.

[20] K. Kollreider, H. Fronthaler, M. I. Faraj, and J. Bigun, “Real-time face

detection and motion analysis with application in “liveness” assessment,”

IEEE Transactions on Information Forensics Security, vol. 2, no. 3, pp.

548–558, 2015.

[21] Y. Kim, J. Na, S. Yoon, and J. Yi, “Masked fake face detection using

radiance measurements,” J Opt Soc Am A Opt Image Sci Vis, vol. 26,

no. 4, pp. 760–766, 2009.

[22] A. da Silva Pinto, H. Pedrini, W. R. Schwartz, and A. Rocha, “Face

spooﬁng detection through visual codebooks of spectral temporal cubes,”

IEEE Trans. Image Processing, vol. 24, no. 12, pp. 4726–4740, 2015.

[23] D. J. Jobson, Z. Rahman, and G. A. Woodell, “A multiscale retinex

for bridging the gap between color images and the human observation

of scenes,” IEEE Trans. Image Processing, vol. 6, no. 7, pp. 965–976,

1997.

[24] J. Yang, P. Ren, D. Zhang, D. Chen, F. Wen, H. Li, and G. Hua,

“Neural aggregation network for video face recognition,” in 2017 IEEE

Conference on Computer Vision and Pattern Recognition, CVPR 2017,

Honolulu, HI, USA, July 21-26, 2017, 2017, pp. 5216–5225.

[25] Y. Atoum, Y. Liu, A. Jourabloo, and X. Liu, “Face anti-spooﬁng

using patch and depth-based cnns,” in 2017 IEEE International Joint

Conference on Biometrics, IJCB 2017, Denver, CO, USA, October 1-4,

2017, 2017, pp. 319–328.

[26] Z. Xu, S. Li, and W. Deng, “Learning temporal features using LSTM-

CNN architecture for face anti-spooﬁng,” in 3rd IAPR Asian Conference

on Pattern Recognition, ACPR 2015, Kuala Lumpur, Malaysia, Novem-

ber 3-6, 2015, 2015, pp. 141–145.

[27] S. Tirunagari, N. Poh, D. Windridge, A. Iorliam, N. Suki, and A. T. Ho,

“Detection of face spooﬁng using visual dynamics,” IEEE transactions

on information forensics and security, vol. 10, no. 4, pp. 762–777, 2015.

[28] Z. Zhang, D. Yi, Z. Lei, and S. Z. Li, “Face liveness detection

by learning multispectral reﬂectance distributions,” in Ninth IEEE

International Conference on Automatic Face and Gesture Recognition

(FG 2011), Santa Barbara, CA, USA, 21-25 March 2011, 2011, pp.

436–441. [Online]. Available: https://doi.org/10.1109/FG.2011.5771438

[29] J. Komulainen, A. Hadid, M. Pietikäinen, A. Anjos, and S. Marcel,

“Complementary countermeasures for detecting scenic face spooﬁng

attacks,” in International Conference on Biometrics, ICB 2013, 4-7 June,

2013, Madrid, Spain, 2013, pp. 1–7.

[30] M. De Marsico, M. Nappi, D. Riccio, and J. Dugelay, “Moving face

spooﬁng detection via 3d projective invariants,” in 5th IAPR Interna-

tional Conference on Biometrics, ICB 2012, New Delhi, India, March

29 - April 1, 2012, 2012, pp. 73–78.

[31] T. Wang, J. Yang, Z. Lei, S. Liao, and S. Z. Li, “Face liveness detection

using 3d structure recovered from a single camera,” in International

Conference on Biometrics, ICB 2013, 4-7 June, 2013, Madrid, Spain,

2013, pp. 1–6.

[32] Y. Wang, F. Nian, T. Li, Z. Meng, and K. Wang, “Robust face anti-

spooﬁng with depth information,” J. Visual Communication and Image

Representation, vol. 49, pp. 332–337, 2017.

[33] E. H. Land and J. J. Mccann, “Lightness and retinex theory,” J Opt Soc

Am, vol. 61, no. 1, pp. 1–11, 1971.

[34] D. Choi, I. H. Jang, M. H. Kim, and N. C. Kim, “Color image

enhancement based on single-scale retinex with a jnd-based nonlinear

ﬁlter,” in International Symposium on Circuits and Systems (ISCAS

2007), 27-20 May 2007, New Orleans, Louisiana, USA, 2007, pp. 3948–

3951.

[35] G. Zhang, D. Sun, P. Yan, H. Zhao, and Z. Li, “A LDCT image

contrast enhancement algorithm based on single-scale retinex theory,”

in 2008 International Conferences on Computational Intelligence for

Modelling, Control and Automation (CIMCA 2008), Intelligent Agents,

Web Technologies and Internet Commerce (IAWTIC 2008), Innovation

in Software Engineering (ISE 2008), 10-12 December 2008, Vienna,

Austria, 2008, pp. 181–186.

[36] D. J. Jobson, Z. Rahman, and G. A. Woodell, “Properties and perfor-

mance of a center/surround retinex,” IEEE Trans. Image Processing,

vol. 6, no. 3, pp. 451–462, 1997.

[37] S. J. Xie, Y. Lu, S. Yoon, J. C. Yang, and D. S. Park, “Intensity variation

normalization for ﬁnger vein recognition using guided ﬁlter based singe

scale retinex,” Sensors, vol. 15, no. 7, pp. 17 089–17 105, 2015.

[38] C. Lee, J. Shih, C. Lien, and C. Han, “Adaptive multiscale retinex

for image contrast enhancement,” in Ninth International Conference on

Signal-Image Technology & Internet-Based Systems, SITIS 2013, Kyoto,

Japan, December 2-5, 2013, 2013, pp. 43–50.

[39] E. Park, X. Han, T. L. Berg, and A. C. Berg, “Combining multiple

sources of knowledge in deep cnns for action recognition,” in 2016

IEEE Winter Conference on Applications of Computer Vision, WACV

2016, Lake Placid, NY, USA, March 7-10, 2016, 2016, pp. 1–8.

[40] K. Simonyan and A. Zisserman, “Two-stream convolutional networks

for action recognition in videos,” in Advances in Neural Information

Processing Systems 27: Annual Conference on Neural Information

Processing Systems 2014, December 8-13 2014, Montreal, Quebec,

Canada, 2014, pp. 568–576.

[41] D. H. Wolpert, “Stacked generalization,” Neural Networks, vol. 5, no. 2,

pp. 241–259, 1992.

[42] Z. Zhou, Y. Huang, W. Wang, L. Wang, and T. Tan, “See the forest for

the trees: Joint spatial and temporal recurrent neural networks for video-

based person re-identiﬁcation,” in 2017 IEEE Conference on Computer

Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July

21-26, 2017, 2017, pp. 6776–6785.

[43] L. Itti, C. Koch, and E. Niebur, “A model of saliency-based visual

attention for rapid scene analysis,” IEEE Trans. Pattern Anal. Mach.

Intell., vol. 20, no. 11, pp. 1254–1259, 1998.

[44] R. Girdhar and D. Ramanan, “Attentional pooling for action recog-

nition,” in Advances in Neural Information Processing Systems 30:

Annual Conference on Neural Information Processing Systems 2017,

4-9 December 2017, Long Beach, CA, USA, 2017, pp. 33–44.

[45] A. Gupta, D. Agrawal, H. Chauhan, J. Dolz, and M. Pedersoli,

“An attention model for group-level emotion recognition,” CoRR, vol.

abs/1807.03380, 2018.

[46] F. Wang, M. Jiang, C. Qian, S. Yang, C. Li, H. Zhang, X. Wang, and

X. Tang, “Residual attention network for image classiﬁcation,” in 2017

IEEE Conference on Computer Vision and Pattern Recognition, CVPR

2017, Honolulu, HI, USA, July 21-26, 2017, 2017, pp. 6450–6458.

[47] R. F. Nogueira, R. de Alencar Lotufo, and R. C. Machado, “Fingerprint

liveness detection using convolutional neural networks,” IEEE Trans.

Information Forensics and Security, vol. 11, no. 6, pp. 1206–1213, 2016.

[48] L. Li, X. Feng, X. Jiang, Z. Xia, and A. Hadid, “Face anti-spooﬁng via

deep local binary patterns,” in 2017 IEEE International Conference on

Image Processing, ICIP 2017, Beijing, China, September 17-20, 2017,

2017, pp. 101–105.

[49] H. Li, P. He, S. Wang, A. Rocha, X. Jiang, and A. C. Kot, “Learning

generalized deep feature representation for face anti-spooﬁng,” IEEE

Trans. Information Forensics and Security, vol. 13, no. 10, pp. 2639–

2652, 2018.

[50] L. Feng, L. Po, Y. Li, X. Xu, F. Yuan, T. C. Cheung, and K. Cheung,

“Integration of image quality and motion cues for face anti-spooﬁng:

A neural network approach,” J. Visual Communication and Image

Representation, vol. 38, pp. 451–460, 2016.

[51] K. Zhang, Z. Zhang, Z. Li, and Y. Qiao, “Joint face detection and

alignment using multi-task cascaded convolutional networks,” CoRR,

vol. abs/1604.02878, 2016.

[52] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand,

M. Andreetto, and H. Adam, “Mobilenets: Efﬁcient convolutional neural

networks for mobile vision applications,” CoRR, vol. abs/1704.04861,

2017.

[53] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image

recognition,” in The IEEE Conference on Computer Vision and Pattern

Recognition (CVPR), June 2016.

[54] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and F. Li, “Imagenet: A large-

scale hierarchical image database,” in 2009 IEEE Computer Society

Conference on Computer Vision and Pattern Recognition (CVPR 2009),

20-25 June 2009, Miami, Florida, USA, 2009, pp. 248–255.

[55] Z. Zhang, J. Yan, S. Liu, Z. Lei, D. Yi, and S. Z. Li, “A face antispooﬁng

database with diverse attacks,” in 5th IAPR International Conference on

Biometrics, ICB 2012, New Delhi, India, March 29 - April 1, 2012,

2012, pp. 26–31.

[56] I. Chingovska, A. Anjos, and S. Marcel, “On the effectiveness of local

binary patterns in face anti-spooﬁng,” in 2012 BIOSIG - Proceedings

of the International Conference of Biometrics Special Interest Group,

Darmstadt, Germany, September 6-7, 2012, 2012, pp. 1–7.

[57] Z. Boulkenafet, J. Komulainen, L. Li, X. Feng, and A. Hadid, “OULU-

NPU: A mobile face presentation attack database with real-world

variations,” in 12th IEEE International Conference on Automatic Face

& Gesture Recognition, FG 2017, Washington, DC, USA, May 30 - June

3, 2017, 2017, pp. 612–618.

[58] Z. Boulkenafet, J. Komulainen, Z. Akhtar, A. Benlamoudi, D. Samai,

S. E. Bekhouche, A. Ouaﬁ, F. Dornaika, A. Taleb-Ahmed, L. Qin,

F. Peng, L. B. Zhang, M. Long, S. Bhilare, V. Kanhangad, A. Costa-

Pazo, E. Vázquez-Fernández, D. Perez-Cabo, J. J. Moreira-Perez,

D. González-Jiménez, A. Mohammadi, S. Bhattacharjee, S. Marcel,

S. Volkova, Y. Tang, N. Abe, L. Li, X. Feng, Z. Xia, X. Jiang,

S. Liu, R. Shao, P. C. Yuen, W. R. Almeida, F. A. Andaló, R. Padilha,

G. Bertocco, W. Dias, J. Wainer, R. da Silva Torres, A. Rocha, M. A.

Angeloni, G. Folego, A. Godoy, and A. Hadid, “A competition on

generalized software-based face presentation attack detection in mobile

scenarios,” in 2017 IEEE International Joint Conference on Biometrics,

IJCB 2017, Denver, CO, USA, October 1-4, 2017, 2017, pp. 688–696.

[59] R. Ramachandra and C. Busch, “Presentation attack detection methods

for face recognition systems: A comprehensive survey,” ACM Comput.

Surv., vol. 50, no. 1, pp. 8:1–8:37, Mar. 2017. [Online]. Available:

http://doi.acm.org/10.1145/3038924

[60] A. Anjos and S. Marcel, “Counter-measures to photo attacks in face

recognition: A public database and a baseline,” in 2011 IEEE Inter-

national Joint Conference on Biometrics, IJCB 2011, Washington, DC,

USA, October 11-13, 2011, 2011, pp. 1–7.

[61] T. F. Pereira, J. Komulainen, A. Anjos, J. M. D. Martino, A. Hadid,

M. Pietikäinen, and S. Marcel, “Face liveness detection using dynamic

texture,” EURASIP J. Image and Video Processing, vol. 2014, p. 2, 2014.

[62] J. Yang, Z. Lei, S. Liao, and S. Z. Li, “Face liveness detection

with component dependent descriptor,” in International Conference on

Biometrics, ICB 2013, 4-7 June, 2013, Madrid, Spain, 2013, pp. 1–6.

[63] Y. Liu, A. Jourabloo, and X. Liu, “Learning deep models for face anti-

spooﬁng: Binary or auxiliary supervision,” CoRR, vol. abs/1803.11097,

2018.

[64] S. Bharadwaj, T. I. Dhamecha, M. Vatsa, and R. Singh, “Computa-

tionally efﬁcient face spooﬁng detection with motion magniﬁcation,” in

IEEE Conference on Computer Vision and Pattern Recognition, CVPR

Workshops 2013, Portland, OR, USA, June 23-28, 2013, 2013, pp. 105–

110.

[65] H. Li, W. Li, H. Cao, S. Wang, F. Huang, and A. C. Kot, “Unsuper-

vised domain adaptation for face anti-spooﬁng,” IEEE Transactions on

Information Forensics and Security, vol. 13, no. 7, pp. 1794–1809, 2018.

[66] A. Benlamoudi, K. E. Aiadi, A. Ouaﬁ, D. Samai, and M. Oussalah, “Face

antispooﬁng based on frame difference and multilevel representation,”

J. Electronic Imaging, vol. 26, no. 4, p. 43007, 2017.

[67] K. M. Borgwardt, G. Arthur, M. J. Rasch, K. Hans-Peter, S. Bernhard,

and A. J. Smola, “Integrating structured biological data by kernel

maximum mean discrepancy,” Bioinformatics, vol. 22, no. 14, pp. e49–

e57, 2006.

Haonan Chen received the B.S. degree from Zhe-

jiang University in 2014. He is currently pursuing

the Ph.D.degree with Zhejiang University. His re-

search interest is deep learning, pattern recognition

and biometrics (mainly face recognition).

Guosheng Hu is the honorary lecturer of Queens

University Belfast and Senior Researcher of AnyVi-

sion. He was a postdoctoral researcher in the LEAR

team, Inria Grenoble Rhone-Alpes, France from

May 2015 to May 2016. He ﬁnished his PhD in

Centre for Vision, Speech and Signal Processing,

University of Surrey, UK in June, 2015. His research

interests include deep learning, pattern recognition,

biometrics (mainly face recognition), and graphics.

Zhen Lei received the BS degree in automation from

the University of Science and Technology of China,

in 2005, and the PhD degree from the Institute of

Automation, Chinese Academy of Sciences, in 2010,

where he is currently an associate professor. He

has published more than 100 papers in international

journals and conferences. His research interests are

in computer vision, pattern recognition, image pro-

cessing, and face recognition in particular. He served

as an area chair of the International Joint Conference

on Biometrics in 2014, the IAPR/IEEE International

Conference on Biometric in 2015, 2016, 2018, and the IEEE International

Conference on Automatic Face and Gesture Recognition in 2015. He is a

senior member of the IEEE.

Yaowu Chen received the Ph.D. degree from Zhe-

jiang University, Hangzhou, China, in 1998. He is

currently a Professor and the Director of the Institute

of Advanced Digital Technologies and Instrumenta-

tion, Zhejiang University. His major research ﬁelds

are embedded system, multimedia system, and net-

working.

Neil M. Robertson is Professor and Director of Re-

search for Image and Vision Systems in the Centre

for Data Sciences and Scalable Computing, at the

Queens University of Belfast, UK. He researches

underpinning machine learning methods for visual

analytics. His principal research focus is face and

activity recognition in video. He started his career in

the UK Scientiﬁc Civil Service with DERA (2000-

2002) and QinetiQ (2002-2007). Neil was the 1851

Royal Commission Fellow at Oxford University

(2003-2006) in the Robotics Research Group. His

autonomous systems, defence and security research is extensive including UK

major research programmes and doctoral training centres.

Stan Z.Li received the BEng degree from Hunan

University, China, the MEng degree from National

University of Defense Technology, China, and the

PhD degree from Surrey University, United King-

dom. He is currently a professor and the director

of Center for Biometrics and Security Research

(CBSR), Institute of Automation, Chinese Academy

of Sciences (CASIA). He was with Microsoft Re-

search Asia as a researcher from 2000 to 2004. Prior

to that, he was an associate professor in the Nanyang

Technological University, Singapore. His research

interests include pattern recognition and machine learning, image and vision

processing, face recognition, biometrics, and intelligent video surveillance. He

has published more than 300 papers in international journals and conferences,

and authored and edited eight books. He was an associate editor of the IEEE

Transactions on Pattern Analysis and Machine Intelligence and is acting as the

editor-in-chief for the Encyclopedia of Biometrics. He served as a program co-

chair for the International Conference on Biometrics 2007, 2009, 2013, 2014,

2015, 2016 and 2018, and has been involved in organizing other international

conferences and workshops in the ﬁelds of his research interest. He was

elevated to IEEE fellow for his contributions to the ﬁelds of face recognition,

pattern recognition and computer vision and he is a member of the IEEE

Computer Society.

Research Paper on Transformative Innovations in Identity Verification and Recognition

Article

May 2024

Priti Nagtode

Integrating real-time human detection into identity authentication greatly improves both security and user experience. This strategy decreases fraud risk by analysing physiological and behavioural markers such as facial and eye movements. Its implementation in financial, healthcare, e-commerce, and law enforcement sectors promise to strengthen security measures. Although obstacles exist, the benefits of this human-centered approach are significant, paving the path for a safer digital future. Keywords: Identity verification, biometric authentication, digital security, real-time human detection, fraud prevention, user experience, computer vision, physiological and behavioural clues, facial and vocal patterns, banking, healthcare, e-commerce, law enforcement.

Revolutionizing Identity Verification and Recognition

Article

May 2024

Priti Nagtode

The innovative integration of real-time human detection improves both security and user experience in identity authentication. This approach improves authentication processes by analysing physiological and behavioural indicators such as face movements and eye movements, hence reducing fraud risks. Its deployment across sectors promises to enhance security in finance, healthcare, e-commerce, and law enforcement. Despite the hurdles, the benefits of this human-centered approach are enormous, indicating a more secure digital future. Keywords— Identity verification, live human detection, digital security, biometric authentication, user experience, fraud prevention, computer vision, physiological and behavioural cues, facial and vocal patterns, finance, healthcare, e-commerce, law enforcement.

Lightweight 3D-StudentNet for defending against face replay attacks

Article

Full-text available

Jun 2024

Biometric face and lip-reading systems are susceptible to face replay attacks. Where the intruder presents a recorded video of a legitimate user or presents printed photos to gain system access without authorization. Consequently, liveness detection becomes essential to confirm whether the person in camera view is real rather than fake replays. This study aims to detect such printed photo attacks and video or photo replay attacks on high-resolution screens. The main objective of this research is to develop a lightweight 3D-DNN model that considers both spatial and temporal features to distinguish between real and attack videos. To achieve this objective, a lightweight 3D-StudentNet face replay attack defense system is proposed leveraging 3D-ArrowNet deep neural network and knowledge distillation. The system captures dynamic spatio-temporal features from five video frames captured by an RGB camera. Considerable experimentation is conducted using the Replay-Attack and Replay-Mobile benchmark datasets. Experimental results demonstrate that the proposed 3D-ArrowNet achieves state-of-the-art performance and transfers its knowledge successfully to a lightweight 3D-StudentNet with fewer network parameters. Thus, 3D-StudentNet can supplant existing 2D-DNN architectures utilized for replay attack defense systems, which capture only spatial features.

CNN-LPQ: convolutional neural network combined to local phase quantization based approach for face anti-spoofing

Article

Full-text available

Mar 2024
MULTIMED TOOLS APPL

In this paper, we propose a novel approach for face spoofing detection using a combination of color texture descriptors with a new convolutional neural network (CNN) architecture. The proposed approach is based on a new convolutional neural network architecture composed of two CNN parallel branches. The first branch is fed with complementary shallow local phase quantization (LPQ) invariant descriptors that result from joint color texture information from the hue, saturation, and value (HSV) color space to accurately capture the reflection properties of the face. Combining the HSV color space with LPQ is known to significantly improve performance. The second branch of the CNN takes an RGB image directly as input, effectively separating chromatic (color-related) information from achromatic (brightness-related) information in order to extract crucial facial color features. Each branch of the CNN produces a vector of deep features that are extracted. To effectively concatenate the deep features from the two output branches, we employ an attention mechanism based combination method. This method captures the complementarity of the two branches, improving the accuracy and robustness of the model. The combined feature vectors form an input vector for the next Dense layer, where the model can distinguish between live and spoofed faces. Our method detects 2D facial spoofing attacks involving printed photos and replayed videos. We showcase the effectiveness and superior performance of our approach through a series of experiments conducted on both the CASIA-FASD and Replay-Attack datasets. Our results are promising and surpassing those of other state-of-the-art methods on both used datasets in terms of 9 performance metrics.

Revealing Real Face for Generalized Anti-Spoofing

Chapter

Jun 2024

Generative Data Augmentation with Liveness Information Preserving for Face Anti-Spoofing

Conference Paper

Jun 2024

Advancements and Challenges in Fully Automated Online Proctoring Systems: A Comprehensive Survey of AI-Driven Solutions

Chapter

Jun 2024

A Novel High-Performance Face Anti-Spoofing Detection Method

Article

Full-text available

Jan 2024

Accurate face recognition technology is of great significance for face anti-counterfeiting. Due to illumination, posture, angle, and other reasons, the existing face liveness detection technology is difficult to adapt the environmental changes, resulting in low detection accuracy. To address this issue, this paper presents a novel high-performance face anti-spoofing detection method named RGCS_ConvNeXt. The data-enhanced face images are fed into the ConvNext network, which group convolution is added to extract the correlation between different features, and the coordinate attention mechanism is used to enhance the facial feature extraction capability both spatially and directionally. Then SPPF is used to extract the features at different scales to enhance the representation of the feature map. Finally, the facial key point detection technique is utilized to calculate the eye EAR value to achieve accurate face anti-counterfeiting recognition. The proposed algorithm shows an average classification error rate of 0.3%, 1.7%, 1.9%±1.5% and 2.8%±3.4%, respectively, on the four protocols of the OULU-NPU public dataset. On the Siw dataset, the average classification error rate is 0.69%, a reduction of 0.02% compared to the MA-Net network. The half-error rate on the MSU-MFSD dataset is 2.39%, a 0.21% reduction compared to the DPCNN network. The algorithm shows good accuracy on the OULU-NPU, MSU-MFSD and Siw datasets, reaching 99.64%, 98.40% and 99.25% respectively, 0.26% higher than the SE-FeatherNet network’s average accuracy.

Antispoofing in face biometrics: a comprehensive study on software-based techniques

Article

Full-text available

Mar 2023

The vulnerability of the face recognition system to spoofing attacks has piqued the biometric community's interest, motivating them to develop antispoofing techniques to secure it. Photo, video, or mask attacks can compromise face biometric systems (types of presentation attacks). Spoofing attacks are detected using liveness detection techniques, which determine whether the facial image presented at a biometric system is a live face or a fake version of it. We discuss the classification of face anti-spoofing techniques in this paper. Anti-spoofing techniques are divided into two categories: hardware and software methods. Hardware-based techniques are summarized briefly. A comprehensive study on software-based countermeasures for presentation attacks is discussed, which are further divided into static and dynamic methods. We cited a few publicly available presentation attack datasets and calculated a few metrics to demonstrate the value of anti-spoofing techniques.

Modeling Spoof Noise by De-spoofing Diffusion and its Application in Face Anti-spoofing

Conference Paper

Sep 2023

Learning Deep Models for Face Anti-Spoofing: Binary or Auxiliary Supervision

Conference Paper

Full-text available

Jun 2018

Learning Generalized Deep Feature Representation for Face Anti-Spoofing

Article

Full-text available

Apr 2018

In this work, we propose a novel framework leveraging the advantages of the representational ability of deep learning and domain generalization for face spoofing detection. In particular, the generalized deep feature representation is achieved by taking both spatial and temporal information into consideration, and a 3D Convolutional Neural Network (3D CNN) architecture tailored for the spatial-temporal input is proposed. The network is first initialized by training with augmented facial samples based on cross-entropy loss and further enhanced with a specifically designed generalization loss, which coherently serves as the regularization term. The training samples from different domains can seamlessly work together for learning the generalized feature representation by manipulating their feature distribution distances. We evaluate the proposed framework with different experimental setups using various databases. Experimental results indicate that our method can learn more discriminative and generalized information compared with the state-of-the-art methods.

Learning Deep Models for Face Anti-Spoofing: Binary or Auxiliary Supervision

Article

Full-text available

Mar 2018

Face anti-spoofing is the crucial step to prevent face recognition systems from a security breach. Previous deep learning approaches formulate face anti-spoofing as a binary classification problem. Many of them struggle to grasp adequate spoofing cues and generalize poorly. In this paper, we argue the importance of auxiliary supervision to guide the learning toward discriminative and generalizable cues. A CNN-RNN model is learned to estimate the face depth with pixel-wise supervision, and to estimate rPPG signals with sequence-wise supervision. Then we fuse the estimated depth and rPPG to distinguish live vs. spoof faces. In addition, we introduce a new face anti-spoofing database that covers a large range of illumination, subject, and pose variations. Experimental results show that our model achieves the state-of-the-art performance on both intra-database and cross-database testing.

A Competition on Generalized Software-based Face Presentation Attack Detection in Mobile Scenarios

Conference Paper

Full-text available

Oct 2017

In recent years, software-based face presentation attack detection (PAD) methods have seen a great progress. However, most existing schemes are not able to generalize well in more realistic conditions. The objective of this competition is to evaluate and compare the generalization performances of mobile face PAD techniques under some real-world variations, including unseen input sensors, presentation attack instruments (PAI) and illumination conditions , on a larger scale OULU-NPU dataset using its standard evaluation protocols and metrics. Thirteen teams from academic and industrial institutions across the world participated in this competition. This time typical liveness detection based on physiological signs of life was totally discarded. Instead, every submitted system relies practically on some sort of feature representation extracted from the face and/or background regions using hand-crafted, learned or hybrid descriptors. Interesting results and findings are presented and discussed in this paper.

An Attention Model for Group-Level Emotion Recognition

Conference Paper

Oct 2018

In this paper we propose a new approach for classifying the global emotion of images containing groups of people. To achieve this task, we consider two different and complementary sources of information: i) a global representation of the entire image (ii) a local representation where only faces are considered. While the global representation of the image is learned with a convolutional neural network (CNN), the local representation is obtained by merging face features through an attention mechanism. The two representations are first learned independently with two separate CNN branches and then fused through concatenation in order to obtain the final group-emotion classifier. For our submission to the EmotiW 2018 group-level emotion recognition challenge, we combine several variations of the proposed model into an ensemble, obtaining a final accuracy of 64.83% on the test set and ranking 4th among all challenge participants.

Face anti-spoofing via deep local binary patterns

Conference Paper