Conference PaperPDF Available

Accurate and Resource-Efficient Lipreading with Efficientnetv2 and Transformers

May 2022

May 2022

DOI:10.1109/ICASSP43922.2022.9747729

Conference: ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Authors:

Alexandros Koumparoulis

University of Thessaly

Gerasimos Potamianos

University of Thessaly

Content uploaded by Alexandros Koumparoulis

Content may be subject to copyright.

ACCURATE AND RESOURCE-EFFICIENT LIPREADING

WITH EFFICIENTNETV2 AND TRANSFORMERS

Alexandros Koumparoulis, Gerasimos Potamianos

Electrical and Computer Engineering Department, University of Thessaly, Volos, Greece

ABSTRACT

We present a novel resource-efﬁcient end-to-end architecture for

lipreading that achieves state-of-the-art results on a popular and

challenging benchmark. In particular, we make the following con-

tributions: First, inspired by the recent success of the EfﬁcientNet

architecture in image classiﬁcation and our earlier work on resource-

efﬁcient lipreading models (MobiLipNet), we introduce Efﬁcient-

Nets to the lipreading task. Second, we show that the currently

most popular in the literature 3D front-end contains a max-pool

layer that prohibits networks from reaching superior performance

and propose its removal. Finally, we improve our system’s back-end

robustness by including a Transformer encoder. We evaluate our

proposed system on the “Lipreading In-The-Wild” (LRW) corpus, a

database containing short video segments from BBC TV broadcasts.

The proposed network (T-variant) attains 88.53% word accuracy,

a 0.17% absolute improvement over the current state-of-the-art,

while being ﬁve times less computationally intensive. Further, an

up-scaled version of our model (L-variant) achieves 89.52%, a new

state-of-the-art result on the LRW corpus.

Index Terms—EfﬁcientNet, Transformers, Lipreading.

1. INTRODUCTION

Visual speech recognition (VSR) models have progressed signiﬁ-

cantly, thanks to advances in deep learning algorithms. Typically,

these are based on accurate but computationally intensive architec-

tures, such as convolutional neural networks (CNNs) and recurrent

or self-attention layers. For example, the 2D CNN of [1] requires

11.22×10 9ﬂoating-point operations (FLOPs) to process a single

video frame. In resource-constrained scenarios, such deep learning-

based models are impractical due to their computational proﬁle.

Recently, a small number of works [2–5] have focused on im-

proving VSR model efﬁciency, in order to enable broader application

of this technology. Proposed architectures reduce computational re-

quirements by replacing standard convolutions with grouped ones,

such as depthwise and pointwise. In these cases, the gains in FLOPs

are signiﬁcant: for example, our MobiLipNetV2 [3] is 37 times more

efﬁcient than a 3D-ResNet. However, when presented with challeng-

ing real-world data, efﬁcient VSR models still lag in recognition per-

formance. For example, the best ShufﬂeNetV2 model in [5] achieves

a word accuracy (WAcc) of 85.5% on the LRW corpus [6], trailing

the current state-of-the-art of 88.36% [7] on the task.

Motivated by the above, in this paper we attempt to eliminate

the performance discrepancy between conventional and resource-

efﬁcient VSR systems. To this end, we propose a novel neural net-

work architecture, focusing on both recognition accuracy and re-

source efﬁciency. In particular, we make the following contributions:

•First, we introduce a new VSR model based on the architecture

of EfﬁcientNet [8, 9]. Such resource-efﬁcient model relies on

pointwise/depthwise convolutions as well. However, its main

departure from similar architectures like MobileNetV3 [10]

is compound scaling, where the model’s depth/width/resolu-

tion are scaled together, resulting in a family of models with

different accuracy/efﬁciency trade-off. Here we present three

conﬁgurations: a tiny one (denoted by “T”), targeting resource-

constrained applications, and two larger ones (denoted by “M”

and “L”). To differentiate between them, we append the variant-

type letter at the end of the model name, e.g. EfﬁcientNetV2-T.

Our experiments show that these models yield superior perfor-

mance over all other VSR architectures. To our knowledge, this

represents the ﬁrst use of EfﬁcientNets in VSR.

•Second, we systematically study the 3D front-end of the VSR

model. This is a crucial module, as it captures the short-term dy-

namics of the mouth region, and it has been proven advantageous

over simpler 2D front-ends [3, 11,12]. Speciﬁcally, we assume

a 3D front-end with a single convolution layer, and we focus

on the dimensions of the 3D convolution kernel and whether a

max-pooling layer should be used downstream. Our experiments

show that smaller kernels are as effective as larger ones, however

max-pooling combined with non-unit convolution stride hurts

recognition, thus adversely affecting the majority of recent VSR

systems that employ it within their 3D front-ends.

•Third, we propose a robust back-end that combines a Trans-

former encoder with a temporal convolutional network (TCN).

TCNs are known to perform on-par with recurrent architec-

tures [5, 7], while being easier to train. However, one disadvan-

tage is their ﬁxed receptive ﬁeld. For this reason, we pre-process

the input features using a Transformer encoder, allowing us to

better handle sequences of different size. This yields a nearly

1% absolute WAcc improvement on the LRW corpus.

We evaluate our proposed system for speaker-independent word

VSR on the LRW corpus [6], a very popular lipreading dataset [2,

5–7, 12–25], thus allowing extensive comparisons. We report that

our introduced EfﬁcientNetV2-T model is more accurate than the

currently best CNN-based network [7], while having a signiﬁcantly

smaller computational cost. Further, our larger EfﬁcientNetV2-L

conﬁguration achieves the new state-of-the-art of 89.52% WAcc on

LRW, namely a 1.16% absolute improvement over [7].

2. THE PROPOSED VSR SYSTEM AND ITS MODULES

This section introduces our proposed VSR system and details its

modules. A schematic system overview is provided in Fig. 1a.

2.1. 3D Front-end

The mouth-region video frames are ﬁrst processed by a 3D front-

end in order to capture short-term spatio-temporal motion. The 3D

3D Conv. (1→24, 3✕5✕5), BN, ReLU

EfficientNetV2-T

Softmax (500 classes)

Transformer (8 heads, 128)

TCN (✕4, channels= 463)

Temporal Average

29✕1✕88✕88

29✕24✕44✕44

29✕384

29✕463

INPUT

T✕C✕H✕W

Linear (463 → 500)

T✕C

1✕463

1✕500

(a)

Pointwise Conv (1✕1✕M)

1✕1✕L Pointwise Conv

Squeeze-and-Excitation

squeeze-ratio= 0.25, Linear→SiLU→Linear→Sigmoid

BN & SiLU

Depthwise Conv (K✕K✕1)

L = M✕expansion-ratio, {W’,H’} = {W,H} / stride

BN & SiLU

M✕H✕W→L✕H✕W

Only if stride = 1

L✕H✕W→L✕H’✕W’

M✕H✕W

L✕H’✕W’→L✕H’✕W’

L✕H’✕W’→N✕H’✕W’

(b)

1x1xL Pointwise Conv

Regular Conv (K✕K✕M)

L = M✕expansion-ratio, {W’,H’} = {W,H} / stride

BN & SiLU

Only if stride = 1

L✕H✕W→L✕H’✕W’

M✕H✕W

L✕H’✕W’→N✕H’✕W’

(c)

Multi-Head Attention (8 heads)

LayerNorm

T✕384→T✕384

T✕384

Fully-Connected (384→128)

T✕384→T✕128

Fully-Connected (128→384)

Dropout (p= 0.1)

ReLU & LayerNorm

Dropout (p= 0.1)

LayerNorm

T✕384

T✕128→T✕384

(d)

Conv-1D

(kernel= 3✕M, {padding, dilation}= 2L)

T✕M→T✕N

T✕M

BN & PReLU

Dropout (p= 0.2)

Conv-1D

(kernel= 3✕M, {padding, dilation}= 2L)

BN & PReLU

Dropout (p= 0.2)

PReLU

T✕N→T✕K

(e)

Fig. 1: Overview of the proposed VSR architecture, employing the EfﬁcientNetV2-T model: (a) Entire system; (b) Inverted-Bottleneck

module with SiLU activation; (c) Fused-MBConv module, where the ﬁrst pointwise and depthwise convolutions have been merged into a

single regular convolution layer; (d) Transformer-encoder layer [26], along with the hyper-parameters employed; (e) TCN-module. In our

architecture we have used M=N=Kand dilation 2L, where Lis the index of the TCN module (0-3).

front-end complements the back-end temporal classiﬁer [11], which

primarily captures long-term dynamics.

The dominant 3D front-end in the literature [12, 21, 22, 27]

consists of a convolutional layer with 3-dimensional (3D) kernels

of 5×7×7size (time/width/height) and stride 1×2×2, followed by

batch normalization (BN) and rectiﬁed linear units (ReLU). The ex-

tracted feature maps are passed through a spatio-temporal max-pool

layer with kernel of size 1×3×3and stride 1×2×2.

Due to the non-unit stride in convolution and max-pool layers,

the output spatial dimensions are four times smaller than the input.

We postulate this design decision was mostly driven by practical

constrains: by reducing signiﬁcantly the output dimensions, the net-

work could ﬁt in the GPU memory and train faster. However, ex-

cessive and early downsampling discards crucial information in the

feature extraction process that is not retained by the large 3D con-

volution kernel. For this reason, we remove the max-pool layer and

decrease the convolution’s kernel size to 3×5×5.

2.2. EfﬁcientNetV2

Most resource-efﬁcient CNNs [10, 28–30] rely on introducing new

basic-block modules that are computationally leaner. EfﬁcientNets

deviate from this pattern. First, neural architecture search (NAS)

is utilized to obtain a baseline model that has good trade-off on

accuracy and FLOPs. The baseline model is then scaled up with a

compound scaling strategy (input resolution/width/depth), obtaining

a family of models with different efﬁciency/accuracy ratios. In this

work, we retain the input resolution ﬁxed (88×88 pixels) and scale

the networks in the other two dimensions. We proceed with describ-

ing the two basic modules used in EfﬁcientNetV2-based models.

Inverted Residual Bottleneck (MBConv): The inverted residual

with linear bottleneck module was ﬁrst introduced in [29] (shown in

Fig. 1b). Here, we describe the variant used in the EfﬁcientNet ar-

chitectures. First, a low-dimensional compressed representation 2D

input is expanded (increasing the number of channels, M→L) with

pointwise (PW) convolution according to ratio ρ(L=M·ρ), then ﬁl-

tered with a depthwise (DW) K×Kconvolution kernel (M→M).

Channel-wise attention is applied using the squeeze-and-excitation

(SE) mechanism [31] (reduction ratio r= 1/24) and ﬁnally com-

pressed back with pointwise convolution (L→N). The depthwise

convolution may have stride greater than one, decreasing the output

dimensions. In case of unit stride, a residual connection is also ap-

plied (shown with a dashed line). EfﬁcientNetV2 uses only K= 3.

The SiLU activation [32] is also applied on the ﬁrst two convolution

layers and inside the SE mechanism.

To provide numerical examples of required FLOPs, we assume

input of size M= 16,L= 128 (ρ=8), N= 32,H=W= 32, DW

kernel size K=3 and unit stride. The ﬁrst PW convolution layer re-

quires M L H W multiplications (2.09M FLOPs, 2048 parameters),

the DW one costs K2L H W (1.17M FLOPs, 1152 parameters), and

the second PW convolution costs L N H W (4.19M FLOPs, 4096

parameters). The SE mechanism contains two anti-symmetrical

fully-connected layers, each requiring L(L·r)multiplications (1.3K

FLOPs, 1376 parameters). The total cost of a single such module is

therefore 7.45M FLOPs and 8K parameters.

Fused-MBConv: The MBConv module relies on depthwise con-

volution for spatial ﬁltering. Depthwise convolutions have fewer pa-

rameters and FLOPs than regular convolutions, but they often can-

not fully utilize modern accelerators, especially in the early lay-

ers where spatial dimensions are large. To better utilize mobile or

server accelerators, Fused-MBConv was recently proposed [9]. It

fuses the expansion pointwise and depthwise convolution MBConv

with a single regular convolution and lacks an SE mechanism, as

shown in Fig. 1c. When applied in early stages, Fused-MBConv

improves training speed with a small overhead on parameters and

FLOPs, however it is not as effective at later stages [9].

For the same numerical setup as in MBConv, the regular con-

volution costs M K 2L H W multiplications (18M FLOPs, 18K

parameters), and the projection pointwise costs L N H W (4.19M

FLOPs, 4096 parameters). In total, the module requires 22.19M

FLOPs and 23K parameters. While the cost is roughly three times

that of MBConv, for smaller numbers of channels and larger spatial

dimensions MBConv is more efﬁcient on parallel hardware, where

Table 1: Hyper-parameters of the Baseline and T/M/L EfﬁcientNet

variants. Operators MBConv and Fused-MBConv are explained in

Section 2.2. All modules use spatial kernels with size 1×3×3. For

all SE layers, r= 1/24. For variants T (α= 0.66,β= 1.0) and M

(α=1.0,β=1.1) the #Layers and #Channels columns, respectively,

are not shown, as they equal those of the Baseline.

Stage Module Stride Baseline T (α=0.66) M (β= 1.1) L (α= 1.1,β= 1.2)

#Channels #Layers #Channels #Layers #Channels #Layers

1 Fused-MBConv(ρ=1) 1 16 1 8 2 24 2

2 Fused-MBConv(ρ=4) 2 32 2 16 3 48 4

3 Fused-MBConv(ρ=4) 2 48 2 32 3 64 4

4 MBConv (ρ=4, SE) 2 96 3 56 4 128 6

5 MBConv (ρ=6, SE) 1 112 5 64 6 120 6

6 MBConv (ρ=6, SE) 2 192 8 112 9 208 10

7 Conv1×1→Pooling→GLU 1 768→384 1 768→384 1 768→384 1

Table 2: Comparison of ShufﬂeNetV2 (0.5 ×) [5] and our

EfﬁcientNetV2-T VSR systems with four 3D front-end variants, in

terms of recognition performance (in WAcc, %, on the LRW test set)

and efﬁciency (in per-frame FLOPs and parameters).

3D Front-end ShufﬂeNetV2 (0.5×) [5] EfﬁcientNetV2-T

Conv-kernel

size

FLOPs

(×109)

Max

Pool

WAcc

(%)

FLOPs (×109)Params

(×106)

WAcc

(%)

FLOPs (×109)Params

(×106)

Total CNN Total CNN

3×3×30.04 780.82 0.66 0.46 5.012 88.38 1.56 1.12 8.96

379.50 0.32 0.12 86.19 0.80 0.36

3×5×50.10 781.04 0.72 0.46 5.013 88.52 1.62 1.12 8.96

379.51 0.39 0.12 86.67 0.86 0.36

MBConv is bottlenecked by memory bandwidth instead of compute.

Network Topology and Scaling: For our EfﬁcientNet-based VSR

systems we start with the Baseline model of EfﬁcientNetV2, shown

in the fourth column of Table 1. In contrast to the original Baseline,

we remove the ﬁrst layer (Stage = 0), also known as stem layer, since

its functionality has been replaced by the 3D front-end (Section 2.1).

Further, all network variants share the same last Stage (7): a PW con-

volution expands the number of channels (X→768, where the size

of Xdepends on the previous stage, 6), a spatial averaging opera-

tion aggregates spatial dimensions, and ﬁnally a GLU [33] reduces

the ﬁnal number of channels to 384. Apart from these changes, our

Baseline model is the same as the original [9].

All network variants apply the same modules as the Baseline.

However, each variant is a scaled version of the baseline model.

Each module is parameterized by the channel width (α) that con-

trols how many channels will be used, as well as by the depth mul-

tiplier (β) that controls how many times the number of layers will

be scaled in each stage. We retain a ﬁxed input size (88×88 pixels)

for all variants. We consider three scaling setups. The T (from tiny)

variant (α=0.66,β=1.0) is a conﬁguration with a reduced number

of channels and our main network, aiming at a computational cost

of roughly one Giga-FLOP per frame. In addition, we create two

more variants, M (medium, α= 1.0,β= 1.1) and L (large, α= 1.1,

β= 1.2), to study the performance effect of scaling each axis (mod-

ule channels and module depth).

2.3. Transformer

After feeding the video through the 3D front-end and the 2D Efﬁ-

cientNet, the output features are further processed by a Transformer

encoder [26] (shown in Fig. 1d). The Transformer encoder ﬁrst

passes the input sequence through a multihead-attention layer [26]

and subsequently through two fully-connected layers. The attention

mechanism in the Transformer allows the network to dynamically

discard irrelevant information, for example the end of the previous

word or the start of the next one. Further, because it is fully par-

allelizable it is fast to compute, a desirable but missing property in

recurrent architectures. The Transformer retains the number of input

channels (384).

2.4. TCN and Final Classiﬁer

Finally, the output of the Transformer is fed to a 4-layer TCN, sim-

ilar to the ones used in [5]. Recently, TCNs have been used with

great success in VSR systems [5, 7, 21]. They are used as a drop-in

replacement to recurrent architectures such as the GRU or LSTM.

They are easier to train and for tasks like ours (LRW) have proved as

accurate as recurrent models. Each TCN module (shown in Fig. 1e)

applies two 1D convolutions with a 3×1kernel and dilation δ=2 L,

where Lis the module index (0-3). Further, each convolution layer

pads the input appropriately in order to maintain the input length.

Table 3: Comparison of back-end classiﬁer variations. The ﬁrst two

row-groups investigate a leaner TCN (named TCN-S) and whether

a Transformer encoder is beneﬁcial. The last row-group investigates

fewer heads in the multi-head attention mechanism.

Model WAcc

(%)

Params

(×106)

FLOPs

(×109)

Inference-time

(ms / frame)

EfﬁcientNetV2-T + TCN-S 87.27 6.73 1.49 2.90

EfﬁcientNetV2-T + TCN-S + Transformer (8 heads) 88.14 7.42 1.51 2.93

EfﬁcientNetV2-T + TCN 87.52 8.89 1.60 2.98

EfﬁcientNetV2-T + TCN + Transformer (8 heads) 88.52 8.96 1.62 3.01

EfﬁcientNetV2-T + TCN + Transformer (1 head) 88.33 8.96 1.62 3.01

EfﬁcientNetV2-T + TCN + Transformer (4 heads) 88.47 8.96 1.62 3.01

Each convolution is followed by BN and PReLU. Finally, the hidden

output of the last TCN module (L=3) is temporally averaged, fed to

the ﬁnal classiﬁcation head (fully-connected layer with 463→500),

and a softmax normalizes it into a probability distribution.

3. EXPERIMENTAL SETUP

3.1. Database

We train and evaluate our proposed architecture on the challenging

LRW database [6]. It consists of short audiovisual speech segments,

extracted automatically from BBC TV broadcasts. The task is to rec-

ognize 500 distinct words from continuous speech. The database is

challenging due to its high variability with respect to the number of

speakers, naturally varying head-pose (near-frontal), and noisy back-

ground (real-world conditions). Words are not pre-segmented, and

there may be co-articulation from preceding and subsequent ones.

Moreover, there exist word pairs that share most visemes, for exam-

ple nouns in singular and plural forms (e.g. thing / things), verbs in

both present and past tenses (e.g. happen / happened / happening),

and homophones (e.g. whether / weather). All clips have a ﬁxed

duration of 1.16 sec (29 frames at a 25 Hz rate). The training set

consists of up to 1000 utterances per target-word, while the valida-

tion and testing sets both contain 50 utterances per word. Samples

are shown in Fig. 2.

3.2. Visual Front-end

We use the same visual front-end as the one in [5], so we only brieﬂy

overview its operation here. LRW videos contain tightly cropped

face images, thus the face detection step is skipped. First, 68 facial

landmarks are detected and tracked using [34]. These are interpo-

lated in case of a detection failure in a frame. Consequently, using

the detected landmarks, the faces are aligned to a neutral reference

frame, and a region-of-interest (ROI) of size 88×88 pixels contain-

ing the mouth area is extracted from each frame.

3.3. Training Setup

We follow the training hyper-parameters of [9]. We use two GPUs,

each with a batch size of 80 videos for EfﬁcientNetV2-T and 40

Fig. 2: Top row: example frames from the LRW corpus. Bottom

row: corresponding ROIs after applying the visual front-end.

videos for variants EfﬁcientNetV2-M and EfﬁcientNetV2-L. We em-

ploy the RMSProp optimizer with decay 0.9 and momentum 0.9;

BN momentum 0.99; weight decay 1e-5. Learning is ﬁrst warmed

up from 1e-6 to 0.18 for three epochs, and then decayed by 0.97

every 2.4 epochs. We use exponential moving average with 0.9999

decay rate, dropout (p=0.3), and stochastic depth [35] with 0.8 sur-

vival probability. During training, we apply data augmentations at

the segment level by employing RandAugment [36].

4. RESULTS

We investigate a number of variations of our EfﬁcientNetV2-T VSR

architecture. First, we consider the role of the max-pool layer in the

3D front-end. Second, we investigate how the width of the TCN

module affects model accuracy, whether the presence of the Trans-

former encoder is beneﬁcial, and which number of attention heads

leads to the best performance. Finally, we experiment with larger

EfﬁcientNets (M and L variants).

3D Front-end: We perform controlled experiments where the

hyper-parameters of the 3D front-end are varied. In particular, we

test two different convolution kernel sizes (3×3×3and 3×5×5),

and, for each, two variants, with or without the max-pool layer. To

ensure that our results transfer to other architectures too, we include

results for the ShufﬂeNetV2 (0.5×) [5] network as well. As we can

see from the results in Table 2, adding a max-pool layer to the 3D

front-end has detrimental effect on network performance, since all

networks without it outperform those with it. One negative side-

effect though is the increased computational cost. Further, from our

experience, networks without the max-pool layer begin to converge

faster than those with it.

Transformer and TCN: Transformers have been tried before on

the LRW database without much success [22]. However, when com-

bined with other components (such as TCN), they are effective. We

perform three experiments to understand the effect of TCN width,

the effect of Transformer presence, and ﬁnally whether a smaller

number of attention heads leads to better performance. We sum-

marize our results in Table 3. The ﬁrst two rows contain results for

the TCN-S variant, which has 386 channels, instead of 463 used in

TCN. As we can see, the wider TCN shows a 0.25% absolute word

accuracy improvement. Further, in both cases (TCN-S and TCN),

the presence of the Transformer is beneﬁcial, leading to signiﬁcant

Fig. 3: Recognition performance on the LRW test set (in WAcc, %)

vs. efﬁciency (in per-frame FLOPs) of the proposed EfﬁcientNet-

based VSR system (three variants), other resource-efﬁcient models

(ShufﬂeNetV2 and MobiVSR) and ResNet-based VSR systems (see

also Table 4).

Table 4: EfﬁcientNetV2 results for all three variants (T, M, L) on the

LRW test set (lower part of the table). Inference time is measured

on a single-core of a i7-8750H CPU using PyTorch. Other literature

results on the same benchmark are summarized at the upper part.

Model WAcc

(%)

Params

(×106)

FLOPs

(×109)

Inference-time

(ms / frame)

VGG [6] 61.10 − − −

ResNet-34 + BiLSTM [12] 83.00 − − −

ResNet-18 + 2×BiLSTM + word-boundary [13] 88.08 − − −

Multi-Grained ResNet-34 + Conv + BiConvLSTM [14] 83.34 − − −

Two-Stream ResNet-18 + BiLSTM [15] 84.07 − − −

ResNet-18 + Bi-GRU + Policy Gradient [16] 83.60 − − −

ResNet-18 + STFM [17] 83.70 − − −

2×ResNet-18 + Bi-GRU [18] 84.13 7.95 14.5 −

ResNet-18 + 3×Bi-GRU + MI [19] 84.41 − − −

ResNet-18 + BiGRU + Face-cutout [20] 85.02 − − −

ResNet-18 + Multi-Scale TCN [21] 85.30 − − −

MobiVSR-1 [2] 72.20 4.5 10.75 −

ResNet-18 + MS-TCN + MS-KD [5] 87.90 36.4 10.31 −

ResNet-18 + MS-TCN + MS-KD (ensemble) [5] 88.50 36.4 10.31 −

ShufﬂeNetV2 + MS-TCN + KD [5] 85.50 28.8 2.23 −

SE-ResNet-18 + BiGRU [22] 85.00 − − −

SE-ResNet-18 + BiGRU + word-boundary [22] 88.40 − − −

ResNet-18 + HPConv [23] 86.83 − − −

ResNet-18 + HPConv + word-boundary [23] 88.60 − − −

2×ResNet-50 + SlowFast [37] 84.40 − − −

ResNet-34 + GCN + BiGRU [38] 84.25 − − −

ResNet-18 + Transformer [24] 87.32 − − −

ResNet-18 + Dense-TCN [7] 88.36 − − −

ResNet-18 + Conformer + Bimodal-KD [39] 88.10 − − −

ALSOS-ResNet-18 + MS-TCN [40] 87.01 − − −

3D-ResNet18 + BiGRU [41] 86.23 − − −

Spatio-Temporal Attention + KD + word-boundary [25] 88.64 − − −

EfﬁcientNetV2-T + TCN + Transformer (8 heads) 88.52 8.96 1.62 3.01

EfﬁcientNetV2-M + TCN + Transformer (8 heads) 89.01 13.68 5.10 7.07

EfﬁcientNetV2-L + TCN + Transformer (8 heads) 89.52 15.47 5.87 7.89

WAcc improvement (by 0.87% and 1.0% absolute, respectively). In

contrast, a smaller number of heads (1 and 4) degrades WAcc very

slightly, compared to the 8-head variant.

EfﬁcientNet: We summarize literature results in the upper section

of Table 4, while in the lower section we present results for our three

EfﬁcientNetV2-based variants (T/M/L). Along with WAcc (%), we

also include model size (number of parameters) and computational

requirements (FLOPs and inference time per frame), where avail-

able. We consider as the state-of-the-art on LRW the 88.36% WAcc

reported in [7]. Note that some systems in the literature exceed this

performance, however these either exploit word-boundary informa-

tion (which must be presented at test-time), or are ensembles of mul-

tiple networks, or are trained on additional data resources.

Our EfﬁcientNetV2-T outperforms all tabulated models of the

literature, while at the same time being 4 times leaner than the

ResNet-18 of [5]. Compared to other resource-efﬁcient VSR models

like ShufﬂeNetV2 [5], it yields a 3% absolute WAcc improvement,

while also being slightly faster. Scaling the width or depth of the

network yields further improvements. Indeed, the EfﬁcientNetV2-M

variant achieves 0.49% absolute improvement over the T-variant, and

EfﬁcientNetV2-L achieves a new state-of-the-art result of 89.52%,

a 1.16% absolute WAcc improvement over [7]. A summary of such

VSR system comparisons can also be viewed in Fig. 3.

5. CONCLUSIONS

In this paper, we focused on resource-efﬁcient end-to-end deep-

learning based lipreading, making three contributions on efﬁcient

VSR architectures. First, we presented EfﬁcientNetV2-T, a lean

model that outperforms other resource-efﬁcient VSR systems, and

subsequently EfﬁcientNetV2-L, an up-scaled version that improves

the state-of-the-art WAcc result on LRW by 1.16% absolute. Finally,

we identiﬁed and resolved a deﬁciency in the leading 3D front-end

of the literature, and we showed that combining a Transformer

encoder with a TCN is beneﬁcial.

6. REFERENCES

[1] J. S. Chung, A. Senior, O. Vinyals, and A. Zisserman, “Lip

reading sentences in the wild,” in Proc. CVPR, 2017.

[2] N. Shrivastava, A. Saxena, Y. Kumar, R. R. Shah, D. Mahata,

and A. Stent, “MobiVSR: A visual speech recognition solution

for mobile devices,” CoRR, arXiv:1905.03968v3, 2019.

[3] A. Koumparoulis and G. Potamianos, “MobiLipNet:

Resource-efﬁcient deep learning based lipreading,” in Proc.

Interspeech, 2019.

[4] A. Koumparoulis, G. Potamianos, S. Thomas, and E. S.

Morais, “Resource-adaptive deep learning for visual speech

recognition,” in Proc. Interspeech, 2020.

[5] P. Ma, B. Martinez, S. Petridis, and M. Pantic, “Towards prac-

tical lipreading with distilled and efﬁcient models,” in Proc.

ICASSP, 2021.

[6] J. S. Chung and A. Zisserman, “Lip reading in the wild,” in

Proc. ACCV, 2016.

[7] P. Ma, Y. Wang, J. Shen, S. Petridis, and M. Pantic, “Lip-

reading with densely connected temporal convolutional net-

works,” in Proc. WACV, 2021.

[8] M. Tan and Q. V. Le, “EfﬁcientNet: Rethinking model scaling

for convolutional neural networks,” in Proc. ICML, 2019.

[9] M. Tan and Q. V. Le, “EfﬁcientNetV2: Smaller models and

faster training,” in Proc. ICML, 2021.

[10] A. Howard, M. Sandler, G. Chu, L.-C. Chen, B. Chen, M. Tan,

W. Wang, Y. Zhu, R. Pang, V. Vasudevan, Q. V. Le, and

H. Adam, “Searching for MobileNetV3,” in Proc. ICCV, 2019.

[11] Y. M. Assael, B. Shillingford, S. Whiteson, and N. de Fre-

itas, “LipNet: End-to-end sentence-level lipreading,” CoRR,

arXiv:1611.01599v2, 2016.

[12] T. Stafylakis and G. Tzimiropoulos, “Combining residual net-

works with LSTMs for lipreading,” in Proc. Interspeech, 2017.

[13] T. Stafylakis, M. H. Khan, and G. Tzimiropoulos, “Pushing

the boundaries of audiovisual word recognition using residual

networks and LSTMs,” Comp. Vision Image Unders., 176-177:

22–32, 2018.

[14] C. Wang, “Multi-grained spatio-temporal modeling for lip-

reading,” in Proc. BMVC, 2019.

[15] X. Weng and K. Kitani, “Learning spatio-temporal features

with two-stream deep 3D CNNs for lipreading,” in Proc.

BMVC, 2019.

[16] M. Luo, S. Yang, S. Shan, and X. Chen, “Pseudo-convolutional

policy gradient for sequence-to-sequence lip-reading,” in Proc.

FG, 2020.

[17] X. Zhang, F. Cheng, and S. Wang, “Spatio-temporal fusion

based convolutional sequence learning for lip reading,” in

Proc. ICCV, 2019.

[18] J. Xiao, S. Yang, Y. Zhang, S. Shan, and X. Chen, “Deforma-

tion ﬂow based two-stream network for lip reading,” in Proc.

FG, 2020.

[19] X. Zhao, S. Yang, S. Shan, and X. Chen, “Mutual information

maximization for effective lip reading,” in Proc. FG, 2020.

[20] Y. Zhang, S. Yang, J. Xiao, S. Shan, and X. Chen, “Can we

read speech beyond the lips? Rethinking RoI selection for deep

visual speech recognition,” in Proc. FG, 2020.

[21] B. Martinez, P. Ma, S. Petridis, and M. Pantic, “Lipreading us-

ing temporal convolutional networks,” in Proc. ICASSP, 2020.

[22] D. Feng, S. Yang, S. Shan, and X. Chen, “Learn an effective

lip reading model without pains,” CoRR, arXiv:2011.07557,

2020.

[23] H. Chen, J. Du, Y. Hu, L.-R. Dai, B.-C. Yin, and C.-H.

Lee, “Automatic lip-reading with hierarchical pyramidal con-

volution and self-attention for image sequences with no word

boundaries,” in Proc. Interspeech, 2021.

[24] M. Luo, S. Yang, X. Chen, Z. Liu, and S. Shan, “Synchronous

bidirectional learning for multilingual lip reading,” in Proc.

BMVC, 2020.

[25] S. Elashmawy, M. Ramsis, H. M. Eraqi, F. Eldeshnawy,

H. Mabrouk, O. Abugabal, and N. Sakr, “Spatio-temporal at-

tention mechanism and knowledge distillation for lip reading,”

CoRR, arXiv:2108.03543, 2021.

[26] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones,

A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all

you need,” in Proc. NeurIPS, 2017.

[27] S. Petridis, T. Stafylakis, P. Ma, F. Cai, G. Tzimiropoulos, and

M. Pantic, “End-to-end audiovisual speech recognition,” in

Proc. ICASSP, 2018.

[28] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang,

T. Weyand, M. Andreetto, and H. Adam, “MobileNets: Efﬁ-

cient convolutional neural networks for mobile vision applica-

tions,” CoRR, arXiv:1704.04861v1, 2017.

[29] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C.

Chen, “MobileNetV2: Inverted residuals and linear bottle-

necks,” in Proc. CVPR, 2018.

[30] X. Zhang, X. Zhou, M. Lin, and J. Sun, “ShufﬂeNet: An ex-

tremely efﬁcient convolutional neural network for mobile de-

vices,” in Proc. CVPR, 2018.

[31] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation net-

works,” in Proc. CVPR, 2018.

[32] S. Elfwing, E. Uchibe, and K. Doya, “Sigmoid-weighted linear

units for neural network function approximation in reinforce-

ment learning,” Neural Networks, 107: 3–11, 2018.

[33] Y. N. Dauphin, A. Fan, M. Auli, and D. Grangier, “Language

modeling with gated convolutional networks,” in Proc. ICML,

2017.

[34] S. Ren, X. Cao, Y. Wei, and J. Sun, “Face alignment at 3000

fps via regressing local binary features,” in Proc. CVPR, 2014.

[35] G. Huang, Y. Sun, Z. Liu, D. Sedra, and K. Q. Weinberger,

“Deep networks with stochastic depth,” in Proc. ECCV, 2016.

[36] E. D. Cubuk, B. Zoph, J. Shlens, and Q. Le, “RandAugment:

Practical automated data augmentation with a reduced search

space,” in Proc. NeurIPS, 2020.

[37] P. Wiriyathammabhum, “SpotFast networks with memory aug-

mented lateral transformers for lipreading,” in Proc. ICONIP,

2020.

[38] H. Liu, Z.Chen, and B. Yang, “Lip graph assisted audio-visual

speech recognition using bidirectional synchronous fusion,” in

Proc. Interspeech, 2020.

[39] P. Ma, R. Mira, S. Petridis, B. W. Schuller, and M. Pan-

tic, “LiRA: Learning visual speech representations from audio

through self-supervision,” in Proc. Interspeech, 2021.

[40] D. Tsourounis, D. Kastaniotis, and S. Fotopoulos, “Lip read-

ing by alternating between spatiotemporal and spatial convolu-

tions,” J. Imaging, 7(5):91, 2021.

[41] M. Hao, M. Mamut, N. Yadikar, A. Aysa, and K. Ubul, “How

to use time information effectively? Combining with time shift

module for lipreading,” in Proc. ICASSP, 2021.

EMOLIPS: Towards Reliable Emotional Speech Lip-Reading

Article

Full-text available

Nov 2023

In this article, we present a novel approach for emotional speech lip-reading (EMOLIPS). This two-level approach to emotional speech to text recognition based on visual data processing is motivated by human perception and the recent developments in multimodal deep learning. The proposed approach uses visual speech data to determine the type of speech emotion. The speech data are then processed using one of the emotional lip-reading models trained from scratch. This essentially resolves the multi-emotional lip-reading issue associated with most real-life scenarios. We implemented these models as a combination of EMO-3DCNN-GRU architecture for emotion recognition and 3DCNN-BiLSTM architecture for automatic lip-reading. We evaluated the models on the CREMA-D and RAVDESS emotional speech corpora. In addition, this article provides a detailed review of recent advances in automated lip-reading and emotion recognition that have been developed over the last 5 years (2018–2023). In comparison to existing research, we mainly focus on the valuable progress brought with the introduction of deep learning to the field and skip the description of traditional approaches. The EMOLIPS approach significantly improves the state-of-the-art accuracy for phrase recognition due to considering emotional features of the pronounced audio-visual speech up to 91.9% and 90.9% for RAVDESS and CREMA-D, respectively. Moreover, we present an extensive experimental investigation that demonstrates how different emotions (happiness, anger, disgust, fear, sadness, and neutral), valence (positive, neutral, and negative) and binary (emotional and neutral) affect automatic lip-reading.

A novel weakly-supervised method based on the segment anything model for seamless transition from classification to segmentation: A case study in segmenting latent photovoltaic locations

Article

Full-text available

Jun 2024

In the quest for large-scale photovoltaic (PV) panel extraction, substantial data volumes are essential, given the demand for sub-meter rooftop PV resolution. This requires the concept of Latent Photovoltaic Locations (LPL) to reduce the scope of the amount of subsequent processing. In order to minimize manual annotation, a pioneering weakly-supervised framework is proposed, which is capable of generating pixel-level annotations for segmen-tation based on image-level annotations and provides the two datasets required for the classification-then-segmentation strategy without more annotations. The strong noise-resistance of the Segment Anything Model (SAM) is discovered in the extremely difficult rough coarse pseudo-label refinement, which, after integrating a probability updating mechanism, achieves a seamless transition from scene classification to semantic segmen-tation. The resulting national LPL distribution map, rendered at a 2 m resolution, showcases a commendable 92 % accuracy and a F1-score of 91 %, and the advantages of the framework in terms of efficiency and accuracy have been verified through a large number of experiments. This process explores how to use fundamental large models to accelerate the remote sensing information extraction process, which is crucial in the current trajectory of deep learning in remote sensing. The relevant code is available at https://github.com/Github-YRQ/LPL.

Indonesian Lip-Reading Detection and Recognition Based on Lip Shape Using Face Mesh and Long-Term Recurrent Convolutional Network

Article

Full-text available

Apr 2024

Communication through speech can be hindered by environmental noise, prompting the need for alternative methods such as lip reading, which bypasses auditory challenges. However, the accurate interpretation of lip movements is impeded by the uniqueness of individual lip shapes, necessitating detailed analysis. In addition, the development of an Indonesian dataset addresses the lack of diversity in existing datasets, predominantly in English, fostering more inclusive research. This study proposes an enhanced lip-reading system trained using the long-term recurrent convolutional network (LRCN) considering eight different types of lip shapes. MediaPipe Face Mesh precisely detects lip landmarks, enabling the LRCN model to recognize Indonesian utterances. Experimental results demonstrate the effectiveness of the approach, with the LRCN model with three convolutional layers (LRCN-3Conv) achieving 95.42% accuracy for word test data and 95.63% for phrases, outperforming the convolutional long short-term memory (Conv-LSTM) method. The proposed approach outperforms Conv-LSTM in terms of accuracy. Furthermore, the evaluation of the original MIRACL-VC1 dataset also produced a best accuracy of 90.67% on LRCN-3Conv compared to previous studies in the word-labeled class. The success is attributed to MediaPipe Face Mesh detection, which facilitates the accurate detection of the lip region. Leveraging advanced deep learning techniques and precise landmark detection, these findings promise improved communication accessibility for individuals facing auditory challenges.

Deep Learning-Based Automatic Speech and Emotion Recognition for Students with Disabilities: A Review

Chapter

May 2024

Sunil Kumar

Education is a basic right that enriches the lives of all people. It is a common misconception in education that everyone receives the same benefits and that everyone has an equal chance of success. One of the prevailing ideas is that if students work hard enough, exhibit determination and resilience and take charge of their education and prospects, they can all achieve. Students who have disabilities require support in order to compete successfully. People with hearing loss would greatly benefit from assistive technology by employing VSR and audio-visual speech recognition. The precise meaning of a speaker’s message cannot be determined only by text. In order to comprehend the meaning of speech, it is also vital to consider the speaker’s emotions. In this chapter, state-of-the-art techniques related to automatic speech and emotion recognition models are investigated and how these technologies can aid the education of the physically disabled.

Audio-visual speech recognition based on regulated transformer and spatio-temporal fusion strategy for driver assistive systems

Article

May 2024
EXPERT SYST APPL

AKVSR: Audio Knowledge Empowered Visual Speech Recognition by Compressing Audio Knowledge of a Pretrained Model

Article

Jan 2024

Visual Speech Recognition (VSR) is the task of predicting spoken words from silent lip movements. VSR is regarded as a challenging task because of the insufficient information on lip movements. In this paper, we propose an Audio Knowledge empowered Visual Speech Recognition framework (AKVSR) to complement the insufficient speech information of visual modality by using audio modality. Different from the previous methods, the proposed AKVSR 1) utilizes rich audio knowledge encoded by a large-scale pretrained audio model, 2) saves the linguistic information of audio knowledge in compact audio memory by discarding the non-linguistic information from the audio through quantization, and 3) includes Audio Bridging Module which can find the best-matched audio features from the compact audio memory, which makes our training possible without audio inputs, once after the compact audio memory is composed. We validate the effectiveness of the proposed method through extensive experiments, and achieve new state-of-the-art performances on the widely-used LRS3 dataset.

Data-Driven Advancements in Lip Motion Analysis: A Review

Article

Full-text available

Nov 2023

This work reviews the dataset-driven advancements that have occurred in the area of lip motion analysis, particularly visual lip-reading and visual lip motion authentication, in the deep learning era. We provide an analysis of datasets and their usage, creation, and associated challenges. Future research can utilize this work as a guide for selecting appropriate datasets and as a source of insights for creating new and innovative datasets. Large and varied datasets are vital to a successful deep learning system. There have been many incredible advancements made in these fields due to larger datasets. There are indications that even larger, more varied datasets would result in further improvement upon existing systems. We highlight the datasets that brought about the progression in lip-reading systems from digit- to word-level lip-reading, and then from word- to sentence-level lip-reading. Through an in-depth analysis of lip-reading system results, we show that datasets with large amounts of diversity increase results immensely. We then discuss the next step for lip-reading systems to move from sentence- to dialogue-level lip-reading and emphasize that new datasets are required to make this transition possible. We then explore lip motion authentication datasets. While lip motion authentication has been well researched, it is not very unified on a particular implementation, and there is no benchmark dataset to compare the various methods. As was seen in the lip-reading analysis, large, diverse datasets are required to evaluate the robustness and accuracy of new methods attempted by researchers. These large datasets have pushed the work in the visual lip-reading realm. Due to the lack of large, diverse, and publicly accessible datasets, visual lip motion authentication research has struggled to validate results and real-world applications. A new benchmark dataset is required to unify the studies in this area such that they can be compared to previous methods as well as validate new methods more effectively.

SelfTalk: A Self-Supervised Commutative Training Diagram to Comprehend 3D Talking Faces

Conference Paper

Oct 2023

Speaker independent VSR: A systematic review and futuristic applications

Article

Oct 2023
IMAGE VISION COMPUT

A multi-purpose audio-visual corpus for multi-modal Persian speech recognition: The Arman-AV dataset

Article

Sep 2023
EXPERT SYST APPL

Spatio-Temporal Attention Mechanism and Knowledge Distillation for Lip Reading

Conference Paper

Full-text available

Aug 2021

Despite the advancement in the domain of audio and audiovisual speech recognition, visual speech recognition systems are still quite under-explored due to the visual ambiguity of some phonemes. In this work, we propose a new lipreading model that combines three contributions. First, the model front-end adopts a spatio-temporal attention mechanism to help extract the informative data from the input visual frames. Second, the model back-end utilizes a sequence-level and frame-level Knowledge Distillation (KD) techniques that allow leveraging audio data during the visual model training. Third, a data preprocessing pipeline is adopted that includes facial landmarks detection-based lip-alignment. On LRW lipreading dataset benchmark, a noticeable accuracy improvement is demonstrated; the spatio-temporal attention, Knowledge Distillation, and lip-alignment contributions achieved 88.43%, 88.64%, and 88.37% respectively.

Lip-reading with Densely Connected Temporal Convolutional Networks

Conference Paper

Full-text available

Jan 2021

In this work, we present the Densely Connected Temporal Convolutional Network (DC-TCN) for lip-reading of isolated words. Although Temporal Convolutional Networks (TCN) have recently demonstrated great potential in many vision tasks, its receptive fields are not dense enough to model the complex temporal dynamics in lip-reading scenarios. To address this problem, we introduce dense connections into the network to capture more robust temporal features. Moreover, our approach utilises the Squeeze-and-Excitation block, a light-weight attention mechanism, to further enhance the model's classification power. Without bells and whistles, our DC-TCN method has achieved 88.36% accuracy on the Lip Reading in the Wild (LRW) dataset and 43.65% on the LRW-1000 dataset, which has surpassed all the baseline methods and is the new state-of-the-art on both datasets.

Lip Reading by Alternating between Spatiotemporal and Spatial Convolutions

Article

Full-text available

May 2021

Lip reading (LR) is the task of predicting the speech utilizing only the visual information of the speaker. In this work, for the first time, the benefits of alternating between spatiotemporal and spatial convolutions for learning effective features from the LR sequences are studied. In this context, a new learnable module named ALSOS (Alternating Spatiotemporal and Spatial Convolutions) is introduced in the proposed LR system. The ALSOS module consists of spatiotemporal (3D) and spatial (2D) convolutions along with two conversion components (3D-to-2D and 2D-to-3D) providing a sequence-to-sequence-mapping. The designed LR system utilizes the ALSOS module in-between ResNet blocks, as well as Temporal Convolutional Networks (TCNs) in the backend for classification. The whole framework is composed by feedforward convolutional along with residual layers and can be trained end-to-end directly from the image sequences in the word-level LR problem. The ALSOS module can capture spatiotemporal dynamics and can be advantageous in the task of LR when combined with the ResNet topology. Experiments with different combinations of ALSOS with ResNet are performed on a dataset in Greek language simulating a medical support application scenario and on the popular large-scale LRW-500 dataset of English words. Results indicate that the proposed ALSOS module can improve the performance of a LR system. Overall, the insertion of ALSOS module into the ResNet architecture obtained higher classification accuracy since it incorporates the contribution of the temporal information captured at different spatial scales of the framework.

LiRA: Learning Visual Speech Representations from Audio Through Self-Supervision

Conference Paper