Conference PaperPDF Available

Accurate and Resource-Efficient Lipreading with Efficientnetv2 and Transformers

Authors:
ACCURATE AND RESOURCE-EFFICIENT LIPREADING
WITH EFFICIENTNETV2 AND TRANSFORMERS
Alexandros Koumparoulis, Gerasimos Potamianos
Electrical and Computer Engineering Department, University of Thessaly, Volos, Greece
ABSTRACT
We present a novel resource-efficient end-to-end architecture for
lipreading that achieves state-of-the-art results on a popular and
challenging benchmark. In particular, we make the following con-
tributions: First, inspired by the recent success of the EfficientNet
architecture in image classification and our earlier work on resource-
efficient lipreading models (MobiLipNet), we introduce Efficient-
Nets to the lipreading task. Second, we show that the currently
most popular in the literature 3D front-end contains a max-pool
layer that prohibits networks from reaching superior performance
and propose its removal. Finally, we improve our system’s back-end
robustness by including a Transformer encoder. We evaluate our
proposed system on the “Lipreading In-The-Wild” (LRW) corpus, a
database containing short video segments from BBC TV broadcasts.
The proposed network (T-variant) attains 88.53% word accuracy,
a 0.17% absolute improvement over the current state-of-the-art,
while being five times less computationally intensive. Further, an
up-scaled version of our model (L-variant) achieves 89.52%, a new
state-of-the-art result on the LRW corpus.
Index TermsEfficientNet, Transformers, Lipreading.
1. INTRODUCTION
Visual speech recognition (VSR) models have progressed signifi-
cantly, thanks to advances in deep learning algorithms. Typically,
these are based on accurate but computationally intensive architec-
tures, such as convolutional neural networks (CNNs) and recurrent
or self-attention layers. For example, the 2D CNN of [1] requires
11.22×10 9floating-point operations (FLOPs) to process a single
video frame. In resource-constrained scenarios, such deep learning-
based models are impractical due to their computational profile.
Recently, a small number of works [2–5] have focused on im-
proving VSR model efficiency, in order to enable broader application
of this technology. Proposed architectures reduce computational re-
quirements by replacing standard convolutions with grouped ones,
such as depthwise and pointwise. In these cases, the gains in FLOPs
are significant: for example, our MobiLipNetV2 [3] is 37 times more
efficient than a 3D-ResNet. However, when presented with challeng-
ing real-world data, efficient VSR models still lag in recognition per-
formance. For example, the best ShuffleNetV2 model in [5] achieves
a word accuracy (WAcc) of 85.5% on the LRW corpus [6], trailing
the current state-of-the-art of 88.36% [7] on the task.
Motivated by the above, in this paper we attempt to eliminate
the performance discrepancy between conventional and resource-
efficient VSR systems. To this end, we propose a novel neural net-
work architecture, focusing on both recognition accuracy and re-
source efficiency. In particular, we make the following contributions:
First, we introduce a new VSR model based on the architecture
of EfficientNet [8, 9]. Such resource-efficient model relies on
pointwise/depthwise convolutions as well. However, its main
departure from similar architectures like MobileNetV3 [10]
is compound scaling, where the model’s depth/width/resolu-
tion are scaled together, resulting in a family of models with
different accuracy/efficiency trade-off. Here we present three
configurations: a tiny one (denoted by “T”), targeting resource-
constrained applications, and two larger ones (denoted by “M”
and “L”). To differentiate between them, we append the variant-
type letter at the end of the model name, e.g. EfficientNetV2-T.
Our experiments show that these models yield superior perfor-
mance over all other VSR architectures. To our knowledge, this
represents the first use of EfficientNets in VSR.
Second, we systematically study the 3D front-end of the VSR
model. This is a crucial module, as it captures the short-term dy-
namics of the mouth region, and it has been proven advantageous
over simpler 2D front-ends [3, 11,12]. Specifically, we assume
a 3D front-end with a single convolution layer, and we focus
on the dimensions of the 3D convolution kernel and whether a
max-pooling layer should be used downstream. Our experiments
show that smaller kernels are as effective as larger ones, however
max-pooling combined with non-unit convolution stride hurts
recognition, thus adversely affecting the majority of recent VSR
systems that employ it within their 3D front-ends.
Third, we propose a robust back-end that combines a Trans-
former encoder with a temporal convolutional network (TCN).
TCNs are known to perform on-par with recurrent architec-
tures [5, 7], while being easier to train. However, one disadvan-
tage is their fixed receptive field. For this reason, we pre-process
the input features using a Transformer encoder, allowing us to
better handle sequences of different size. This yields a nearly
1% absolute WAcc improvement on the LRW corpus.
We evaluate our proposed system for speaker-independent word
VSR on the LRW corpus [6], a very popular lipreading dataset [2,
5–7, 12–25], thus allowing extensive comparisons. We report that
our introduced EfficientNetV2-T model is more accurate than the
currently best CNN-based network [7], while having a significantly
smaller computational cost. Further, our larger EfficientNetV2-L
configuration achieves the new state-of-the-art of 89.52% WAcc on
LRW, namely a 1.16% absolute improvement over [7].
2. THE PROPOSED VSR SYSTEM AND ITS MODULES
This section introduces our proposed VSR system and details its
modules. A schematic system overview is provided in Fig. 1a.
2.1. 3D Front-end
The mouth-region video frames are first processed by a 3D front-
end in order to capture short-term spatio-temporal motion. The 3D
3D Conv. (1→24, 355), BN, ReLU
EfficientNetV2-T
Softmax (500 classes)
Transformer (8 heads, 128)
TCN (4, channels= 463)
Temporal Average
2918888
29244444
29384
29384
29463
INPUT
TCHW
Linear (463 → 500)
TC
1463
1500
(a)
Pointwise Conv (11M)
11L Pointwise Conv
Squeeze-and-Excitation
squeeze-ratio= 0.25, Linear→SiLU→Linear→Sigmoid
BN & SiLU
Depthwise Conv (KK1)
L = Mexpansion-ratio, {W’,H’} = {W,H} / stride
BN & SiLU
MHWLHW
Only if stride = 1
LHWLH’W’
MHW
LH’W’LH’W’
BN
LH’W’NH’W’
(b)
1x1xL Pointwise Conv
Regular Conv (KKM)
L = Mexpansion-ratio, {W’,H’} = {W,H} / stride
BN & SiLU
Only if stride = 1
LHWLH’W’
MHW
BN
LH’W’NH’W’
(c)
Multi-Head Attention (8 heads)
LayerNorm
T384T384
T384
Fully-Connected (384→128)
T384T128
Fully-Connected (128→384)
Dropout (p= 0.1)
ReLU & LayerNorm
Dropout (p= 0.1)
LayerNorm
T384
T128T384
(d)
Conv-1D
(kernel= 3M, {padding, dilation}= 2L)
TMTN
TM
BN & PReLU
Dropout (p= 0.2)
Conv-1D
(kernel= 3M, {padding, dilation}= 2L)
BN & PReLU
Dropout (p= 0.2)
PReLU
TNTK
(e)
Fig. 1: Overview of the proposed VSR architecture, employing the EfficientNetV2-T model: (a) Entire system; (b) Inverted-Bottleneck
module with SiLU activation; (c) Fused-MBConv module, where the first pointwise and depthwise convolutions have been merged into a
single regular convolution layer; (d) Transformer-encoder layer [26], along with the hyper-parameters employed; (e) TCN-module. In our
architecture we have used M=N=Kand dilation 2L, where Lis the index of the TCN module (0-3).
front-end complements the back-end temporal classifier [11], which
primarily captures long-term dynamics.
The dominant 3D front-end in the literature [12, 21, 22, 27]
consists of a convolutional layer with 3-dimensional (3D) kernels
of 5×7×7size (time/width/height) and stride 1×2×2, followed by
batch normalization (BN) and rectified linear units (ReLU). The ex-
tracted feature maps are passed through a spatio-temporal max-pool
layer with kernel of size 1×3×3and stride 1×2×2.
Due to the non-unit stride in convolution and max-pool layers,
the output spatial dimensions are four times smaller than the input.
We postulate this design decision was mostly driven by practical
constrains: by reducing significantly the output dimensions, the net-
work could fit in the GPU memory and train faster. However, ex-
cessive and early downsampling discards crucial information in the
feature extraction process that is not retained by the large 3D con-
volution kernel. For this reason, we remove the max-pool layer and
decrease the convolution’s kernel size to 3×5×5.
2.2. EfficientNetV2
Most resource-efficient CNNs [10, 28–30] rely on introducing new
basic-block modules that are computationally leaner. EfficientNets
deviate from this pattern. First, neural architecture search (NAS)
is utilized to obtain a baseline model that has good trade-off on
accuracy and FLOPs. The baseline model is then scaled up with a
compound scaling strategy (input resolution/width/depth), obtaining
a family of models with different efficiency/accuracy ratios. In this
work, we retain the input resolution fixed (88×88 pixels) and scale
the networks in the other two dimensions. We proceed with describ-
ing the two basic modules used in EfficientNetV2-based models.
Inverted Residual Bottleneck (MBConv): The inverted residual
with linear bottleneck module was first introduced in [29] (shown in
Fig. 1b). Here, we describe the variant used in the EfficientNet ar-
chitectures. First, a low-dimensional compressed representation 2D
input is expanded (increasing the number of channels, ML) with
pointwise (PW) convolution according to ratio ρ(L=M·ρ), then fil-
tered with a depthwise (DW) K×Kconvolution kernel (MM).
Channel-wise attention is applied using the squeeze-and-excitation
(SE) mechanism [31] (reduction ratio r= 1/24) and finally com-
pressed back with pointwise convolution (LN). The depthwise
convolution may have stride greater than one, decreasing the output
dimensions. In case of unit stride, a residual connection is also ap-
plied (shown with a dashed line). EfficientNetV2 uses only K= 3.
The SiLU activation [32] is also applied on the first two convolution
layers and inside the SE mechanism.
To provide numerical examples of required FLOPs, we assume
input of size M= 16,L= 128 (ρ=8), N= 32,H=W= 32, DW
kernel size K=3 and unit stride. The first PW convolution layer re-
quires M L H W multiplications (2.09M FLOPs, 2048 parameters),
the DW one costs K2L H W (1.17M FLOPs, 1152 parameters), and
the second PW convolution costs L N H W (4.19M FLOPs, 4096
parameters). The SE mechanism contains two anti-symmetrical
fully-connected layers, each requiring L(L·r)multiplications (1.3K
FLOPs, 1376 parameters). The total cost of a single such module is
therefore 7.45M FLOPs and 8K parameters.
Fused-MBConv: The MBConv module relies on depthwise con-
volution for spatial filtering. Depthwise convolutions have fewer pa-
rameters and FLOPs than regular convolutions, but they often can-
not fully utilize modern accelerators, especially in the early lay-
ers where spatial dimensions are large. To better utilize mobile or
server accelerators, Fused-MBConv was recently proposed [9]. It
fuses the expansion pointwise and depthwise convolution MBConv
with a single regular convolution and lacks an SE mechanism, as
shown in Fig. 1c. When applied in early stages, Fused-MBConv
improves training speed with a small overhead on parameters and
FLOPs, however it is not as effective at later stages [9].
For the same numerical setup as in MBConv, the regular con-
volution costs M K 2L H W multiplications (18M FLOPs, 18K
parameters), and the projection pointwise costs L N H W (4.19M
FLOPs, 4096 parameters). In total, the module requires 22.19M
FLOPs and 23K parameters. While the cost is roughly three times
that of MBConv, for smaller numbers of channels and larger spatial
dimensions MBConv is more efficient on parallel hardware, where
Table 1: Hyper-parameters of the Baseline and T/M/L EfficientNet
variants. Operators MBConv and Fused-MBConv are explained in
Section 2.2. All modules use spatial kernels with size 1×3×3. For
all SE layers, r= 1/24. For variants T (α= 0.66,β= 1.0) and M
(α=1.0,β=1.1) the #Layers and #Channels columns, respectively,
are not shown, as they equal those of the Baseline.
Stage Module Stride Baseline T (α=0.66) M (β= 1.1) L (α= 1.1,β= 1.2)
#Channels #Layers #Channels #Layers #Channels #Layers
1 Fused-MBConv(ρ=1) 1 16 1 8 2 24 2
2 Fused-MBConv(ρ=4) 2 32 2 16 3 48 4
3 Fused-MBConv(ρ=4) 2 48 2 32 3 64 4
4 MBConv (ρ=4, SE) 2 96 3 56 4 128 6
5 MBConv (ρ=6, SE) 1 112 5 64 6 120 6
6 MBConv (ρ=6, SE) 2 192 8 112 9 208 10
7 Conv1×1PoolingGLU 1 768384 1 768384 1 768384 1
Table 2: Comparison of ShuffleNetV2 (0.5 ×) [5] and our
EfficientNetV2-T VSR systems with four 3D front-end variants, in
terms of recognition performance (in WAcc, %, on the LRW test set)
and efficiency (in per-frame FLOPs and parameters).
3D Front-end ShuffleNetV2 (0.5×) [5] EfficientNetV2-T
Conv-kernel
size
FLOPs
(×109)
Max
Pool
WAcc
(%)
FLOPs (×109)Params
(×106)
WAcc
(%)
FLOPs (×109)Params
(×106)
Total CNN Total CNN
3×3×30.04 780.82 0.66 0.46 5.012 88.38 1.56 1.12 8.96
379.50 0.32 0.12 86.19 0.80 0.36
3×5×50.10 781.04 0.72 0.46 5.013 88.52 1.62 1.12 8.96
379.51 0.39 0.12 86.67 0.86 0.36
MBConv is bottlenecked by memory bandwidth instead of compute.
Network Topology and Scaling: For our EfficientNet-based VSR
systems we start with the Baseline model of EfficientNetV2, shown
in the fourth column of Table 1. In contrast to the original Baseline,
we remove the first layer (Stage = 0), also known as stem layer, since
its functionality has been replaced by the 3D front-end (Section 2.1).
Further, all network variants share the same last Stage (7): a PW con-
volution expands the number of channels (X768, where the size
of Xdepends on the previous stage, 6), a spatial averaging opera-
tion aggregates spatial dimensions, and finally a GLU [33] reduces
the final number of channels to 384. Apart from these changes, our
Baseline model is the same as the original [9].
All network variants apply the same modules as the Baseline.
However, each variant is a scaled version of the baseline model.
Each module is parameterized by the channel width (α) that con-
trols how many channels will be used, as well as by the depth mul-
tiplier (β) that controls how many times the number of layers will
be scaled in each stage. We retain a fixed input size (88×88 pixels)
for all variants. We consider three scaling setups. The T (from tiny)
variant (α=0.66,β=1.0) is a configuration with a reduced number
of channels and our main network, aiming at a computational cost
of roughly one Giga-FLOP per frame. In addition, we create two
more variants, M (medium, α= 1.0,β= 1.1) and L (large, α= 1.1,
β= 1.2), to study the performance effect of scaling each axis (mod-
ule channels and module depth).
2.3. Transformer
After feeding the video through the 3D front-end and the 2D Effi-
cientNet, the output features are further processed by a Transformer
encoder [26] (shown in Fig. 1d). The Transformer encoder first
passes the input sequence through a multihead-attention layer [26]
and subsequently through two fully-connected layers. The attention
mechanism in the Transformer allows the network to dynamically
discard irrelevant information, for example the end of the previous
word or the start of the next one. Further, because it is fully par-
allelizable it is fast to compute, a desirable but missing property in
recurrent architectures. The Transformer retains the number of input
channels (384).
2.4. TCN and Final Classifier
Finally, the output of the Transformer is fed to a 4-layer TCN, sim-
ilar to the ones used in [5]. Recently, TCNs have been used with
great success in VSR systems [5, 7, 21]. They are used as a drop-in
replacement to recurrent architectures such as the GRU or LSTM.
They are easier to train and for tasks like ours (LRW) have proved as
accurate as recurrent models. Each TCN module (shown in Fig. 1e)
applies two 1D convolutions with a 3×1kernel and dilation δ=2 L,
where Lis the module index (0-3). Further, each convolution layer
pads the input appropriately in order to maintain the input length.
Table 3: Comparison of back-end classifier variations. The first two
row-groups investigate a leaner TCN (named TCN-S) and whether
a Transformer encoder is beneficial. The last row-group investigates
fewer heads in the multi-head attention mechanism.
Model WAcc
(%)
Params
(×106)
FLOPs
(×109)
Inference-time
(ms / frame)
EfficientNetV2-T + TCN-S 87.27 6.73 1.49 2.90
EfficientNetV2-T + TCN-S + Transformer (8 heads) 88.14 7.42 1.51 2.93
EfficientNetV2-T + TCN 87.52 8.89 1.60 2.98
EfficientNetV2-T + TCN + Transformer (8 heads) 88.52 8.96 1.62 3.01
EfficientNetV2-T + TCN + Transformer (1 head) 88.33 8.96 1.62 3.01
EfficientNetV2-T + TCN + Transformer (4 heads) 88.47 8.96 1.62 3.01
Each convolution is followed by BN and PReLU. Finally, the hidden
output of the last TCN module (L=3) is temporally averaged, fed to
the final classification head (fully-connected layer with 463500),
and a softmax normalizes it into a probability distribution.
3. EXPERIMENTAL SETUP
3.1. Database
We train and evaluate our proposed architecture on the challenging
LRW database [6]. It consists of short audiovisual speech segments,
extracted automatically from BBC TV broadcasts. The task is to rec-
ognize 500 distinct words from continuous speech. The database is
challenging due to its high variability with respect to the number of
speakers, naturally varying head-pose (near-frontal), and noisy back-
ground (real-world conditions). Words are not pre-segmented, and
there may be co-articulation from preceding and subsequent ones.
Moreover, there exist word pairs that share most visemes, for exam-
ple nouns in singular and plural forms (e.g. thing / things), verbs in
both present and past tenses (e.g. happen / happened / happening),
and homophones (e.g. whether / weather). All clips have a fixed
duration of 1.16 sec (29 frames at a 25 Hz rate). The training set
consists of up to 1000 utterances per target-word, while the valida-
tion and testing sets both contain 50 utterances per word. Samples
are shown in Fig. 2.
3.2. Visual Front-end
We use the same visual front-end as the one in [5], so we only briefly
overview its operation here. LRW videos contain tightly cropped
face images, thus the face detection step is skipped. First, 68 facial
landmarks are detected and tracked using [34]. These are interpo-
lated in case of a detection failure in a frame. Consequently, using
the detected landmarks, the faces are aligned to a neutral reference
frame, and a region-of-interest (ROI) of size 88×88 pixels contain-
ing the mouth area is extracted from each frame.
3.3. Training Setup
We follow the training hyper-parameters of [9]. We use two GPUs,
each with a batch size of 80 videos for EfficientNetV2-T and 40
Fig. 2: Top row: example frames from the LRW corpus. Bottom
row: corresponding ROIs after applying the visual front-end.
videos for variants EfficientNetV2-M and EfficientNetV2-L. We em-
ploy the RMSProp optimizer with decay 0.9 and momentum 0.9;
BN momentum 0.99; weight decay 1e-5. Learning is first warmed
up from 1e-6 to 0.18 for three epochs, and then decayed by 0.97
every 2.4 epochs. We use exponential moving average with 0.9999
decay rate, dropout (p=0.3), and stochastic depth [35] with 0.8 sur-
vival probability. During training, we apply data augmentations at
the segment level by employing RandAugment [36].
4. RESULTS
We investigate a number of variations of our EfficientNetV2-T VSR
architecture. First, we consider the role of the max-pool layer in the
3D front-end. Second, we investigate how the width of the TCN
module affects model accuracy, whether the presence of the Trans-
former encoder is beneficial, and which number of attention heads
leads to the best performance. Finally, we experiment with larger
EfficientNets (M and L variants).
3D Front-end: We perform controlled experiments where the
hyper-parameters of the 3D front-end are varied. In particular, we
test two different convolution kernel sizes (3×3×3and 3×5×5),
and, for each, two variants, with or without the max-pool layer. To
ensure that our results transfer to other architectures too, we include
results for the ShuffleNetV2 (0.5×) [5] network as well. As we can
see from the results in Table 2, adding a max-pool layer to the 3D
front-end has detrimental effect on network performance, since all
networks without it outperform those with it. One negative side-
effect though is the increased computational cost. Further, from our
experience, networks without the max-pool layer begin to converge
faster than those with it.
Transformer and TCN: Transformers have been tried before on
the LRW database without much success [22]. However, when com-
bined with other components (such as TCN), they are effective. We
perform three experiments to understand the effect of TCN width,
the effect of Transformer presence, and finally whether a smaller
number of attention heads leads to better performance. We sum-
marize our results in Table 3. The first two rows contain results for
the TCN-S variant, which has 386 channels, instead of 463 used in
TCN. As we can see, the wider TCN shows a 0.25% absolute word
accuracy improvement. Further, in both cases (TCN-S and TCN),
the presence of the Transformer is beneficial, leading to significant
Fig. 3: Recognition performance on the LRW test set (in WAcc, %)
vs. efficiency (in per-frame FLOPs) of the proposed EfficientNet-
based VSR system (three variants), other resource-efficient models
(ShuffleNetV2 and MobiVSR) and ResNet-based VSR systems (see
also Table 4).
Table 4: EfficientNetV2 results for all three variants (T, M, L) on the
LRW test set (lower part of the table). Inference time is measured
on a single-core of a i7-8750H CPU using PyTorch. Other literature
results on the same benchmark are summarized at the upper part.
Model WAcc
(%)
Params
(×106)
FLOPs
(×109)
Inference-time
(ms / frame)
VGG [6] 61.10
ResNet-34 + BiLSTM [12] 83.00
ResNet-18 + 2×BiLSTM + word-boundary [13] 88.08
Multi-Grained ResNet-34 + Conv + BiConvLSTM [14] 83.34
Two-Stream ResNet-18 + BiLSTM [15] 84.07
ResNet-18 + Bi-GRU + Policy Gradient [16] 83.60
ResNet-18 + STFM [17] 83.70
2×ResNet-18 + Bi-GRU [18] 84.13 7.95 14.5
ResNet-18 + 3×Bi-GRU + MI [19] 84.41
ResNet-18 + BiGRU + Face-cutout [20] 85.02
ResNet-18 + Multi-Scale TCN [21] 85.30
MobiVSR-1 [2] 72.20 4.5 10.75
ResNet-18 + MS-TCN + MS-KD [5] 87.90 36.4 10.31
ResNet-18 + MS-TCN + MS-KD (ensemble) [5] 88.50 36.4 10.31
ShuffleNetV2 + MS-TCN + KD [5] 85.50 28.8 2.23
SE-ResNet-18 + BiGRU [22] 85.00
SE-ResNet-18 + BiGRU + word-boundary [22] 88.40
ResNet-18 + HPConv [23] 86.83
ResNet-18 + HPConv + word-boundary [23] 88.60
2×ResNet-50 + SlowFast [37] 84.40
ResNet-34 + GCN + BiGRU [38] 84.25
ResNet-18 + Transformer [24] 87.32
ResNet-18 + Dense-TCN [7] 88.36
ResNet-18 + Conformer + Bimodal-KD [39] 88.10
ALSOS-ResNet-18 + MS-TCN [40] 87.01
3D-ResNet18 + BiGRU [41] 86.23
Spatio-Temporal Attention + KD + word-boundary [25] 88.64
EfficientNetV2-T + TCN + Transformer (8 heads) 88.52 8.96 1.62 3.01
EfficientNetV2-M + TCN + Transformer (8 heads) 89.01 13.68 5.10 7.07
EfficientNetV2-L + TCN + Transformer (8 heads) 89.52 15.47 5.87 7.89
WAcc improvement (by 0.87% and 1.0% absolute, respectively). In
contrast, a smaller number of heads (1 and 4) degrades WAcc very
slightly, compared to the 8-head variant.
EfficientNet: We summarize literature results in the upper section
of Table 4, while in the lower section we present results for our three
EfficientNetV2-based variants (T/M/L). Along with WAcc (%), we
also include model size (number of parameters) and computational
requirements (FLOPs and inference time per frame), where avail-
able. We consider as the state-of-the-art on LRW the 88.36% WAcc
reported in [7]. Note that some systems in the literature exceed this
performance, however these either exploit word-boundary informa-
tion (which must be presented at test-time), or are ensembles of mul-
tiple networks, or are trained on additional data resources.
Our EfficientNetV2-T outperforms all tabulated models of the
literature, while at the same time being 4 times leaner than the
ResNet-18 of [5]. Compared to other resource-efficient VSR models
like ShuffleNetV2 [5], it yields a 3% absolute WAcc improvement,
while also being slightly faster. Scaling the width or depth of the
network yields further improvements. Indeed, the EfficientNetV2-M
variant achieves 0.49% absolute improvement over the T-variant, and
EfficientNetV2-L achieves a new state-of-the-art result of 89.52%,
a 1.16% absolute WAcc improvement over [7]. A summary of such
VSR system comparisons can also be viewed in Fig. 3.
5. CONCLUSIONS
In this paper, we focused on resource-efficient end-to-end deep-
learning based lipreading, making three contributions on efficient
VSR architectures. First, we presented EfficientNetV2-T, a lean
model that outperforms other resource-efficient VSR systems, and
subsequently EfficientNetV2-L, an up-scaled version that improves
the state-of-the-art WAcc result on LRW by 1.16% absolute. Finally,
we identified and resolved a deficiency in the leading 3D front-end
of the literature, and we showed that combining a Transformer
encoder with a TCN is beneficial.
6. REFERENCES
[1] J. S. Chung, A. Senior, O. Vinyals, and A. Zisserman, “Lip
reading sentences in the wild,” in Proc. CVPR, 2017.
[2] N. Shrivastava, A. Saxena, Y. Kumar, R. R. Shah, D. Mahata,
and A. Stent, “MobiVSR: A visual speech recognition solution
for mobile devices, CoRR, arXiv:1905.03968v3, 2019.
[3] A. Koumparoulis and G. Potamianos, “MobiLipNet:
Resource-efficient deep learning based lipreading, in Proc.
Interspeech, 2019.
[4] A. Koumparoulis, G. Potamianos, S. Thomas, and E. S.
Morais, “Resource-adaptive deep learning for visual speech
recognition,” in Proc. Interspeech, 2020.
[5] P. Ma, B. Martinez, S. Petridis, and M. Pantic, “Towards prac-
tical lipreading with distilled and efficient models, in Proc.
ICASSP, 2021.
[6] J. S. Chung and A. Zisserman, “Lip reading in the wild,” in
Proc. ACCV, 2016.
[7] P. Ma, Y. Wang, J. Shen, S. Petridis, and M. Pantic, “Lip-
reading with densely connected temporal convolutional net-
works,” in Proc. WACV, 2021.
[8] M. Tan and Q. V. Le, “EfficientNet: Rethinking model scaling
for convolutional neural networks, in Proc. ICML, 2019.
[9] M. Tan and Q. V. Le, “EfficientNetV2: Smaller models and
faster training,” in Proc. ICML, 2021.
[10] A. Howard, M. Sandler, G. Chu, L.-C. Chen, B. Chen, M. Tan,
W. Wang, Y. Zhu, R. Pang, V. Vasudevan, Q. V. Le, and
H. Adam, “Searching for MobileNetV3,” in Proc. ICCV, 2019.
[11] Y. M. Assael, B. Shillingford, S. Whiteson, and N. de Fre-
itas, “LipNet: End-to-end sentence-level lipreading, CoRR,
arXiv:1611.01599v2, 2016.
[12] T. Stafylakis and G. Tzimiropoulos, “Combining residual net-
works with LSTMs for lipreading,” in Proc. Interspeech, 2017.
[13] T. Stafylakis, M. H. Khan, and G. Tzimiropoulos, “Pushing
the boundaries of audiovisual word recognition using residual
networks and LSTMs,” Comp. Vision Image Unders., 176-177:
22–32, 2018.
[14] C. Wang, “Multi-grained spatio-temporal modeling for lip-
reading,” in Proc. BMVC, 2019.
[15] X. Weng and K. Kitani, “Learning spatio-temporal features
with two-stream deep 3D CNNs for lipreading,” in Proc.
BMVC, 2019.
[16] M. Luo, S. Yang, S. Shan, and X. Chen, “Pseudo-convolutional
policy gradient for sequence-to-sequence lip-reading, in Proc.
FG, 2020.
[17] X. Zhang, F. Cheng, and S. Wang, “Spatio-temporal fusion
based convolutional sequence learning for lip reading, in
Proc. ICCV, 2019.
[18] J. Xiao, S. Yang, Y. Zhang, S. Shan, and X. Chen, “Deforma-
tion flow based two-stream network for lip reading, in Proc.
FG, 2020.
[19] X. Zhao, S. Yang, S. Shan, and X. Chen, “Mutual information
maximization for effective lip reading, in Proc. FG, 2020.
[20] Y. Zhang, S. Yang, J. Xiao, S. Shan, and X. Chen, “Can we
read speech beyond the lips? Rethinking RoI selection for deep
visual speech recognition,” in Proc. FG, 2020.
[21] B. Martinez, P. Ma, S. Petridis, and M. Pantic, “Lipreading us-
ing temporal convolutional networks, in Proc. ICASSP, 2020.
[22] D. Feng, S. Yang, S. Shan, and X. Chen, “Learn an effective
lip reading model without pains,” CoRR, arXiv:2011.07557,
2020.
[23] H. Chen, J. Du, Y. Hu, L.-R. Dai, B.-C. Yin, and C.-H.
Lee, “Automatic lip-reading with hierarchical pyramidal con-
volution and self-attention for image sequences with no word
boundaries,” in Proc. Interspeech, 2021.
[24] M. Luo, S. Yang, X. Chen, Z. Liu, and S. Shan, “Synchronous
bidirectional learning for multilingual lip reading,” in Proc.
BMVC, 2020.
[25] S. Elashmawy, M. Ramsis, H. M. Eraqi, F. Eldeshnawy,
H. Mabrouk, O. Abugabal, and N. Sakr, “Spatio-temporal at-
tention mechanism and knowledge distillation for lip reading,
CoRR, arXiv:2108.03543, 2021.
[26] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones,
A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all
you need,” in Proc. NeurIPS, 2017.
[27] S. Petridis, T. Stafylakis, P. Ma, F. Cai, G. Tzimiropoulos, and
M. Pantic, “End-to-end audiovisual speech recognition, in
Proc. ICASSP, 2018.
[28] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang,
T. Weyand, M. Andreetto, and H. Adam, “MobileNets: Effi-
cient convolutional neural networks for mobile vision applica-
tions,” CoRR, arXiv:1704.04861v1, 2017.
[29] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C.
Chen, “MobileNetV2: Inverted residuals and linear bottle-
necks,” in Proc. CVPR, 2018.
[30] X. Zhang, X. Zhou, M. Lin, and J. Sun, “ShuffleNet: An ex-
tremely efficient convolutional neural network for mobile de-
vices,” in Proc. CVPR, 2018.
[31] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation net-
works,” in Proc. CVPR, 2018.
[32] S. Elfwing, E. Uchibe, and K. Doya, “Sigmoid-weighted linear
units for neural network function approximation in reinforce-
ment learning,” Neural Networks, 107: 3–11, 2018.
[33] Y. N. Dauphin, A. Fan, M. Auli, and D. Grangier, “Language
modeling with gated convolutional networks, in Proc. ICML,
2017.
[34] S. Ren, X. Cao, Y. Wei, and J. Sun, “Face alignment at 3000
fps via regressing local binary features, in Proc. CVPR, 2014.
[35] G. Huang, Y. Sun, Z. Liu, D. Sedra, and K. Q. Weinberger,
“Deep networks with stochastic depth,” in Proc. ECCV, 2016.
[36] E. D. Cubuk, B. Zoph, J. Shlens, and Q. Le, “RandAugment:
Practical automated data augmentation with a reduced search
space,” in Proc. NeurIPS, 2020.
[37] P. Wiriyathammabhum, “SpotFast networks with memory aug-
mented lateral transformers for lipreading,” in Proc. ICONIP,
2020.
[38] H. Liu, Z.Chen, and B. Yang, “Lip graph assisted audio-visual
speech recognition using bidirectional synchronous fusion,” in
Proc. Interspeech, 2020.
[39] P. Ma, R. Mira, S. Petridis, B. W. Schuller, and M. Pan-
tic, “LiRA: Learning visual speech representations from audio
through self-supervision,” in Proc. Interspeech, 2021.
[40] D. Tsourounis, D. Kastaniotis, and S. Fotopoulos, “Lip read-
ing by alternating between spatiotemporal and spatial convolu-
tions,” J. Imaging, 7(5):91, 2021.
[41] M. Hao, M. Mamut, N. Yadikar, A. Aysa, and K. Ubul, “How
to use time information effectively? Combining with time shift
module for lipreading,” in Proc. ICASSP, 2021.
... While the performance of current State-of-the-Art (SOTA) lip-reading methods [8,[11][12][13][14][15] has improved significantly in emotionally neutral Speech Recognition (SR) tasks, they still suffer with crucial performance degradations in real application scenarios, in particular, when uttered speech is affected by human emotions. However, in real applications it is often crucially important for a person that an automatic system understands his/her requests in an emotional state. ...
... The proposed structure can be formalized as: X rr = LN(X + RNN(X)), X f r = LN(X + X rr + FF(X rr )), (11) where X denotes the feature matrix of the utterance; RNN(·), LN(·), and FF(·) denote the RNN, normalization, and feed-forward network layers, respectively. In this study, the RNN(·) and LN(·) default to Bi Gated Recurrent Unit (GRU) and layer normalization, while the feed-forward layer consists of two fully connected networks, which can be formulated as: FF(X rr ) = DP(FC(DP(α(FC(X rr ))))), (12) where FC(·) and DP(·) denote the fully connected network and dropout operation, respectively, and α(·) denotes the activation function. ...
Article
Full-text available
In this article, we present a novel approach for emotional speech lip-reading (EMOLIPS). This two-level approach to emotional speech to text recognition based on visual data processing is motivated by human perception and the recent developments in multimodal deep learning. The proposed approach uses visual speech data to determine the type of speech emotion. The speech data are then processed using one of the emotional lip-reading models trained from scratch. This essentially resolves the multi-emotional lip-reading issue associated with most real-life scenarios. We implemented these models as a combination of EMO-3DCNN-GRU architecture for emotion recognition and 3DCNN-BiLSTM architecture for automatic lip-reading. We evaluated the models on the CREMA-D and RAVDESS emotional speech corpora. In addition, this article provides a detailed review of recent advances in automated lip-reading and emotion recognition that have been developed over the last 5 years (2018–2023). In comparison to existing research, we mainly focus on the valuable progress brought with the introduction of deep learning to the field and skip the description of traditional approaches. The EMOLIPS approach significantly improves the state-of-the-art accuracy for phrase recognition due to considering emotional features of the pronounced audio-visual speech up to 91.9% and 90.9% for RAVDESS and CREMA-D, respectively. Moreover, we present an extensive experimental investigation that demonstrates how different emotions (happiness, anger, disgust, fear, sadness, and neutral), valence (positive, neutral, and negative) and binary (emotional and neutral) affect automatic lip-reading.
... The second model, used for level-16 tiles, was the EfficientNet V2-M model (Tan and Le, 2021). This model balanced accuracy with faster inference speeds (Koumparoulis and Potamianos, 2022), a necessity for nationwide analysis. Both models were trained using PyTorch, with the ADAM optimizer, a batch size of 32, learning rate of 0.0001, and 300 epochs, on an NVIDIA Geforce RTX A4000 16-GB GPU. ...
Article
Full-text available
In the quest for large-scale photovoltaic (PV) panel extraction, substantial data volumes are essential, given the demand for sub-meter rooftop PV resolution. This requires the concept of Latent Photovoltaic Locations (LPL) to reduce the scope of the amount of subsequent processing. In order to minimize manual annotation, a pioneering weakly-supervised framework is proposed, which is capable of generating pixel-level annotations for segmen-tation based on image-level annotations and provides the two datasets required for the classification-then-segmentation strategy without more annotations. The strong noise-resistance of the Segment Anything Model (SAM) is discovered in the extremely difficult rough coarse pseudo-label refinement, which, after integrating a probability updating mechanism, achieves a seamless transition from scene classification to semantic segmen-tation. The resulting national LPL distribution map, rendered at a 2 m resolution, showcases a commendable 92 % accuracy and a F1-score of 91 %, and the advantages of the framework in terms of efficiency and accuracy have been verified through a large number of experiments. This process explores how to use fundamental large models to accelerate the remote sensing information extraction process, which is crucial in the current trajectory of deep learning in remote sensing. The relevant code is available at https://github.com/Github-YRQ/LPL.
... Martinez et al. improved word-by-word lip reading using multiscale temporal convolutional networks (MSTCNs) [24]. Koumparoulis and Potamianos introduced efcient networks for lip reading, achieving high accuracy levels in the lip-reading in the wild (LRW) dataset [25]. Visual speech recognition (VSR) models have also been developed, surpassing previous methods in accuracy [26]. ...
Article
Full-text available
Communication through speech can be hindered by environmental noise, prompting the need for alternative methods such as lip reading, which bypasses auditory challenges. However, the accurate interpretation of lip movements is impeded by the uniqueness of individual lip shapes, necessitating detailed analysis. In addition, the development of an Indonesian dataset addresses the lack of diversity in existing datasets, predominantly in English, fostering more inclusive research. This study proposes an enhanced lip-reading system trained using the long-term recurrent convolutional network (LRCN) considering eight different types of lip shapes. MediaPipe Face Mesh precisely detects lip landmarks, enabling the LRCN model to recognize Indonesian utterances. Experimental results demonstrate the effectiveness of the approach, with the LRCN model with three convolutional layers (LRCN-3Conv) achieving 95.42% accuracy for word test data and 95.63% for phrases, outperforming the convolutional long short-term memory (Conv-LSTM) method. The proposed approach outperforms Conv-LSTM in terms of accuracy. Furthermore, the evaluation of the original MIRACL-VC1 dataset also produced a best accuracy of 90.67% on LRCN-3Conv compared to previous studies in the word-labeled class. The success is attributed to MediaPipe Face Mesh detection, which facilitates the accurate detection of the lip region. Leveraging advanced deep learning techniques and precise landmark detection, these findings promise improved communication accessibility for individuals facing auditory challenges.
Chapter
Education is a basic right that enriches the lives of all people. It is a common misconception in education that everyone receives the same benefits and that everyone has an equal chance of success. One of the prevailing ideas is that if students work hard enough, exhibit determination and resilience and take charge of their education and prospects, they can all achieve. Students who have disabilities require support in order to compete successfully. People with hearing loss would greatly benefit from assistive technology by employing VSR and audio-visual speech recognition. The precise meaning of a speaker’s message cannot be determined only by text. In order to comprehend the meaning of speech, it is also vital to consider the speaker’s emotions. In this chapter, state-of-the-art techniques related to automatic speech and emotion recognition models are investigated and how these technologies can aid the education of the physically disabled.
Article
Visual Speech Recognition (VSR) is the task of predicting spoken words from silent lip movements. VSR is regarded as a challenging task because of the insufficient information on lip movements. In this paper, we propose an Audio Knowledge empowered Visual Speech Recognition framework (AKVSR) to complement the insufficient speech information of visual modality by using audio modality. Different from the previous methods, the proposed AKVSR 1) utilizes rich audio knowledge encoded by a large-scale pretrained audio model, 2) saves the linguistic information of audio knowledge in compact audio memory by discarding the non-linguistic information from the audio through quantization, and 3) includes Audio Bridging Module which can find the best-matched audio features from the compact audio memory, which makes our training possible without audio inputs, once after the compact audio memory is composed. We validate the effectiveness of the proposed method through extensive experiments, and achieve new state-of-the-art performances on the widely-used LRS3 dataset.
Article
Full-text available
This work reviews the dataset-driven advancements that have occurred in the area of lip motion analysis, particularly visual lip-reading and visual lip motion authentication, in the deep learning era. We provide an analysis of datasets and their usage, creation, and associated challenges. Future research can utilize this work as a guide for selecting appropriate datasets and as a source of insights for creating new and innovative datasets. Large and varied datasets are vital to a successful deep learning system. There have been many incredible advancements made in these fields due to larger datasets. There are indications that even larger, more varied datasets would result in further improvement upon existing systems. We highlight the datasets that brought about the progression in lip-reading systems from digit- to word-level lip-reading, and then from word- to sentence-level lip-reading. Through an in-depth analysis of lip-reading system results, we show that datasets with large amounts of diversity increase results immensely. We then discuss the next step for lip-reading systems to move from sentence- to dialogue-level lip-reading and emphasize that new datasets are required to make this transition possible. We then explore lip motion authentication datasets. While lip motion authentication has been well researched, it is not very unified on a particular implementation, and there is no benchmark dataset to compare the various methods. As was seen in the lip-reading analysis, large, diverse datasets are required to evaluate the robustness and accuracy of new methods attempted by researchers. These large datasets have pushed the work in the visual lip-reading realm. Due to the lack of large, diverse, and publicly accessible datasets, visual lip motion authentication research has struggled to validate results and real-world applications. A new benchmark dataset is required to unify the studies in this area such that they can be compared to previous methods as well as validate new methods more effectively.
Conference Paper
Full-text available
Despite the advancement in the domain of audio and audiovisual speech recognition, visual speech recognition systems are still quite under-explored due to the visual ambiguity of some phonemes. In this work, we propose a new lipreading model that combines three contributions. First, the model front-end adopts a spatio-temporal attention mechanism to help extract the informative data from the input visual frames. Second, the model back-end utilizes a sequence-level and frame-level Knowledge Distillation (KD) techniques that allow leveraging audio data during the visual model training. Third, a data preprocessing pipeline is adopted that includes facial landmarks detection-based lip-alignment. On LRW lipreading dataset benchmark, a noticeable accuracy improvement is demonstrated; the spatio-temporal attention, Knowledge Distillation, and lip-alignment contributions achieved 88.43%, 88.64%, and 88.37% respectively.
Conference Paper
Full-text available
In this work, we present the Densely Connected Temporal Convolutional Network (DC-TCN) for lip-reading of isolated words. Although Temporal Convolutional Networks (TCN) have recently demonstrated great potential in many vision tasks, its receptive fields are not dense enough to model the complex temporal dynamics in lip-reading scenarios. To address this problem, we introduce dense connections into the network to capture more robust temporal features. Moreover, our approach utilises the Squeeze-and-Excitation block, a light-weight attention mechanism, to further enhance the model's classification power. Without bells and whistles, our DC-TCN method has achieved 88.36% accuracy on the Lip Reading in the Wild (LRW) dataset and 43.65% on the LRW-1000 dataset, which has surpassed all the baseline methods and is the new state-of-the-art on both datasets.
Article
Full-text available
Lip reading (LR) is the task of predicting the speech utilizing only the visual information of the speaker. In this work, for the first time, the benefits of alternating between spatiotemporal and spatial convolutions for learning effective features from the LR sequences are studied. In this context, a new learnable module named ALSOS (Alternating Spatiotemporal and Spatial Convolutions) is introduced in the proposed LR system. The ALSOS module consists of spatiotemporal (3D) and spatial (2D) convolutions along with two conversion components (3D-to-2D and 2D-to-3D) providing a sequence-to-sequence-mapping. The designed LR system utilizes the ALSOS module in-between ResNet blocks, as well as Temporal Convolutional Networks (TCNs) in the backend for classification. The whole framework is composed by feedforward convolutional along with residual layers and can be trained end-to-end directly from the image sequences in the word-level LR problem. The ALSOS module can capture spatiotemporal dynamics and can be advantageous in the task of LR when combined with the ResNet topology. Experiments with different combinations of ALSOS with ResNet are performed on a dataset in Greek language simulating a medical support application scenario and on the popular large-scale LRW-500 dataset of English words. Results indicate that the proposed ALSOS module can improve the performance of a LR system. Overall, the insertion of ALSOS module into the ResNet architecture obtained higher classification accuracy since it incorporates the contribution of the temporal information captured at different spatial scales of the framework.