ArticlePDF Available

Driver Distraction Detection Using Octave-Like Convolutional Neural Network

Authors:
  • Sino-Austria Research Institute for Intelligent Industries

Abstract

This study proposes a lightweight convolutional neural network with an octave-like convolution mixed block, called OLCMNet, for detecting driver distraction under a limited computational budget. The OLCM block uses point-wise convolution (PC) to expand feature maps into two sets of branches. In the low-frequency branches, we perform average pooling, depth-wise convolution (DC), and upsampling to obtain a low-resolution low-frequency feature map, reducing spatial redundancy and connection density. In the high-frequency branches, the expanded feature map with the original resolution is fed to the DC operator, gaining an apposite receptive field to capture fine details. The feature concatenation of the low-frequency and high-frequency branches is encoded sequentially by a squeeze-and-excitation (SE) module and PC operator, realizing feature global information fusion. Introducing another SE module at the last stage, the OLCMNet facilitates further sensitive information exchange between layers. In addition, with an augmented reality head-up display (ARHUD) platform, we create a Lilong Distracted Driving Behavior (LDDB) Dataset through a series of on-road experiments. Such a dataset contains 14808 videos collected from an infrared camera, covering six driving behaviors of 2468 participants. We manually annotate these videos at five frames per second, obtaining a total of 267378 images. Compared with the existing methods, the embedded hardware platform experiments indicate that OLCMNet hits acceptable trade-offs, namely, 89.53% accuracy for StateFarm Dataset and 95.98% accuracy LDDB Dataset when the latency is 32.8±4.6ms.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS 1
Driver Distraction Detection Using Octave-Like
Convolutional Neural Network
Penghua Li , Yifeng Yang, Radu Grosu ,Member, IEEE, Guodong Wang ,RuiLi ,
Yuehong Wu, and Zeng Huang
Abstract This study proposes a lightweight convolutional
neural network with an octave-like convolution mixed block,
called OLCMNet, for detecting driver distraction under a limited
computational budget. The OLCM block uses point-wise convolu-
tion (PC) to expand feature maps into two sets of branches. In the
low-frequency branches, we perform average pooling, depth-wise
convolution (DC), and upsampling to obtain a low-resolution
low-frequency feature map, reducing spatial redundancy and
connection density. In the high-frequency branches, the expanded
feature map with the original resolution is fed to the DC
operator, gaining an apposite receptive field to capture fine
details. The feature concatenation of the low-frequency and
high-frequency branches is encoded sequentially by a squeeze-
and-excitation (SE) module and PC operator, realizing feature
global information fusion. Introducing another SE module at the
last stage, the OLCMNet facilitates further sensitive informa-
tion exchange between layers. In addition, with an augmented
reality head-up display (ARHUD) platform, we create a Lilong
Distracted Driving Behavior (LDDB) Dataset through a series
of on-road experiments. Such a dataset contains 14808 videos
collected from an infrared camera, covering six driving behaviors
of 2468 participants. We manually annotate these videos at ve
frames per second, obtaining a total of 267378 images. Compared
with the existing methods, the embedded hardware platform
experiments indicate that OLCMNet hits acceptable trade-offs,
namely, 89.53% accuracy for StateFarm Dataset and 95.98%
accuracy LDDB Dataset when the latency is 32.8±4.6ms.
Index Terms—Natural istic driving, octave-like convolution,
lightweight neural network.
I. INTRODUCTION
DRIVER distraction is a significant problem that affects
driving safety. There are 80% of crashes and 65% of
near-crashes that involve driver distraction [1]. According to
the research of the National Highway Traffic Safety Adminis-
tration (NHTSA), distraction comes down to four categories,
Manuscript received December 3, 2020; revised April 20, 2021; accepted
May 28, 2021. This work was supported in part by the Ministry of Education
China Mobile Research Fund under Grant MCM20180404, in part by the
National State Key Laboratory Foundation under Grant 6142006200405, and
in part by the Project of Innovation Research Group of Universities in
Chongqing under Grant 202006. The Associate Editor for this article was
S. A. Birrell. (Corresponding author: Penghua Li.)
Penghua Li, Yifeng Yang, and Rui Li are with the College of Automa-
tion, Chongqing University of Posts and Telecommunications, Chongqing
400065, China (e-mail: liph@cqupt.edu.cn; s180301020@stu.cqupt.edu.cn;
lirui@cqupt.edu.cn).
Radu Grosu and Guodong Wang are with the Institute of Computer
Engineering, Vienna University of Technology, 1040 Vienna, Austria (e-mail:
radu.grosu@tuwien.ac.at; guodong.wang@tuwien.ac.at).
Yuehong Wu is with the Chongqing Lilong Automobile Research Institute,
Chongqing 401123, China (e-mail: wuyuehong@li-long.com.cn).
Zeng Huang is with Chongqing Chang’an Automobile Company Ltd.,
Chongqing 400023, China (e-mail: huangzeng@changan.com.cn).
Digital Object Identifier 10.1109/TITS.2021.3086411
i.e., visual distraction, auditory distraction, biomechanical dis-
traction, and cognitive distraction [2].
Over the past two decades, based on but not limited to the
above distraction categories, many naturalistic driving studies
(NDSs) [3]–[6] and simulated driving studies (SDSs) [7]–[9]
have further established the correlation between driver dis-
traction and degraded driving performance [3]. The SDSs
use simulated vehicle data to establish a human-like driver
model [7], leveraging electrocardiogram (ECG) [8] or elec-
troencephalogram (EEG) [9] to understand the driver behav-
iors. Although there is a correlation between simulated and
naturalistic driving behaviors, the difference between these two
types of driving behaviors is not negligible. Furthermore, indi-
rect physiological measurements inevitably introduce detection
errors [5].
In contrast, the NDSs, using the continuous recording of
driving information under real-world driving conditions, pro-
vide opportunities to assess driving risks [6]. Inspired by recent
developments of the convolutional neural network (CNN),
most NDSs attempt to employ video data to capture distracted
driver information. The work in [10] creates a dataset com-
posed of face-view videos in the Strategic Highway Research
Program (SHRP2) and utilizes the supervised descent-based
facial landmark tracking algorithm to detect driver cell-
phone usage with 93.9% accuracy. The later work [11]
applies a multiple scale faster-RCNN to SHRP2 videos for
cellphone usage detection and the intelligent vehicles and
applications (VIVA) challenge database for steering wheel
detection. The experiments present 94.6% and 93% accuracy
for VIVA and SHRP2 datasets, respectively. Borghi et al. [12]
present a regressive neural network called POSEidon for head
localization and pose estimation. The experiments carried on
Biwi Kinect Head Pose, ICT-3DHP, and Pandora datasets
demonstrate POSEidon is fast enough to process 30 frames
per second in a real-time fashion. Recent research reports
that a modified VGG-16 [13] classifies ten distraction behav-
iors such as talking to a passenger, drinking, etc., achieving
95.54% accuracy with parameters reduced from 140M in
the original VGG-16 to 15M only. A similar study using
VGG-19 with a bigger size than VGG-16 is reported in [14],
which shows 99% average accuracy for detection tasks in [13].
Xing et al. [15] leverage a deep feed-forward NN to detect
seven driving behaviors such as normal driving, cellphone
answering, etc., hitting an average of more than 80% accuracy.
Then they improve their work using CNN [16]. The AlexNet,
GoogLeNet, and ResNet50 are pre-trained for such seven
1558-0016 © 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: CHONGQING UNIV OF POST AND TELECOM. Downloaded on February 21,2022 at 04:03:33 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
2IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS
driving behaviors, achieving 81.6%, 78.6%, and 74.9% accu-
racy. Based on these pre-trained models, the binary detection
achieves 91.4% accuracy.
Although good results have been reported for the approaches
mentioned above, their application for driver distraction detec-
tion needs further validation with the following concerns.
Sample diversity is of paramount importance for the
generalization of neural networks [17]. For assessing
the performance of proposed methods, most distraction
studies only use the samples including few drivers, e.g.,
SHRP2 Database (41 drivers) [10], [11], VIVA Hand
Database (50 drivers), Pandora Database (22 drivers)
[11], Biwi Kinect Head Pose Database (20 drivers) [12],
StateFarm’s Dataset (81 drivers) [18] and self-collected
dataset (5 drivers) [15], [16]. The scarce sample diversity
makes published results less practical in a real-world
application.
Many approaches apply backbone networks with ample
size to detect distraction, e.g., original VGG-16 (140M),
improved VGG-16 (15M) [13], VGG-19 (143.68M) [14],
AlexNet (62.38M) and ResNet50 (19.35M) [16]. How-
ever, these networks need to transmit their data back
to the computer, even the server to evaluate driving
distraction, making the original investigation of such
methods challenging to apply on a vehicle device with
limited computational capability.
Recent efforts have been spent on improving the effi-
ciency of backbone CNNs, e.g., reducing inherent redun-
dancy of dense model parameters [19] or the channel
dimension of feature maps [20]. However, few related
studies directly use these lightweight networks to detect
driver distraction up to now. Besides, the designed light-
weight CNNs, e.g., MobileNet [20] and shuffleNet [21],
share the commonplaces of using a single convolution
kernel for each layer, which leads to bottlenecks in fea-
ture expression, with the consequence of lacking higher
accuracy for in-vehicle applications.
This study aims to tackle these concerns above. Our
contributions are summarized as follows. (1) A distraction
dataset composed of 14808 videos focusing on 2468 par-
ticipants’ six driving behaviors is built to guarantee sam-
ple diversity. (2) A lightweight CNN with an octave-like
convolution mixed (OLCM) block, called OLCMNet,1is
specially designed for detecting driver distraction under a
limited computational budget. (3) We apply the SOTA light-
weight networks, e.g., MobileNet [20] and ShuffleNet [21],
to detect driver distraction. We also conduct a systematic
comparison among those lightweight networks, the backbone
networks (ResNet [22], DenseNet [23], etc.) and the proposed
OLCMNet. Compared with the existing methods reform-
ing vanilla convolution, our method pays more attention to
improving the network’s topology. Due to its unique topology
modification, the proposed OLCMNet can facilitate sensitive
information exchange more flexibly and learn the image’s
multi-scale representation more lightly.
1For the OLCMNet’s source codes and the labeled StateFarm Dataset, please
refer to https://github.com/Lipenghua-CQ/OLCMNet.
Fig. 1. The ARHUD development platform, where (a) presents the cameras
and HUD projection devices in driver’s cabin, (b) presents the embedded
hardware processing image data and information fusion.
II. DATA COLLECTION
The distraction videos are collected in a series of on-road
experiments conducted on a platform established for develop-
ing an augmented reality head-up display (ARHUD) system.
Based on these video data, we create a distraction dataset
called Lilong Distracted Driving Behaviours (LDDB) Dataset
(Table I).
A. Data Collection Platform
Our first-generation ARHUD development platform (Fig. 1)
aims to verify individual functions in the designed plan, not
prototype production. The ARHUD platform includes four
cameras and a HUD device mounted in the driving cabin
(Fig. 1 (a)) and the embedded hardware mounted in the trunk
(Fig. 1 (b)).
The cameras C-I to C-II, C-III, and C-IV collect the
image data of the driver’s eyes, distracting behavior, and road
environment in front of the vehicle, respectively. The TX2-I
is used to run a pupil tracking algorithm and OLCMNet,
while the TX2-II performs an obstacle recognition algorithm
fused vision and radar data. The IMAX8 receives the results
from both TX2s, then calls virtual-real registration and other
information fusion algorithms to drive the HUD projection.
Note that camera C-III is an infrared camera with resolution
1280 ×720 pixels, frame rate 30Hz, a field of view (FOV)
of 100 degrees. It has two 850nm infrared lamp beads
for light-compensation and supports automatic switching of
IR-CUT filters.
B. Data Acquisition Details
The LDDB Dataset is collected on the non-public roads in
Yuzui Industrial Park, Yubei District, Chongqing City, China.
The non-public roads include some predefined routes located
in different areas, e.g., urban testing roads, rural testing roads,
and factory testing roads, approximately 10 kilometers in total
length. We conduct the data collection from 9 to 11:30 am
and from 2:00 to 5:00 pm every day, over six months from
May to November 2019 except for holidays and days when
exceptional circumstances make collection impossible. The
environmental conditions (time of day, weather, mainly sunny
and rainy) also vary during the data collection. The people
participating in the collection involve 2648 volunteers whose
age range is 20 to 60 years old (32.20±10.44) and male/female
ratio 6:4.
The LDDB Dataset covers six driving behaviors, i.e., safe
driving, calling, smoking, drinking, reaching behind, and tex-
ting. The videos of all behaviors except two (reaching behind
and texting, collected when the vehicle is static) are acquired
Authorized licensed use limited to: CHONGQING UNIV OF POST AND TELECOM. Downloaded on February 21,2022 at 04:03:33 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
LI et al.: DRIVER DISTRACTION DETECTION USING OCTAVE-LIKE CNN 3
TAB LE I
THE DETAILS OF LDDB DATAS E T
Fig. 2. The design of OLCMnet, where (a)-(e) are the architectures of OLCMnet, OLCM block, SE module, DC kernel, PC kernel, respectively.
when the participants drive slowly. The automobile’s speed
depends on the driver’s driving proficiency but is not allowed
to exceed 40km/h. To ensure safety during the collection
process, an experimenter sitting in the co-driving position can
step on the secondary brake when a dangerous situation occurs.
For data authenticity, we ask the participants to perform six
driving behaviors according to their habits, causing different
duration shown in Table I. In each vehicle trial, a participant
completes six behaviors under the guidance of an experimenter
sitting in the co-driver position, driving a certain distance
over a total period of about 15 to 20 minutes. Note that
we do not record the exact value of such total period since
it is empirical time, and much of it is unimportant, e.g.,
the preparation time when introducing to a participant and
the switch time when waiting for the next participant. Another
experimenter supervises the data collection, sitting on the back
seat’s right side in the vehicle. A handheld trigger connected
to TX2-I allows this experimenter to annotate video streams
from camera C-III when the video recording finishes for one
of six behaviors.
The video recording codes are written in Python-based
OpenCV and loaded into TX2-I. We follow [15], [18], [24] to
save the storage costs, compressing the data’s resolution from
1280 ×720 ×3 to 640 ×360 ×3 at 20 frames per second. This
video data is frame by frame manually labeled ground truth
for six driving behaviors. To reduce the similarity of images
captured from these videos, the actual image capture changes
from 20 frames per second to 5 frames per second. More than
44000 images per class selected from 14808 videos constitutes
the full dataset.
III. METHODOLOGIES
The designed OLCMNet (Fig. 2(a)) consists of the
head, feature extraction, and last stages. In the OLCMNet,
we develop an OLCM building block (Fig. 2(b)) to reduce
spatial redundancy and connection density. This block focuses
on topology modification rather than vanilla convolution oper-
ator improvement like octave convolution [19]. Compared
with the existing lightweight efforts, the proposed OLCMNet
demonstrates its novelty by the following three aspects.
The OLCM block uses point-wise convolution (PC)
(Fig. 2(d)) to expand feature maps into two sets
of branches. In the low-frequency branches, we use
the average pooling (AP) to obtain a low-resolution
low-frequency feature map to reduce spatial redundancy
and improve subsequent operations’ computational effi-
ciency. Then we perform depth-wise convolution (DC)
operation (Fig. 2(c)) and upsampling for the follow-
ing feature extraction and information fusion. On the
other hand, we keep the original feature map in the
high-frequency branch as the DC operation’s inputs.
The OLCM block introduces the squeeze-and-excitation
(SE) module [25] (Fig. 2(e)) and PC operation for global
information fusion instead of performing information
exchange in the high and low-frequency groups like
octave convolution [19]. Specifically, we adopt global
average pooling (GAP) to obtain global information
from the embedded concatenation for each branch, then
create a bottleneck with two sets of PCs to emphasize
informative features and suppress less useful ones. The
PC operation after the SE module serves to squeeze and
fuse information.
The OLCMNet adds a SE module into its last stage,
which is different from MoblieNetV3 [20] that only
synthesizes global space features during its last stage.
Such modification facilitates sensitive information
exchange between layers further and provides a higher
classification accuracy.
Authorized licensed use limited to: CHONGQING UNIV OF POST AND TELECOM. Downloaded on February 21,2022 at 04:03:33 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
4IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS
TAB LE II
RELATED MATHE M AT I C A L SYMBOLS AND THEIR REPRESENTATIONS
The design of OLCMNet is as follows. The related symbols
and their representation are given in Table II.
A. Head Stage
Let Ube the input image. After downsampling spatial
resolution and expanding channels, the feature map at the head
stage, FH, can be obtained by:
Fh=σKhU(1)
where σ,Kh,andrepresent h-swish activation function,
vanilla convolution kernels, and convolutional operation. Note
that the h-swish function is given as ReLU6(x+3)6 [20].
B. Feature Extraction Stage
Let On
in and On
out be the input and output feature maps
of nth OCLM block at the feature extraction stage, respec-
tively. Obviously, O1
in =FH,On+1
in =On
out . In a specific
OCLM operation, On
in is split into Mbranches by MPC
operators, which yields an expanded input feature map On
m
with m=1,2. The calculation of On
mis described as:
On
m=σ˙
Kn
mOn
in(2)
where ˙
Kn
mrepresents the kernels of PC in the mth branch of
nth OLCM block.
The On
mcan be learned with low and high frequency manner
in the following operations. For the low-frequency learning,
an average pooling operation is used to down-sample On
mfor
getting a low-frequency input feature map, i.e., On
low.
On
low=pool(On
m,ZA,SA)(3)
Here, ZA,andSAdenote kernel size, and stride size. Note
that ZAand SAare selected as 2 in this study. Then, a DC
operator is performed on On
lowfor extracting the pth low-
frequency output feature map On
lowp.
On
lowp=σˆ
Kn
lowpOn
low(4)
where ˆ
Kn
lowpdenotes the kernels in the pth low-frequency
path of nth OLCM block. Note that p={1,2,...pmax }and
pmax is selected as 2 in this study. To achieve subsequent
information fusion for the feature maps with different spatial
resolution, On
lowpis up-sampled, which yields a feature map
with original resolution, ¯
On
lowp.
¯
On
lowp=upsampleOn
lowp
(5)
where λis an up-sampling factor closest to the interpolation
and is selected as 2 in this study.
For the high-frequency learning, On
mis regarded as an input
feature tensor. Keeping the spatial resolution of this tensor
unchanged, a high-frequency output feature map of mbranch
in the nth block, On
highqcan be obtained by a DC operator.
On
highq=σˆ
Kn
highqOn
m(6)
Here, ˆ
Kn
highqdenotes the kernels of DC in the qth
high-frequency path of nth OLCM block. Note that q=
{1,2,...,qmax}and qmax is selected as 1 in this study.
After learning different frequency information, all branches
are concatenated to form a fusion feature map On
concat ,which
is described as
On
concat =Concat ¯
On
lowp,On
highq(7)
Then a SE module is adopted to learn more important
feature channels, which helps to selectively emphasize infor-
mative features and suppress less useful ones. Use the SE
operation on On
co to get the filtered feature map ˜
On
co:
˜
On
concat =FSE(On
concat )(8)
where FSE(·)is the SE operation, and its detailed operation is
given in [25]. Ultimately, a PC with linear activation function
is adopted for fusing multi-scale information between channels
and compressing channels, so that the final output of nth
OLCM block can be obtained by:
On
out =˙
Kn
out ˜
On
concat (9)
where ˙
Kn
out is the kernels of PC at the end of nth OLCM
block.
C. Last Stage
As mentioned above, the calculation amount at the feature
extraction stage is significantly reduced by concatenation of
NOLCM blocks where the channels of feature map are
compressed by a PC operator at the end of each block. How-
ever, such an architecture brings difficulties to the subsequent
classification since the feature map in the last OLCM block,
ON
out =On
out
n=N, suffers from channel bottleneck when
it is regarded as the input feature map at the last stage
in OLCMNet. Therefore, a PC operator is used to enrich
the channel semantic information of ON
out , which yields an
expanded feature map, Fl
e.
Fl
e=σ˙
Kl
eON
out (10)
Here, ˙
Kl
eis the kernels of PC at the beginning of the last stage.
Then a SE module is used to further facilitate filtering sensitive
information, i.e., ˜
Fl
e=FSE(Fl
e),whereFl
edenotes the filtered
feature map. To generate channel-wise statistics, the GAP is
performed on Fl
e, which yields a global spatial information
descriptor, i.e., Fl
gap =GAP(˜
Fl
e). Finally, instead of using a
direct full connection to Fl
gap for classification in backbone
network, two PCs are used to obtain a predicted feature map,
Authorized licensed use limited to: CHONGQING UNIV OF POST AND TELECOM. Downloaded on February 21,2022 at 04:03:33 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
LI et al.: DRIVER DISTRACTION DETECTION USING OCTAVE-LIKE CNN 5
Fl
out R1×1×I, which is used to be the input of final sof t max
projection with Iclasses.
Fl
out =˙
Kl
2σ˙
Kl
1Fl
gap(11)
si=sof t max Fl
out (ci)=eFl
out (ci)
I
j=1
eFl
out(cj)
(12)
where ˙
Kl
1,˙
Kl
2,si,andFl
out (ci)represent two PC’s kernels,
predicted score of ith class and ith channel of Fl
out at the last
stage, respectively.
IV. EXPERIMENT AND RESULT ANALYSIS
For demonstrating the effectiveness of OLCMNet, we con-
duct the experiments on StateFarm Dataset [18] and LDDB
Dataset, comparing the results with those on ResNet-50
[22], DenseNet-40 [23], InceptionNet-V4 [26], MSCNN [27],
ShuffleNet-V2 [21], and MobileNet-V3 [20]. We also report
various ablation studies on OLCMNet to shed light on the
effects of various design decisions.
A. Data Description
The StateFarm Dataset covers ten driving behaviors,
i.e., safe driving (C0), texting-right (C1), talking on the
phone-right (C2), texting-left (C3), talking on the phone-left
(C4), operating the radio (C5), drinking (C6), reaching behind
(C7), hair and makeup (C8), and talking to the passenger (C9).
This dataset provides 22424 annotated images for training.
However, its 79726 images used for testing are unlabeled.
To evaluate these networks’ performance, we labeled ground
truth of approximately 1000 images per class from the
79726 images. For LDDB Dataset, 70%, 10%, and 20% of
such dataset are regarded as training, validating, and testing
data.
B. Model Training
We train all the networks in a server where hardware
and software configurations are GPU/2×Nvidia Tesla
P100@16GB, Memory/8×ECC Registered DDR4@32GB,
CPU/2×Intel Xeon Gold6126@2.60GHz, Operating
System/Ubuntu18.04, Programming Language/Python3.7, and
Deep Learning Framework/ TensorFlow1.13.1+Keras2.3.1.
The cross-entropy loss function of these networks is described
as:
E(L,S)=
Nc
i=1
lilog (si)(13)
where liand sidenote the likelihood and predicted score of
ith class, respectively. According to the structures presented
in [20]–[23], [26], [27], we handle some StateFarm Dataset
and LDDB Dataset trials and give the predefined structures
of OLCMNet on the two datasets (Table III). To avoid
the influence of different image resolutions on all the
aforementioned networks’ performance, we set the input size
of OLCMNet (including the other compared networks) as
128 ×256 ×3. The conv2d represents a vanilla convolution
TABLE III
THE PREDEFINED STRUCTURES OF OLCMNET
with 3 ×3 kernel in the head stage, while the OLCM denotes
the OLCM blocks in the feature extraction stage where the
kernel size of each OLCM block is 3 ×3. The expanded size
and output size represent each branch’s and output feature
maps’ channel numbers in the OLCM block, respectively.
We train both the OLCMnets using synchronized stochastic
gradient descent with 0.9momentum based on the predefined
structures. The batch normalization [28] is used in each con-
volutional layer for improving the stability of OLCMnets. The
data augmentation with a 0.5 probability setting (small-angle
rotation, random cropping, horizontal flipping, and random
erasure [29]) is applied to increase sample diversity further.
The label smoothing [30] is employed to prevent overfitting
for the classification task.
li=
1Nc1
Nc
ε, if(i=t)
ε
Nc
,if(i= t)
(14)
where εis a small constant to encourage the model to be
less confident on the training dataset and tis the ground truth
of the target class. In this study, εis set as 0.1. We conduct
60 trials to tune the networks’ hyperparameters manually. For
OLCMNet, we choose stochastic gradient descent (SGD) as
the optimizer. SGD proved itself as an effective optimization
method that was central in many machine learning success
examples [31]. However, the SGD optimizer needs to tune
each hyperparameter several times to achieve faster iterations
in trade for a lower convergence rate. In contrast, adaptive
moment estimation (Adam) computes adaptive learning rates
for each parameter and only requires first-order gradients with
little memory requirement [32]. In order to improve efficiency,
we apply SDG to optimize OLCMNet and use the optimized
hyperparameters of OLCMNet as a benchmark. Then we
employ Adam to optimize other networks. Table IV sum-
marizes the selected hyperparameters, including the optimizer
(OP), learning rate (LR), learning rate decay (LRD), epochs
(EP), batch size (BZ), dropout rate (DR), and weight decay
(WD). Based on such optimized hyperparameters, we train all
the networks from scratch and save the models with the highest
accuracy on the validation set during their training process.
Fig. 3 (a) and Fig. 3 (c) demonstrate that all the networks
are trained well on both StateFarm and LDDB datasets. The
InceptionNet’s validation accuracy curves (Fig. 3 (b) and
Fig. 3 (d)) present large fluctuations, which is because such
a network is developed with a focus on ILSVRC 2012 and
may thus be by design over-fit to this specific task.
Authorized licensed use limited to: CHONGQING UNIV OF POST AND TELECOM. Downloaded on February 21,2022 at 04:03:33 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
6IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS
Fig. 3. The training and validating profiles on both StateFarm and LDDB datasets, where (a) and (b) present the training and validating accuracy on the
StateFarm Dataset, respectively; (c) and (d) show the training and validating accuracy on the LDDB Dataset, respectively.
TAB LE IV
THE SELECTED HYPERPARAMETERS OF DIFFERENT NETWORKS
It is worth mentioning that the hyperparameter selection
in Table IV is effective and acceptable. Table V presents
each NN’s highest accuracy during its training and validating
process shown in Fig. 3. We can find that OLCMNet’s training
and validation accuracy on the two datasets are not always
the highest as expected, indicating a relatively fair compar-
ison in this study. For example, the ResNet-50 achieves the
highest training accuracy on the data-augmented StateFarm
Dataset (99.58%) and the highest validation accuracy on the
LDDB Dataset (98.06%). Besides, it is also relatively fair
that we tune OLCMNet on the LDDB Datasets, then use
such net’s hyperparameters as a benchmark to train other
networks independently on the LDDB and StateFarm Datasets.
One illustration is that the training accuracy of each network
on the auxiliary dataset (StateFarm Dataset) is not all lower
than that of each network on the LDDB Dataset. Moreover,
the relative fairness of such selection can be further con-
firmed by an additional experiment in which we regard the
VGG19 and ResNet34 in [27] as the benchmark and use the
hyperparameter selection in Table IVto conduct comparative
experiments on the StateFarm Dataset. Table VI shows that
the performance of the two networks adopting our parameter
configuration is still better than that of the networks using
parameter configuration in [27]. In fact, other researchers
[33] have used similar hyperparameter multiplexing methods
to conduct comparative experiments more efficiently. The
independent training duration in Table V is roughly equal
TAB LE V
THE TRAINING AND VALIDATION DETAILS OF DIFFERENT NETWORKS
to the time cost for tuning one hyperparameter. It will take
extended time costs to traverse all the hyperparameters of
other networks on both datasets, considering the existing
60 trials for tuning OLCMNet and about 50 trials for tuning
other networks’ learning rate. Therefore, the hyperparameter
selection in this study is a practical compromise between time
cost and comparative fairness.
C. Model Testing
For investigating the performance of different methods
when the hardware’s computing resources are gradually
decreasing, all the well-trained networks are ported,
without any speeded-up operation, to a computer and a
TX2 device. The computer configurations are CUP/Intel
i7-8700@3020GHz, GPU/Nvidia GTX1080Ti@11GB
and Memory/LPDDR4@32GB. The TX2 configurations
are CPU/Dual-core Denver2(64-bit) and quad-core ARM
A57 Complex, GPU/Nvidia PascalTM architecture with
256 CUDA cores (1.3 TFLOPS), Memory/LPDDR4
(128-bit)@4GB and Storage/eMMC5.1@32GB. Note that the
TX2 is only applied to measure inference latency, while the
computer is employed to verify more performance.
We use 10128 images (StateFarm) and 40304
images (LDDB) for performance testing carried on the
computer. For the testing samples used on TX2, we randomly
select 400 images in StateFarm Dataset and a real video with
Authorized licensed use limited to: CHONGQING UNIV OF POST AND TELECOM. Downloaded on February 21,2022 at 04:03:33 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
LI et al.: DRIVER DISTRACTION DETECTION USING OCTAVE-LIKE CNN 7
TAB LE VI
THE PERFORMANCE OF DIFFERENT NNSWHEN USING OUR HYPERPARAMETER CONFIGURATION
TAB LE VI I
THE FLOATING POINT PERFORMANCE ON TWO TARGET PLATFORMS
400 frames in LDDB Dataset. Besides the accuracy, we also
use F1 score to measure different networks’ performance,
where the definition of F1 score follows [34]. The inference
latency is measured using a single large core with a batch size
of one. Specifically, we count a latency per batch (1 frame
image), then obtain the mean and standard deviation of
these latencies when the networks run through all the testing
batches. The multi-core inference time is not reported in
this study since such a setup is not very practical for our
applications. The definition of FLOPs follows [21], i.e., the
number of multiply-adds. The parameters are counted directly
by the model.summary function built-in Keras.
Table VII shows that the accuracy of networks trained on
the LDDB Dataset is higher than that on StateFarm Dataset,
demonstrating that increasing sample diversity can improve
the networks’ performance in practical applications. On the
two datasets StateFarm and LDDB, all networks’ accuracy
is roughly the same as their F1 scores, indicating that the
samples used for training and testing in the two datasets are
balanced. The accuracy, F1 score, and inference speed of
OLCMNet precede other networks’ performance due to its
improved structure. More details on the effects of various
structure improvements are introduced in the subsequent abla-
tion experiments. Although the OLCMNet’s parameters and
FLOPs are not the lowest, it is still acceptable because of
their front ranking in the comparative experiments. Besides,
there is no strong positive correlation between inference
speed, parameters, and FLOPs. For instance, the DenseNet40’s
parameter amount [23] is the lowest, 0.26M, due to its
particular feature map multiplexing structure, but its FLOPs
are as high as 5.64B. Benefiting from the neural structure
search (NAS) technology, MobileNet-V3 (Small) hits the
lowest FLOPs, 0.09B. However, the parameters and latency
of MobileNet-V3 [20] are not the lowest compared with
other networks. The research in [35] also reports a similar
phenomenon. In fact, the FLOPs, an indirect metric to measure
the computation complexity, is just an approximation of, but
usually not equivalent to the direct metric, i.e., inference
latency that many applications [21], as well as ours, care about.
Note that only MSCNN, among all the networks used for
comparison, is developed focusing on StateFarm’s Dataset.
Our trained MSCNN (MSCNN-I) accuracy is close to the
accuracy reported in [27] (MSCNN-II), which once again
indicates that our training is efficacious to the networks
initially designed for other tasks.
For the accuracy results shown in Table VII, Fig. 4 gives
more details in the form of multiple confusion matrices.
We find that except for InceptiongNet-V4, the other net-
works’ accuracy for C9 (talking-to-passenger) in the StateFarm
Dataset is lower than 70%. It is because small conversations
(hands on the steering wheel with a small head deflection
angle) are easily considered safe-driving (C0) during driving.
For the remaining eight classes in the StateFarm Dataset,
the OLCMNet’s accuracy for each class is greater than 80%,
performing better than other networks. On the LDDB Dataset,
the recognition of C2 (smoking) by all the networks is less
than 90%. It is difficult to distinguish between smoking and
drinking due to their similarity in posture. Besides, small-sized
cigarettes’ useful features become less after down-sampling,
which is another reason for the low accuracy of C2.
Leveraging the class activation mapping (CAM) technique
[36], Fig. 5 gives more details to explain the results presented
in Fig. 4. We take the C4 in StateFarm Dataset and C3 in
LDDB Dataset as examples. We find that the OLCMNets can
successfully localize the discriminative regions for distraction
classification as the actions that the drivers are performing
rather than the drivers themselves. Besides, OLCMNet’s pre-
diction score of the same image is higher than that of other
networks, i.e., 0.917 (C4) and 0.902 (C3), which is consistent
with the results presented in Fig. 4, i.e., the highest accuracy
of 97.62% (C4) and 98.04% (C3). We also observe that
the highlighted regions vary across different networks. For
instance, InceptionNet-V4 and MobileNet-v3 (small) activate
their interest area incorrectly and hit lower prediction scores,
0.235 and 0.218, respectively (Fig. 5 (a)). The prediction score
of MobileNet-v3 (small) is lower than those of other networks
because of its unreasonable interest area (Fig. 5 (b)). Although
the prediction scores of InceptionNet-V4 and MobileNet-v3
(large) in Fig. 5 (b) are relatively high, their expanded areas
of interest will also bring more interference and maybe cause
Authorized licensed use limited to: CHONGQING UNIV OF POST AND TELECOM. Downloaded on February 21,2022 at 04:03:33 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
8IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS
Fig. 4. The confusion matrices of different networks, where (a) and (b) are the confusion matrices for StateFarm and LDDB datasets.
Fig. 5. Examples of the CAMs generated from the predicted classes, where (a) and (b) are the CAMs when applying different networks to classify C4 in
StateFarm Dataset and C3 in LDDB Dataset.
classification errors. The InceptionNet-V4 in Fig. 5 (a) hap-
pens to be an example.
To show the classification performance of different networks
more clearly, we count each confusion matrix’s diagonal
accuracy in Fig. 4 and then draw the accuracy distribution’s
boxplots for StateFarm and LDDB datasets. Fig. 6 shows that
the minimum value and Q1 (25th percentile) of OLCMNet’s
boxplots are the highest. Meanwhile, the boxes of OLCMNet
Authorized licensed use limited to: CHONGQING UNIV OF POST AND TELECOM. Downloaded on February 21,2022 at 04:03:33 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
LI et al.: DRIVER DISTRACTION DETECTION USING OCTAVE-LIKE CNN 9
TABLE VIII
EFFEC T OF VARIANTS ON OLCMNET
Fig. 6. The accuracy distribution’s boxplots of different networks, where
(a) and (b) are the boxplots for StateFarm and LDDB datasets.
are smaller than those of most other networks. Therefore,
the accuracy of the OLCMNet is better than that of other
networks overall.
D. Ablation Study
1) Impact of the Low-Frequency Branch: The OLCM block
includes a high-frequency branch and two low-frequency
branches. In the octave-like operation, we keep the orig-
inal feature map unchanged in the high-frequency branch
as the input, followed by a DC operation and no upsam-
pling. On the other hand, we leverage an average pool-
ing (AP) to obtain a low-frequency feature map, followed by
two DC operations and upsampling. According to Tabel III,
we describe the two low-frequency branches’ operations as
AP+3×3DC+upsampling and AP+5×5DC+upsampling
and the high-frequency operation as 3 ×3DC. To inves-
tigate the low-frequency branch’s function in the approxi-
mate same receptive field, we use 5 ×5DC and 7 ×7DC
(DC@5+7) to replace such two low-frequency operations
and keep high-frequency operation unchanged. Table VIII
shows that both accuracy and speed are degraded when
we only use DC operation, indicating the effectiveness of
the octave-like low-frequency branch. We also observe the
influence of the low-frequency branch’s number. Table VIII
shows that the accuracy is improved when the low-frequency
Fig. 7. Impact of individual components in the development of OLCMNet,
where (a) and (b) are the experiments conducted on StateFarm and LDDB
datasets, respectively.
branches increase to 3 (Low-frequency branches@3), but the
inference speed slows down. On the other hand, decreasing
branches (Low-frequency branches@1) results in the opposite
performance. Therefore, it is necessary to determine the num-
ber of low-frequency branches according to both the accuracy
requirements and hardware resources in a specific task.
2) Impact of the SE Module in OLCM Block: Table VIII
shows that the accuracy significantly drops when removing
the SE module (OLCM block SE@0), but the speed is not
improved much. On the other hand, when we add a SE module
in the OLCM block (OLCM block SE@2), the accuracy is not
improved significantly, but the speed reduces. These results
Authorized licensed use limited to: CHONGQING UNIV OF POST AND TELECOM. Downloaded on February 21,2022 at 04:03:33 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
10 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS
demonstrate that the SE module’s information fusion helps
OLCMNet improve accuracy. However, it is not the more SE
modules, the better performance.
3) Impact of the SE Module and PC Operator in the Last
Stage: Table VIII presents that removing the SE module (Last
stage SE@0) can improve the parameters but has almost no
effect on the FLOPs and latency. More seriously, it causes a
decline in OLCMNet’s accuracy. Therefore, the SE module in
the last stage is helpful to improve the accuracy of OLCMNet.
Besides, we investigate the function of the PC operator after
the bottleneck. When removing such an operator (Last stage
PC@0), the OLCMNet’s latency, FLOPs, and parameters can
be improved, but its accuracy on both datasets decreases.
Fig. 7 further illustrates the results of Table IV, which help us
investigate the OLCMNet’s structure more intuitively.
V. C ONCLUSION
This study aims to obtain better accuracy under a lim-
ited computational budget, given by a TX2 target platform
and application scenarios for detecting driver distraction. To
achieve such a goal, an OLCMNet is proposed for detecting
driver distraction. We have described our efforts to harness the
idea of octave convolution and advances in network design
to deliver the lightweight OLCMNet. We have also shown
how to adapt low-frequency branches and apply squeeze and
excite in a quantization friendly. Compared with the existing
backbone and lightweight networks, the TX2 target platform
experiments indicate that OLCMNet hits acceptable trade-
offs, i.e., 89.53% accuracy for StateFarm Dataset and 95.98%
accuracy for LDDB Dataset when latency 32.8±4.6ms. In the
future, we will use a vehicle-grade hardware platform instead
of the industrial-grade TX2. Meanwhile, the OLCMNet will be
further improved and ported to such a vehicle-grade platform,
along with other algorithms such as obstacle recognition and
pupil tracking.
REFERENCES
[1] R. Tian, L. Li, M. Chen, Y. Chen, and G. J. Witt, “Studying the effects of
driver distraction and traffic density on the probability of crash and near-
crash events in naturalistic driving environment,” IEEE Trans. Intell.
Transp. Syst., vol. 14, no. 3, pp. 1547–1555, Sep. 2013.
[2] N. Li and C. Busso, “Predicting perceived visual and cognitive distrac-
tions of drivers with multimodal features,” IEEE Trans. Intell. Transp.
Syst., vol. 16, no. 1, pp. 51–65, Feb. 2015.
[3] T. Liu, Y. Yang, G.-B. Huang, Y. K. Yeo, and Z. Lin, “Driver distraction
detection using semi-supervised machine learning, IEEE Trans. Intell.
Transp. Syst., vol. 17, no. 4, pp. 1108–1120, Apr. 2016.
[4] Y. Liao, S. E. Li, W. Wang, Y. Wang, G. Li, and B. Cheng, “Detection
of driver cognitive distraction: A comparison study of stop-controlled
intersection and speed-limited highway, IEEE Trans. Intell. Transp.
Syst., vol. 17, no. 6, pp. 1628–1637, Jun. 2016.
[5] Z. Li, S. Bao, I. V. Kolmanovsky, and X. Yin, “Visual-manual distraction
detection using driving performance indicators with naturalistic driving
data,” IEEE Trans. Intell. Transp. Syst., vol. 19, no. 8, pp. 2528–2535,
Aug. 2018.
[6] J. Chen, Z. Wu, and J. Zhang, “Driving safety risk prediction using
cost-sensitive with nonnegativity-constrained autoencoders based on
imbalanced naturalistic driving data, IEEE Trans. Intell. Transp. Syst.,
vol. 20, no. 12, pp. 4450–4465, Dec. 2019.
[7] B. Shi, L. Xu, and W. Meng, “Applying a WNN-HMM based driver
model in human driver simulation: Method and test,” IEEE Trans. Intell.
Transp. Syst., vol. 19, no. 11, pp. 3431–3438, Nov. 2018.
[8] K. T. Chui, K. F. Tsang, H. R. Chi, B. W. K. Ling, and C. K. Wu,
“An accurate ECG-based transportation safety drowsiness detection
scheme, IEEE Trans. Ind. Informat., vol. 12, no. 4, pp. 1438–1452,
Aug. 2016.
[9] S. Wang, Y. Zhang, C. Wu, F. Darvas, and W. A. Chaovalitwongse,
“Online prediction of driver distraction based on brain activity patterns,”
IEEE Trans. Intell. Transp. Syst., vol. 16, no. 1, pp. 136–150, Feb. 2015.
[10] K. Seshadri, F. Juefei-Xu, D. K. Pal, M. Savvides, and C. P. Thor,
“Driver cell phone usage detection on strategic highway research pro-
gram (SHRP2) face view videos, in Proc. IEEE Conf. Comput. Vis.
Pattern Recognit. Workshops (CVPRW), Jun. 2015, pp. 35–43.
[11] T. H. N. Le, Y. Zheng, C. Zhu, K. Luu, and M. Savvides, “Multiple
scale faster-RCNN approach to driver’s cell-phone usage and hands on
steering wheel detection,” in Proc. IEEE Conf. Comput. Vis. Pattern
Recognit. Workshops (CVPRW), Jun. 2016, pp. 46–53.
[12] G. Borghi, M. Venturelli, R. Vezzani, and R. Cucchiara, “POSEidon:
Face-from-depth for driver pose estimation,” in Proc. IEEE Conf.
Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 4661–4670.
[13] B. Baheti, S. Gajre, and S. Talbar, “Detection of distracted driver using
convolutional neural network,” in Proc. IEEE/CVF Conf. Comput. Vis.
Pattern Recognit. Workshops (CVPRW), Jun. 2018, pp. 1032–1038.
[14] S. Masood, A. Rai, A. Aggarwal, M. N. Doja, and M. Ahmad, “Detect-
ing distraction of drivers using convolutional neural network,” Pattern
Recognit. Lett., vol. 139, pp. 79–85, Nov. 2020.
[15] Y. Xing et al., “Identification and analysis of driver postures for in-
vehicle driving activities and secondary tasks recognition,” IEEE Trans.
Comput. Social Syst., vol. 5, no. 1, pp. 95–108, Mar. 2018.
[16] Y. Xing, C. Lv, H. Wang, D. Cao, E. Velenis, and F.-Y. Wang, “Driver
activity recognition for intelligent vehicles: A deep learning approach,”
IEEE Trans. Veh. Technol., vol. 68, no. 6, pp. 5379–5390, Jun. 2019.
[17] D.-D. Chen, W. Wang, W. Gao, and Z.-H. Zhou, “Tri-net for semi-
supervised deep learning,” in Proc. 27th Int. Joint Conf. Artif. Intell.,
Jul. 2018, pp. 2014–2020.
[18] Y. Abouelnaga, H. M. Eraqi, and M. N. Moustafa, “Real-time distracted
driver posture classification,” 2017, arXiv:1706.09498.[Online]. Avail-
able: http://arxiv.org/abs/1706.09498
[19] Y. Chen et al., “Drop an octave: Reducing spatial redundancy in convo-
lutional neural networks with octave convolution,” in Proc. IEEE/CVF
Int. Conf. Comput. Vis. (ICCV), Oct. 2019, pp. 3435–3444.
[20] A. Howard et al., “Searching for MobileNetV3,” in Proc. IEEE/CVF
Int. Conf. Comput. Vis. (ICCV), Oct. 2019, pp. 1314–1324.
[21] N. Ma, X. Zhang, H.-T. Zheng, and J. Sun, “ShuffleNet V2: Practical
guidelines for efficient CNN architecture design,” in Proc. Eur. Conf.
Comput. Vis. (ECCV), 2018, pp. 116–131.
[22] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for
image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.
(CVPR), Jun. 2016, pp. 770–778.
[23] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely
connected convolutional networks,” in Proc. IEEE Conf. Comput. Vis.
Pattern Recognit. (CVPR), Jul. 2017, pp. 4700–4708.
[24] C. H. Zhao, B. L. Zhang, J. He, and J. Lian, “Recognition of driving
postures by contourlet transform and random forests,” IET Intell. Transp.
Syst., vol. 6, no. 2, pp. 161–168, 2012.
[25] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in
Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2018,
pp. 7132–7141.
[26] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi, “Inception-v4,
inception-ResNet and the impact of residual connections on learning,”
in Proc. 31st AAAI Conf. Artif. Intell., 2017, pp. 4278–4284.
[27] Y. Hu, M. Lu, and X. Lu, “Driving behaviour recognition from still
images by using multi-stream fusion CNN,” Mach. Vis. Appl., vol. 30,
no. 5, pp. 851–865, Jul. 2019.
[28] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating
deep network training by reducing internal covariate shift, 2015,
arXiv:1502.03167. [Online]. Available: http://arxiv.org/abs/1502.03167
[29] Z. Zhong, L. Zheng, G. Kang, S. Li, and Y. Yang, “Random eras-
ing data augmentation,” 2017, arXiv:1708.04896. [Online]. Available:
http://arxiv.org/abs/1708.04896
[30] R. Müller, S. Kornblith, and G. E. Hinton, “When does label smoothing
help?” in Proc. Adv. Neural Inf. Process. Syst., 2019, pp. 4696–4705.
[31] G. E. Hinton, “Reducing the dimensionality of data with neural net-
works,” Science, vol. 313, no. 5786, pp. 504–507, Jul. 2006.
[32] D. P. Kingma and J. Ba, “Adam: A method for stochastic opti-
mization,” 2014, arXiv:1412.6980. [Online]. Available: http://arxiv.
org/abs/1412.6980
Authorized licensed use limited to: CHONGQING UNIV OF POST AND TELECOM. Downloaded on February 21,2022 at 04:03:33 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
LI et al.: DRIVER DISTRACTION DETECTION USING OCTAVE-LIKE CNN 11
[33] B. Chang, L. Meng, E. Haber, L. Ruthotto, D. Begert, and E. Holtham,
“Reversible architectures for arbitrarily deep residual neural networks,”
in Proc. AAAI Conf. Artif. Intell., vol. 32, no. 1, 2018, pp. 2811–2818.
[34] D. M. W. Powers, “Evaluation: From precision, recall and F-
measure to ROC, informedness, markedness and correlation,” 2020,
arXiv:2010.16061. [Online]. Available: http://arxiv.org/abs/2010.16061
[35] I. Radosavovic, R. P. Kosaraju, R. Girshick, K. He, and P. Dollár,
“Designing network design spaces,” in Proc. IEEE/CVF Conf. Comput.
Vis. Pattern Recognit. (CVPR), Jun. 2020, pp. 10428–10436.
[36] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba, “Learning
deep features for discriminative localization,” in Proc. IEEE Conf.
Comput. Vis. Pattern Recognit. (CVPR), Jun. 2016, pp. 2921–2929.
Penghua Li was born in 1984. He received the
B.S. degree in electronic information science and
technology and the Ph.D. degree in control theory
and control engineering from Chongqing University,
China, in 2008 and 2012, respectively. He was a
Senior Visiting Scholar with the Vienna University
of Technology. He is currently a Professor with the
Chongqing University of Posts and Telecommuni-
cations (CQUPT). He also serves as the Deputy
Director of the Department of Measurement and
Control, Automation College, CQUPT; the Director
of the Chongqing Artificial Intelligence Society (CQAIS); and the Standing
Committee Member of the Intelligent Transportation Professional Committee,
CQAIS. His research direction is neural network theory and its application
research, such as image recognition, speech recognition, multi-round dialogue,
and lithium battery health management. He won the First Prize of Chongqing
Science and Technology Progress Award twice in 2018 and 2019. He
also received the title of the Young Scientific and Technological Talent of
Chongqing. He chaired the 27th and 30th China Conference on Control and
Decision-Making Neural Networks.
Yifeng Yang was born in 1996. He received the
B.E. degree in control science and engineering from
the Chongqing University of Posts and Telecommu-
nications, Chongqing, China, in 2018, where he is
currently pursuing the master’s degree in control the-
ory and control engineering. His research interests
include deep learning and image classification.
Radu Grosu (Member, IEEE) received the Ph.D.
degree in computer science from the Technical
University of Munich, Munich, Germany, in 1994.
He is currently a Professor and the Head of
the Cyber-Physical Group, Faculty of Informatics,
Vienna University of Technology. Before receiving
his appointment at the Vienna University of Tech-
nology, he was an Associate Professor with the
Computer Science Department, State University of
New York, Stony Brook, where he co-directed the
Concurrent-Systems Laboratory and co-founded the
Systems-Biology Laboratory. He was a Research Associate with the Computer
Science Department, University of Pennsylvania. He is a Research Professor
with the Computer Science Department, State University of New York. His
research interests include modeling, analysis and control of cyber-physical,
and biological systems. His application focus include green operating systems,
mobile ad-hoc networks, automotive systems, the Mars rover, cardiac-cell
networks, and genetic regulatory networks. He is a member of the International
Federation of Information Processing WG 2.2. He was the recipient of the
National Science Foundation Career Award, the State University of New York
Research Foundation Promising Inventor Award, the ACM Service Award.
Guodong Wang received the Ph.D. degree in
computer science from the Vienna University of
Technology, Vienna, Austria, in January 2019.
He is currently the Managing Director of the
Sino-Austria Research Institute for Intelligent Indus-
tries (SINOAUS), Nanjing, China. Before joining
SINOAUS, he was a Machine Learning Researcher
with the Institute of Computer Engineering, Vienna
University of Technology, dealing with theoretical
research of machine learning and its industrial appli-
cations. He is the Vice Chief Engineer with the
Shanghai Institute of Computing Technology, Shanghai, China. He is a Guest
Professor with Hangzhou Dianzi University, Hangzhou, China. His research
interests include learning representation, deep neural networks, automated
machine learning, cyber-physical systems, and the application of machine
learning techniques on industrial data analytics. He is a member of ACM
SIGBED China.
Rui Li was born in 1975. He received the B.S.
degree from the Chongqing University of Technol-
ogy, China, in 1999, and the M.S. and Ph.D. degrees
from Chongqing University, China, in 2004 and
2009, respectively. He is currently a Professor with
the College of Automation, Chongqing University
of Posts and Telecommunications. He also serves as
the Head of the Chongqing University Innovation
Research Group and the Director of the Labora-
tory Instrument Subcommittee of China Instrument
and Control Society. His research interests include
intelligent sensing, intelligent robots, intelligent electromechanical structures,
and intelligent manufacturing. He received titles, such as the Chongqing
Outstanding Youth, the Chongqing Academic Technology Leader, the Bayu
Distinguished Professor, and the Chonqing University Outstanding Talent.
He won the First Prize of Chongqing Science and Technology Progress
Award in 2017 and the Third Prize of China Machinery Industry Science
and Technology Progress Award in 2020.
Yue h on g Wu was born in 1964. He is currently
a Senior Engineer and the Deputy Dean of the
Chongqing Lilong Automobile Intelligent Technol-
ogy Research Institute. He is the Deputy Chairman
of the Automobile Instrument Standard Committee
of China Automobile Association. He is a mem-
ber of the Expert Committee of China Automobile
Association. Since 1996, he has been serving as the
Head of the Research and Development Department,
Chongqing YAZAKI Instrument Company Ltd., and
has been engaging in developing automotive elec-
tronic products for more than 30 years. He presided over and participated
in research and development for dozens of automobile instrument products,
including entire digital automobile instrument, virtual automobile instru-
ment, and multi-function display, which are widely used in many vehicles,
such as Toyota Prado/Coaster, Roewe RX5, MG6 Cadillac XT5/6, Haval
H6/H9/M6/C30, and Geely Vision/Boyue/Xingyue.
Zeng Huang was born in 1980. He received the
B.S. degree in mechanical design, manufacturing,
and automation from Xi’an Technological Univer-
sity, Xi’an, China, in 2002, and the M.S. degree in
industrial engineering from Chongqing University,
Chongqing, China, in 2017. He is currently working
as a Science and Technology Project Management
Manager at Chongqing Chang’an Automobile Com-
pany Ltd. Since 2017, he has been leading and
participating in the research and development of the
driving assistance system, human-vehicle interaction
system for the automobiles produced by Chang’an Automobile Company Ltd.,
such as the CX70, CX30, CS55, and Yuexiang-V5. His research areas include
vehicle environment perception and human-vehicle interaction systems.
Authorized licensed use limited to: CHONGQING UNIV OF POST AND TELECOM. Downloaded on February 21,2022 at 04:03:33 UTC from IEEE Xplore. Restrictions apply.
... Yan et al. [42] focused on locating the driver's hand by extracting prominent information, with the goal of predicting driving posture via trainable filters and local neighborhood pooling operations. Meanwhile, Li et al. [43] designed a lightweight network, termed OLCMNet, to detect driver distractions. They accomplished this by extending feature maps into two separate branches via point-wise convolution, effectively reducing network size and enhancing real-time performance. ...
Article
Full-text available
Risky driving is a major factor in traffic incidents, necessitating constant monitoring and prevention through Intelligent Transportation Systems (ITS). Despite recent progress, a lack of suitable data for detecting risky driving in traffic surveillance settings remains a significant challenge. To address this issue, Bayonet-Drivers, a pioneering benchmark for risky driving detection, is proposed. The unique challenge posed by Bayonet-Drivers arises from the nature of the original data obtained from intelligent monitoring and recording systems, rather than in-vehicle cameras. Bayonet-Drivers encompasses a broad spectrum of challenging scenarios, thereby enhancing the resilience and generalizability of algorithms for detecting risky driving. Further, to address the scarcity of labeled data without compromising detection accuracy, a novel semi-supervised network architecture, named DGMB-Net, is proposed. Within DGMB-Net, an enhanced semi-supervised method founded on a teacher-student model is introduced, aiming at bypassing the time-consuming and labor-intensive tasks associated with data labeling. Additionally, DGMB-Net has engineered an Adaptive Perceptual Learning (APL) Module and a Hierarchical Feature Pyramid Network (HFPN) to amplify spatial perception capabilities and amalgamate features at varying scales and levels, thus boosting detection precision. Extensive experiments on widely utilized datasets, including the State Farm dataset and Bayonet-Drivers, demonstrated the remarkable performance of the proposed DGMB-Net.
... Various solutions based on deep learning led to high levels of accuracy but often demand substantial computational resources and may not always meet real-time detection requirements [4], [5]. For example, lightweight convolutional neural networks might trade off accuracy in certain scenarios [6]. Temporal-spatial double-line deep learning networks with causal AND-OR graphs showed promising results in continuous recognition [7]. ...
Conference Paper
Full-text available
Every year, 2.5 million car crashes involve distracted drivers globally. It takes a few seconds for a car crash to happen after the driver has been distracted. Distracted driving thus poses a critical threat to road safety, needing innovative approaches for its detection and mitigation. This paper introduces a novel system to monitor in-car conversations and identify potential distractions from escalating arguments. The system analyzes Mel spectrograms generated from real-time audio signals containing in-car discussions by combining continuous voice recording and deep learning techniques. First, a denoiser employs a convolutional autoencoder to reduce car engine noise within the spectrograms. Then, a classifier uses convolutional and recurrent neural networks to determine whether the audio corresponds to a calm conversation or a quarrel based on the denoised spectrogram. The experimental results showed that the system achieved a 91.8% classification accuracy. This system addresses a previously unexplored dimension of cognitive distraction, offering valuable insights into strategies for reducing the risk of road accidents. Ongoing research is focused on accounting for other environmental noises, such as radio speakers, music, wind from open windows, and engine sounds from surrounding vehicles, which may influence classification accuracy. The system is also being extended to consider more than two occupants in the car.
Article
Full-text available
Distracted behavior detection is an important task in computer-assisted driving. Although deep learning has made significant progress in this area, it is still difficult to meet the requirements of the real-time analysis and processing of massive data by relying solely on local computing power. To overcome these problems, this paper proposes a driving distraction detection method based on cloud–fog computing architecture, which introduces scalable modules and a model-driven optimization based on greedy pruning. Specifically, the proposed method makes full use of cloud–fog computing to process complex driving scene data, solves the problem of local computing resource limitations, and achieves the goal of detecting distracted driving behavior in real time. In terms of feature extraction, scalable modules are used to adapt to different levels of feature extraction to effectively capture the diversity of driving behaviors. Additionally, in order to improve the performance of the model, a model-driven optimization method based on greedy pruning is introduced to optimize the model structure to obtain a lighter and more efficient model. Through verification experiments on multiple driving scene datasets such as LDDB and Statefarm, the effectiveness of the proposed driving distraction detection method is proved.
Article
Full-text available
Accurate recognition of driver distraction is significant for the design of human-machine cooperation driving systems. Existing studies mainly focus on classifying varied distracted driving behaviors, which depend heavily on the scale and quality of datasets and only detect the discrete distraction categories. Therefore, most data-driven approaches have limited capability of recognizing unseen driving activities and cannot provide a reasonable solution for downstream applications. To address these challenges, this paper develops a vision Transformer-enabled weakly supervised contrastive (W-SupCon) learning framework, in which distracted behaviors are quantified by calculating their distances from the normal driving representation set. The Gaussian mixed model (GMM) is employed for the representation clustering, which centralizes the distribution of the normal driving representation set to better identify distracted behaviors. A novel driver behavior dataset and the other three ones are employed for the evaluation, experimental results demonstrate that our proposed approach has more accurate and robust performance than existing methods in the recognition of unknown driver activities. Furthermore, the rationality of distraction levels for different driving behaviors is evaluated through driver skeleton poses. The constructed dataset and demo videos are available at https://yanghh.io/Driver-Distraction-Quantification .
Article
Driver distraction causes a significant number of traffic accidents every year, resulting in economic losses and casualties. Currently, the level of automation in commercial vehicles is far from completely unmanned, and drivers still play an important role in operating and controlling the vehicle. Therefore, driver distraction behavior detection is crucial for road safety. At present, driver distraction detection primarily relies on traditional convolutional neural networks (CNN) and supervised learning methods. However, there are still challenges such as the high cost of labeled datasets, limited ability to capture high-level semantic information, and weak generalization performance. In order to solve these problems, this paper proposes a new self-supervised learning method based on masked image modeling for driver distraction behavior detection. Firstly, a self-supervised learning framework for masked image modeling (MIM) is introduced to solve the serious human and material consumption issues caused by dataset labeling. Secondly, the Swin Transformer is employed as an encoder. Performance is enhanced by reconfiguring the Swin Transformer block and adjusting the distribution of the number of window multi-head self-attention (W-MSA) and shifted window multi-head self-attention (SW-MSA) detection heads across all stages, which leads to model more lightening. Finally, various data augmentation strategies are used along with the best random masking strategy to strengthen the model’s recognition and generalization ability. Test results on a large-scale driver distraction behavior dataset show that the self-supervised learning method proposed in this paper achieves an accuracy of 99.60 approximating the excellent performance of advanced supervised learning methods. Our code is publicly available at github.com/Rocky1salady-killer/SL-DDBD.
Article
Full-text available
Abnormal driving behaviour is a leading cause of serious traffic accidents threatening human life and public property globally. In this paper, we investigate the use of a deep learning approach to automatically recognize driving behaviour (such as normal driving, driving with hands off the wheel, calling, playing mobile phone, smoking and talking with passengers) in a single image. The task of driving behaviour recognition can be regarded as a multi-class classification problem, and we resolve this problem from two aspects in our study: (1) Employ multi-stream CNN to extract multi-scale features by filtering images with receptive fields of different kernel sizes and (2) investigate different fusion strategies to combine the multi-scale information and generate the final decision for driving behaviour recognition. The effectiveness of our proposed method is validated by extensive experiments carried out on our self-created simulated driving behaviour dataset, as well as a real driving behaviour dataset, and the experiment results demonstrate that the proposed multi-stream CNN-based method achieves the significant performance improvements compared to the state of the art.
Article
Intelligent vehicles and advanced driver assistance systems (ADAS) need to have proper awareness of the traffic context, as well as the driver status since ADAS share the vehicle control authorities with the human driver. This paper provides an overview of the ego-vehicle driver intention inference (DII), which mainly focuses on the lane change intention on highways. First, a human intention mechanism is discussed in the beginning to gain an overall understanding of the driver intention. Next, the egovehicle driver intention is classified into different categories based on various criteria. A complete DII system can be separated into different modules,which consist of traffic context awareness, driver states monitoring, and the vehicle dynamic measurement module. The relationship between these modules and the corresponding impacts on the DII are analyzed. Then, the lane change intention inference system is reviewed from the perspective of input signals, algorithms, and evaluation. Finally, future concerns and emerging trends in this area are highlighted.
Article
A large number of studies have shown that most vehicle collisions are caused by drivers’ abnormal operations. To ensure the safety of all people on the road network as much as possible, it is crucial to be able to predict the drivers’ driving safety risks in real time. In this paper, we propose a novel cost-sensitive $L1/L2$ -nonnegativity-constrained deep autoencoder network for driving safety risk prediction. Unfortunately, with existing research methods, the size of the sliding time window is too large, the feature extraction is relatively subjective, and class imbalances occur, which leads to low identification accuracy, long prediction times, and poor applicability. We first propose using a three-layer $L1/L2$ -nonnegativity-constrained autoencoder to adaptively search the optimal size of the sliding window and then construct a deep $L1/L2$ -nonnegativity-constrained autoencoder network to automatically extract the hidden features of the driving behaviors. Finally, we build a new $L1/L2$ -nonnegativity-constrained focal loss classifier to predict the driving behaviors under different safety risk levels. The results from the public 100-Car naturalistic driving study dataset indicate that our method can effectively find the optimal window size, reduce the data volume and reconstruction error, and extract more distinctive features. Furthermore, this method effectively curbs the class imbalance, improves the driving safety risk prediction performance, reduces overfitting, shortens the prediction time, and improves the timeliness.
Chapter
Currently, the neural network architecture design is mostly guided by the indirect metric of computation complexity, i.e., FLOPs. However, the direct metric, e.g., speed, also depends on the other factors such as memory access cost and platform characterics. Thus, this work proposes to evaluate the direct metric on the target platform, beyond only considering FLOPs. Based on a series of controlled experiments, this work derives several practical guidelines for efficient network design. Accordingly, a new architecture is presented, called ShuffleNet V2. Comprehensive ablation experiments verify that our model is the state-of-the-art in terms of speed and accuracy tradeoff.
Conference Paper
Deep neural networks have witnessed great successes in various real applications, but it requires a large number of labeled data for training. In this paper, we propose tri-net, a deep neural network which is able to use massive unlabeled data to help learning with limited labeled data. We consider model initialization, diversity augmentation and pseudo-label editing simultaneously. In our work, we utilize output smearing to initialize modules, use fine-tuning on labeled data to augment diversity and eliminate unstable pseudo-labels to alleviate the influence of suspicious pseudo-labeled data. Experiments show that our method achieves the best performance in comparison with state-of-the-art semi-supervised deep learning methods. In particular, it achieves 8.30% error rate on CIFAR-10 by using only 4000 labeled examples.