ArticlePDF Available

Driver Distraction Detection Using Octave-Like Convolutional Neural Network

June 2021
IEEE Transactions on Intelligent Transportation Systems PP(99):1-11

June 2021
PP(99):1-11

DOI:10.1109/TITS.2021.3086411

Authors:

Penghua Li

Guodong Wang

Sino-Austria Research Institute for Intelligent Industries

Show all 7 authorsHide

This study proposes a lightweight convolutional neural network with an octave-like convolution mixed block, called OLCMNet, for detecting driver distraction under a limited computational budget. The OLCM block uses point-wise convolution (PC) to expand feature maps into two sets of branches. In the low-frequency branches, we perform average pooling, depth-wise convolution (DC), and upsampling to obtain a low-resolution low-frequency feature map, reducing spatial redundancy and connection density. In the high-frequency branches, the expanded feature map with the original resolution is fed to the DC operator, gaining an apposite receptive field to capture fine details. The feature concatenation of the low-frequency and high-frequency branches is encoded sequentially by a squeeze-and-excitation (SE) module and PC operator, realizing feature global information fusion. Introducing another SE module at the last stage, the OLCMNet facilitates further sensitive information exchange between layers. In addition, with an augmented reality head-up display (ARHUD) platform, we create a Lilong Distracted Driving Behavior (LDDB) Dataset through a series of on-road experiments. Such a dataset contains 14808 videos collected from an infrared camera, covering six driving behaviors of 2468 participants. We manually annotate these videos at five frames per second, obtaining a total of 267378 images. Compared with the existing methods, the embedded hardware platform experiments indicate that OLCMNet hits acceptable trade-offs, namely, 89.53% accuracy for StateFarm Dataset and 95.98% accuracy LDDB Dataset when the latency is 32.8±4.6ms.

Content uploaded by Penghua Li

Content may be subject to copyright.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS 1

Driver Distraction Detection Using Octave-Like

Convolutional Neural Network

Penghua Li , Yifeng Yang, Radu Grosu ,Member, IEEE, Guodong Wang ,RuiLi ,

Yuehong Wu, and Zeng Huang

Abstract— This study proposes a lightweight convolutional

neural network with an octave-like convolution mixed block,

called OLCMNet, for detecting driver distraction under a limited

computational budget. The OLCM block uses point-wise convolu-

tion (PC) to expand feature maps into two sets of branches. In the

low-frequency branches, we perform average pooling, depth-wise

convolution (DC), and upsampling to obtain a low-resolution

low-frequency feature map, reducing spatial redundancy and

connection density. In the high-frequency branches, the expanded

feature map with the original resolution is fed to the DC

operator, gaining an apposite receptive ﬁeld to capture ﬁne

details. The feature concatenation of the low-frequency and

high-frequency branches is encoded sequentially by a squeeze-

and-excitation (SE) module and PC operator, realizing feature

global information fusion. Introducing another SE module at the

last stage, the OLCMNet facilitates further sensitive informa-

tion exchange between layers. In addition, with an augmented

reality head-up display (ARHUD) platform, we create a Lilong

Distracted Driving Behavior (LDDB) Dataset through a series

of on-road experiments. Such a dataset contains 14808 videos

collected from an infrared camera, covering six driving behaviors

of 2468 participants. We manually annotate these videos at ﬁve

frames per second, obtaining a total of 267378 images. Compared

with the existing methods, the embedded hardware platform

experiments indicate that OLCMNet hits acceptable trade-offs,

namely, 89.53% accuracy for StateFarm Dataset and 95.98%

accuracy LDDB Dataset when the latency is 32.8±4.6ms.

Index Terms—Natural istic driving, octave-like convolution,

lightweight neural network.

I. INTRODUCTION

DRIVER distraction is a signiﬁcant problem that affects

driving safety. There are 80% of crashes and 65% of

near-crashes that involve driver distraction [1]. According to

the research of the National Highway Trafﬁc Safety Adminis-

tration (NHTSA), distraction comes down to four categories,

Manuscript received December 3, 2020; revised April 20, 2021; accepted

May 28, 2021. This work was supported in part by the Ministry of Education

China Mobile Research Fund under Grant MCM20180404, in part by the

National State Key Laboratory Foundation under Grant 6142006200405, and

in part by the Project of Innovation Research Group of Universities in

Chongqing under Grant 202006. The Associate Editor for this article was

S. A. Birrell. (Corresponding author: Penghua Li.)

Penghua Li, Yifeng Yang, and Rui Li are with the College of Automa-

tion, Chongqing University of Posts and Telecommunications, Chongqing

400065, China (e-mail: liph@cqupt.edu.cn; s180301020@stu.cqupt.edu.cn;

lirui@cqupt.edu.cn).

Radu Grosu and Guodong Wang are with the Institute of Computer

Engineering, Vienna University of Technology, 1040 Vienna, Austria (e-mail:

radu.grosu@tuwien.ac.at; guodong.wang@tuwien.ac.at).

Yuehong Wu is with the Chongqing Lilong Automobile Research Institute,

Chongqing 401123, China (e-mail: wuyuehong@li-long.com.cn).

Zeng Huang is with Chongqing Chang’an Automobile Company Ltd.,

Chongqing 400023, China (e-mail: huangzeng@changan.com.cn).

Digital Object Identiﬁer 10.1109/TITS.2021.3086411

i.e., visual distraction, auditory distraction, biomechanical dis-

traction, and cognitive distraction [2].

Over the past two decades, based on but not limited to the

above distraction categories, many naturalistic driving studies

(NDSs) [3]–[6] and simulated driving studies (SDSs) [7]–[9]

have further established the correlation between driver dis-

traction and degraded driving performance [3]. The SDSs

use simulated vehicle data to establish a human-like driver

model [7], leveraging electrocardiogram (ECG) [8] or elec-

troencephalogram (EEG) [9] to understand the driver behav-

iors. Although there is a correlation between simulated and

naturalistic driving behaviors, the difference between these two

types of driving behaviors is not negligible. Furthermore, indi-

rect physiological measurements inevitably introduce detection

errors [5].

In contrast, the NDSs, using the continuous recording of

driving information under real-world driving conditions, pro-

vide opportunities to assess driving risks [6]. Inspired by recent

developments of the convolutional neural network (CNN),

most NDSs attempt to employ video data to capture distracted

driver information. The work in [10] creates a dataset com-

posed of face-view videos in the Strategic Highway Research

Program (SHRP2) and utilizes the supervised descent-based

facial landmark tracking algorithm to detect driver cell-

phone usage with 93.9% accuracy. The later work [11]

applies a multiple scale faster-RCNN to SHRP2 videos for

cellphone usage detection and the intelligent vehicles and

applications (VIVA) challenge database for steering wheel

detection. The experiments present 94.6% and 93% accuracy

for VIVA and SHRP2 datasets, respectively. Borghi et al. [12]

present a regressive neural network called POSEidon for head

localization and pose estimation. The experiments carried on

Biwi Kinect Head Pose, ICT-3DHP, and Pandora datasets

demonstrate POSEidon is fast enough to process 30 frames

per second in a real-time fashion. Recent research reports

that a modiﬁed VGG-16 [13] classiﬁes ten distraction behav-

iors such as talking to a passenger, drinking, etc., achieving

95.54% accuracy with parameters reduced from 140M in

the original VGG-16 to 15M only. A similar study using

VGG-19 with a bigger size than VGG-16 is reported in [14],

which shows 99% average accuracy for detection tasks in [13].

Xing et al. [15] leverage a deep feed-forward NN to detect

seven driving behaviors such as normal driving, cellphone

answering, etc., hitting an average of more than 80% accuracy.

Then they improve their work using CNN [16]. The AlexNet,

GoogLeNet, and ResNet50 are pre-trained for such seven

See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: CHONGQING UNIV OF POST AND TELECOM. Downloaded on February 21,2022 at 04:03:33 UTC from IEEE Xplore. Restrictions apply.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

2IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS

driving behaviors, achieving 81.6%, 78.6%, and 74.9% accu-

racy. Based on these pre-trained models, the binary detection

achieves 91.4% accuracy.

Although good results have been reported for the approaches

mentioned above, their application for driver distraction detec-

tion needs further validation with the following concerns.

•Sample diversity is of paramount importance for the

generalization of neural networks [17]. For assessing

the performance of proposed methods, most distraction

studies only use the samples including few drivers, e.g.,

SHRP2 Database (41 drivers) [10], [11], VIVA Hand

Database (50 drivers), Pandora Database (22 drivers)

[11], Biwi Kinect Head Pose Database (20 drivers) [12],

StateFarm’s Dataset (81 drivers) [18] and self-collected

dataset (5 drivers) [15], [16]. The scarce sample diversity

makes published results less practical in a real-world

application.

•Many approaches apply backbone networks with ample

size to detect distraction, e.g., original VGG-16 (140M),

improved VGG-16 (15M) [13], VGG-19 (143.68M) [14],

AlexNet (62.38M) and ResNet50 (19.35M) [16]. How-

ever, these networks need to transmit their data back

to the computer, even the server to evaluate driving

distraction, making the original investigation of such

methods challenging to apply on a vehicle device with

limited computational capability.

•Recent efforts have been spent on improving the efﬁ-

ciency of backbone CNNs, e.g., reducing inherent redun-

dancy of dense model parameters [19] or the channel

dimension of feature maps [20]. However, few related

studies directly use these lightweight networks to detect

driver distraction up to now. Besides, the designed light-

weight CNNs, e.g., MobileNet [20] and shufﬂeNet [21],

share the commonplaces of using a single convolution

kernel for each layer, which leads to bottlenecks in fea-

ture expression, with the consequence of lacking higher

accuracy for in-vehicle applications.

This study aims to tackle these concerns above. Our

contributions are summarized as follows. (1) A distraction

dataset composed of 14808 videos focusing on 2468 par-

ticipants’ six driving behaviors is built to guarantee sam-

ple diversity. (2) A lightweight CNN with an octave-like

convolution mixed (OLCM) block, called OLCMNet,1is

specially designed for detecting driver distraction under a

limited computational budget. (3) We apply the SOTA light-

weight networks, e.g., MobileNet [20] and ShufﬂeNet [21],

to detect driver distraction. We also conduct a systematic

comparison among those lightweight networks, the backbone

networks (ResNet [22], DenseNet [23], etc.) and the proposed

OLCMNet. Compared with the existing methods reform-

ing vanilla convolution, our method pays more attention to

improving the network’s topology. Due to its unique topology

modiﬁcation, the proposed OLCMNet can facilitate sensitive

information exchange more ﬂexibly and learn the image’s

multi-scale representation more lightly.

1For the OLCMNet’s source codes and the labeled StateFarm Dataset, please

refer to https://github.com/Lipenghua-CQ/OLCMNet.

Fig. 1. The ARHUD development platform, where (a) presents the cameras

and HUD projection devices in driver’s cabin, (b) presents the embedded

hardware processing image data and information fusion.

II. DATA COLLECTION

The distraction videos are collected in a series of on-road

experiments conducted on a platform established for develop-

ing an augmented reality head-up display (ARHUD) system.

Based on these video data, we create a distraction dataset

called Lilong Distracted Driving Behaviours (LDDB) Dataset

(Table I).

A. Data Collection Platform

Our ﬁrst-generation ARHUD development platform (Fig. 1)

aims to verify individual functions in the designed plan, not

prototype production. The ARHUD platform includes four

cameras and a HUD device mounted in the driving cabin

(Fig. 1 (a)) and the embedded hardware mounted in the trunk

(Fig. 1 (b)).

The cameras C-I to C-II, C-III, and C-IV collect the

image data of the driver’s eyes, distracting behavior, and road

environment in front of the vehicle, respectively. The TX2-I

is used to run a pupil tracking algorithm and OLCMNet,

while the TX2-II performs an obstacle recognition algorithm

fused vision and radar data. The IMAX8 receives the results

from both TX2s, then calls virtual-real registration and other

information fusion algorithms to drive the HUD projection.

Note that camera C-III is an infrared camera with resolution

1280 ×720 pixels, frame rate 30Hz, a ﬁeld of view (FOV)

of 100 degrees. It has two 850nm infrared lamp beads

for light-compensation and supports automatic switching of

IR-CUT ﬁlters.

B. Data Acquisition Details

The LDDB Dataset is collected on the non-public roads in

Yuzui Industrial Park, Yubei District, Chongqing City, China.

The non-public roads include some predeﬁned routes located

in different areas, e.g., urban testing roads, rural testing roads,

and factory testing roads, approximately 10 kilometers in total

length. We conduct the data collection from 9 to 11:30 am

and from 2:00 to 5:00 pm every day, over six months from

May to November 2019 except for holidays and days when

exceptional circumstances make collection impossible. The

environmental conditions (time of day, weather, mainly sunny

and rainy) also vary during the data collection. The people

participating in the collection involve 2648 volunteers whose

age range is 20 to 60 years old (32.20±10.44) and male/female

ratio 6:4.

The LDDB Dataset covers six driving behaviors, i.e., safe

driving, calling, smoking, drinking, reaching behind, and tex-

ting. The videos of all behaviors except two (reaching behind

and texting, collected when the vehicle is static) are acquired

Authorized licensed use limited to: CHONGQING UNIV OF POST AND TELECOM. Downloaded on February 21,2022 at 04:03:33 UTC from IEEE Xplore. Restrictions apply.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

LI et al.: DRIVER DISTRACTION DETECTION USING OCTAVE-LIKE CNN 3

TAB LE I

THE DETAILS OF LDDB DATAS E T

Fig. 2. The design of OLCMnet, where (a)-(e) are the architectures of OLCMnet, OLCM block, SE module, DC kernel, PC kernel, respectively.

when the participants drive slowly. The automobile’s speed

depends on the driver’s driving proﬁciency but is not allowed

to exceed 40km/h. To ensure safety during the collection

process, an experimenter sitting in the co-driving position can

step on the secondary brake when a dangerous situation occurs.

For data authenticity, we ask the participants to perform six

driving behaviors according to their habits, causing different

duration shown in Table I. In each vehicle trial, a participant

completes six behaviors under the guidance of an experimenter

sitting in the co-driver position, driving a certain distance

over a total period of about 15 to 20 minutes. Note that

we do not record the exact value of such total period since

it is empirical time, and much of it is unimportant, e.g.,

the preparation time when introducing to a participant and

the switch time when waiting for the next participant. Another

experimenter supervises the data collection, sitting on the back

seat’s right side in the vehicle. A handheld trigger connected

to TX2-I allows this experimenter to annotate video streams

from camera C-III when the video recording ﬁnishes for one

of six behaviors.

The video recording codes are written in Python-based

OpenCV and loaded into TX2-I. We follow [15], [18], [24] to

save the storage costs, compressing the data’s resolution from

1280 ×720 ×3 to 640 ×360 ×3 at 20 frames per second. This

video data is frame by frame manually labeled ground truth

for six driving behaviors. To reduce the similarity of images

captured from these videos, the actual image capture changes

from 20 frames per second to 5 frames per second. More than

44000 images per class selected from 14808 videos constitutes

the full dataset.

III. METHODOLOGIES

The designed OLCMNet (Fig. 2(a)) consists of the

head, feature extraction, and last stages. In the OLCMNet,

we develop an OLCM building block (Fig. 2(b)) to reduce

spatial redundancy and connection density. This block focuses

on topology modiﬁcation rather than vanilla convolution oper-

ator improvement like octave convolution [19]. Compared

with the existing lightweight efforts, the proposed OLCMNet

demonstrates its novelty by the following three aspects.

•The OLCM block uses point-wise convolution (PC)

(Fig. 2(d)) to expand feature maps into two sets

of branches. In the low-frequency branches, we use

the average pooling (AP) to obtain a low-resolution

low-frequency feature map to reduce spatial redundancy

and improve subsequent operations’ computational efﬁ-

ciency. Then we perform depth-wise convolution (DC)

operation (Fig. 2(c)) and upsampling for the follow-

ing feature extraction and information fusion. On the

other hand, we keep the original feature map in the

high-frequency branch as the DC operation’s inputs.

•The OLCM block introduces the squeeze-and-excitation

(SE) module [25] (Fig. 2(e)) and PC operation for global

information fusion instead of performing information

exchange in the high and low-frequency groups like

octave convolution [19]. Speciﬁcally, we adopt global

average pooling (GAP) to obtain global information

from the embedded concatenation for each branch, then

create a bottleneck with two sets of PCs to emphasize

informative features and suppress less useful ones. The

PC operation after the SE module serves to squeeze and

fuse information.

•The OLCMNet adds a SE module into its last stage,

which is different from MoblieNetV3 [20] that only

synthesizes global space features during its last stage.

Such modiﬁcation facilitates sensitive information

exchange between layers further and provides a higher

classiﬁcation accuracy.

Authorized licensed use limited to: CHONGQING UNIV OF POST AND TELECOM. Downloaded on February 21,2022 at 04:03:33 UTC from IEEE Xplore. Restrictions apply.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

4IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS

TAB LE II

RELATED MATHE M AT I C A L SYMBOLS AND THEIR REPRESENTATIONS

The design of OLCMNet is as follows. The related symbols

and their representation are given in Table II.

A. Head Stage

Let Ube the input image. After downsampling spatial

resolution and expanding channels, the feature map at the head

stage, FH, can be obtained by:

Fh=σKh∗U(1)

where σ,Kh,and∗represent h-swish activation function,

vanilla convolution kernels, and convolutional operation. Note

that the h-swish function is given as ReLU6(x+3)6 [20].

B. Feature Extraction Stage

Let On

in and On

out be the input and output feature maps

of nth OCLM block at the feature extraction stage, respec-

tively. Obviously, O1

in =FH,On+1

in =On

out . In a speciﬁc

OCLM operation, On

in is split into Mbranches by MPC

operators, which yields an expanded input feature map On

with m=1,2. The calculation of On

mis described as:

m=σ˙

m∗On

in(2)

where ˙

mrepresents the kernels of PC in the mth branch of

nth OLCM block.

The On

mcan be learned with low and high frequency manner

in the following operations. For the low-frequency learning,

an average pooling operation is used to down-sample On

mfor

getting a low-frequency input feature map, i.e., On

low.

low=pool(On

m,ZA,SA)(3)

Here, ZA,andSAdenote kernel size, and stride size. Note

that ZAand SAare selected as 2 in this study. Then, a DC

operator is performed on On

lowfor extracting the pth low-

frequency output feature map On

lowp.

lowp=σˆ

lowp∗On

low(4)

where ˆ

lowpdenotes the kernels in the pth low-frequency

path of nth OLCM block. Note that p={1,2,...pmax }and

pmax is selected as 2 in this study. To achieve subsequent

information fusion for the feature maps with different spatial

resolution, On

lowpis up-sampled, which yields a feature map

with original resolution, ¯

lowp.

lowp=upsampleOn

lowp,λ

(5)

where λis an up-sampling factor closest to the interpolation

and is selected as 2 in this study.

For the high-frequency learning, On

mis regarded as an input

feature tensor. Keeping the spatial resolution of this tensor

unchanged, a high-frequency output feature map of mbranch

in the nth block, On

highqcan be obtained by a DC operator.

highq=σˆ

highq∗On

m(6)

Here, ˆ

highqdenotes the kernels of DC in the qth

high-frequency path of nth OLCM block. Note that q=

{1,2,...,qmax}and qmax is selected as 1 in this study.

After learning different frequency information, all branches

are concatenated to form a fusion feature map On

concat ,which

is described as

concat =Concat ¯

lowp,On

highq(7)

Then a SE module is adopted to learn more important

feature channels, which helps to selectively emphasize infor-

mative features and suppress less useful ones. Use the SE

operation on On

co to get the ﬁltered feature map ˜

co:

concat =FSE(On

concat )(8)

where FSE(·)is the SE operation, and its detailed operation is

given in [25]. Ultimately, a PC with linear activation function

is adopted for fusing multi-scale information between channels

and compressing channels, so that the ﬁnal output of nth

OLCM block can be obtained by:

out =˙

out ∗˜

concat (9)

where ˙

out is the kernels of PC at the end of nth OLCM

block.

C. Last Stage

As mentioned above, the calculation amount at the feature

extraction stage is signiﬁcantly reduced by concatenation of

NOLCM blocks where the channels of feature map are

compressed by a PC operator at the end of each block. How-

ever, such an architecture brings difﬁculties to the subsequent

classiﬁcation since the feature map in the last OLCM block,

out =On

out 

n=N, suffers from channel bottleneck when

it is regarded as the input feature map at the last stage

in OLCMNet. Therefore, a PC operator is used to enrich

the channel semantic information of ON

out , which yields an

expanded feature map, Fl

e=σ˙

e∗ON

out (10)

Here, ˙

eis the kernels of PC at the beginning of the last stage.

Then a SE module is used to further facilitate ﬁltering sensitive

information, i.e., ˜

e=FSE(Fl

e),whereFl

edenotes the ﬁltered

feature map. To generate channel-wise statistics, the GAP is

performed on Fl

e, which yields a global spatial information

descriptor, i.e., Fl

gap =GAP(˜

e). Finally, instead of using a

direct full connection to Fl

gap for classiﬁcation in backbone

network, two PCs are used to obtain a predicted feature map,

Authorized licensed use limited to: CHONGQING UNIV OF POST AND TELECOM. Downloaded on February 21,2022 at 04:03:33 UTC from IEEE Xplore. Restrictions apply.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

LI et al.: DRIVER DISTRACTION DETECTION USING OCTAVE-LIKE CNN 5

out ∈R1×1×I, which is used to be the input of ﬁnal sof t max

projection with Iclasses.

out =˙

2∗σ˙

1∗Fl

gap(11)

si=sof t max Fl

out (ci)=eFl

out (ci)



j=1

eFl

out(cj)

(12)

where ˙

1,˙

2,si,andFl

out (ci)represent two PC’s kernels,

predicted score of ith class and ith channel of Fl

out at the last

stage, respectively.

IV. EXPERIMENT AND RESULT ANALYSIS

For demonstrating the effectiveness of OLCMNet, we con-

duct the experiments on StateFarm Dataset [18] and LDDB

Dataset, comparing the results with those on ResNet-50

[22], DenseNet-40 [23], InceptionNet-V4 [26], MSCNN [27],

ShufﬂeNet-V2 [21], and MobileNet-V3 [20]. We also report

various ablation studies on OLCMNet to shed light on the

effects of various design decisions.

A. Data Description

The StateFarm Dataset covers ten driving behaviors,

i.e., safe driving (C0), texting-right (C1), talking on the

phone-right (C2), texting-left (C3), talking on the phone-left

(C4), operating the radio (C5), drinking (C6), reaching behind

(C7), hair and makeup (C8), and talking to the passenger (C9).

This dataset provides 22424 annotated images for training.

However, its 79726 images used for testing are unlabeled.

To evaluate these networks’ performance, we labeled ground

truth of approximately 1000 images per class from the

79726 images. For LDDB Dataset, 70%, 10%, and 20% of

such dataset are regarded as training, validating, and testing

data.

B. Model Training

We train all the networks in a server where hardware

and software conﬁgurations are GPU/2×Nvidia Tesla

P100@16GB, Memory/8×ECC Registered DDR4@32GB,

CPU/2×Intel Xeon Gold6126@2.60GHz, Operating

System/Ubuntu18.04, Programming Language/Python3.7, and

Deep Learning Framework/ TensorFlow1.13.1+Keras2.3.1.

The cross-entropy loss function of these networks is described

as:

E(L,S)=



i=1

lilog (si)(13)

where liand sidenote the likelihood and predicted score of

ith class, respectively. According to the structures presented

in [20]–[23], [26], [27], we handle some StateFarm Dataset

and LDDB Dataset trials and give the predeﬁned structures

of OLCMNet on the two datasets (Table III). To avoid

the inﬂuence of different image resolutions on all the

aforementioned networks’ performance, we set the input size

of OLCMNet (including the other compared networks) as

128 ×256 ×3. The conv2d represents a vanilla convolution

TABLE III

THE PREDEFINED STRUCTURES OF OLCMNET

with 3 ×3 kernel in the head stage, while the OLCM denotes

the OLCM blocks in the feature extraction stage where the

kernel size of each OLCM block is 3 ×3. The expanded size

and output size represent each branch’s and output feature

maps’ channel numbers in the OLCM block, respectively.

We train both the OLCMnets using synchronized stochastic

gradient descent with 0.9momentum based on the predeﬁned

structures. The batch normalization [28] is used in each con-

volutional layer for improving the stability of OLCMnets. The

data augmentation with a 0.5 probability setting (small-angle

rotation, random cropping, horizontal ﬂipping, and random

erasure [29]) is applied to increase sample diversity further.

The label smoothing [30] is employed to prevent overﬁtting

for the classiﬁcation task.

li=⎧

⎪

⎨

⎪

⎩

1−Nc−1

ε, if(i=t)

,if(i= t)

(14)

where εis a small constant to encourage the model to be

less conﬁdent on the training dataset and tis the ground truth

of the target class. In this study, εis set as 0.1. We conduct

60 trials to tune the networks’ hyperparameters manually. For

OLCMNet, we choose stochastic gradient descent (SGD) as

the optimizer. SGD proved itself as an effective optimization

method that was central in many machine learning success

examples [31]. However, the SGD optimizer needs to tune

each hyperparameter several times to achieve faster iterations

in trade for a lower convergence rate. In contrast, adaptive

moment estimation (Adam) computes adaptive learning rates

for each parameter and only requires ﬁrst-order gradients with

little memory requirement [32]. In order to improve efﬁciency,

we apply SDG to optimize OLCMNet and use the optimized

hyperparameters of OLCMNet as a benchmark. Then we

employ Adam to optimize other networks. Table IV sum-

marizes the selected hyperparameters, including the optimizer

(OP), learning rate (LR), learning rate decay (LRD), epochs

(EP), batch size (BZ), dropout rate (DR), and weight decay

(WD). Based on such optimized hyperparameters, we train all

the networks from scratch and save the models with the highest

accuracy on the validation set during their training process.

Fig. 3 (a) and Fig. 3 (c) demonstrate that all the networks

are trained well on both StateFarm and LDDB datasets. The

InceptionNet’s validation accuracy curves (Fig. 3 (b) and

Fig. 3 (d)) present large ﬂuctuations, which is because such

a network is developed with a focus on ILSVRC 2012 and

may thus be by design over-ﬁt to this speciﬁc task.

Authorized licensed use limited to: CHONGQING UNIV OF POST AND TELECOM. Downloaded on February 21,2022 at 04:03:33 UTC from IEEE Xplore. Restrictions apply.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

6IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS

Fig. 3. The training and validating proﬁles on both StateFarm and LDDB datasets, where (a) and (b) present the training and validating accuracy on the

StateFarm Dataset, respectively; (c) and (d) show the training and validating accuracy on the LDDB Dataset, respectively.

TAB LE IV

THE SELECTED HYPERPARAMETERS OF DIFFERENT NETWORKS

It is worth mentioning that the hyperparameter selection

in Table IV is effective and acceptable. Table V presents

each NN’s highest accuracy during its training and validating

process shown in Fig. 3. We can ﬁnd that OLCMNet’s training

and validation accuracy on the two datasets are not always

the highest as expected, indicating a relatively fair compar-

ison in this study. For example, the ResNet-50 achieves the

highest training accuracy on the data-augmented StateFarm

Dataset (99.58%) and the highest validation accuracy on the

LDDB Dataset (98.06%). Besides, it is also relatively fair

that we tune OLCMNet on the LDDB Datasets, then use

such net’s hyperparameters as a benchmark to train other

networks independently on the LDDB and StateFarm Datasets.

One illustration is that the training accuracy of each network

on the auxiliary dataset (StateFarm Dataset) is not all lower

than that of each network on the LDDB Dataset. Moreover,

the relative fairness of such selection can be further con-

ﬁrmed by an additional experiment in which we regard the

VGG19 and ResNet34 in [27] as the benchmark and use the

hyperparameter selection in Table IVto conduct comparative

experiments on the StateFarm Dataset. Table VI shows that

the performance of the two networks adopting our parameter

conﬁguration is still better than that of the networks using

parameter conﬁguration in [27]. In fact, other researchers

[33] have used similar hyperparameter multiplexing methods

to conduct comparative experiments more efﬁciently. The

independent training duration in Table V is roughly equal

TAB LE V

THE TRAINING AND VALIDATION DETAILS OF DIFFERENT NETWORKS

to the time cost for tuning one hyperparameter. It will take

extended time costs to traverse all the hyperparameters of

other networks on both datasets, considering the existing

60 trials for tuning OLCMNet and about 50 trials for tuning

other networks’ learning rate. Therefore, the hyperparameter

selection in this study is a practical compromise between time

cost and comparative fairness.

C. Model Testing

For investigating the performance of different methods

when the hardware’s computing resources are gradually

decreasing, all the well-trained networks are ported,

without any speeded-up operation, to a computer and a

TX2 device. The computer conﬁgurations are CUP/Intel

i7-8700@3020GHz, GPU/Nvidia GTX1080Ti@11GB

and Memory/LPDDR4@32GB. The TX2 conﬁgurations

are CPU/Dual-core Denver2(64-bit) and quad-core ARM

A57 Complex, GPU/Nvidia PascalTM architecture with

256 CUDA cores (1.3 TFLOPS), Memory/LPDDR4

(128-bit)@4GB and Storage/eMMC5.1@32GB. Note that the

TX2 is only applied to measure inference latency, while the

computer is employed to verify more performance.

We use 10128 images (StateFarm) and 40304

images (LDDB) for performance testing carried on the

computer. For the testing samples used on TX2, we randomly

select 400 images in StateFarm Dataset and a real video with

Authorized licensed use limited to: CHONGQING UNIV OF POST AND TELECOM. Downloaded on February 21,2022 at 04:03:33 UTC from IEEE Xplore. Restrictions apply.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

LI et al.: DRIVER DISTRACTION DETECTION USING OCTAVE-LIKE CNN 7

TAB LE VI

THE PERFORMANCE OF DIFFERENT NNSWHEN USING OUR HYPERPARAMETER CONFIGURATION

TAB LE VI I

THE FLOATING POINT PERFORMANCE ON TWO TARGET PLATFORMS

400 frames in LDDB Dataset. Besides the accuracy, we also

use F1 score to measure different networks’ performance,

where the deﬁnition of F1 score follows [34]. The inference

latency is measured using a single large core with a batch size

of one. Speciﬁcally, we count a latency per batch (1 frame

image), then obtain the mean and standard deviation of

these latencies when the networks run through all the testing

batches. The multi-core inference time is not reported in

this study since such a setup is not very practical for our

applications. The deﬁnition of FLOPs follows [21], i.e., the

number of multiply-adds. The parameters are counted directly

by the model.summary function built-in Keras.

Table VII shows that the accuracy of networks trained on

the LDDB Dataset is higher than that on StateFarm Dataset,

demonstrating that increasing sample diversity can improve

the networks’ performance in practical applications. On the

two datasets StateFarm and LDDB, all networks’ accuracy

is roughly the same as their F1 scores, indicating that the

samples used for training and testing in the two datasets are

balanced. The accuracy, F1 score, and inference speed of

OLCMNet precede other networks’ performance due to its

improved structure. More details on the effects of various

structure improvements are introduced in the subsequent abla-

tion experiments. Although the OLCMNet’s parameters and

FLOPs are not the lowest, it is still acceptable because of

their front ranking in the comparative experiments. Besides,

there is no strong positive correlation between inference

speed, parameters, and FLOPs. For instance, the DenseNet40’s

parameter amount [23] is the lowest, 0.26M, due to its

particular feature map multiplexing structure, but its FLOPs

are as high as 5.64B. Beneﬁting from the neural structure

search (NAS) technology, MobileNet-V3 (Small) hits the

lowest FLOPs, 0.09B. However, the parameters and latency

of MobileNet-V3 [20] are not the lowest compared with

other networks. The research in [35] also reports a similar

phenomenon. In fact, the FLOPs, an indirect metric to measure

the computation complexity, is just an approximation of, but

usually not equivalent to the direct metric, i.e., inference

latency that many applications [21], as well as ours, care about.

Note that only MSCNN, among all the networks used for

comparison, is developed focusing on StateFarm’s Dataset.

Our trained MSCNN (MSCNN-I) accuracy is close to the

accuracy reported in [27] (MSCNN-II), which once again

indicates that our training is efﬁcacious to the networks

initially designed for other tasks.

For the accuracy results shown in Table VII, Fig. 4 gives

more details in the form of multiple confusion matrices.

We ﬁnd that except for InceptiongNet-V4, the other net-

works’ accuracy for C9 (talking-to-passenger) in the StateFarm

Dataset is lower than 70%. It is because small conversations

(hands on the steering wheel with a small head deﬂection

angle) are easily considered safe-driving (C0) during driving.

For the remaining eight classes in the StateFarm Dataset,

the OLCMNet’s accuracy for each class is greater than 80%,

performing better than other networks. On the LDDB Dataset,

the recognition of C2 (smoking) by all the networks is less

than 90%. It is difﬁcult to distinguish between smoking and

drinking due to their similarity in posture. Besides, small-sized

cigarettes’ useful features become less after down-sampling,

which is another reason for the low accuracy of C2.

Leveraging the class activation mapping (CAM) technique

[36], Fig. 5 gives more details to explain the results presented

in Fig. 4. We take the C4 in StateFarm Dataset and C3 in

LDDB Dataset as examples. We ﬁnd that the OLCMNets can

successfully localize the discriminative regions for distraction

classiﬁcation as the actions that the drivers are performing

rather than the drivers themselves. Besides, OLCMNet’s pre-

diction score of the same image is higher than that of other

networks, i.e., 0.917 (C4) and 0.902 (C3), which is consistent

with the results presented in Fig. 4, i.e., the highest accuracy

of 97.62% (C4) and 98.04% (C3). We also observe that

the highlighted regions vary across different networks. For

instance, InceptionNet-V4 and MobileNet-v3 (small) activate

their interest area incorrectly and hit lower prediction scores,

0.235 and 0.218, respectively (Fig. 5 (a)). The prediction score

of MobileNet-v3 (small) is lower than those of other networks

because of its unreasonable interest area (Fig. 5 (b)). Although

the prediction scores of InceptionNet-V4 and MobileNet-v3

(large) in Fig. 5 (b) are relatively high, their expanded areas

of interest will also bring more interference and maybe cause

Authorized licensed use limited to: CHONGQING UNIV OF POST AND TELECOM. Downloaded on February 21,2022 at 04:03:33 UTC from IEEE Xplore. Restrictions apply.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

8IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS

Fig. 4. The confusion matrices of different networks, where (a) and (b) are the confusion matrices for StateFarm and LDDB datasets.

Fig. 5. Examples of the CAMs generated from the predicted classes, where (a) and (b) are the CAMs when applying different networks to classify C4 in

StateFarm Dataset and C3 in LDDB Dataset.

classiﬁcation errors. The InceptionNet-V4 in Fig. 5 (a) hap-

pens to be an example.

To show the classiﬁcation performance of different networks

more clearly, we count each confusion matrix’s diagonal

accuracy in Fig. 4 and then draw the accuracy distribution’s

boxplots for StateFarm and LDDB datasets. Fig. 6 shows that

the minimum value and Q1 (25th percentile) of OLCMNet’s

boxplots are the highest. Meanwhile, the boxes of OLCMNet

Authorized licensed use limited to: CHONGQING UNIV OF POST AND TELECOM. Downloaded on February 21,2022 at 04:03:33 UTC from IEEE Xplore. Restrictions apply.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

LI et al.: DRIVER DISTRACTION DETECTION USING OCTAVE-LIKE CNN 9

TABLE VIII

EFFEC T OF VARIANTS ON OLCMNET

Fig. 6. The accuracy distribution’s boxplots of different networks, where

(a) and (b) are the boxplots for StateFarm and LDDB datasets.

are smaller than those of most other networks. Therefore,

the accuracy of the OLCMNet is better than that of other

networks overall.

D. Ablation Study

1) Impact of the Low-Frequency Branch: The OLCM block

includes a high-frequency branch and two low-frequency

branches. In the octave-like operation, we keep the orig-

inal feature map unchanged in the high-frequency branch

as the input, followed by a DC operation and no upsam-

pling. On the other hand, we leverage an average pool-

ing (AP) to obtain a low-frequency feature map, followed by

two DC operations and upsampling. According to Tabel III,

we describe the two low-frequency branches’ operations as

AP+3×3DC+upsampling and AP+5×5DC+upsampling

and the high-frequency operation as 3 ×3DC. To inves-

tigate the low-frequency branch’s function in the approxi-

mate same receptive ﬁeld, we use 5 ×5DC and 7 ×7DC

(DC@5+7) to replace such two low-frequency operations

and keep high-frequency operation unchanged. Table VIII

shows that both accuracy and speed are degraded when

we only use DC operation, indicating the effectiveness of

the octave-like low-frequency branch. We also observe the

inﬂuence of the low-frequency branch’s number. Table VIII

shows that the accuracy is improved when the low-frequency

Fig. 7. Impact of individual components in the development of OLCMNet,

where (a) and (b) are the experiments conducted on StateFarm and LDDB

datasets, respectively.

branches increase to 3 (Low-frequency branches@3), but the

inference speed slows down. On the other hand, decreasing

branches (Low-frequency branches@1) results in the opposite

performance. Therefore, it is necessary to determine the num-

ber of low-frequency branches according to both the accuracy

requirements and hardware resources in a speciﬁc task.

2) Impact of the SE Module in OLCM Block: Table VIII

shows that the accuracy signiﬁcantly drops when removing

the SE module (OLCM block SE@0), but the speed is not

improved much. On the other hand, when we add a SE module

in the OLCM block (OLCM block SE@2), the accuracy is not

improved signiﬁcantly, but the speed reduces. These results

Authorized licensed use limited to: CHONGQING UNIV OF POST AND TELECOM. Downloaded on February 21,2022 at 04:03:33 UTC from IEEE Xplore. Restrictions apply.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

10 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS

demonstrate that the SE module’s information fusion helps

OLCMNet improve accuracy. However, it is not the more SE

modules, the better performance.

3) Impact of the SE Module and PC Operator in the Last

Stage: Table VIII presents that removing the SE module (Last

stage SE@0) can improve the parameters but has almost no

effect on the FLOPs and latency. More seriously, it causes a

decline in OLCMNet’s accuracy. Therefore, the SE module in

the last stage is helpful to improve the accuracy of OLCMNet.

Besides, we investigate the function of the PC operator after

the bottleneck. When removing such an operator (Last stage

PC@0), the OLCMNet’s latency, FLOPs, and parameters can

be improved, but its accuracy on both datasets decreases.

Fig. 7 further illustrates the results of Table IV, which help us

investigate the OLCMNet’s structure more intuitively.

V. C ONCLUSION

This study aims to obtain better accuracy under a lim-

ited computational budget, given by a TX2 target platform

and application scenarios for detecting driver distraction. To

achieve such a goal, an OLCMNet is proposed for detecting

driver distraction. We have described our efforts to harness the

idea of octave convolution and advances in network design

to deliver the lightweight OLCMNet. We have also shown

how to adapt low-frequency branches and apply squeeze and

excite in a quantization friendly. Compared with the existing

backbone and lightweight networks, the TX2 target platform

experiments indicate that OLCMNet hits acceptable trade-

offs, i.e., 89.53% accuracy for StateFarm Dataset and 95.98%

accuracy for LDDB Dataset when latency 32.8±4.6ms. In the

future, we will use a vehicle-grade hardware platform instead

of the industrial-grade TX2. Meanwhile, the OLCMNet will be

further improved and ported to such a vehicle-grade platform,

along with other algorithms such as obstacle recognition and

pupil tracking.

REFERENCES

[1] R. Tian, L. Li, M. Chen, Y. Chen, and G. J. Witt, “Studying the effects of

driver distraction and trafﬁc density on the probability of crash and near-

crash events in naturalistic driving environment,” IEEE Trans. Intell.

Transp. Syst., vol. 14, no. 3, pp. 1547–1555, Sep. 2013.

[2] N. Li and C. Busso, “Predicting perceived visual and cognitive distrac-

tions of drivers with multimodal features,” IEEE Trans. Intell. Transp.

Syst., vol. 16, no. 1, pp. 51–65, Feb. 2015.

[3] T. Liu, Y. Yang, G.-B. Huang, Y. K. Yeo, and Z. Lin, “Driver distraction

detection using semi-supervised machine learning,” IEEE Trans. Intell.

Transp. Syst., vol. 17, no. 4, pp. 1108–1120, Apr. 2016.

[4] Y. Liao, S. E. Li, W. Wang, Y. Wang, G. Li, and B. Cheng, “Detection

of driver cognitive distraction: A comparison study of stop-controlled

intersection and speed-limited highway,” IEEE Trans. Intell. Transp.

Syst., vol. 17, no. 6, pp. 1628–1637, Jun. 2016.

[5] Z. Li, S. Bao, I. V. Kolmanovsky, and X. Yin, “Visual-manual distraction

detection using driving performance indicators with naturalistic driving

data,” IEEE Trans. Intell. Transp. Syst., vol. 19, no. 8, pp. 2528–2535,

Aug. 2018.

[6] J. Chen, Z. Wu, and J. Zhang, “Driving safety risk prediction using

cost-sensitive with nonnegativity-constrained autoencoders based on

imbalanced naturalistic driving data,” IEEE Trans. Intell. Transp. Syst.,

vol. 20, no. 12, pp. 4450–4465, Dec. 2019.

[7] B. Shi, L. Xu, and W. Meng, “Applying a WNN-HMM based driver

model in human driver simulation: Method and test,” IEEE Trans. Intell.

Transp. Syst., vol. 19, no. 11, pp. 3431–3438, Nov. 2018.

[8] K. T. Chui, K. F. Tsang, H. R. Chi, B. W. K. Ling, and C. K. Wu,

“An accurate ECG-based transportation safety drowsiness detection

scheme,” IEEE Trans. Ind. Informat., vol. 12, no. 4, pp. 1438–1452,

Aug. 2016.

[9] S. Wang, Y. Zhang, C. Wu, F. Darvas, and W. A. Chaovalitwongse,

“Online prediction of driver distraction based on brain activity patterns,”

IEEE Trans. Intell. Transp. Syst., vol. 16, no. 1, pp. 136–150, Feb. 2015.

[10] K. Seshadri, F. Juefei-Xu, D. K. Pal, M. Savvides, and C. P. Thor,

“Driver cell phone usage detection on strategic highway research pro-

gram (SHRP2) face view videos,” in Proc. IEEE Conf. Comput. Vis.

Pattern Recognit. Workshops (CVPRW), Jun. 2015, pp. 35–43.

[11] T. H. N. Le, Y. Zheng, C. Zhu, K. Luu, and M. Savvides, “Multiple

scale faster-RCNN approach to driver’s cell-phone usage and hands on

steering wheel detection,” in Proc. IEEE Conf. Comput. Vis. Pattern

Recognit. Workshops (CVPRW), Jun. 2016, pp. 46–53.

[12] G. Borghi, M. Venturelli, R. Vezzani, and R. Cucchiara, “POSEidon:

Face-from-depth for driver pose estimation,” in Proc. IEEE Conf.

Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 4661–4670.

[13] B. Baheti, S. Gajre, and S. Talbar, “Detection of distracted driver using

convolutional neural network,” in Proc. IEEE/CVF Conf. Comput. Vis.

Pattern Recognit. Workshops (CVPRW), Jun. 2018, pp. 1032–1038.

[14] S. Masood, A. Rai, A. Aggarwal, M. N. Doja, and M. Ahmad, “Detect-

ing distraction of drivers using convolutional neural network,” Pattern

Recognit. Lett., vol. 139, pp. 79–85, Nov. 2020.

[15] Y. Xing et al., “Identiﬁcation and analysis of driver postures for in-

vehicle driving activities and secondary tasks recognition,” IEEE Trans.

Comput. Social Syst., vol. 5, no. 1, pp. 95–108, Mar. 2018.

[16] Y. Xing, C. Lv, H. Wang, D. Cao, E. Velenis, and F.-Y. Wang, “Driver

activity recognition for intelligent vehicles: A deep learning approach,”

IEEE Trans. Veh. Technol., vol. 68, no. 6, pp. 5379–5390, Jun. 2019.

[17] D.-D. Chen, W. Wang, W. Gao, and Z.-H. Zhou, “Tri-net for semi-

supervised deep learning,” in Proc. 27th Int. Joint Conf. Artif. Intell.,

Jul. 2018, pp. 2014–2020.

[18] Y. Abouelnaga, H. M. Eraqi, and M. N. Moustafa, “Real-time distracted

driver posture classiﬁcation,” 2017, arXiv:1706.09498.[Online]. Avail-

able: http://arxiv.org/abs/1706.09498

[19] Y. Chen et al., “Drop an octave: Reducing spatial redundancy in convo-

lutional neural networks with octave convolution,” in Proc. IEEE/CVF

Int. Conf. Comput. Vis. (ICCV), Oct. 2019, pp. 3435–3444.

[20] A. Howard et al., “Searching for MobileNetV3,” in Proc. IEEE/CVF

Int. Conf. Comput. Vis. (ICCV), Oct. 2019, pp. 1314–1324.

[21] N. Ma, X. Zhang, H.-T. Zheng, and J. Sun, “ShufﬂeNet V2: Practical

guidelines for efﬁcient CNN architecture design,” in Proc. Eur. Conf.

Comput. Vis. (ECCV), 2018, pp. 116–131.

[22] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for

image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.

(CVPR), Jun. 2016, pp. 770–778.

[23] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely

connected convolutional networks,” in Proc. IEEE Conf. Comput. Vis.

Pattern Recognit. (CVPR), Jul. 2017, pp. 4700–4708.

[24] C. H. Zhao, B. L. Zhang, J. He, and J. Lian, “Recognition of driving

postures by contourlet transform and random forests,” IET Intell. Transp.

Syst., vol. 6, no. 2, pp. 161–168, 2012.

[25] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in

Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2018,

pp. 7132–7141.

[26] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi, “Inception-v4,

inception-ResNet and the impact of residual connections on learning,”

in Proc. 31st AAAI Conf. Artif. Intell., 2017, pp. 4278–4284.

[27] Y. Hu, M. Lu, and X. Lu, “Driving behaviour recognition from still

images by using multi-stream fusion CNN,” Mach. Vis. Appl., vol. 30,

no. 5, pp. 851–865, Jul. 2019.

[28] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating

deep network training by reducing internal covariate shift,” 2015,

arXiv:1502.03167. [Online]. Available: http://arxiv.org/abs/1502.03167

[29] Z. Zhong, L. Zheng, G. Kang, S. Li, and Y. Yang, “Random eras-

ing data augmentation,” 2017, arXiv:1708.04896. [Online]. Available:

http://arxiv.org/abs/1708.04896

[30] R. Müller, S. Kornblith, and G. E. Hinton, “When does label smoothing

help?” in Proc. Adv. Neural Inf. Process. Syst., 2019, pp. 4696–4705.

[31] G. E. Hinton, “Reducing the dimensionality of data with neural net-

works,” Science, vol. 313, no. 5786, pp. 504–507, Jul. 2006.

[32] D. P. Kingma and J. Ba, “Adam: A method for stochastic opti-

mization,” 2014, arXiv:1412.6980. [Online]. Available: http://arxiv.

org/abs/1412.6980

Authorized licensed use limited to: CHONGQING UNIV OF POST AND TELECOM. Downloaded on February 21,2022 at 04:03:33 UTC from IEEE Xplore. Restrictions apply.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

LI et al.: DRIVER DISTRACTION DETECTION USING OCTAVE-LIKE CNN 11

[33] B. Chang, L. Meng, E. Haber, L. Ruthotto, D. Begert, and E. Holtham,

“Reversible architectures for arbitrarily deep residual neural networks,”

in Proc. AAAI Conf. Artif. Intell., vol. 32, no. 1, 2018, pp. 2811–2818.

[34] D. M. W. Powers, “Evaluation: From precision, recall and F-

measure to ROC, informedness, markedness and correlation,” 2020,

arXiv:2010.16061. [Online]. Available: http://arxiv.org/abs/2010.16061

[35] I. Radosavovic, R. P. Kosaraju, R. Girshick, K. He, and P. Dollár,

“Designing network design spaces,” in Proc. IEEE/CVF Conf. Comput.

Vis. Pattern Recognit. (CVPR), Jun. 2020, pp. 10428–10436.

[36] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba, “Learning

deep features for discriminative localization,” in Proc. IEEE Conf.

Comput. Vis. Pattern Recognit. (CVPR), Jun. 2016, pp. 2921–2929.

Penghua Li was born in 1984. He received the

B.S. degree in electronic information science and

technology and the Ph.D. degree in control theory

and control engineering from Chongqing University,

China, in 2008 and 2012, respectively. He was a

Senior Visiting Scholar with the Vienna University

of Technology. He is currently a Professor with the

Chongqing University of Posts and Telecommuni-

cations (CQUPT). He also serves as the Deputy

Director of the Department of Measurement and

Control, Automation College, CQUPT; the Director

of the Chongqing Artiﬁcial Intelligence Society (CQAIS); and the Standing

Committee Member of the Intelligent Transportation Professional Committee,

CQAIS. His research direction is neural network theory and its application

research, such as image recognition, speech recognition, multi-round dialogue,

and lithium battery health management. He won the First Prize of Chongqing

Science and Technology Progress Award twice in 2018 and 2019. He

also received the title of the Young Scientiﬁc and Technological Talent of

Chongqing. He chaired the 27th and 30th China Conference on Control and

Decision-Making Neural Networks.

Yifeng Yang was born in 1996. He received the

B.E. degree in control science and engineering from

the Chongqing University of Posts and Telecommu-

nications, Chongqing, China, in 2018, where he is

currently pursuing the master’s degree in control the-

ory and control engineering. His research interests

include deep learning and image classiﬁcation.

Radu Grosu (Member, IEEE) received the Ph.D.

degree in computer science from the Technical

University of Munich, Munich, Germany, in 1994.

He is currently a Professor and the Head of

the Cyber-Physical Group, Faculty of Informatics,

Vienna University of Technology. Before receiving

his appointment at the Vienna University of Tech-

nology, he was an Associate Professor with the

Computer Science Department, State University of

New York, Stony Brook, where he co-directed the

Concurrent-Systems Laboratory and co-founded the

Systems-Biology Laboratory. He was a Research Associate with the Computer

Science Department, University of Pennsylvania. He is a Research Professor

with the Computer Science Department, State University of New York. His

research interests include modeling, analysis and control of cyber-physical,

and biological systems. His application focus include green operating systems,

mobile ad-hoc networks, automotive systems, the Mars rover, cardiac-cell

networks, and genetic regulatory networks. He is a member of the International

Federation of Information Processing WG 2.2. He was the recipient of the

National Science Foundation Career Award, the State University of New York

Research Foundation Promising Inventor Award, the ACM Service Award.

Guodong Wang received the Ph.D. degree in

computer science from the Vienna University of

Technology, Vienna, Austria, in January 2019.

He is currently the Managing Director of the

Sino-Austria Research Institute for Intelligent Indus-

tries (SINOAUS), Nanjing, China. Before joining

SINOAUS, he was a Machine Learning Researcher

with the Institute of Computer Engineering, Vienna

University of Technology, dealing with theoretical

research of machine learning and its industrial appli-

cations. He is the Vice Chief Engineer with the

Shanghai Institute of Computing Technology, Shanghai, China. He is a Guest

Professor with Hangzhou Dianzi University, Hangzhou, China. His research

interests include learning representation, deep neural networks, automated

machine learning, cyber-physical systems, and the application of machine

learning techniques on industrial data analytics. He is a member of ACM

SIGBED China.

Rui Li was born in 1975. He received the B.S.

degree from the Chongqing University of Technol-

ogy, China, in 1999, and the M.S. and Ph.D. degrees

from Chongqing University, China, in 2004 and

2009, respectively. He is currently a Professor with

the College of Automation, Chongqing University

of Posts and Telecommunications. He also serves as

the Head of the Chongqing University Innovation

Research Group and the Director of the Labora-

tory Instrument Subcommittee of China Instrument

and Control Society. His research interests include

intelligent sensing, intelligent robots, intelligent electromechanical structures,

and intelligent manufacturing. He received titles, such as the Chongqing

Outstanding Youth, the Chongqing Academic Technology Leader, the Bayu

Distinguished Professor, and the Chonqing University Outstanding Talent.

He won the First Prize of Chongqing Science and Technology Progress

Award in 2017 and the Third Prize of China Machinery Industry Science

and Technology Progress Award in 2020.

Yue h on g Wu was born in 1964. He is currently

a Senior Engineer and the Deputy Dean of the

Chongqing Lilong Automobile Intelligent Technol-

ogy Research Institute. He is the Deputy Chairman

of the Automobile Instrument Standard Committee

of China Automobile Association. He is a mem-

ber of the Expert Committee of China Automobile

Association. Since 1996, he has been serving as the

Head of the Research and Development Department,

Chongqing YAZAKI Instrument Company Ltd., and

has been engaging in developing automotive elec-

tronic products for more than 30 years. He presided over and participated

in research and development for dozens of automobile instrument products,

including entire digital automobile instrument, virtual automobile instru-

ment, and multi-function display, which are widely used in many vehicles,

such as Toyota Prado/Coaster, Roewe RX5, MG6 Cadillac XT5/6, Haval

H6/H9/M6/C30, and Geely Vision/Boyue/Xingyue.

Zeng Huang was born in 1980. He received the

B.S. degree in mechanical design, manufacturing,

and automation from Xi’an Technological Univer-

sity, Xi’an, China, in 2002, and the M.S. degree in

industrial engineering from Chongqing University,

Chongqing, China, in 2017. He is currently working

as a Science and Technology Project Management

Manager at Chongqing Chang’an Automobile Com-

pany Ltd. Since 2017, he has been leading and

participating in the research and development of the

driving assistance system, human-vehicle interaction

system for the automobiles produced by Chang’an Automobile Company Ltd.,

such as the CX70, CX30, CS55, and Yuexiang-V5. His research areas include

vehicle environment perception and human-vehicle interaction systems.

Authorized licensed use limited to: CHONGQING UNIV OF POST AND TELECOM. Downloaded on February 21,2022 at 04:03:33 UTC from IEEE Xplore. Restrictions apply.

Towards Efficient Risky Driving Detection: A Benchmark and a Semi-Supervised Model

Article

Full-text available

Feb 2024
SENSORS-BASEL

Risky driving is a major factor in traffic incidents, necessitating constant monitoring and prevention through Intelligent Transportation Systems (ITS). Despite recent progress, a lack of suitable data for detecting risky driving in traffic surveillance settings remains a significant challenge. To address this issue, Bayonet-Drivers, a pioneering benchmark for risky driving detection, is proposed. The unique challenge posed by Bayonet-Drivers arises from the nature of the original data obtained from intelligent monitoring and recording systems, rather than in-vehicle cameras. Bayonet-Drivers encompasses a broad spectrum of challenging scenarios, thereby enhancing the resilience and generalizability of algorithms for detecting risky driving. Further, to address the scarcity of labeled data without compromising detection accuracy, a novel semi-supervised network architecture, named DGMB-Net, is proposed. Within DGMB-Net, an enhanced semi-supervised method founded on a teacher-student model is introduced, aiming at bypassing the time-consuming and labor-intensive tasks associated with data labeling. Additionally, DGMB-Net has engineered an Adaptive Perceptual Learning (APL) Module and a Hierarchical Feature Pyramid Network (HFPN) to amplify spatial perception capabilities and amalgamate features at varying scales and levels, thus boosting detection precision. Extensive experiments on widely utilized datasets, including the State Farm dataset and Bayonet-Drivers, demonstrated the remarkable performance of the proposed DGMB-Net.

Speech-Based Detection of In-Car Escalating Arguments to Prevent Distracted Driving

Conference Paper

Full-text available

Dec 2023

Every year, 2.5 million car crashes involve distracted drivers globally. It takes a few seconds for a car crash to happen after the driver has been distracted. Distracted driving thus poses a critical threat to road safety, needing innovative approaches for its detection and mitigation. This paper introduces a novel system to monitor in-car conversations and identify potential distractions from escalating arguments. The system analyzes Mel spectrograms generated from real-time audio signals containing in-car discussions by combining continuous voice recording and deep learning techniques. First, a denoiser employs a convolutional autoencoder to reduce car engine noise within the spectrograms. Then, a classifier uses convolutional and recurrent neural networks to determine whether the audio corresponds to a calm conversation or a quarrel based on the denoised spectrogram. The experimental results showed that the system achieved a 91.8% classification accuracy. This system addresses a previously unexplored dimension of cognitive distraction, offering valuable insights into strategies for reducing the risk of road accidents. Ongoing research is focused on accounting for other environmental noises, such as radio speakers, music, wind from open windows, and engine sounds from surrounding vehicles, which may influence classification accuracy. The system is also being extended to consider more than two occupants in the car.

Prediction of Dangerous Driving Behaviour Based on Vehicle Motion

Article

Jan 2024

Unleashing the Power of Deep Convolutional Neural Networks in a Stacked Ensemble Model for Precise Paediatric Pneumonia Diagnosis from Chest Radiographs

Conference Paper

Feb 2024

Driver Distraction Detection Based on Cloud Computing Architecture and Lightweight Neural Network

Article

Full-text available

Dec 2023

Distracted behavior detection is an important task in computer-assisted driving. Although deep learning has made significant progress in this area, it is still difficult to meet the requirements of the real-time analysis and processing of massive data by relying solely on local computing power. To overcome these problems, this paper proposes a driving distraction detection method based on cloud–fog computing architecture, which introduces scalable modules and a model-driven optimization based on greedy pruning. Specifically, the proposed method makes full use of cloud–fog computing to process complex driving scene data, solves the problem of local computing resource limitations, and achieves the goal of detecting distracted driving behavior in real time. In terms of feature extraction, scalable modules are used to adapt to different levels of feature extraction to effectively capture the diversity of driving behaviors. Additionally, in order to improve the performance of the model, a model-driven optimization method based on greedy pruning is introduced to optimize the model structure to obtain a lighter and more efficient model. Through verification experiments on multiple driving scene datasets such as LDDB and Statefarm, the effectiveness of the proposed driving distraction detection method is proved.

Pose-guided instance-aware learning for driver distraction recognition

Article

Jan 2023

Hybrid Convolutional-Transformer Neural Network for Driver Distraction Detection

Conference Paper

Sep 2023

An accurate and lightweight model for driver distraction detection via multiple teacher knowledge distillation

Article

Oct 2023
EXPERT SYST APPL

Quantitative Identification of Driver Distraction: A Weakly Supervised Contrastive Learning Approach

Article

Full-text available

Sep 2023

Accurate recognition of driver distraction is significant for the design of human-machine cooperation driving systems. Existing studies mainly focus on classifying varied distracted driving behaviors, which depend heavily on the scale and quality of datasets and only detect the discrete distraction categories. Therefore, most data-driven approaches have limited capability of recognizing unseen driving activities and cannot provide a reasonable solution for downstream applications. To address these challenges, this paper develops a vision Transformer-enabled weakly supervised contrastive (W-SupCon) learning framework, in which distracted behaviors are quantified by calculating their distances from the normal driving representation set. The Gaussian mixed model (GMM) is employed for the representation clustering, which centralizes the distribution of the normal driving representation set to better identify distracted behaviors. A novel driver behavior dataset and the other three ones are employed for the evaluation, experimental results demonstrate that our proposed approach has more accurate and robust performance than existing methods in the recognition of unknown driver activities. Furthermore, the rationality of distraction levels for different driving behaviors is evaluated through driver skeleton poses. The constructed dataset and demo videos are available at https://yanghh.io/Driver-Distraction-Quantification .

A Novel Driver Distraction Behavior Detection Method Based on Self-Supervised Learning With Masked Image Modeling

Article

Jan 2023

Driver distraction causes a significant number of traffic accidents every year, resulting in economic losses and casualties. Currently, the level of automation in commercial vehicles is far from completely unmanned, and drivers still play an important role in operating and controlling the vehicle. Therefore, driver distraction behavior detection is crucial for road safety. At present, driver distraction detection primarily relies on traditional convolutional neural networks (CNN) and supervised learning methods. However, there are still challenges such as the high cost of labeled datasets, limited ability to capture high-level semantic information, and weak generalization performance. In order to solve these problems, this paper proposes a new self-supervised learning method based on masked image modeling for driver distraction behavior detection. Firstly, a self-supervised learning framework for masked image modeling (MIM) is introduced to solve the serious human and material consumption issues caused by dataset labeling. Secondly, the Swin Transformer is employed as an encoder. Performance is enhanced by reconfiguring the Swin Transformer block and adjusting the distribution of the number of window multi-head self-attention (W-MSA) and shifted window multi-head self-attention (SW-MSA) detection heads across all stages, which leads to model more lightening. Finally, various data augmentation strategies are used along with the best random masking strategy to strengthen the model’s recognition and generalization ability. Test results on a large-scale driver distraction behavior dataset show that the self-supervised learning method proposed in this paper achieves an accuracy of 99.60 approximating the excellent performance of advanced supervised learning methods. Our code is publicly available at github.com/Rocky1salady-killer/SL-DDBD.

Detection of Distracted Driver Using Convolutional Neural Network

Conference Paper

Full-text available

Jun 2018

Driving behaviour recognition from still images by using multi-stream fusion CNN

Article

Full-text available

Jul 2019
MACH VISION APPL

Abnormal driving behaviour is a leading cause of serious traffic accidents threatening human life and public property globally. In this paper, we investigate the use of a deep learning approach to automatically recognize driving behaviour (such as normal driving, driving with hands off the wheel, calling, playing mobile phone, smoking and talking with passengers) in a single image. The task of driving behaviour recognition can be regarded as a multi-class classification problem, and we resolve this problem from two aspects in our study: (1) Employ multi-stream CNN to extract multi-scale features by filtering images with receptive fields of different kernel sizes and (2) investigate different fusion strategies to combine the multi-scale information and generate the final decision for driving behaviour recognition. The effectiveness of our proposed method is validated by extensive experiments carried out on our self-created simulated driving behaviour dataset, as well as a real driving behaviour dataset, and the experiment results demonstrate that the proposed multi-stream CNN-based method achieves the significant performance improvements compared to the state of the art.

Designing Network Design Spaces

Conference Paper

Jun 2020

Searching for MobileNetV3

Conference Paper

Oct 2019

Drop an Octave: Reducing Spatial Redundancy in Convolutional Neural Networks With Octave Convolution

Conference Paper

Oct 2019

Driver Lane Change Intention Inference for Intelligent Vehicles: Framework, Survey, and Challenges

Article

May 2019

Intelligent vehicles and advanced driver assistance systems (ADAS) need to have proper awareness of the traffic context, as well as the driver status since ADAS share the vehicle control authorities with the human driver. This paper provides an overview of the ego-vehicle driver intention inference (DII), which mainly focuses on the lane change intention on highways. First, a human intention mechanism is discussed in the beginning to gain an overall understanding of the driver intention. Next, the egovehicle driver intention is classified into different categories based on various criteria. A complete DII system can be separated into different modules,which consist of traffic context awareness, driver states monitoring, and the vehicle dynamic measurement module. The relationship between these modules and the corresponding impacts on the DII are analyzed. Then, the lane change intention inference system is reviewed from the perspective of input signals, algorithms, and evaluation. Finally, future concerns and emerging trends in this area are highlighted.

Driving Safety Risk Prediction Using Cost-Sensitive With Nonnegativity-Constrained Autoencoders Based on Imbalanced Naturalistic Driving Data

Article

Jan 2019

A large number of studies have shown that most vehicle collisions are caused by drivers’ abnormal operations. To ensure the safety of all people on the road network as much as possible, it is crucial to be able to predict the drivers’ driving safety risks in real time. In this paper, we propose a novel cost-sensitive $L1/L2$ -nonnegativity-constrained deep autoencoder network for driving safety risk prediction. Unfortunately, with existing research methods, the size of the sliding time window is too large, the feature extraction is relatively subjective, and class imbalances occur, which leads to low identification accuracy, long prediction times, and poor applicability. We first propose using a three-layer $L1/L2$ -nonnegativity-constrained autoencoder to adaptively search the optimal size of the sliding window and then construct a deep $L1/L2$ -nonnegativity-constrained autoencoder network to automatically extract the hidden features of the driving behaviors. Finally, we build a new $L1/L2$ -nonnegativity-constrained focal loss classifier to predict the driving behaviors under different safety risk levels. The results from the public 100-Car naturalistic driving study dataset indicate that our method can effectively find the optimal window size, reduce the data volume and reconstruction error, and extract more distinctive features. Furthermore, this method effectively curbs the class imbalance, improves the driving safety risk prediction performance, reduces overfitting, shortens the prediction time, and improves the timeliness.

Squeeze-and-Excitation Networks

Conference Paper

Jun 2018

ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design: 15th European Conference, Munich, Germany, September 8–14, 2018, Proceedings, Part XIV

Chapter

Oct 2018

Currently, the neural network architecture design is mostly guided by the indirect metric of computation complexity, i.e., FLOPs. However, the direct metric, e.g., speed, also depends on the other factors such as memory access cost and platform characterics. Thus, this work proposes to evaluate the direct metric on the target platform, beyond only considering FLOPs. Based on a series of controlled experiments, this work derives several practical guidelines for efficient network design. Accordingly, a new architecture is presented, called ShuffleNet V2. Comprehensive ablation experiments verify that our model is the state-of-the-art in terms of speed and accuracy tradeoff.

Tri-net for Semi-Supervised Deep Learning

Conference Paper

Jul 2018

Deep neural networks have witnessed great successes in various real applications, but it requires a large number of labeled data for training. In this paper, we propose tri-net, a deep neural network which is able to use massive unlabeled data to help learning with limited labeled data. We consider model initialization, diversity augmentation and pseudo-label editing simultaneously. In our work, we utilize output smearing to initialize modules, use fine-tuning on labeled data to augment diversity and eliminate unstable pseudo-labels to alleviate the influence of suspicious pseudo-labeled data. Experiments show that our method achieves the best performance in comparison with state-of-the-art semi-supervised deep learning methods. In particular, it achieves 8.30% error rate on CIFAR-10 by using only 4000 labeled examples.

Driver Distraction Detection Using Octave-Like Convolutional Neural Network

Abstract

Recommended publications

Towards Computationally Efficient and Realtime Distracted Driver Detection With MobileVGG Network

Pose Estimation for Distracted Driver Detection Using Deep Convolutional Neural Networks

A Deep Learning Based Framework for Distracted Driver Detection

Fine-Grained Detection of Driver Distraction Based on Neural Architecture Search