ArticlePDF Available

Investigating the Influential Factors for Practical Application of Multi-Class Vehicle Detection for Images from Unmanned Aerial Vehicle using Deep Learning Models

Authors:

Abstract and Figures

Traffic density, which is a critical measure in traffic operations, should be collected precisely at various locations and times to reflect site-specific spatiotemporal characteristics. For detailed analysis, heavy vehicles have to be separated from ordinary vehicles, since heavy vehicles have a significant effect on traffic flow as well as traffic safety. With unmanned aerial vehicles (UAVs), it is easy to acquire video for vehicle detection by collecting images from above the traffic without any disturbances. Despite previous studies on vehicle detection, there is still a lack of research on real-world applications in estimating traffic density. This study investigates the effects of several influential factors: the size of objects, the number of samples, and a combination of datasets, on detecting multi-class vehicles using deep learning models in various UAV images. Three detection models are compared: faster region-based convolutional neural networks (faster R-CNN), region-based fully convolutional network (R-FCN), and single-shot detector (SSD), to suggest guidelines for model selection. The results provided several findings: (i) vehicle detection from UAV images showed sufficient performance with a small number of samples and small objects; (ii) deep learning-based multi-class vehicle detectors can have advantages compared with single-class detectors; (iii) among all the models, SSD showed the best performance because of its algorithmic structure; (iv) simply combining datasets in different environments cannot guarantee performance improvement. Based on these findings, practical guidelines are offered for estimating multi-class traffic density using UAV.
Content may be subject to copyright.
1
Article title: Investigating the Influential Factors for Practical Application of Multiclass Vehicle
1
Detection for Images from Unmanned Aerial Vehicle Using Deep Learning Models
2
Journal title: Transportation Research Record
3
Paper history: Submitted 1st August 2019
4
Revised 10th March 2020
5
Accepted 2nd May 2020
6
Published online 16th October 2020
7
Published 1st December 2020
8
Funding: Ministry of Science and ICT, Republic of Korea (NRF-2019R1H1A1080045)
9
DOT information: https://doi.org/10.1177/0361198120954187
10
---------------------------
11
2
Investigating the Influential Factors for Practical Application of
1
Multiclass Vehicle Detection for Images from Unmanned Aerial
2
Vehicle Using Deep Learning Models
3
4
5
Seung Woo Ham
6
Department of Civil and Environmental Engineering
7
Seoul National University, Gwanak-gu, Seoul, Republic of Korea, 08826
8
Email: seungwoo.ham@snu.ac.kr
9
10
Ho-Chul Park
11
Department of Transportation Engineering
12
Myongji University, Yongin, Kyunggi, Republic of Korea, 17058
13
E-mail: hcpark@mju.ac.kr
14
15
Eui-Jin Kim
16
Department of Civil and Environmental Engineering
17
Seoul National University, Gwanak-gu, Seoul, Republic of Korea, 08826
18
Email: kyjcwal@snu.ac.kr
19
20
Seung-Young Kho
21
Department of Civil and Environmental Engineering
22
Seoul National University, Gwanak-gu, Seoul, Republic of Korea, 08826
23
Email: sykho@snu.ac.kr
24
25
Dong-Kyu Kim, Corresponding Author
26
Department of Civil and Environmental Engineering and Institute of Construction and Environmental
27
Engineering
28
Seoul National University, Gwanak-gu, Seoul, Republic of Korea, 08826
29
Email: dongkyukim@snu.ac.kr
30
31
32
Word Count: 6,436 words + 4 tables = 7,436 words
33
34
Call for Papers: Collection and Application of Quality Traffic Data (ABJ35)
35
36
37
3
ABSTRACT
1
Traffic density, which is a critical measure in traffic operations, should be collected precisely at various
2
locations and times to reflect site-specific spatiotemporal characteristics. For detailed analysis, heavy
3
vehicles have to be separated from ordinary vehicles, since they have a significant effect on traffic flow as
4
well as traffic safety. With unmanned aerial vehicles (UAVs), we can easily acquire a video for vehicle
5
detection by collecting images from above the traffic without any disturbances. Despite previous studies
6
for vehicle detection, there is still a lack of research for real-world applications in estimating traffic
7
density. In this study, we investigate the effects of influential factors, which are the size of objects,
8
number of samples, and a combination of datasets, on detecting multi-class vehicle using deep learning
9
models in various UAV images. We compare three detection models, which are Faster Region-based
10
Convolutional Neural Networks (Faster R-CNN), Region-based Fully Convolutional Network (R-FCN),
11
and Single-Shot Detector (SSD), to suggest guidelines for model selection. The results provided several
12
findings: 1) Vehicle detection from UAV images showed sufficient performance with a small number of
13
samples and small objects; 2) Deep learning-based multi-class vehicle detectors can have advantages
14
compared with single-class detectors; 3) Among all the models, SSD showed the best performance due to
15
its algorithmic structure; 4) Simply combining datasets in different environments cannot guarantee
16
performance improvement. Based on our findings, we brought practical guidelines for estimating multi-
17
class traffic density using UAV.
18
19
Keywords: Vehicle Detection, Deep Learning, Unmanned Aerial Vehicle, Reproducibility in Practice,
20
Single-Shot Detector
21
4
INTRODUCTION
1
Traffic phenomena in congestion such as traffic oscillation, traffic breakdown, and capacity drop
2
result in not just traveler delays that reduce system-wide efficiencies but also increased crash potential.
3
Many studies have unveiled the mechanism that triggers those phenomena and founded measures to
4
capture them (1, 2). Their effort demonstrated that traffic density (i.e., the inverse of average vehicle
5
spacing) is the most critical measure. To diagnose and analyze the congestion phenomena, traffic density
6
should be collected precisely at various locations and times to reflect site-specific characteristics.
7
Meanwhile, in traffic flow analysis, heavy vehicles have a significant effect on traffic congestion as well
8
as traffic safety due to their physical characteristics, e.g., heavy weight, large size, and maneuvering
9
limitations (3-5). For precise analysis, therefore, heavy vehicles should be considered, but it is costly to
10
obtain the density of heavy vehicles separately from ordinary vehicles.
11
Based on vehicle detector systems, aerial images, or surveillance cameras, many studies have
12
attempted to obtain vehicle densities (6-8). Although they showed the possibility of high performance for
13
collecting traffic density, there is still a need for improvements in real-world congestion management (9).
14
This is because the approaches were cost-ineffective to install surveillance cameras at many points or
15
collect high-resolution aerial images at different times. Also, most studies focused on detecting only
16
ordinary vehicles. Furthermore, in the case of a surveillance camera, the images cannot be taken in the
17
vertical direction from above the traffic like in a UAV, resulting in the problem of overlapping vehicle
18
images in congested traffic.
19
Recently, unmanned aerial vehicles (UAVs) have been proposed to mitigate this inefficiency
20
using their mobility, cost-effectiveness, wide field-of-view, and ability to hover (stationary flight) (6, 7,
21
10). UAVs can easily obtain high-resolution images from above traffic at low altitude, and only simple
22
camera calibration is required to acquire a clear image and correct the geometric distortion (10, 11). The
23
major drawback of images from UAV is that vehicle features are represented in a small number of pixels.
24
This can be further exacerbated in congested traffic, where shadows partially occlude vehicles or adjacent
25
vehicles are detected as one; further, it significantly reduces the accuracy of collected traffic density (12).
26
The conventional approaches for detecting vehicles such as background subtraction, blob analysis, and
27
optical flow are vulnerable to those difficulties because it cannot robustly detect the exact bounding boxes
28
surrounding the vehicles (13, 14).
29
With the development of computer vision and deep learning, the supervised learning-based
30
vehicle detection method has been proposed to collect accurate traffic density even in congestion. Until
31
recently, combining feature representation and learning algorithms was mainly done for detecting
32
vehicles in UAV images (8, 11, 15). Because those studies used generic features for object detection
33
instead of customized features for vehicles in UAV images (8, 16), efficiency and accuracy can be further
34
enhanced.
35
As the convolutional neural network (CNN) (17) had great success with image classification,
36
many researchers have recently focused on vehicle detection using CNN. These deep learning structures
37
automatically create features in the images (18), and those features showed better performance for vehicle
38
detection in UAV images than the generic features (19). In particular, combining CNN with bounding box
39
regression (20) called region-based CNN (R-CNN) (21) allows the location of vehicles to be precisely
40
specified by a bounding box, which drastically improves the performance of CNN-based object detection.
41
Faster R-CNN (22), the enhanced model of R-CNN, performed real-time vehicle detection with high
42
accuracy (19). In addition, a variety of advanced methodologies have been applied to measure vehicle
43
density accurately, such as Region-based Fully Convolutional Network (R-FCN), Single-Shot Detector
44
(SSD), and so on.
45
Despite many methodological studies for vehicle detection, there is still a lack of experimental
46
research for real-world applications. For example, multi-class vehicle detection, which classifies vehicles
47
and heavy vehicles, is essential for analyzing congestion due to its different impacts on traffic conditions
48
(23, 24). Regarding the performance, validation in the various environments is required for evaluating the
49
robustness of a detector because detection performance can vary greatly depending on the characteristics
50
5
of the image, e.g., image resolution, lighting conditions, geometric features of the road. However, detailed
1
analysis in various environments has not yet appeared in previous studies (19, 25, 26).
2
In this paper, we investigate the effects of influential factors, i.e., small objects, size of a vehicle
3
in the image, number of samples, and combination of datasets, on detecting multi-class vehicle using deep
4
learning models. In addition, we compare three models, i.e., Faster R-CNN, R-FCN, and SSD, which are
5
modern deep learning object detecting models (27, 28). Based on the results, we provide practical
6
guidelines for multi-class vehicle detection.
7
The remainder of this paper is organized as follows. First, we introduce a literature review of
8
vehicle detection. In the next section, we discuss the deep learning methodologies of this study. Then, we
9
describe three datasets and measures of effectiveness used in the study. We show the model estimation
10
results and discuss our findings. Lastly, we conclude this study and provide guidelines for multi-class
11
vehicle detection.
12
13
LITERATURE REVIEW
14
Vehicle detection using aerial and UAV images are becoming popular for their maneuverability
15
and promising results. Among the vast literature, we have categorized important recent studies focusing
16
on the development of methodologies: edge and blob detection, machine learning, and deep learning.
17
Because unsupervised methods such as edge detection and blob detection have the advantage of
18
relatively less computational power compared with other methods, it has been widely used for real-time
19
detection. Azevedo et al. and Khan et al. used a background subtraction approach and blob analysis to
20
target vehicles from the uncongested free flow image (10, 29). Ke et al. detected vehicles from UAV
21
traffic video using Shi-Tomasi features (7). These previous works showed that the unsupervised method
22
can be applied in real-time without training and perform appropriately in free-flow and urban
23
intersections. However, the detection was not validated in a congested situation where the features used in
24
those methods are reported susceptible to image conditions.
25
To improve the robustness of detection performance in a complex environment, machine learning
26
methods are widely used. Elmikaty and Stathaki trained support vector machines (SVMs) with human-
27
made features such as gradient, color, and texture (30). Gleason et al. used the histogram of gradients
28
(HoG) and histogram of Gabor coefficients as a feature. They tested it with various detectors such as k-
29
Nearest Neighbors (k-NN), Random Forests, and SVM (31). However, these detection models were
30
conducted in a single environment so its performance can drop in some situations which limits practical
31
use. Moreover, the fact that human-made features are targeted for general objects, not the vehicle from
32
UAV image also limits its performance.
33
Deep learning methods have significant advantages over other methods in that they automatically
34
select the feature from the image (17). Xu et al. used Faster R-CNN and a VGG16 network to train a
35
vehicle detector of a UAV image (19). The authors also showed that the Faster R-CNN method has
36
robustness on image orientation, compared with a Viola-Jones object detection scheme and HoG feature-
37
infused linear SVM. In this study, it is difficult to know how well the performance of the detection is
38
achieved because there is no detailed information about the evaluation metric.
39
While the mainstream of vehicle detection focused on binary vehicle detection, some research has
40
focused on multi-class vehicle detection. Tang et al. trained a UAV image with CNN and cascade of a
41
boosted classifier to detect two-class vehicles with images gathered from a different road in the daytime.
42
The result showed that deep learning method works better compared with the conventional machine
43
learning techniques (25). Liu and Mattyus applied a binary detector using a soft-cascade structure with
44
integral channel features and classified the result into multi-class with an aggregated classifier method
45
(8). Li et al. trained the R-CNN network with a high-resolution aerial image to detect vehicles in multi-
46
class. After detected by binary vehicle detector, each vehicle was classified in four different classes, and
47
the station wagon showed the highest detection performance with 2,302 training samples (32). However,
48
these studies have limitations in adding one more stage for classification after detecting vehicles, which
49
induces not only the longer detection time but also a performance drop because errors occur
50
independently in each stage.
51
6
Multi-class object detection solely on deep learning can set the number of classes from the
1
beginning of the training stage. Although previous attempts have been developed and evaluated the multi-
2
class detection model for a generic object (33), there is no research to provide a reproducible guideline for
3
detecting vehicles and heavy vehicles from UAV images. In other words, each machine learning model
4
for object detection has different strengths and weaknesses by the type of the object (19, 30, 31). So, a
5
model that is suitable for vehicle detection from UAV images needs to be selected for practical usage.
6
Several studies have been attempts to apply vehicle detections in specific environmental conditions of the
7
training images, but there is a lack of research on detailed performance analysis for various conditions
8
such as lighting conditions, the ratio of heavy vehicle and resolution of the image (34). Therefore, the
9
type of model architecture that shows decent performance for vehicle detection from UAV images and the
10
impact of an influential factor on the performance of the methods should be investigated.
11
As the deep learning method has strengths in image recognition in traffic images, as well as
12
various fields (17), we focused on investigating an effective deep learning architecture for vehicle
13
detection from UAV images, as well as the influential factor on detection performance. Even the deep
14
learning method need greater computational power for implementation than the traditional methods; still,
15
it can be used in real-time detection (22). Notably, a one-stage deep learning algorithm such as SSD
16
shows faster running time than other algorithms due to its less complexity (27).
17
18
DEEP LEARNING METHODOLOGIES
19
In this paper, we used the three state-of-the-art object detectors: Faster R-CNN, SSD, and R-FCN.
20
We properly adjusted the hyperparameters of each detector for the best performance in vehicle detection.
21
All three methods share common hyperparameters. The default sliding window (anchor box) was
22
specified to 128×64, the scale of the sliding window was varied from 0.25 to 4.00, and the ratio was
23
varied from 1.0 to 4.0. Contexts below introduce the core idea of each detector.
24
Faster R-CNN is the third version of R-CNN architecture. Figure 1 shows the structure of Faster
25
R-CNN. R-CNN architecture extracts the region that is likely to contain an object, i.e., region proposals,
26
from the image and classifies each proposal if it has an object or not. Among the region proposals, more
27
plausible proposals are selected as a region of interests (RoI). The early version of R-CNN extracted RoI
28
by using an algorithm called selective search. Each RoI is entered into CNN, transforming RoI to a
29
feature vector. Output feature vectors are then used for classification by SVM, and the coordinate of the
30
detection is adjusted by bounding box regression.
31
Faster R-CNN fastens the region proposing process by extracting RoI from the output feature
32
vector. This stage is called the “region proposal network (RPN). In RPN, the sliding window method is
33
used to find the RoI. Twelve sliding windows (three different ratios and four different scales) are applied
34
for each point. Each RoI is then evaluated using the integrated loss, which contains classification loss and
35
bounding box regression loss. The equation is described in Equation 1.
36
37
󰇛󰇝
󰇞󰇝
󰇞󰇜
 󰇛
󰇜
 󰇛
󰇜 (1)
38
39
Here,  is the mini-batch size, and  is the number of sliding windows. is an indicator to
40
identify the individual sliding window. The classification loss (), which uses cross-entropy loss, and
41
bounding box regression loss (), which uses smooth loss is calculated for each sliding window.
42
represents the predicted probability that indicates if sliding window has an object or not. is the binary
43
indicator that demonstrates the ground-truth of
.
and are adjusted coordinates of the predicted
44
bounding box and ground-truth bounding box, respectively. Weight between classification loss and
45
bounding box regression loss is controlled by .
46
Region-based Fully Convolutional Network (R-FCN) is an attempt to solve a detection problem
47
like a classification problem. The classification problem is a translation invariance problem, which means
48
the result does not change by the translation of an image. On the other hand, the detection problem has
49
translation variance characteristic as its output varies if the image is shifted or enlarged. Because the
50
7
classification problem is easier than the detection problem, it has many advantages if we can use the
1
property of classification in detection.
2
R-FCN suggests a position-sensitive score map that contains relative location information of the
3
components of an object. A position-sensitive score map learns the arrangement of components in the
4
object at the training stage and uses it in the detection stage. For example, in the case of detecting a
5
human face, a position-sensitive score map learns that the nose will be in the center and mouth will be at
6
the bottom of the face. By using this knowledge embedded in position-sensitive score map, the detector
7
searches for the face that must have a nose in the center, and mouth at the bottom.
8
R-FCN splits the length and breadth of the object into . Thus, R-FCN searches for
9
components of the object. When the detector is trained for C number of categories, it creates the vectors
10
that can categorize a total of 󰇛 󰇜 categories, including the background. Thus, the total number of the
11
channel becomes 󰇛 󰇜. Figure 2(a) depicts the case .
12
Single-Shot Detector (SSD), depicted in Figure 2(b), is a one-stage detection framework that has
13
a simple structure compared with Faster R-CNN and R-FCN. While those two-stage detection
14
frameworks contain a pre-processing stage such as region proposal network, a one-stage detection
15
framework does the region proposal and detection simultaneously. SSD uses a feature map from various
16
convolutional layers. As an image goes through the convolutional layer, the output feature map represents
17
more complex components. At the initial layer, the output feature map represents a component with low
18
complexity, and at the end of the layers, components with high complexity are represented. Also, the
19
same area in each feature map contains a different area of the original image. The latter, the larger. This
20
enables SSD to target multiple objects with a variety of complexity and sizes in one image. The latter
21
feature map will detect a large complex object, and the initial feature map will detect the small, simple
22
object. The scale of a default bounding box is set as Equation 2:
23
24
 󰇛 󰇜
 󰇟 󰇠 (2)
25
26
Here, indicates the number of the whole feature map that is used, and indicates the number of
27
the current feature map.  and  is set as 0.9 and 0.2, respectively. The aspect ratio of each default
28
bounding box is prescribed as 󰇥
󰇦.
29
The bounding box is created for each feature map, and it contains four numbers of coordinates
30
and c number of a prediction score for each class. The bounding boxes are then evaluated by the same
31
loss function, which was used in the Faster R-CNN. In the paper where SSD was introduced, the authors
32
defined the loss function as the weighted sum of the confidence loss and localization loss; however, each
33
loss is the same with classification loss and bounding box regression loss, respectively. If several
34
bounding boxes indicate the same object, only one bounding box was selected by non-maximum
35
suppression (NMS).
36
37
DESCRIPTION OF THE DATASET AND RESEARCH FRAMEWORK
38
Figure 3 illustrates the overall framework of the research, including data collection, data labeling,
39
and model evaluation. We compared the three advanced deep learning architecture for vehicle detection
40
from UAV images, with detailed performance analysis according to the number of training images, the
41
resolution of images, and the composition of the dataset.
42
Four types of video images taken from four different places were used in the study: Cheonho
43
Bridge (CB), Gyeongbu Expressway (GE), Gyeongin Expressway 2 (GE2), and Seohaean Expressway
44
(SE). The four videos can be characterized by environmental conditions affecting performance, such as
45
lighting conditions, shadow, congestion, and surroundings. Here, a vehicle that is longer than 15 m in the
46
image was classified as a heavy vehicle. Examples of the video image and important information of each
47
image are shown in Figure 4 and Table 1. The ground truth data were obtained by manual labeling of
48
bounding boxes around the vehicles in each frame. To reduce the labeling effort, we used a user-friendly
49
image labeler provided by MATLAB.
50
8
The videos were taken in the vertical direction. The photography was done with a DJI Inspire Pro
1
1 equipped with a Zenmuse X5 camera, which is a quadcopter drone with 4K video and 3-axis Gimbals.
2
The resolution of the video was 3840×2160 (25 fps), and a vehicle roughly consisted of 40×100 image
3
pixels in this video. Although the hovering capability of our UAV with 3-axis Gimbals was enough for
4
minimizing UAV instability in all environments, an additional stabilization process may be required in
5
harsh conditions. Details about the stabilization process for UAV are presented in other work (35).
6
We constructed the training and evaluation datasets for each place, not mixing with other places.
7
For example, the model that detects a vehicle in ‘Cheonho Bridge’ is trained by the images of ‘Cheonho
8
Bridge,’ and evaluated by the images of ‘Cheonho Bridge.’ The impact of a mixed dataset on the detection
9
performance was investigated in the later section.
10
11
MEASURE OF EFFECTIVENESS
12
The measure of detection performance is based on the intersection over union (IoU). IoU, also known as
13
the Jaccard index, is the value of overlap area divided by the union area of two boxes, detection box, and
14
ground-truth box. We set a threshold of IoU to determine if the detection is true or false. If the IoU of
15
detection is larger than a threshold, we accept it as true positive and vice versa.
16
Let us say that we detected four vehicles in one image as in Figure 5 and Table 2. A good
17
detection may have an IoU of near 1.00, while a bad detection may have an IoU near 0. Our example
18
represents four detections with IoUs ranged from 0.22 to 0.92.
19
Regarding the threshold, true detection result varies, because the higher threshold means stricter
20
measures of evaluation. In the case of threshold equals to 0.5, all detections except Detection 3, in which
21
its IoU is 0.22, are recognized as true detection. In this case, from four detection results, three are true, so
22
the precision becomes 0.75. The recall will be 0.60 as three true detections were done from five ground
23
truth data. Precision and recall represent the exactness and sensitivity of the model, respectively. Detailed
24
equation with true positive (TP), true negative (TN), false positive (FP), and false negative (FN) are
25
explained as follows:
26
27
 
 (3)
28
 
 (4)
29
 
 (5)
30
31
However, when the threshold increases up to 0.9, Detection 1 is the only detection that is
32
regarded as true. The threshold can be determined by the future application of detection results. If the
33
detections aim to count vehicles for estimating traffic density in the sections, a low threshold is
34
acceptable. However, if the detections aim to count vehicles for each lane or calculate the spacing
35
between vehicles, a high threshold should be set.
36
Precision is the accuracy of predicting true positives from all of the detected samples, while the
37
recall is the number of true positives detected among all the ground-truth. The F-score can reflect both
38
precision and recall, which is in a trade-off relationship with each other. In the detection example above,
39
if the detector had created a bounding box in every spot where it has the only small chance of containing a
40
vehicle, it would record high recall as it detects most of the ground-truth data. However, at the same time,
41
most bounding boxes would not have a vehicle inside the box, so the precision drops. On the other hand,
42
if only objects that are clearly classified into vehicles are recognized as real vehicles, the number of
43
vehicles detected will decrease, and the recall will decrease at the same time, while precision surges near
44
to 1. This relationship can be described in the receiver operating characteristics (ROC) curve, which
45
shows the capability of the model to classify objects. The F-score came to be used because both precision
46
and recall are important in vehicle detection, where accurate detection without false positives and false
47
negatives is essential.
48
9
In the field of computer vision, many types of evaluation methods exist for detection problems.
1
For example, an average precision 0.5 (AP 0.5) is a measure of the effectiveness with a threshold set as
2
IoU 0.5. It was used in vision-based learning in the transportation field (36), but it is a relatively low
3
standard for practical applications such as lane-level traffic density estimation and crash-likelihood
4
estimation by vehicle spacing (37). Recently, mean average precision [0.50:0.95] (mAP [0.50:0.95]) and
5
a mean average recall [0.50:0.95] (mAR [0.50:0.95]) are widely used in benchmark study (33), which
6
measures the mean of area under the ROC curve for each threshold, from 0.50 to 0.95 with step size 0.05.
7
In this study, we used the modified version of them, mean F-score [0.50:0.95] (mF-score [0.50:0.95]),
8
which calculates F-score from mAP [0.50:0.95] and mAR [0.50:0.95]. This method can be used as a
9
balanced evaluation measure as it includes both information about precision and recall:
10
11
 󰇟 󰇠
 󰇛󰇡
 󰇢


 󰇜 (6)
12
13
Figure 6 also shows the need for an evaluation with an mF-score [0.50:0.95] rather than an AP
14
0.5. The purpose of this evaluation is to reveal the changes in the performance of the SSD detector for
15
each environment. However, Figure 6(a) shows that consistent performance regardless of the
16
environment when evaluated with AP 0.5. In contrast, when the mF-score [0.50:0.95] is adapted as an
17
evaluation metric, the performance changes vary depending on the environment. In addition, Figure 6(b)
18
shows that the number of training samples is highly relevant to performance as the green and yellow lines
19
ensure an increase in performance while the increase rate decreases when the performance is above a
20
certain level. Using mF-score [0.50:0.95], which is a strict evaluation method, brought a lucid comparison
21
of various aspects of the detection performance. Borji et al. also suggested that F-score is the most
22
appropriate evaluation metric for object detection (38). The models that performed well on the F-score
23
metric also found to have a good performance on other evaluation metrics.
24
25
RESULTS
26
27
What is the difference in accuracy between single and multi-class detection?
28
If the performance of multi-class detection (i.e., detection of vehicles with a classification of a vehicle and
29
a heavy vehicle) lags far behind those of single-class detection (i.e., detection of vehicles without
30
classification), separation of vehicle and the heavy vehicle should be conducted by single-class detection
31
followed by additional classification.
32
Table 3 shows the detection results evaluated by mF-score [0.50:0.95] for single-class and multi-
33
class, respectively. For CB, the SSD showed mF-score [0.50:0.95] of 0.861 for single-class detector and hit
34
mF-score [0.50:0.95] of 0.893 for multi-class detection. The similar results of the SSD detector were also
35
observed for GE2 dataset. From all six cases, in only two cases with GE2 detection using Faster R-CNN
36
hitting mF-Score [0.50:0.95] of 0.702, and CB detection using R-FCN hitting mF-Score [0.50:0.95] of
37
0.803, single-class detector showed better performance than multi-class detector. However, even in these
38
cases, the performance difference between the single-class detector and the multi-class detector was
39
negligible. This means a multi-class detector does properly extract different feature representations from
40
each vehicle type; vehicle and heavy vehicle. As a result, there was no significant difference in performance
41
between single-class and multi-class detectors, although the multi-class detector classified vehicle and
42
heavy vehicle. Regarding the importance of distinguishing heavy vehicles, there is no reason to use single-
43
class detection.
44
45
What is the best among Faster R-CNN, R-FCN, and SSD?
46
The number of training samples was changed to confirm their relevance to detection performance. Figure
47
7 showed that the performance increases as the number of training samples increases, as common-sense
48
dictates. However, the convergence of performance was already achieved with a small number of training
49
samples near 3,000 in the case of GE and GE2. This shows that, if vehicle detection is performed on UAV
50
10
images, the deep learning model can be easily saturated with only a small number of training samples.
1
These results suggest that UAV-based vehicle detection can be utilized in various environments without
2
excessive labeling work from a practical point of view.
3
As a result of comparing the three models in Table 4, the result of SSD showed the best
4
performance for all cases. The mF-score [0.50:0.95] of SSD was 0.893 for CB, 0.875 for GE, 0.806 for
5
GE2, and 0.694 for SE. SSD also has advantages over other models in terms of execution speed. While the
6
other model is two-stage, the SSD requires less computational power itself as a one-stage model. The
7
performance of the SSD in Table 4 is not only the highest among the three models but also higher than the
8
advanced object detection models from the previous research (39). Bodla et al. recorded 0.647 in bus (heavy
9
vehicle) detection and 0.615 in car (vehicle) detection with mAP [0.50:0.95] evaluation metric, which is far
10
behind the 0.893 and 0.870. Besides, the mF-Score [0.50:0.95] of this study is a more rigorous metric than
11
mAP [0.50:0.95] as they consider recall and precision together. This high performance of SSD is due to the
12
properly adjusted hyperparameters and target-environment-specific training datasets. In Figure 8, the
13
detection result with SSD of Gyeongin Expressway 2 and Cheonho Bridge is illustrated, which is the most
14
crowded environment and free flow environment, respectively. The green boxes are vehicles, and the blue
15
boxes indicate heavy vehicles.
16
17
How tolerant does SSD to a small object detection issue?
18
The wider view of road image enables the analysis of longer road sections. In order to obtain traffic
19
information from a wider area, the size (number of pixels) of an individual vehicle must be reduced
20
accordingly, though it is concerned as a difficult problem for common deep learning models.
21
To confirm the robustness of SSD in detecting small objects, we reduced the number of horizontal and
22
vertical pixels of the image by 20%, only remaining 1/25 pixels of the entire image. The test for small
23
objects was conducted with CB and GE2, where the average vehicle size was largest. The average object
24
size of each environment is reduced from 159.3*65.6 pixels to 31.9*13.1 pixels and from 144.1*62.2
25
pixels to 28.8*12.4 pixels, respectively. In computer vision, an object smaller than 32*32 pixels is
26
generally classified as a small object (33), which means our problem is harder than usual small object
27
detection in the computer vision area.
28
Figure 9(a) shows the recognition performance in various resolutions using SSD. 100% means
29
the original image, and 20% is the image that has been reduced by 1/25. However, recognition
30
performance does not have a specific tendency of change in resolution. These results can be interpreted in
31
many ways. First, the characteristics of the road image may have influenced. Comparing Figure 9(b), the
32
original image, and Figure 9(c), image reduced to 1/25, we can still draw a bounding box for vehicles in
33
Figure 9(c) with confidence. As we can see, the visible features of two road images are not very different,
34
even if the image size is reduced.
35
Structural characteristics also can be a possible reason for its high performance in small object
36
detection. The SSD passes the image through the neural network and stores the obtained feature map in
37
every stage. The initial feature map can recognize a small image with simple property (e.g., a vehicle
38
from UAV image), and the latter feature map can recognize a large image with complex property (e.g.,
39
human posture recognition) (40). This characteristic was also applied in Ren et al. They improved Faster
40
R-CNN’s performance of small object detection with ResNet-50. ResNet-50 has a deeper network than
41
VGG16, but it also creates features that skip some networks at the same time, making it suitable for small
42
and large objects (41). With these reasons, vehicle recognition on the road using ResNet-50 based on SSD
43
has good performance, even though the vehicle is a small object.
44
45
Does the dataset expansion always enhance performance?
46
In the existing deep learning framework, it is a common method to acquire and train with a lot of datasets.
47
However, since the road images collected from UAVs are unique and simple, different characteristics
48
from the general deep learning context can be expected. The results in Figure 10 show a different
49
tendency from general deep learning. We fixed the test set as the CB dataset. First, we used only CB
50
dataset as a training set and then expanded the training set by adding up GE dataset, GE2 dataset; finally,
51
11
SE dataset was added to the training dataset. Unlike what we expect from general deep learning,
1
performance does not increase as the number of data increases. Sometimes, it even decreases. The final
2
CB+GE+GE2+SE training set had more than 15 times larger samples than the initial training ran only
3
with CB, it means that it is simply not right to increase the amount of the dataset to achieve high
4
performance. We can infer that the video form different environments (non-target-environment)
5
introduces interference of feature, hindering the performance.
6
Therefore, it can be seen that for vehicle detection using UAVs from a practical perspective, it is
7
most useful to secure a proper level of training sample for various environments and utilize it only when it
8
has a similar environment with the target video (target-environment). This result also suggests that the
9
UAV can be a more effective data collection system than a fixed video camera since the UAV can quickly
10
obtain a small amount of video data in various locations without installing an extensive fixed
11
infrastructure.
12
13
CONCLUSIONS
14
In this study, the performance of the Faster R-CNN, R-FCN, and SSD, i.e., modern deep learning-
15
based vehicle detection models, was compared and analyzed. The evaluation of models was applied by a
16
strict mF-score [0.50:0.95] compared with the previous AP 0.5. Each model was adjusted, e.g., anchor size
17
and aspect ratio, to be used appropriately for the vehicle detection in the UAV image. As a result, SSD
18
showed significantly higher performance compared with the Faster R-CNN and R-FCN, which was a
19
common conclusion for all environments. This result was noteworthy considering that the SSD is a one-
20
stage detector; thus, the speed is also faster than that of the other two. In the case of SSD, the image size is
21
expanded and reduced through the algorithm, and it creates several rectangular anchor boxes with various
22
scales. The shape of these rectangular anchor boxes is similar to the target, which is a vehicle, and it also
23
generates various sizes of anchor boxes compared to other algorithms. For these reasons, it is assumed that
24
higher performance could be achieved from SSD. SSD, in particular, did not show much difference
25
compared with when the training data set was small. Even if there are about 500 samples in the training set,
26
the detection was robust enough. In fact, just as it did with a general deep learning framework, random
27
mixing of all images did not guarantee high accuracy but rather showed a decrease in overall performance
28
when other environments were added. Therefore, UAV pre-flight planning should focus on getting more
29
than 500 samples from a single environment.
30
In tests using small objects, which are traditionally treated as weaknesses of general detectors, SSD
31
showed robust performance, unlike other concerns. There was no significant change in performance even
32
when objects were smaller than 32*32 pixels, the usual standard for small objects. Even if we removed 96%
33
pixels of the existing image, the performance was fine. This can be explained by the characteristics of road
34
traffic images, the use of the ResNet-50, and the use of early-stage feature maps. The simple characteristic
35
of road traffic image from UAV was able to interpret by ResNet-50, which provides features from the
36
various depth of networks, and SSD’s multi-scale feature map structure. Therefore, it is safe to shoot a wide
37
range road with UAVs by floating them higher, as the performance of SSD is not significantly affected by
38
the size of the object themselves when identifying traffic patterns using SSD. However, if the distance is
39
far away, the image naturally responds more sensitively to the vibrations of the UAV itself and convection,
40
so in reality, the accuracy is expected to decrease.
41
This study is significant in that the vehicle detection with deep learning, which has thus far ended
42
at the trial level, has been conducted in various environments and verified to a reproducible level. There is
43
a contribution in that we suggest the optimal strategy that can be used in general vehicle detection from
44
UAV images with deep learning. Practical applications using emerging technology is difficult due to the
45
local contexts such as the objective, data collection, environmental condition, and so on. To address these
46
difficulties, this study mainly focused on the various aspects for practical application of deep learning-based
47
vehicle detection, not just the model selection, training process, and performance measure but also the
48
evaluation in various environment and image resolution. This study also classified heavy vehicles from
49
other vehicles, which is significant in terms of traffic flow. SSD, which showed the highest performance in
50
this study, can operate in real-time, so if only a UAV can float, the traffic state of the site can be analyzed
51
12
in real-time. Using the result of this research, microscale vehicle level traffic data of a specific point can be
1
obtained by repeated video recording. We hope our research can contribute to bridging the gaps between
2
practice and emerging front-line technology by guiding the reproducible methods. Future research could
3
include research on more detailed contexts that we have not covered in the research, e.g., night, evening
4
lighting conditions, complex congestion crossings, and vehicle tracking.
5
13
DATA ACCESSBILITY STATEMENT
1
Some or all data, models, or code generated or used during the study are available from the corresponding
2
author by request.
3
4
ACKNOWLEDGMENT
5
This research was supported by Basic Science Research Program through the National Research
6
Foundation of Korea (NRF) funded by the Ministry of Science and ICT (NRF-2019R1H1A1080045).
7
8
AUTHOR CONTRIBUTIONS
9
The authors confirm contribution to the paper as follows: study conception and design: Ham, Park, Kim,
10
Kim, and Kho; data collection: Ham and Kim; analysis and interpretation of results: Ham; draft manuscript
11
preparation: Ham, Park, and Kim. All authors reviewed the results and approved the final version of the
12
manuscript.
13
14
REFERENCES
1
1. Chung, K., J. Rudjanakanoknad, and M. J. Cassidy. Relation between Traffic Density and Capacity
2
Drop at Three Freeway Bottlenecks. Transportation Research Part B: Methodological, 2007. 41: 8295.
3
4
2. Li, L., X. Chen, and L. Zhang. Multimodel Ensemble for Freeway Traffic State Estimations. IEEE
5
Transactions on Intelligent Transportation Systems, 2014. 15: 13231336.
6
7
3. Hamdar, S.H., Qin, L. and Talebpour, A. Weather and road geometry impact on longitudinal driving
8
behavior: Exploratory analysis using an empirically supported acceleration modeling framework.
9
Transportation research part C: emerging technologies, 2016. 67: 193-213.
10
11
4. Park, H.-C., Y.-J. Joo, S.-Y. Kho, D.-K. Kim, and B.-J. Park. Injury Severity of BusPedestrian
12
Crashes in South Korea Considering the Effects of Regional and Company Factors. Sustainability, 2019.
13
11: 3169.
14
15
5. Park, H.-C., D.-K. Kim, S.-Y. Kho, and P. Y. Park. Cross-Classified Multilevel Models for Severity of
16
Commercial Motor Vehicle Crashes Considering Heterogeneity among Companies and Regions.
17
Accident Analysis and Prevention, 2017. 106: 305314.
18
19
6. Coifman, B., M. McCord, R. G. Mishalani, M. Iswalt, and Y. Ji. Roadway Traffic Monitoring from an
20
Unmanned Aerial Vehicle. IEE Proceedings - Intelligent Transport Systems, 2006. 153: 1120.
21
22
7. Ke, R., S. Member, Z. Li, S. Kim, J. Ash, Z. Cui, and Y. Wang. Real-Time Bidirectional Traffic Flow
23
Parameter Estimation From Aerial Videos. IEEE Transactions on Intelligent Transportation Systems,
24
2016. 18: 890901.
25
26
8. Liu, K., and G. Mattyus. Fast Multiclass Vehicle Detection on Aerial Images. IEEE Geoscience and
27
Remote Sensing Letters, 2015. 12: 19381942.
28
29
9. Ozkurt, C., and F. Camci. Automatic Traffic Density Estimation and Vehicle Classification for Traffic
30
Surveillance Systems Using Neural Networks. Mathematical and Computational Applications, 2009. 14:
31
187196.
32
33
10. Khan, M. A., W. Ectors, T. Bellemans, D. Janssens, and G. Wets. UAV-Based Traffic Analysis: A
34
Universal Guiding Framework Based on Literature Survey. Transportation Research Procedia, 2017. 22:
35
541550.
36
37
11. Kim, E. J., H. C. Park, S. W. Ham, S. Y. Kho, and D. K. Kim. Extracting Vehicle Trajectories Using
38
Unmanned Aerial Vehicles in Congested Traffic Conditions. Journal of Advanced Transportation, 2019.
39
2019: https://doi.org/10.1155/2019/9060797.
40
41
12. Xu, Y., G. Yu, X. Wu, Y. Wang, and Y. Ma. An Enhanced Viola-Jones Vehicle Detection Method
42
from Unmanned Aerial Vehicles Imagery. IEEE Transactions on Intelligent Transportation Systems,
43
2017. 18: 18451856.
44
45
13. Barmpounakis, E. N., E. I. Vlahogianni, and J. C. Golias. Unmanned Aerial Aircraft Systems for
46
Transportation Engineering: Current Practice and Future Challenges. International Journal of
47
Transportation Science and Technology, 2017. 5: 111122.
48
49
14. Lyu, S., M. C. Chang, D. Du, L. Wen, H. Qi, Y. Li, Y. Wei, L. Ke, T. Hu, M. Del Coco, P. Carcagni,
50
D. Anisimov, E. Bochinski, F. Galasso, F. Bunyak, G. Han, H. Ye, H. Wang, K. Palaniappan, K. Ozcan,
51
15
L. Wang, L. Wang, M. Lauer, N. Watcharapinchai, N. Song, N. M. Al-Shakarji, S. Wang, S. Amin, S.
1
Rujikietgumjorn, T. Khanova, T. Sikora, T. Kutschbach, V. Eiselein, W. Tian, X. Xue, X. Yu, Y. Lu, Y.
2
Zheng, Y. Huang, and Y. Zhang. UA-DETRAC 2017: Report of AVSS2017 & IWT4S Challenge on
3
Advanced Traffic Monitoring. Presented at 2017 14th IEEE International Conference on Advanced Video
4
and Signal Based Surveillance, 2017. https://doi.org/10.1109/AVSS.2017.8078560.
5
6
15. Gaszczak, A., T. P. Breckon, and J. Han. Real-Time People and Vehicle Detection from UAV
7
Imagery. Intelligent Robots and Computer Vision XXVIII: Algorithms and Techniques, 2011. 7878:
8
https://doi.org/10.1117/12.876663.
9
10
16. Tuermer, S., F. Kurz, P. Reinartz, and U. Stilla. Airborne Vehicle Detection in Dense Urban Areas
11
Using HoG Features and Disparity Maps. IEEE Journal of Selected Topics in Applied Earth Observations
12
and Remote Sensing, 2013. 6: 23272337.
13
14
17. Krizhevsky, A., I. Sutskever, and G. E. Hinton. Imagenet Classification with Deep Convolutional
15
Neural Networks. Presented at Advances in Neural Information Processing Systems, 2012.
16
17
18. Wang, R., L. Zhang, K. Xiao, R. Sun, and L. Cui. EasiSee: Real-Time Vehicle Classification and
18
Counting via Low-Cost Collaborative Sensing. IEEE Transactions on Intelligent Transportation Systems,
19
2014. 15: 414424.
20
21
19. Xu, Y., G. Yu, Y. Wang, X. Wu, and Y. Ma. Car Detection from Low-Altitude UAV Imagery with
22
the Faster R-CNN. Journal of Advanced Transportation, 2017. 2017:
23
https://doi.org/10.1155/2017/2823617.
24
25
20. Felzenszwalb, P. F., R. B. Girshick, and D. McAllester. Cascade Object Detection with Deformable
26
Part Models. Presented at 2010 IEEE Computer Society Conference on Computer Vision and Pattern
27
Recognition, 2010.
28
29
21. Girshick, R., J. Donahue, T. Darrell, and J. Malik. Region-Based Convolutional Networks for
30
Accurate Object Detection and Segmentation. IEEE Transactions on Pattern Analysis and Machine
31
Intelligence, 2016. 38: 142158.
32
33
22. Ren, S., K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards Real-Time Object Detection with
34
Region Proposal Networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017. 39:
35
11371149.
36
37
23. Al-Kaisy, A., J. Bhatt, and H. Rakha. Modeling the Effect of Heavy Vehicles on Sign Occlusion at
38
Multilane Highways. Journal of Transportation Engineering, 2005. 131: 219228.
39
40
24. Van Lint, J. W. C., S. P. Hoogendoorn, and M. Schreuder. Fastlane: New Multiclass First-Order
41
Traffic Flow Model. Transportation Research Record: Journal of the Transportation
42
Research Board, 2008. 2088: 177187.
43
44
25. Tang, T., S. Zhou, Z. Deng, H. Zou, and L. Lei. Vehicle Detection in Aerial Images Based on Region
45
Convolutional Neural Networks and Hard Negative Example Mining. Sensors, 2017. 17:
46
https://doi.org/10.3390/s17020336.
47
48
26. Maria, G., E. Baccaglini, D. Brevi, M. Gavelli, and R. Scopigno. A Drone-Based Image Processing
49
System for Car Detection in a Smart Transport Infrastructure. Presented at 18th Mediterranean
50
Electrotechnical Conference: Intelligent and Efficient Technologies and Services for the Citizen, 2016.
51
16
1
27. Liu, W., D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Y. Fu, and A. C. Berg. SSD: Single Shot
2
Multibox Detector, Presented at European Conference on Computer Vision, 2016.
3
4
28. Dai, J., Li, Y., He, K. and Sun, J. R-fcn: Object detection via region-based fully convolutional
5
networks. In Advances in Neural Information Processing Systems, 2016.
6
7
29. Azevedo, C. L., J. L. Cardoso, M. Ben-Akiva, J. P. Costeira, and M. Marques. Automatic Vehicle
8
Trajectory Extraction by Aerial Remote Sensing. Procedia - Social and Behavioral Sciences, 2014. 111:
9
849858.
10
11
30. Elmikaty, M., and T. Stathaki. Detection of Cars in High-Resolution Aerial Images of Complex Urban
12
Environments, IEEE Transactions on Geoscience and Remote Sensing, 2017. 55: 59135924.
13
14
31. Gleason, J., A. V Nefian, X. Bouyssounousse, T. Fong, and G. Bebis. Vehicle Detection from Aerial
15
Imagery. Presented at 2011 IEEE International Conference on Robotics and Automation, 2011.
16
17
32. Li, F., S. Li, C. Zhu, X. Lan, and H. Chang. Cost-Effective Class-Imbalance Aware CNN for Vehicle
18
Localization and Categorization in High Resolution Aerial Images. Remote Sensing, 2017. 9: 129.
19
20
33. Lin, T. Y., M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick.
21
Microsoft COCO: Common Objects in Context. Lecture Notes in Computer Science, 2014. 8693: 740
22
755.
23
24
34. Zhao, X., D. Dawson, W. A. Sarasua, and S. T. Birchfield. Automated Traffic Surveillance System
25
with Aerial Camera Arrays Imagery: Macroscopic Data Collection with Vehicle Tracking. Journal of
26
Computing in Civil Engineering, 2016. 31: https://doi.org/10.1061/(asce)cp.1943-5487.0000646.
27
28
35. Khan, M. A., Ectors, W., Bellemans, T., Janssens, D., and Wets, G. Unmanned Aerial Vehicle-Based
29
Traffic Analysis: A Methodological Framework for Automated Multi-Vehicle Trajectory Extraction.
30
Transportation Research Record: Journal of the Transportation Research Board, 2017. 32: 115.
31
32
36. Yu, S. L., T. Westfechtel, R. Hamada, K. Ohno, and S. Tadokoro. Vehicle Detection and Localization
33
on Bird’s Eye View Elevation Images Using Convolutional Neural Network. Presented at 2017 IEEE
34
International Symposium on Safety, Security and Rescue Robotics (SSRR), 2017.
35
36
37. Li, Z., S. Ahn, K. Chung, D. R. Ragland, W. Wang, and J. W. Yu. Surrogate Safety Measure for
37
Evaluating Rear-End Collision Risk Related to Kinematic Waves near Freeway Recurrent Bottlenecks.
38
Accident Analysis & Prevention, 2014. 64: 5261.
39
40
38. Borji, A., M. M. Cheng, H. Jiang, and J. Li. Salient Object Detection: A Benchmark. IEEE
41
Transactions on Image Processing, 2015. 24: 57065722.
42
43
39. Bodla, N., B. Singh, R. Chellappa, and L. S. Davis. Soft-NMS--Improving Object Detection With
44
One Line of Code. Presented at 2017 IEEE International Conference on Computer Vision (ICCV), 2017.
45
46
40. Bau, D., Zhou, B., Khosla, A., Oliva, A. and Torralba, A. Network dissection: Quantifying
47
interpretability of deep visual representations. Presented at 2017 IEEE Conference on Computer Vision
48
and Pattern Recognition, 2017.
49
50
17
41. Ren, Y., C. Zhu, and S. Xiao. Small Object Detection in Optical Remote Sensing Images via Modified
1
Faster R-CNN. Applied Sciences, 2018. 8: https://doi.org/10.3390/app8050813.
2
18
TABLE 1 Environment Description of Sample Images
1
Environment
Number
of
Vehicle
Number
of Heavy
Vehicle
Sum
Average
Length of
Vehicle
(px)
Average
Width of
Vehicle
(px)
Average
Width of
Heavy
Vehicle
(px)
Heavy
Vehicle
Ratio
Average
Length
(px)
Average
width
(px)
Number
of
Images
Cheonho
Bridge
1042
95
1137
146.08
63.70
86.59
8.36
159.32
65.62
125
Gyeongbu
Expressway
4333
393
4726
91.80
40.66
52.78
8.32
99.90
41.67
98
Geyeongin
Expressway 2
4447
309
4756
134.14
60.30
89.23
6.50
144.06
62.18
130
Seohaean
Expressway
1576
118
1694
132.52
57.34
93.29
6.97
141.95
59.84
118
Total or
Average
11398
915
12313
126.13
55.50
80.47
7.43
136.24
57.36
471
2
19
TABLE 2 Evaluation of Sample Detection
1
Detection Number
IoU
Threshold
0.50
0.80
0.90
1
0.92
True
True
True
2
0.55
True
False
False
3
0.22
False
False
False
4
0.82
True
True
False
Precision
0.75
0.5
0.25
Recall
0.60
0.40
0.20
F-Score
0.67
0.44
0.22
2
20
TABLE 3 mF-score [0.50:0.95] Comparison by Number of Detection Class
1
Cheonho Bridge (CB)
Gyeongin Expressway 2 (GE2)
Faster
R-
CNN
Single-Class
Detector
Multi-Class Detector
Single-Class
Detector
Multi-Class Detector
Overall
Vehicle
Heavy
Vehicle
Overall
Vehicle
Heavy Vehicle
0.775
0.782
0.774
0.702
0.696
0.689
R-FCN
Single-Class
Detector
Multi-Class Detector
Single-Class
Detector
Multi-Class Detector
Overall
Vehicle
Heavy
Vehicle
Overall
Vehicle
Heavy Vehicle
0.803
0.802
0.780
0.616
0.624
0.603
SSD
Single-Class
Detector
Multi-Class Detector
Single-Class
Detector
Multi-Class Detector
Overall
Vehicle
Heavy
Vehicle
Overall
Vehicle
Heavy Vehicle
0.861
0.893
0.870
0.800
0.806
0.759
2
21
TABLE 4 Performance Comparison by Detection Algorithm
1
2
3
`
Cheonho Bridge
Gyeongbu Expressway
Gyeongin Expressway 2
Seohaean Expressway
Faster
R-CNN
R-FCN
SSD
Faster
R-CNN
R-FCN
SSD
Faster
R-CNN
R-FCN
SSD
Faster
R-CNN
R-FCN
SSD
Vehicle
0.782
0.802
0.893
0.563
0.423
0.875
0.696
0.624
0.806
0.575
0.556
0.694
Heavy
Vehicle
0.774
0.780
0.870
0.563
0.466
0.876
0.689
0.603
0.759
0.589
0.472
0.614
22
1
2
Figure 1 Structure of Faster R-CNN
3
23
1
2
3
Figure 2 (a) Structure of Faster R-FCN; (b) Structure of SSD
4
24
1
2
Figure 3 Overall framework of the research
3
4
25
1
2
3
4
5
Figure 4 Sample image data (a) Cheonho bridge (clear morning, no shadow, free flow, and road mark) (b)
6
Gyeongbu expressway (cloudy morning, faded shadow, moderate traffic, and large spatial scope) (c)
7
Gyeongin expressway 2 (clear evening, full shadow, heavy traffic, and slight curve), and (d) Seohaean
8
expressway (clear afternoon, partial shadow, moderate traffic, and shadow of road sign)
9
10
26
1
2
Figure 5 Sample detection image
3
4
27
1
2
3
Figure 6 Result of same detection with different evaluation (a) AP 0.5 (left) (b) mF-score [0.50:0.95] (right)
4
5
6
28
1
2
Figure 7 Performance comparison of detection algorithms
3
4
29
1
2
Figure 8 Result of vehicle detection with SSD at the heaviest traffic (Gyeongin Expressway 2) (top) and
3
the free flow condition (Cheonho Bridge) (bottom)
4
5
30
1
2
3
4
Figure 9 (a) SSD’s performance by variation of image resolution (above) (b) Original UAV image (bottom
5
left) (c) UAV Image reduced to 1/25 (bottom right)
6
7
31
1
2
Figure 10 Change of performance by adding up training datasets
3
... References RR or PR [1, 2, 10-12, 14, 16, 23, 32, 33, 35, 37, 39, 40, 42] AP [1,9,11,13,15,17,22,25,26,[31][32][33]36] F1-score [1,2,9,11,14,16,22,31,37,39,40] Accuracy [9,12,16,40,42] FPS [10-12, 14, 17, 22, 25, 26, 32, 33, 35-37, 42] Others MAE and RMSE [8,15,25], false alarm rate [32,42], AR [8,32], GFLOPs [25], IoU [31,35,39], DSC [40], quality [ [8,9,13,14,39] Densely packed vehicles [2,8,12,32,36,39] Multi-scale images [1, 9-11, 14, 15, 21, 23, 30, 32, 36, 37, 39, 42] Reduction in processing time [1, 8, 9, 11, 12, 16, 17, 20, 25, 26, 32-37, 41, 42] Reduction in search space [12,16,17] Reduction in number of region proposals [8,9] Scarcity of annotated data [1,12,15,33,41] Differentiating touching vehicles [39,40] Hard-easy class imbalance problem [2,20,21] Handling occlusion [14,23] resolutions [33]. Others set minimum dimensions of proposals according to the sizes of small objects [28]. ...
... References RR or PR [1, 2, 10-12, 14, 16, 23, 32, 33, 35, 37, 39, 40, 42] AP [1,9,11,13,15,17,22,25,26,[31][32][33]36] F1-score [1,2,9,11,14,16,22,31,37,39,40] Accuracy [9,12,16,40,42] FPS [10-12, 14, 17, 22, 25, 26, 32, 33, 35-37, 42] Others MAE and RMSE [8,15,25], false alarm rate [32,42], AR [8,32], GFLOPs [25], IoU [31,35,39], DSC [40], quality [ [8,9,13,14,39] Densely packed vehicles [2,8,12,32,36,39] Multi-scale images [1, 9-11, 14, 15, 21, 23, 30, 32, 36, 37, 39, 42] Reduction in processing time [1, 8, 9, 11, 12, 16, 17, 20, 25, 26, 32-37, 41, 42] Reduction in search space [12,16,17] Reduction in number of region proposals [8,9] Scarcity of annotated data [1,12,15,33,41] Differentiating touching vehicles [39,40] Hard-easy class imbalance problem [2,20,21] Handling occlusion [14,23] resolutions [33]. Others set minimum dimensions of proposals according to the sizes of small objects [28]. ...
... References RR or PR [1, 2, 10-12, 14, 16, 23, 32, 33, 35, 37, 39, 40, 42] AP [1,9,11,13,15,17,22,25,26,[31][32][33]36] F1-score [1,2,9,11,14,16,22,31,37,39,40] Accuracy [9,12,16,40,42] FPS [10-12, 14, 17, 22, 25, 26, 32, 33, 35-37, 42] Others MAE and RMSE [8,15,25], false alarm rate [32,42], AR [8,32], GFLOPs [25], IoU [31,35,39], DSC [40], quality [ [8,9,13,14,39] Densely packed vehicles [2,8,12,32,36,39] Multi-scale images [1, 9-11, 14, 15, 21, 23, 30, 32, 36, 37, 39, 42] Reduction in processing time [1, 8, 9, 11, 12, 16, 17, 20, 25, 26, 32-37, 41, 42] Reduction in search space [12,16,17] Reduction in number of region proposals [8,9] Scarcity of annotated data [1,12,15,33,41] Differentiating touching vehicles [39,40] Hard-easy class imbalance problem [2,20,21] Handling occlusion [14,23] resolutions [33]. Others set minimum dimensions of proposals according to the sizes of small objects [28]. ...
Article
Full-text available
Unmanned aerial vehicles" (UAVs) are now being used for a wide range of surveillance applications. Specifically, the detection of on-ground vehicles from UAV images has attracted significant attention due to its potential in applications such as traffic management, parking lot management, and facilitating rescue operations in disaster zones and rugged terrains. This paper presents a survey of deep learning techniques for performing on-ground vehicle detection from aerial imagery captured using UAVs (also known as drones). We review the works in terms of their approach to improve accuracy and reduce computation overhead and their optimization objective. We show the similarities and differences of various techniques and also highlight the future challenges in this area. This survey will benefit researchers in the area of artificial intelligence, traffic surveillance, and applications of UAVs.
... To investigate the impact of target size, number of training samples and combination of different datasets on the vehicle detection performance, [98] prepared the dataset by recording videos from UAVs in the presence of different environmental conditions such as lighting, congestion, surrounding and shadow. Experimental tests showed that the SSD framework [82] has good performance compared to Faster R-CNN [84] and Region-based Fully Convolutional Network (R-FCN) [111] detectors. ...
... Although researchers have tried to tackle this crucial and critical problem, in most cases they have not explicitly mentioned the considered altitude information. In this connection, [98] suggested that the SSD framework is promising for vehicle detection task in aerial imagery and its inherent multi-scale feature structure facilitated multisize object detection, making this solution suitable for traffic monitoring at generically high altitudes. However, the exact value or range of UAV altitude considered by the authors was not mentioned for the obtained results. ...
Article
Full-text available
Drone deployment has become crucial in a variety of applications, including solutions to traffic issues in metropolitan areas and highways. On the other hand, data collected via drones suffers from several problems, including a wide range of object scales, angle variations, truncation, and occlusion. To process and manipulate visual data from the drones, a variety of image processing algorithms have been employed, each with a distinct aim. Additionally, recent breakthroughs in the field of Artificial Intelligence, particularly deep learning, have attracted broad interest and are being applied to many domains in the framework of smart cities, including road traffic monitoring. The purpose of this study is to conduct a systematic review of drone-based traffic monitoring systems from a deep learning perspective. This work focuses on vehicle detection, tracking, and counting, since they are fundamental building blocks towards founding solutions for traffic congestion, flow rate and vehicle speed estimation. Additionally, drone-based datasets are examined, which face issues and problems caused by the diversity of features inherent of drone devices. The review analysis presented in this work summarizes the literature solutions provided and deployed so far and discusses future research trends in establishing a comprehensive traffic monitoring system in support of the development of smart cities.
... Similar trends such as using Yolo [25] and Faster RCNN [26] are prevalent for UAVbased traffic surveillance. An important design criterion is flight altitude. ...
Article
Full-text available
Millions of commuters face congestion as a part of their daily routines. Mitigating traffic congestion requires effective transportation planning, design, and management. Accurate traffic data are needed for informed decision making. As such, operating agencies deploy fixed-location and often temporary detectors on public roads to count passing vehicles. This traffic flow measurement is key to estimating demand throughout the network. However, fixed-location detectors are spatially sparse and do not cover the entirety of the road network, and temporary detectors are temporally sparse, providing often only a few days of measurements every few years. Against this backdrop, previous studies proposed that public transit bus fleets could be used as surveillance agents if additional sensors were installed, and the viability and accuracy of this methodology was established by manually processing video imagery recorded by cameras mounted on transit buses. In this paper, we propose to operationalize this traffic surveillance methodology for practical applications, leveraging the perception and localization sensors already deployed on these vehicles. We present an automatic, vision-based vehicle counting method applied to the video imagery recorded by cameras mounted on transit buses. First, a state-of-the-art 2D deep learning model detects objects frame by frame. Then, detected objects are tracked with the commonly used SORT method. The proposed counting logic converts tracking results to vehicle counts and real-world bird’s-eye-view trajectories. Using multiple hours of real-world video imagery obtained from in-service transit buses, we demonstrate that the proposed system can detect and track vehicles, distinguish parked vehicles from traffic participants, and count vehicles bidirectionally. Through an exhaustive ablation study and analysis under various weather conditions, it is shown that the proposed method can achieve high-accuracy vehicle counts.
... This method for creating context-based construction information built a database with considerations of time-spatial context and situational awareness, from UAV data with images and flight data at a construction site. Ham et al. [9] utilized three state-of-the-art object detectors for vehicle detection, namely faster R-CNN, SSD, and R-FCN. The hyper-parameters of each detector were adjusted individually to achieve the optimal performance in vehicle detection. ...
Article
Full-text available
With the rapid advancement of Unmanned Aerial Vehicle (UAV) technologies in recent years, their uses have been increasingly adopted in the architecture, engineering, and construction industries. To satisfy the needs of various types of construction projects, a considerable amount of research work has been performed to implement and refine the operations, safety, and accuracy of UAVs. This paper presents the findings of a comprehensive literature review that focuses on UAV research in construction management during the timeframe of 2016 to 2021. A total of 95 papers were identified and collected from a list of 21 relevant journals and conference proceedings, and were then categorized by their research topic, sensor types, and targeted structures. The results of 47 exemplary studies were reported in two categories, namely UAV uses and construction uses. The research topics identified for UAV uses include algorithm, applications, operations, framework, and training, while research topics identified for construction use include inspection, surveying, safety, and monitoring. The connection between the research topics, sensor types, targeted structures, and other advanced technologies were also discussed. This paper summarizes the current results of UAV research in construction management, reviews the methodology, benefits, and limitations of the reviewed literature, and provides valuable knowledge for the future trend of UAV applications in the civil, infrastructure, and construction industries.
Article
Full-text available
Vehicle classification is a hot computer vision topic, with studies ranging from ground-view to top-view imagery. Top-view images allow understanding city patterns, traffic management, among others. However, there are some difficulties for pixel-wise classification: most vehicle classification studies use object detection methods, and most publicly available datasets are designed for this task, creating instance segmentation datasets is laborious, and traditional instance segmentation methods underperform on this task since the objects are small. Thus, the present research objectives are as follows: first, propose a novel semisupervised iterative learning approach using the geographic information system software, second, propose a box-free instance segmentation approach, and third, provide a city-scale vehicle dataset. The iterative learning procedure considered the following: first, labeling a few vehicles from the entire scene, second, choosing training samples near those areas, third, training the deep learning model (U-net with efficient-net-B7 backbone), fourth, classifying the whole scene, fifth, converting the predictions into shapefile, sixth, correcting areas with wrong predictions, seventh, including them in the training data, eighth repeating until results are satisfactory. We considered vehicle interior and borders to separate instances using a semantic segmentation model. When removing the borders, the vehicle interior becomes isolated, allowing for unique object identification. Our procedure is very efficient and accurate for generating data iteratively, which resulted in 122 567 mapped vehicles. Metrics-wise, our method presented higher intersection over union when compared to box-based methods (82% against 72%), and per-object metrics surpassed 90% for precision and recall.
Preprint
Full-text available
Vehicle classification is a hot computer vision topic, with studies ranging from ground-view up to top-view imagery. In remote sensing, the usage of top-view images allows for understanding city patterns, vehicle concentration, traffic management, and others. However, there are some difficulties when aiming for pixel-wise classification: (a) most vehicle classification studies use object detection methods, and most publicly available datasets are designed for this task, (b) creating instance segmentation datasets is laborious, and (c) traditional instance segmentation methods underperform on this task since the objects are small. Thus, the present research objectives are: (1) propose a novel semi-supervised iterative learning approach using GIS software, (2) propose a box-free instance segmentation approach, and (3) provide a city-scale vehicle dataset. The iterative learning procedure considered: (1) label a small number of vehicles, (2) train on those samples, (3) use the model to classify the entire image, (4) convert the image prediction into a polygon shapefile, (5) correct some areas with errors and include them in the training data, and (6) repeat until results are satisfactory. To separate instances, we considered vehicle interior and vehicle borders, and the DL model was the U-net with the Efficient-net-B7 backbone. When removing the borders, the vehicle interior becomes isolated, allowing for unique object identification. To recover the deleted 1-pixel borders, we proposed a simple method to expand each prediction. The results show better pixel-wise metrics when compared to the Mask-RCNN (82% against 67% in IoU). On per-object analysis, the overall accuracy, precision, and recall were greater than 90%. This pipeline applies to any remote sensing target, being very efficient for segmentation and generating datasets.
Article
Full-text available
Bus–pedestrian crashes typically result in more severe injuries and deaths than any other type of bus crash. Thus, it is important to screen and improve the risk factors that affect bus–pedestrian crashes. However, bus–pedestrian crashes that are affected by a company’s and regional characteristics have a cross-classified hierarchical structure, which is difficult to address properly using a single-level model or even a two-level multi-level model. In this study, we used a cross-classified, multi-level model to consider simultaneously the unobserved heterogeneities at these two distinct levels. Using bus–pedestrian crash data in South Korea from 2011 through to 2015, in this study, we investigated the factors related to the injury severity of the crashes, including crash level, regional and company level factors. The results indicate that the company and regional effects are 16.8% and 5.1%, respectively, which justified the use of a multi-level model. We confirm that type I errors may arise when the effects of upper-level groups are ignored. We also identified the factors that are statistically significant, including three regional-level factors, i.e., the elderly ratio, the ratio of the transportation infrastructure budget, and the number of doctors, and 13 crash-level factors. This study provides useful insights concerning bus–pedestrian crashes, and a safety policy is suggested to enhance bus–pedestrian safety.
Article
Full-text available
Obtaining the trajectories of all vehicles in congested traffic is essential for analyzing traffic dynamics. To conduct an effective analysis using trajectory data, a framework is needed to efficiently and accurately extract the data. Unfortunately, obtaining accurate trajectories in congested traffic is challenging due to false detections and tracking errors caused by factors in the road environment, such as adjacent vehicles, shadows, road signs, and road facilities. Unmanned aerial vehicles (UAVs), with incorporating machine learning and image processing, can mitigate these difficulties by their ability to hover above the traffic. However, research is lacking regarding the extraction and evaluation of vehicle trajectories in congested traffic. In this study, we propose and compare two learning-based frameworks for detecting vehicles: the aggregated channel feature (ACF), which is based on human-made features, and the faster region-based convolutional neural network (Faster R-CNN), which is based on data-driven features. We extend the detection results to extract vehicle trajectories in congested traffic conditions from UAV images. To remove the errors associated with tracking vehicles, we also develop a postprocessing method based on motion constraints. Then, we conduct detailed performance analyses to confirm the feasibility of the proposed framework on a congested expressway in Korea. The results show that Faster R-CNN outperforms the ACF in images with large objects and in those with small objects if sufficient data are provided. This framework extracts the vehicle trajectories with high precision, making them available for analyzing traffic dynamics based on the training of just a small number of positive samples. The results of this study provide a practical guideline for building a framework to extract vehicles trajectories based on given conditions.
Article
Full-text available
The PASCAL VOC Challenge performance has been significantly boosted by the prevalently CNN-based pipelines like Faster R-CNN. However, directly applying the Faster R-CNN to the small remote sensing objects usually renders poor performance. To address this issue, this paper investigates on how to modify Faster R-CNN for the task of small object detection in optical remote sensing images. First of all, we not only modify the RPN stage of Faster R-CNN by setting appropriate anchors but also leverage a single high-level feature map of a fine resolution by designing a similar architecture adopting top-down and skip connections. In addition, we incorporate context information to further boost small remote sensing object detection performance while we apply a simple sampling strategy to solve the issue about the imbalanced numbers of images between different classes. At last, we introduce a simple yet effective data augmentation method named 'random rotation' during training. Experimental results show that our modified Faster R-CNN algorithm improves the mean average precision by a large margin on detecting small remote sensing objects.
Article
Full-text available
UAV based traffic monitoring holds distinct advantages over traditional traffic sensors, such as loop detectors, as UAVs have higher mobility, wider field of view, and less impact on the observed traffic. For traffic monitoring from UAV images, the essential but challenging task is vehicle detection. This paper extends the framework of Faster R-CNN for car detection from low-altitude UAV imagery captured over signalized intersections. Experimental results show that Faster R-CNN can achieve promising car detection results compared with other methods. Our tests further demonstrate that Faster R-CNN is robust to illumination changes and cars’ in-plane rotation. Besides, the detection speed of Faster R-CNN is insensitive to the detection load, that is, the number of detected cars in a frame; therefore, the detection speed is almost constant for each frame. In addition, our tests show that Faster R-CNN holds great potential for parking lot car detection. This paper tries to guide the readers to choose the best vehicle detection framework according to their applications. Future research will be focusing on expanding the current framework to detect other transportation modes such as buses, trucks, motorcycles, and bicycles.
Conference Paper
We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 dif- ferent classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implemen- tation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called dropout that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry
Article
Detection of small targets, more specifically cars, in aerial images of urban scenes, has various applications in several domains, such as surveillance, military, remote sensing, and others. This is a tremendously challenging problem, mainly because of the significant interclass similarity among objects in urban environments, e.g., cars and certain types of non-target objects, such as buildings' roofs and windows. These non-target objects often possess very similar visual appearance to that of cars making it hard to separate the car and the non-car classes. Accordingly, most past works experienced low precision rates at high recall rates. In this paper, a novel framework is introduced that achieves a higher precision rate at a given recall than the state of the art. The proposed framework adopts a sliding-window approach and it consists of four stages, namely, window evaluation, extraction and encoding of features, classification, and post-processing. This paper introduces a new way to derive descriptors that encode the local distributions of gradients, colors, and texture. Image descriptors characterize the aforementioned cues using adaptive cell distributions, wherein the distribution of cells within a detection window is a function of its dominant orientation, and hence, neither the rotation of the patch under examination nor the computation of descriptors at different orientations is required. The performance of the proposed framework has been evaluated on the challenging Vaihingen and Overhead Imagery Research data sets. Results demonstrate the superiority of the proposed framework to the state of the art.