ArticlePDF Available

Investigating the Influential Factors for Practical Application of Multi-Class Vehicle Detection for Images from Unmanned Aerial Vehicle using Deep Learning Models

October 2020
Transportation Research Record Journal of the Transportation Research Board 2674(12):036119812095418

October 2020
2674(12):036119812095418

DOI:10.1177/0361198120954187

Authors:

Seung Woo Ham

Seoul National University

Ho-Chul Park

Seoul National University

Eui-Jin Kim

National University of Singapore

Seung-Young Kho

Seoul National University

Show all 5 authorsHide

Traffic density, which is a critical measure in traffic operations, should be collected precisely at various locations and times to reflect site-specific spatiotemporal characteristics. For detailed analysis, heavy vehicles have to be separated from ordinary vehicles, since heavy vehicles have a significant effect on traffic flow as well as traffic safety. With unmanned aerial vehicles (UAVs), it is easy to acquire video for vehicle detection by collecting images from above the traffic without any disturbances. Despite previous studies on vehicle detection, there is still a lack of research on real-world applications in estimating traffic density. This study investigates the effects of several influential factors: the size of objects, the number of samples, and a combination of datasets, on detecting multi-class vehicles using deep learning models in various UAV images. Three detection models are compared: faster region-based convolutional neural networks (faster R-CNN), region-based fully convolutional network (R-FCN), and single-shot detector (SSD), to suggest guidelines for model selection. The results provided several findings: (i) vehicle detection from UAV images showed sufficient performance with a small number of samples and small objects; (ii) deep learning-based multi-class vehicle detectors can have advantages compared with single-class detectors; (iii) among all the models, SSD showed the best performance because of its algorithmic structure; (iv) simply combining datasets in different environments cannot guarantee performance improvement. Based on these findings, practical guidelines are offered for estimating multi-class traffic density using UAV.

Evaluation of Sample Detection 1

…

mF-score [0.50:0.95] Comparison by Number of Detection Class 1

…

Figures - uploaded by Dong-Kyu Kim

Content may be subject to copyright.

Content uploaded by Dong-Kyu Kim

Content may be subject to copyright.

Article title: Investigating the Influential Factors for Practical Application of Multiclass Vehicle

Detection for Images from Unmanned Aerial Vehicle Using Deep Learning Models

Journal title: Transportation Research Record

Paper history: Submitted 1st August 2019

Revised 10th March 2020

Accepted 2nd May 2020

Published online 16th October 2020

Published 1st December 2020

Funding: Ministry of Science and ICT, Republic of Korea (NRF-2019R1H1A1080045)

DOT information: https://doi.org/10.1177/0361198120954187

---------------------------

Investigating the Influential Factors for Practical Application of

Multiclass Vehicle Detection for Images from Unmanned Aerial

Vehicle Using Deep Learning Models

Seung Woo Ham

Department of Civil and Environmental Engineering

Seoul National University, Gwanak-gu, Seoul, Republic of Korea, 08826

Email: seungwoo.ham@snu.ac.kr

Ho-Chul Park

Department of Transportation Engineering

Myongji University, Yongin, Kyunggi, Republic of Korea, 17058

E-mail: hcpark@mju.ac.kr

Eui-Jin Kim

Department of Civil and Environmental Engineering

Seoul National University, Gwanak-gu, Seoul, Republic of Korea, 08826

Email: kyjcwal@snu.ac.kr

Seung-Young Kho

Department of Civil and Environmental Engineering

Seoul National University, Gwanak-gu, Seoul, Republic of Korea, 08826

Email: sykho@snu.ac.kr

Dong-Kyu Kim, Corresponding Author

Department of Civil and Environmental Engineering and Institute of Construction and Environmental

Engineering

Seoul National University, Gwanak-gu, Seoul, Republic of Korea, 08826

Email: dongkyukim@snu.ac.kr

Word Count: 6,436 words + 4 tables = 7,436 words

Call for Papers: Collection and Application of Quality Traffic Data (ABJ35)

ABSTRACT

Traffic density, which is a critical measure in traffic operations, should be collected precisely at various

locations and times to reflect site-specific spatiotemporal characteristics. For detailed analysis, heavy

vehicles have to be separated from ordinary vehicles, since they have a significant effect on traffic flow as

well as traffic safety. With unmanned aerial vehicles (UAVs), we can easily acquire a video for vehicle

detection by collecting images from above the traffic without any disturbances. Despite previous studies

for vehicle detection, there is still a lack of research for real-world applications in estimating traffic

density. In this study, we investigate the effects of influential factors, which are the size of objects,

number of samples, and a combination of datasets, on detecting multi-class vehicle using deep learning

models in various UAV images. We compare three detection models, which are Faster Region-based

Convolutional Neural Networks (Faster R-CNN), Region-based Fully Convolutional Network (R-FCN),

and Single-Shot Detector (SSD), to suggest guidelines for model selection. The results provided several

findings: 1) Vehicle detection from UAV images showed sufficient performance with a small number of

samples and small objects; 2) Deep learning-based multi-class vehicle detectors can have advantages

compared with single-class detectors; 3) Among all the models, SSD showed the best performance due to

its algorithmic structure; 4) Simply combining datasets in different environments cannot guarantee

performance improvement. Based on our findings, we brought practical guidelines for estimating multi-

class traffic density using UAV.

Keywords: Vehicle Detection, Deep Learning, Unmanned Aerial Vehicle, Reproducibility in Practice,

Single-Shot Detector

INTRODUCTION

Traffic phenomena in congestion such as traffic oscillation, traffic breakdown, and capacity drop

result in not just traveler delays that reduce system-wide efficiencies but also increased crash potential.

Many studies have unveiled the mechanism that triggers those phenomena and founded measures to

capture them (1, 2). Their effort demonstrated that traffic density (i.e., the inverse of average vehicle

spacing) is the most critical measure. To diagnose and analyze the congestion phenomena, traffic density

should be collected precisely at various locations and times to reflect site-specific characteristics.

Meanwhile, in traffic flow analysis, heavy vehicles have a significant effect on traffic congestion as well

as traffic safety due to their physical characteristics, e.g., heavy weight, large size, and maneuvering

limitations (3-5). For precise analysis, therefore, heavy vehicles should be considered, but it is costly to

obtain the density of heavy vehicles separately from ordinary vehicles.

Based on vehicle detector systems, aerial images, or surveillance cameras, many studies have

attempted to obtain vehicle densities (6-8). Although they showed the possibility of high performance for

collecting traffic density, there is still a need for improvements in real-world congestion management (9).

This is because the approaches were cost-ineffective to install surveillance cameras at many points or

collect high-resolution aerial images at different times. Also, most studies focused on detecting only

ordinary vehicles. Furthermore, in the case of a surveillance camera, the images cannot be taken in the

vertical direction from above the traffic like in a UAV, resulting in the problem of overlapping vehicle

images in congested traffic.

Recently, unmanned aerial vehicles (UAVs) have been proposed to mitigate this inefficiency

using their mobility, cost-effectiveness, wide field-of-view, and ability to hover (stationary flight) (6, 7,

10). UAVs can easily obtain high-resolution images from above traffic at low altitude, and only simple

camera calibration is required to acquire a clear image and correct the geometric distortion (10, 11). The

major drawback of images from UAV is that vehicle features are represented in a small number of pixels.

This can be further exacerbated in congested traffic, where shadows partially occlude vehicles or adjacent

vehicles are detected as one; further, it significantly reduces the accuracy of collected traffic density (12).

The conventional approaches for detecting vehicles such as background subtraction, blob analysis, and

optical flow are vulnerable to those difficulties because it cannot robustly detect the exact bounding boxes

surrounding the vehicles (13, 14).

With the development of computer vision and deep learning, the supervised learning-based

vehicle detection method has been proposed to collect accurate traffic density even in congestion. Until

recently, combining feature representation and learning algorithms was mainly done for detecting

vehicles in UAV images (8, 11, 15). Because those studies used generic features for object detection

instead of customized features for vehicles in UAV images (8, 16), efficiency and accuracy can be further

enhanced.

As the convolutional neural network (CNN) (17) had great success with image classification,

many researchers have recently focused on vehicle detection using CNN. These deep learning structures

automatically create features in the images (18), and those features showed better performance for vehicle

detection in UAV images than the generic features (19). In particular, combining CNN with bounding box

regression (20) called “region-based CNN” (R-CNN) (21) allows the location of vehicles to be precisely

specified by a bounding box, which drastically improves the performance of CNN-based object detection.

Faster R-CNN (22), the enhanced model of R-CNN, performed real-time vehicle detection with high

accuracy (19). In addition, a variety of advanced methodologies have been applied to measure vehicle

density accurately, such as Region-based Fully Convolutional Network (R-FCN), Single-Shot Detector

(SSD), and so on.

Despite many methodological studies for vehicle detection, there is still a lack of experimental

research for real-world applications. For example, multi-class vehicle detection, which classifies vehicles

and heavy vehicles, is essential for analyzing congestion due to its different impacts on traffic conditions

(23, 24). Regarding the performance, validation in the various environments is required for evaluating the

robustness of a detector because detection performance can vary greatly depending on the characteristics

of the image, e.g., image resolution, lighting conditions, geometric features of the road. However, detailed

analysis in various environments has not yet appeared in previous studies (19, 25, 26).

In this paper, we investigate the effects of influential factors, i.e., small objects, size of a vehicle

in the image, number of samples, and combination of datasets, on detecting multi-class vehicle using deep

learning models. In addition, we compare three models, i.e., Faster R-CNN, R-FCN, and SSD, which are

modern deep learning object detecting models (27, 28). Based on the results, we provide practical

guidelines for multi-class vehicle detection.

The remainder of this paper is organized as follows. First, we introduce a literature review of

vehicle detection. In the next section, we discuss the deep learning methodologies of this study. Then, we

describe three datasets and measures of effectiveness used in the study. We show the model estimation

results and discuss our findings. Lastly, we conclude this study and provide guidelines for multi-class

vehicle detection.

LITERATURE REVIEW

Vehicle detection using aerial and UAV images are becoming popular for their maneuverability

and promising results. Among the vast literature, we have categorized important recent studies focusing

on the development of methodologies: edge and blob detection, machine learning, and deep learning.

Because unsupervised methods such as edge detection and blob detection have the advantage of

relatively less computational power compared with other methods, it has been widely used for real-time

detection. Azevedo et al. and Khan et al. used a background subtraction approach and blob analysis to

target vehicles from the uncongested free flow image (10, 29). Ke et al. detected vehicles from UAV

traffic video using Shi-Tomasi features (7). These previous works showed that the unsupervised method

can be applied in real-time without training and perform appropriately in free-flow and urban

intersections. However, the detection was not validated in a congested situation where the features used in

those methods are reported susceptible to image conditions.

To improve the robustness of detection performance in a complex environment, machine learning

methods are widely used. Elmikaty and Stathaki trained support vector machines (SVMs) with human-

made features such as gradient, color, and texture (30). Gleason et al. used the histogram of gradients

(HoG) and histogram of Gabor coefficients as a feature. They tested it with various detectors such as k-

Nearest Neighbors (k-NN), Random Forests, and SVM (31). However, these detection models were

conducted in a single environment so its performance can drop in some situations which limits practical

use. Moreover, the fact that human-made features are targeted for general objects, not the vehicle from

UAV image also limits its performance.

Deep learning methods have significant advantages over other methods in that they automatically

select the feature from the image (17). Xu et al. used Faster R-CNN and a VGG16 network to train a

vehicle detector of a UAV image (19). The authors also showed that the Faster R-CNN method has

robustness on image orientation, compared with a Viola-Jones object detection scheme and HoG feature-

infused linear SVM. In this study, it is difficult to know how well the performance of the detection is

achieved because there is no detailed information about the evaluation metric.

While the mainstream of vehicle detection focused on binary vehicle detection, some research has

focused on multi-class vehicle detection. Tang et al. trained a UAV image with CNN and cascade of a

boosted classifier to detect two-class vehicles with images gathered from a different road in the daytime.

The result showed that deep learning method works better compared with the conventional machine

learning techniques (25). Liu and Mattyus applied a binary detector using a soft-cascade structure with

integral channel features and classified the result into multi-class with an aggregated classifier method

(8). Li et al. trained the R-CNN network with a high-resolution aerial image to detect vehicles in multi-

class. After detected by binary vehicle detector, each vehicle was classified in four different classes, and

the station wagon showed the highest detection performance with 2,302 training samples (32). However,

these studies have limitations in adding one more stage for classification after detecting vehicles, which

induces not only the longer detection time but also a performance drop because errors occur

independently in each stage.

Multi-class object detection solely on deep learning can set the number of classes from the

beginning of the training stage. Although previous attempts have been developed and evaluated the multi-

class detection model for a generic object (33), there is no research to provide a reproducible guideline for

detecting vehicles and heavy vehicles from UAV images. In other words, each machine learning model

for object detection has different strengths and weaknesses by the type of the object (19, 30, 31). So, a

model that is suitable for vehicle detection from UAV images needs to be selected for practical usage.

Several studies have been attempts to apply vehicle detections in specific environmental conditions of the

training images, but there is a lack of research on detailed performance analysis for various conditions

such as lighting conditions, the ratio of heavy vehicle and resolution of the image (34). Therefore, the

type of model architecture that shows decent performance for vehicle detection from UAV images and the

impact of an influential factor on the performance of the methods should be investigated.

As the deep learning method has strengths in image recognition in traffic images, as well as

various fields (17), we focused on investigating an effective deep learning architecture for vehicle

detection from UAV images, as well as the influential factor on detection performance. Even the deep

learning method need greater computational power for implementation than the traditional methods; still,

it can be used in real-time detection (22). Notably, a one-stage deep learning algorithm such as SSD

shows faster running time than other algorithms due to its less complexity (27).

DEEP LEARNING METHODOLOGIES

In this paper, we used the three state-of-the-art object detectors: Faster R-CNN, SSD, and R-FCN.

We properly adjusted the hyperparameters of each detector for the best performance in vehicle detection.

All three methods share common hyperparameters. The default sliding window (anchor box) was

specified to 128×64, the scale of the sliding window was varied from 0.25 to 4.00, and the ratio was

varied from 1.0 to 4.0. Contexts below introduce the core idea of each detector.

Faster R-CNN is the third version of R-CNN architecture. Figure 1 shows the structure of Faster

R-CNN. R-CNN architecture extracts the region that is likely to contain an object, i.e., region proposals,

from the image and classifies each proposal if it has an object or not. Among the region proposals, more

plausible proposals are selected as a region of interests (RoI). The early version of R-CNN extracted RoI

by using an algorithm called “selective search.” Each RoI is entered into CNN, transforming RoI to a

feature vector. Output feature vectors are then used for classification by SVM, and the coordinate of the

detection is adjusted by bounding box regression.

Faster R-CNN fastens the region proposing process by extracting RoI from the output feature

vector. This stage is called the “region proposal network” (RPN). In RPN, the sliding window method is

used to find the RoI. Twelve sliding windows (three different ratios and four different scales) are applied

for each point. Each RoI is then evaluated using the integrated loss, which contains classification loss and

bounding box regression loss. The equation is described in Equation 1.

󰇛󰇝

󰇞󰇝

󰇞󰇜 

 󰇛

 󰇜

  

  󰇛

 󰇜 (1)

Here,  is the mini-batch size, and  is the number of sliding windows.  is an indicator to

identify the individual sliding window. The classification loss (), which uses cross-entropy loss, and

bounding box regression loss (), which uses smooth  loss is calculated for each sliding window. 



represents the predicted probability that indicates if sliding window  has an object or not.  is the binary

indicator that demonstrates the ground-truth of 

. 

 and  are adjusted coordinates of the predicted

bounding box and ground-truth bounding box, respectively. Weight between classification loss and

bounding box regression loss is controlled by .

Region-based Fully Convolutional Network (R-FCN) is an attempt to solve a detection problem

like a classification problem. The classification problem is a translation invariance problem, which means

the result does not change by the translation of an image. On the other hand, the detection problem has

translation variance characteristic as its output varies if the image is shifted or enlarged. Because the

classification problem is easier than the detection problem, it has many advantages if we can use the

property of classification in detection.

R-FCN suggests a position-sensitive score map that contains relative location information of the

components of an object. A position-sensitive score map learns the arrangement of components in the

object at the training stage and uses it in the detection stage. For example, in the case of detecting a

human face, a position-sensitive score map learns that the nose will be in the center and mouth will be at

the bottom of the face. By using this knowledge embedded in position-sensitive score map, the detector

searches for the face that must have a nose in the center, and mouth at the bottom.

R-FCN splits the length and breadth of the object into . Thus, R-FCN searches for   

components of the object. When the detector is trained for C number of categories, it creates the vectors

that can categorize a total of 󰇛  󰇜 categories, including the background. Thus, the total number of the

channel becomes 󰇛  󰇜. Figure 2(a) depicts the case   .

Single-Shot Detector (SSD), depicted in Figure 2(b), is a one-stage detection framework that has

a simple structure compared with Faster R-CNN and R-FCN. While those two-stage detection

frameworks contain a pre-processing stage such as region proposal network, a one-stage detection

framework does the region proposal and detection simultaneously. SSD uses a feature map from various

convolutional layers. As an image goes through the convolutional layer, the output feature map represents

more complex components. At the initial layer, the output feature map represents a component with low

complexity, and at the end of the layers, components with high complexity are represented. Also, the

same area in each feature map contains a different area of the original image. The latter, the larger. This

enables SSD to target multiple objects with a variety of complexity and sizes in one image. The latter

feature map will detect a large complex object, and the initial feature map will detect the small, simple

object. The scale of a default bounding box is set as Equation 2:

  󰇛  󰇜

   󰇟 󰇠 (2)

Here,  indicates the number of the whole feature map that is used, and  indicates the number of

the current feature map.  and  is set as 0.9 and 0.2, respectively. The aspect ratio of each default

bounding box is prescribed as 󰇥



󰇦.

The bounding box is created for each feature map, and it contains four numbers of coordinates

and c number of a prediction score for each class. The bounding boxes are then evaluated by the same

loss function, which was used in the Faster R-CNN. In the paper where SSD was introduced, the authors

defined the loss function as the weighted sum of the confidence loss and localization loss; however, each

loss is the same with classification loss and bounding box regression loss, respectively. If several

bounding boxes indicate the same object, only one bounding box was selected by non-maximum

suppression (NMS).

DESCRIPTION OF THE DATASET AND RESEARCH FRAMEWORK

Figure 3 illustrates the overall framework of the research, including data collection, data labeling,

and model evaluation. We compared the three advanced deep learning architecture for vehicle detection

from UAV images, with detailed performance analysis according to the number of training images, the

resolution of images, and the composition of the dataset.

Four types of video images taken from four different places were used in the study: Cheonho

Bridge (CB), Gyeongbu Expressway (GE), Gyeongin Expressway 2 (GE2), and Seohaean Expressway

(SE). The four videos can be characterized by environmental conditions affecting performance, such as

lighting conditions, shadow, congestion, and surroundings. Here, a vehicle that is longer than 15 m in the

image was classified as a heavy vehicle. Examples of the video image and important information of each

image are shown in Figure 4 and Table 1. The ground truth data were obtained by manual labeling of

bounding boxes around the vehicles in each frame. To reduce the labeling effort, we used a user-friendly

image labeler provided by MATLAB.

The videos were taken in the vertical direction. The photography was done with a DJI Inspire Pro

1 equipped with a Zenmuse X5 camera, which is a quadcopter drone with 4K video and 3-axis Gimbals.

The resolution of the video was 3840×2160 (25 fps), and a vehicle roughly consisted of 40×100 image

pixels in this video. Although the hovering capability of our UAV with 3-axis Gimbals was enough for

minimizing UAV instability in all environments, an additional stabilization process may be required in

harsh conditions. Details about the stabilization process for UAV are presented in other work (35).

We constructed the training and evaluation datasets for each place, not mixing with other places.

For example, the model that detects a vehicle in ‘Cheonho Bridge’ is trained by the images of ‘Cheonho

Bridge,’ and evaluated by the images of ‘Cheonho Bridge.’ The impact of a mixed dataset on the detection

performance was investigated in the later section.

MEASURE OF EFFECTIVENESS

The measure of detection performance is based on the intersection over union (IoU). IoU, also known as

the Jaccard index, is the value of overlap area divided by the union area of two boxes, detection box, and

ground-truth box. We set a threshold of IoU to determine if the detection is true or false. If the IoU of

detection is larger than a threshold, we accept it as true positive and vice versa.

Let us say that we detected four vehicles in one image as in Figure 5 and Table 2. A good

detection may have an IoU of near 1.00, while a bad detection may have an IoU near 0. Our example

represents four detections with IoUs ranged from 0.22 to 0.92.

Regarding the threshold, true detection result varies, because the higher threshold means stricter

measures of evaluation. In the case of threshold equals to 0.5, all detections except Detection 3, in which

its IoU is 0.22, are recognized as true detection. In this case, from four detection results, three are true, so

the precision becomes 0.75. The recall will be 0.60 as three true detections were done from five ground

truth data. Precision and recall represent the exactness and sensitivity of the model, respectively. Detailed

equation with true positive (TP), true negative (TN), false positive (FP), and false negative (FN) are

explained as follows:

  

 (3)

  

 (4)

    

 (5)

However, when the threshold increases up to 0.9, Detection 1 is the only detection that is

regarded as true. The threshold can be determined by the future application of detection results. If the

detections aim to count vehicles for estimating traffic density in the sections, a low threshold is

acceptable. However, if the detections aim to count vehicles for each lane or calculate the spacing

between vehicles, a high threshold should be set.

Precision is the accuracy of predicting true positives from all of the detected samples, while the

recall is the number of true positives detected among all the ground-truth. The F-score can reflect both

precision and recall, which is in a trade-off relationship with each other. In the detection example above,

if the detector had created a bounding box in every spot where it has the only small chance of containing a

vehicle, it would record high recall as it detects most of the ground-truth data. However, at the same time,

most bounding boxes would not have a vehicle inside the box, so the precision drops. On the other hand,

if only objects that are clearly classified into vehicles are recognized as real vehicles, the number of

vehicles detected will decrease, and the recall will decrease at the same time, while precision surges near

to 1. This relationship can be described in the receiver operating characteristics (ROC) curve, which

shows the capability of the model to classify objects. The F-score came to be used because both precision

and recall are important in vehicle detection, where accurate detection without false positives and false

negatives is essential.

In the field of computer vision, many types of evaluation methods exist for detection problems.

For example, an average precision 0.5 (AP 0.5) is a measure of the effectiveness with a threshold set as

IoU 0.5. It was used in vision-based learning in the transportation field (36), but it is a relatively low

standard for practical applications such as lane-level traffic density estimation and crash-likelihood

estimation by vehicle spacing (37). Recently, mean average precision [0.50:0.95] (mAP [0.50:0.95]) and

a mean average recall [0.50:0.95] (mAR [0.50:0.95]) are widely used in benchmark study (33), which

measures the mean of area under the ROC curve for each threshold, from 0.50 to 0.95 with step size 0.05.

In this study, we used the modified version of them, mean F-score [0.50:0.95] (mF-score [0.50:0.95]),

which calculates F-score from mAP [0.50:0.95] and mAR [0.50:0.95]. This method can be used as a

balanced evaluation measure as it includes both information about precision and recall:

  󰇟 󰇠

  󰇛󰇡

 󰇢





 󰇜 (6)

Figure 6 also shows the need for an evaluation with an mF-score [0.50:0.95] rather than an AP

0.5. The purpose of this evaluation is to reveal the changes in the performance of the SSD detector for

each environment. However, Figure 6(a) shows that consistent performance regardless of the

environment when evaluated with AP 0.5. In contrast, when the mF-score [0.50:0.95] is adapted as an

evaluation metric, the performance changes vary depending on the environment. In addition, Figure 6(b)

shows that the number of training samples is highly relevant to performance as the green and yellow lines

ensure an increase in performance while the increase rate decreases when the performance is above a

certain level. Using mF-score [0.50:0.95], which is a strict evaluation method, brought a lucid comparison

of various aspects of the detection performance. Borji et al. also suggested that F-score is the most

appropriate evaluation metric for object detection (38). The models that performed well on the F-score

metric also found to have a good performance on other evaluation metrics.

RESULTS

What is the difference in accuracy between single and multi-class detection?

If the performance of multi-class detection (i.e., detection of vehicles with a classification of a vehicle and

a heavy vehicle) lags far behind those of single-class detection (i.e., detection of vehicles without

classification), separation of vehicle and the heavy vehicle should be conducted by single-class detection

followed by additional classification.

Table 3 shows the detection results evaluated by mF-score [0.50:0.95] for single-class and multi-

class, respectively. For CB, the SSD showed mF-score [0.50:0.95] of 0.861 for single-class detector and hit

mF-score [0.50:0.95] of 0.893 for multi-class detection. The similar results of the SSD detector were also

observed for GE2 dataset. From all six cases, in only two cases with GE2 detection using Faster R-CNN

hitting mF-Score [0.50:0.95] of 0.702, and CB detection using R-FCN hitting mF-Score [0.50:0.95] of

0.803, single-class detector showed better performance than multi-class detector. However, even in these

cases, the performance difference between the single-class detector and the multi-class detector was

negligible. This means a multi-class detector does properly extract different feature representations from

each vehicle type; vehicle and heavy vehicle. As a result, there was no significant difference in performance

between single-class and multi-class detectors, although the multi-class detector classified vehicle and

heavy vehicle. Regarding the importance of distinguishing heavy vehicles, there is no reason to use single-

class detection.

What is the best among Faster R-CNN, R-FCN, and SSD?

The number of training samples was changed to confirm their relevance to detection performance. Figure

7 showed that the performance increases as the number of training samples increases, as common-sense

dictates. However, the convergence of performance was already achieved with a small number of training

samples near 3,000 in the case of GE and GE2. This shows that, if vehicle detection is performed on UAV

images, the deep learning model can be easily saturated with only a small number of training samples.

These results suggest that UAV-based vehicle detection can be utilized in various environments without

excessive labeling work from a practical point of view.

As a result of comparing the three models in Table 4, the result of SSD showed the best

performance for all cases. The mF-score [0.50:0.95] of SSD was 0.893 for CB, 0.875 for GE, 0.806 for

GE2, and 0.694 for SE. SSD also has advantages over other models in terms of execution speed. While the

other model is two-stage, the SSD requires less computational power itself as a one-stage model. The

performance of the SSD in Table 4 is not only the highest among the three models but also higher than the

advanced object detection models from the previous research (39). Bodla et al. recorded 0.647 in bus (heavy

vehicle) detection and 0.615 in car (vehicle) detection with mAP [0.50:0.95] evaluation metric, which is far

behind the 0.893 and 0.870. Besides, the mF-Score [0.50:0.95] of this study is a more rigorous metric than

mAP [0.50:0.95] as they consider recall and precision together. This high performance of SSD is due to the

properly adjusted hyperparameters and target-environment-specific training datasets. In Figure 8, the

detection result with SSD of Gyeongin Expressway 2 and Cheonho Bridge is illustrated, which is the most

crowded environment and free flow environment, respectively. The green boxes are vehicles, and the blue

boxes indicate heavy vehicles.

How tolerant does SSD to a small object detection issue?

The wider view of road image enables the analysis of longer road sections. In order to obtain traffic

information from a wider area, the size (number of pixels) of an individual vehicle must be reduced

accordingly, though it is concerned as a difficult problem for common deep learning models.

To confirm the robustness of SSD in detecting small objects, we reduced the number of horizontal and

vertical pixels of the image by 20%, only remaining 1/25 pixels of the entire image. The test for small

objects was conducted with CB and GE2, where the average vehicle size was largest. The average object

size of each environment is reduced from 159.3*65.6 pixels to 31.9*13.1 pixels and from 144.1*62.2

pixels to 28.8*12.4 pixels, respectively. In computer vision, an object smaller than 32*32 pixels is

generally classified as a small object (33), which means our problem is harder than usual small object

detection in the computer vision area.

Figure 9(a) shows the recognition performance in various resolutions using SSD. 100% means

the original image, and 20% is the image that has been reduced by 1/25. However, recognition

performance does not have a specific tendency of change in resolution. These results can be interpreted in

many ways. First, the characteristics of the road image may have influenced. Comparing Figure 9(b), the

original image, and Figure 9(c), image reduced to 1/25, we can still draw a bounding box for vehicles in

Figure 9(c) with confidence. As we can see, the visible features of two road images are not very different,

even if the image size is reduced.

Structural characteristics also can be a possible reason for its high performance in small object

detection. The SSD passes the image through the neural network and stores the obtained feature map in

every stage. The initial feature map can recognize a small image with simple property (e.g., a vehicle

from UAV image), and the latter feature map can recognize a large image with complex property (e.g.,

human posture recognition) (40). This characteristic was also applied in Ren et al. They improved Faster

R-CNN’s performance of small object detection with ResNet-50. ResNet-50 has a deeper network than

VGG16, but it also creates features that skip some networks at the same time, making it suitable for small

and large objects (41). With these reasons, vehicle recognition on the road using ResNet-50 based on SSD

has good performance, even though the vehicle is a small object.

Does the dataset expansion always enhance performance?

In the existing deep learning framework, it is a common method to acquire and train with a lot of datasets.

However, since the road images collected from UAVs are unique and simple, different characteristics

from the general deep learning context can be expected. The results in Figure 10 show a different

tendency from general deep learning. We fixed the test set as the CB dataset. First, we used only CB

dataset as a training set and then expanded the training set by adding up GE dataset, GE2 dataset; finally,

SE dataset was added to the training dataset. Unlike what we expect from general deep learning,

performance does not increase as the number of data increases. Sometimes, it even decreases. The final

CB+GE+GE2+SE training set had more than 15 times larger samples than the initial training ran only

with CB, it means that it is simply not right to increase the amount of the dataset to achieve high

performance. We can infer that the video form different environments (non-target-environment)

introduces interference of feature, hindering the performance.

Therefore, it can be seen that for vehicle detection using UAVs from a practical perspective, it is

most useful to secure a proper level of training sample for various environments and utilize it only when it

has a similar environment with the target video (target-environment). This result also suggests that the

UAV can be a more effective data collection system than a fixed video camera since the UAV can quickly

obtain a small amount of video data in various locations without installing an extensive fixed

infrastructure.

CONCLUSIONS

In this study, the performance of the Faster R-CNN, R-FCN, and SSD, i.e., modern deep learning-

based vehicle detection models, was compared and analyzed. The evaluation of models was applied by a

strict mF-score [0.50:0.95] compared with the previous AP 0.5. Each model was adjusted, e.g., anchor size

and aspect ratio, to be used appropriately for the vehicle detection in the UAV image. As a result, SSD

showed significantly higher performance compared with the Faster R-CNN and R-FCN, which was a

common conclusion for all environments. This result was noteworthy considering that the SSD is a one-

stage detector; thus, the speed is also faster than that of the other two. In the case of SSD, the image size is

expanded and reduced through the algorithm, and it creates several rectangular anchor boxes with various

scales. The shape of these rectangular anchor boxes is similar to the target, which is a vehicle, and it also

generates various sizes of anchor boxes compared to other algorithms. For these reasons, it is assumed that

higher performance could be achieved from SSD. SSD, in particular, did not show much difference

compared with when the training data set was small. Even if there are about 500 samples in the training set,

the detection was robust enough. In fact, just as it did with a general deep learning framework, random

mixing of all images did not guarantee high accuracy but rather showed a decrease in overall performance

when other environments were added. Therefore, UAV pre-flight planning should focus on getting more

than 500 samples from a single environment.

In tests using small objects, which are traditionally treated as weaknesses of general detectors, SSD

showed robust performance, unlike other concerns. There was no significant change in performance even

when objects were smaller than 32*32 pixels, the usual standard for small objects. Even if we removed 96%

pixels of the existing image, the performance was fine. This can be explained by the characteristics of road

traffic images, the use of the ResNet-50, and the use of early-stage feature maps. The simple characteristic

of road traffic image from UAV was able to interpret by ResNet-50, which provides features from the

various depth of networks, and SSD’s multi-scale feature map structure. Therefore, it is safe to shoot a wide

range road with UAVs by floating them higher, as the performance of SSD is not significantly affected by

the size of the object themselves when identifying traffic patterns using SSD. However, if the distance is

far away, the image naturally responds more sensitively to the vibrations of the UAV itself and convection,

so in reality, the accuracy is expected to decrease.

This study is significant in that the vehicle detection with deep learning, which has thus far ended

at the trial level, has been conducted in various environments and verified to a reproducible level. There is

a contribution in that we suggest the optimal strategy that can be used in general vehicle detection from

UAV images with deep learning. Practical applications using emerging technology is difficult due to the

local contexts such as the objective, data collection, environmental condition, and so on. To address these

difficulties, this study mainly focused on the various aspects for practical application of deep learning-based

vehicle detection, not just the model selection, training process, and performance measure but also the

evaluation in various environment and image resolution. This study also classified heavy vehicles from

other vehicles, which is significant in terms of traffic flow. SSD, which showed the highest performance in

this study, can operate in real-time, so if only a UAV can float, the traffic state of the site can be analyzed

in real-time. Using the result of this research, microscale vehicle level traffic data of a specific point can be

obtained by repeated video recording. We hope our research can contribute to bridging the gaps between

practice and emerging front-line technology by guiding the reproducible methods. Future research could

include research on more detailed contexts that we have not covered in the research, e.g., night, evening

lighting conditions, complex congestion crossings, and vehicle tracking.

DATA ACCESSBILITY STATEMENT

Some or all data, models, or code generated or used during the study are available from the corresponding

author by request.

ACKNOWLEDGMENT

This research was supported by Basic Science Research Program through the National Research

Foundation of Korea (NRF) funded by the Ministry of Science and ICT (NRF-2019R1H1A1080045).

AUTHOR CONTRIBUTIONS

The authors confirm contribution to the paper as follows: study conception and design: Ham, Park, Kim,

Kim, and Kho; data collection: Ham and Kim; analysis and interpretation of results: Ham; draft manuscript

preparation: Ham, Park, and Kim. All authors reviewed the results and approved the final version of the

manuscript.

REFERENCES

1. Chung, K., J. Rudjanakanoknad, and M. J. Cassidy. Relation between Traffic Density and Capacity

Drop at Three Freeway Bottlenecks. Transportation Research Part B: Methodological, 2007. 41: 82–95.

2. Li, L., X. Chen, and L. Zhang. Multimodel Ensemble for Freeway Traffic State Estimations. IEEE

Transactions on Intelligent Transportation Systems, 2014. 15: 1323–1336.

3. Hamdar, S.H., Qin, L. and Talebpour, A. Weather and road geometry impact on longitudinal driving

behavior: Exploratory analysis using an empirically supported acceleration modeling framework.

Transportation research part C: emerging technologies, 2016. 67: 193-213.

4. Park, H.-C., Y.-J. Joo, S.-Y. Kho, D.-K. Kim, and B.-J. Park. Injury Severity of Bus–Pedestrian

Crashes in South Korea Considering the Effects of Regional and Company Factors. Sustainability, 2019.

11: 3169.

5. Park, H.-C., D.-K. Kim, S.-Y. Kho, and P. Y. Park. Cross-Classified Multilevel Models for Severity of

Commercial Motor Vehicle Crashes Considering Heterogeneity among Companies and Regions.

Accident Analysis and Prevention, 2017. 106: 305–314.

6. Coifman, B., M. McCord, R. G. Mishalani, M. Iswalt, and Y. Ji. Roadway Traffic Monitoring from an

Unmanned Aerial Vehicle. IEE Proceedings - Intelligent Transport Systems, 2006. 153: 11–20.

7. Ke, R., S. Member, Z. Li, S. Kim, J. Ash, Z. Cui, and Y. Wang. Real-Time Bidirectional Traffic Flow

Parameter Estimation From Aerial Videos. IEEE Transactions on Intelligent Transportation Systems,

2016. 18: 890–901.

8. Liu, K., and G. Mattyus. Fast Multiclass Vehicle Detection on Aerial Images. IEEE Geoscience and

Remote Sensing Letters, 2015. 12: 1938–1942.

9. Ozkurt, C., and F. Camci. Automatic Traffic Density Estimation and Vehicle Classification for Traffic

Surveillance Systems Using Neural Networks. Mathematical and Computational Applications, 2009. 14:

187–196.

10. Khan, M. A., W. Ectors, T. Bellemans, D. Janssens, and G. Wets. UAV-Based Traffic Analysis: A

Universal Guiding Framework Based on Literature Survey. Transportation Research Procedia, 2017. 22:

541–550.

11. Kim, E. J., H. C. Park, S. W. Ham, S. Y. Kho, and D. K. Kim. Extracting Vehicle Trajectories Using

Unmanned Aerial Vehicles in Congested Traffic Conditions. Journal of Advanced Transportation, 2019.

2019: https://doi.org/10.1155/2019/9060797.

12. Xu, Y., G. Yu, X. Wu, Y. Wang, and Y. Ma. An Enhanced Viola-Jones Vehicle Detection Method

from Unmanned Aerial Vehicles Imagery. IEEE Transactions on Intelligent Transportation Systems,

2017. 18: 1845–1856.

13. Barmpounakis, E. N., E. I. Vlahogianni, and J. C. Golias. Unmanned Aerial Aircraft Systems for

Transportation Engineering: Current Practice and Future Challenges. International Journal of

Transportation Science and Technology, 2017. 5: 111–122.

14. Lyu, S., M. C. Chang, D. Du, L. Wen, H. Qi, Y. Li, Y. Wei, L. Ke, T. Hu, M. Del Coco, P. Carcagni,

D. Anisimov, E. Bochinski, F. Galasso, F. Bunyak, G. Han, H. Ye, H. Wang, K. Palaniappan, K. Ozcan,

L. Wang, L. Wang, M. Lauer, N. Watcharapinchai, N. Song, N. M. Al-Shakarji, S. Wang, S. Amin, S.

Rujikietgumjorn, T. Khanova, T. Sikora, T. Kutschbach, V. Eiselein, W. Tian, X. Xue, X. Yu, Y. Lu, Y.

Zheng, Y. Huang, and Y. Zhang. UA-DETRAC 2017: Report of AVSS2017 & IWT4S Challenge on

Advanced Traffic Monitoring. Presented at 2017 14th IEEE International Conference on Advanced Video

and Signal Based Surveillance, 2017. https://doi.org/10.1109/AVSS.2017.8078560.

15. Gaszczak, A., T. P. Breckon, and J. Han. Real-Time People and Vehicle Detection from UAV

Imagery. Intelligent Robots and Computer Vision XXVIII: Algorithms and Techniques, 2011. 7878:

https://doi.org/10.1117/12.876663.

16. Tuermer, S., F. Kurz, P. Reinartz, and U. Stilla. Airborne Vehicle Detection in Dense Urban Areas

Using HoG Features and Disparity Maps. IEEE Journal of Selected Topics in Applied Earth Observations

and Remote Sensing, 2013. 6: 2327–2337.

17. Krizhevsky, A., I. Sutskever, and G. E. Hinton. Imagenet Classification with Deep Convolutional

Neural Networks. Presented at Advances in Neural Information Processing Systems, 2012.

18. Wang, R., L. Zhang, K. Xiao, R. Sun, and L. Cui. EasiSee: Real-Time Vehicle Classification and

Counting via Low-Cost Collaborative Sensing. IEEE Transactions on Intelligent Transportation Systems,

2014. 15: 414–424.

19. Xu, Y., G. Yu, Y. Wang, X. Wu, and Y. Ma. Car Detection from Low-Altitude UAV Imagery with

the Faster R-CNN. Journal of Advanced Transportation, 2017. 2017:

https://doi.org/10.1155/2017/2823617.

20. Felzenszwalb, P. F., R. B. Girshick, and D. McAllester. Cascade Object Detection with Deformable

Part Models. Presented at 2010 IEEE Computer Society Conference on Computer Vision and Pattern

Recognition, 2010.

21. Girshick, R., J. Donahue, T. Darrell, and J. Malik. Region-Based Convolutional Networks for

Accurate Object Detection and Segmentation. IEEE Transactions on Pattern Analysis and Machine

Intelligence, 2016. 38: 142–158.

22. Ren, S., K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards Real-Time Object Detection with

Region Proposal Networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017. 39:

1137–1149.

23. Al-Kaisy, A., J. Bhatt, and H. Rakha. Modeling the Effect of Heavy Vehicles on Sign Occlusion at

Multilane Highways. Journal of Transportation Engineering, 2005. 131: 219–228.

24. Van Lint, J. W. C., S. P. Hoogendoorn, and M. Schreuder. Fastlane: New Multiclass First-Order

Traffic Flow Model. Transportation Research Record: Journal of the Transportation

Research Board, 2008. 2088: 177–187.

25. Tang, T., S. Zhou, Z. Deng, H. Zou, and L. Lei. Vehicle Detection in Aerial Images Based on Region

Convolutional Neural Networks and Hard Negative Example Mining. Sensors, 2017. 17:

https://doi.org/10.3390/s17020336.

26. Maria, G., E. Baccaglini, D. Brevi, M. Gavelli, and R. Scopigno. A Drone-Based Image Processing

System for Car Detection in a Smart Transport Infrastructure. Presented at 18th Mediterranean

Electrotechnical Conference: Intelligent and Efficient Technologies and Services for the Citizen, 2016.

27. Liu, W., D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Y. Fu, and A. C. Berg. SSD: Single Shot

Multibox Detector, Presented at European Conference on Computer Vision, 2016.

28. Dai, J., Li, Y., He, K. and Sun, J. R-fcn: Object detection via region-based fully convolutional

networks. In Advances in Neural Information Processing Systems, 2016.

29. Azevedo, C. L., J. L. Cardoso, M. Ben-Akiva, J. P. Costeira, and M. Marques. Automatic Vehicle

Trajectory Extraction by Aerial Remote Sensing. Procedia - Social and Behavioral Sciences, 2014. 111:

849–858.

30. Elmikaty, M., and T. Stathaki. Detection of Cars in High-Resolution Aerial Images of Complex Urban

Environments, IEEE Transactions on Geoscience and Remote Sensing, 2017. 55: 5913–5924.

31. Gleason, J., A. V Nefian, X. Bouyssounousse, T. Fong, and G. Bebis. Vehicle Detection from Aerial

Imagery. Presented at 2011 IEEE International Conference on Robotics and Automation, 2011.

32. Li, F., S. Li, C. Zhu, X. Lan, and H. Chang. Cost-Effective Class-Imbalance Aware CNN for Vehicle

Localization and Categorization in High Resolution Aerial Images. Remote Sensing, 2017. 9: 1–29.

33. Lin, T. Y., M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick.

Microsoft COCO: Common Objects in Context. Lecture Notes in Computer Science, 2014. 8693: 740–

755.

34. Zhao, X., D. Dawson, W. A. Sarasua, and S. T. Birchfield. Automated Traffic Surveillance System

with Aerial Camera Arrays Imagery: Macroscopic Data Collection with Vehicle Tracking. Journal of

Computing in Civil Engineering, 2016. 31: https://doi.org/10.1061/(asce)cp.1943-5487.0000646.

35. Khan, M. A., Ectors, W., Bellemans, T., Janssens, D., and Wets, G. Unmanned Aerial Vehicle-Based

Traffic Analysis: A Methodological Framework for Automated Multi-Vehicle Trajectory Extraction.

Transportation Research Record: Journal of the Transportation Research Board, 2017. 32: 1–15.

36. Yu, S. L., T. Westfechtel, R. Hamada, K. Ohno, and S. Tadokoro. Vehicle Detection and Localization

on Bird’s Eye View Elevation Images Using Convolutional Neural Network. Presented at 2017 IEEE

International Symposium on Safety, Security and Rescue Robotics (SSRR), 2017.

37. Li, Z., S. Ahn, K. Chung, D. R. Ragland, W. Wang, and J. W. Yu. Surrogate Safety Measure for

Evaluating Rear-End Collision Risk Related to Kinematic Waves near Freeway Recurrent Bottlenecks.

Accident Analysis & Prevention, 2014. 64: 52–61.

38. Borji, A., M. M. Cheng, H. Jiang, and J. Li. Salient Object Detection: A Benchmark. IEEE

Transactions on Image Processing, 2015. 24: 5706–5722.

39. Bodla, N., B. Singh, R. Chellappa, and L. S. Davis. Soft-NMS--Improving Object Detection With

One Line of Code. Presented at 2017 IEEE International Conference on Computer Vision (ICCV), 2017.

40. Bau, D., Zhou, B., Khosla, A., Oliva, A. and Torralba, A. Network dissection: Quantifying

interpretability of deep visual representations. Presented at 2017 IEEE Conference on Computer Vision

and Pattern Recognition, 2017.

41. Ren, Y., C. Zhu, and S. Xiao. Small Object Detection in Optical Remote Sensing Images via Modified

Faster R-CNN. Applied Sciences, 2018. 8: https://doi.org/10.3390/app8050813.

TABLE 1 Environment Description of Sample Images

Environment

Number

Vehicle

Number

of Heavy

Vehicle

Sum

Average

Length of

Vehicle

(px)

Average

Width of

Vehicle

(px)

Average

Length

of Heavy

Vehicle

(px)

Average

Width of

Heavy

Vehicle

(px)

Heavy

Vehicle

Ratio

Average

Length

(px)

Average

width

(px)

Number

Images

Cheonho

Bridge

1042

1137

146.08

63.70

304.55

86.59

8.36

159.32

65.62

125

Gyeongbu

Expressway

4333

393

4726

91.80

40.66

189.20

52.78

8.32

99.90

41.67

Geyeongin

Expressway 2

4447

309

4756

134.14

60.30

286.76

89.23

6.50

144.06

62.18

130

Seohaean

Expressway

1576

118

1694

132.52

57.34

267.83

93.29

6.97

141.95

59.84

118

Total or

Average

11398

915

12313

126.13

55.50

262.08

80.47

7.43

136.24

57.36

471

TABLE 2 Evaluation of Sample Detection

Detection Number

IoU

Threshold

0.50

0.80

0.90

0.92

True

0.55

True

False

0.22

False

0.82

True

False

Precision

0.75

0.5

0.25

Recall

0.60

0.40

0.20

F-Score

0.67

0.44

0.22

TABLE 3 mF-score [0.50:0.95] Comparison by Number of Detection Class

Cheonho Bridge (CB)

Gyeongin Expressway 2 (GE2)

Faster

R-

CNN

Single-Class

Detector

Multi-Class Detector

Single-Class

Detector

Multi-Class Detector

Overall

Vehicle

Heavy

Vehicle

Overall

Vehicle

Heavy Vehicle

0.775

0.782

0.774

0.702

0.696

0.689

R-FCN

Single-Class

Detector

Multi-Class Detector

Single-Class

Detector

Multi-Class Detector

Overall

Vehicle

Heavy

Vehicle

Overall

Vehicle

Heavy Vehicle

0.803

0.802

0.780

0.616

0.624

0.603

SSD

Single-Class

Detector

Multi-Class Detector

Single-Class

Detector

Multi-Class Detector

Overall

Vehicle

Heavy

Vehicle

Overall

Vehicle

Heavy Vehicle

0.861

0.893

0.870

0.800

0.806

0.759

TABLE 4 Performance Comparison by Detection Algorithm

Cheonho Bridge

Gyeongbu Expressway

Gyeongin Expressway 2

Seohaean Expressway

Faster

R-CNN

R-FCN

SSD

Faster

R-CNN

R-FCN

SSD

Faster

R-CNN

R-FCN

SSD

Faster

R-CNN

R-FCN

SSD

Vehicle

0.782

0.802

0.893

0.563

0.423

0.875

0.696

0.624

0.806

0.575

0.556

0.694

Heavy

Vehicle

0.774

0.780

0.870

0.563

0.466

0.876

0.689

0.603

0.759

0.589

0.472

0.614

Figure 1 Structure of Faster R-CNN

Figure 2 (a) Structure of Faster R-FCN; (b) Structure of SSD

Figure 3 Overall framework of the research

Figure 4 Sample image data (a) Cheonho bridge (clear morning, no shadow, free flow, and road mark) (b)

Gyeongbu expressway (cloudy morning, faded shadow, moderate traffic, and large spatial scope) (c)

Gyeongin expressway 2 (clear evening, full shadow, heavy traffic, and slight curve), and (d) Seohaean

expressway (clear afternoon, partial shadow, moderate traffic, and shadow of road sign)

Figure 5 Sample detection image

Figure 6 Result of same detection with different evaluation (a) AP 0.5 (left) (b) mF-score [0.50:0.95] (right)

Figure 7 Performance comparison of detection algorithms

Figure 8 Result of vehicle detection with SSD at the heaviest traffic (Gyeongin Expressway 2) (top) and

the free flow condition (Cheonho Bridge) (bottom)

Figure 9 (a) SSD’s performance by variation of image resolution (above) (b) Original UAV image (bottom

left) (c) UAV Image reduced to 1/25 (bottom right)

Figure 10 Change of performance by adding up training datasets

A Survey of Deep Learning Techniques for Vehicle Detection from UAV Images

Article

Full-text available

Apr 2021
J SYST ARCHITECT

Unmanned aerial vehicles" (UAVs) are now being used for a wide range of surveillance applications. Specifically, the detection of on-ground vehicles from UAV images has attracted significant attention due to its potential in applications such as traffic management, parking lot management, and facilitating rescue operations in disaster zones and rugged terrains. This paper presents a survey of deep learning techniques for performing on-ground vehicle detection from aerial imagery captured using UAVs (also known as drones). We review the works in terms of their approach to improve accuracy and reduce computation overhead and their optimization objective. We show the similarities and differences of various techniques and also highlight the future challenges in this area. This survey will benefit researchers in the area of artificial intelligence, traffic surveillance, and applications of UAVs.

A Systematic Review of Drone Based Road Traffic Monitoring System

Article

Full-text available

Jan 2022

Drone deployment has become crucial in a variety of applications, including solutions to traffic issues in metropolitan areas and highways. On the other hand, data collected via drones suffers from several problems, including a wide range of object scales, angle variations, truncation, and occlusion. To process and manipulate visual data from the drones, a variety of image processing algorithms have been employed, each with a distinct aim. Additionally, recent breakthroughs in the field of Artificial Intelligence, particularly deep learning, have attracted broad interest and are being applied to many domains in the framework of smart cities, including road traffic monitoring. The purpose of this study is to conduct a systematic review of drone-based traffic monitoring systems from a deep learning perspective. This work focuses on vehicle detection, tracking, and counting, since they are fundamental building blocks towards founding solutions for traffic congestion, flow rate and vehicle speed estimation. Additionally, drone-based datasets are examined, which face issues and problems caused by the diversity of features inherent of drone devices. The review analysis presented in this work summarizes the literature solutions provided and deployed so far and discusses future research trends in establishing a comprehensive traffic monitoring system in support of the development of smart cities.

Automated Traffic Surveillance Using Existing Cameras on Transit Buses

Article

Full-text available

May 2023
SENSORS-BASEL

Millions of commuters face congestion as a part of their daily routines. Mitigating traffic congestion requires effective transportation planning, design, and management. Accurate traffic data are needed for informed decision making. As such, operating agencies deploy fixed-location and often temporary detectors on public roads to count passing vehicles. This traffic flow measurement is key to estimating demand throughout the network. However, fixed-location detectors are spatially sparse and do not cover the entirety of the road network, and temporary detectors are temporally sparse, providing often only a few days of measurements every few years. Against this backdrop, previous studies proposed that public transit bus fleets could be used as surveillance agents if additional sensors were installed, and the viability and accuracy of this methodology was established by manually processing video imagery recorded by cameras mounted on transit buses. In this paper, we propose to operationalize this traffic surveillance methodology for practical applications, leveraging the perception and localization sensors already deployed on these vehicles. We present an automatic, vision-based vehicle counting method applied to the video imagery recorded by cameras mounted on transit buses. First, a state-of-the-art 2D deep learning model detects objects frame by frame. Then, detected objects are tracked with the commonly used SORT method. The proposed counting logic converts tracking results to vehicle counts and real-world bird’s-eye-view trajectories. Using multiple hours of real-world video imagery obtained from in-service transit buses, we demonstrate that the proposed system can detect and track vehicles, distinguish parked vehicles from traffic participants, and count vehicles bidirectionally. Through an exhaustive ablation study and analysis under various weather conditions, it is shown that the proposed method can achieve high-accuracy vehicle counts.

A Review of Unmanned Aerial Vehicle Applications in Construction Management: 2016–2021

Article

Full-text available

Apr 2023

With the rapid advancement of Unmanned Aerial Vehicle (UAV) technologies in recent years, their uses have been increasingly adopted in the architecture, engineering, and construction industries. To satisfy the needs of various types of construction projects, a considerable amount of research work has been performed to implement and refine the operations, safety, and accuracy of UAVs. This paper presents the findings of a comprehensive literature review that focuses on UAV research in construction management during the timeframe of 2016 to 2021. A total of 95 papers were identified and collected from a list of 21 relevant journals and conference proceedings, and were then categorized by their research topic, sensor types, and targeted structures. The results of 47 exemplary studies were reported in two categories, namely UAV uses and construction uses. The research topics identified for UAV uses include algorithm, applications, operations, framework, and training, while research topics identified for construction use include inspection, surveying, safety, and monitoring. The connection between the research topics, sensor types, targeted structures, and other advanced technologies were also discussed. This paper summarizes the current results of UAV research in construction management, reviews the methodology, benefits, and limitations of the reviewed literature, and provides valuable knowledge for the future trend of UAV applications in the civil, infrastructure, and construction industries.

Bounding Box-Free Instance Segmentation Using Semi-Supervised Iterative Learning for Vehicle~Detection

Article

Full-text available

Jan 2022

Vehicle classification is a hot computer vision topic, with studies ranging from ground-view to top-view imagery. Top-view images allow understanding city patterns, traffic management, among others. However, there are some difficulties for pixel-wise classification: most vehicle classification studies use object detection methods, and most publicly available datasets are designed for this task, creating instance segmentation datasets is laborious, and traditional instance segmentation methods underperform on this task since the objects are small. Thus, the present research objectives are as follows: first, propose a novel semisupervised iterative learning approach using the geographic information system software, second, propose a box-free instance segmentation approach, and third, provide a city-scale vehicle dataset. The iterative learning procedure considered the following: first, labeling a few vehicles from the entire scene, second, choosing training samples near those areas, third, training the deep learning model (U-net with efficient-net-B7 backbone), fourth, classifying the whole scene, fifth, converting the predictions into shapefile, sixth, correcting areas with wrong predictions, seventh, including them in the training data, eighth repeating until results are satisfactory. We considered vehicle interior and borders to separate instances using a semantic segmentation model. When removing the borders, the vehicle interior becomes isolated, allowing for unique object identification. Our procedure is very efficient and accurate for generating data iteratively, which resulted in 122 567 mapped vehicles. Metrics-wise, our method presented higher intersection over union when compared to box-based methods (82% against 72%), and per-object metrics surpassed 90% for precision and recall.

Bounding Box-Free Instance Segmentation Using Semi-Supervised Learning for Generating a City-Scale Vehicle Dataset

Preprint

Full-text available

Nov 2021

Vehicle classification is a hot computer vision topic, with studies ranging from ground-view up to top-view imagery. In remote sensing, the usage of top-view images allows for understanding city patterns, vehicle concentration, traffic management, and others. However, there are some difficulties when aiming for pixel-wise classification: (a) most vehicle classification studies use object detection methods, and most publicly available datasets are designed for this task, (b) creating instance segmentation datasets is laborious, and (c) traditional instance segmentation methods underperform on this task since the objects are small. Thus, the present research objectives are: (1) propose a novel semi-supervised iterative learning approach using GIS software, (2) propose a box-free instance segmentation approach, and (3) provide a city-scale vehicle dataset. The iterative learning procedure considered: (1) label a small number of vehicles, (2) train on those samples, (3) use the model to classify the entire image, (4) convert the image prediction into a polygon shapefile, (5) correct some areas with errors and include them in the training data, and (6) repeat until results are satisfactory. To separate instances, we considered vehicle interior and vehicle borders, and the DL model was the U-net with the Efficient-net-B7 backbone. When removing the borders, the vehicle interior becomes isolated, allowing for unique object identification. To recover the deleted 1-pixel borders, we proposed a simple method to expand each prediction. The results show better pixel-wise metrics when compared to the Mask-RCNN (82% against 67% in IoU). On per-object analysis, the overall accuracy, precision, and recall were greater than 90%. This pipeline applies to any remote sensing target, being very efficient for segmentation and generating datasets.

Injury Severity of Bus–Pedestrian Crashes in South Korea Considering the Effects of Regional and Company Factors

Article

Full-text available

Jun 2019

Bus–pedestrian crashes typically result in more severe injuries and deaths than any other type of bus crash. Thus, it is important to screen and improve the risk factors that affect bus–pedestrian crashes. However, bus–pedestrian crashes that are affected by a company’s and regional characteristics have a cross-classified hierarchical structure, which is difficult to address properly using a single-level model or even a two-level multi-level model. In this study, we used a cross-classified, multi-level model to consider simultaneously the unobserved heterogeneities at these two distinct levels. Using bus–pedestrian crash data in South Korea from 2011 through to 2015, in this study, we investigated the factors related to the injury severity of the crashes, including crash level, regional and company level factors. The results indicate that the company and regional effects are 16.8% and 5.1%, respectively, which justified the use of a multi-level model. We confirm that type I errors may arise when the effects of upper-level groups are ignored. We also identified the factors that are statistically significant, including three regional-level factors, i.e., the elderly ratio, the ratio of the transportation infrastructure budget, and the number of doctors, and 13 crash-level factors. This study provides useful insights concerning bus–pedestrian crashes, and a safety policy is suggested to enhance bus–pedestrian safety.

Extracting Vehicle Trajectories Using Unmanned Aerial Vehicles in Congested Traffic Conditions

Article

Full-text available

Apr 2019
J ADV TRANSPORT

Obtaining the trajectories of all vehicles in congested traffic is essential for analyzing traffic dynamics. To conduct an effective analysis using trajectory data, a framework is needed to efficiently and accurately extract the data. Unfortunately, obtaining accurate trajectories in congested traffic is challenging due to false detections and tracking errors caused by factors in the road environment, such as adjacent vehicles, shadows, road signs, and road facilities. Unmanned aerial vehicles (UAVs), with incorporating machine learning and image processing, can mitigate these difficulties by their ability to hover above the traffic. However, research is lacking regarding the extraction and evaluation of vehicle trajectories in congested traffic. In this study, we propose and compare two learning-based frameworks for detecting vehicles: the aggregated channel feature (ACF), which is based on human-made features, and the faster region-based convolutional neural network (Faster R-CNN), which is based on data-driven features. We extend the detection results to extract vehicle trajectories in congested traffic conditions from UAV images. To remove the errors associated with tracking vehicles, we also develop a postprocessing method based on motion constraints. Then, we conduct detailed performance analyses to confirm the feasibility of the proposed framework on a congested expressway in Korea. The results show that Faster R-CNN outperforms the ACF in images with large objects and in those with small objects if sufficient data are provided. This framework extracts the vehicle trajectories with high precision, making them available for analyzing traffic dynamics based on the training of just a small number of positive samples. The results of this study provide a practical guideline for building a framework to extract vehicles trajectories based on given conditions.

Small Object Detection in Optical Remote Sensing Images via Modified Faster R-CNN

Article

Full-text available

May 2018

The PASCAL VOC Challenge performance has been significantly boosted by the prevalently CNN-based pipelines like Faster R-CNN. However, directly applying the Faster R-CNN to the small remote sensing objects usually renders poor performance. To address this issue, this paper investigates on how to modify Faster R-CNN for the task of small object detection in optical remote sensing images. First of all, we not only modify the RPN stage of Faster R-CNN by setting appropriate anchors but also leverage a single high-level feature map of a fine resolution by designing a similar architecture adopting top-down and skip connections. In addition, we incorporate context information to further boost small remote sensing object detection performance while we apply a simple sampling strategy to solve the issue about the imbalanced numbers of images between different classes. At last, we introduce a simple yet effective data augmentation method named 'random rotation' during training. Experimental results show that our modified Faster R-CNN algorithm improves the mean average precision by a large margin on detecting small remote sensing objects.

UA-DETRAC 2017: Report of AVSS2017 & IWT4S Challenge on Advanced Traffic Monitoring

Conference Paper

Full-text available

Aug 2017

Car detection from low-altitude UAV imagery with the faster R-CNN

Article

Full-text available

Aug 2017
J ADV TRANSPORT

UAV based traffic monitoring holds distinct advantages over traditional traffic sensors, such as loop detectors, as UAVs have higher mobility, wider field of view, and less impact on the observed traffic. For traffic monitoring from UAV images, the essential but challenging task is vehicle detection. This paper extends the framework of Faster R-CNN for car detection from low-altitude UAV imagery captured over signalized intersections. Experimental results show that Faster R-CNN can achieve promising car detection results compared with other methods. Our tests further demonstrate that Faster R-CNN is robust to illumination changes and cars’ in-plane rotation. Besides, the detection speed of Faster R-CNN is insensitive to the detection load, that is, the number of detected cars in a frame; therefore, the detection speed is almost constant for each frame. In addition, our tests show that Faster R-CNN holds great potential for parking lot car detection. This paper tries to guide the readers to choose the best vehicle detection framework according to their applications. Future research will be focusing on expanding the current framework to detect other transportation modes such as buses, trucks, motorcycles, and bicycles.

Soft-NMS — Improving Object Detection with One Line of Code

Conference Paper

Oct 2017

Network Dissection: Quantifying Interpretability of Deep Visual Representations

Conference Paper

Jul 2017

Vehicle detection and localization on bird's eye view elevation images using convolutional neural network

Conference Paper

Oct 2017

Imagenet classification with deep convolutional neural networks

Conference Paper

Jan 2012

We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 dif- ferent classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implemen- tation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called dropout that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry

Detection of Cars in High-Resolution Aerial Images of Complex Urban Environments

Article

Jul 2017

Detection of small targets, more specifically cars, in aerial images of urban scenes, has various applications in several domains, such as surveillance, military, remote sensing, and others. This is a tremendously challenging problem, mainly because of the significant interclass similarity among objects in urban environments, e.g., cars and certain types of non-target objects, such as buildings' roofs and windows. These non-target objects often possess very similar visual appearance to that of cars making it hard to separate the car and the non-car classes. Accordingly, most past works experienced low precision rates at high recall rates. In this paper, a novel framework is introduced that achieves a higher precision rate at a given recall than the state of the art. The proposed framework adopts a sliding-window approach and it consists of four stages, namely, window evaluation, extraction and encoding of features, classification, and post-processing. This paper introduces a new way to derive descriptors that encode the local distributions of gradients, colors, and texture. Image descriptors characterize the aforementioned cues using adaptive cell distributions, wherein the distribution of cells within a detection window is a function of its dominant orientation, and hence, neither the rotation of the patch under examination nor the computation of descriptors at different orientations is required. The performance of the proposed framework has been evaluated on the challenging Vaihingen and Overhead Imagery Research data sets. Results demonstrate the superiority of the proposed framework to the state of the art.

Investigating the Influential Factors for Practical Application of Multi-Class Vehicle Detection for Images from Unmanned Aerial Vehicle using Deep Learning Models

Abstract and Figures

Recommended publications

Automated Safety Diagnosis Based on Unmanned Aerial Vehicle Video and Deep Learning Algorithm

Extracting Vehicle Trajectories Using Unmanned Aerial Vehicles in Congested Traffic Conditions

A Review of Real-Time Traffic Data Extraction Based on Spatio-Temporal Inference for Traffic Analysi...

Unmanned Aerial Vehicle-based Traffic Analysis: A Case Study to Analyze Traffic Streams at Urban Rou...

A Hybrid Approach Based on Variational Mode Decomposition for Analyzing and Predicting Urban Travel...