Content uploaded by Tang Tuan Le
Author content
All content in this area was uploaded by Tang Tuan Le on Dec 28, 2019
Content may be subject to copyright.
Vehicle Count System based on Time Interval Image
Capture Method and Deep Learning Mask R-CNN
Eduardo Jr Piedad
Department of Electrical Engineering
University of San Jose-Recoletos
Cebu City, Philippines
eduardojr.piedad@usjr.edu.ph
Fhenyl Kristel Pama
Department of Civil Engineering
University of San Jose-Recoletos
Cebu City, Philippines
fhenylkristel.pama@gmail.com
Tuan-Tang Le
Department of Mechanical Engineering
National Taiwan University of Science
and Technology
Taipei City, Taiwan
d10603809@mail.ntust.edu.tw
Ianny Tabale
Department of Civil Engineering
University of San Jose-Recoletos
Cebu City, Philippines
ianny.tabale@gmail.com
Kimberly Aying
Department of Civil Engineering
University of San Jose-Recoletos
Cebu City, Philippines
kimberlyaying@gmail.com
Abstract— Traffic congestion is an undesirable problem for
big cities especially in third world countries. Better policy
planning and decision-making from the authority comes from
well-conducted practical research. In this study, a Vehicle
Count System (VCS) using deep learning Mask R-CNN is
developed to classify and count vehicles passing in a target
street. A novel time interval image capture (TIIC) system is
employed to the VCS instead of the typical real-time video
streaming to avoid big data storage cost. To determine the
effectiveness of the developed VCS, its output is compared to
that of conventional method from manual recording of the
passing vehicles. Four vehicle types – cars, motorbikes, trucks
and buses are present in the 1800 real traffic images gathered
from an actual field. As an initial stage, the developed tool
performs satisfactorily in classifying and counting car-type
vehicle with 97.62% accuracy in a 10-hour testing. However, it
fails to recognize motorbikes probably due to its relatively
smaller pixel size compared to other vehicle types. The presence
of jeeps confused the VCS. The real image dataset can be used
as basis for further development. The newly developed TIIC
system can also be used in future research as a promising tool to
replace real-time video streaming.
Keywords—vehicle counting system, time interval image
capture, Mask R-CNN, traffic monitoring
I. INTRODUCTION
Traffic congestion is a growing concern in many
developing countries especially in the Philippines. Its capital
city, Metro Manila, is considered to be the worst traffic in the
world based on 2015 Global Driver Satisfaction Index (GDSI)
conducted by Waze navigation app [1]. Later, the problem
may also grow in other big cities of the country such as Cebu
and Davao City. Policy-making bodies need sufficient
information from various research agencies in the race of
addressing the problem without inflicting further cost. In this
initial study, a deep learning based traffic monitoring system
is developed.
Various studies address the big data problem due to the
integration of smart cameras for traffic related application. For
example, expert systems deploying a network of smart
cameras for traffic monitoring are developed by [2] to handle
several subsystems for a wide traffic control. Instead of fixed
camera, another study of [3] uses images from unmanned
aerial vehicles (UAV) for vehicle detection. A study on ad hoc
networks by [4] reviews the gaining enormous research on
vehicular Ad hoc Networks (VANETs) to address traffic
safety and other applications such as traffic status monitoring
and road traffic management. Another study by [5] conducted
an optimization based and deep learning based methods to
deeply understand traffic density from a large-scale web
camera data. Due to the economic cost generated from large-
scale camera data, there is a need for cheaper yet effective
system. This study uses a strategy that limits big data
acquisition by developing a time interval-based image capture
technique.
In the recent vehicle recognition and count, deep learning
methods tend to be the widely-used due to its promising
benefits such as fast computation and high accuracy. A
TraCount technique developed by [6] to address the counting
problem of overlapping vehicle. Another technique, FCN-
rLSTM, by [7] develop a deep learning using spatio-temporal
neural networks to count vehicles in low quality videos
captured by city cameras. Another similar study uses a fine-
grained vehicle classification model developed by [8] that
handles complicated transportation scene. A developed
vehicle counting system (VCS) by [9] provides a more
accurate VCS based on vehicle types instead of determining
only whether vehicle or not vehicle. It classifies which vehicle
type whether car, taxi or truck based on convolutional neural
network with layer skipping-strategy (CNNLS) framework.
The study of [10] generated a large contextual dataset for
classification, detection and counting of cars using a deep
learning called ResCeption. This network offers a new way
of counting cars in a single look instead of using the
conventional image processing techniques. A deep
Fig 1. Flowchart of the developed vehicle counting
2675
978-1-7281-1895-6/19/$31.00 c
2019 IEEE
convolutional neural network developed by [11] gives a recent
image-based learning technique to measure traffic density.
Most of these studies develop deep learning tools that uses big
scale dataset from video streaming image data. Some focuses
in improving the tool itself.
A practical implementation in an actual scenario using a
more recent deep learning tool, mask region-based
convolutional neural network (Mask R-CNN), is proposed in
this study. This study uses a pre-trained Mask R-CNN model
to classify and count vehicles in the generated traffic images
from the developed interval-based image capture.
II. VEHICLE COUNTING SYSTEM (VCS) SETUP
The developed VCS in this study setup is shown in Fig. 1.
There two important parts of the VCS – time interval image
capture (TIIC) system and the deep learning Mask R-CNN.
The setup is simply a fixed camera with the same angle of
depression per image in order to get the desired traffic
parameter. A sample image of the VCS is shown in Fig. 2.
Later in the deep learning implementation, this original image
will be preprocessed in order to detect the vehicles only in the
desired area of the image.
Fig 3. The time interval image capture method
Fig 4. Three captured images in 60-second green status of the target street
Fig 5. Deep learning Mask R-CNN Implementation
Fig 2. A sample image taken by the vehicle counting system
2676 2019 IEEE Region 10 Conference (TENCON 2019)
A. Time interval image capture (TIIC) system
Then the time interval image capture (TIIC) method is
illustrated in Fig. 3. There are three typical traffic status –
green, red and yellow. The green status is when the vehicle
signals the vehicle to ‘Go’ while red signals ‘Stop’. Since the
yellow status usually covers only around three-second time
interval, this is ignored in this method. The duration of go and
stop statuses depends on every location and normally it takes
at least 16 seconds each status. In the TIIC method, at least
three images are captured per status as shown in Fig. 4. After
detecting and counting the vehicle type, the mean number per
vehicle type in the three said images are taken. There are 1800
traffic images collected from the overpass of Osmeña
Boulevard, one of the most congested place in Cebu City,
Philippines.
B. Deep Learning Mask R-CNN
The second part of VCS includes the vehicle detection and
counting. The typical deep learning implementation is shown
in Fig. 5 where proposed VCS is compared with the
conventional classification of manual recording. The images
captured in TIIC system are fed into the deep learning model.
In this study, we use the recent mask region-based
convolutional neural network (Mask R-CNN). The readers are
invited to check the literature of Mask R-CNN and its similar
variants in [12]–[15]. It is recently known for accurate object
Fig 6. Mask R-CNN integrated learning model pre-trained by COCO dataset
Fig 7. Image processing from the original image (a) to final output (d)
2019 IEEE Region 10 Conference (TENCON 2019) 2677
detection and simultaneously object classification while
creating an output mask of the object. There are four object
concerned corresponding to four vehicle types – cars,
motorbike, truck and bus. The performance evaluation is done
by comparing the classification accuracy between proposed
VCS and the conventional method using equation (1).
ܣܿܿݑݎܽܿݕ ൌ ே௨ ௧௬௦௦ௗ௩
்௧௨௩ Ψ (1)
III. IMPLEMENTATION
The learning model based on Mask R-CNN is shown in
Fig. 6. There are four essential steps of this process – data
preprocessing, Mask R-CNN architecture, filter and output
post-processing.
Step 1. In the data preprocessing stage, the original image
in Fig. 7 (a) is transformed into the preprocessed image in Fig.
6. In order to prevent detecting the vehicles not in the desired
street as shown in Fig. 7 (b), an image processing technique is
performed to subtract the information outside the desired
street as shown in Fig. 7 (c). Note that the vehicle detection
process in Figs. 7 (b)-(d) is discussed in the next step and is
only shown as a reference.
Step 2. The new images serve as the Mask RCNN
architecture input. In this study, we employed a pre-trained
model which detects 80 different classes. Mask R-CNN model
is fast and easy to train. This learning model is used to detect
and classify multiple vehicles in an image. The training is
done with ResNet-50-FPN on COCO trainval35k that takes 32
hours in our synchronized 8-GPU implementation (0.72s per
16-image mini-batch), and 44 hours with ResNet-101-FPN. In
fact, fast prototyping can be completed in less than one day
when training using the train set. Models are trained in all
COCO trainval35k images that contains annotated keypoints.
To avoid overfitting, as this training set is smaller, we train
using image scales randomly sampled from [640, 800] pixels
while inference is on a single scale of 800 pixels. We train for
90,000 iterations, starting from a learning rate of 0.02 and
reducing it by 10 at 60000 and 80000 iterations. We use
bounding-box NMS with a threshold of 0.5. The COCO
dataset is available online in [16]. The network’s output is a
random sequence of classes, boxes and masks. The output
information will be collected based on the same object with
the same index.
Step 3. Since there are a number of undesirable outputs of
pre-trained Mask R-CNN, a filter is created so that the desired
information such as the body boxes and masks of the desired
vehicle type - cars, trucks, motors and buses, are combined or
retained.
Step 4. Finally, the retained information is sorted and a
counting process is performed. Similar vehicle type is sorted
and counted together.
IV. RESULTS AND DISCUSSIO N
Based on the output images in Fig. 8 (a), there is a
difficulty of detecting a motorbike due to its comparatively
small pixel size which can be easily overlapped with other
bigger vehicle types. In addition, the number of both truck and
bus detected by the proposed VCS is more than the manual
method due to their confusion with jeep as shown in Fig. 8. In
this study, jeep is categorized as car type and based on Fig. 8
(b), two overlapping jeeps looks similar with trucks and buses.
Fig. 8 (c) shows a successful detection of vehicle in an image
while some misdetection and misclassifications in Fig. 8 (d).
Fig 8 (a)-
(d). Output images of the proposed VCS with vehicle
masks and labels
Table 1. Vehicle Classification Accuracy of Manual and Proposed VCS
Hour Manual Proposed % acc Manual Proposed % acc Manual Proposed % acc Manual Proposed % acc
1 340 292 85.88 250 97 38.80 0 16 0.00 16 101 -15.84
2 464 506 91.70 397 157 49.20 4 15 -26.67 20 151 -13.25
3 403 384 95.29 313 154 49.20 1 7 -14.29 6 121 - 4.96
4 514 777 66.15 588 105 17.86 1 23 -4.35 33 186 -17.74
5 341 352 96.88 287 138 48.08 5 25 -20.00 4 111 -3.60
6 832 744 89.42 455 132 29.01 6 33 18.18 16 124 -12.90
7 355 415 85.54 236 147 62.29 1 20 -5.00 14 156 -8.97
8 489 349 71.37 292 117 40.07 1 20 -5.00 31 98 - 31.63
9 414 711 58.23 311 125 40.19 0 39 0.00 16 232 -6.90
10 882 384 43.54 413 139 33.66 9 22 -40.91 48 98 -48.98
Total 5034 4914 3542 1311 28 220 204 1378
Total
Weighted
Accuracy
97.62% 37.01% -12.73% - 14.80%
Car
Motor
Bus
Truck
2678 2019 IEEE Region 10 Conference (TENCON 2019)
Table 1 presents the 10-hour vehicle classification
accuracy of conventional and proposed vehicle counting
system (VCS). It also shows the accuracy of proposed VCS in
comparison with the manual method. Note that the negative
accuracy only means that proposed VCS detects more vehicle
than the manual method. Accordingly, the proposed VCS
successfully detects car with satisfactory total accuracy of
97.62 % while poorly to the rest of vehicle types. It can be
observed that in the 10th hour, it poorly detects car with only
37.01 % accuracy while truck and bus have significantly
increased. This means that the proposed VCS has the tendency
of confusing car with either truck or bus.
V. CONCLUSION
A vehicle counting system (VCS) based on time interval
image capture (TIIC) system and deep learning Mask R-CNN
is successfully developed and implemented in an actual field
scenario. The developed VCS failed to count the number of
motorbikes, trucks and bus but sufficiently for cars. The
presence of jeep which is categorized as car confused the VCS
and detect it as either truck or bus. The VCS tends to missed
most of motorbikes probably due to its small pixel size
comparative to other vehicle types. Future researches can
implement the novel TIIC system to minimize data storage
cost.
REFERENCES
[1] B. Bongat, “How Much Money Are You Losing Because of Traffic?,”
Yahoo News, 2015. [Online]. Available:
https://sg.news.yahoo.com/much-money-losing-because-traffic-
220012598.html. [Accessed: 15-Feb-2019].
[2] L. Calderoni, D. Maio, and S. Rovis, “Expert Systems with
Applications Deploying a network of smart cameras for traffic
monitoring on a ‘“ city kernel ,”’” vol. 41, pp. 502–507, 2014.
[3] G. V Konoplich, E. O. Putin, and A. A. Filchenkov, “Application of
Deep Learning to the Problem of Vehicle Detection in UAV Images,”
2016 XIX IEEE Int. Conf. Soft Comput. Meas., pp. 4–6, 2016.
[4] T. Darwish and K. A. Bakar, “Traffic density estimation in vehicular
ad hoc networks : A review,” Ad Hoc Networks, vol. 24, pp. 337–351,
2015.
[5] S. Zhang, G. Wu, and P. Costeira, “Understanding Traffic Density from
Large -Scale Web Camera Data,” arXiv Prepr., 2017.
[6] S. Surya and V. Babu, “TraCount : A Deep Convolutional Neural
Network for Highly Overlapping Vehicle Counting,” 2016.
[7] S. Zhang, G. Wu, J. Costeira, and J. Moura, “FCN-rLSTM : Deep
Spatio-Temporal Neural Networks for Vehicle Counting in City
Cameras,” in 2017 IEEE International Conference on Computer Vision
(ICCV), 2017.
[8] S. Yu, Y. Wu, W. Li, Z. Song, and W. Zeng, “A model for fine-grained
vehicle classification based on deep learning,” Neurocomputing, vol.
257, pp. 97–103, 2017.
[9] S. Awang and N. M. A. N. Azmi, “Vehicle Counting System Based on
Vehicle Type Classification Using Deep Learning Method,” in IT
Convergence and Security 2017, 2018, pp. 52–59.
[10] T. N. Mundhenk, G. Konjevod, W. A. Sakla, and K. Boakye, “A Large
Contextual Dataset for Classification, Detection and Counting of Cars
with Deep Learning,” in Computer Vision -- ECCV 2016, 2016, pp.
785–800.
[11] J. Chung and K. Sohn, “Image-Based Learning to Measure Traffic
Density Using a Deep Convolutional Neural Network,” IEEE Trans.
Intell. Transp. Syst., vol. 19, no. 5, pp. 1670–1675, 2018.
[12] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature
hierarchies for accurate object detection and semantic segmentation,”
in Proceedings of the IEEE conference on computer vision and pattern
recognition, 2014, pp. 580–587.
[13] R. Girshick, “Fast R-CNN,” in The IEEE International Conference on
Computer Vision (ICCV), 2015.
[14] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-
time object detection with region proposal networks,” in Advances in
neural information processing systems, 2015, pp. 91–99.
[15] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask r-cnn,” in
Computer Vision (ICCV), 2017 IEEE International Conference on,
2017, pp. 2980–2988.
[16] C. Consortium, “Coco dataset.” [Online]. Available:
http://cocodataset.org/#home. [Accessed: 03-Mar-2019].
2019 IEEE Region 10 Conference (TENCON 2019) 2679