Content uploaded by Thi Lan Le
Author content
All content in this area was uploaded by Thi Lan Le on Aug 21, 2020
Content may be subject to copyright.
Comparative evaluation of human detection and
tracking approaches for online tracking applications
Hong-Quan Nguyen ∗‡ ,Thuy-Binh Nguyen ∗† ,Tuan-Anh Nguyen ∗, Thi-Lan Le ∗, Thanh-Hai Vu §, Alexis Noe ¶
∗Computer Vision Department, MICA International Research Institute,
Hanoi University of Science and Technology, Vietnam
Email: Thi-Lan.Le@mica.edu.vn
†Faculty of Electrical and Electronics Engineering, University of Transport and Communications, Hanoi, Vietnam
‡Faculty of Information Technology, Viet-Hung Industrial University, Hanoi, Vietnam
§Mathematics Science Research Division, Viettel Group
¶School of Engineering in Physics, Applied Physics, Electronics & Materials Science, Grenoble Institute of Technology, France.
Abstract—Object detection and tracking in videos is an impor-
tant problem in computer vision thanks to its wide applications
in various video analysis scenarios. As a result, it has attracted
huge interest from the scientific community. Majority of recent
works following the tracking-by-detection approaches which rely
on a people detector to start, update, reinitialize, guide and
terminate the trackers. Recent years have witnessed a significant
advance in person detection and tracking performance. However,
person detection and tracking are usually treated separately
in the recent works. The contributions of this paper are two-
fold. First, a comparative evaluation of the coupling of person
detection and tracking methods for online tracking applications
is conducted on two video datasets: MOT17 - a benchmark
dataset provided in MOT Challenge [1] and our own dataset
captured in a video surveillance context. For this, we investigate
a popular online tracking method (DeepSORT) coupled with
the two state-of-the-art people detection methods that are You
Only Look Once ( YOLO) and MaskR −CNN. Second, a deep
analysis on the behavior of the person detection and tracking
method in term of both detection and tracking performance and
resources requirement for practical applications is given. The
implementation of the framework and dataset used in this paper
will be made publicly available.
I. INTRODUCTION
The presence of cameras in our surroundings grows every
day, allowing visual surveillance systems to be used in a wide
range of domain such security, health-care, etc. In these sys-
tems, pedestrian detection and tracking which aims to estimate
the state of person (e.g. person location, identity) over the
times is an initial and crucial step in these systems. In the last
decade, pedestrian detection and tracking has received a lot of
attention as a research topic, which resulted in a broad amount
of available techniques [2]. Majority of recent works following
the tracking-by-detection approaches which bases on a people
detector to start, update, reinitialize, guide and terminate the
trackers. Recent years have witnessed a significant advance in
person detection and tracking performance. However, person
detection and tracking are usually treated separately in the
recent works. Some previous works that attempt to evaluate
the object detection [3] and object tracking [1], [4], [5]. The
MOT Challenge has designed a common platform containing
videos with object detection and tracking ground-truth as well
as common evaluation metrics for object tracking evaluation.
However, as this platform allows both batch mode (in which
video frames from future time steps are also utilized to solve
the data association problem) and online mode, therefore, the
suggestion on the choice of these methods for online tracking
applications is unavailable.
The contributions of this paper are two-folds. First, a
comparative evaluation of the coupling of person detection and
tracking methods for online tracking applications is conducted
on two datasets: MOT17 - a benchmark dataset provided in
MOT Challenge [1] and our own dataset captured in a video
surveillance context. For this, we investigate a popular online
tracking method (DeepSORT) coupled with the two state-of-
the-art people detection methods that are You Only Look Once
(YOLO) and MaskR −CNN. Second, a deep analysis on
the behavior of the person detection and tracking method in
term of both detection and tracking performance and resources
requirement for practical applications is given.
The rest of paper is organized as follows. Section 2 dis-
cusses some previous works related to human detection and
tracking. Next, our framework is presented in Section 3. And
then, an exhaustive evaluation on performance of both human
detection and tracking is shown in Section 4. Conclusion and
future work are mentioned in the last Section.
II. RE LATE D WO RK
In this section, we discuss briefly on some prominent studies
involve human detection and tracking. To detect object in
videos, numerous previous works tried to build the background
models and determine the object through background sub-
traction algorithms [6]–[8]. Even some models for dynamic
background have been proposed, background subtraction-
based methods are relatively sensitive to lighting variance.
Another approach is to model the person appearance using
the appearance clues such as color, texture and to employ the
scanning window technique to determine the presence of a
person in a given window [9]–[12]. In recent years, there have
been various deep learning networks are proposed for object
detection in general and human detection in particular such as
YOLO [13], SSD [14], Mask R-CNN [15].
Concerning object tracking, object tracking is classified
into two main approaches: single object and multiple object
tracking algorithms. In comparison, multiple object tracking
has to cope with more challenges than single shot one because
of sudden appearance and disappearance of object. There
are numerous trackers that belong to the former approach,
such as on-line boosting trackers [16]–[18], tracking-learning-
detection (TLD) tracker [19], and Kernelized Correlation Filter
(KCF) [20], etc. However, almost these trackers mainly focus
on local position of bounding boxes and motion information.
In recent times, several novel algorithms are introduced to
integrate appearance features extracted on each bounding
box for improving the performance of tracking. In order to
overcome difficulties in multiple object tracking Nicolai Wojke
et al . [21] proposed a novel framework to incorporate an
object detector, Kalman filter as the base tracker, and a data
association method to take advantage of both detection and
tracking tasks.
As above, we have introduced several researches related
to human detection and tracking, however, there are a few
works provide a comprehensive evaluation of the coupling
human detection and tracking. This is the motivation for our
study to conduct various extensive experiments to assess the
effectiveness of both human detection and tracking component
in a realistic camera system.
III. FRAMEWORK FOR PERSON DETECTION AND
TR ACK IN G
The purpose of our work is to provide a comprehensive
evaluation on the performance of a human detection and
tracking system in online tracking applications. Figure 1 shows
a common framework for human detection and tracking in
videos. Among different methods proposed for person detec-
tion, we select two state-of-the-art object detection methods
that are You Only Look One (YOLO) [13] and Mask R-CNN
[15]. We couple these detection methods with one popular
online tracking method DeepSORT [21]. In the following
sections, we describe briefly the person detection and tracking
methods.
A. Pedestrian detection methods
1) You Only Look Once (YOLO): YOLO [13] is one kind
of Single Shot Detector (SSD) [14] in which DarkNet [22]
is employed as a backbone for feature extraction. Up to
now, YOLO network has three different versions, namely
YOLOv1, YOLOv2, and YOLOv3. In comparison with the
other versions, YOLOv3 is evaluated to have higher computa-
tional speed. Furthermore, due to having a more complicated
structure with Pyramid features, YOLOv3 is able to detect a
small object. Because of the above-mentioned advantages, we
utilize YOLOv3 and an its variant, called YOLOv3-tiny for
our study.
2) Mask R-CNN: MaskR −CNN is developed from Faster
R-CNN [15], a difference between the two networks is that
MaskR −CNN generates simultaneously a bounding box and
a corresponding mask for a detected object. It is worth to note
that the major contribution in the structure of Faster R-CNN
is to incorporate an object proposal generator into a detection
network. By this way, convolutional features are shared not
only between the object proposals but also among the object
proposal and detection networks leading a high computation
cost reduction and a mean Average Precision (mAP) gain.
B. Pedestrian tracking methods
DeepSORT is an improved version of Simple Online and
Realtime Tracking (SORT) [21] which based on Kalman filter
[23]. The advantages of SORT is to have not only high
speed but also high performance. However, a backward of
this algorithm is to generate numerous identity switch (IDSW)
errors when an occlusion appears or objects cross each other.
DeepSORT helps to tackle this problem by adding a deep net-
work trained on a large dataset to extract appearance features
for person representation. The obtained results indicates that
DeepSORT allows to reduce significantly ID switch errors
while maintaining real-time response in a realistic system.
Different from SORT that uses the IoU ratios between detected
boxes as elements of the cost matrix in data association,
DeepSORT employ the following measurement metric:
ci,j =λd(1)(i, j ) + (1 −λ)d(2)(i, j)(1)
where ci,j is the similarity between the i-th track and the
j-th bounding box detection; d(1)(i, j )and d(2)(i, j)are the
two metrics calculated based on motion and appearance in-
formation, respectively. While d(1)(i, j )is calculated based
on Mahalanobis distance, d(2)(i, j )is the smallest cosine
distance between the i-th track and the j-th bounding box
detection in the appearance space; hyperparameter λcontrols
this association.
IV. EXP ER IM EN TAL RESULTS
A. Datasets
Multiple Object Tracking (MOT) Challenge datasets are
built to provide for researchers proving the effectiveness of
their own tracking methods. In our work, we use MOT17
Challenge dataset which has 14 videos with different charac-
teristics in term of frame rate, pedestrian density, illumination
condition, and point of view. A half of them is used for
training and the remaining is used for testing. As MOT17
testing set aims at evaluating the tracking method while fixing
MaskR −CNN as the person detection method, To evaluate
the coupling of different detection methods with the tracking
method, in this paper, we use 7 videos from MOT17 training
set.
In addition, we have captured our own dataset
COMVIS MICA containing three video sequences captured
by two static cameras in two environment indoor and outdoor
named: indoor,outdoor easy,outdoor hard. These videos
are annotated using Labelimg tool.
#1
Frame #2
Frame #3
Frame #N
ID1ID2ID3
ID6
ID5ID4
Frame #1
Frame #2
Frame #3
Frame #N
Human trajectories
Human detection Tracking
Fig. 1. Framework for evaluating human detection and tracking phases in a fully-automatic system. Green, red, and blue bounding boxes indicate the obtained
results in case of applying YOLOv3-tiny, YOLOv3, and Mask R-CNN, respectively.
B. Evaluation measures
Evaluating the performance of a human detector
We employ Precision (Prcn) and Recall (Rcll) to evaluate
the detection performance. These two metrics are defined as
follows:
Prcn = TP
TP + FP Rcll = TP
TP + FN,(2)
where, TP, FP, and FN are number of True Positive, False
Positive, and False Negative, respectively. A detected box is
determined to be a TP if it has IoU ≥0.5where IoU is the
ratio of Intersection over Union between detected bounding
box and its corresponding ground-truth.
Evaluating the performance of a tracker
Several metrics have been proposed to evaluate object tracking
methods, in this paper, we employ the metrics used in [1].
•IDP (ID Precision) and IDR (ID Recall)
The mean of these two metrics are the same Precision
(Prcn) and Recall (Rcll) in evaluating a detector, but
they are outcomes of tracking. These two metrics are
calculated based on the values of ID True/False Posi-
tive/Negative which defined as follows: (3).
IDP = IDTP
IDTP + IDFP IDR = IDTP
IDTP + IDFN.(3)
where, IDTP: sum of TP in detection and the number
of correctly labeled objects in the tracking; IDFP/IDFN:
sum of FP/FN in detection and the number of correctly
predicted objects for positive class in detection but incor-
rectly labeled in tracking.
•IDF1: This metric is formulated based on IDP and IDR
as in Eq.(4). The higher IDF1 is, the better tracker is.
IDF1 =2 ×IDP ×IDR
IDP + IDR (4)
•ID switch (IDs): The number of identity switches in total
tracklets. This metric means that several individuals are
assigned to the same label.
•Fragment (FM): Total number of switches from tracked
to track.
•MOTA (Multi Object Tracking Accuracy): This is
the most important metric for object tracking evaluation.
MOTA is defined as:
MOTA = 1−Pt(IDFNt+IDFPt+IDst)
PtGTt
,(5)
where, tis the index of frame, GT is the number of
observed objects in the real-world. It is worth to note
that MOTA would be a negative value if there are many
errors in the tracking process and the number of these
errors is larger than that of observed objects.
•MOTP (Multi Object Tracking Precision):MOTP is
defined as the average distance between all true positive
and their corresponding ground truth targets.
MOTP =Pt,idt,i
Ptct
(6)
where, ctdenotes the number of matches found in frame
tand dt,i is the sum of distances between all true
positives and their corresponding ground truth i. This
metric indicate the ability of the tracking in estimating
precise object positions.
•Track quality measures: Recovered trajectories by a
tracking algorithm can be categorized into three different
kinds such as mostly tracked (MT), partially tracked (PT),
mostly loss (ML). A target is mostly tracked if its tracking
time is at least 80% total length of the ground truth
trajectory. While, if a track is only covered for less than
20%, it is called mostly lost. The other cases are defined
as partially tracked.
C. Experimental results and Discussions
In this section, we show several experimental results on
MOT17 and COMVIS MICA datasets.
In order to observe the behavior of person detection and
tracking, we classify 7 video sequences of MOT17 into two
main groups: (1) static cameras (2, 4, 9) and (2) moving cam-
eras (5, 10, 11, 13). All experiments are conducted on a Work-
station Supermicro with Intel(R) Xeon(R), CPU E5-2620 v2 @
2.10GHz, 6 cores, 12 threads, RAM 12GB, GPU GTX 1080.
Our framework based on Keras with backend Tensorflow,
Ubutu 18, Python 3. Some parameters in our experiments as
follows: size of input images is 1920×1080,detect freq = 2,
down sample ratio = 1,IoU threshold = 0.5.
1) Overall evaluation of person detection and tracking:
In this study, we conduct experiments on two datasets
with three kinds of detectors (YOLOv3-tiny, YOLOv3, and
Mask R-CNN) and one tracker ( DeepSORT tracker). The
obtained results are shown in the three Tables I-III as below.
Concerning person detection performance, among three chosen
methods Mask R-CNN outperforms both YOLOv3-tiny and
YOLOv3 on both datasets in term of Recall metric. The
average Recall obtained by YOLOv3-tiny, YOLOv3 and
Mask R-CNN are 16.50%, 41.3%, and 49.5% on MOT17 and
81.3%, 94.7%, and 98.1% on COMVIS MICA, respectively.
However, the Precision obtained when using YOLOv3 and
Mask R-CNN is slightly reduced in comparison with that of
YOLOv3-tiny. This comes from the fact that YOLOv3 and
Mask R-CNN can detect objects which even have small-size.
In this case, these methods may help to reduce the miss
detection. However, among detected objects, some of them
are not human. That is why these methods produce more false
alarm than YOLOv3-tiny.
Among 7 videos of MOT17, the best results are obtained
for MOT17-09 and MOT17-05. This is explained by the
characteristics of these videos. MOT17-09 video is installed
in a central hall (indoor) with a closed view containing 26
pedestrians while the other videos are captured outdoor with
large views (e.g., a large square in MOT17-2 and a crowed
scene in MOT17-04).
It is also interesting to see that when working with a
challenging dataset like MOT17, the performance of the three
detection methods varies a lot. However, with a less challeng-
ing dataset such as COMVIS MICA, the difference between
the performance of these methods is not so significant.
We may also observe the influence of person detection
quality into person tracking method. Two metrics are served
as the most important key for evaluating tracking results,
called MOTA and MOTP. While MOTA evaluate the overall
performance of a tracker, MOTP relates to the position
dissimilarity between all true positive and their corresponding
ground truth targets. This mean that, a higher MOTA and
a lower MOTP shows a better quality for a tracker. When
observing the overall results, the coupling Mask R-CNN with
DeepSORT obtains the best results in terms of MOTA and
MOTP. The margins of MOTA by YOLOv3+DeepSORT
and Mask R-CNN+DeepSORT is 13.7% and 16.4% for
MOT17 and 12.2% and 14.6% for COMVIS MICA more
than the coupling between YOLOv3-tiny and DeepSORT.
2) Analysis on memory requirement and processing rate:
This section aims at evaluating the memory requirement and
processing rate of person detection and tracking methods. The
results are shown in Table IV. We evaluate three couplings
of person detection and tracking in two cases: with GPU and
without GPU. The results shows that among three couplings,
two couplings (YOLOv3-tiny + DeepSORT and YOLOv3
+DeepSORT) can work without GPU. In case of without
GPU, YOLOv3-tiny + DeepSORT requires a haft memory
and performs with the speed twice times faster than YOLOv3
+DeepSORT. However, while using GPU, the requirements
of these two coupling are quite similar. Processing rate of the
coupling of Mask R-CNN + DeepSORT that achieved very
good results in term of person detection and tracking quality
is 2.5 Hz while that of YOLOv3-tiny + DeepSORT and
YOLOv3 +DeepSORT is 11.9 Hz and 11.1 Hz, respectively.
From the experimental results, three recommendations on
the choice of person detection and tracking can be pro-
vided. Firs, the coupling of YOLOv3-tiny and DeepSORT
is suggested for the application that can not support GPU
workstation while requires real-time processing especially
when the captured scene is not so complex (e.g., surveillance
application in office). If the complex of the scene increases, in
this case, we can employ YOLOv3 and DeepSORT. Second,
in the case that GPU is available, YOLOv3 and DeepSORT
is still a good choice because of the trade-off between the
detection, tracking quality and processing time. Finally, in
some applications where the scene is relatively complex and
the detection and tracking are not required for all coming
frames, Mask R-CNN is recommended.
Figure 2 shows an example for obtained results on human de-
tection and tracking for COMVIS MICA dataset when apply-
ing Mask R-CNN for detection and DeepSORT for tracking.
In Fig.2a) indicates a correct result in a simple context while
Fig.2b) shows a fragment error when an occlusion appears.
V. CONCLUSION
In this paper, we have performed several experiments on
MOT17 Challenge and COMVIS MICA datasets for pro-
viding an exhaustive evaluation on performance of human
detection and tracking components in a visual surveillance
camera network. The experimental results allow us to provide
suggestions for the choice of person detection and tracking in
online tracking applications. However, due to the limitation
of time, only one tracking method (DeepSORT) has been
evaluated. In the future, we will perform evaluations with
others person tracking methods.
ACKNOWLEDGMENT
This research is funded by Vietnam National Foundation for
Science and Technology Development (NAFOSTED) under
grant number 102.01-2017.315
REFERENCES
[1] “The multiple object tracking benchmark,” https://motchallenge.net.
[2] M. Paul, S. M. E. Haque, and S. Chakraborty, “Human detection in
surveillance videos and its applications - a review,” EURASIP Journal
on Advances in Signal Processing, vol. 2013, no. 1, p. 176, Nov 2013.
[Online]. Available: https://doi.org/10.1186/1687-6180-2013-176
[3] P. Dollar, C. Wojek, B. Schiele, and P. Perona, “Pedestrian detection:
An evaluation of the state of the art,” IEEE Transactions on Pattern
Analysis and Machine Intelligence, vol. 34, no. 4, pp. 743–761, April
2012.
[4] E. Moussy, A. A. Mekonnen, G. Marion, and F. Lerasle, “A comparative
view on exemplar tracking-by-detection approaches,” in 2015 12th
IEEE International Conference on Advanced Video and Signal Based
Surveillance (AVSS), Aug 2015, pp. 1–6.
TABLE I
PERFORMANCE ON SEVERAL VIDEOS OF MOT17 AND COMVIS MICA DATAS ETS W HE N EMP LOY ING YOLOV3 -T INY A S A DE TEC TO R AND
DeepSORT AS A TRAC KE R. TH E TWO B ES T RES ULTS F OR E ACH C ASE O F MOT17 ARE IN BOLD.
Videos For evaluating a detector (1) For evaluating a tracker (2)
FP↓FN↓Rcll(%)↑Prcn(%)↑GT MT↑PT↑ML↓IDF1(%)↑IDP(%)↑IDR(%)↑IDs↓FM↓MOTA(%)↑MOTP↓
Static camera
MOT17-02 1105 15994 13.90 70.10 62 3 10 49 12.10 36.30 7.20 83 151 7.50 0.32
MOT17-04 1616 43331 8.90 72.30 83 0 13 70 9.40 42.90 5.30 94 366 5.30 0.30
MOT17-09 415 3165 40.60 83.90 26 0 21 5 30.10 46.20 22.30 102 192 30.90 0.30
Moving camera
MOT17-05 620 3969 42.60 82.60 133 7 71 55 40.70 59.80 30.80 151 277 31.50 0.31
MOT17-10 1148 10623 17.30 65.90 57 3 14 40 16.30 39.20 10.30 144 271 7.20 0.32
MOT17-11 551 5679 39.80 87.20 75 5 25 45 26.70 42.50 19.40 103 249 32.90 0.28
MOT17-13 329 10961 5.80 67.40 110 1 12 97 8.00 49.90 4.30 89 143 2.30 0.32
OVERALL for MOT17 - - 16.50 76.30 - - - - 15.80 44.50 9.60 - - 10.70 0.31
indoor 60 220 80.9 93.9 7 3 4 0 84.0 90.8 78.2 7 30 75.0 0.248
outdoor easy 57 269 89.5 97.6 7 6 1 0 66.0 68.9 63.3 14 35 86.7 0.226
outdoor hard 405 1428 78.2 92.7 20 13 7 0 71.4 78.0 65.8 49 115 71.3 0.300
OVERALL for COMVIS MICA - - 81.3 94.1 - - - - 71.4 77.0 66.6 - - 75.6 0.274
TABLE II
PERFORMANCE ON VIDEOS OF MOT17 AND COMVIS MICA WHEN EMPLOYING YOL OV3AS A D ET ECT OR AN D DeepSORT AS A T RAC KER . TH E
TWO B ES T RES ULTS F OR E ACH C ASE A RE I N BOL D.
Videos For evaluating a detector (1) For evaluating a tracker (2)
FP↓FN↓Rcll(%)↑Prcn(%)↑GT MT↑PT↑ML↓IDF1(%)↑IDP(%)↑IDR(%)↑IDs↓FM↓MOTA(%)↑MOTP↓
Static camera
MOT17-02 2936 12735 31.50 66.60 62 7 23 32 29.50 45.90 21.70 138 254 14.90 0.28
MOT17-04 5463 29825 37.30 76.40 83 8 41 34 34.60 52.80 25.80 257 608 25.30 0.26
MOT17-09 864 2077 61.00 79.00 26 5 17 4 44.40 50.90 39.30 79 106 43.30 0.26
Moving camera
MOT17-05 1660 2613 62.20 72.20 133 29 79 25 46.80 50.50 43.60 181 240 35.60 0.29
MOT17-10 2808 6953 45.80 67.70 57 7 30 20 33.20 41.10 27.80 300 503 21.60 0.29
MOT17-11 1694 3856 59.10 76.70 75 16 24 35 46.40 53.30 41.10 63 90 40.50 0.22
MOT17-13 2124 7830 32.70 64.20 110 7 54 49 29.50 43.70 22.30 459 674 10.60 0.32
OVERALL for MOT17 - - 41.30 72.60 - - - - 35.70 49.10 28.00 - - 24.40 0.27
indoor 86 53 95.4 92.7 7 7 0 0 86.7 85.4 87.9 4 14 87.6 0.260
outdoor easy 61 66 97.4 97.6 7 7 0 0 74.8 74.9 74.7 5 20 94.8 0.202
outdoor hard 518 430 93.4 92.2 20 19 1 0 76.6 76.1 77.1 30 65 85.1 0.277
OVERALL for COMVIS MICA - - 94.7 93.6 - - - - 77.3 76.9 77.8 - - 87.8 0.256
TABLE III
PERFORMANCE ON SEVERAL VIDEOS OF MOT17 AND COMVIS MICA WHEN USING MASK R-CNN FOR DETECTION AND DeepSORT FOR
TR ACKI NG . THE T WO B EST R ES ULTS F OR EAC H CA SE AR E IN B OLD .
Videos For evaluating a detector (1) For evaluating a tracker (2)
FP↓FN↓Rcll(%)↑Prcn(%)↑GT MT↑PT↑ML↓IDF1(%)↑IDP(%)↑IDR(%)↑IDs↓FM↓MOTA(%)↑MOTP↓
Static cameras
MOT17-02 4206 11140 40.00 63.90 62 8 29 25 33.00 42.90 26.90 231 309 16.20 0.27
MOT17-04 4228 25709 45.90 83.80 83 10 44 29 43.80 61.90 33.90 271 730 36.50 0.22
MOT17-09 1827 1574 70.40 67.20 26 10 13 3 42.60 41.60 43.60 63 91 34.90 0.22
Moving camera
MOT17-05 2227 2420 65.00 66.90 133 38 74 21 47.30 48.00 46.70 189 245 30.10 0.28
MOT17-10 4705 5822 54.70 59.90 57 12 35 10 40.10 42.00 38.40 363 553 15.20 0.28
MOT17-11 3124 3345 64.60 66.10 75 21 26 28 51.90 52.50 51.30 50 102 30.90 0.21
MOT17-13 3234 6668 42.70 60.60 110 14 63 33 34.20 41.30 29.10 484 741 10.80 0.31
OVERALL for MOT17 - - 49.50 70.30 - - - - 41.60 50.30 35.50 - - 27.10 0.25
indoor 85 21 98.2 93.0 7 7 0 0 92.6 90.2 95.2 2 8 90.6 0.216
outdoor easy 121 28 98.9 95.4 7 7 0 0 97.2 95.4 98.8 1 9 94.1 0.166
outdoor hard 574 148 97.7 91.8 20 20 0 0 78.7 76.3 81.2 22 36 88.6 0.257
OVERALL for COMVIS MICA - - 98.1 92.8 - - - - 84.8 82.5 87.2 - - 90.2 0.229
TABLE IV
AN EVALUATION ABOUT MEMORY REQUIREMENT AND PROCESSING RATE FOR THREE COUPLINGS OF HUMAN DETECTION AND TRACKING METHOD.
Methods
Memory requirement (MB) Frames per second (Hz)
Without GPU With GPU Without GPU With GPU
RAM RAM GPU
YOLOv3-tiny + DeepSORT 700 3400 2489 7.00 14.9
YOLOv3 + DeepSORT 1300 5400 2897 2.0 11.1
Mask R-CNN + DeepSORT - 4178 4300 - 2.5
ID=1
ID=2
ID=3 ID=1 ID=2
ID=3
ID=1 ID=2
ID=3
ID=1 ID=2
ID=3 ID=1
ID=2
ID=3
ID=1
ID=3
ID=1
ID=3 ID=5
ID=3
ID=1 ID=4 ID=1 ID=3
Inputvideo
ID=1
ID=2
ID=3
ID=4 ID=1
ID=2
ID=3
ID=4 ID=1
ID=2
ID=3
ID=4 ID=1
ID=2
ID=3
ID=4
ID=1
ID=2
ID=3
ID=4
ID=1
ID=2
ID=3
ID=4 ID=1
ID=2
ID=3
ID=4
Inputvideo
(a)
ID=1
ID=2
ID=3 ID=1 ID=2
ID=3
ID=1 ID=2
ID=3
ID=1 ID=2
ID=3 ID=1
ID=2
ID=3
ID=1
ID=3
ID=1
ID=3 ID=5
ID=3
ID=1 ID=4 ID=1 ID=3
Inputvideo
ID=1
ID=2
ID=3
ID=4 ID=1
ID=2
ID=3
ID=4 ID=1
ID=2
ID=3
ID=4 ID=1
ID=2
ID=3
ID=4
ID=1
ID=2
ID=3
ID=4
ID=1
ID=2
ID=3
ID=4 ID=1
ID=2
ID=3
ID=4
Inputvideo
(b)
Fig. 2. An example simulates the obtained results in human detection and tracking for COMVIS MICA dataset in two cases: (a) correct tracking (b) a
fragment occurs when pedestrians are passing each other. The detected boxes and their corresponding ground-truth are remarked in orange and blue bounding
boxes, respectively.
[5] A. A. Mekonnen and F. Lerasle, “Comparative evaluations of selected
tracking-by-detection approaches,” IEEE Transactions on Circuits and
Systems for Video Technology, vol. 29, no. 4, pp. 996–1010, April 2019.
[6] J. Zhou and J. Hoang, “Real time robust human detection and tracking
system,” in 2005 IEEE Computer Society Conference on Computer
Vision and Pattern Recognition (CVPR’05)-Workshops. IEEE, 2005,
pp. 149–149.
[7] A. A. Malik, A. Khalil, and H. U. Khan, “Object detection and track-
ing using background subtraction and connected component labeling,”
International Journal of Computer Applications, vol. 75, no. 13, 2013.
[8] S. Haifeng and X. Chao, “Moving object detection based on background
subtraction of block updates,” in 2013 6th International Conference on
Intelligent Networks and Intelligent Systems (ICINIS). IEEE, 2013, pp.
51–54.
[9] N. Dalal and B. Triggs, “Histograms of oriented gradients for human
detection,” in international Conference on computer vision & Pattern
Recognition (CVPR’05), vol. 1. IEEE Computer Society, 2005, pp.
886–893.
[10] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan,
“Object detection with discriminatively trained part-based models,”
IEEE transactions on pattern analysis and machine intelligence, vol. 32,
no. 9, pp. 1627–1645, 2009.
[11] X. Wang, G. Hua, and T. X. Han, “Detection by detections: Non-
parametric detector adaptation for a video,” in 2012 IEEE Conference on
Computer Vision and Pattern Recognition. IEEE, 2012, pp. 350–357.
[12] V. Gajjar, A. Gurnani, and Y. Khandhediya, “Human detection and
tracking for video surveillance: A cognitive science approach,” in
Proceedings of the IEEE International Conference on Computer Vision,
2017, pp. 2805–2809.
[13] J. Redmon and A. Farhadi, “Yolov3: An incremental improvement,”
arXiv preprint arXiv:1804.02767, 2018.
[14] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C.
Berg, “Ssd: Single shot multibox detector,” in European conference on
computer vision. Springer, 2016, pp. 21–37.
[15] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time
object detection with region proposal networks,” in Advances in neural
information processing systems, 2015, pp. 91–99.
[16] H. Grabner and H. Bischof, “On-line boosting and vision,” in 2006
IEEE Computer Society Conference on Computer Vision and Pattern
Recognition (CVPR’06), vol. 1. Ieee, 2006, pp. 260–267.
[17] H. Grabner, C. Leistner, and H. Bischof, “Semi-supervised on-line
boosting for robust tracking,” in European conference on computer
vision. Springer, 2008, pp. 234–247.
[18] B. Babenko, M.-H. Yang, and S. Belongie, “Visual tracking with online
multiple instance learning,” in 2009 IEEE Conference on Computer
Vision and Pattern Recognition. IEEE, 2009, pp. 983–990.
[19] Z. Kalal, K. Mikolajczyk, and J. Matas, “Tracking-learning-detection,”
IEEE transactions on pattern analysis and machine intelligence, vol. 34,
no. 7, pp. 1409–1422, 2011.
[20] J. F. Henriques, R. Caseiro, P. Martins, and J. Batista, “High-speed track-
ing with kernelized correlation filters,” IEEE transactions on pattern
analysis and machine intelligence, vol. 37, no. 3, pp. 583–596, 2014.
[21] N. Wojke, A. Bewley, and D. Paulus, “Simple online and realtime
tracking with a deep association metric,” in 2017 IEEE International
Conference on Image Processing (ICIP), Sep. 2017, pp. 3645–3649.
[22] J. Redmon, “Darknet: Open source neural networks in c,”
http://pjreddie.com/darknet/, 2013–2016.
[23] R. E. Kalman, “A new approach to linear filtering and prediction
problems,” Journal of basic Engineering, vol. 82, no. 1, pp. 35–45,
1960.