Conference PaperPDF Available

Comparative evaluation of human detection and tracking approaches for online tracking applications

Authors:

Abstract

Object detection and tracking in videos is an important problem in computer vision thanks to its wide applications in various video analysis scenarios. As a result, it has attracted huge interest from the scientific community. Majority of recent works following the tracking-by-detection approaches which rely on a people detector to start, update, reinitialize, guide and terminate the trackers. Recent years have witnessed a significant advance in person detection and tracking performance. However, person detection and tracking are usually treated separately in the recent works. The contributions of this paper are twofold. First, a comparative evaluation of the coupling of person detection and tracking methods for online tracking applications is conducted on two video datasets: MOT17-a benchmark dataset provided in MOT Challenge [1] and our own dataset captured in a video surveillance context. For this, we investigate a popular online tracking method (DeepSORT) coupled with the two state-of-the-art people detection methods that are You Only Look Once (YOLO) and MaskR − CNN. Second, a deep analysis on the behavior of the person detection and tracking method in term of both detection and tracking performance and resources requirement for practical applications is given. The implementation of the framework and dataset used in this paper will be made publicly available.
Comparative evaluation of human detection and
tracking approaches for online tracking applications
Hong-Quan Nguyen ∗‡ ,Thuy-Binh Nguyen ∗† ,Tuan-Anh Nguyen , Thi-Lan Le , Thanh-Hai Vu §, Alexis Noe
Computer Vision Department, MICA International Research Institute,
Hanoi University of Science and Technology, Vietnam
Email: Thi-Lan.Le@mica.edu.vn
Faculty of Electrical and Electronics Engineering, University of Transport and Communications, Hanoi, Vietnam
Faculty of Information Technology, Viet-Hung Industrial University, Hanoi, Vietnam
§Mathematics Science Research Division, Viettel Group
School of Engineering in Physics, Applied Physics, Electronics & Materials Science, Grenoble Institute of Technology, France.
Abstract—Object detection and tracking in videos is an impor-
tant problem in computer vision thanks to its wide applications
in various video analysis scenarios. As a result, it has attracted
huge interest from the scientific community. Majority of recent
works following the tracking-by-detection approaches which rely
on a people detector to start, update, reinitialize, guide and
terminate the trackers. Recent years have witnessed a significant
advance in person detection and tracking performance. However,
person detection and tracking are usually treated separately
in the recent works. The contributions of this paper are two-
fold. First, a comparative evaluation of the coupling of person
detection and tracking methods for online tracking applications
is conducted on two video datasets: MOT17 - a benchmark
dataset provided in MOT Challenge [1] and our own dataset
captured in a video surveillance context. For this, we investigate
a popular online tracking method (DeepSORT) coupled with
the two state-of-the-art people detection methods that are You
Only Look Once ( YOLO) and MaskR CNN. Second, a deep
analysis on the behavior of the person detection and tracking
method in term of both detection and tracking performance and
resources requirement for practical applications is given. The
implementation of the framework and dataset used in this paper
will be made publicly available.
I. INTRODUCTION
The presence of cameras in our surroundings grows every
day, allowing visual surveillance systems to be used in a wide
range of domain such security, health-care, etc. In these sys-
tems, pedestrian detection and tracking which aims to estimate
the state of person (e.g. person location, identity) over the
times is an initial and crucial step in these systems. In the last
decade, pedestrian detection and tracking has received a lot of
attention as a research topic, which resulted in a broad amount
of available techniques [2]. Majority of recent works following
the tracking-by-detection approaches which bases on a people
detector to start, update, reinitialize, guide and terminate the
trackers. Recent years have witnessed a significant advance in
person detection and tracking performance. However, person
detection and tracking are usually treated separately in the
recent works. Some previous works that attempt to evaluate
the object detection [3] and object tracking [1], [4], [5]. The
MOT Challenge has designed a common platform containing
videos with object detection and tracking ground-truth as well
as common evaluation metrics for object tracking evaluation.
However, as this platform allows both batch mode (in which
video frames from future time steps are also utilized to solve
the data association problem) and online mode, therefore, the
suggestion on the choice of these methods for online tracking
applications is unavailable.
The contributions of this paper are two-folds. First, a
comparative evaluation of the coupling of person detection and
tracking methods for online tracking applications is conducted
on two datasets: MOT17 - a benchmark dataset provided in
MOT Challenge [1] and our own dataset captured in a video
surveillance context. For this, we investigate a popular online
tracking method (DeepSORT) coupled with the two state-of-
the-art people detection methods that are You Only Look Once
(YOLO) and MaskR CNN. Second, a deep analysis on
the behavior of the person detection and tracking method in
term of both detection and tracking performance and resources
requirement for practical applications is given.
The rest of paper is organized as follows. Section 2 dis-
cusses some previous works related to human detection and
tracking. Next, our framework is presented in Section 3. And
then, an exhaustive evaluation on performance of both human
detection and tracking is shown in Section 4. Conclusion and
future work are mentioned in the last Section.
II. RE LATE D WO RK
In this section, we discuss briefly on some prominent studies
involve human detection and tracking. To detect object in
videos, numerous previous works tried to build the background
models and determine the object through background sub-
traction algorithms [6]–[8]. Even some models for dynamic
background have been proposed, background subtraction-
based methods are relatively sensitive to lighting variance.
Another approach is to model the person appearance using
the appearance clues such as color, texture and to employ the
scanning window technique to determine the presence of a
person in a given window [9]–[12]. In recent years, there have
been various deep learning networks are proposed for object
detection in general and human detection in particular such as
YOLO [13], SSD [14], Mask R-CNN [15].
Concerning object tracking, object tracking is classified
into two main approaches: single object and multiple object
tracking algorithms. In comparison, multiple object tracking
has to cope with more challenges than single shot one because
of sudden appearance and disappearance of object. There
are numerous trackers that belong to the former approach,
such as on-line boosting trackers [16]–[18], tracking-learning-
detection (TLD) tracker [19], and Kernelized Correlation Filter
(KCF) [20], etc. However, almost these trackers mainly focus
on local position of bounding boxes and motion information.
In recent times, several novel algorithms are introduced to
integrate appearance features extracted on each bounding
box for improving the performance of tracking. In order to
overcome difficulties in multiple object tracking Nicolai Wojke
et al . [21] proposed a novel framework to incorporate an
object detector, Kalman filter as the base tracker, and a data
association method to take advantage of both detection and
tracking tasks.
As above, we have introduced several researches related
to human detection and tracking, however, there are a few
works provide a comprehensive evaluation of the coupling
human detection and tracking. This is the motivation for our
study to conduct various extensive experiments to assess the
effectiveness of both human detection and tracking component
in a realistic camera system.
III. FRAMEWORK FOR PERSON DETECTION AND
TR ACK IN G
The purpose of our work is to provide a comprehensive
evaluation on the performance of a human detection and
tracking system in online tracking applications. Figure 1 shows
a common framework for human detection and tracking in
videos. Among different methods proposed for person detec-
tion, we select two state-of-the-art object detection methods
that are You Only Look One (YOLO) [13] and Mask R-CNN
[15]. We couple these detection methods with one popular
online tracking method DeepSORT [21]. In the following
sections, we describe briefly the person detection and tracking
methods.
A. Pedestrian detection methods
1) You Only Look Once (YOLO): YOLO [13] is one kind
of Single Shot Detector (SSD) [14] in which DarkNet [22]
is employed as a backbone for feature extraction. Up to
now, YOLO network has three different versions, namely
YOLOv1, YOLOv2, and YOLOv3. In comparison with the
other versions, YOLOv3 is evaluated to have higher computa-
tional speed. Furthermore, due to having a more complicated
structure with Pyramid features, YOLOv3 is able to detect a
small object. Because of the above-mentioned advantages, we
utilize YOLOv3 and an its variant, called YOLOv3-tiny for
our study.
2) Mask R-CNN: MaskR CNN is developed from Faster
R-CNN [15], a difference between the two networks is that
MaskR CNN generates simultaneously a bounding box and
a corresponding mask for a detected object. It is worth to note
that the major contribution in the structure of Faster R-CNN
is to incorporate an object proposal generator into a detection
network. By this way, convolutional features are shared not
only between the object proposals but also among the object
proposal and detection networks leading a high computation
cost reduction and a mean Average Precision (mAP) gain.
B. Pedestrian tracking methods
DeepSORT is an improved version of Simple Online and
Realtime Tracking (SORT) [21] which based on Kalman filter
[23]. The advantages of SORT is to have not only high
speed but also high performance. However, a backward of
this algorithm is to generate numerous identity switch (IDSW)
errors when an occlusion appears or objects cross each other.
DeepSORT helps to tackle this problem by adding a deep net-
work trained on a large dataset to extract appearance features
for person representation. The obtained results indicates that
DeepSORT allows to reduce significantly ID switch errors
while maintaining real-time response in a realistic system.
Different from SORT that uses the IoU ratios between detected
boxes as elements of the cost matrix in data association,
DeepSORT employ the following measurement metric:
ci,j =λd(1)(i, j ) + (1 λ)d(2)(i, j)(1)
where ci,j is the similarity between the i-th track and the
j-th bounding box detection; d(1)(i, j )and d(2)(i, j)are the
two metrics calculated based on motion and appearance in-
formation, respectively. While d(1)(i, j )is calculated based
on Mahalanobis distance, d(2)(i, j )is the smallest cosine
distance between the i-th track and the j-th bounding box
detection in the appearance space; hyperparameter λcontrols
this association.
IV. EXP ER IM EN TAL RESULTS
A. Datasets
Multiple Object Tracking (MOT) Challenge datasets are
built to provide for researchers proving the effectiveness of
their own tracking methods. In our work, we use MOT17
Challenge dataset which has 14 videos with different charac-
teristics in term of frame rate, pedestrian density, illumination
condition, and point of view. A half of them is used for
training and the remaining is used for testing. As MOT17
testing set aims at evaluating the tracking method while fixing
MaskR CNN as the person detection method, To evaluate
the coupling of different detection methods with the tracking
method, in this paper, we use 7 videos from MOT17 training
set.
In addition, we have captured our own dataset
COMVIS MICA containing three video sequences captured
by two static cameras in two environment indoor and outdoor
named: indoor,outdoor easy,outdoor hard. These videos
are annotated using Labelimg tool.
#1
Frame #2
Frame #3
Frame #N
ID1ID2ID3
ID6
ID5ID4
Frame #1
Frame #2
Frame #3
Frame #N
Human trajectories
Human detection Tracking
Fig. 1. Framework for evaluating human detection and tracking phases in a fully-automatic system. Green, red, and blue bounding boxes indicate the obtained
results in case of applying YOLOv3-tiny, YOLOv3, and Mask R-CNN, respectively.
B. Evaluation measures
Evaluating the performance of a human detector
We employ Precision (Prcn) and Recall (Rcll) to evaluate
the detection performance. These two metrics are defined as
follows:
Prcn = TP
TP + FP Rcll = TP
TP + FN,(2)
where, TP, FP, and FN are number of True Positive, False
Positive, and False Negative, respectively. A detected box is
determined to be a TP if it has IoU 0.5where IoU is the
ratio of Intersection over Union between detected bounding
box and its corresponding ground-truth.
Evaluating the performance of a tracker
Several metrics have been proposed to evaluate object tracking
methods, in this paper, we employ the metrics used in [1].
IDP (ID Precision) and IDR (ID Recall)
The mean of these two metrics are the same Precision
(Prcn) and Recall (Rcll) in evaluating a detector, but
they are outcomes of tracking. These two metrics are
calculated based on the values of ID True/False Posi-
tive/Negative which defined as follows: (3).
IDP = IDTP
IDTP + IDFP IDR = IDTP
IDTP + IDFN.(3)
where, IDTP: sum of TP in detection and the number
of correctly labeled objects in the tracking; IDFP/IDFN:
sum of FP/FN in detection and the number of correctly
predicted objects for positive class in detection but incor-
rectly labeled in tracking.
IDF1: This metric is formulated based on IDP and IDR
as in Eq.(4). The higher IDF1 is, the better tracker is.
IDF1 =2 ×IDP ×IDR
IDP + IDR (4)
ID switch (IDs): The number of identity switches in total
tracklets. This metric means that several individuals are
assigned to the same label.
Fragment (FM): Total number of switches from tracked
to track.
MOTA (Multi Object Tracking Accuracy): This is
the most important metric for object tracking evaluation.
MOTA is defined as:
MOTA = 1Pt(IDFNt+IDFPt+IDst)
PtGTt
,(5)
where, tis the index of frame, GT is the number of
observed objects in the real-world. It is worth to note
that MOTA would be a negative value if there are many
errors in the tracking process and the number of these
errors is larger than that of observed objects.
MOTP (Multi Object Tracking Precision):MOTP is
defined as the average distance between all true positive
and their corresponding ground truth targets.
MOTP =Pt,idt,i
Ptct
(6)
where, ctdenotes the number of matches found in frame
tand dt,i is the sum of distances between all true
positives and their corresponding ground truth i. This
metric indicate the ability of the tracking in estimating
precise object positions.
Track quality measures: Recovered trajectories by a
tracking algorithm can be categorized into three different
kinds such as mostly tracked (MT), partially tracked (PT),
mostly loss (ML). A target is mostly tracked if its tracking
time is at least 80% total length of the ground truth
trajectory. While, if a track is only covered for less than
20%, it is called mostly lost. The other cases are defined
as partially tracked.
C. Experimental results and Discussions
In this section, we show several experimental results on
MOT17 and COMVIS MICA datasets.
In order to observe the behavior of person detection and
tracking, we classify 7 video sequences of MOT17 into two
main groups: (1) static cameras (2, 4, 9) and (2) moving cam-
eras (5, 10, 11, 13). All experiments are conducted on a Work-
station Supermicro with Intel(R) Xeon(R), CPU E5-2620 v2 @
2.10GHz, 6 cores, 12 threads, RAM 12GB, GPU GTX 1080.
Our framework based on Keras with backend Tensorflow,
Ubutu 18, Python 3. Some parameters in our experiments as
follows: size of input images is 1920×1080,detect freq = 2,
down sample ratio = 1,IoU threshold = 0.5.
1) Overall evaluation of person detection and tracking:
In this study, we conduct experiments on two datasets
with three kinds of detectors (YOLOv3-tiny, YOLOv3, and
Mask R-CNN) and one tracker ( DeepSORT tracker). The
obtained results are shown in the three Tables I-III as below.
Concerning person detection performance, among three chosen
methods Mask R-CNN outperforms both YOLOv3-tiny and
YOLOv3 on both datasets in term of Recall metric. The
average Recall obtained by YOLOv3-tiny, YOLOv3 and
Mask R-CNN are 16.50%, 41.3%, and 49.5% on MOT17 and
81.3%, 94.7%, and 98.1% on COMVIS MICA, respectively.
However, the Precision obtained when using YOLOv3 and
Mask R-CNN is slightly reduced in comparison with that of
YOLOv3-tiny. This comes from the fact that YOLOv3 and
Mask R-CNN can detect objects which even have small-size.
In this case, these methods may help to reduce the miss
detection. However, among detected objects, some of them
are not human. That is why these methods produce more false
alarm than YOLOv3-tiny.
Among 7 videos of MOT17, the best results are obtained
for MOT17-09 and MOT17-05. This is explained by the
characteristics of these videos. MOT17-09 video is installed
in a central hall (indoor) with a closed view containing 26
pedestrians while the other videos are captured outdoor with
large views (e.g., a large square in MOT17-2 and a crowed
scene in MOT17-04).
It is also interesting to see that when working with a
challenging dataset like MOT17, the performance of the three
detection methods varies a lot. However, with a less challeng-
ing dataset such as COMVIS MICA, the difference between
the performance of these methods is not so significant.
We may also observe the influence of person detection
quality into person tracking method. Two metrics are served
as the most important key for evaluating tracking results,
called MOTA and MOTP. While MOTA evaluate the overall
performance of a tracker, MOTP relates to the position
dissimilarity between all true positive and their corresponding
ground truth targets. This mean that, a higher MOTA and
a lower MOTP shows a better quality for a tracker. When
observing the overall results, the coupling Mask R-CNN with
DeepSORT obtains the best results in terms of MOTA and
MOTP. The margins of MOTA by YOLOv3+DeepSORT
and Mask R-CNN+DeepSORT is 13.7% and 16.4% for
MOT17 and 12.2% and 14.6% for COMVIS MICA more
than the coupling between YOLOv3-tiny and DeepSORT.
2) Analysis on memory requirement and processing rate:
This section aims at evaluating the memory requirement and
processing rate of person detection and tracking methods. The
results are shown in Table IV. We evaluate three couplings
of person detection and tracking in two cases: with GPU and
without GPU. The results shows that among three couplings,
two couplings (YOLOv3-tiny + DeepSORT and YOLOv3
+DeepSORT) can work without GPU. In case of without
GPU, YOLOv3-tiny + DeepSORT requires a haft memory
and performs with the speed twice times faster than YOLOv3
+DeepSORT. However, while using GPU, the requirements
of these two coupling are quite similar. Processing rate of the
coupling of Mask R-CNN + DeepSORT that achieved very
good results in term of person detection and tracking quality
is 2.5 Hz while that of YOLOv3-tiny + DeepSORT and
YOLOv3 +DeepSORT is 11.9 Hz and 11.1 Hz, respectively.
From the experimental results, three recommendations on
the choice of person detection and tracking can be pro-
vided. Firs, the coupling of YOLOv3-tiny and DeepSORT
is suggested for the application that can not support GPU
workstation while requires real-time processing especially
when the captured scene is not so complex (e.g., surveillance
application in office). If the complex of the scene increases, in
this case, we can employ YOLOv3 and DeepSORT. Second,
in the case that GPU is available, YOLOv3 and DeepSORT
is still a good choice because of the trade-off between the
detection, tracking quality and processing time. Finally, in
some applications where the scene is relatively complex and
the detection and tracking are not required for all coming
frames, Mask R-CNN is recommended.
Figure 2 shows an example for obtained results on human de-
tection and tracking for COMVIS MICA dataset when apply-
ing Mask R-CNN for detection and DeepSORT for tracking.
In Fig.2a) indicates a correct result in a simple context while
Fig.2b) shows a fragment error when an occlusion appears.
V. CONCLUSION
In this paper, we have performed several experiments on
MOT17 Challenge and COMVIS MICA datasets for pro-
viding an exhaustive evaluation on performance of human
detection and tracking components in a visual surveillance
camera network. The experimental results allow us to provide
suggestions for the choice of person detection and tracking in
online tracking applications. However, due to the limitation
of time, only one tracking method (DeepSORT) has been
evaluated. In the future, we will perform evaluations with
others person tracking methods.
ACKNOWLEDGMENT
This research is funded by Vietnam National Foundation for
Science and Technology Development (NAFOSTED) under
grant number 102.01-2017.315
REFERENCES
[1] “The multiple object tracking benchmark,” https://motchallenge.net.
[2] M. Paul, S. M. E. Haque, and S. Chakraborty, “Human detection in
surveillance videos and its applications - a review,” EURASIP Journal
on Advances in Signal Processing, vol. 2013, no. 1, p. 176, Nov 2013.
[Online]. Available: https://doi.org/10.1186/1687-6180-2013-176
[3] P. Dollar, C. Wojek, B. Schiele, and P. Perona, “Pedestrian detection:
An evaluation of the state of the art,IEEE Transactions on Pattern
Analysis and Machine Intelligence, vol. 34, no. 4, pp. 743–761, April
2012.
[4] E. Moussy, A. A. Mekonnen, G. Marion, and F. Lerasle, “A comparative
view on exemplar tracking-by-detection approaches,” in 2015 12th
IEEE International Conference on Advanced Video and Signal Based
Surveillance (AVSS), Aug 2015, pp. 1–6.
TABLE I
PERFORMANCE ON SEVERAL VIDEOS OF MOT17 AND COMVIS MICA DATAS ETS W HE N EMP LOY ING YOLOV3 -T INY A S A DE TEC TO R AND
DeepSORT AS A TRAC KE R. TH E TWO B ES T RES ULTS F OR E ACH C ASE O F MOT17 ARE IN BOLD.
Videos For evaluating a detector (1) For evaluating a tracker (2)
FPFNRcll(%)Prcn(%)GT MTPTMLIDF1(%)IDP(%)IDR(%)IDsFMMOTA(%)MOTP
Static camera
MOT17-02 1105 15994 13.90 70.10 62 3 10 49 12.10 36.30 7.20 83 151 7.50 0.32
MOT17-04 1616 43331 8.90 72.30 83 0 13 70 9.40 42.90 5.30 94 366 5.30 0.30
MOT17-09 415 3165 40.60 83.90 26 0 21 5 30.10 46.20 22.30 102 192 30.90 0.30
Moving camera
MOT17-05 620 3969 42.60 82.60 133 7 71 55 40.70 59.80 30.80 151 277 31.50 0.31
MOT17-10 1148 10623 17.30 65.90 57 3 14 40 16.30 39.20 10.30 144 271 7.20 0.32
MOT17-11 551 5679 39.80 87.20 75 5 25 45 26.70 42.50 19.40 103 249 32.90 0.28
MOT17-13 329 10961 5.80 67.40 110 1 12 97 8.00 49.90 4.30 89 143 2.30 0.32
OVERALL for MOT17 - - 16.50 76.30 - - - - 15.80 44.50 9.60 - - 10.70 0.31
indoor 60 220 80.9 93.9 7 3 4 0 84.0 90.8 78.2 7 30 75.0 0.248
outdoor easy 57 269 89.5 97.6 7 6 1 0 66.0 68.9 63.3 14 35 86.7 0.226
outdoor hard 405 1428 78.2 92.7 20 13 7 0 71.4 78.0 65.8 49 115 71.3 0.300
OVERALL for COMVIS MICA - - 81.3 94.1 - - - - 71.4 77.0 66.6 - - 75.6 0.274
TABLE II
PERFORMANCE ON VIDEOS OF MOT17 AND COMVIS MICA WHEN EMPLOYING YOL OV3AS A D ET ECT OR AN D DeepSORT AS A T RAC KER . TH E
TWO B ES T RES ULTS F OR E ACH C ASE A RE I N BOL D.
Videos For evaluating a detector (1) For evaluating a tracker (2)
FPFNRcll(%)Prcn(%)GT MTPTMLIDF1(%)IDP(%)IDR(%)IDsFMMOTA(%)MOTP
Static camera
MOT17-02 2936 12735 31.50 66.60 62 7 23 32 29.50 45.90 21.70 138 254 14.90 0.28
MOT17-04 5463 29825 37.30 76.40 83 8 41 34 34.60 52.80 25.80 257 608 25.30 0.26
MOT17-09 864 2077 61.00 79.00 26 5 17 4 44.40 50.90 39.30 79 106 43.30 0.26
Moving camera
MOT17-05 1660 2613 62.20 72.20 133 29 79 25 46.80 50.50 43.60 181 240 35.60 0.29
MOT17-10 2808 6953 45.80 67.70 57 7 30 20 33.20 41.10 27.80 300 503 21.60 0.29
MOT17-11 1694 3856 59.10 76.70 75 16 24 35 46.40 53.30 41.10 63 90 40.50 0.22
MOT17-13 2124 7830 32.70 64.20 110 7 54 49 29.50 43.70 22.30 459 674 10.60 0.32
OVERALL for MOT17 - - 41.30 72.60 - - - - 35.70 49.10 28.00 - - 24.40 0.27
indoor 86 53 95.4 92.7 7 7 0 0 86.7 85.4 87.9 4 14 87.6 0.260
outdoor easy 61 66 97.4 97.6 7 7 0 0 74.8 74.9 74.7 5 20 94.8 0.202
outdoor hard 518 430 93.4 92.2 20 19 1 0 76.6 76.1 77.1 30 65 85.1 0.277
OVERALL for COMVIS MICA - - 94.7 93.6 - - - - 77.3 76.9 77.8 - - 87.8 0.256
TABLE III
PERFORMANCE ON SEVERAL VIDEOS OF MOT17 AND COMVIS MICA WHEN USING MASK R-CNN FOR DETECTION AND DeepSORT FOR
TR ACKI NG . THE T WO B EST R ES ULTS F OR EAC H CA SE AR E IN B OLD .
Videos For evaluating a detector (1) For evaluating a tracker (2)
FPFNRcll(%)Prcn(%)GT MTPTMLIDF1(%)IDP(%)IDR(%)IDsFMMOTA(%)MOTP
Static cameras
MOT17-02 4206 11140 40.00 63.90 62 8 29 25 33.00 42.90 26.90 231 309 16.20 0.27
MOT17-04 4228 25709 45.90 83.80 83 10 44 29 43.80 61.90 33.90 271 730 36.50 0.22
MOT17-09 1827 1574 70.40 67.20 26 10 13 3 42.60 41.60 43.60 63 91 34.90 0.22
Moving camera
MOT17-05 2227 2420 65.00 66.90 133 38 74 21 47.30 48.00 46.70 189 245 30.10 0.28
MOT17-10 4705 5822 54.70 59.90 57 12 35 10 40.10 42.00 38.40 363 553 15.20 0.28
MOT17-11 3124 3345 64.60 66.10 75 21 26 28 51.90 52.50 51.30 50 102 30.90 0.21
MOT17-13 3234 6668 42.70 60.60 110 14 63 33 34.20 41.30 29.10 484 741 10.80 0.31
OVERALL for MOT17 - - 49.50 70.30 - - - - 41.60 50.30 35.50 - - 27.10 0.25
indoor 85 21 98.2 93.0 7 7 0 0 92.6 90.2 95.2 2 8 90.6 0.216
outdoor easy 121 28 98.9 95.4 7 7 0 0 97.2 95.4 98.8 1 9 94.1 0.166
outdoor hard 574 148 97.7 91.8 20 20 0 0 78.7 76.3 81.2 22 36 88.6 0.257
OVERALL for COMVIS MICA - - 98.1 92.8 - - - - 84.8 82.5 87.2 - - 90.2 0.229
TABLE IV
AN EVALUATION ABOUT MEMORY REQUIREMENT AND PROCESSING RATE FOR THREE COUPLINGS OF HUMAN DETECTION AND TRACKING METHOD.
Methods
Memory requirement (MB) Frames per second (Hz)
Without GPU With GPU Without GPU With GPU
RAM RAM GPU
YOLOv3-tiny + DeepSORT 700 3400 2489 7.00 14.9
YOLOv3 + DeepSORT 1300 5400 2897 2.0 11.1
Mask R-CNN + DeepSORT - 4178 4300 - 2.5
ID=1
ID=2
ID=3 ID=1 ID=2
ID=3
ID=1 ID=2
ID=3
ID=1 ID=2
ID=3 ID=1
ID=2
ID=3
ID=1
ID=3
ID=1
ID=3 ID=5
ID=3
ID=1 ID=4 ID=1 ID=3
Inputvideo
ID=1
ID=2
ID=3
ID=4 ID=1
ID=2
ID=3
ID=4 ID=1
ID=2
ID=3
ID=4 ID=1
ID=2
ID=3
ID=4
ID=1
ID=2
ID=3
ID=4
ID=1
ID=2
ID=3
ID=4 ID=1
ID=2
ID=3
ID=4
Inputvideo
(a)
ID=1
ID=2
ID=3 ID=1 ID=2
ID=3
ID=1 ID=2
ID=3
ID=1 ID=2
ID=3 ID=1
ID=2
ID=3
ID=1
ID=3
ID=1
ID=3 ID=5
ID=3
ID=1 ID=4 ID=1 ID=3
Inputvideo
ID=1
ID=2
ID=3
ID=4 ID=1
ID=2
ID=3
ID=4 ID=1
ID=2
ID=3
ID=4 ID=1
ID=2
ID=3
ID=4
ID=1
ID=2
ID=3
ID=4
ID=1
ID=2
ID=3
ID=4 ID=1
ID=2
ID=3
ID=4
Inputvideo
(b)
Fig. 2. An example simulates the obtained results in human detection and tracking for COMVIS MICA dataset in two cases: (a) correct tracking (b) a
fragment occurs when pedestrians are passing each other. The detected boxes and their corresponding ground-truth are remarked in orange and blue bounding
boxes, respectively.
[5] A. A. Mekonnen and F. Lerasle, “Comparative evaluations of selected
tracking-by-detection approaches,” IEEE Transactions on Circuits and
Systems for Video Technology, vol. 29, no. 4, pp. 996–1010, April 2019.
[6] J. Zhou and J. Hoang, “Real time robust human detection and tracking
system,” in 2005 IEEE Computer Society Conference on Computer
Vision and Pattern Recognition (CVPR’05)-Workshops. IEEE, 2005,
pp. 149–149.
[7] A. A. Malik, A. Khalil, and H. U. Khan, “Object detection and track-
ing using background subtraction and connected component labeling,”
International Journal of Computer Applications, vol. 75, no. 13, 2013.
[8] S. Haifeng and X. Chao, “Moving object detection based on background
subtraction of block updates,” in 2013 6th International Conference on
Intelligent Networks and Intelligent Systems (ICINIS). IEEE, 2013, pp.
51–54.
[9] N. Dalal and B. Triggs, “Histograms of oriented gradients for human
detection,” in international Conference on computer vision & Pattern
Recognition (CVPR’05), vol. 1. IEEE Computer Society, 2005, pp.
886–893.
[10] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan,
“Object detection with discriminatively trained part-based models,
IEEE transactions on pattern analysis and machine intelligence, vol. 32,
no. 9, pp. 1627–1645, 2009.
[11] X. Wang, G. Hua, and T. X. Han, “Detection by detections: Non-
parametric detector adaptation for a video,” in 2012 IEEE Conference on
Computer Vision and Pattern Recognition. IEEE, 2012, pp. 350–357.
[12] V. Gajjar, A. Gurnani, and Y. Khandhediya, “Human detection and
tracking for video surveillance: A cognitive science approach,” in
Proceedings of the IEEE International Conference on Computer Vision,
2017, pp. 2805–2809.
[13] J. Redmon and A. Farhadi, “Yolov3: An incremental improvement,
arXiv preprint arXiv:1804.02767, 2018.
[14] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C.
Berg, “Ssd: Single shot multibox detector,” in European conference on
computer vision. Springer, 2016, pp. 21–37.
[15] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time
object detection with region proposal networks,” in Advances in neural
information processing systems, 2015, pp. 91–99.
[16] H. Grabner and H. Bischof, “On-line boosting and vision,” in 2006
IEEE Computer Society Conference on Computer Vision and Pattern
Recognition (CVPR’06), vol. 1. Ieee, 2006, pp. 260–267.
[17] H. Grabner, C. Leistner, and H. Bischof, “Semi-supervised on-line
boosting for robust tracking,” in European conference on computer
vision. Springer, 2008, pp. 234–247.
[18] B. Babenko, M.-H. Yang, and S. Belongie, “Visual tracking with online
multiple instance learning,” in 2009 IEEE Conference on Computer
Vision and Pattern Recognition. IEEE, 2009, pp. 983–990.
[19] Z. Kalal, K. Mikolajczyk, and J. Matas, “Tracking-learning-detection,
IEEE transactions on pattern analysis and machine intelligence, vol. 34,
no. 7, pp. 1409–1422, 2011.
[20] J. F. Henriques, R. Caseiro, P. Martins, and J. Batista, “High-speed track-
ing with kernelized correlation filters,” IEEE transactions on pattern
analysis and machine intelligence, vol. 37, no. 3, pp. 583–596, 2014.
[21] N. Wojke, A. Bewley, and D. Paulus, “Simple online and realtime
tracking with a deep association metric,” in 2017 IEEE International
Conference on Image Processing (ICIP), Sep. 2017, pp. 3645–3649.
[22] J. Redmon, “Darknet: Open source neural networks in c,”
http://pjreddie.com/darknet/, 2013–2016.
[23] R. E. Kalman, “A new approach to linear filtering and prediction
problems,” Journal of basic Engineering, vol. 82, no. 1, pp. 35–45,
1960.
... However, this is almost impossible due to occlusion, the variety in viewpoints and the noise that may be introduced in a video. Also, real-time tracking requires great computational resources and also needs to face challenges like the identity switches and various detection failures [33,34,40]. ...
Article
Full-text available
Multiple-object tracking is a fundamental computer vision task which is gaining increasing attention due to its academic and commercial potential. Multiple-object detection, recognition and tracking are quite desired in many domains and applications. However, accurate object tracking is very challenging, and things are even more challenging when multiple objects are involved. The main challenges that multiple-object tracking is facing include the similarity and the high density of detected objects, while also occlusions and viewpoint changes can occur as the objects move. In this article, we introduce a real-time multiple-object tracking framework that is based on a modified version of the Deep SORT algorithm. The modification concerns the process of the initialization of the objects, and its rationale is to consider an object as tracked if it is detected in a set of previous frames. The modified Deep SORT is coupled with YOLO detection methods, and a concrete and multi-dimensional analysis of the performance of the framework is performed in the context of real-time multiple tracking of vehicles and pedestrians in various traffic videos from datasets and various real-world footage. The results are quite interesting and highlight that our framework has very good performance and that the improvements on Deep SORT algorithm are functional. Lastly, we show improved detection and execution performance by custom training YOLO on the UA-DETRAC dataset and provide a new vehicle dataset consisting of 7 scenes, 11.025 frames and 25.193 bounding boxes.
Article
Full-text available
We present a method for detecting objects in images using a single deep neural network. Our approach, named SSD, discretizes the output space of bounding boxes into a set of bounding box priors over different aspect ratios and scales per feature map location. At prediction time, the network generates confidences that each prior corresponds to objects of interest and produces adjustments to the prior to better match the object shape. Additionally, the network combines predictions from multiple feature maps with different resolutions to naturally handle objects of various sizes. Our SSD model is simple relative to methods that requires object proposals, such as R-CNN and MultiBox, because it completely discards the proposal generation step and encapsulates all the computation in a single network. This makes SSD easy to train and straightforward to integrate into systems that require a detection component. Experimental results on ILSVRC DET and PASCAL VOC dataset confirm that SSD has comparable performance with methods that utilize an additional object proposal step and yet is 100-1000x faster. Compared to other single stage methods, SSD has similar or better performance, while providing a unified framework for both training and inference.
Article
We present some updates to YOLO! We made a bunch of little design changes to make it better. We also trained this new network that's pretty swell. It's a little bigger than last time but more accurate. It's still fast though, don't worry. At 320x320 YOLOv3 runs in 22 ms at 28.2 mAP, as accurate as SSD but three times faster. When we look at the old .5 IOU mAP detection metric YOLOv3 is quite good. It achieves 57.9 mAP@50 in 51 ms on a Titan X, compared to 57.5 mAP@50 in 198 ms by RetinaNet, similar performance but 3.8x faster. As always, all the code is online at https://pjreddie.com/yolo/
Article
In this work, we present a comparative evaluation of various multi-person tracking-by-detection approaches on public datasets. The work investigates five popular trackers coupled with six relevant visual people detectors evaluated on seven public datasets. The evaluation emphasizes on exhibited performance variation depending on tracker-detector choices. Our experimental results show that the overall performance depends on how challenging the dataset is, the performance of the detector on the specific dataset, and the tracker-detector combination. Some trackers are more sensitive to the choice of a detector and some detectors to the choice of a tracker than others. Based on our results, two of the trackers demonstrate the best performances consistently across different datasets whereas the best performing detectors vary per dataset. This underscores the need for careful application context specific evaluation when choosing a detector.
Conference Paper
We study the question of feature sets for robust visual object recognition, adopting linear SVM based human detection as a test case. After reviewing existing edge and gradient based descriptors, we show experimentally that grids of Histograms of Oriented Gradient (HOG) descriptors significantly outperform existing feature sets for human detection. We study the influence of each stage of the computation on performance, concluding that fine-scale gradients, fine orientation binning, relatively coarse spatial binning, and high-quality local contrast normalization in overlapping descriptor blocks are all important for good results. The new approach gives near-perfect separation on the original MIT pedestrian database, so we introduce a more challenging dataset containing over 1800 annotated human images with a large range of pose variations and backgrounds.
Conference Paper
State-of-the-art object detection networks depend on region proposal algorithms to hypothesize object locations. Advances like SPPnet [7] and Fast R-CNN [5] have reduced the running time of these detection networks, exposing region pro-posal computation as a bottleneck. In this work, we introduce a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals. An RPN is a fully-convolutional network that simultaneously predicts object bounds and objectness scores at each position. RPNs are trained end-to-end to generate high-quality region proposals, which are used by Fast R-CNN for detection. With a simple alternating optimization, RPN and Fast R-CNN can be trained to share convolu-tional features. For the very deep VGG-16 model [18], our detection system has a frame rate of 5fps (including all steps) on a GPU, while achieving state-of-the-art object detection accuracy on PASCAL VOC 2007 (73.2% mAP) and 2012 (70.4% mAP) using 300 proposals per image. The code will be released.
Conference Paper
This article presents a block-updating moving object detection method-background subtraction. This method weights the former two frames of start frame, models initial background, then subtracts current frames and background frame, divides difference image into multiple sub blocks, computes sum and does threshold comparison of total pixels of each block, gets binary image through threshold segmentation, and isolates background area and foreground area. Background area is used to model and update background model. According to the test, this method is insensitive to outside environment and can detect moving object correctly.
Article
The core component of most modern trackers is a discriminative classifier, tasked with distinguishing between the target and the surrounding environment. To cope with natural image changes, this classifier is typically trained with translated and scaled sample patches. Such sets of samples are riddled with redundancies -- any overlapping pixels are constrained to be the same. Based on this simple observation, we propose an analytic model for datasets of thousands of translated patches. By showing that the resulting data matrix is circulant, we can diagonalize it with the Discrete Fourier Transform, reducing both storage and computation by several orders of magnitude. Interestingly, for linear regression our formulation is equivalent to a correlation filter, used by some of the fastest competitive trackers. For kernel regression, however, we derive a new Kernelized Correlation Filter (KCF), that unlike other kernel algorithms has the exact same complexity as its linear counterpart. Building on it, we also propose a fast multi-channel extension of linear correlation filters, via a linear kernel, which we call Dual Correlation Filter (DCF). Both KCF and DCF outperform top-ranking trackers such as Struck or TLD on a 50 videos benchmark, despite running at hundreds of frames-per-second, and being implemented in a few lines of code (Algorithm 1). To encourage further developments, our tracking framework was made open-source.