Conference PaperPDF Available

Comparative evaluation of human detection and tracking approaches for online tracking applications

October 2019

October 2019

Conference: International Conference on Advanced Technologies for Communications (ATC)

Authors:

Nguyen Hong Quan

Viet - Hung Industrial University

Thuybinh Nguyen

Hanoi University of Science and Technology

Tuan-Anh Nguyen

Ho Chi Minh City University of Technology (HCMUT)

Thi Lan Le

Hanoi University of Science and Technology

Show all 6 authorsHide

Object detection and tracking in videos is an important problem in computer vision thanks to its wide applications in various video analysis scenarios. As a result, it has attracted huge interest from the scientific community. Majority of recent works following the tracking-by-detection approaches which rely on a people detector to start, update, reinitialize, guide and terminate the trackers. Recent years have witnessed a significant advance in person detection and tracking performance. However, person detection and tracking are usually treated separately in the recent works. The contributions of this paper are twofold. First, a comparative evaluation of the coupling of person detection and tracking methods for online tracking applications is conducted on two video datasets: MOT17-a benchmark dataset provided in MOT Challenge [1] and our own dataset captured in a video surveillance context. For this, we investigate a popular online tracking method (DeepSORT) coupled with the two state-of-the-art people detection methods that are You Only Look Once (YOLO) and MaskR − CNN. Second, a deep analysis on the behavior of the person detection and tracking method in term of both detection and tracking performance and resources requirement for practical applications is given. The implementation of the framework and dataset used in this paper will be made publicly available.

Content uploaded by Thi Lan Le

Content may be subject to copyright.

Comparative evaluation of human detection and

tracking approaches for online tracking applications

Hong-Quan Nguyen ∗‡ ,Thuy-Binh Nguyen ∗† ,Tuan-Anh Nguyen ∗, Thi-Lan Le ∗, Thanh-Hai Vu §, Alexis Noe ¶

∗Computer Vision Department, MICA International Research Institute,

Hanoi University of Science and Technology, Vietnam

Email: Thi-Lan.Le@mica.edu.vn

†Faculty of Electrical and Electronics Engineering, University of Transport and Communications, Hanoi, Vietnam

‡Faculty of Information Technology, Viet-Hung Industrial University, Hanoi, Vietnam

§Mathematics Science Research Division, Viettel Group

¶School of Engineering in Physics, Applied Physics, Electronics & Materials Science, Grenoble Institute of Technology, France.

Abstract—Object detection and tracking in videos is an impor-

tant problem in computer vision thanks to its wide applications

in various video analysis scenarios. As a result, it has attracted

huge interest from the scientiﬁc community. Majority of recent

works following the tracking-by-detection approaches which rely

on a people detector to start, update, reinitialize, guide and

terminate the trackers. Recent years have witnessed a signiﬁcant

advance in person detection and tracking performance. However,

person detection and tracking are usually treated separately

in the recent works. The contributions of this paper are two-

fold. First, a comparative evaluation of the coupling of person

detection and tracking methods for online tracking applications

is conducted on two video datasets: MOT17 - a benchmark

dataset provided in MOT Challenge [1] and our own dataset

captured in a video surveillance context. For this, we investigate

a popular online tracking method (DeepSORT) coupled with

the two state-of-the-art people detection methods that are You

Only Look Once ( YOLO) and MaskR −CNN. Second, a deep

analysis on the behavior of the person detection and tracking

method in term of both detection and tracking performance and

resources requirement for practical applications is given. The

implementation of the framework and dataset used in this paper

will be made publicly available.

I. INTRODUCTION

The presence of cameras in our surroundings grows every

day, allowing visual surveillance systems to be used in a wide

range of domain such security, health-care, etc. In these sys-

tems, pedestrian detection and tracking which aims to estimate

the state of person (e.g. person location, identity) over the

times is an initial and crucial step in these systems. In the last

decade, pedestrian detection and tracking has received a lot of

attention as a research topic, which resulted in a broad amount

of available techniques [2]. Majority of recent works following

the tracking-by-detection approaches which bases on a people

detector to start, update, reinitialize, guide and terminate the

trackers. Recent years have witnessed a signiﬁcant advance in

person detection and tracking performance. However, person

detection and tracking are usually treated separately in the

recent works. Some previous works that attempt to evaluate

the object detection [3] and object tracking [1], [4], [5]. The

MOT Challenge has designed a common platform containing

videos with object detection and tracking ground-truth as well

as common evaluation metrics for object tracking evaluation.

However, as this platform allows both batch mode (in which

video frames from future time steps are also utilized to solve

the data association problem) and online mode, therefore, the

suggestion on the choice of these methods for online tracking

applications is unavailable.

The contributions of this paper are two-folds. First, a

comparative evaluation of the coupling of person detection and

tracking methods for online tracking applications is conducted

on two datasets: MOT17 - a benchmark dataset provided in

MOT Challenge [1] and our own dataset captured in a video

surveillance context. For this, we investigate a popular online

tracking method (DeepSORT) coupled with the two state-of-

the-art people detection methods that are You Only Look Once

(YOLO) and MaskR −CNN. Second, a deep analysis on

the behavior of the person detection and tracking method in

term of both detection and tracking performance and resources

requirement for practical applications is given.

The rest of paper is organized as follows. Section 2 dis-

cusses some previous works related to human detection and

tracking. Next, our framework is presented in Section 3. And

then, an exhaustive evaluation on performance of both human

detection and tracking is shown in Section 4. Conclusion and

future work are mentioned in the last Section.

II. RE LATE D WO RK

In this section, we discuss brieﬂy on some prominent studies

involve human detection and tracking. To detect object in

videos, numerous previous works tried to build the background

models and determine the object through background sub-

traction algorithms [6]–[8]. Even some models for dynamic

background have been proposed, background subtraction-

based methods are relatively sensitive to lighting variance.

Another approach is to model the person appearance using

the appearance clues such as color, texture and to employ the

scanning window technique to determine the presence of a

person in a given window [9]–[12]. In recent years, there have

been various deep learning networks are proposed for object

detection in general and human detection in particular such as

YOLO [13], SSD [14], Mask R-CNN [15].

Concerning object tracking, object tracking is classiﬁed

into two main approaches: single object and multiple object

tracking algorithms. In comparison, multiple object tracking

has to cope with more challenges than single shot one because

of sudden appearance and disappearance of object. There

are numerous trackers that belong to the former approach,

such as on-line boosting trackers [16]–[18], tracking-learning-

detection (TLD) tracker [19], and Kernelized Correlation Filter

(KCF) [20], etc. However, almost these trackers mainly focus

on local position of bounding boxes and motion information.

In recent times, several novel algorithms are introduced to

integrate appearance features extracted on each bounding

box for improving the performance of tracking. In order to

overcome difﬁculties in multiple object tracking Nicolai Wojke

et al . [21] proposed a novel framework to incorporate an

object detector, Kalman ﬁlter as the base tracker, and a data

association method to take advantage of both detection and

tracking tasks.

As above, we have introduced several researches related

to human detection and tracking, however, there are a few

works provide a comprehensive evaluation of the coupling

human detection and tracking. This is the motivation for our

study to conduct various extensive experiments to assess the

effectiveness of both human detection and tracking component

in a realistic camera system.

III. FRAMEWORK FOR PERSON DETECTION AND

TR ACK IN G

The purpose of our work is to provide a comprehensive

evaluation on the performance of a human detection and

tracking system in online tracking applications. Figure 1 shows

a common framework for human detection and tracking in

videos. Among different methods proposed for person detec-

tion, we select two state-of-the-art object detection methods

that are You Only Look One (YOLO) [13] and Mask R-CNN

[15]. We couple these detection methods with one popular

online tracking method DeepSORT [21]. In the following

sections, we describe brieﬂy the person detection and tracking

methods.

A. Pedestrian detection methods

1) You Only Look Once (YOLO): YOLO [13] is one kind

of Single Shot Detector (SSD) [14] in which DarkNet [22]

is employed as a backbone for feature extraction. Up to

now, YOLO network has three different versions, namely

YOLOv1, YOLOv2, and YOLOv3. In comparison with the

other versions, YOLOv3 is evaluated to have higher computa-

tional speed. Furthermore, due to having a more complicated

structure with Pyramid features, YOLOv3 is able to detect a

small object. Because of the above-mentioned advantages, we

utilize YOLOv3 and an its variant, called YOLOv3-tiny for

our study.

2) Mask R-CNN: MaskR −CNN is developed from Faster

R-CNN [15], a difference between the two networks is that

MaskR −CNN generates simultaneously a bounding box and

a corresponding mask for a detected object. It is worth to note

that the major contribution in the structure of Faster R-CNN

is to incorporate an object proposal generator into a detection

network. By this way, convolutional features are shared not

only between the object proposals but also among the object

proposal and detection networks leading a high computation

cost reduction and a mean Average Precision (mAP) gain.

B. Pedestrian tracking methods

DeepSORT is an improved version of Simple Online and

Realtime Tracking (SORT) [21] which based on Kalman ﬁlter

[23]. The advantages of SORT is to have not only high

speed but also high performance. However, a backward of

this algorithm is to generate numerous identity switch (IDSW)

errors when an occlusion appears or objects cross each other.

DeepSORT helps to tackle this problem by adding a deep net-

work trained on a large dataset to extract appearance features

for person representation. The obtained results indicates that

DeepSORT allows to reduce signiﬁcantly ID switch errors

while maintaining real-time response in a realistic system.

Different from SORT that uses the IoU ratios between detected

boxes as elements of the cost matrix in data association,

DeepSORT employ the following measurement metric:

ci,j =λd(1)(i, j ) + (1 −λ)d(2)(i, j)(1)

where ci,j is the similarity between the i-th track and the

j-th bounding box detection; d(1)(i, j )and d(2)(i, j)are the

two metrics calculated based on motion and appearance in-

formation, respectively. While d(1)(i, j )is calculated based

on Mahalanobis distance, d(2)(i, j )is the smallest cosine

distance between the i-th track and the j-th bounding box

detection in the appearance space; hyperparameter λcontrols

this association.

IV. EXP ER IM EN TAL RESULTS

A. Datasets

Multiple Object Tracking (MOT) Challenge datasets are

built to provide for researchers proving the effectiveness of

their own tracking methods. In our work, we use MOT17

Challenge dataset which has 14 videos with different charac-

teristics in term of frame rate, pedestrian density, illumination

condition, and point of view. A half of them is used for

training and the remaining is used for testing. As MOT17

testing set aims at evaluating the tracking method while ﬁxing

MaskR −CNN as the person detection method, To evaluate

the coupling of different detection methods with the tracking

method, in this paper, we use 7 videos from MOT17 training

set.

In addition, we have captured our own dataset

COMVIS MICA containing three video sequences captured

by two static cameras in two environment indoor and outdoor

named: indoor,outdoor easy,outdoor hard. These videos

are annotated using Labelimg tool.

Frame #2

Frame #3

Frame #N

ID1ID2ID3

ID6

ID5ID4

Frame #1

Frame #2

Frame #3

Frame #N

Human trajectories

Human detection Tracking

Fig. 1. Framework for evaluating human detection and tracking phases in a fully-automatic system. Green, red, and blue bounding boxes indicate the obtained

results in case of applying YOLOv3-tiny, YOLOv3, and Mask R-CNN, respectively.

B. Evaluation measures

Evaluating the performance of a human detector

We employ Precision (Prcn) and Recall (Rcll) to evaluate

the detection performance. These two metrics are deﬁned as

follows:

Prcn = TP

TP + FP Rcll = TP

TP + FN,(2)

where, TP, FP, and FN are number of True Positive, False

Positive, and False Negative, respectively. A detected box is

determined to be a TP if it has IoU ≥0.5where IoU is the

ratio of Intersection over Union between detected bounding

box and its corresponding ground-truth.

Evaluating the performance of a tracker

Several metrics have been proposed to evaluate object tracking

methods, in this paper, we employ the metrics used in [1].

•IDP (ID Precision) and IDR (ID Recall)

The mean of these two metrics are the same Precision

(Prcn) and Recall (Rcll) in evaluating a detector, but

they are outcomes of tracking. These two metrics are

calculated based on the values of ID True/False Posi-

tive/Negative which deﬁned as follows: (3).

IDP = IDTP

IDTP + IDFP IDR = IDTP

IDTP + IDFN.(3)

where, IDTP: sum of TP in detection and the number

of correctly labeled objects in the tracking; IDFP/IDFN:

sum of FP/FN in detection and the number of correctly

predicted objects for positive class in detection but incor-

rectly labeled in tracking.

•IDF1: This metric is formulated based on IDP and IDR

as in Eq.(4). The higher IDF1 is, the better tracker is.

IDF1 =2 ×IDP ×IDR

IDP + IDR (4)

•ID switch (IDs): The number of identity switches in total

tracklets. This metric means that several individuals are

assigned to the same label.

•Fragment (FM): Total number of switches from tracked

to track.

•MOTA (Multi Object Tracking Accuracy): This is

the most important metric for object tracking evaluation.

MOTA is deﬁned as:

MOTA = 1−Pt(IDFNt+IDFPt+IDst)

PtGTt

,(5)

where, tis the index of frame, GT is the number of

observed objects in the real-world. It is worth to note

that MOTA would be a negative value if there are many

errors in the tracking process and the number of these

errors is larger than that of observed objects.

•MOTP (Multi Object Tracking Precision):MOTP is

deﬁned as the average distance between all true positive

and their corresponding ground truth targets.

MOTP =Pt,idt,i

Ptct

(6)

where, ctdenotes the number of matches found in frame

tand dt,i is the sum of distances between all true

positives and their corresponding ground truth i. This

metric indicate the ability of the tracking in estimating

precise object positions.

•Track quality measures: Recovered trajectories by a

tracking algorithm can be categorized into three different

kinds such as mostly tracked (MT), partially tracked (PT),

mostly loss (ML). A target is mostly tracked if its tracking

time is at least 80% total length of the ground truth

trajectory. While, if a track is only covered for less than

20%, it is called mostly lost. The other cases are deﬁned

as partially tracked.

C. Experimental results and Discussions

In this section, we show several experimental results on

MOT17 and COMVIS MICA datasets.

In order to observe the behavior of person detection and

tracking, we classify 7 video sequences of MOT17 into two

main groups: (1) static cameras (2, 4, 9) and (2) moving cam-

eras (5, 10, 11, 13). All experiments are conducted on a Work-

station Supermicro with Intel(R) Xeon(R), CPU E5-2620 v2 @

2.10GHz, 6 cores, 12 threads, RAM 12GB, GPU GTX 1080.

Our framework based on Keras with backend Tensorﬂow,

Ubutu 18, Python 3. Some parameters in our experiments as

follows: size of input images is 1920×1080,detect freq = 2,

down sample ratio = 1,IoU threshold = 0.5.

1) Overall evaluation of person detection and tracking:

In this study, we conduct experiments on two datasets

with three kinds of detectors (YOLOv3-tiny, YOLOv3, and

Mask R-CNN) and one tracker ( DeepSORT tracker). The

obtained results are shown in the three Tables I-III as below.

Concerning person detection performance, among three chosen

methods Mask R-CNN outperforms both YOLOv3-tiny and

YOLOv3 on both datasets in term of Recall metric. The

average Recall obtained by YOLOv3-tiny, YOLOv3 and

Mask R-CNN are 16.50%, 41.3%, and 49.5% on MOT17 and

81.3%, 94.7%, and 98.1% on COMVIS MICA, respectively.

However, the Precision obtained when using YOLOv3 and

Mask R-CNN is slightly reduced in comparison with that of

YOLOv3-tiny. This comes from the fact that YOLOv3 and

Mask R-CNN can detect objects which even have small-size.

In this case, these methods may help to reduce the miss

detection. However, among detected objects, some of them

are not human. That is why these methods produce more false

alarm than YOLOv3-tiny.

Among 7 videos of MOT17, the best results are obtained

for MOT17-09 and MOT17-05. This is explained by the

characteristics of these videos. MOT17-09 video is installed

in a central hall (indoor) with a closed view containing 26

pedestrians while the other videos are captured outdoor with

large views (e.g., a large square in MOT17-2 and a crowed

scene in MOT17-04).

It is also interesting to see that when working with a

challenging dataset like MOT17, the performance of the three

detection methods varies a lot. However, with a less challeng-

ing dataset such as COMVIS MICA, the difference between

the performance of these methods is not so signiﬁcant.

We may also observe the inﬂuence of person detection

quality into person tracking method. Two metrics are served

as the most important key for evaluating tracking results,

called MOTA and MOTP. While MOTA evaluate the overall

performance of a tracker, MOTP relates to the position

dissimilarity between all true positive and their corresponding

ground truth targets. This mean that, a higher MOTA and

a lower MOTP shows a better quality for a tracker. When

observing the overall results, the coupling Mask R-CNN with

DeepSORT obtains the best results in terms of MOTA and

MOTP. The margins of MOTA by YOLOv3+DeepSORT

and Mask R-CNN+DeepSORT is 13.7% and 16.4% for

MOT17 and 12.2% and 14.6% for COMVIS MICA more

than the coupling between YOLOv3-tiny and DeepSORT.

2) Analysis on memory requirement and processing rate:

This section aims at evaluating the memory requirement and

processing rate of person detection and tracking methods. The

results are shown in Table IV. We evaluate three couplings

of person detection and tracking in two cases: with GPU and

without GPU. The results shows that among three couplings,

two couplings (YOLOv3-tiny + DeepSORT and YOLOv3

+DeepSORT) can work without GPU. In case of without

GPU, YOLOv3-tiny + DeepSORT requires a haft memory

and performs with the speed twice times faster than YOLOv3

+DeepSORT. However, while using GPU, the requirements

of these two coupling are quite similar. Processing rate of the

coupling of Mask R-CNN + DeepSORT that achieved very

good results in term of person detection and tracking quality

is 2.5 Hz while that of YOLOv3-tiny + DeepSORT and

YOLOv3 +DeepSORT is 11.9 Hz and 11.1 Hz, respectively.

From the experimental results, three recommendations on

the choice of person detection and tracking can be pro-

vided. Firs, the coupling of YOLOv3-tiny and DeepSORT

is suggested for the application that can not support GPU

workstation while requires real-time processing especially

when the captured scene is not so complex (e.g., surveillance

application in ofﬁce). If the complex of the scene increases, in

this case, we can employ YOLOv3 and DeepSORT. Second,

in the case that GPU is available, YOLOv3 and DeepSORT

is still a good choice because of the trade-off between the

detection, tracking quality and processing time. Finally, in

some applications where the scene is relatively complex and

the detection and tracking are not required for all coming

frames, Mask R-CNN is recommended.

Figure 2 shows an example for obtained results on human de-

tection and tracking for COMVIS MICA dataset when apply-

ing Mask R-CNN for detection and DeepSORT for tracking.

In Fig.2a) indicates a correct result in a simple context while

Fig.2b) shows a fragment error when an occlusion appears.

V. CONCLUSION

In this paper, we have performed several experiments on

MOT17 Challenge and COMVIS MICA datasets for pro-

viding an exhaustive evaluation on performance of human

detection and tracking components in a visual surveillance

camera network. The experimental results allow us to provide

suggestions for the choice of person detection and tracking in

online tracking applications. However, due to the limitation

of time, only one tracking method (DeepSORT) has been

evaluated. In the future, we will perform evaluations with

others person tracking methods.

ACKNOWLEDGMENT

This research is funded by Vietnam National Foundation for

Science and Technology Development (NAFOSTED) under

grant number 102.01-2017.315

REFERENCES

[1] “The multiple object tracking benchmark,” https://motchallenge.net.

[2] M. Paul, S. M. E. Haque, and S. Chakraborty, “Human detection in

surveillance videos and its applications - a review,” EURASIP Journal

on Advances in Signal Processing, vol. 2013, no. 1, p. 176, Nov 2013.

[Online]. Available: https://doi.org/10.1186/1687-6180-2013-176

[3] P. Dollar, C. Wojek, B. Schiele, and P. Perona, “Pedestrian detection:

An evaluation of the state of the art,” IEEE Transactions on Pattern

Analysis and Machine Intelligence, vol. 34, no. 4, pp. 743–761, April

2012.

[4] E. Moussy, A. A. Mekonnen, G. Marion, and F. Lerasle, “A comparative

view on exemplar tracking-by-detection approaches,” in 2015 12th

IEEE International Conference on Advanced Video and Signal Based

Surveillance (AVSS), Aug 2015, pp. 1–6.

TABLE I

PERFORMANCE ON SEVERAL VIDEOS OF MOT17 AND COMVIS MICA DATAS ETS W HE N EMP LOY ING YOLOV3 -T INY A S A DE TEC TO R AND

DeepSORT AS A TRAC KE R. TH E TWO B ES T RES ULTS F OR E ACH C ASE O F MOT17 ARE IN BOLD.

Videos For evaluating a detector (1) For evaluating a tracker (2)

FP↓FN↓Rcll(%)↑Prcn(%)↑GT MT↑PT↑ML↓IDF1(%)↑IDP(%)↑IDR(%)↑IDs↓FM↓MOTA(%)↑MOTP↓

Static camera

MOT17-02 1105 15994 13.90 70.10 62 3 10 49 12.10 36.30 7.20 83 151 7.50 0.32

MOT17-04 1616 43331 8.90 72.30 83 0 13 70 9.40 42.90 5.30 94 366 5.30 0.30

MOT17-09 415 3165 40.60 83.90 26 0 21 5 30.10 46.20 22.30 102 192 30.90 0.30

Moving camera

MOT17-05 620 3969 42.60 82.60 133 7 71 55 40.70 59.80 30.80 151 277 31.50 0.31

MOT17-10 1148 10623 17.30 65.90 57 3 14 40 16.30 39.20 10.30 144 271 7.20 0.32

MOT17-11 551 5679 39.80 87.20 75 5 25 45 26.70 42.50 19.40 103 249 32.90 0.28

MOT17-13 329 10961 5.80 67.40 110 1 12 97 8.00 49.90 4.30 89 143 2.30 0.32

OVERALL for MOT17 - - 16.50 76.30 - - - - 15.80 44.50 9.60 - - 10.70 0.31

indoor 60 220 80.9 93.9 7 3 4 0 84.0 90.8 78.2 7 30 75.0 0.248

outdoor easy 57 269 89.5 97.6 7 6 1 0 66.0 68.9 63.3 14 35 86.7 0.226

outdoor hard 405 1428 78.2 92.7 20 13 7 0 71.4 78.0 65.8 49 115 71.3 0.300

OVERALL for COMVIS MICA - - 81.3 94.1 - - - - 71.4 77.0 66.6 - - 75.6 0.274

TABLE II

PERFORMANCE ON VIDEOS OF MOT17 AND COMVIS MICA WHEN EMPLOYING YOL OV3AS A D ET ECT OR AN D DeepSORT AS A T RAC KER . TH E

TWO B ES T RES ULTS F OR E ACH C ASE A RE I N BOL D.

Videos For evaluating a detector (1) For evaluating a tracker (2)

FP↓FN↓Rcll(%)↑Prcn(%)↑GT MT↑PT↑ML↓IDF1(%)↑IDP(%)↑IDR(%)↑IDs↓FM↓MOTA(%)↑MOTP↓

Static camera

MOT17-02 2936 12735 31.50 66.60 62 7 23 32 29.50 45.90 21.70 138 254 14.90 0.28

MOT17-04 5463 29825 37.30 76.40 83 8 41 34 34.60 52.80 25.80 257 608 25.30 0.26

MOT17-09 864 2077 61.00 79.00 26 5 17 4 44.40 50.90 39.30 79 106 43.30 0.26

Moving camera

MOT17-05 1660 2613 62.20 72.20 133 29 79 25 46.80 50.50 43.60 181 240 35.60 0.29

MOT17-10 2808 6953 45.80 67.70 57 7 30 20 33.20 41.10 27.80 300 503 21.60 0.29

MOT17-11 1694 3856 59.10 76.70 75 16 24 35 46.40 53.30 41.10 63 90 40.50 0.22

MOT17-13 2124 7830 32.70 64.20 110 7 54 49 29.50 43.70 22.30 459 674 10.60 0.32

OVERALL for MOT17 - - 41.30 72.60 - - - - 35.70 49.10 28.00 - - 24.40 0.27

indoor 86 53 95.4 92.7 7 7 0 0 86.7 85.4 87.9 4 14 87.6 0.260

outdoor easy 61 66 97.4 97.6 7 7 0 0 74.8 74.9 74.7 5 20 94.8 0.202

outdoor hard 518 430 93.4 92.2 20 19 1 0 76.6 76.1 77.1 30 65 85.1 0.277

OVERALL for COMVIS MICA - - 94.7 93.6 - - - - 77.3 76.9 77.8 - - 87.8 0.256

TABLE III

PERFORMANCE ON SEVERAL VIDEOS OF MOT17 AND COMVIS MICA WHEN USING MASK R-CNN FOR DETECTION AND DeepSORT FOR

TR ACKI NG . THE T WO B EST R ES ULTS F OR EAC H CA SE AR E IN B OLD .

Videos For evaluating a detector (1) For evaluating a tracker (2)

FP↓FN↓Rcll(%)↑Prcn(%)↑GT MT↑PT↑ML↓IDF1(%)↑IDP(%)↑IDR(%)↑IDs↓FM↓MOTA(%)↑MOTP↓

Static cameras

MOT17-02 4206 11140 40.00 63.90 62 8 29 25 33.00 42.90 26.90 231 309 16.20 0.27

MOT17-04 4228 25709 45.90 83.80 83 10 44 29 43.80 61.90 33.90 271 730 36.50 0.22

MOT17-09 1827 1574 70.40 67.20 26 10 13 3 42.60 41.60 43.60 63 91 34.90 0.22

Moving camera

MOT17-05 2227 2420 65.00 66.90 133 38 74 21 47.30 48.00 46.70 189 245 30.10 0.28

MOT17-10 4705 5822 54.70 59.90 57 12 35 10 40.10 42.00 38.40 363 553 15.20 0.28

MOT17-11 3124 3345 64.60 66.10 75 21 26 28 51.90 52.50 51.30 50 102 30.90 0.21

MOT17-13 3234 6668 42.70 60.60 110 14 63 33 34.20 41.30 29.10 484 741 10.80 0.31

OVERALL for MOT17 - - 49.50 70.30 - - - - 41.60 50.30 35.50 - - 27.10 0.25

indoor 85 21 98.2 93.0 7 7 0 0 92.6 90.2 95.2 2 8 90.6 0.216

outdoor easy 121 28 98.9 95.4 7 7 0 0 97.2 95.4 98.8 1 9 94.1 0.166

outdoor hard 574 148 97.7 91.8 20 20 0 0 78.7 76.3 81.2 22 36 88.6 0.257

OVERALL for COMVIS MICA - - 98.1 92.8 - - - - 84.8 82.5 87.2 - - 90.2 0.229

TABLE IV

AN EVALUATION ABOUT MEMORY REQUIREMENT AND PROCESSING RATE FOR THREE COUPLINGS OF HUMAN DETECTION AND TRACKING METHOD.

Methods

Memory requirement (MB) Frames per second (Hz)

Without GPU With GPU Without GPU With GPU

RAM RAM GPU

YOLOv3-tiny + DeepSORT 700 3400 2489 7.00 14.9

YOLOv3 + DeepSORT 1300 5400 2897 2.0 11.1

Mask R-CNN + DeepSORT - 4178 4300 - 2.5

ID=1

ID=2

ID=3 ID=1 ID=2

ID=3

ID=1 ID=2

ID=3

ID=1 ID=2

ID=3 ID=1

ID=2

ID=3

ID=1

ID=3

ID=1

ID=3 ID=5

ID=3

ID=1 ID=4 ID=1 ID=3

Inputvideo

ID=1

ID=2

ID=3

ID=4 ID=1

ID=2

ID=3

ID=4 ID=1

ID=2

ID=3

ID=4 ID=1

ID=2

ID=3

ID=4

ID=1

ID=2

ID=3

ID=4

ID=1

ID=2

ID=3

ID=4 ID=1

ID=2

ID=3

ID=4

Inputvideo

(a)

ID=1

ID=2

ID=3 ID=1 ID=2

ID=3

ID=1 ID=2

ID=3

ID=1 ID=2

ID=3 ID=1

ID=2

ID=3

ID=1

ID=3

ID=1

ID=3 ID=5

ID=3

ID=1 ID=4 ID=1 ID=3

Inputvideo

ID=1

ID=2

ID=3

ID=4 ID=1

ID=2

ID=3

ID=4 ID=1

ID=2

ID=3

ID=4 ID=1

ID=2

ID=3

ID=4

ID=1

ID=2

ID=3

ID=4

ID=1

ID=2

ID=3

ID=4 ID=1

ID=2

ID=3

ID=4

Inputvideo

(b)

Fig. 2. An example simulates the obtained results in human detection and tracking for COMVIS MICA dataset in two cases: (a) correct tracking (b) a

fragment occurs when pedestrians are passing each other. The detected boxes and their corresponding ground-truth are remarked in orange and blue bounding

boxes, respectively.

[5] A. A. Mekonnen and F. Lerasle, “Comparative evaluations of selected

tracking-by-detection approaches,” IEEE Transactions on Circuits and

Systems for Video Technology, vol. 29, no. 4, pp. 996–1010, April 2019.

[6] J. Zhou and J. Hoang, “Real time robust human detection and tracking

system,” in 2005 IEEE Computer Society Conference on Computer

Vision and Pattern Recognition (CVPR’05)-Workshops. IEEE, 2005,

pp. 149–149.

[7] A. A. Malik, A. Khalil, and H. U. Khan, “Object detection and track-

ing using background subtraction and connected component labeling,”

International Journal of Computer Applications, vol. 75, no. 13, 2013.

[8] S. Haifeng and X. Chao, “Moving object detection based on background

subtraction of block updates,” in 2013 6th International Conference on

Intelligent Networks and Intelligent Systems (ICINIS). IEEE, 2013, pp.

51–54.

[9] N. Dalal and B. Triggs, “Histograms of oriented gradients for human

detection,” in international Conference on computer vision & Pattern

Recognition (CVPR’05), vol. 1. IEEE Computer Society, 2005, pp.

886–893.

[10] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan,

“Object detection with discriminatively trained part-based models,”

IEEE transactions on pattern analysis and machine intelligence, vol. 32,

no. 9, pp. 1627–1645, 2009.

[11] X. Wang, G. Hua, and T. X. Han, “Detection by detections: Non-

parametric detector adaptation for a video,” in 2012 IEEE Conference on

Computer Vision and Pattern Recognition. IEEE, 2012, pp. 350–357.

[12] V. Gajjar, A. Gurnani, and Y. Khandhediya, “Human detection and

tracking for video surveillance: A cognitive science approach,” in

Proceedings of the IEEE International Conference on Computer Vision,

2017, pp. 2805–2809.

[13] J. Redmon and A. Farhadi, “Yolov3: An incremental improvement,”

arXiv preprint arXiv:1804.02767, 2018.

[14] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C.

Berg, “Ssd: Single shot multibox detector,” in European conference on

computer vision. Springer, 2016, pp. 21–37.

[15] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time

object detection with region proposal networks,” in Advances in neural

information processing systems, 2015, pp. 91–99.

[16] H. Grabner and H. Bischof, “On-line boosting and vision,” in 2006

IEEE Computer Society Conference on Computer Vision and Pattern

Recognition (CVPR’06), vol. 1. Ieee, 2006, pp. 260–267.

[17] H. Grabner, C. Leistner, and H. Bischof, “Semi-supervised on-line

boosting for robust tracking,” in European conference on computer

vision. Springer, 2008, pp. 234–247.

[18] B. Babenko, M.-H. Yang, and S. Belongie, “Visual tracking with online

multiple instance learning,” in 2009 IEEE Conference on Computer

Vision and Pattern Recognition. IEEE, 2009, pp. 983–990.

[19] Z. Kalal, K. Mikolajczyk, and J. Matas, “Tracking-learning-detection,”

IEEE transactions on pattern analysis and machine intelligence, vol. 34,

no. 7, pp. 1409–1422, 2011.

[20] J. F. Henriques, R. Caseiro, P. Martins, and J. Batista, “High-speed track-

ing with kernelized correlation ﬁlters,” IEEE transactions on pattern

analysis and machine intelligence, vol. 37, no. 3, pp. 583–596, 2014.

[21] N. Wojke, A. Bewley, and D. Paulus, “Simple online and realtime

tracking with a deep association metric,” in 2017 IEEE International

Conference on Image Processing (ICIP), Sep. 2017, pp. 3645–3649.

[22] J. Redmon, “Darknet: Open source neural networks in c,”

http://pjreddie.com/darknet/, 2013–2016.

[23] R. E. Kalman, “A new approach to linear ﬁltering and prediction

problems,” Journal of basic Engineering, vol. 82, no. 1, pp. 35–45,

1960.

Real-time multiple object tracking using deep learning methods

Article

Full-text available

Aug 2021
NEURAL COMPUT APPL

Multiple-object tracking is a fundamental computer vision task which is gaining increasing attention due to its academic and commercial potential. Multiple-object detection, recognition and tracking are quite desired in many domains and applications. However, accurate object tracking is very challenging, and things are even more challenging when multiple objects are involved. The main challenges that multiple-object tracking is facing include the similarity and the high density of detected objects, while also occlusions and viewpoint changes can occur as the objects move. In this article, we introduce a real-time multiple-object tracking framework that is based on a modified version of the Deep SORT algorithm. The modification concerns the process of the initialization of the objects, and its rationale is to consider an object as tracked if it is detected in a set of previous frames. The modified Deep SORT is coupled with YOLO detection methods, and a concrete and multi-dimensional analysis of the performance of the framework is performed in the context of real-time multiple tracking of vehicles and pedestrians in various traffic videos from datasets and various real-world footage. The results are quite interesting and highlight that our framework has very good performance and that the improvements on Deep SORT algorithm are functional. Lastly, we show improved detection and execution performance by custom training YOLO on the UA-DETRAC dataset and provide a new vehicle dataset consisting of 7 scenes, 11.025 frames and 25.193 bounding boxes.

Human Detection and Tracking for Video Surveillance: A Cognitive Science Approach

Conference Paper

Full-text available

Oct 2017

SSD: Single Shot MultiBox Detector

Article

Full-text available

Dec 2015

We present a method for detecting objects in images using a single deep neural network. Our approach, named SSD, discretizes the output space of bounding boxes into a set of bounding box priors over different aspect ratios and scales per feature map location. At prediction time, the network generates confidences that each prior corresponds to objects of interest and produces adjustments to the prior to better match the object shape. Additionally, the network combines predictions from multiple feature maps with different resolutions to naturally handle objects of various sizes. Our SSD model is simple relative to methods that requires object proposals, such as R-CNN and MultiBox, because it completely discards the proposal generation step and encapsulates all the computation in a single network. This makes SSD easy to train and straightforward to integrate into systems that require a detection component. Experimental results on ILSVRC DET and PASCAL VOC dataset confirm that SSD has comparable performance with methods that utilize an additional object proposal step and yet is 100-1000x faster. Compared to other single stage methods, SSD has similar or better performance, while providing a unified framework for both training and inference.

YOLOv3: An Incremental Improvement

Article

Apr 2018

We present some updates to YOLO! We made a bunch of little design changes to make it better. We also trained this new network that's pretty swell. It's a little bigger than last time but more accurate. It's still fast though, don't worry. At 320x320 YOLOv3 runs in 22 ms at 28.2 mAP, as accurate as SSD but three times faster. When we look at the old .5 IOU mAP detection metric YOLOv3 is quite good. It achieves 57.9 mAP@50 in 51 ms on a Titan X, compared to 57.5 mAP@50 in 198 ms by RetinaNet, similar performance but 3.8x faster. As always, all the code is online at https://pjreddie.com/yolo/

Comparative Evaluations of Selected Tracking-by-Detection Approaches

Article

Mar 2018

In this work, we present a comparative evaluation of various multi-person tracking-by-detection approaches on public datasets. The work investigates five popular trackers coupled with six relevant visual people detectors evaluated on seven public datasets. The evaluation emphasizes on exhibited performance variation depending on tracker-detector choices. Our experimental results show that the overall performance depends on how challenging the dataset is, the performance of the detector on the specific dataset, and the tracker-detector combination. Some trackers are more sensitive to the choice of a detector and some detectors to the choice of a tracker than others. Based on our results, two of the trackers demonstrate the best performances consistently across different datasets whereas the best performing detectors vary per dataset. This underscores the need for careful application context specific evaluation when choosing a detector.

Simple online and realtime tracking with a deep association metric

Conference Paper

Sep 2017

Histograms of Oriented Gradients for Human Detection

Conference Paper

Jul 2005
IEEE Comput Soc Conf Comput Vis Pattern Recogn

We study the question of feature sets for robust visual object recognition, adopting linear SVM based human detection as a test case. After reviewing existing edge and gradient based descriptors, we show experimentally that grids of Histograms of Oriented Gradient (HOG) descriptors significantly outperform existing feature sets for human detection. We study the influence of each stage of the computation on performance, concluding that fine-scale gradients, fine orientation binning, relatively coarse spatial binning, and high-quality local contrast normalization in overlapping descriptor blocks are all important for good results. The new approach gives near-perfect separation on the original MIT pedestrian database, so we introduce a more challenging dataset containing over 1800 annotated human images with a large range of pose variations and backgrounds.

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Conference Paper

Jan 2016

State-of-the-art object detection networks depend on region proposal algorithms to hypothesize object locations. Advances like SPPnet [7] and Fast R-CNN [5] have reduced the running time of these detection networks, exposing region pro-posal computation as a bottleneck. In this work, we introduce a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals. An RPN is a fully-convolutional network that simultaneously predicts object bounds and objectness scores at each position. RPNs are trained end-to-end to generate high-quality region proposals, which are used by Fast R-CNN for detection. With a simple alternating optimization, RPN and Fast R-CNN can be trained to share convolu-tional features. For the very deep VGG-16 model [18], our detection system has a frame rate of 5fps (including all steps) on a GPU, while achieving state-of-the-art object detection accuracy on PASCAL VOC 2007 (73.2% mAP) and 2012 (70.4% mAP) using 300 proposals per image. The code will be released.

A comparative view on exemplar ‘tracking-by-detection’ approaches

Conference Paper

Aug 2015

Moving Object Detection Based on Background Subtraction of Block Updates

Conference Paper

Nov 2013

This article presents a block-updating moving object detection method-background subtraction. This method weights the former two frames of start frame, models initial background, then subtracts current frames and background frame, divides difference image into multiple sub blocks, computes sum and does threshold comparison of total pixels of each block, gets binary image through threshold segmentation, and isolates background area and foreground area. Background area is used to model and update background model. According to the test, this method is insensitive to outside environment and can detect moving object correctly.

High-Speed Tracking with Kernelized Correlation Filters

Article

Apr 2014

The core component of most modern trackers is a discriminative classifier, tasked with distinguishing between the target and the surrounding environment. To cope with natural image changes, this classifier is typically trained with translated and scaled sample patches. Such sets of samples are riddled with redundancies -- any overlapping pixels are constrained to be the same. Based on this simple observation, we propose an analytic model for datasets of thousands of translated patches. By showing that the resulting data matrix is circulant, we can diagonalize it with the Discrete Fourier Transform, reducing both storage and computation by several orders of magnitude. Interestingly, for linear regression our formulation is equivalent to a correlation filter, used by some of the fastest competitive trackers. For kernel regression, however, we derive a new Kernelized Correlation Filter (KCF), that unlike other kernel algorithms has the exact same complexity as its linear counterpart. Building on it, we also propose a fast multi-channel extension of linear correlation filters, via a linear kernel, which we call Dual Correlation Filter (DCF). Both KCF and DCF outperform top-ranking trackers such as Struck or TLD on a 50 videos benchmark, despite running at hundreds of frames-per-second, and being implemented in a few lines of code (Algorithm 1). To encourage further developments, our tracking framework was made open-source.

Comparative evaluation of human detection and tracking approaches for online tracking applications

Abstract

Recommended publications

Robust Object Tracking Based on Motion Consistency

A unified framework for automated person re-indentification

Urban Road Object Detection and Tracking Applications Based on Acoustic Localization

Visual Tracking by Assembling Multiple Correlation Filters

A Robust Adaptive Classifier for Detector Adaptation in a Video