ArticlePDF Available

Scale-Aware Hierarchical Detection Network for Pedestrian Detection

May 2020
IEEE Access PP(99):1-1

May 2020
PP(99):1-1

DOI:10.1109/ACCESS.2020.2995321

License
CC BY 4.0

Authors:

Several or even dozens of times spatial scale variation is one of the major bottleneck for pedestrian detection. Although the Region-based Convolutional Neural Network (R-CNN) families have shown promising results for object detection, they are still limited to detect pedestrians with large scale variations due to the fixed receptive field sizes on a single convolutional output layer. In contrast to previous methods that simply combined pedestrian predictions on feature maps with different resolution, we propose a scale-aware hierarchical detection network for pedestrian detection under large scale variations. First, we introduce a cross-scale features aggregation module to accomplish feature augmentation for pedestrian representation through merging the lateral connection, the top-down path and bottom-up path. Specifically, the cross-scale features aggregation module can adaptively fuse hierarchical features to enhance feature pyramid representation for robust semantic and accurate localization. Further, we design a scale-aware hierarchical detection network to effectively integrate multiscale pedestrian detection into a unified framework by adaptively perceiving the augmented feature level for special-scale pedestrian detection. Experimentally, the proposed scale-aware hierarchical detection network forms a more robust and discriminative model for pedestrian instances with different scales on widely-used ETH and Caltech benchmarks. In particular, compared with the state-of-the-art method FasterRCNN+ATT [44], the logaverage miss rate of pedestrian detection is reduced by 11.98% for medium scale pedestrians (between 30-80 pixels in height), and 14.12% for whole scale pedestrians (above 20 pixels in height) on Caltech benchmark.

Visual examples for pedestrians of multiple scales. (a) shows the exemplars of pedestrian detection using the stateof-the-art method AR-Ped [2]. (b) shows the scale distribution of pedestrians heights on Caltech dataset. One can observe that the medium size instances indeed dominate the distribution. (c) shows pedestrian instances with different scales on Caltech dataset.

…

Our proposed scale compensation strategy from multipath RPN to initial proposals. It uses hierarchy features of deep convolutional layers to obtain a series of reasonable anchor scales for pedestrian proposal, and each scale focus on pedestrian instances within certain scale ranges in an image.

…

Detection results of our approach on Caltech dataset.

…

EVALUATIONS OF PEDESTRIAN DETECTION AT DIFFERENT FEATURE PYRAMID LEVEL BY LOG- AVERAGE MISS RATE (MR) UNDER IOU=0.5 ON CALTECH DATASET

…

COMPARISONS OF PEDESTRIAN DETECTION RESULTS BY LOG-AVERAGE MISS RATE (MR) UNDER IOU=0.5 ON CALTECH DATASET

…

Figures - available via license: Creative Commons Attribution 4.0 International

Content may be subject to copyright.

Available via license: CC BY 4.0

Content may be subject to copyright.

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI

10.1109/ACCESS.2020.2995321, IEEE Access

Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000.

Digital Object Identiﬁer 10.1109/ACCESS.2017.DOI

Scale-aware Hierarchical Detection

Network for Pedestrian Detection

XIAOWEI ZHANG, SHUAI CAO, AND CHENGLIZHAO CHEN

Shandong Key Laboratory of Intelligent Information Processing, School of Computer Science and Technology, Qingdao University, Qingdao 266071, China.

Corresponding author: Xiaowei Zhang (e-mail: xiaowei19870119@sina.com) and Chenglizhao Chen (e-mail: cclz123@163.com).

This work was supported in part by the National Natural Science Foundation of China (Grant No.6190070308), and in part by the Natural

Science Foundation of Shandong Province of China (Grant No.ZR2019BF028).

ABSTRACT Several or even dozens of times spatial scale variation is one of the major bottleneck for

pedestrian detection. Although the Region-based Convolutional Neural Network (R-CNN) families have

shown promising results for object detection, they are still limited to detect pedestrians with large scale

variations due to the ﬁxed receptive ﬁeld sizes on a single convolutional output layer. In contrast to

previous methods that simply combined pedestrian predictions on feature maps with different resolution, we

propose a scale-aware hierarchical detection network for pedestrian detection under large scale variations.

First, we introduce a cross-scale features aggregation module to accomplish feature augmentation for

pedestrian representation through merging the lateral connection, the top-down path and bottom-up path.

Speciﬁcally, the cross-scale features aggregation module can adaptively fuse hierarchical features to

enhance feature pyramid representation for robust semantic and accurate localization. Further, we design

a scale-aware hierarchical detection network to effectively integrate multiscale pedestrian detection into

a uniﬁed framework by adaptively perceiving the augmented feature level for special-scale pedestrian

detection. Experimentally, the proposed scale-aware hierarchical detection network forms a more robust

and discriminative model for pedestrian instances with different scales on widely-used ETH and Caltech

benchmarks. In particular, compared with the state-of-the-art method FasterRCNN+ATT [44], the log-

average miss rate of pedestrian detection is reduced by 11.98% for medium scale pedestrians (between

30-80 pixels in height), and 14.12% for whole scale pedestrians (above 20 pixels in height) on Caltech

benchmark.

INDEX TERMS Scale Variation, Feature Aggregation, Scale-aware Weighting, Hierarchical Detection.

I. INTRODUCTION

PEDESTRIAN detection stands out from the traditional

object detection tasks, in view of its broad application

prospects in computer vision, such as video surveillance,

autonomous driving, robotics. Despite signiﬁcant improve-

ments have been made on pedestrian detection [4], [8], [24],

[27], [41] over the years, the most existing efforts generally

work very well for large scale pedestrian instances [17]–[19],

[23], [34], [36], [52]. Compared with pedestrian detection

under large scale, much less attention has been put toward

medium and small scale ones, as similar observations in the

literature [14], [55].

For autonomous driving system, detecting medium and

small size pedestrians are an important topic because there

may leave sufﬁcient time to alert the driver. Assuming the

vehicle traveling at an urban speed of 15m/s and a pedestrian

of 1.8m tall, the person with 80 pixels in height is just 1.5 s

away, while a person with 30 pixels is 4 s away. Take one

latest effort AR-Ped [2] for example, it has been reported

that empirically their detector is capable of achieving 6.45%

log-average miss rate for pedestrians taller than 50 pixels

on Caltech Pedestrian Benchmark [20], however the same

error rate increases to 49.31% MR for pedestrians of 30-

80 pixels in height. Fig. 1(a) shows several failed examples

of pedestrian detection using state-of-the-art method AR-

Ped [2] under large scale appearance variations on Caltech

benchmark. As Fig. 1(b) illustrates the scale distribution of

pedestrians in height on Caltech dataset, we group pedestri-

ans by their image size (height in pixels) into three scales

following [55]: near (80 or more pixels), medium (between

30-80 pixels), and far (between 20-30 pixels). Note that about

81.67% MR of the pedestrians lie in the medium scale on

Caltech dataset.

The degraded performance for pedestrian detection under

VOLUME 4, 2016 1

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI

10.1109/ACCESS.2020.2995321, IEEE Access

Xiaowei Zhang et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

Height ( pixels)

0 50 100 150 200 250 300 350 400 450

Scale Distributi on

0.05

0.1

0.15

0.2

0.25 8030

（a）Exemplars of of pedestrian

detection on Caltech benchmark

（b）Scale distribution of Pedestrian instances on Caltech benchmark （c） Pedestrian instances with different scales on Caltech benchmark

10×20 ~ 15×30

15×30 ~ 25×50

25×50 ~ 30×80

30×80 ~ 100×200

100×200 ~ 256×512

FIGURE 1: Visual examples for pedestrians of multiple scales. (a) shows the exemplars of pedestrian detection using the state-

of-the-art method AR-Ped [2]. (b) shows the scale distribution of pedestrians heights on Caltech dataset. One can observe that

the medium size instances indeed dominate the distribution. (c) shows pedestrian instances with different scales on Caltech

dataset.

large scale variations may be attributed to the following in-

herent challenges. First, small-size pedestrian instances often

convey smaller amount of information while having a greater

proportion of noise with obscure appearance and blurred

boundaries. It is in general difﬁcult to distinguish them from

the background clutters. Second, visual semantic concepts of

an object can emerge in different spatial scales depending

on the size of the target objects. For a pedestrian instance

of interest, visual features are effective only at a proper

scale where optimal response is obtained. This difference is

more pronounced in complex scenes containing pedestrian

instances of diverse scales.

To address the issue of pedestrian detection under large

scale appearance variations, the Faster-RCNN [9] exploits a

multiscale region proposal network (RPN), which achieves

excellent object detection performance. However, multi-scale

detection is generated by sliding a ﬁxed set of ﬁlters over

a ﬁxed set of convolutional feature maps. This results in an

inconsistency between the sizes of objects and ﬁlter receptive

ﬁelds, as the scales of objects are variable, yet the sizes of

ﬁlter receptive ﬁelds are ﬁxed. Instead of using a ﬁxed set

of receptive ﬁelds, in most related works [1], [7], [15], [44],

[50], [58], [59] that aim to detect multi-scale pedestrians,

pedestrian detection is performed by redeploying the percep-

tive ﬁelds of convolution based on object sizes at multiple

output layers. However, in our view, these methods either

simply selecting multiple output layers based on the sizes

of receptive ﬁelds [11] [4] [5], or using feature fusion to

expand the receptive ﬁeld on a single output layer [44] [57]

to obtain multi-scale receptive ﬁelds, are lack of enhancing

entire feature hierarchy for multiscale pedestrian detection.

This motivates us to construct an aggregation feature repre-

sentation to enhance semantic information and localization

signals for scale-aware pedestrian detection.

Motivated by above insight and analysis for the representa-

tion of feature hierarchy pyramid, we propose a scale-aware

hierarchical detection network for pedestrian detection under

large scale variations. First, we accomplish feature aggrega-

tion based on FPN [43] to enhance semantic information and

localization signals for feature representation, by merging

the lateral connection, the top-down path and the bottom-

up path. Furthermore, in view of the feature differences for

pedestrians under different scales, scale-aware hierarchical

detection network is designed to learn adaptively perceive

pedestrian instances within certain scale ranges, by probing

the feature differences with different scales from augmented

pyramid features.

To sum up, our work possesses the following contribu-

tions:

1) We introduce a cross-scale features aggregation module

to enhance feature pyramid representation by fusing robust

semantic and accurate localization for pedestrians with dif-

ferent scales, which accomplishes feature augmentation from

the lateral connection, the top-down path and bottom-up path.

2) A novel scale perception strategy by normalized Gaus-

sian fate function is designed to integrate multiple detection

heads into a uniﬁed framework through adaptively perceiving

the cross-scale features aggregation module for scale-aware

2VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI

10.1109/ACCESS.2020.2995321, IEEE Access

Xiaowei Zhang et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

hierarchical detection network.

3) Experimentally, compared with the state-of-the-art

method FasterRCNN+ATT [44], the log-average miss rate of

pedestrian detection is reduced by 11.98% for medium scale

pedestrians (between 30-80 pixels in height), and 14.12%

for whole scale pedestrians (above 20 pixels in height) on

Caltech benchmark.

II. RELATED WORK

There has been lasting research activities on pedestrian

detection with vast literature. Before the emergence of

CNN, hand-crafted features have been widely used to ob-

tain good performance for pedestrian detection, including

HOG [10], Edgelets [38], ICF [21] and its variants ACF [28],

LDCF [17], [22], SCF [23]. The most popular pedestrian de-

tector is deformable part model (DPM) [51], which combines

rigid root ﬁlter and deformable part ﬁlters based on HOG

feature pyramid and latent SVM classiﬁers for detection.

Deep ConvNet due to its stronger feature representation

ability exhibits obvious performance gains on pedestrian

detection [28, 13, 29, 18, 2]. CCF [56] absorbs merits from

ﬁltered channel features and Convolutional Neural Networks

(CNN), and transfers low-level features from pre-trained

CNN models to feed the boosting forest model for pedes-

trian detection. ConvNet [41] uses an unsupervised method

based on convolutional sparse coding to pre-train CNN for

pedestrian detection. Deep Parts [13] consists of extensive

part detectors, and each part detector is a strong detector

that can detect pedestrian by observing only a part of a

proposal. SDP [11] investigates scale-dependent pooling and

layer-wise cascaded rejection classiﬁers from CNN to detect

objects. CompACT-Deep [16] leverages both hand-crafted

and CNN features to form complexity-aware cascaded de-

tectors for an optimal trade-off between accuracy and speed.

Especially, Faster-RCNN [9] has addressed a multiscale re-

gion proposal network that shares full-image convolutional

features with the detection network, leading to an excellent

performance for pedestrian detection.

However, spatial scale variation is one of main challenge

for pedestrian detection due to the large variance of in-

stance scales in a cross-scenario. To address the issue, an

upsampling or dilated operations [5], [11] are employed to

alleviate the decline which just adopts a ﬁxed set of ﬁlter

respective ﬁelds existed in Faster-RCNN [9]. MS-CNN [5]

combines multiple output layers by feature upsampling of

deconvolution to produce a strong multi-scale object detector.

SA-FastRCNN [4] exploits multiple built-in subnetworks by

a divide-and-conquer strategy to adaptively detect pedestri-

ans across scales. RPN+BF [7] reuses the high-resolution

convolutional features of RPN by cascaded boosted forests

for multiscale pedestrian detection. ADM [50] executes se-

quences of coordinate transformation on multi-layer fea-

ture maps to deliver accurate pedestrian locations. Trident-

Net [54] constructs a parallel multi-branch architecture to

expend receptive ﬁelds on the detection of different scale

objects through dilated convolution. However, these methods

have not effectively to fuse the robust semantic information

of targets existed in high-level convolutional layers and the

precise localization signals of the lower convolutional layers

for multiscale pedestrian detection.

To exploit strong semantic for prediction, FPN [43] aug-

ments a top-down pathway and lateral connections to propa-

gate high-level semantic information for reasonable classiﬁ-

cation capability. DSSD [48] adopts deconvolution layers to

aggregate context and the high-level semantics for enhanc-

ing shallow features. M2Det [3] presents multi-level feature

pyramid network to fuse multiscale features for detecting

objects of different scales. On the other hand, many ﬁne de-

tails and higher resolution existed in low-level feature maps

are beneﬁts for localization accuracy. PANet [47] builds a

strong indicator to accurately localize instance segmentation

by a pathway with clean lateral connections from the low

level to top ones. DLA [49] augments standard architectures

with deeper aggregation across layers to obtain stronger

layer-wise multi-scale representation capability. STDN [29]

is equipped with embedded super-resolution scale-transfer

layers to explore the inter-scale consistency nature across

multiple detection scales. Recently, NAS-FPN [53] consists

of a series of merging cells to fuse features across scales

by a combination of top-down and bottom-up connections.

Res2Net [46] constructs hierarchical residual-like connec-

tions within one single residual block to capture multi-scale

features at a granular level.

Inspired by these observations and analysis of feature

fusion to multiscale detection, in this paper, we explore a

scale-aware hierarchical detection network for multi-scale

pedestrian detection, by aggregating the strong semantic

information from high-level features and the accurate local-

ization signals from low-level layer to enhance pyramidal

feature representations.

III. APPROACH OVERVIEW

A high-level overview of our approach architecture is shown

in Fig. 2. Our proposed approach consists of two main

components: cross-scale features aggregation module and

scale-aware hierarchical detection network. The cross-scale

features aggregation module is built on Feature Pyramid

Network (FPN) [43] to enhance representation ability of

pyramid features. FPN shows signiﬁcant improvement as a

generic feature extractor for object recognition, signiﬁcantly

which propagates semantically strong features to enhance

pyramid features with reasonable classiﬁcation capability by

the top-down path. Similarly, with many ﬁne details and

strong responses of local patterns are existed in low-level

convolutional layers, which are beneﬁts for high localization

accuracy. For this reason, we design a cross-scale features ag-

gregation module to adaptively aggregate features of pyramid

hierarchy to enhance the localization capability.

Further, the scale-aware hierarchical detection network

based on the Fast R-CNN framework [6] combines the com-

plementary detection branches on hierarchical pyramid fea-

ture maps from the cross-scale features aggregation module.

VOLUME 4, 2016 3

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI

10.1109/ACCESS.2020.2995321, IEEE Access

Xiaowei Zhang et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

RoI

Pooling

layer FCs

Conv feature

maps

1024-d

RoI feature

vector

C3_cls_score

C3_bbox_pred

2-d

8-d

RoI

Pooling

layer FCs

Conv feature

maps

1024-d

RoI feature

vector

C4_cls_score

C4_bbox_pred

2-d

8-d

FCs

Conv feature

maps

1024-d

RoI feature

vector

C5_cls_score

C5_bbox_pred

2-d

8-d

RoI

Pooling

layer

Scale-aware_cls_score

Scale-aware_bbox_pred

2-d

8-d



Average pooling

Upsampling

Identity mapping

Original image

Near scale Medium scale Far scale

Region proposals

Scale-aware Hierarchical Detection Subnetwork

Output results

Cross-level Feature

Aggregation Module

FIGURE 2: The architecture of our proposed Scale-aware Hierarchical Detection Network. Our approach uses the cross-scale

features aggregation module to enhance semantic robust and localization accuracy, and the scale-aware hierarchical detection

network to adaptively detect pedestrians from augmented feature levels for special-scale pedestrians presented in the image.

And the detection heads in hierarchical detection network

pretrained on ImageNet based on ResNet [33] are all sharing

parameters for each proposal to learn scale-aware hierarchi-

cal weights by minimizing the error rate for pedestrians with

different scales, regardless of their feature levels.

A. CROSS-SCALE FEATURES AGGREGATION MODULE

Feature Pyramid Network (FPN) [43] shows signiﬁcant im-

provement as a generic feature extractor for object recogni-

tion, which propagates semantically strong features to en-

hance pyramid features with reasonable classiﬁcation capa-

bility. Followed by previous evidence on the beneﬁts of the

strategy of feature approximation [28], we denote the output

of last residual blocks as {C1, C2, C3, C4, C5}for conv1,

conv2, conv3, conv4, and conv5 in ResNet. And given a list

of multi-scale pyramid features {P1, P2, P3, P4, P5}from

FPN [43], where Pirepresents the feature at pyramid level

i. However, the feature fusion from FPN only directly builds

on the lateral connection and the top-down pathway, ignoring

the impact of bottom-up path augmentation to enhance fea-

ture representation for accurate localization signals existing

in low-level convolutional layers.

Our goal is to ﬁnd a transformation function fthat can

effectively aggregate multi-scale features and output a list of

new features: Xout =f(Xin),Xin may be Ci,Pi, or their

union. Different from the feature augmentation generated by

FPN, we propose a cross-scale features aggregation module

(CFAM) to merge a bottom-up pathway to FPN. Speciﬁcally,

we use {H1, H2, H3, H4, H5}to denote augmented feature

pyramid and in which the spatial resolution of feature maps

is gradually upsampled with factor 2 from Hito Hi−1. As

shown in Fig. 3(b), each feature aggregation module takes

a convolutional feature map Ci−1with higher resolution,

an identify mapping feature maps Ciand a coarser feature

map Hi+1 with stronger semantic to generate the augmented

feature map Hi. Note that we adopt an average pooling to

downsample the spatially ﬁner feature maps, which implies

to directly propagate strong responses of local patterns from

low-level pyramid levels for accurately localization by the

bottom-up augmented pathway.

The key idea of CFAM is to adaptively aggregate multi-

scale context information from feature maps of the convolu-

tional layers at adjacent scales to generate more discrimina-

tive features. As shown in Fig. 3(b), each aggregating module

merges a top-down path, lateral connections and a bottom-

up augmented path by addition. The feature aggregating

module takes a convolutional feature map Ci−1with higher

resolution, an identify mapping feature map Ciand a coarser

feature map Hi+1 with stronger semantic to generate the

fused feature map Hi. This is an iterated process to build aug-

mented feature pyramid until to the ﬁnest resolution map H3.

At the beginning of iteration, we adopt a 1×1 convolutional

layer on C5to produce the coarsest but semantically strongest

resolution map H5. Then the lower-level feature map Ci−1

goes through a 2 × 2 average pooling layer with stride 2 to

reduce the spatial size to generate the down-sampled feature

map in the bottom-up augmented pathway. Each element

of feature map Hi+1, the down-sampled feature map and

the identify mapping feature map Ciare added to generate

fused feature map. Finally, we append a 1×1 convolution

4VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI

10.1109/ACCESS.2020.2995321, IEEE Access

Xiaowei Zhang et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

Identity mapping

Upsampling

2× Upsampling

Identity

mapping

C5P5

(a) The feature aggregation block from FPN [43] merged the lateral

connection and the top-down pathway by addition.

Identity mapping

Average pooling

Upsampling

2× Upsampling

Identity

mapping Average

pooling

(b) Our cross-scale features aggregation module augmented features from the

lateral connection, the top-down pathway and the bottom-up pathway.

FIGURE 3: Illustrations of feature aggregation module design.

on each merged map to generate the ﬁnal augmented fea-

ture map Hifor following sub-networks, which is used to

reduce the aliasing effect of upsampling and downsampling.

In the feature aggregation module, these augmented feature

maps are respectively corresponding to {C3, C4, C5}with

the same spatial sizes, and we set 1024-channel outputs

for each augmented feature pyramid {H3, H4, H5}to scale-

aware hierarchical detection network.

B. SCALE-AWARE HIERARCHICAL DETECTION

NETWORK

The coverage of multiple scales is a critical problem to

different scale ranges for pedestrian detection. Different from

the multi-scale mechanism of the RPN [9], we divide the

region proposals into three scales (near, medium, and far)

from higher convolutional layers C4, each scale is transported

an augmented feature pyramid level Hito detect pedestrian

instances within certain scale ranges as shown in Fig. 4. We

hypothesize that pedestrian instances with different scales

can be better modeled by hierarchical detection network with

the valid range of ﬁlter receptive ﬁelds. Speciﬁcally, each

pedestrian anchor scale needs to effectively match the re-

ceptive ﬁeld size of the ROI pooling though different spatial

pooling structure.

Let Lm(Xi, Yi|W)represents multi-task loss function for

each pedestrian proposal under speciﬁc feature level Hm, and

is given by:

Lm(Xi, Yi|W) = Lm

cls (pi,ˆpi) + λˆpiLm

loc bi,ˆ

bi.(1)

Where ˆpiis 1 if the anchor is labeled positive, otherwise is

0. piis the predicted probability of the anchor being a pro-

posal. ˆ

bi= (ˆ

i,ˆ

i,)represents the ground-truth box

associated with a positive anchor, and bi= (bx

i, by

i, bw

i, bh

i,)

represents the parameterized coordinates of the predicted

bounding box. The classiﬁcation loss Lm

cls is the softmax loss

of two classes (pedestrian vs. not) from speciﬁc feature level

Hm. For the regression loss, we use Lm

loc =R(bi−ˆ

bi)where

Ris the robust loss function (smooth-L1) deﬁned in [6]. The

term ˆpiLm

loc means the regression loss which is activated only

for positive anchors ˆpi= 1 and is disabled otherwise ˆpi= 0.

To adaptively match valid feature level and anchor scale

for multiscale pedestrian detection, SDP [11] adopts a hard

isolation strategy by the pixels in height of an object proposal

to detect multiscale objects. SA-FastRCNN [4] exploits a soft

isolation strategy by Sigmoid gate function deﬁned over the

object proposal sizes to generate scale-aware weighting for

multi-scale detection subnetworks. In this paper, we design

a novel scale perception strategy by normalized Gaussian

gate function for scale-aware hierarchical detection network

(SHDN) as shown in Fig. 4, and the model loss function is

deﬁned as:

L(W) =

m=1 X

i∈U

ωmLm(Xi, Yi|W)(2)

Where Mis the number of hierarchical feature pyramid as

mentioned in Section III A, U={(Xi, Yi)}N

i=1 contains

the training examples of multi-scale for pedestrian instances,

and ωmis the normalized scale-aware weight to corre-

sponded hierarchical loss Lm(Xi, Yi|W), and is initialized

by ωm=eˆωm/

i=1

eˆωi,ˆωm=e−(s−¯sm)2/2(γm)2). Here

s=log2(h)denotes the height scale of the pedestrians

which has already been normalized to a narrow range prior

to detection, ¯smand γmis the average height scale and

the scaling coefﬁcient for speciﬁc feature level Hm, respec-

tively. Given a sliding window, the Gaussian function with

lower γmtends to enlarge the gap between the weights for

pedestrian instances from different scale ranges. Based on

the ResNet structure, the output size of RoI pooling is 7

× 7, with a stride chosen from the set of {8,16,32}to

construct deep network {C3, C4, C5}, then the valid recep-

tive ﬁelds for hierarchical feature pyramid {H3, H4, H5}is

{56,112,224}pixels for the height of bounding box, re-

spectively. Consequently, we assign scale-aware parameters

(¯sm, γm)as {(5.8,1.25),(6.8,2),(7.8,1.25)}for hierarchi-

cal feature pyramid {H3, H4, H5}, respectively. Note that we

optimize multi-task loss function being shunt to scale-aware

hierarchical detection module by the scale-aware weights

parameters (¯sm, γm), and all hyper parameters after ROI

pooling layers are shared for all levels of the hierarchical

feature pyramid.

VOLUME 4, 2016 5

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI

10.1109/ACCESS.2020.2995321, IEEE Access

Xiaowei Zhang et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

RoI

Pooling

layer FCs

Conv feature

maps

1024-d

RoI feature

vector

C3_cls_score

C3_bbox_pred

2-d

8-d

RoI

Pooling

layer FCs

Conv feature

maps

1024-d

RoI feature

vector

C4_cls_score

C4_bbox_pred

2-d

8-d

FCs

Conv feature

maps

1024-d

RoI feature

vector

C5_cls_score

C5_bbox_pred

2-d

8-d

RoI

Pooling

layer

Scale-aware_cls_score

Scale-aware_bbox_pred

2-d

8-d



Scale-aware Hierarchical Detection Subnetwork

Output results

Near scale

Medium scale

Far scale

Dviding multiscale pedestrian proposals

Far

Scale

Medium

scale

Near

scale

FIGURE 4: Our proposed scale compensation strategy from multipath RPN to initial proposals. It uses hierarchy features

of deep convolutional layers to obtain a series of reasonable anchor scales for pedestrian proposal, and each scale focus on

pedestrian instances within certain scale ranges in an image.

For efﬁcient training the scale-aware hierarchical detection

network, sampling is used to compensate for the imbalance

from the distribution of positive samples Um

+and negative

samples Um

−. In this paper, we adopt random sampling and

bootstrapped sampling to collect a ﬁnal set of negative sam-

ples, such that Um

−=ζU m

+. We utilize random sampling to

randomly select easy negative samples according to a uni-

form distribution. Because hard negatives mining has large

inﬂuence on the detection accuracy, bootstrapping sampling

is exploited to improve detection performance by ranking the

negative samples according to their objectness scores. On

the other hand, to avoid the heavily asymmetric of positive

samples Um

+and negative samples Um

−resulting in for each

speciﬁc detection layer, the cross-entropy terms of positives

and negatives are weighted in formula (3), which guarantee

that each detection layer have enough positive samples to

cover a certain range of scales.

Lcls =1

1 + ζ

i∈Um

−logp ˆpi(Xi)

+ζ

1 + ζ

−

i∈Um

−

−logp0(Xi)

(3)

IV. EXPERIMENTS

A. EXPERIMENTS DETAILS

Following ResNet [33] pretrained on ImageNet, we ﬁne-

tunes the convolutional neural network to extract visual

features from observed video frames on Caltech training

dataset. The convolutional layers and max pooling layers of

the ResNet network are used as the shared convolutional

layers before the Region-of-Interest (RoI) pooling layer to

produce feature maps from the entire input image. The last

convolutional block in ResNet is 2048-d, and we employ

a randomly initialized 1024-d 1×1 convolutional layer for

reducing dimension. And we use single-scale training in

which the scale of the input image is resized as 600 pixels

on the shortest side. The scale-aware feature aggregation net-

work is trained with Stochastic Gradient Descent (SGD) with

momentum of 0.9, and weight decay of 0.0005. As [9], [30]

demonstrate that mining from a larger set of candidates (e.g.,

2000) has no beneﬁt, we use 300 RoIs for both training and

testing of this paper. We ﬁne-tune scale-aware hierarchical

detection network using a learning rate of 0.001 for 20k mini-

batches. Each mini-batch consists of 128 randomly sampled

object proposals in one randomly selected image, where in

32 positive object proposals and the rest 96 negative object

proposals. A positive label of pedestrian is assigned when

IoU ≥0.5between the object proposal and the ground truth

box, and the negative label to RoIs if their IoU ≤0.3for

all ground-truth boxes. The whole scale-aware hierarchical

detection network is trained on a single NVIDIA GeForce

GTX TITAN X GPU with 12GB memory.

B. ABLATION EXPERIMENTS

1) Evaluating the cross-scale features aggregation module

As mentioned in [7], the Region Proposal Network (RPN)

in Faster R-CNN indeed performs well as a stand-alone de-

tector, but the downstream classiﬁer degrades the pedestrian

detection performance. In this subsection, we investigate

cross-scale features aggregation module in terms of detection

quality, evaluated by the log-average miss rate of pedestrian

detection under IOU = 0.5on Caltech dataset.

First of all, we evaluate high-level convolutional layer

(from ResNet-50-C3 to ResNet-50-C5) of ResNet [33] to

extract ROI features to detect pedestrian by using a set of

anchor scales from RPN. As shown in Table 1(a)(b)(c), for

illustrating the effects of high-level convolutional features

in ResNet-50 to detect pedestrian instances, the higher con-

6VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI

10.1109/ACCESS.2020.2995321, IEEE Access

Xiaowei Zhang et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

TABLE 1: EVALUATIONS OF PEDESTRIAN DETECTION AT DIFFERENT FEATURE PYRAMID LEVEL BY LOG-

AVERAGE MISS RATE (MR) UNDER IOU=0.5 ON CALTECH DATASET

Detection Network Proposals RoI features lateral? top-down? bottom-up? M Rall MRfM RmM Rn

(a) Baseline on Conv. C4C3× × × 95.79% 100% 94.25% 79.31%

(b) Baseline on Conv. C4C4× × × 63.45% 95.76% 47.92% 4.83%

(d) Baseline on FPN C4P3X X ×52.52% 76.59% 37.76% 14.18%

(e) Baseline on FPN C4P4X X ×46.78% 90.38% 38.65% 2.74%

(f) Baseline on FPN C4P5X X ×78.67% – 75.96% 3.36%

(g) Based on Our CFAM C4H3X X X 46.52% 72.83% 36.50% 16.35%

(h) Based on Our CFAM C4H4X X X 43.69% 86.50% 38.08% 2.12%

(i) Based on Our CFAM C4H5X X X 84.84% – 85.82% 2.51%

volutional layers (e.g., C4,C5) obviously perform better

than lower-level convolutional layers (e.g., C3) for pedestrian

instances with near scale. This can be attributed to higher-

level convolutional features with more robust semantic infor-

mation than lower-levels.

Further, compared to adopt the simple high-level convolu-

tional layer (e.g., C3,C4, or C5) to detect pedestrian, FPN

(e.g., P3,P4, or P5) fused the semantically strong features

from higher convolutional layer to enhance pyramid features

for classiﬁcation capability. Especially, P3gets 76.59% MR

for pedestrian detection under far scale, and decreases by

10.16% MR for medium scale pedestrian instances to C4as

shown in Table 1(d). However, FPN only directly builds on

the lateral connection and the top-down pathway for feature

fusion, ignoring the impact of bottom-up pathway which is

beniﬁt for accurate localization. Compared to the improved

performance of P3under far scale and medium scale, P4

has degraded the pedestrian detection performance shown in

Table 1(e), which may be due to lack of accurate localization

signals existing in lower convolutional layers. Therefore, we

propose a cross-scale features aggregation module (CFAM)

to fuse semantic information and localization signals by

adding a bottom-up augmented pathway to FPN. As shown

in Table 1 (g), H3has achieved the best pedestrian detection

performance under far-scale and medium-scale, up to 72.83%

MR and 36.50% MR respectively. Note that H4achieves

43.69% MR for pedestrians with all scales.

2) The Role of Scale-aware Hierarchical Detection network

In this subsection, the contribution of proposed scale-aware

hierarchical detection network is evaluated by log-average

miss rate under IOU = 0.5on Caltech testing dataset for

pedestrian detection. We conduct comparison experiments

to verify the effectiveness of the proposed method within a

single output layer and multiple output layers for detection

heads. As shown in Table 2 (a)(b)(c), we compare the single

output layer H3,H4and H5as detection head from proposed

cross-scale features aggregation module for pedestrian detec-

tion under different scales. We found that the H3performs

better than other single output layer on log-average miss rate

for pedestrian detection under far and medium scales. For

near scale, H4has achieved the best detection performance

in a single output layer, up to 2.12% MR with a relative

improvement of 14.23% over the competitor H3.

However, detecting pedestrian only from a single output

layer cannot effectively cover the multiscale pedestrians ap-

peared large scale variations, due to lacking of the scale com-

plementary from multiple feature layers with different sizes

of ﬁlter receptive ﬁelds. To effectively combine multiple out-

put layers from feature pyramid for pedestrian detection, we

adopt the scale-aware parameters (¯sm, γm)to initialize learn-

ing hierarchical weights ωmfor optimizing multi-task loss

function in formula 2. Speciﬁcally, we assign scale-aware pa-

rameters (¯sm, γm)as {(5.8,1.25),(6.8,2),(7.8,1.25)}for

hierarchical feature pyramid {H3, H4, H5}, respectively. In

Table 2(d), combining layers {H3,H4} gains 42.58% MR

for all scales on Caltech benchmark, improving by 1.11%

compared to the single output layer H4, and achieves best

detection performance 65.86% MR for far scale pedestrian

instances. Particularly, combining layers {H4,H5} does not

improve pedestrian detection performance on medium and

far scales, but achieves better detection performance 1.25%

MR on near scale shown in Table 2(e). The reason may be

attributed to our proposed hierarchical scale-aware detection

network that each detection branch is used to learn a proper

pyramid feature layer to focus on pedestrian instances within

certain scale ranges. Moreover, the log-average miss rate

is reduced to 40.39% for all scales pedestrian detection,

28.77% for medium scale and 1.08% for near scale, by

combining layers {H3,H4,H5} as shown in Table 2(f). Note

that combining {H3,H4,H5} gets the best performance

compared to {C3,C4,C5},{P3,P4,P5}, and {P2,P3,P4,

P5} shown in Table 2(g∼i). The experiments demonstrate

that the proposed hierarchical scale-aware detection network

is more ﬂexible and is able to take advantage of different sizes

of ﬁlter receptive ﬁelds from multiple level pyramid features

for large variance in pedestrian instance scales.

C. COMPARISON WITH STATE-OF-THE-ARTS

In this section, the performance of proposed algorithm is

fully evaluated to the state-of-the-art methods on Caltech [20]

and ETH [18] datasets. As [55] proposed the evaluation crite-

VOLUME 4, 2016 7

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI

10.1109/ACCESS.2020.2995321, IEEE Access

Xiaowei Zhang et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

TABLE 2: COMPARISONS OF PEDESTRIAN DETECTION RESULTS BY LOG-AVERAGE MISS RATE (MR) UNDER

IOU=0.5 ON CALTECH DATASET

Detection Network Proposals RoI features MRall M RfM RmMRn

(a) Based on a single output layer C4H346.52% 72.83% 36.50% 16.35%

(b) Based on a single output layer C4H443.69% 86.50% 38.08% 2.12%

(d) Based on multiple output layers C4{H3, H4}42.58% 65.86% 30.53% 1.48%

(e) Based on multiple output layers C4{H4, H5}44.37% 90.56% 33.66% 1.25%

(f) Based on multiple output layers C4{H3, H4, H5}40.39% 70.69% 28.77% 1.08%

(g) Based on multiple output layers C4{C3, C4, C5}50.47% 78.67% 37.43% 2.22%

(h) Based on multiple output layers C4{P3, P4, P5}46.69% 74.42% 31.16% 1.79%

(i) Based on multiple output layers C4{P2, P3, P4, P5}45.12% 71.54% 32.83% 1.46%

ria, the log-average miss rate is used to summarize the detec-

tor performance. The performance is computed by averaging

miss rate at FPPI rates evenly spaced in log-space within

the range of 10−3to 100. The experiments demonstrate

that jointing cross-scale features aggregation module and

scale-aware hierarchical detection network outperforms the

state-of-the-art pedestrian detection algorithms, especially on

pedestrian instances with small sizes.

1) Comparison with state-of-the-art methods on Caltech

dataset

The Caltech pedestrian dataset consists of approximately

10 hours of 640*480 30Hz video taken from a vehicle

driving through regular trafﬁc in an urban environment,

which includes about 250,000 frames with a total of 2300

unique pedestrians. Similar to other relevant publications

previously [13], [16], [17], [24], we adopt the different spatial

scale pedestrians to evaluate our method on the Caltech

testing dataset, and choose the Caltech training dataset and

the INRIA training dataset [10] as our training set. The exper-

imental evaluations of our proposed method with the state-

of-the-art methods are constructed on the Caltech testing

dataset, including LDCF [22], ACF+SDt [34], RPN+BF [7],

MS-CNN [5], CompACT-Deep [16], TA-CNN [24], SA-

FastRCNN [4], FasterRCNN+ATT [44], and AR-Ped [7].

To evaluate the effectiveness of our proposed scale-aware

hierarchical detection network, quantitative results of com-

parison are presented for different scale ranges of pedestrian

instances on Caltech dataset. Fig. 5 shows the comparison

results of the log-average miss rate for pedestrians under

different scale ranges. It can be observed that our proposed

method signiﬁcantly outperforms other methods and achieves

the lowest log-average miss rate 28.77% on Caltech dataset

of the medium scale shown in Fig. 5(a), which is lower

than the state-of-the-art approach FasterRCNN+ATT [44] by

11.98%. As the similar trend shown in Fig. 5(b), our ap-

proach achieves 7.41% log-average miss rate for pedestrian

instances taller than 50 pixels in height, second only to the

state-of-the-art approach AR-Ped [2].

For pedestrian instances in far scale ranges, most methods

exhibit dramatic performance drops as shown in Fig. 5(c).

10-3 10-2 10-1

False positives per image

0.20

0.30

0.40

0.50

0.64

0.80

1.00

Miss rate

63.62% TA-CNN

61.82% LDCF

56.42% DeepParts

53.93% RPN+BF

53.23% CompACT-Deep

51.83% SA-FastRCNN

49.31% AR-Ped

49.13% MS-CNN

40.75% FasterRCNN+ATT

28.77% Ours(SHDN)

(a) Medium scale (80≥height≥30 pix-

els)

10-3 10-2 10-1

False positives per image

0.20

0.30

0.40

0.50

0.64

0.80

1.00

Miss rate

24.80% LDCF

20.86% TA-CNN

11.89% DeepParts

11.75% CompACT-Deep

10.33% FasterRCNN+ATT

9.95% MS-CNN

9.68% SA-FastRCNN

9.58% RPN+BF

7.41% Ours(SHDN)

6.45% AR-Ped

(b) Resonable (height≥50 pixels)

10-3 10-2 10-1

False positives per image

0.20

0.30

0.40

0.50

0.64

0.80

1.00

Miss rate

100.00% ACF+SDt

100.00% SA-FastRCNN

100.00% CompACT-Deep

100.00% DeepParts

100.00% RPN+BF

100.00% TA-CNN

100.00% LDCF

97.23% MS-CNN

90.94% FasterRCNN+ATT

70.69% Ours(SHDN)

10-3 10-2 10-1

False positives per image

0.20

0.30

0.40

0.50

0.64

0.80

1.00

Miss rate

71.25% LDCF

71.22% TA-CNN

64.78% DeepParts

64.66% RPN+BF

64.44% CompACT-Deep

62.59% SA-FastRCNN

60.95% MS-CNN

58.83% AR-Ped

54.51% FasterRCNN+ATT

40.39% Ours(SHDN)

(d) Overall (height≥20 pixels)

FIGURE 5: Quantitative results of comparisons on Caltech

dataset.

While our proposed method outperforms better than the

available state-of-the-art competitors, it is difﬁculty to iden-

tify pedestrians reliably for small-size pedestrian instances

under 30 pixels in height. In Fig. 5(c), the log-average miss

rate is reduced to 70.69%, improved 20.25% compared to

FasterRCNN+ATT [44]. This is similar to human perfor-

mance that is also quite good in the large scales but degrades

noticeably at the medium and far scales. Signiﬁcantly, for

pedestrian instances in whole scale span ranges, our approach

achieves the log-average miss rate 40.39% for all pedes-

trian instances taller than 20 pixels in height, better than

the current FasterRCNN+ATT [44] by 14.12% as shown in

Fig. 5(d). The comparison results with different scale ranges

of pedestrian instances demonstrate that our proposed ap-

proach substantially improve the performance for pedestrian

detection.

Fig. 6 shows the detection results of our proposed scale-

aware hierarchical detection network on Caltech dataset. The

8VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI

10.1109/ACCESS.2020.2995321, IEEE Access

Xiaowei Zhang et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

green dotted bounding boxes represent true positive windows

when the intersection over union (IoU) between the detected

window and the ground truth (green solid bounding box)

exceeds 50%. Otherwise, the bounding boxes denote false

positive windows by the red dotted bounding box. As shown

in Fig. 6, the most of pedestrian instances with different scale

ranges can be detected by our proposed approach. Moreover,

because of adaptively perceiving the augmented feature level

with different resolutions for special-scale pedestrians, the

medium-size and small-size pedestrian instances also can

be detected in proposed scale-aware hierarchical detection

network. The red dotted bounding box represents the posi-

tive pedestrians which are not marked by the ground truth

as shown in Fig. 6. This experiment shows that the usage

of jointing the cross-scale features aggregation module and

scale-aware hierarchical detection network for pedestrian de-

tection outperforms the state-of-the-art algorithms, especially

for pedestrian instances from medium and small scale ranges.

2) Comparison with state-of-the-art methods on ETH dataset

The ETH benchmark dataset consists of 3 testing video

sequences with a resolution of 640*480, and a frame

rate of 13FPS. Studies in [7], [37] report that state-of-

the-art algorithms have the remarkable detection perfor-

mance evaluated on ETH dataset, including ChnFtrs [21],

MultiFtr+Motion [35], JointDeep [37], pAUCBoost [40],

ConvNet [41], DBN-Mut [12], SpatialPooling [39], TA-

CNN [24], and RPN+BF [7]. As most approaches are

trained on the INRIA training dataset [10], our proposed

method is also trained on the INIRA training dataset. As

Fig. 7(a) the log-average miss rate of our proposed approach

achieves 44.75% next to the state-of-the-art SpatialPool-

ing [39] 43.36% for pedestrians under medium scale. Similar

trend to what we have observed for pedestrian with near

scale, our approach achieves 20.49% log-average miss rate,

second only to the best available competitor’s RPN+BF [7]

as shown in Fig. 7(b). Signiﬁcantly, for pedestrian instances

above pixels taller than 80 pixels in height, our approach

gets 16.84% log-average miss rate, improving 0.78% com-

pared to the state-of-the-art RPN+BF [7] shown in Fig. 7(c).

Moreover, for a more challenging with large variation of

scale (above 20 pixels in height), the log-average miss rate

of our approach reduces 3.98% over RPN+BF [7] on ETH

dataset as shown in Fig. 7(d). The results demonstrate that

our proposed method has a substantially better detection

performance for the multiscale pedestrian instances appeared

large scale variations in natural scenes.

The pedestrian detection results of our proposed method

are shown in Fig. 8 on ETH dataset. As shown in Fig. 8,

the green dotted boxes demonstrate the detection results of

our approach. Our proposed approach adaptively perceives

the augmented feature level to generate the ﬁnal detection

results for special-scale pedestrian detection by scale-aware

hierarchical detection network. And the small-size pedestrian

instances also can be detected, where the red dotted bounding

box represents the positive pedestrians which are not marked

FIGURE 6: Detection results of our approach on Caltech

dataset.

by the ground truth as shown in Fig. 8. One can observe that

our method can successfully detect most of the pedestrian in-

stances, especially for pedestrians with large scale variations.

V. CONCLUSION

This study describes an effective approach to detect pedes-

trian instances with different scale ranges. The proposed

cross-scale features aggregation module adaptively fuses hi-

erarchical features to enhance feature pyramid representation

by merging the lateral connection, the top-down path and

bottom-up path. Moreover, probing the differences of local

features with different sizes of receptive ﬁelds, the proposed

scale-aware hierarchical detection network effectively inte-

grates multiscale pedestrian detection into a uniﬁed frame-

work through adaptively perceiving the augmented feature

level for special-scale pedestrian detection. Experimentally,

compared with the state-of-the-art FasterRCNN+ATT [44],

VOLUME 4, 2016 9

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI

10.1109/ACCESS.2020.2995321, IEEE Access

Xiaowei Zhang et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

10-2 10-1 100

False positives per image

0.20

0.30

0.40

0.50

0.64

0.80

1.00

Miss rate

67.77% MultiFtr+Motion

65.71% ConvNet

59.59% DBN-Mut

58.78% JointDeep

55.42% pAUCBoost

53.55% ChnFtrs

53.38% RPN+BF

48.69% TA-CNN

44.75% Ours(SHDN)

43.36% SpatialPooling

(a) Medium scale (80≥height≥30 pix-

els)

10-2 10-1 100

False positives per image

0.20

0.30

0.40

0.50

0.64

0.80

1.00

Miss rate

48.34% ChnFtrs

45.44% MultiFtr+Motion

40.75% JointDeep

39.72% pAUCBoost

39.23% ConvNet

34.73% DBN-Mut

29.66% SpatialPooling

23.24% TA-CNN

20.49% Ours(SHDN)

17.63% RPN+BF

(b) Near scale (height≥80 pixels)

False positives per image

10-2 10-1 100

Miss rate

0.20

0.30

0.40

0.50

0.64

0.80

1.00

59.99% MultiFtr+Motion

57.47% ChnFtrs

50.27% ConvNet

49.06% pAUCBoost

45.32% JointDeep

41.07% DBN-Mut

37.37% SpatialPooling

34.98% TA-CNN

30.23% RPN+BF

29.45% Ours(SHDN)

10-2 10-1 100

False positives per image

0.20

0.30

0.40

0.50

0.64

0.80

1.00

Miss rate

70.12% MultiFtr+Motion

61.86% ChnFtrs

57.80% ConvNet

54.32% JointDeep

53.56% pAUCBoost

51.28% DBN-Mut

43.19% SpatialPooling

42.92% TA-CNN

39.46% RPN+BF

35.48% Ours(SHDN)

(d) Overall (height≥20 pixels)

FIGURE 7: Quantitative results of comparisons on ETH

dataset.

the log-average miss rate of pedestrian detection is reduced

by 11.98% for medium scale pedestrians (between 30-80

pixels in height), and 14.12% for whole scale pedestrians

(above 20 pixels in height) on the Caltech benchmark.

REFERENCES

[1] Z. Chen, L. Zhang, A. M. Khattak, et al., “Deep feature fusion by

competitive attention for pedestrian detection,” IEEE Access, vol. 7, pp.

21981-21989, 2019.

[2] G. Brazil, X. Liu, “Pedestrian Detection With Autoregressive Network

Phases,” in CVPR, Long Beach, CA, USA, 2019, pp.7231-7240.

[3] Q. Zhao, T. Sheng, Y. Wang, et al., “M2Det A Single-Shot Object Detector

based on Multi-Level Feature Pyramid Network,” in AAAI, Honolulu,

Hawaii, USA, 2019.

[4] J. Li, X. Liang, S. Shen, et al., “Scale-aware Fast R-CNN for Pedestrian

Detection,” in CVPR, Honolulu, Hawaii, 2017, pp.985–996.

[5] T. Cai, Q. Fan, R. Feris, et al., “A Uniﬁed Multi-scale Deep Convolutional

Neural Network for Fast Object Detection,” in ECCV, Amsterdam, Nether-

lands, 2016, pp.354–370.

[6] R. Girshick, “Fast R-CNN,” in ICCV, Santiago, Chile, 2015, pp.1440–

1448.

[7] L. Zhang, L. Lin, X. Liang, et al., “Is Faster R-CNN Doing Well for

Pedestrian Detection?,” in ECCV, Springer, Cham, 2016, pp. 443–457.

[8] C. Fei, B. Liu, Z. Chen, et al., “Learning Pixel-Level and Instance-

Level Context-Aware Features for Pedestrian Detection in Crowds,” IEEE

Access, vol. 7, pp. 94944–94953, 2019.

[9] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real-

time object detection with region proposal networks,” in NIPS, Montreal,

Canada, 2015, pp.1–9.

[10] N. Dalal, B. Triggs, “ Histograms of oriented gradients for human detec-

tion,” in CVPR, San Diego, California, USA, 2005.

[11] F. Yang, W. Choi, Y. Lin, “Exploit All the Layers: Fast and Accurate CNN

Object Detector with Scale Dependent Pooling and Cascaded Rejection

Classiﬁers,” in CVPR, Las Vegas, USA, 2016, pp.2129–2137.

[12] W. Ouyang, X. Zeng, X. Wang, “ Modeling Mutual Visibility Relationship

with a Deep Model in Pedestrian Detection,” in CVPR, Portland, Oregon,

2013, pp.3222–3229.

[13] Y. Tian, P. Luo, X. Wang, et al., “ Deep learning strong parts for pedestrian

detection,” in ICCV, Santiago, Chile, 2015, pp.1904–1912.

FIGURE 8: Detection results of our approach on ETH

dataset.

[14] D. Hoiem, Y. Chodpathumwan, Q. Dai, “Diagnosing error in object

detectors,” in ECCV, Florence, Italy, 2012, pp.340–353.

[15] J. Yan, X. Zhang, Z. Lei, et al., “Robust Multi-Resolution Pedestrian

Detection in Trafﬁc Scenes,” in CVPR, Portland, Oregon, 2013, pp.3033–

3040.

[16] Z. Cai , M. Saberian, N. Vasconcelos, “Learning complexity-aware cas-

cades for deep pedestrian detection” in ICCV, Santiago, Chile, 2015,

pp.3361–3369.

[17] S. Zhang, R. Benenson, B. Schiele, “Filtered channel features for pedes-

trian detection,” in CVPR, Boston, Massachusetts, 2015, pp.1751–1760.

[18] P. Dollar, S. Belongie, P. Perona, “The fastest pedestrian detector in the

west,” in BMVC, Aberystwyth, UK, 2010.

[19] P. Dollar, R. Appel, W. Kienzle, “Crosstalk cascades for frame-rate pedes-

trian detection,” in ECCV, Florence, Italy, 2012, pp.645–659.

[20] P. Dollar, C. Wojek, B. Schiele, et al., “Pedestrian Detection: A Bench-

mark,” in CVPR, Miami, Florida, USA, 2009, pp.304–311.

[21] P. Dollar, Z. Tu, P. Perona, et al., “Integral channel features,” in BMVC,

London, 2009.

[22] W. Nam, P. Dollar, J. Han, “Local decorrelation for improved pedestrian

detection,” in NIPS, Montreal, Canada, 2014, pp.424–432.

[23] R. Benenson, M. Omran, J. Hosang, et al., “Ten years of pedestrian

detection, what have we learned?,” in ECCV, Zurich, Switzerland, 2014,

pp.613–627.

10 VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI

10.1109/ACCESS.2020.2995321, IEEE Access

Xiaowei Zhang et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

[24] Y. Tian, P. Luo, X. Wang, et al., “Pedestrian detection aided by deep learn-

ing semantic tasks,” in CVPR, Boston, Massachusetts, 2015, pp.5079–

5087.

[25] J. Hosang, M. Omran, R. Benenson, et al., “Taking a deeper look at

pedestrians,” in CVPR, Boston, Massachusetts, 2015, pp.4073–4082.

[26] M. Zeiler, R. Fergus, “ Visualizing and understanding convolutional net-

works,” in ECCV, Zurich, Switzerland, 2014, pp.818–833.

[27] B. Yang, J. Yan, Z. Lei, et al., “ Convolutional channel features,” in ICCV,

Santiago, Chile, 2015, pp.82–90.

[28] P. Dollar, R. Appel, S. Belongie, and P. Perona., “ Fast feature pyramids for

object detection,” IEEE Trans. Pattern Analysis and Machine Intelligence

, vol. 36, no. 8, pp. 1532––1545, 2014.

[29] P. Zhou, B. Ni, C. Geng, et al., “ Scale-Transferrable Object Detection,” in

CVPR, Salt Lake City, UT, USA, 2018, pp.528–537.

[30] J. Dai, Y. Li, K. He, et al., “ R-FCN: Object Detection via Region-based

Fully Convolutional Networks,” in NIPS, Barcelona, Spain, 2016, pp.379–

387.

[31] W. Liu, D. Anguelov, D. Erhan, et al., “ SSD: Single Shot MultiBox

Detector,” in ECCV, Amsterdam, Netherlands, 2016, pp.21-37.

[32] S. Liu, L. Qi, H. Qin, et al., “ Path Aggregation Network for Instance

Segmentation,” in CVPR, Salt Lake City, UT, USA, 2018, pp.8759-8768.

[33] K. He, X. Zhang, S. Ren, et al., “ Deep Residual Learning for Image

Recognition,” in CVPR, Las Vegas, NV, USA, 2016, pp.770–778.

[34] D. Park, C. Zitnick, D. Ramanan, et al., “Exploring Weak Stabilization for

Motion Feature Extraction,” in CVPR , Portland, Oregon, 2013, pp.2882–

2889.

[35] S. Walk, N. Majer, K. Schindler, et al., “ New Features and Insights for

Pedestrian Detection,” in CVPR, San Francisco, CA, USA, 2010, pp.1030–

1037.

[36] G. Chen, Y. Ding, J. Xiao, et al., “ Detection Evolution with Multi-order

Contextual Co-occurrence,” in CVPR, Portland, Oregon, 2013,pp.1798–

1805.

[37] W. Ouyang, X. Wang, “ Joint Deep Learning for Pedestrian Detection,” in

ICCV, Sydney, Australia, 2013, pp.2056–2063.

[38] B. Wu and R. Nevatia, “ Detection and tracking of multiple, partially oc-

cluded humans by bayesian combination of edgelet based part detectors,”

International Journal of Computer Vision (IJCV), vol. 75, no. 2, pp. 247–

266, 2007.

[39] S. Paisitkriangkrai, C. Shen, A. van den Hengel, “ Strengthening the

Effectiveness of Pedestrian Detection,” in ECCV, Zurich, Switzerland,

2014, pp.546–561.

[40] S. Paisitkriangkrai, C. Shen, A. van den Hengel, “ Efﬁcient pedestrian

detection by directly optimize the partial area under the ROC curve,” in

ICCV, Sydney, Australia, 2013, pp.1057–1064.

[41] P. Sermanet, K. Kavukcuoglu, S. Chintala, et al., “Pedestrian Detection

with Unsupervised Multi-Stage Feature Learning,” in CVPR, Portland,

Oregon, 2013, pp.3626–3633.

[42] B. Zhou, A. Khosla, A. Lapedriza, et al., “ Learning Deep Features for

Discriminative Localization,” in CVPR, Las Vegas, NV, USA, 2016, pp.

2921-2929.

[43] T. Lin, P. Dollar, R. Girshick, et al., “ Feature Pyramid Networks for Object

Detection,” in CVPR, Honolulu, USA, 2017, pp.2117–2125.

[44] S. Zhang, J. Yang, B. Schiele, et al., “ Occluded Pedestrian Detection

Through Guided Attention in CNNs,” in CVPR, Salt Lake City, UT, USA,

2018, pp.6995–7003.

[45] Z. Hao, Y. Liu, H. Qin, et al., “ Scale-Aware Face Detection,” in CVPR,

Salt Lake City, UT, USA, 2018, pp.6186–6195.

[46] S. Gao, M. Cheng, K. Zhao, et al., “ Res2Net: A New Multi-scale

Backbone Architecture,” arXiv preprint arXiv:1904.01169, 2019.

[47] S. Liu, L. Qi, H. Qin, et al., “ Path Aggregation Network for Instance

Segmentation,”in CVPR, Salt Lake City, UT, USA, 2018, pp.8759–8768

[48] C. Fu, W. Liu, A. Ranga, etal, “ DSSD : Deconvolutional Single Shot

Detector,” arXiv preprint arXiv:1701.06659, 2019.

[49] F. Yu, D. Wang, E. Shelhamer, et al., “ Deep Layer Aggregation,” in CVPR,

Salt Lake City, UT, USA, 2018, pp.2403–2412

[50] X. Zhang, L. Cheng, B. Li, et al., “ Too Far to See? Not Really! —

Pedestrian Detection with Scale-aware Localization Policy,” IEEE Trans.

On Image Processing , vol. 27, no. 8, pp. 3703–3715, 2018.

[51] P. Felzenszwalb, R. Girshick, D. McAllester, et al., “ Object detection with

discriminatively trained part based models,” IEEE Trans. Pattern Analysis

and Machine Intelligence , vol. 32, no. 9, pp.1627–1645, 2010.

[52] J. Cao, Y. Pang, X. Li, “ Pedestrian Detection Inspired by Appearance

Constancy and Shape Symmetry ” IEEE Trans. On Image Processing, vol.

25, no. 12, pp. 5538–5551, 2016.

[53] G. Ghiasi, T. Lin, Q. Le, “ NAS-FPN: Learning Scalable Feature Pyramid

Architecture for Object Detection ” in CVPR, Long Beach, CA, USA,

2019, pp. 7036–7045.

[54] Y. Li, Y. Chen, N. Wang, Z. Zhang, “ Scale-Aware Trident Networks for

Object Detection,” in CVPR, Long Beach, CA, USA, 2019.

[55] P. Dollar, C. Wojek, B. Schiele, et al., “ Pedestrian detection: An evalu-

ation of the state of the art,” IEEE Trans. Pattern Analysis and Machine

Intelligence , vol. 34, no. 4, pp. 743–761, 2012.

[56] B. Yang, J. Yan, Z. Lei, S. Li, “ Convolutional Channel Features,” in ICCV,

Santiago, Chile, 2015.

[57] S. Bell, C. L. Zitnick, K. Bala, and R. Girshick., “ Insideoutside net: De-

tecting objects in context with skip pooling and recurrent neural networks,”

in CVPR, Las Vegas, NV, USA, 2016, pp. 2874-2883.

[58] S. Choudhury, R. Padhy, A. Sangaiah, et al., “ Scale Aware Deep Pedes-

trian Detection,” Transactions on Emerging Telecommunications Tech-

nologies , vol. 30, no. 9, pp. 1–14, 2019.

[59] T. Liu, M. Elmikaty, T. Stathaki, et al., “ SAM-RCNN: Scale-Aware Multi-

Resolution Multi-Channel Pedestrian Detection,” in BMVC, Newcastle,

UK, 2018.

XIAOWEI ZHANG received the Ph.D. degree in

computer science from Beihang University, Bei-

jing, China, in 2018, the M.S. degree in com-

puter science from Shandong Normal University,

Jinan, China, in 2013, and the B.S. degree in

computer science from Shanxi Normal University,

Linfen, China, in 2009. He was a visiting stu-

dent at Bioinformatics Institute (BII), A*STAR,

Singapore from 2016 to 2017. Currently, he is

an assistant professor of Computer Science and

Engineering at Qingdao University. His current research interests include

image/video analysis and understanding, computer vision and machine

learning.

SHUAI CAO received the B.S. degree in computer

science and technology from Liaoning University

of Technology, Jinzhou, China, in 2018. He is

currently pursuing the M.S. degree in Computer

Science and Engineering with Qingdao University

of China. His current research interests include

pedestrian detection and machine learning.

CHENGLIZHAO CHEN received the Ph.D. de-

gree in computer science from Beihang University

in 2017. He is currently an assistant professor with

Qingdao University. His research interests include

computer vision, machine learning, and pattern

recognition.

VOLUME 4, 2016 11

MSF-CSPNet: A Specially Designed Backbone Network for Faster R-CNN

Article

Full-text available

Jan 2024

Although Faster R-CNN has undergone a lot of improvements, it still exists a significant gap in the performance between the detection of small and large objects, mainly because the low-level network lacks semantic information and small objects are only involved in a few images. To mitigate the above issues, we propose an object detection model based on Multi-Scale Feature fusion Cross Stage Partial Network (MSF-CSPNet) in this paper. The proposed MSF-CSPNet focuses on the fusion of concrete features and abstract features from multi-scale feature by learning shallow features at the shallow level and deep features at the deep level. Meanwhile, the data augmentation is performed by using random horizontal flip. On the basis, the improved Faster-RCNN model with Automatic Mixed Precision, Group Batch Sampler and MSF-CSPNet was formed. The proposed algorithm is valuated on the Microsoft Common Objects in Context (MS COCO) 2017 and obtained leading performance with 5.4% improvement in AP coco , 5.9% improvement in AP 50 , 6.9% improvement in AP 75 , 5.8% improvement in AP S , 6.1% improvement in AP M , 5.8% improvement in AP L compare to Faster R-CNN based on ResNet-50 with Feature Pyramid Network (FPN) backbone, and also outperformed previous reports on state-of-art Faster R-CNN series using other backbone networks, especially for small object detection. This research shows that the combination of a backbone with stronger learning ability and FPN is helpful to detect the expression of objects. Faster R-CNN based on MSF-CSPNet has high efficiency and better balance between accuracy and speed.

Small-Scale Pedestrian Detection Using Fusion Network and Probabilistic Loss

Article

Full-text available

Jan 2024

Small-scale pedestrian detection is a challenge. The main issues are as follows: (1) Troubled by their small scale, it is difficult to extract features effectively; (2) During the detection process, it is easily disturbed by background noise such as inter-class occlusion and intra-class occlusion, leading to missed or false detection; (3) The current widely used IoU measurement method is very sensitive to the position deviation of small objects, which seriously reduces the detection performance. To address these problems, we improve YOLOv5 structure by integrating Non-Local and Convolution structures, building a new feature extraction module called ResNet-Conv&NonL, combined with the ResNet structure. This module was then integrated into the backbone of YOLOv5 for better image feature extraction. In addition, we developed a novel model to measure the similarity between bounding boxes, which are embedded in the loss function of the YOLOv5 structure to replace the normal IoU measurement. Experiments on a self-made dataset and a combined dataset from Caltech and CityPersons show the feasibility of the proposed network structure. Results demonstrate the feasibility of the improved network structure is superior to the original method because it increases average precision by 6.0% compared to the original one.

DCPDN: High-Performance Pedestrian Detection Networks for Difficult Conditions

Article

Full-text available

Jan 2023

Pedestrian detection is the use of computer vision technology to identify and accurately locate pedestrians in image or video data, which has a strong use value. This technology can be used as the research basis for visual tasks such as person re-identification, human pose estimation and behavior analysis, and can also be applied to industrial fields such as intelligent security, automatic driving and human-computer interaction. However, the problems of low image resolution, blurred appearance, large scale difference of pedestrians, occluded pedestrians and complex background still bring great challenges to the detection performance. To solve these problems, this paper proposes a high-performance pedestrian detection network dedicated to difficult conditions: DCPDN. Firstly, we design an optimized super-resolution reconstruction network to preprocess the image to alleviate the performance damage caused by low-resolution and blurred images. Then, to solve the multi-scale problem in pedestrian detection, we propose a weighted cross-scale feature fusion module, which adopts a hierarchical detection strategy to deal with pedestrian objects of different scales while fully fusing feature maps of different levels. Finally, to solve the occlusion problem that has plagued pedestrian detection for a long time, we design an occlusion processing module based on graph convolutional network, which can effectively use the correlation information between different parts of the human body and promote the feature expression of occluded objects. On the CityPersons dataset, the MR -2 of the detector is reduced by 6.9%, 19.2%, 8.9%, 1.9%, 3.6% and 14.2%, respectively, corresponding to different partition subsets of R, HO, A, L, M and S. On the Caltech dataset, corresponding to different divisions of R, HO, A, L and S, the MR -2 of the detector is reduced by 9.9%, 15.8%, 16.3%, 6.8% and 25.8%, respectively. The experimental results show that the performance improvement of the detector is significant on both severe occlusion (HO) and small scale (S) subsets. After testing, the algorithm has strong robustness to occluded pedestrians, and can be easily embedded in other detection frameworks. Our DCPDN is able to compete with the state of the art methods and is especially effective when dealing with the pedestrian detection problem under difficult conditions.

Efficient Text Bounding Box Identification Using Mask R-CNN: Case of Thai Documents

Article

Full-text available

Jan 2024

Text detection is a fundamental task in computer vision, particularly for Optical Character Recognition (OCR) applications. This study focuses on text detection within an OCR application, encompassing text detection, text recognition, and information extraction, with a specific emphasis on text detection. Character-Region Awareness for Text Detection (CRAFT), Pyramid Mask Text Detector (PMTD), and Scene Text Detection with Supervised Pyramid Context Network (SPCNET) have demonstrated promising results in bounding-box detection. However, it faces challenges related to postprocessing and multiline text detection. A post-processing problem arises because of the need to reconfigure the model when new documents are introduced, which leads to inefficiencies and complexities. In addition, especially for CRAFT tends to merge bounding boxes from consecutive lines by introducing multiline errors. To address these challenges, this study proposes an adapted approach based on Mask R-CNN, an instance segmentation model that treats each text element as an individual object. By adopting the Mask R-CNN approach, post-processing issues were successfully eliminated. Moreover, the multiline problem is effectively resolved. Comparative experiments demonstrate that the proposed model achieves comparable results to these models, while surpassing it in terms of accuracy and versatility. The proposed model is extensively evaluated on various document types, including bankbooks, Thai ID cards (both front and back sides), invoices, car registrations, mobile banking slips, passports, Indonesian ID cards, driver licenses, and receipts. The results indicated the model’s high performance and potential for real-world applications. The elimination of post-processing and multiline problems ensures the adaptability of the model to a wide range of document structures, and reduces both time inference and resource utilization.

Pedestrian detection using RetinaNet with multi-branch structure and double pooling attention mechanism

Article

Full-text available

Jun 2023
MULTIMED TOOLS APPL

Pedestrian detection technology, combined with techniques such as pedestrian tracking and behavior analysis, can be widely applied in fields closely related to people's lives such as traffic, security, and machine interaction. However, the multi-scale changes of pedestrians have always been a challenge for pedestrian detection. Aiming at the shortcomings of the traditional RetinaNet algorithm in multi-scale pedestrian detection, such as false detection, missed detection, and low detection accuracy, an improved RetinaNet algorithm is proposed to enhance the detection ability of the network model. This paper mainly makes innovations in the following two aspects. Firstly, in order to obtain more semantic information, we use a multi-branch structure to expand the network and extract the characteristics of different receptive fields at different depths. Secondly, in order to make the model pay more attention to the important information of pedestrian features, double pooling attention mechanism module is embedded in the prediction head of the model to enhance the correlation of feature information between channels, suppress unimportant information, and improve the detection accuracy of the model. Experiments were conducted on different datasets such as the COCO dataset, and the results showed that compared with the traditional RetinaNet model, the model proposed in this paper has improved in various evaluation indicators and has good performance, which can meet the needs of pedestrian detection.

A CPU-based Pedestrian Detector using Deep Learning for Intelligent Surveillance Systems

Conference Paper

Full-text available

Aug 2022

Multimodal Pedestrian Detection Based on Cross-Modality Reference Search

Article

May 2024

Pedestrian detection in thermal and visible images is crucial for various applications, such as surveillance, driver assistance, and autonomous driving. In this paper, we propose a novel fusion scheme that effectively integrates multimodal features to improve detection performance. Our approach relies on Cross-Modality Reference Module (CMRM) for exchanging complementary features extracted from different modalities, solving incorrect sensor dominance problem in rare untrained-for contexts. We also utilize modality-specific region proposal networks to explore potential candidates separately in each modality, ensuring accurate and reliable proposals. The fusion of region proposals is performed using the Multimodal Fusion Module (MFM) that employs an attention mechanism to combine features based on their attention scores. To improve the robustness of the model in practical scenarios, we introduce a group of new data augmentation techniques, which simulate real-world challenges. Experimental evaluations conducted on the public KAIST, CVC-14, and FLIR datasets demonstrate the effectiveness of our proposed method. The results show that our fusion scheme significantly outperforms the existing methods in terms of detection performance by as much as 16.4% in practical scenarios.

Occlusion and multi-scale pedestrian detection A review

Article

Sep 2023

PVDet: Towards pedestrian and vehicle detection on gigapixel-level images

Article

Feb 2023
ENG APPL ARTIF INTEL

Real-time monitoring model of passenger flow congestion of subway station based on improved YOLOv3

Conference Paper

Dec 2022

Pedestrian Detection With Autoregressive Network Phases

Conference Paper

Full-text available

Jun 2019

Res2Net: A New Multi-scale Backbone Architecture

Article

Full-text available

Aug 2019

Representing features at multiple scales is of great importance for numerous vision tasks. Recent advances in backbone convolutional neural networks (CNNs) continually demonstrate stronger multi-scale representation ability, leading to consistent performance gains on a wide range of applications. However, most existing methods represent the multi-scale features in a layer-wise manner. In this paper, we propose a novel building block for CNNs, namely Res2Net, by constructing hierarchical residual-like connections within one single residual block. The Res2Net represents multi-scale features at a granular level and increases the range of receptive fields for each network layer. The proposed Res2Net block can be plugged into the state-of-the-art backbone CNN models, e.g., ResNet, ResNeXt, and DLA. We evaluate the Res2Net block on all these models and demonstrate consistent performance gains over baseline models on widely-used datasets, e.g., CIFAR-100 and ImageNet. Further ablation studies and experimental results on representative computer vision tasks, i.e., object detection, class activation mapping, and salient object detection, further verify the superiority of the Res2Net over the state-of-the-art baseline methods. The source code and trained models are available on https://mmcheng.net/res2net/.

M2Det: A Single-Shot Object Detector Based on Multi-Level Feature Pyramid Network

Article

Full-text available

Jul 2019

Feature pyramids are widely exploited by both the state-of-the-art one-stage object detectors (e.g., DSSD, RetinaNet, RefineDet) and the two-stage object detectors (e.g., Mask RCNN, DetNet) to alleviate the problem arising from scale variation across object instances. Although these object detectors with feature pyramids achieve encouraging results, they have some limitations due to that they only simply construct the feature pyramid according to the inherent multiscale, pyramidal architecture of the backbones which are originally designed for object classification task. Newly, in this work, we present Multi-Level Feature Pyramid Network (MLFPN) to construct more effective feature pyramids for detecting objects of different scales. First, we fuse multi-level features (i.e. multiple layers) extracted by backbone as the base feature. Second, we feed the base feature into a block of alternating joint Thinned U-shape Modules and Feature Fusion Modules and exploit the decoder layers of each Ushape module as the features for detecting objects. Finally, we gather up the decoder layers with equivalent scales (sizes) to construct a feature pyramid for object detection, in which every feature map consists of the layers (features) from multiple levels. To evaluate the effectiveness of the proposed MLFPN, we design and train a powerful end-to-end one-stage object detector we call M2Det by integrating it into the architecture of SSD, and achieve better detection performance than state-of-the-art one-stage detectors. Specifically, on MSCOCO benchmark, M2Det achieves AP of 41.0 at speed of 11.8 FPS with single-scale inference strategy and AP of 44.2 with multi-scale inference strategy, which are the new stateof-the-art results among one-stage detectors. The code will be made available on https://github.com/qijiezhao/M2Det.

Learning Pixel-Level and Instance-Level Context-Aware Features for Pedestrian Detection in Crowds

Article

Full-text available

Jul 2019

Pedestrian detection in crowded scenes is an intractable problem in computer vision, in which occlusion often presents a great challenge. In this paper, we propose a novel context-aware feature learning method for detecting pedestrians in crowds, with the purpose of making better use of context information for dealing with occlusion. Unlike most current pedestrian detectors that only extract context information from a single and fixed region, a new pixel-level context embedding module is developed to integrate multi-cue context into a deep CNN feature hierarchy, which enables access to the context of various regions by multi-branch convolution layers with different receptive fields. In addition, to utilize the distinctive visual characteristics formed by pedestrians that appear in groups and occlude each other, we propose a novel instance-level context prediction module which is actually implemented by a 2-person detector, to improve the 1-person detection performance. Applying with these strategies, we achieve an efficient and lightweight detector that can be trained in an end-to-end fashion. We evaluate the proposed approach on two popular pedestrian detection datasets, i.e., Caltech and CityPersons. Extensive experimental results demonstrate the effectiveness of the proposed method, especially under heavy occlusion cases.

Deep Feature Fusion by Competitive Attention for Pedestrian Detection

Article

Full-text available

Jan 2019

Pedestrian detection is a key problem for automatic driving, and results have been improved significantly via deep convolutional networks. However, there is still room to improve the performance of pedestrian detection by carefully dealing with some critical issues. To take advantages of more discriminative information for pedestrian detection, we propose a novel architecture to auto-choose semantic as well as specific information among the feature maps at different levels and integrate valuable information among the feature maps in multi-scales. Particularly, our architecture consists of feature maps concatenating in different levels and feature maps integrating with multi-scales. Both the operations are equipped with a competitive attention block. The architecture has the ability to obtain more efficient and discriminate features for pedestrian detection. In comparison with the other prevailing models, our architecture provides superior performance. The promising results achieved through experimentation with this architecture achieve a new state-of-the-art on Caltech dataset.

Scale-Aware Trident Networks for Object Detection

Conference Paper