Available via license: CC BY 4.0
Content may be subject to copyright.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2995321, IEEE Access
Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000.
Digital Object Identifier 10.1109/ACCESS.2017.DOI
Scale-aware Hierarchical Detection
Network for Pedestrian Detection
XIAOWEI ZHANG, SHUAI CAO, AND CHENGLIZHAO CHEN
Shandong Key Laboratory of Intelligent Information Processing, School of Computer Science and Technology, Qingdao University, Qingdao 266071, China.
Corresponding author: Xiaowei Zhang (e-mail: xiaowei19870119@sina.com) and Chenglizhao Chen (e-mail: cclz123@163.com).
This work was supported in part by the National Natural Science Foundation of China (Grant No.6190070308), and in part by the Natural
Science Foundation of Shandong Province of China (Grant No.ZR2019BF028).
ABSTRACT Several or even dozens of times spatial scale variation is one of the major bottleneck for
pedestrian detection. Although the Region-based Convolutional Neural Network (R-CNN) families have
shown promising results for object detection, they are still limited to detect pedestrians with large scale
variations due to the fixed receptive field sizes on a single convolutional output layer. In contrast to
previous methods that simply combined pedestrian predictions on feature maps with different resolution, we
propose a scale-aware hierarchical detection network for pedestrian detection under large scale variations.
First, we introduce a cross-scale features aggregation module to accomplish feature augmentation for
pedestrian representation through merging the lateral connection, the top-down path and bottom-up path.
Specifically, the cross-scale features aggregation module can adaptively fuse hierarchical features to
enhance feature pyramid representation for robust semantic and accurate localization. Further, we design
a scale-aware hierarchical detection network to effectively integrate multiscale pedestrian detection into
a unified framework by adaptively perceiving the augmented feature level for special-scale pedestrian
detection. Experimentally, the proposed scale-aware hierarchical detection network forms a more robust
and discriminative model for pedestrian instances with different scales on widely-used ETH and Caltech
benchmarks. In particular, compared with the state-of-the-art method FasterRCNN+ATT [44], the log-
average miss rate of pedestrian detection is reduced by 11.98% for medium scale pedestrians (between
30-80 pixels in height), and 14.12% for whole scale pedestrians (above 20 pixels in height) on Caltech
benchmark.
INDEX TERMS Scale Variation, Feature Aggregation, Scale-aware Weighting, Hierarchical Detection.
I. INTRODUCTION
PEDESTRIAN detection stands out from the traditional
object detection tasks, in view of its broad application
prospects in computer vision, such as video surveillance,
autonomous driving, robotics. Despite significant improve-
ments have been made on pedestrian detection [4], [8], [24],
[27], [41] over the years, the most existing efforts generally
work very well for large scale pedestrian instances [17]–[19],
[23], [34], [36], [52]. Compared with pedestrian detection
under large scale, much less attention has been put toward
medium and small scale ones, as similar observations in the
literature [14], [55].
For autonomous driving system, detecting medium and
small size pedestrians are an important topic because there
may leave sufficient time to alert the driver. Assuming the
vehicle traveling at an urban speed of 15m/s and a pedestrian
of 1.8m tall, the person with 80 pixels in height is just 1.5 s
away, while a person with 30 pixels is 4 s away. Take one
latest effort AR-Ped [2] for example, it has been reported
that empirically their detector is capable of achieving 6.45%
log-average miss rate for pedestrians taller than 50 pixels
on Caltech Pedestrian Benchmark [20], however the same
error rate increases to 49.31% MR for pedestrians of 30-
80 pixels in height. Fig. 1(a) shows several failed examples
of pedestrian detection using state-of-the-art method AR-
Ped [2] under large scale appearance variations on Caltech
benchmark. As Fig. 1(b) illustrates the scale distribution of
pedestrians in height on Caltech dataset, we group pedestri-
ans by their image size (height in pixels) into three scales
following [55]: near (80 or more pixels), medium (between
30-80 pixels), and far (between 20-30 pixels). Note that about
81.67% MR of the pedestrians lie in the medium scale on
Caltech dataset.
The degraded performance for pedestrian detection under
VOLUME 4, 2016 1
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2995321, IEEE Access
Xiaowei Zhang et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
Height ( pixels)
0 50 100 150 200 250 300 350 400 450
Scale Distributi on
0
0.05
0.1
0.15
0.2
0.25 8030
(a)Exemplars of of pedestrian
detection on Caltech benchmark
(b)Scale distribution of Pedestrian instances on Caltech benchmark (c) Pedestrian instances with different scales on Caltech benchmark
10×20 ~ 15×30
15×30 ~ 25×50
25×50 ~ 30×80
30×80 ~ 100×200
100×200 ~ 256×512
FIGURE 1: Visual examples for pedestrians of multiple scales. (a) shows the exemplars of pedestrian detection using the state-
of-the-art method AR-Ped [2]. (b) shows the scale distribution of pedestrians heights on Caltech dataset. One can observe that
the medium size instances indeed dominate the distribution. (c) shows pedestrian instances with different scales on Caltech
dataset.
large scale variations may be attributed to the following in-
herent challenges. First, small-size pedestrian instances often
convey smaller amount of information while having a greater
proportion of noise with obscure appearance and blurred
boundaries. It is in general difficult to distinguish them from
the background clutters. Second, visual semantic concepts of
an object can emerge in different spatial scales depending
on the size of the target objects. For a pedestrian instance
of interest, visual features are effective only at a proper
scale where optimal response is obtained. This difference is
more pronounced in complex scenes containing pedestrian
instances of diverse scales.
To address the issue of pedestrian detection under large
scale appearance variations, the Faster-RCNN [9] exploits a
multiscale region proposal network (RPN), which achieves
excellent object detection performance. However, multi-scale
detection is generated by sliding a fixed set of filters over
a fixed set of convolutional feature maps. This results in an
inconsistency between the sizes of objects and filter receptive
fields, as the scales of objects are variable, yet the sizes of
filter receptive fields are fixed. Instead of using a fixed set
of receptive fields, in most related works [1], [7], [15], [44],
[50], [58], [59] that aim to detect multi-scale pedestrians,
pedestrian detection is performed by redeploying the percep-
tive fields of convolution based on object sizes at multiple
output layers. However, in our view, these methods either
simply selecting multiple output layers based on the sizes
of receptive fields [11] [4] [5], or using feature fusion to
expand the receptive field on a single output layer [44] [57]
to obtain multi-scale receptive fields, are lack of enhancing
entire feature hierarchy for multiscale pedestrian detection.
This motivates us to construct an aggregation feature repre-
sentation to enhance semantic information and localization
signals for scale-aware pedestrian detection.
Motivated by above insight and analysis for the representa-
tion of feature hierarchy pyramid, we propose a scale-aware
hierarchical detection network for pedestrian detection under
large scale variations. First, we accomplish feature aggrega-
tion based on FPN [43] to enhance semantic information and
localization signals for feature representation, by merging
the lateral connection, the top-down path and the bottom-
up path. Furthermore, in view of the feature differences for
pedestrians under different scales, scale-aware hierarchical
detection network is designed to learn adaptively perceive
pedestrian instances within certain scale ranges, by probing
the feature differences with different scales from augmented
pyramid features.
To sum up, our work possesses the following contribu-
tions:
1) We introduce a cross-scale features aggregation module
to enhance feature pyramid representation by fusing robust
semantic and accurate localization for pedestrians with dif-
ferent scales, which accomplishes feature augmentation from
the lateral connection, the top-down path and bottom-up path.
2) A novel scale perception strategy by normalized Gaus-
sian fate function is designed to integrate multiple detection
heads into a unified framework through adaptively perceiving
the cross-scale features aggregation module for scale-aware
2VOLUME 4, 2016
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2995321, IEEE Access
Xiaowei Zhang et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
hierarchical detection network.
3) Experimentally, compared with the state-of-the-art
method FasterRCNN+ATT [44], the log-average miss rate of
pedestrian detection is reduced by 11.98% for medium scale
pedestrians (between 30-80 pixels in height), and 14.12%
for whole scale pedestrians (above 20 pixels in height) on
Caltech benchmark.
II. RELATED WORK
There has been lasting research activities on pedestrian
detection with vast literature. Before the emergence of
CNN, hand-crafted features have been widely used to ob-
tain good performance for pedestrian detection, including
HOG [10], Edgelets [38], ICF [21] and its variants ACF [28],
LDCF [17], [22], SCF [23]. The most popular pedestrian de-
tector is deformable part model (DPM) [51], which combines
rigid root filter and deformable part filters based on HOG
feature pyramid and latent SVM classifiers for detection.
Deep ConvNet due to its stronger feature representation
ability exhibits obvious performance gains on pedestrian
detection [28, 13, 29, 18, 2]. CCF [56] absorbs merits from
filtered channel features and Convolutional Neural Networks
(CNN), and transfers low-level features from pre-trained
CNN models to feed the boosting forest model for pedes-
trian detection. ConvNet [41] uses an unsupervised method
based on convolutional sparse coding to pre-train CNN for
pedestrian detection. Deep Parts [13] consists of extensive
part detectors, and each part detector is a strong detector
that can detect pedestrian by observing only a part of a
proposal. SDP [11] investigates scale-dependent pooling and
layer-wise cascaded rejection classifiers from CNN to detect
objects. CompACT-Deep [16] leverages both hand-crafted
and CNN features to form complexity-aware cascaded de-
tectors for an optimal trade-off between accuracy and speed.
Especially, Faster-RCNN [9] has addressed a multiscale re-
gion proposal network that shares full-image convolutional
features with the detection network, leading to an excellent
performance for pedestrian detection.
However, spatial scale variation is one of main challenge
for pedestrian detection due to the large variance of in-
stance scales in a cross-scenario. To address the issue, an
upsampling or dilated operations [5], [11] are employed to
alleviate the decline which just adopts a fixed set of filter
respective fields existed in Faster-RCNN [9]. MS-CNN [5]
combines multiple output layers by feature upsampling of
deconvolution to produce a strong multi-scale object detector.
SA-FastRCNN [4] exploits multiple built-in subnetworks by
a divide-and-conquer strategy to adaptively detect pedestri-
ans across scales. RPN+BF [7] reuses the high-resolution
convolutional features of RPN by cascaded boosted forests
for multiscale pedestrian detection. ADM [50] executes se-
quences of coordinate transformation on multi-layer fea-
ture maps to deliver accurate pedestrian locations. Trident-
Net [54] constructs a parallel multi-branch architecture to
expend receptive fields on the detection of different scale
objects through dilated convolution. However, these methods
have not effectively to fuse the robust semantic information
of targets existed in high-level convolutional layers and the
precise localization signals of the lower convolutional layers
for multiscale pedestrian detection.
To exploit strong semantic for prediction, FPN [43] aug-
ments a top-down pathway and lateral connections to propa-
gate high-level semantic information for reasonable classifi-
cation capability. DSSD [48] adopts deconvolution layers to
aggregate context and the high-level semantics for enhanc-
ing shallow features. M2Det [3] presents multi-level feature
pyramid network to fuse multiscale features for detecting
objects of different scales. On the other hand, many fine de-
tails and higher resolution existed in low-level feature maps
are benefits for localization accuracy. PANet [47] builds a
strong indicator to accurately localize instance segmentation
by a pathway with clean lateral connections from the low
level to top ones. DLA [49] augments standard architectures
with deeper aggregation across layers to obtain stronger
layer-wise multi-scale representation capability. STDN [29]
is equipped with embedded super-resolution scale-transfer
layers to explore the inter-scale consistency nature across
multiple detection scales. Recently, NAS-FPN [53] consists
of a series of merging cells to fuse features across scales
by a combination of top-down and bottom-up connections.
Res2Net [46] constructs hierarchical residual-like connec-
tions within one single residual block to capture multi-scale
features at a granular level.
Inspired by these observations and analysis of feature
fusion to multiscale detection, in this paper, we explore a
scale-aware hierarchical detection network for multi-scale
pedestrian detection, by aggregating the strong semantic
information from high-level features and the accurate local-
ization signals from low-level layer to enhance pyramidal
feature representations.
III. APPROACH OVERVIEW
A high-level overview of our approach architecture is shown
in Fig. 2. Our proposed approach consists of two main
components: cross-scale features aggregation module and
scale-aware hierarchical detection network. The cross-scale
features aggregation module is built on Feature Pyramid
Network (FPN) [43] to enhance representation ability of
pyramid features. FPN shows significant improvement as a
generic feature extractor for object recognition, significantly
which propagates semantically strong features to enhance
pyramid features with reasonable classification capability by
the top-down path. Similarly, with many fine details and
strong responses of local patterns are existed in low-level
convolutional layers, which are benefits for high localization
accuracy. For this reason, we design a cross-scale features ag-
gregation module to adaptively aggregate features of pyramid
hierarchy to enhance the localization capability.
Further, the scale-aware hierarchical detection network
based on the Fast R-CNN framework [6] combines the com-
plementary detection branches on hierarchical pyramid fea-
ture maps from the cross-scale features aggregation module.
VOLUME 4, 2016 3
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2995321, IEEE Access
Xiaowei Zhang et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
RoI
Pooling
layer FCs
Conv feature
maps
FC
FC
1024-d
RoI feature
vector
C3_cls_score
C3_bbox_pred
2-d
8-d
RoI
Pooling
layer FCs
Conv feature
maps
FC
FC
1024-d
RoI feature
vector
C4_cls_score
C4_bbox_pred
2-d
8-d
FCs
Conv feature
maps
FC
FC
1024-d
RoI feature
vector
C5_cls_score
C5_bbox_pred
2-d
8-d
RoI
Pooling
layer
Scale-aware_cls_score
Scale-aware_bbox_pred
2-d
8-d
1
1
2
2
3
3
Average pooling
Upsampling
C5
C3
C2
H5
H4
H3
Identity mapping
C1
Original image
C4
C3
Near scale Medium scale Far scale
Region proposals
Scale-aware Hierarchical Detection Subnetwork
Output results
Cross-level Feature
Aggregation Module
FIGURE 2: The architecture of our proposed Scale-aware Hierarchical Detection Network. Our approach uses the cross-scale
features aggregation module to enhance semantic robust and localization accuracy, and the scale-aware hierarchical detection
network to adaptively detect pedestrians from augmented feature levels for special-scale pedestrians presented in the image.
And the detection heads in hierarchical detection network
pretrained on ImageNet based on ResNet [33] are all sharing
parameters for each proposal to learn scale-aware hierarchi-
cal weights by minimizing the error rate for pedestrians with
different scales, regardless of their feature levels.
A. CROSS-SCALE FEATURES AGGREGATION MODULE
Feature Pyramid Network (FPN) [43] shows significant im-
provement as a generic feature extractor for object recogni-
tion, which propagates semantically strong features to en-
hance pyramid features with reasonable classification capa-
bility. Followed by previous evidence on the benefits of the
strategy of feature approximation [28], we denote the output
of last residual blocks as {C1, C2, C3, C4, C5}for conv1,
conv2, conv3, conv4, and conv5 in ResNet. And given a list
of multi-scale pyramid features {P1, P2, P3, P4, P5}from
FPN [43], where Pirepresents the feature at pyramid level
i. However, the feature fusion from FPN only directly builds
on the lateral connection and the top-down pathway, ignoring
the impact of bottom-up path augmentation to enhance fea-
ture representation for accurate localization signals existing
in low-level convolutional layers.
Our goal is to find a transformation function fthat can
effectively aggregate multi-scale features and output a list of
new features: Xout =f(Xin),Xin may be Ci,Pi, or their
union. Different from the feature augmentation generated by
FPN, we propose a cross-scale features aggregation module
(CFAM) to merge a bottom-up pathway to FPN. Specifically,
we use {H1, H2, H3, H4, H5}to denote augmented feature
pyramid and in which the spatial resolution of feature maps
is gradually upsampled with factor 2 from Hito Hi−1. As
shown in Fig. 3(b), each feature aggregation module takes
a convolutional feature map Ci−1with higher resolution,
an identify mapping feature maps Ciand a coarser feature
map Hi+1 with stronger semantic to generate the augmented
feature map Hi. Note that we adopt an average pooling to
downsample the spatially finer feature maps, which implies
to directly propagate strong responses of local patterns from
low-level pyramid levels for accurately localization by the
bottom-up augmented pathway.
The key idea of CFAM is to adaptively aggregate multi-
scale context information from feature maps of the convolu-
tional layers at adjacent scales to generate more discrimina-
tive features. As shown in Fig. 3(b), each aggregating module
merges a top-down path, lateral connections and a bottom-
up augmented path by addition. The feature aggregating
module takes a convolutional feature map Ci−1with higher
resolution, an identify mapping feature map Ciand a coarser
feature map Hi+1 with stronger semantic to generate the
fused feature map Hi. This is an iterated process to build aug-
mented feature pyramid until to the finest resolution map H3.
At the beginning of iteration, we adopt a 1×1 convolutional
layer on C5to produce the coarsest but semantically strongest
resolution map H5. Then the lower-level feature map Ci−1
goes through a 2 × 2 average pooling layer with stride 2 to
reduce the spatial size to generate the down-sampled feature
map in the bottom-up augmented pathway. Each element
of feature map Hi+1, the down-sampled feature map and
the identify mapping feature map Ciare added to generate
fused feature map. Finally, we append a 1×1 convolution
4VOLUME 4, 2016
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2995321, IEEE Access
Xiaowei Zhang et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
Identity mapping
Upsampling
2× Upsampling
Identity
mapping
C5P5
C4
C3
C2
P4
P3
(a) The feature aggregation block from FPN [43] merged the lateral
connection and the top-down pathway by addition.
Identity mapping
Average pooling
Upsampling
2× Upsampling
Identity
mapping Average
pooling
(b) Our cross-scale features aggregation module augmented features from the
lateral connection, the top-down pathway and the bottom-up pathway.
FIGURE 3: Illustrations of feature aggregation module design.
on each merged map to generate the final augmented fea-
ture map Hifor following sub-networks, which is used to
reduce the aliasing effect of upsampling and downsampling.
In the feature aggregation module, these augmented feature
maps are respectively corresponding to {C3, C4, C5}with
the same spatial sizes, and we set 1024-channel outputs
for each augmented feature pyramid {H3, H4, H5}to scale-
aware hierarchical detection network.
B. SCALE-AWARE HIERARCHICAL DETECTION
NETWORK
The coverage of multiple scales is a critical problem to
different scale ranges for pedestrian detection. Different from
the multi-scale mechanism of the RPN [9], we divide the
region proposals into three scales (near, medium, and far)
from higher convolutional layers C4, each scale is transported
an augmented feature pyramid level Hito detect pedestrian
instances within certain scale ranges as shown in Fig. 4. We
hypothesize that pedestrian instances with different scales
can be better modeled by hierarchical detection network with
the valid range of filter receptive fields. Specifically, each
pedestrian anchor scale needs to effectively match the re-
ceptive field size of the ROI pooling though different spatial
pooling structure.
Let Lm(Xi, Yi|W)represents multi-task loss function for
each pedestrian proposal under specific feature level Hm, and
is given by:
Lm(Xi, Yi|W) = Lm
cls (pi,ˆpi) + λˆpiLm
loc bi,ˆ
bi.(1)
Where ˆpiis 1 if the anchor is labeled positive, otherwise is
0. piis the predicted probability of the anchor being a pro-
posal. ˆ
bi= (ˆ
bx
i,ˆ
by
i,ˆ
bw
i,ˆ
bh
i,)represents the ground-truth box
associated with a positive anchor, and bi= (bx
i, by
i, bw
i, bh
i,)
represents the parameterized coordinates of the predicted
bounding box. The classification loss Lm
cls is the softmax loss
of two classes (pedestrian vs. not) from specific feature level
Hm. For the regression loss, we use Lm
loc =R(bi−ˆ
bi)where
Ris the robust loss function (smooth-L1) defined in [6]. The
term ˆpiLm
loc means the regression loss which is activated only
for positive anchors ˆpi= 1 and is disabled otherwise ˆpi= 0.
To adaptively match valid feature level and anchor scale
for multiscale pedestrian detection, SDP [11] adopts a hard
isolation strategy by the pixels in height of an object proposal
to detect multiscale objects. SA-FastRCNN [4] exploits a soft
isolation strategy by Sigmoid gate function defined over the
object proposal sizes to generate scale-aware weighting for
multi-scale detection subnetworks. In this paper, we design
a novel scale perception strategy by normalized Gaussian
gate function for scale-aware hierarchical detection network
(SHDN) as shown in Fig. 4, and the model loss function is
defined as:
L(W) =
M
X
m=1 X
i∈U
ωmLm(Xi, Yi|W)(2)
Where Mis the number of hierarchical feature pyramid as
mentioned in Section III A, U={(Xi, Yi)}N
i=1 contains
the training examples of multi-scale for pedestrian instances,
and ωmis the normalized scale-aware weight to corre-
sponded hierarchical loss Lm(Xi, Yi|W), and is initialized
by ωm=eˆωm/
M
P
i=1
eˆωi,ˆωm=e−(s−¯sm)2/2(γm)2). Here
s=log2(h)denotes the height scale of the pedestrians
which has already been normalized to a narrow range prior
to detection, ¯smand γmis the average height scale and
the scaling coefficient for specific feature level Hm, respec-
tively. Given a sliding window, the Gaussian function with
lower γmtends to enlarge the gap between the weights for
pedestrian instances from different scale ranges. Based on
the ResNet structure, the output size of RoI pooling is 7
× 7, with a stride chosen from the set of {8,16,32}to
construct deep network {C3, C4, C5}, then the valid recep-
tive fields for hierarchical feature pyramid {H3, H4, H5}is
{56,112,224}pixels for the height of bounding box, re-
spectively. Consequently, we assign scale-aware parameters
(¯sm, γm)as {(5.8,1.25),(6.8,2),(7.8,1.25)}for hierarchi-
cal feature pyramid {H3, H4, H5}, respectively. Note that we
optimize multi-task loss function being shunt to scale-aware
hierarchical detection module by the scale-aware weights
parameters (¯sm, γm), and all hyper parameters after ROI
pooling layers are shared for all levels of the hierarchical
feature pyramid.
VOLUME 4, 2016 5
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2995321, IEEE Access
Xiaowei Zhang et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
RoI
Pooling
layer FCs
Conv feature
maps
FC
FC
1024-d
RoI feature
vector
C3_cls_score
C3_bbox_pred
2-d
8-d
RoI
Pooling
layer FCs
Conv feature
maps
FC
FC
1024-d
RoI feature
vector
C4_cls_score
C4_bbox_pred
2-d
8-d
FCs
Conv feature
maps
FC
FC
1024-d
RoI feature
vector
C5_cls_score
C5_bbox_pred
2-d
8-d
RoI
Pooling
layer
Scale-aware_cls_score
Scale-aware_bbox_pred
2-d
8-d
1
1
2
2
3
3
H5
Scale-aware Hierarchical Detection Subnetwork
Output results
H4
H3
Near scale
Medium scale
Far scale
Dviding multiscale pedestrian proposals
Far
Scale
Medium
scale
Near
scale
FIGURE 4: Our proposed scale compensation strategy from multipath RPN to initial proposals. It uses hierarchy features
of deep convolutional layers to obtain a series of reasonable anchor scales for pedestrian proposal, and each scale focus on
pedestrian instances within certain scale ranges in an image.
For efficient training the scale-aware hierarchical detection
network, sampling is used to compensate for the imbalance
from the distribution of positive samples Um
+and negative
samples Um
−. In this paper, we adopt random sampling and
bootstrapped sampling to collect a final set of negative sam-
ples, such that Um
−=ζU m
+. We utilize random sampling to
randomly select easy negative samples according to a uni-
form distribution. Because hard negatives mining has large
influence on the detection accuracy, bootstrapping sampling
is exploited to improve detection performance by ranking the
negative samples according to their objectness scores. On
the other hand, to avoid the heavily asymmetric of positive
samples Um
+and negative samples Um
−resulting in for each
specific detection layer, the cross-entropy terms of positives
and negatives are weighted in formula (3), which guarantee
that each detection layer have enough positive samples to
cover a certain range of scales.
Lcls =1
1 + ζ
1
Um
+
X
i∈Um
+
−logp ˆpi(Xi)
+ζ
1 + ζ
1
Um
−
X
i∈Um
−
−logp0(Xi)
(3)
IV. EXPERIMENTS
A. EXPERIMENTS DETAILS
Following ResNet [33] pretrained on ImageNet, we fine-
tunes the convolutional neural network to extract visual
features from observed video frames on Caltech training
dataset. The convolutional layers and max pooling layers of
the ResNet network are used as the shared convolutional
layers before the Region-of-Interest (RoI) pooling layer to
produce feature maps from the entire input image. The last
convolutional block in ResNet is 2048-d, and we employ
a randomly initialized 1024-d 1×1 convolutional layer for
reducing dimension. And we use single-scale training in
which the scale of the input image is resized as 600 pixels
on the shortest side. The scale-aware feature aggregation net-
work is trained with Stochastic Gradient Descent (SGD) with
momentum of 0.9, and weight decay of 0.0005. As [9], [30]
demonstrate that mining from a larger set of candidates (e.g.,
2000) has no benefit, we use 300 RoIs for both training and
testing of this paper. We fine-tune scale-aware hierarchical
detection network using a learning rate of 0.001 for 20k mini-
batches. Each mini-batch consists of 128 randomly sampled
object proposals in one randomly selected image, where in
32 positive object proposals and the rest 96 negative object
proposals. A positive label of pedestrian is assigned when
IoU ≥0.5between the object proposal and the ground truth
box, and the negative label to RoIs if their IoU ≤0.3for
all ground-truth boxes. The whole scale-aware hierarchical
detection network is trained on a single NVIDIA GeForce
GTX TITAN X GPU with 12GB memory.
B. ABLATION EXPERIMENTS
1) Evaluating the cross-scale features aggregation module
As mentioned in [7], the Region Proposal Network (RPN)
in Faster R-CNN indeed performs well as a stand-alone de-
tector, but the downstream classifier degrades the pedestrian
detection performance. In this subsection, we investigate
cross-scale features aggregation module in terms of detection
quality, evaluated by the log-average miss rate of pedestrian
detection under IOU = 0.5on Caltech dataset.
First of all, we evaluate high-level convolutional layer
(from ResNet-50-C3 to ResNet-50-C5) of ResNet [33] to
extract ROI features to detect pedestrian by using a set of
anchor scales from RPN. As shown in Table 1(a)(b)(c), for
illustrating the effects of high-level convolutional features
in ResNet-50 to detect pedestrian instances, the higher con-
6VOLUME 4, 2016
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2995321, IEEE Access
Xiaowei Zhang et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
TABLE 1: EVALUATIONS OF PEDESTRIAN DETECTION AT DIFFERENT FEATURE PYRAMID LEVEL BY LOG-
AVERAGE MISS RATE (MR) UNDER IOU=0.5 ON CALTECH DATASET
Detection Network Proposals RoI features lateral? top-down? bottom-up? M Rall MRfM RmM Rn
(a) Baseline on Conv. C4C3× × × 95.79% 100% 94.25% 79.31%
(b) Baseline on Conv. C4C4× × × 63.45% 95.76% 47.92% 4.83%
(c) Baseline on Conv. C4C5× × × 81.86% – 78.42% 4.29%
(d) Baseline on FPN C4P3X X ×52.52% 76.59% 37.76% 14.18%
(e) Baseline on FPN C4P4X X ×46.78% 90.38% 38.65% 2.74%
(f) Baseline on FPN C4P5X X ×78.67% – 75.96% 3.36%
(g) Based on Our CFAM C4H3X X X 46.52% 72.83% 36.50% 16.35%
(h) Based on Our CFAM C4H4X X X 43.69% 86.50% 38.08% 2.12%
(i) Based on Our CFAM C4H5X X X 84.84% – 85.82% 2.51%
volutional layers (e.g., C4,C5) obviously perform better
than lower-level convolutional layers (e.g., C3) for pedestrian
instances with near scale. This can be attributed to higher-
level convolutional features with more robust semantic infor-
mation than lower-levels.
Further, compared to adopt the simple high-level convolu-
tional layer (e.g., C3,C4, or C5) to detect pedestrian, FPN
(e.g., P3,P4, or P5) fused the semantically strong features
from higher convolutional layer to enhance pyramid features
for classification capability. Especially, P3gets 76.59% MR
for pedestrian detection under far scale, and decreases by
10.16% MR for medium scale pedestrian instances to C4as
shown in Table 1(d). However, FPN only directly builds on
the lateral connection and the top-down pathway for feature
fusion, ignoring the impact of bottom-up pathway which is
benifit for accurate localization. Compared to the improved
performance of P3under far scale and medium scale, P4
has degraded the pedestrian detection performance shown in
Table 1(e), which may be due to lack of accurate localization
signals existing in lower convolutional layers. Therefore, we
propose a cross-scale features aggregation module (CFAM)
to fuse semantic information and localization signals by
adding a bottom-up augmented pathway to FPN. As shown
in Table 1 (g), H3has achieved the best pedestrian detection
performance under far-scale and medium-scale, up to 72.83%
MR and 36.50% MR respectively. Note that H4achieves
43.69% MR for pedestrians with all scales.
2) The Role of Scale-aware Hierarchical Detection network
In this subsection, the contribution of proposed scale-aware
hierarchical detection network is evaluated by log-average
miss rate under IOU = 0.5on Caltech testing dataset for
pedestrian detection. We conduct comparison experiments
to verify the effectiveness of the proposed method within a
single output layer and multiple output layers for detection
heads. As shown in Table 2 (a)(b)(c), we compare the single
output layer H3,H4and H5as detection head from proposed
cross-scale features aggregation module for pedestrian detec-
tion under different scales. We found that the H3performs
better than other single output layer on log-average miss rate
for pedestrian detection under far and medium scales. For
near scale, H4has achieved the best detection performance
in a single output layer, up to 2.12% MR with a relative
improvement of 14.23% over the competitor H3.
However, detecting pedestrian only from a single output
layer cannot effectively cover the multiscale pedestrians ap-
peared large scale variations, due to lacking of the scale com-
plementary from multiple feature layers with different sizes
of filter receptive fields. To effectively combine multiple out-
put layers from feature pyramid for pedestrian detection, we
adopt the scale-aware parameters (¯sm, γm)to initialize learn-
ing hierarchical weights ωmfor optimizing multi-task loss
function in formula 2. Specifically, we assign scale-aware pa-
rameters (¯sm, γm)as {(5.8,1.25),(6.8,2),(7.8,1.25)}for
hierarchical feature pyramid {H3, H4, H5}, respectively. In
Table 2(d), combining layers {H3,H4} gains 42.58% MR
for all scales on Caltech benchmark, improving by 1.11%
compared to the single output layer H4, and achieves best
detection performance 65.86% MR for far scale pedestrian
instances. Particularly, combining layers {H4,H5} does not
improve pedestrian detection performance on medium and
far scales, but achieves better detection performance 1.25%
MR on near scale shown in Table 2(e). The reason may be
attributed to our proposed hierarchical scale-aware detection
network that each detection branch is used to learn a proper
pyramid feature layer to focus on pedestrian instances within
certain scale ranges. Moreover, the log-average miss rate
is reduced to 40.39% for all scales pedestrian detection,
28.77% for medium scale and 1.08% for near scale, by
combining layers {H3,H4,H5} as shown in Table 2(f). Note
that combining {H3,H4,H5} gets the best performance
compared to {C3,C4,C5},{P3,P4,P5}, and {P2,P3,P4,
P5} shown in Table 2(g∼i). The experiments demonstrate
that the proposed hierarchical scale-aware detection network
is more flexible and is able to take advantage of different sizes
of filter receptive fields from multiple level pyramid features
for large variance in pedestrian instance scales.
C. COMPARISON WITH STATE-OF-THE-ARTS
In this section, the performance of proposed algorithm is
fully evaluated to the state-of-the-art methods on Caltech [20]
and ETH [18] datasets. As [55] proposed the evaluation crite-
VOLUME 4, 2016 7
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2995321, IEEE Access
Xiaowei Zhang et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
TABLE 2: COMPARISONS OF PEDESTRIAN DETECTION RESULTS BY LOG-AVERAGE MISS RATE (MR) UNDER
IOU=0.5 ON CALTECH DATASET
Detection Network Proposals RoI features MRall M RfM RmMRn
(a) Based on a single output layer C4H346.52% 72.83% 36.50% 16.35%
(b) Based on a single output layer C4H443.69% 86.50% 38.08% 2.12%
(c) Based on a single output layer C4H584.84% – 85.82% 2.51%
(d) Based on multiple output layers C4{H3, H4}42.58% 65.86% 30.53% 1.48%
(e) Based on multiple output layers C4{H4, H5}44.37% 90.56% 33.66% 1.25%
(f) Based on multiple output layers C4{H3, H4, H5}40.39% 70.69% 28.77% 1.08%
(g) Based on multiple output layers C4{C3, C4, C5}50.47% 78.67% 37.43% 2.22%
(h) Based on multiple output layers C4{P3, P4, P5}46.69% 74.42% 31.16% 1.79%
(i) Based on multiple output layers C4{P2, P3, P4, P5}45.12% 71.54% 32.83% 1.46%
ria, the log-average miss rate is used to summarize the detec-
tor performance. The performance is computed by averaging
miss rate at FPPI rates evenly spaced in log-space within
the range of 10−3to 100. The experiments demonstrate
that jointing cross-scale features aggregation module and
scale-aware hierarchical detection network outperforms the
state-of-the-art pedestrian detection algorithms, especially on
pedestrian instances with small sizes.
1) Comparison with state-of-the-art methods on Caltech
dataset
The Caltech pedestrian dataset consists of approximately
10 hours of 640*480 30Hz video taken from a vehicle
driving through regular traffic in an urban environment,
which includes about 250,000 frames with a total of 2300
unique pedestrians. Similar to other relevant publications
previously [13], [16], [17], [24], we adopt the different spatial
scale pedestrians to evaluate our method on the Caltech
testing dataset, and choose the Caltech training dataset and
the INRIA training dataset [10] as our training set. The exper-
imental evaluations of our proposed method with the state-
of-the-art methods are constructed on the Caltech testing
dataset, including LDCF [22], ACF+SDt [34], RPN+BF [7],
MS-CNN [5], CompACT-Deep [16], TA-CNN [24], SA-
FastRCNN [4], FasterRCNN+ATT [44], and AR-Ped [7].
To evaluate the effectiveness of our proposed scale-aware
hierarchical detection network, quantitative results of com-
parison are presented for different scale ranges of pedestrian
instances on Caltech dataset. Fig. 5 shows the comparison
results of the log-average miss rate for pedestrians under
different scale ranges. It can be observed that our proposed
method significantly outperforms other methods and achieves
the lowest log-average miss rate 28.77% on Caltech dataset
of the medium scale shown in Fig. 5(a), which is lower
than the state-of-the-art approach FasterRCNN+ATT [44] by
11.98%. As the similar trend shown in Fig. 5(b), our ap-
proach achieves 7.41% log-average miss rate for pedestrian
instances taller than 50 pixels in height, second only to the
state-of-the-art approach AR-Ped [2].
For pedestrian instances in far scale ranges, most methods
exhibit dramatic performance drops as shown in Fig. 5(c).
10-3 10-2 10-1
False positives per image
0.20
0.30
0.40
0.50
0.64
0.80
1.00
Miss rate
63.62% TA-CNN
61.82% LDCF
56.42% DeepParts
53.93% RPN+BF
53.23% CompACT-Deep
51.83% SA-FastRCNN
49.31% AR-Ped
49.13% MS-CNN
40.75% FasterRCNN+ATT
28.77% Ours(SHDN)
(a) Medium scale (80≥height≥30 pix-
els)
10-3 10-2 10-1
False positives per image
0.20
0.30
0.40
0.50
0.64
0.80
1.00
Miss rate
24.80% LDCF
20.86% TA-CNN
11.89% DeepParts
11.75% CompACT-Deep
10.33% FasterRCNN+ATT
9.95% MS-CNN
9.68% SA-FastRCNN
9.58% RPN+BF
7.41% Ours(SHDN)
6.45% AR-Ped
(b) Resonable (height≥50 pixels)
10-3 10-2 10-1
False positives per image
0.20
0.30
0.40
0.50
0.64
0.80
1.00
Miss rate
100.00% ACF+SDt
100.00% SA-FastRCNN
100.00% CompACT-Deep
100.00% DeepParts
100.00% RPN+BF
100.00% TA-CNN
100.00% LDCF
97.23% MS-CNN
90.94% FasterRCNN+ATT
70.69% Ours(SHDN)
(c) Far scale (30≥height≥20 pixels)
10-3 10-2 10-1
False positives per image
0.20
0.30
0.40
0.50
0.64
0.80
1.00
Miss rate
71.25% LDCF
71.22% TA-CNN
64.78% DeepParts
64.66% RPN+BF
64.44% CompACT-Deep
62.59% SA-FastRCNN
60.95% MS-CNN
58.83% AR-Ped
54.51% FasterRCNN+ATT
40.39% Ours(SHDN)
(d) Overall (height≥20 pixels)
FIGURE 5: Quantitative results of comparisons on Caltech
dataset.
While our proposed method outperforms better than the
available state-of-the-art competitors, it is difficulty to iden-
tify pedestrians reliably for small-size pedestrian instances
under 30 pixels in height. In Fig. 5(c), the log-average miss
rate is reduced to 70.69%, improved 20.25% compared to
FasterRCNN+ATT [44]. This is similar to human perfor-
mance that is also quite good in the large scales but degrades
noticeably at the medium and far scales. Significantly, for
pedestrian instances in whole scale span ranges, our approach
achieves the log-average miss rate 40.39% for all pedes-
trian instances taller than 20 pixels in height, better than
the current FasterRCNN+ATT [44] by 14.12% as shown in
Fig. 5(d). The comparison results with different scale ranges
of pedestrian instances demonstrate that our proposed ap-
proach substantially improve the performance for pedestrian
detection.
Fig. 6 shows the detection results of our proposed scale-
aware hierarchical detection network on Caltech dataset. The
8VOLUME 4, 2016
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2995321, IEEE Access
Xiaowei Zhang et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
green dotted bounding boxes represent true positive windows
when the intersection over union (IoU) between the detected
window and the ground truth (green solid bounding box)
exceeds 50%. Otherwise, the bounding boxes denote false
positive windows by the red dotted bounding box. As shown
in Fig. 6, the most of pedestrian instances with different scale
ranges can be detected by our proposed approach. Moreover,
because of adaptively perceiving the augmented feature level
with different resolutions for special-scale pedestrians, the
medium-size and small-size pedestrian instances also can
be detected in proposed scale-aware hierarchical detection
network. The red dotted bounding box represents the posi-
tive pedestrians which are not marked by the ground truth
as shown in Fig. 6. This experiment shows that the usage
of jointing the cross-scale features aggregation module and
scale-aware hierarchical detection network for pedestrian de-
tection outperforms the state-of-the-art algorithms, especially
for pedestrian instances from medium and small scale ranges.
2) Comparison with state-of-the-art methods on ETH dataset
The ETH benchmark dataset consists of 3 testing video
sequences with a resolution of 640*480, and a frame
rate of 13FPS. Studies in [7], [37] report that state-of-
the-art algorithms have the remarkable detection perfor-
mance evaluated on ETH dataset, including ChnFtrs [21],
MultiFtr+Motion [35], JointDeep [37], pAUCBoost [40],
ConvNet [41], DBN-Mut [12], SpatialPooling [39], TA-
CNN [24], and RPN+BF [7]. As most approaches are
trained on the INRIA training dataset [10], our proposed
method is also trained on the INIRA training dataset. As
Fig. 7(a) the log-average miss rate of our proposed approach
achieves 44.75% next to the state-of-the-art SpatialPool-
ing [39] 43.36% for pedestrians under medium scale. Similar
trend to what we have observed for pedestrian with near
scale, our approach achieves 20.49% log-average miss rate,
second only to the best available competitor’s RPN+BF [7]
as shown in Fig. 7(b). Significantly, for pedestrian instances
above pixels taller than 80 pixels in height, our approach
gets 16.84% log-average miss rate, improving 0.78% com-
pared to the state-of-the-art RPN+BF [7] shown in Fig. 7(c).
Moreover, for a more challenging with large variation of
scale (above 20 pixels in height), the log-average miss rate
of our approach reduces 3.98% over RPN+BF [7] on ETH
dataset as shown in Fig. 7(d). The results demonstrate that
our proposed method has a substantially better detection
performance for the multiscale pedestrian instances appeared
large scale variations in natural scenes.
The pedestrian detection results of our proposed method
are shown in Fig. 8 on ETH dataset. As shown in Fig. 8,
the green dotted boxes demonstrate the detection results of
our approach. Our proposed approach adaptively perceives
the augmented feature level to generate the final detection
results for special-scale pedestrian detection by scale-aware
hierarchical detection network. And the small-size pedestrian
instances also can be detected, where the red dotted bounding
box represents the positive pedestrians which are not marked
FIGURE 6: Detection results of our approach on Caltech
dataset.
by the ground truth as shown in Fig. 8. One can observe that
our method can successfully detect most of the pedestrian in-
stances, especially for pedestrians with large scale variations.
V. CONCLUSION
This study describes an effective approach to detect pedes-
trian instances with different scale ranges. The proposed
cross-scale features aggregation module adaptively fuses hi-
erarchical features to enhance feature pyramid representation
by merging the lateral connection, the top-down path and
bottom-up path. Moreover, probing the differences of local
features with different sizes of receptive fields, the proposed
scale-aware hierarchical detection network effectively inte-
grates multiscale pedestrian detection into a unified frame-
work through adaptively perceiving the augmented feature
level for special-scale pedestrian detection. Experimentally,
compared with the state-of-the-art FasterRCNN+ATT [44],
VOLUME 4, 2016 9
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2995321, IEEE Access
Xiaowei Zhang et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
10-2 10-1 100
False positives per image
0.20
0.30
0.40
0.50
0.64
0.80
1.00
Miss rate
67.77% MultiFtr+Motion
65.71% ConvNet
59.59% DBN-Mut
58.78% JointDeep
55.42% pAUCBoost
53.55% ChnFtrs
53.38% RPN+BF
48.69% TA-CNN
44.75% Ours(SHDN)
43.36% SpatialPooling
(a) Medium scale (80≥height≥30 pix-
els)
10-2 10-1 100
False positives per image
0.20
0.30
0.40
0.50
0.64
0.80
1.00
Miss rate
48.34% ChnFtrs
45.44% MultiFtr+Motion
40.75% JointDeep
39.72% pAUCBoost
39.23% ConvNet
34.73% DBN-Mut
29.66% SpatialPooling
23.24% TA-CNN
20.49% Ours(SHDN)
17.63% RPN+BF
(b) Near scale (height≥80 pixels)
False positives per image
10-2 10-1 100
Miss rate
0.20
0.30
0.40
0.50
0.64
0.80
1.00
59.99% MultiFtr+Motion
57.47% ChnFtrs
50.27% ConvNet
49.06% pAUCBoost
45.32% JointDeep
41.07% DBN-Mut
37.37% SpatialPooling
34.98% TA-CNN
30.23% RPN+BF
29.45% Ours(SHDN)
(c) Resonable (height≥50 pixels)
10-2 10-1 100
False positives per image
0.20
0.30
0.40
0.50
0.64
0.80
1.00
Miss rate
70.12% MultiFtr+Motion
61.86% ChnFtrs
57.80% ConvNet
54.32% JointDeep
53.56% pAUCBoost
51.28% DBN-Mut
43.19% SpatialPooling
42.92% TA-CNN
39.46% RPN+BF
35.48% Ours(SHDN)
(d) Overall (height≥20 pixels)
FIGURE 7: Quantitative results of comparisons on ETH
dataset.
the log-average miss rate of pedestrian detection is reduced
by 11.98% for medium scale pedestrians (between 30-80
pixels in height), and 14.12% for whole scale pedestrians
(above 20 pixels in height) on the Caltech benchmark.
REFERENCES
[1] Z. Chen, L. Zhang, A. M. Khattak, et al., “Deep feature fusion by
competitive attention for pedestrian detection,” IEEE Access, vol. 7, pp.
21981-21989, 2019.
[2] G. Brazil, X. Liu, “Pedestrian Detection With Autoregressive Network
Phases,” in CVPR, Long Beach, CA, USA, 2019, pp.7231-7240.
[3] Q. Zhao, T. Sheng, Y. Wang, et al., “M2Det A Single-Shot Object Detector
based on Multi-Level Feature Pyramid Network,” in AAAI, Honolulu,
Hawaii, USA, 2019.
[4] J. Li, X. Liang, S. Shen, et al., “Scale-aware Fast R-CNN for Pedestrian
Detection,” in CVPR, Honolulu, Hawaii, 2017, pp.985–996.
[5] T. Cai, Q. Fan, R. Feris, et al., “A Unified Multi-scale Deep Convolutional
Neural Network for Fast Object Detection,” in ECCV, Amsterdam, Nether-
lands, 2016, pp.354–370.
[6] R. Girshick, “Fast R-CNN,” in ICCV, Santiago, Chile, 2015, pp.1440–
1448.
[7] L. Zhang, L. Lin, X. Liang, et al., “Is Faster R-CNN Doing Well for
Pedestrian Detection?,” in ECCV, Springer, Cham, 2016, pp. 443–457.
[8] C. Fei, B. Liu, Z. Chen, et al., “Learning Pixel-Level and Instance-
Level Context-Aware Features for Pedestrian Detection in Crowds,” IEEE
Access, vol. 7, pp. 94944–94953, 2019.
[9] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real-
time object detection with region proposal networks,” in NIPS, Montreal,
Canada, 2015, pp.1–9.
[10] N. Dalal, B. Triggs, “ Histograms of oriented gradients for human detec-
tion,” in CVPR, San Diego, California, USA, 2005.
[11] F. Yang, W. Choi, Y. Lin, “Exploit All the Layers: Fast and Accurate CNN
Object Detector with Scale Dependent Pooling and Cascaded Rejection
Classifiers,” in CVPR, Las Vegas, USA, 2016, pp.2129–2137.
[12] W. Ouyang, X. Zeng, X. Wang, “ Modeling Mutual Visibility Relationship
with a Deep Model in Pedestrian Detection,” in CVPR, Portland, Oregon,
2013, pp.3222–3229.
[13] Y. Tian, P. Luo, X. Wang, et al., “ Deep learning strong parts for pedestrian
detection,” in ICCV, Santiago, Chile, 2015, pp.1904–1912.
FIGURE 8: Detection results of our approach on ETH
dataset.
[14] D. Hoiem, Y. Chodpathumwan, Q. Dai, “Diagnosing error in object
detectors,” in ECCV, Florence, Italy, 2012, pp.340–353.
[15] J. Yan, X. Zhang, Z. Lei, et al., “Robust Multi-Resolution Pedestrian
Detection in Traffic Scenes,” in CVPR, Portland, Oregon, 2013, pp.3033–
3040.
[16] Z. Cai , M. Saberian, N. Vasconcelos, “Learning complexity-aware cas-
cades for deep pedestrian detection” in ICCV, Santiago, Chile, 2015,
pp.3361–3369.
[17] S. Zhang, R. Benenson, B. Schiele, “Filtered channel features for pedes-
trian detection,” in CVPR, Boston, Massachusetts, 2015, pp.1751–1760.
[18] P. Dollar, S. Belongie, P. Perona, “The fastest pedestrian detector in the
west,” in BMVC, Aberystwyth, UK, 2010.
[19] P. Dollar, R. Appel, W. Kienzle, “Crosstalk cascades for frame-rate pedes-
trian detection,” in ECCV, Florence, Italy, 2012, pp.645–659.
[20] P. Dollar, C. Wojek, B. Schiele, et al., “Pedestrian Detection: A Bench-
mark,” in CVPR, Miami, Florida, USA, 2009, pp.304–311.
[21] P. Dollar, Z. Tu, P. Perona, et al., “Integral channel features,” in BMVC,
London, 2009.
[22] W. Nam, P. Dollar, J. Han, “Local decorrelation for improved pedestrian
detection,” in NIPS, Montreal, Canada, 2014, pp.424–432.
[23] R. Benenson, M. Omran, J. Hosang, et al., “Ten years of pedestrian
detection, what have we learned?,” in ECCV, Zurich, Switzerland, 2014,
pp.613–627.
10 VOLUME 4, 2016
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2995321, IEEE Access
Xiaowei Zhang et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
[24] Y. Tian, P. Luo, X. Wang, et al., “Pedestrian detection aided by deep learn-
ing semantic tasks,” in CVPR, Boston, Massachusetts, 2015, pp.5079–
5087.
[25] J. Hosang, M. Omran, R. Benenson, et al., “Taking a deeper look at
pedestrians,” in CVPR, Boston, Massachusetts, 2015, pp.4073–4082.
[26] M. Zeiler, R. Fergus, “ Visualizing and understanding convolutional net-
works,” in ECCV, Zurich, Switzerland, 2014, pp.818–833.
[27] B. Yang, J. Yan, Z. Lei, et al., “ Convolutional channel features,” in ICCV,
Santiago, Chile, 2015, pp.82–90.
[28] P. Dollar, R. Appel, S. Belongie, and P. Perona., “ Fast feature pyramids for
object detection,” IEEE Trans. Pattern Analysis and Machine Intelligence
, vol. 36, no. 8, pp. 1532––1545, 2014.
[29] P. Zhou, B. Ni, C. Geng, et al., “ Scale-Transferrable Object Detection,” in
CVPR, Salt Lake City, UT, USA, 2018, pp.528–537.
[30] J. Dai, Y. Li, K. He, et al., “ R-FCN: Object Detection via Region-based
Fully Convolutional Networks,” in NIPS, Barcelona, Spain, 2016, pp.379–
387.
[31] W. Liu, D. Anguelov, D. Erhan, et al., “ SSD: Single Shot MultiBox
Detector,” in ECCV, Amsterdam, Netherlands, 2016, pp.21-37.
[32] S. Liu, L. Qi, H. Qin, et al., “ Path Aggregation Network for Instance
Segmentation,” in CVPR, Salt Lake City, UT, USA, 2018, pp.8759-8768.
[33] K. He, X. Zhang, S. Ren, et al., “ Deep Residual Learning for Image
Recognition,” in CVPR, Las Vegas, NV, USA, 2016, pp.770–778.
[34] D. Park, C. Zitnick, D. Ramanan, et al., “Exploring Weak Stabilization for
Motion Feature Extraction,” in CVPR , Portland, Oregon, 2013, pp.2882–
2889.
[35] S. Walk, N. Majer, K. Schindler, et al., “ New Features and Insights for
Pedestrian Detection,” in CVPR, San Francisco, CA, USA, 2010, pp.1030–
1037.
[36] G. Chen, Y. Ding, J. Xiao, et al., “ Detection Evolution with Multi-order
Contextual Co-occurrence,” in CVPR, Portland, Oregon, 2013,pp.1798–
1805.
[37] W. Ouyang, X. Wang, “ Joint Deep Learning for Pedestrian Detection,” in
ICCV, Sydney, Australia, 2013, pp.2056–2063.
[38] B. Wu and R. Nevatia, “ Detection and tracking of multiple, partially oc-
cluded humans by bayesian combination of edgelet based part detectors,”
International Journal of Computer Vision (IJCV), vol. 75, no. 2, pp. 247–
266, 2007.
[39] S. Paisitkriangkrai, C. Shen, A. van den Hengel, “ Strengthening the
Effectiveness of Pedestrian Detection,” in ECCV, Zurich, Switzerland,
2014, pp.546–561.
[40] S. Paisitkriangkrai, C. Shen, A. van den Hengel, “ Efficient pedestrian
detection by directly optimize the partial area under the ROC curve,” in
ICCV, Sydney, Australia, 2013, pp.1057–1064.
[41] P. Sermanet, K. Kavukcuoglu, S. Chintala, et al., “Pedestrian Detection
with Unsupervised Multi-Stage Feature Learning,” in CVPR, Portland,
Oregon, 2013, pp.3626–3633.
[42] B. Zhou, A. Khosla, A. Lapedriza, et al., “ Learning Deep Features for
Discriminative Localization,” in CVPR, Las Vegas, NV, USA, 2016, pp.
2921-2929.
[43] T. Lin, P. Dollar, R. Girshick, et al., “ Feature Pyramid Networks for Object
Detection,” in CVPR, Honolulu, USA, 2017, pp.2117–2125.
[44] S. Zhang, J. Yang, B. Schiele, et al., “ Occluded Pedestrian Detection
Through Guided Attention in CNNs,” in CVPR, Salt Lake City, UT, USA,
2018, pp.6995–7003.
[45] Z. Hao, Y. Liu, H. Qin, et al., “ Scale-Aware Face Detection,” in CVPR,
Salt Lake City, UT, USA, 2018, pp.6186–6195.
[46] S. Gao, M. Cheng, K. Zhao, et al., “ Res2Net: A New Multi-scale
Backbone Architecture,” arXiv preprint arXiv:1904.01169, 2019.
[47] S. Liu, L. Qi, H. Qin, et al., “ Path Aggregation Network for Instance
Segmentation,”in CVPR, Salt Lake City, UT, USA, 2018, pp.8759–8768
[48] C. Fu, W. Liu, A. Ranga, etal, “ DSSD : Deconvolutional Single Shot
Detector,” arXiv preprint arXiv:1701.06659, 2019.
[49] F. Yu, D. Wang, E. Shelhamer, et al., “ Deep Layer Aggregation,” in CVPR,
Salt Lake City, UT, USA, 2018, pp.2403–2412
[50] X. Zhang, L. Cheng, B. Li, et al., “ Too Far to See? Not Really! —
Pedestrian Detection with Scale-aware Localization Policy,” IEEE Trans.
On Image Processing , vol. 27, no. 8, pp. 3703–3715, 2018.
[51] P. Felzenszwalb, R. Girshick, D. McAllester, et al., “ Object detection with
discriminatively trained part based models,” IEEE Trans. Pattern Analysis
and Machine Intelligence , vol. 32, no. 9, pp.1627–1645, 2010.
[52] J. Cao, Y. Pang, X. Li, “ Pedestrian Detection Inspired by Appearance
Constancy and Shape Symmetry ” IEEE Trans. On Image Processing, vol.
25, no. 12, pp. 5538–5551, 2016.
[53] G. Ghiasi, T. Lin, Q. Le, “ NAS-FPN: Learning Scalable Feature Pyramid
Architecture for Object Detection ” in CVPR, Long Beach, CA, USA,
2019, pp. 7036–7045.
[54] Y. Li, Y. Chen, N. Wang, Z. Zhang, “ Scale-Aware Trident Networks for
Object Detection,” in CVPR, Long Beach, CA, USA, 2019.
[55] P. Dollar, C. Wojek, B. Schiele, et al., “ Pedestrian detection: An evalu-
ation of the state of the art,” IEEE Trans. Pattern Analysis and Machine
Intelligence , vol. 34, no. 4, pp. 743–761, 2012.
[56] B. Yang, J. Yan, Z. Lei, S. Li, “ Convolutional Channel Features,” in ICCV,
Santiago, Chile, 2015.
[57] S. Bell, C. L. Zitnick, K. Bala, and R. Girshick., “ Insideoutside net: De-
tecting objects in context with skip pooling and recurrent neural networks,”
in CVPR, Las Vegas, NV, USA, 2016, pp. 2874-2883.
[58] S. Choudhury, R. Padhy, A. Sangaiah, et al., “ Scale Aware Deep Pedes-
trian Detection,” Transactions on Emerging Telecommunications Tech-
nologies , vol. 30, no. 9, pp. 1–14, 2019.
[59] T. Liu, M. Elmikaty, T. Stathaki, et al., “ SAM-RCNN: Scale-Aware Multi-
Resolution Multi-Channel Pedestrian Detection,” in BMVC, Newcastle,
UK, 2018.
XIAOWEI ZHANG received the Ph.D. degree in
computer science from Beihang University, Bei-
jing, China, in 2018, the M.S. degree in com-
puter science from Shandong Normal University,
Jinan, China, in 2013, and the B.S. degree in
computer science from Shanxi Normal University,
Linfen, China, in 2009. He was a visiting stu-
dent at Bioinformatics Institute (BII), A*STAR,
Singapore from 2016 to 2017. Currently, he is
an assistant professor of Computer Science and
Engineering at Qingdao University. His current research interests include
image/video analysis and understanding, computer vision and machine
learning.
SHUAI CAO received the B.S. degree in computer
science and technology from Liaoning University
of Technology, Jinzhou, China, in 2018. He is
currently pursuing the M.S. degree in Computer
Science and Engineering with Qingdao University
of China. His current research interests include
pedestrian detection and machine learning.
CHENGLIZHAO CHEN received the Ph.D. de-
gree in computer science from Beihang University
in 2017. He is currently an assistant professor with
Qingdao University. His research interests include
computer vision, machine learning, and pattern
recognition.
VOLUME 4, 2016 11