ArticlePDF Available

Scale-Aware Hierarchical Detection Network for Pedestrian Detection

Authors:

Abstract and Figures

Several or even dozens of times spatial scale variation is one of the major bottleneck for pedestrian detection. Although the Region-based Convolutional Neural Network (R-CNN) families have shown promising results for object detection, they are still limited to detect pedestrians with large scale variations due to the fixed receptive field sizes on a single convolutional output layer. In contrast to previous methods that simply combined pedestrian predictions on feature maps with different resolution, we propose a scale-aware hierarchical detection network for pedestrian detection under large scale variations. First, we introduce a cross-scale features aggregation module to accomplish feature augmentation for pedestrian representation through merging the lateral connection, the top-down path and bottom-up path. Specifically, the cross-scale features aggregation module can adaptively fuse hierarchical features to enhance feature pyramid representation for robust semantic and accurate localization. Further, we design a scale-aware hierarchical detection network to effectively integrate multiscale pedestrian detection into a unified framework by adaptively perceiving the augmented feature level for special-scale pedestrian detection. Experimentally, the proposed scale-aware hierarchical detection network forms a more robust and discriminative model for pedestrian instances with different scales on widely-used ETH and Caltech benchmarks. In particular, compared with the state-of-the-art method FasterRCNN+ATT [44], the logaverage miss rate of pedestrian detection is reduced by 11.98% for medium scale pedestrians (between 30-80 pixels in height), and 14.12% for whole scale pedestrians (above 20 pixels in height) on Caltech benchmark.
Content may be subject to copyright.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2995321, IEEE Access
Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000.
Digital Object Identifier 10.1109/ACCESS.2017.DOI
Scale-aware Hierarchical Detection
Network for Pedestrian Detection
XIAOWEI ZHANG, SHUAI CAO, AND CHENGLIZHAO CHEN
Shandong Key Laboratory of Intelligent Information Processing, School of Computer Science and Technology, Qingdao University, Qingdao 266071, China.
Corresponding author: Xiaowei Zhang (e-mail: xiaowei19870119@sina.com) and Chenglizhao Chen (e-mail: cclz123@163.com).
This work was supported in part by the National Natural Science Foundation of China (Grant No.6190070308), and in part by the Natural
Science Foundation of Shandong Province of China (Grant No.ZR2019BF028).
ABSTRACT Several or even dozens of times spatial scale variation is one of the major bottleneck for
pedestrian detection. Although the Region-based Convolutional Neural Network (R-CNN) families have
shown promising results for object detection, they are still limited to detect pedestrians with large scale
variations due to the fixed receptive field sizes on a single convolutional output layer. In contrast to
previous methods that simply combined pedestrian predictions on feature maps with different resolution, we
propose a scale-aware hierarchical detection network for pedestrian detection under large scale variations.
First, we introduce a cross-scale features aggregation module to accomplish feature augmentation for
pedestrian representation through merging the lateral connection, the top-down path and bottom-up path.
Specifically, the cross-scale features aggregation module can adaptively fuse hierarchical features to
enhance feature pyramid representation for robust semantic and accurate localization. Further, we design
a scale-aware hierarchical detection network to effectively integrate multiscale pedestrian detection into
a unified framework by adaptively perceiving the augmented feature level for special-scale pedestrian
detection. Experimentally, the proposed scale-aware hierarchical detection network forms a more robust
and discriminative model for pedestrian instances with different scales on widely-used ETH and Caltech
benchmarks. In particular, compared with the state-of-the-art method FasterRCNN+ATT [44], the log-
average miss rate of pedestrian detection is reduced by 11.98% for medium scale pedestrians (between
30-80 pixels in height), and 14.12% for whole scale pedestrians (above 20 pixels in height) on Caltech
benchmark.
INDEX TERMS Scale Variation, Feature Aggregation, Scale-aware Weighting, Hierarchical Detection.
I. INTRODUCTION
PEDESTRIAN detection stands out from the traditional
object detection tasks, in view of its broad application
prospects in computer vision, such as video surveillance,
autonomous driving, robotics. Despite significant improve-
ments have been made on pedestrian detection [4], [8], [24],
[27], [41] over the years, the most existing efforts generally
work very well for large scale pedestrian instances [17]–[19],
[23], [34], [36], [52]. Compared with pedestrian detection
under large scale, much less attention has been put toward
medium and small scale ones, as similar observations in the
literature [14], [55].
For autonomous driving system, detecting medium and
small size pedestrians are an important topic because there
may leave sufficient time to alert the driver. Assuming the
vehicle traveling at an urban speed of 15m/s and a pedestrian
of 1.8m tall, the person with 80 pixels in height is just 1.5 s
away, while a person with 30 pixels is 4 s away. Take one
latest effort AR-Ped [2] for example, it has been reported
that empirically their detector is capable of achieving 6.45%
log-average miss rate for pedestrians taller than 50 pixels
on Caltech Pedestrian Benchmark [20], however the same
error rate increases to 49.31% MR for pedestrians of 30-
80 pixels in height. Fig. 1(a) shows several failed examples
of pedestrian detection using state-of-the-art method AR-
Ped [2] under large scale appearance variations on Caltech
benchmark. As Fig. 1(b) illustrates the scale distribution of
pedestrians in height on Caltech dataset, we group pedestri-
ans by their image size (height in pixels) into three scales
following [55]: near (80 or more pixels), medium (between
30-80 pixels), and far (between 20-30 pixels). Note that about
81.67% MR of the pedestrians lie in the medium scale on
Caltech dataset.
The degraded performance for pedestrian detection under
VOLUME 4, 2016 1
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2995321, IEEE Access
Xiaowei Zhang et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
Height ( pixels)
0 50 100 150 200 250 300 350 400 450
Scale Distributi on
0
0.05
0.1
0.15
0.2
0.25 8030
aExemplars of of pedestrian
detection on Caltech benchmark
bScale distribution of Pedestrian instances on Caltech benchmark c Pedestrian instances with different scales on Caltech benchmark
10×20 ~ 15×30
15×30 ~ 25×50
25×50 ~ 30×80
30×80 ~ 100×200
100×200 ~ 256×512
FIGURE 1: Visual examples for pedestrians of multiple scales. (a) shows the exemplars of pedestrian detection using the state-
of-the-art method AR-Ped [2]. (b) shows the scale distribution of pedestrians heights on Caltech dataset. One can observe that
the medium size instances indeed dominate the distribution. (c) shows pedestrian instances with different scales on Caltech
dataset.
large scale variations may be attributed to the following in-
herent challenges. First, small-size pedestrian instances often
convey smaller amount of information while having a greater
proportion of noise with obscure appearance and blurred
boundaries. It is in general difficult to distinguish them from
the background clutters. Second, visual semantic concepts of
an object can emerge in different spatial scales depending
on the size of the target objects. For a pedestrian instance
of interest, visual features are effective only at a proper
scale where optimal response is obtained. This difference is
more pronounced in complex scenes containing pedestrian
instances of diverse scales.
To address the issue of pedestrian detection under large
scale appearance variations, the Faster-RCNN [9] exploits a
multiscale region proposal network (RPN), which achieves
excellent object detection performance. However, multi-scale
detection is generated by sliding a fixed set of filters over
a fixed set of convolutional feature maps. This results in an
inconsistency between the sizes of objects and filter receptive
fields, as the scales of objects are variable, yet the sizes of
filter receptive fields are fixed. Instead of using a fixed set
of receptive fields, in most related works [1], [7], [15], [44],
[50], [58], [59] that aim to detect multi-scale pedestrians,
pedestrian detection is performed by redeploying the percep-
tive fields of convolution based on object sizes at multiple
output layers. However, in our view, these methods either
simply selecting multiple output layers based on the sizes
of receptive fields [11] [4] [5], or using feature fusion to
expand the receptive field on a single output layer [44] [57]
to obtain multi-scale receptive fields, are lack of enhancing
entire feature hierarchy for multiscale pedestrian detection.
This motivates us to construct an aggregation feature repre-
sentation to enhance semantic information and localization
signals for scale-aware pedestrian detection.
Motivated by above insight and analysis for the representa-
tion of feature hierarchy pyramid, we propose a scale-aware
hierarchical detection network for pedestrian detection under
large scale variations. First, we accomplish feature aggrega-
tion based on FPN [43] to enhance semantic information and
localization signals for feature representation, by merging
the lateral connection, the top-down path and the bottom-
up path. Furthermore, in view of the feature differences for
pedestrians under different scales, scale-aware hierarchical
detection network is designed to learn adaptively perceive
pedestrian instances within certain scale ranges, by probing
the feature differences with different scales from augmented
pyramid features.
To sum up, our work possesses the following contribu-
tions:
1) We introduce a cross-scale features aggregation module
to enhance feature pyramid representation by fusing robust
semantic and accurate localization for pedestrians with dif-
ferent scales, which accomplishes feature augmentation from
the lateral connection, the top-down path and bottom-up path.
2) A novel scale perception strategy by normalized Gaus-
sian fate function is designed to integrate multiple detection
heads into a unified framework through adaptively perceiving
the cross-scale features aggregation module for scale-aware
2VOLUME 4, 2016
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2995321, IEEE Access
Xiaowei Zhang et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
hierarchical detection network.
3) Experimentally, compared with the state-of-the-art
method FasterRCNN+ATT [44], the log-average miss rate of
pedestrian detection is reduced by 11.98% for medium scale
pedestrians (between 30-80 pixels in height), and 14.12%
for whole scale pedestrians (above 20 pixels in height) on
Caltech benchmark.
II. RELATED WORK
There has been lasting research activities on pedestrian
detection with vast literature. Before the emergence of
CNN, hand-crafted features have been widely used to ob-
tain good performance for pedestrian detection, including
HOG [10], Edgelets [38], ICF [21] and its variants ACF [28],
LDCF [17], [22], SCF [23]. The most popular pedestrian de-
tector is deformable part model (DPM) [51], which combines
rigid root filter and deformable part filters based on HOG
feature pyramid and latent SVM classifiers for detection.
Deep ConvNet due to its stronger feature representation
ability exhibits obvious performance gains on pedestrian
detection [28, 13, 29, 18, 2]. CCF [56] absorbs merits from
filtered channel features and Convolutional Neural Networks
(CNN), and transfers low-level features from pre-trained
CNN models to feed the boosting forest model for pedes-
trian detection. ConvNet [41] uses an unsupervised method
based on convolutional sparse coding to pre-train CNN for
pedestrian detection. Deep Parts [13] consists of extensive
part detectors, and each part detector is a strong detector
that can detect pedestrian by observing only a part of a
proposal. SDP [11] investigates scale-dependent pooling and
layer-wise cascaded rejection classifiers from CNN to detect
objects. CompACT-Deep [16] leverages both hand-crafted
and CNN features to form complexity-aware cascaded de-
tectors for an optimal trade-off between accuracy and speed.
Especially, Faster-RCNN [9] has addressed a multiscale re-
gion proposal network that shares full-image convolutional
features with the detection network, leading to an excellent
performance for pedestrian detection.
However, spatial scale variation is one of main challenge
for pedestrian detection due to the large variance of in-
stance scales in a cross-scenario. To address the issue, an
upsampling or dilated operations [5], [11] are employed to
alleviate the decline which just adopts a fixed set of filter
respective fields existed in Faster-RCNN [9]. MS-CNN [5]
combines multiple output layers by feature upsampling of
deconvolution to produce a strong multi-scale object detector.
SA-FastRCNN [4] exploits multiple built-in subnetworks by
a divide-and-conquer strategy to adaptively detect pedestri-
ans across scales. RPN+BF [7] reuses the high-resolution
convolutional features of RPN by cascaded boosted forests
for multiscale pedestrian detection. ADM [50] executes se-
quences of coordinate transformation on multi-layer fea-
ture maps to deliver accurate pedestrian locations. Trident-
Net [54] constructs a parallel multi-branch architecture to
expend receptive fields on the detection of different scale
objects through dilated convolution. However, these methods
have not effectively to fuse the robust semantic information
of targets existed in high-level convolutional layers and the
precise localization signals of the lower convolutional layers
for multiscale pedestrian detection.
To exploit strong semantic for prediction, FPN [43] aug-
ments a top-down pathway and lateral connections to propa-
gate high-level semantic information for reasonable classifi-
cation capability. DSSD [48] adopts deconvolution layers to
aggregate context and the high-level semantics for enhanc-
ing shallow features. M2Det [3] presents multi-level feature
pyramid network to fuse multiscale features for detecting
objects of different scales. On the other hand, many fine de-
tails and higher resolution existed in low-level feature maps
are benefits for localization accuracy. PANet [47] builds a
strong indicator to accurately localize instance segmentation
by a pathway with clean lateral connections from the low
level to top ones. DLA [49] augments standard architectures
with deeper aggregation across layers to obtain stronger
layer-wise multi-scale representation capability. STDN [29]
is equipped with embedded super-resolution scale-transfer
layers to explore the inter-scale consistency nature across
multiple detection scales. Recently, NAS-FPN [53] consists
of a series of merging cells to fuse features across scales
by a combination of top-down and bottom-up connections.
Res2Net [46] constructs hierarchical residual-like connec-
tions within one single residual block to capture multi-scale
features at a granular level.
Inspired by these observations and analysis of feature
fusion to multiscale detection, in this paper, we explore a
scale-aware hierarchical detection network for multi-scale
pedestrian detection, by aggregating the strong semantic
information from high-level features and the accurate local-
ization signals from low-level layer to enhance pyramidal
feature representations.
III. APPROACH OVERVIEW
A high-level overview of our approach architecture is shown
in Fig. 2. Our proposed approach consists of two main
components: cross-scale features aggregation module and
scale-aware hierarchical detection network. The cross-scale
features aggregation module is built on Feature Pyramid
Network (FPN) [43] to enhance representation ability of
pyramid features. FPN shows significant improvement as a
generic feature extractor for object recognition, significantly
which propagates semantically strong features to enhance
pyramid features with reasonable classification capability by
the top-down path. Similarly, with many fine details and
strong responses of local patterns are existed in low-level
convolutional layers, which are benefits for high localization
accuracy. For this reason, we design a cross-scale features ag-
gregation module to adaptively aggregate features of pyramid
hierarchy to enhance the localization capability.
Further, the scale-aware hierarchical detection network
based on the Fast R-CNN framework [6] combines the com-
plementary detection branches on hierarchical pyramid fea-
ture maps from the cross-scale features aggregation module.
VOLUME 4, 2016 3
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2995321, IEEE Access
Xiaowei Zhang et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
RoI
Pooling
layer FCs
Conv feature
maps
FC
FC
1024-d
RoI feature
vector
C3_cls_score
C3_bbox_pred
2-d
8-d
RoI
Pooling
layer FCs
Conv feature
maps
FC
FC
1024-d
RoI feature
vector
C4_cls_score
C4_bbox_pred
2-d
8-d
FCs
Conv feature
maps
FC
FC
1024-d
RoI feature
vector
C5_cls_score
C5_bbox_pred
2-d
8-d
RoI
Pooling
layer
Scale-aware_cls_score
Scale-aware_bbox_pred
2-d
8-d
1
1
2
2
3
3
Average pooling
Upsampling
C5
C3
C2
H5
H4
H3
Identity mapping
C1
Original image
C4
C3
Near scale Medium scale Far scale
Region proposals
Scale-aware Hierarchical Detection Subnetwork
Output results
Cross-level Feature
Aggregation Module
FIGURE 2: The architecture of our proposed Scale-aware Hierarchical Detection Network. Our approach uses the cross-scale
features aggregation module to enhance semantic robust and localization accuracy, and the scale-aware hierarchical detection
network to adaptively detect pedestrians from augmented feature levels for special-scale pedestrians presented in the image.
And the detection heads in hierarchical detection network
pretrained on ImageNet based on ResNet [33] are all sharing
parameters for each proposal to learn scale-aware hierarchi-
cal weights by minimizing the error rate for pedestrians with
different scales, regardless of their feature levels.
A. CROSS-SCALE FEATURES AGGREGATION MODULE
Feature Pyramid Network (FPN) [43] shows significant im-
provement as a generic feature extractor for object recogni-
tion, which propagates semantically strong features to en-
hance pyramid features with reasonable classification capa-
bility. Followed by previous evidence on the benefits of the
strategy of feature approximation [28], we denote the output
of last residual blocks as {C1, C2, C3, C4, C5}for conv1,
conv2, conv3, conv4, and conv5 in ResNet. And given a list
of multi-scale pyramid features {P1, P2, P3, P4, P5}from
FPN [43], where Pirepresents the feature at pyramid level
i. However, the feature fusion from FPN only directly builds
on the lateral connection and the top-down pathway, ignoring
the impact of bottom-up path augmentation to enhance fea-
ture representation for accurate localization signals existing
in low-level convolutional layers.
Our goal is to find a transformation function fthat can
effectively aggregate multi-scale features and output a list of
new features: Xout =f(Xin),Xin may be Ci,Pi, or their
union. Different from the feature augmentation generated by
FPN, we propose a cross-scale features aggregation module
(CFAM) to merge a bottom-up pathway to FPN. Specifically,
we use {H1, H2, H3, H4, H5}to denote augmented feature
pyramid and in which the spatial resolution of feature maps
is gradually upsampled with factor 2 from Hito Hi1. As
shown in Fig. 3(b), each feature aggregation module takes
a convolutional feature map Ci1with higher resolution,
an identify mapping feature maps Ciand a coarser feature
map Hi+1 with stronger semantic to generate the augmented
feature map Hi. Note that we adopt an average pooling to
downsample the spatially finer feature maps, which implies
to directly propagate strong responses of local patterns from
low-level pyramid levels for accurately localization by the
bottom-up augmented pathway.
The key idea of CFAM is to adaptively aggregate multi-
scale context information from feature maps of the convolu-
tional layers at adjacent scales to generate more discrimina-
tive features. As shown in Fig. 3(b), each aggregating module
merges a top-down path, lateral connections and a bottom-
up augmented path by addition. The feature aggregating
module takes a convolutional feature map Ci1with higher
resolution, an identify mapping feature map Ciand a coarser
feature map Hi+1 with stronger semantic to generate the
fused feature map Hi. This is an iterated process to build aug-
mented feature pyramid until to the finest resolution map H3.
At the beginning of iteration, we adopt a 1×1 convolutional
layer on C5to produce the coarsest but semantically strongest
resolution map H5. Then the lower-level feature map Ci1
goes through a 2 × 2 average pooling layer with stride 2 to
reduce the spatial size to generate the down-sampled feature
map in the bottom-up augmented pathway. Each element
of feature map Hi+1, the down-sampled feature map and
the identify mapping feature map Ciare added to generate
fused feature map. Finally, we append a 1×1 convolution
4VOLUME 4, 2016
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2995321, IEEE Access
Xiaowei Zhang et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
Identity mapping
Upsampling
Upsampling
Identity
mapping
C5P5
C4
C3
C2
P4
P3
(a) The feature aggregation block from FPN [43] merged the lateral
connection and the top-down pathway by addition.
Identity mapping
Average pooling
Upsampling
Upsampling
Identity
mapping Average
pooling
(b) Our cross-scale features aggregation module augmented features from the
lateral connection, the top-down pathway and the bottom-up pathway.
FIGURE 3: Illustrations of feature aggregation module design.
on each merged map to generate the final augmented fea-
ture map Hifor following sub-networks, which is used to
reduce the aliasing effect of upsampling and downsampling.
In the feature aggregation module, these augmented feature
maps are respectively corresponding to {C3, C4, C5}with
the same spatial sizes, and we set 1024-channel outputs
for each augmented feature pyramid {H3, H4, H5}to scale-
aware hierarchical detection network.
B. SCALE-AWARE HIERARCHICAL DETECTION
NETWORK
The coverage of multiple scales is a critical problem to
different scale ranges for pedestrian detection. Different from
the multi-scale mechanism of the RPN [9], we divide the
region proposals into three scales (near, medium, and far)
from higher convolutional layers C4, each scale is transported
an augmented feature pyramid level Hito detect pedestrian
instances within certain scale ranges as shown in Fig. 4. We
hypothesize that pedestrian instances with different scales
can be better modeled by hierarchical detection network with
the valid range of filter receptive fields. Specifically, each
pedestrian anchor scale needs to effectively match the re-
ceptive field size of the ROI pooling though different spatial
pooling structure.
Let Lm(Xi, Yi|W)represents multi-task loss function for
each pedestrian proposal under specific feature level Hm, and
is given by:
Lm(Xi, Yi|W) = Lm
cls (pi,ˆpi) + λˆpiLm
loc bi,ˆ
bi.(1)
Where ˆpiis 1 if the anchor is labeled positive, otherwise is
0. piis the predicted probability of the anchor being a pro-
posal. ˆ
bi= (ˆ
bx
i,ˆ
by
i,ˆ
bw
i,ˆ
bh
i,)represents the ground-truth box
associated with a positive anchor, and bi= (bx
i, by
i, bw
i, bh
i,)
represents the parameterized coordinates of the predicted
bounding box. The classification loss Lm
cls is the softmax loss
of two classes (pedestrian vs. not) from specific feature level
Hm. For the regression loss, we use Lm
loc =R(biˆ
bi)where
Ris the robust loss function (smooth-L1) defined in [6]. The
term ˆpiLm
loc means the regression loss which is activated only
for positive anchors ˆpi= 1 and is disabled otherwise ˆpi= 0.
To adaptively match valid feature level and anchor scale
for multiscale pedestrian detection, SDP [11] adopts a hard
isolation strategy by the pixels in height of an object proposal
to detect multiscale objects. SA-FastRCNN [4] exploits a soft
isolation strategy by Sigmoid gate function defined over the
object proposal sizes to generate scale-aware weighting for
multi-scale detection subnetworks. In this paper, we design
a novel scale perception strategy by normalized Gaussian
gate function for scale-aware hierarchical detection network
(SHDN) as shown in Fig. 4, and the model loss function is
defined as:
L(W) =
M
X
m=1 X
iU
ωmLm(Xi, Yi|W)(2)
Where Mis the number of hierarchical feature pyramid as
mentioned in Section III A, U={(Xi, Yi)}N
i=1 contains
the training examples of multi-scale for pedestrian instances,
and ωmis the normalized scale-aware weight to corre-
sponded hierarchical loss Lm(Xi, Yi|W), and is initialized
by ωm=eˆωm/
M
P
i=1
eˆωi,ˆωm=e(s¯sm)2/2(γm)2). Here
s=log2(h)denotes the height scale of the pedestrians
which has already been normalized to a narrow range prior
to detection, ¯smand γmis the average height scale and
the scaling coefficient for specific feature level Hm, respec-
tively. Given a sliding window, the Gaussian function with
lower γmtends to enlarge the gap between the weights for
pedestrian instances from different scale ranges. Based on
the ResNet structure, the output size of RoI pooling is 7
× 7, with a stride chosen from the set of {8,16,32}to
construct deep network {C3, C4, C5}, then the valid recep-
tive fields for hierarchical feature pyramid {H3, H4, H5}is
{56,112,224}pixels for the height of bounding box, re-
spectively. Consequently, we assign scale-aware parameters
sm, γm)as {(5.8,1.25),(6.8,2),(7.8,1.25)}for hierarchi-
cal feature pyramid {H3, H4, H5}, respectively. Note that we
optimize multi-task loss function being shunt to scale-aware
hierarchical detection module by the scale-aware weights
parameters sm, γm), and all hyper parameters after ROI
pooling layers are shared for all levels of the hierarchical
feature pyramid.
VOLUME 4, 2016 5
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2995321, IEEE Access
Xiaowei Zhang et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
RoI
Pooling
layer FCs
Conv feature
maps
FC
FC
1024-d
RoI feature
vector
C3_cls_score
C3_bbox_pred
2-d
8-d
RoI
Pooling
layer FCs
Conv feature
maps
FC
FC
1024-d
RoI feature
vector
C4_cls_score
C4_bbox_pred
2-d
8-d
FCs
Conv feature
maps
FC
FC
1024-d
RoI feature
vector
C5_cls_score
C5_bbox_pred
2-d
8-d
RoI
Pooling
layer
Scale-aware_cls_score
Scale-aware_bbox_pred
2-d
8-d
1
1
H5
Scale-aware Hierarchical Detection Subnetwork
Output results
H4
H3
Near scale
Medium scale
Far scale
Dviding multiscale pedestrian proposals
Far
Scale
Medium
scale
Near
scale
FIGURE 4: Our proposed scale compensation strategy from multipath RPN to initial proposals. It uses hierarchy features
of deep convolutional layers to obtain a series of reasonable anchor scales for pedestrian proposal, and each scale focus on
pedestrian instances within certain scale ranges in an image.
For efficient training the scale-aware hierarchical detection
network, sampling is used to compensate for the imbalance
from the distribution of positive samples Um
+and negative
samples Um
. In this paper, we adopt random sampling and
bootstrapped sampling to collect a final set of negative sam-
ples, such that Um
=ζU m
+. We utilize random sampling to
randomly select easy negative samples according to a uni-
form distribution. Because hard negatives mining has large
influence on the detection accuracy, bootstrapping sampling
is exploited to improve detection performance by ranking the
negative samples according to their objectness scores. On
the other hand, to avoid the heavily asymmetric of positive
samples Um
+and negative samples Um
resulting in for each
specific detection layer, the cross-entropy terms of positives
and negatives are weighted in formula (3), which guarantee
that each detection layer have enough positive samples to
cover a certain range of scales.
Lcls =1
1 + ζ
1
Um
+
X
iUm
+
logp ˆpi(Xi)
+ζ
1 + ζ
1
Um
X
iUm
logp0(Xi)
(3)
IV. EXPERIMENTS
A. EXPERIMENTS DETAILS
Following ResNet [33] pretrained on ImageNet, we fine-
tunes the convolutional neural network to extract visual
features from observed video frames on Caltech training
dataset. The convolutional layers and max pooling layers of
the ResNet network are used as the shared convolutional
layers before the Region-of-Interest (RoI) pooling layer to
produce feature maps from the entire input image. The last
convolutional block in ResNet is 2048-d, and we employ
a randomly initialized 1024-d 1×1 convolutional layer for
reducing dimension. And we use single-scale training in
which the scale of the input image is resized as 600 pixels
on the shortest side. The scale-aware feature aggregation net-
work is trained with Stochastic Gradient Descent (SGD) with
momentum of 0.9, and weight decay of 0.0005. As [9], [30]
demonstrate that mining from a larger set of candidates (e.g.,
2000) has no benefit, we use 300 RoIs for both training and
testing of this paper. We fine-tune scale-aware hierarchical
detection network using a learning rate of 0.001 for 20k mini-
batches. Each mini-batch consists of 128 randomly sampled
object proposals in one randomly selected image, where in
32 positive object proposals and the rest 96 negative object
proposals. A positive label of pedestrian is assigned when
IoU 0.5between the object proposal and the ground truth
box, and the negative label to RoIs if their IoU 0.3for
all ground-truth boxes. The whole scale-aware hierarchical
detection network is trained on a single NVIDIA GeForce
GTX TITAN X GPU with 12GB memory.
B. ABLATION EXPERIMENTS
1) Evaluating the cross-scale features aggregation module
As mentioned in [7], the Region Proposal Network (RPN)
in Faster R-CNN indeed performs well as a stand-alone de-
tector, but the downstream classifier degrades the pedestrian
detection performance. In this subsection, we investigate
cross-scale features aggregation module in terms of detection
quality, evaluated by the log-average miss rate of pedestrian
detection under IOU = 0.5on Caltech dataset.
First of all, we evaluate high-level convolutional layer
(from ResNet-50-C3 to ResNet-50-C5) of ResNet [33] to
extract ROI features to detect pedestrian by using a set of
anchor scales from RPN. As shown in Table 1(a)(b)(c), for
illustrating the effects of high-level convolutional features
in ResNet-50 to detect pedestrian instances, the higher con-
6VOLUME 4, 2016
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2995321, IEEE Access
Xiaowei Zhang et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
TABLE 1: EVALUATIONS OF PEDESTRIAN DETECTION AT DIFFERENT FEATURE PYRAMID LEVEL BY LOG-
AVERAGE MISS RATE (MR) UNDER IOU=0.5 ON CALTECH DATASET
Detection Network Proposals RoI features lateral? top-down? bottom-up? M Rall MRfM RmM Rn
(a) Baseline on Conv. C4C3× × × 95.79% 100% 94.25% 79.31%
(b) Baseline on Conv. C4C4× × × 63.45% 95.76% 47.92% 4.83%
(c) Baseline on Conv. C4C5× × × 81.86% 78.42% 4.29%
(d) Baseline on FPN C4P3X X ×52.52% 76.59% 37.76% 14.18%
(e) Baseline on FPN C4P4X X ×46.78% 90.38% 38.65% 2.74%
(f) Baseline on FPN C4P5X X ×78.67% 75.96% 3.36%
(g) Based on Our CFAM C4H3X X X 46.52% 72.83% 36.50% 16.35%
(h) Based on Our CFAM C4H4X X X 43.69% 86.50% 38.08% 2.12%
(i) Based on Our CFAM C4H5X X X 84.84% 85.82% 2.51%
volutional layers (e.g., C4,C5) obviously perform better
than lower-level convolutional layers (e.g., C3) for pedestrian
instances with near scale. This can be attributed to higher-
level convolutional features with more robust semantic infor-
mation than lower-levels.
Further, compared to adopt the simple high-level convolu-
tional layer (e.g., C3,C4, or C5) to detect pedestrian, FPN
(e.g., P3,P4, or P5) fused the semantically strong features
from higher convolutional layer to enhance pyramid features
for classification capability. Especially, P3gets 76.59% MR
for pedestrian detection under far scale, and decreases by
10.16% MR for medium scale pedestrian instances to C4as
shown in Table 1(d). However, FPN only directly builds on
the lateral connection and the top-down pathway for feature
fusion, ignoring the impact of bottom-up pathway which is
benifit for accurate localization. Compared to the improved
performance of P3under far scale and medium scale, P4
has degraded the pedestrian detection performance shown in
Table 1(e), which may be due to lack of accurate localization
signals existing in lower convolutional layers. Therefore, we
propose a cross-scale features aggregation module (CFAM)
to fuse semantic information and localization signals by
adding a bottom-up augmented pathway to FPN. As shown
in Table 1 (g), H3has achieved the best pedestrian detection
performance under far-scale and medium-scale, up to 72.83%
MR and 36.50% MR respectively. Note that H4achieves
43.69% MR for pedestrians with all scales.
2) The Role of Scale-aware Hierarchical Detection network
In this subsection, the contribution of proposed scale-aware
hierarchical detection network is evaluated by log-average
miss rate under IOU = 0.5on Caltech testing dataset for
pedestrian detection. We conduct comparison experiments
to verify the effectiveness of the proposed method within a
single output layer and multiple output layers for detection
heads. As shown in Table 2 (a)(b)(c), we compare the single
output layer H3,H4and H5as detection head from proposed
cross-scale features aggregation module for pedestrian detec-
tion under different scales. We found that the H3performs
better than other single output layer on log-average miss rate
for pedestrian detection under far and medium scales. For
near scale, H4has achieved the best detection performance
in a single output layer, up to 2.12% MR with a relative
improvement of 14.23% over the competitor H3.
However, detecting pedestrian only from a single output
layer cannot effectively cover the multiscale pedestrians ap-
peared large scale variations, due to lacking of the scale com-
plementary from multiple feature layers with different sizes
of filter receptive fields. To effectively combine multiple out-
put layers from feature pyramid for pedestrian detection, we
adopt the scale-aware parameters sm, γm)to initialize learn-
ing hierarchical weights ωmfor optimizing multi-task loss
function in formula 2. Specifically, we assign scale-aware pa-
rameters sm, γm)as {(5.8,1.25),(6.8,2),(7.8,1.25)}for
hierarchical feature pyramid {H3, H4, H5}, respectively. In
Table 2(d), combining layers {H3,H4} gains 42.58% MR
for all scales on Caltech benchmark, improving by 1.11%
compared to the single output layer H4, and achieves best
detection performance 65.86% MR for far scale pedestrian
instances. Particularly, combining layers {H4,H5} does not
improve pedestrian detection performance on medium and
far scales, but achieves better detection performance 1.25%
MR on near scale shown in Table 2(e). The reason may be
attributed to our proposed hierarchical scale-aware detection
network that each detection branch is used to learn a proper
pyramid feature layer to focus on pedestrian instances within
certain scale ranges. Moreover, the log-average miss rate
is reduced to 40.39% for all scales pedestrian detection,
28.77% for medium scale and 1.08% for near scale, by
combining layers {H3,H4,H5} as shown in Table 2(f). Note
that combining {H3,H4,H5} gets the best performance
compared to {C3,C4,C5},{P3,P4,P5}, and {P2,P3,P4,
P5} shown in Table 2(gi). The experiments demonstrate
that the proposed hierarchical scale-aware detection network
is more flexible and is able to take advantage of different sizes
of filter receptive fields from multiple level pyramid features
for large variance in pedestrian instance scales.
C. COMPARISON WITH STATE-OF-THE-ARTS
In this section, the performance of proposed algorithm is
fully evaluated to the state-of-the-art methods on Caltech [20]
and ETH [18] datasets. As [55] proposed the evaluation crite-
VOLUME 4, 2016 7
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2995321, IEEE Access
Xiaowei Zhang et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
TABLE 2: COMPARISONS OF PEDESTRIAN DETECTION RESULTS BY LOG-AVERAGE MISS RATE (MR) UNDER
IOU=0.5 ON CALTECH DATASET
Detection Network Proposals RoI features MRall M RfM RmMRn
(a) Based on a single output layer C4H346.52% 72.83% 36.50% 16.35%
(b) Based on a single output layer C4H443.69% 86.50% 38.08% 2.12%
(c) Based on a single output layer C4H584.84% 85.82% 2.51%
(d) Based on multiple output layers C4{H3, H4}42.58% 65.86% 30.53% 1.48%
(e) Based on multiple output layers C4{H4, H5}44.37% 90.56% 33.66% 1.25%
(f) Based on multiple output layers C4{H3, H4, H5}40.39% 70.69% 28.77% 1.08%
(g) Based on multiple output layers C4{C3, C4, C5}50.47% 78.67% 37.43% 2.22%
(h) Based on multiple output layers C4{P3, P4, P5}46.69% 74.42% 31.16% 1.79%
(i) Based on multiple output layers C4{P2, P3, P4, P5}45.12% 71.54% 32.83% 1.46%
ria, the log-average miss rate is used to summarize the detec-
tor performance. The performance is computed by averaging
miss rate at FPPI rates evenly spaced in log-space within
the range of 103to 100. The experiments demonstrate
that jointing cross-scale features aggregation module and
scale-aware hierarchical detection network outperforms the
state-of-the-art pedestrian detection algorithms, especially on
pedestrian instances with small sizes.
1) Comparison with state-of-the-art methods on Caltech
dataset
The Caltech pedestrian dataset consists of approximately
10 hours of 640*480 30Hz video taken from a vehicle
driving through regular traffic in an urban environment,
which includes about 250,000 frames with a total of 2300
unique pedestrians. Similar to other relevant publications
previously [13], [16], [17], [24], we adopt the different spatial
scale pedestrians to evaluate our method on the Caltech
testing dataset, and choose the Caltech training dataset and
the INRIA training dataset [10] as our training set. The exper-
imental evaluations of our proposed method with the state-
of-the-art methods are constructed on the Caltech testing
dataset, including LDCF [22], ACF+SDt [34], RPN+BF [7],
MS-CNN [5], CompACT-Deep [16], TA-CNN [24], SA-
FastRCNN [4], FasterRCNN+ATT [44], and AR-Ped [7].
To evaluate the effectiveness of our proposed scale-aware
hierarchical detection network, quantitative results of com-
parison are presented for different scale ranges of pedestrian
instances on Caltech dataset. Fig. 5 shows the comparison
results of the log-average miss rate for pedestrians under
different scale ranges. It can be observed that our proposed
method significantly outperforms other methods and achieves
the lowest log-average miss rate 28.77% on Caltech dataset
of the medium scale shown in Fig. 5(a), which is lower
than the state-of-the-art approach FasterRCNN+ATT [44] by
11.98%. As the similar trend shown in Fig. 5(b), our ap-
proach achieves 7.41% log-average miss rate for pedestrian
instances taller than 50 pixels in height, second only to the
state-of-the-art approach AR-Ped [2].
For pedestrian instances in far scale ranges, most methods
exhibit dramatic performance drops as shown in Fig. 5(c).
10-3 10-2 10-1
False positives per image
0.20
0.30
0.40
0.50
0.64
0.80
1.00
Miss rate
63.62% TA-CNN
61.82% LDCF
56.42% DeepParts
53.93% RPN+BF
53.23% CompACT-Deep
51.83% SA-FastRCNN
49.31% AR-Ped
49.13% MS-CNN
40.75% FasterRCNN+ATT
28.77% Ours(SHDN)
(a) Medium scale (80height30 pix-
els)
10-3 10-2 10-1
False positives per image
0.20
0.30
0.40
0.50
0.64
0.80
1.00
Miss rate
24.80% LDCF
20.86% TA-CNN
11.89% DeepParts
11.75% CompACT-Deep
10.33% FasterRCNN+ATT
9.95% MS-CNN
9.68% SA-FastRCNN
9.58% RPN+BF
7.41% Ours(SHDN)
6.45% AR-Ped
(b) Resonable (height50 pixels)
10-3 10-2 10-1
False positives per image
0.20
0.30
0.40
0.50
0.64
0.80
1.00
Miss rate
100.00% ACF+SDt
100.00% SA-FastRCNN
100.00% CompACT-Deep
100.00% DeepParts
100.00% RPN+BF
100.00% TA-CNN
100.00% LDCF
97.23% MS-CNN
90.94% FasterRCNN+ATT
70.69% Ours(SHDN)
(c) Far scale (30height20 pixels)
10-3 10-2 10-1
False positives per image
0.20
0.30
0.40
0.50
0.64
0.80
1.00
Miss rate
71.25% LDCF
71.22% TA-CNN
64.78% DeepParts
64.66% RPN+BF
64.44% CompACT-Deep
62.59% SA-FastRCNN
60.95% MS-CNN
58.83% AR-Ped
54.51% FasterRCNN+ATT
40.39% Ours(SHDN)
(d) Overall (height20 pixels)
FIGURE 5: Quantitative results of comparisons on Caltech
dataset.
While our proposed method outperforms better than the
available state-of-the-art competitors, it is difficulty to iden-
tify pedestrians reliably for small-size pedestrian instances
under 30 pixels in height. In Fig. 5(c), the log-average miss
rate is reduced to 70.69%, improved 20.25% compared to
FasterRCNN+ATT [44]. This is similar to human perfor-
mance that is also quite good in the large scales but degrades
noticeably at the medium and far scales. Significantly, for
pedestrian instances in whole scale span ranges, our approach
achieves the log-average miss rate 40.39% for all pedes-
trian instances taller than 20 pixels in height, better than
the current FasterRCNN+ATT [44] by 14.12% as shown in
Fig. 5(d). The comparison results with different scale ranges
of pedestrian instances demonstrate that our proposed ap-
proach substantially improve the performance for pedestrian
detection.
Fig. 6 shows the detection results of our proposed scale-
aware hierarchical detection network on Caltech dataset. The
8VOLUME 4, 2016
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2995321, IEEE Access
Xiaowei Zhang et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
green dotted bounding boxes represent true positive windows
when the intersection over union (IoU) between the detected
window and the ground truth (green solid bounding box)
exceeds 50%. Otherwise, the bounding boxes denote false
positive windows by the red dotted bounding box. As shown
in Fig. 6, the most of pedestrian instances with different scale
ranges can be detected by our proposed approach. Moreover,
because of adaptively perceiving the augmented feature level
with different resolutions for special-scale pedestrians, the
medium-size and small-size pedestrian instances also can
be detected in proposed scale-aware hierarchical detection
network. The red dotted bounding box represents the posi-
tive pedestrians which are not marked by the ground truth
as shown in Fig. 6. This experiment shows that the usage
of jointing the cross-scale features aggregation module and
scale-aware hierarchical detection network for pedestrian de-
tection outperforms the state-of-the-art algorithms, especially
for pedestrian instances from medium and small scale ranges.
2) Comparison with state-of-the-art methods on ETH dataset
The ETH benchmark dataset consists of 3 testing video
sequences with a resolution of 640*480, and a frame
rate of 13FPS. Studies in [7], [37] report that state-of-
the-art algorithms have the remarkable detection perfor-
mance evaluated on ETH dataset, including ChnFtrs [21],
MultiFtr+Motion [35], JointDeep [37], pAUCBoost [40],
ConvNet [41], DBN-Mut [12], SpatialPooling [39], TA-
CNN [24], and RPN+BF [7]. As most approaches are
trained on the INRIA training dataset [10], our proposed
method is also trained on the INIRA training dataset. As
Fig. 7(a) the log-average miss rate of our proposed approach
achieves 44.75% next to the state-of-the-art SpatialPool-
ing [39] 43.36% for pedestrians under medium scale. Similar
trend to what we have observed for pedestrian with near
scale, our approach achieves 20.49% log-average miss rate,
second only to the best available competitor’s RPN+BF [7]
as shown in Fig. 7(b). Significantly, for pedestrian instances
above pixels taller than 80 pixels in height, our approach
gets 16.84% log-average miss rate, improving 0.78% com-
pared to the state-of-the-art RPN+BF [7] shown in Fig. 7(c).
Moreover, for a more challenging with large variation of
scale (above 20 pixels in height), the log-average miss rate
of our approach reduces 3.98% over RPN+BF [7] on ETH
dataset as shown in Fig. 7(d). The results demonstrate that
our proposed method has a substantially better detection
performance for the multiscale pedestrian instances appeared
large scale variations in natural scenes.
The pedestrian detection results of our proposed method
are shown in Fig. 8 on ETH dataset. As shown in Fig. 8,
the green dotted boxes demonstrate the detection results of
our approach. Our proposed approach adaptively perceives
the augmented feature level to generate the final detection
results for special-scale pedestrian detection by scale-aware
hierarchical detection network. And the small-size pedestrian
instances also can be detected, where the red dotted bounding
box represents the positive pedestrians which are not marked
FIGURE 6: Detection results of our approach on Caltech
dataset.
by the ground truth as shown in Fig. 8. One can observe that
our method can successfully detect most of the pedestrian in-
stances, especially for pedestrians with large scale variations.
V. CONCLUSION
This study describes an effective approach to detect pedes-
trian instances with different scale ranges. The proposed
cross-scale features aggregation module adaptively fuses hi-
erarchical features to enhance feature pyramid representation
by merging the lateral connection, the top-down path and
bottom-up path. Moreover, probing the differences of local
features with different sizes of receptive fields, the proposed
scale-aware hierarchical detection network effectively inte-
grates multiscale pedestrian detection into a unified frame-
work through adaptively perceiving the augmented feature
level for special-scale pedestrian detection. Experimentally,
compared with the state-of-the-art FasterRCNN+ATT [44],
VOLUME 4, 2016 9
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2995321, IEEE Access
Xiaowei Zhang et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
10-2 10-1 100
False positives per image
0.20
0.30
0.40
0.50
0.64
0.80
1.00
Miss rate
67.77% MultiFtr+Motion
65.71% ConvNet
59.59% DBN-Mut
58.78% JointDeep
55.42% pAUCBoost
53.55% ChnFtrs
53.38% RPN+BF
48.69% TA-CNN
44.75% Ours(SHDN)
43.36% SpatialPooling
(a) Medium scale (80height30 pix-
els)
10-2 10-1 100
False positives per image
0.20
0.30
0.40
0.50
0.64
0.80
1.00
Miss rate
48.34% ChnFtrs
45.44% MultiFtr+Motion
40.75% JointDeep
39.72% pAUCBoost
39.23% ConvNet
34.73% DBN-Mut
29.66% SpatialPooling
23.24% TA-CNN
20.49% Ours(SHDN)
17.63% RPN+BF
(b) Near scale (height80 pixels)
False positives per image
10-2 10-1 100
Miss rate
0.20
0.30
0.40
0.50
0.64
0.80
1.00
59.99% MultiFtr+Motion
57.47% ChnFtrs
50.27% ConvNet
49.06% pAUCBoost
45.32% JointDeep
41.07% DBN-Mut
37.37% SpatialPooling
34.98% TA-CNN
30.23% RPN+BF
29.45% Ours(SHDN)
(c) Resonable (height50 pixels)
10-2 10-1 100
False positives per image
0.20
0.30
0.40
0.50
0.64
0.80
1.00
Miss rate
70.12% MultiFtr+Motion
61.86% ChnFtrs
57.80% ConvNet
54.32% JointDeep
53.56% pAUCBoost
51.28% DBN-Mut
43.19% SpatialPooling
42.92% TA-CNN
39.46% RPN+BF
35.48% Ours(SHDN)
(d) Overall (height20 pixels)
FIGURE 7: Quantitative results of comparisons on ETH
dataset.
the log-average miss rate of pedestrian detection is reduced
by 11.98% for medium scale pedestrians (between 30-80
pixels in height), and 14.12% for whole scale pedestrians
(above 20 pixels in height) on the Caltech benchmark.
REFERENCES
[1] Z. Chen, L. Zhang, A. M. Khattak, et al., “Deep feature fusion by
competitive attention for pedestrian detection, IEEE Access, vol. 7, pp.
21981-21989, 2019.
[2] G. Brazil, X. Liu, “Pedestrian Detection With Autoregressive Network
Phases,” in CVPR, Long Beach, CA, USA, 2019, pp.7231-7240.
[3] Q. Zhao, T. Sheng, Y. Wang, et al., “M2Det A Single-Shot Object Detector
based on Multi-Level Feature Pyramid Network, in AAAI, Honolulu,
Hawaii, USA, 2019.
[4] J. Li, X. Liang, S. Shen, et al., “Scale-aware Fast R-CNN for Pedestrian
Detection,” in CVPR, Honolulu, Hawaii, 2017, pp.985–996.
[5] T. Cai, Q. Fan, R. Feris, et al., A Unified Multi-scale Deep Convolutional
Neural Network for Fast Object Detection,” in ECCV, Amsterdam, Nether-
lands, 2016, pp.354–370.
[6] R. Girshick, “Fast R-CNN,” in ICCV, Santiago, Chile, 2015, pp.1440–
1448.
[7] L. Zhang, L. Lin, X. Liang, et al., “Is Faster R-CNN Doing Well for
Pedestrian Detection?,” in ECCV, Springer, Cham, 2016, pp. 443–457.
[8] C. Fei, B. Liu, Z. Chen, et al., “Learning Pixel-Level and Instance-
Level Context-Aware Features for Pedestrian Detection in Crowds,” IEEE
Access, vol. 7, pp. 94944–94953, 2019.
[9] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real-
time object detection with region proposal networks,” in NIPS, Montreal,
Canada, 2015, pp.1–9.
[10] N. Dalal, B. Triggs, Histograms of oriented gradients for human detec-
tion,” in CVPR, San Diego, California, USA, 2005.
[11] F. Yang, W. Choi, Y. Lin, “Exploit All the Layers: Fast and Accurate CNN
Object Detector with Scale Dependent Pooling and Cascaded Rejection
Classifiers,” in CVPR, Las Vegas, USA, 2016, pp.2129–2137.
[12] W. Ouyang, X. Zeng, X. Wang, Modeling Mutual Visibility Relationship
with a Deep Model in Pedestrian Detection,” in CVPR, Portland, Oregon,
2013, pp.3222–3229.
[13] Y. Tian, P. Luo, X. Wang, et al., Deep learning strong parts for pedestrian
detection,” in ICCV, Santiago, Chile, 2015, pp.1904–1912.
FIGURE 8: Detection results of our approach on ETH
dataset.
[14] D. Hoiem, Y. Chodpathumwan, Q. Dai, “Diagnosing error in object
detectors,” in ECCV, Florence, Italy, 2012, pp.340–353.
[15] J. Yan, X. Zhang, Z. Lei, et al., “Robust Multi-Resolution Pedestrian
Detection in Traffic Scenes, in CVPR, Portland, Oregon, 2013, pp.3033–
3040.
[16] Z. Cai , M. Saberian, N. Vasconcelos, “Learning complexity-aware cas-
cades for deep pedestrian detection” in ICCV, Santiago, Chile, 2015,
pp.3361–3369.
[17] S. Zhang, R. Benenson, B. Schiele, “Filtered channel features for pedes-
trian detection,” in CVPR, Boston, Massachusetts, 2015, pp.1751–1760.
[18] P. Dollar, S. Belongie, P. Perona, “The fastest pedestrian detector in the
west,” in BMVC, Aberystwyth, UK, 2010.
[19] P. Dollar, R. Appel, W. Kienzle, “Crosstalk cascades for frame-rate pedes-
trian detection,” in ECCV, Florence, Italy, 2012, pp.645–659.
[20] P. Dollar, C. Wojek, B. Schiele, et al., “Pedestrian Detection: A Bench-
mark,” in CVPR, Miami, Florida, USA, 2009, pp.304–311.
[21] P. Dollar, Z. Tu, P. Perona, et al., “Integral channel features,” in BMVC,
London, 2009.
[22] W. Nam, P. Dollar, J. Han, “Local decorrelation for improved pedestrian
detection,” in NIPS, Montreal, Canada, 2014, pp.424–432.
[23] R. Benenson, M. Omran, J. Hosang, et al., “Ten years of pedestrian
detection, what have we learned?, in ECCV, Zurich, Switzerland, 2014,
pp.613–627.
10 VOLUME 4, 2016
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2995321, IEEE Access
Xiaowei Zhang et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
[24] Y. Tian, P. Luo, X. Wang, et al., “Pedestrian detection aided by deep learn-
ing semantic tasks,” in CVPR, Boston, Massachusetts, 2015, pp.5079–
5087.
[25] J. Hosang, M. Omran, R. Benenson, et al., “Taking a deeper look at
pedestrians,” in CVPR, Boston, Massachusetts, 2015, pp.4073–4082.
[26] M. Zeiler, R. Fergus, Visualizing and understanding convolutional net-
works,” in ECCV, Zurich, Switzerland, 2014, pp.818–833.
[27] B. Yang, J. Yan, Z. Lei, et al., Convolutional channel features, in ICCV,
Santiago, Chile, 2015, pp.82–90.
[28] P. Dollar, R. Appel, S. Belongie, and P. Perona., Fast feature pyramids for
object detection,” IEEE Trans. Pattern Analysis and Machine Intelligence
, vol. 36, no. 8, pp. 1532––1545, 2014.
[29] P. Zhou, B. Ni, C. Geng, et al., Scale-Transferrable Object Detection, in
CVPR, Salt Lake City, UT, USA, 2018, pp.528–537.
[30] J. Dai, Y. Li, K. He, et al., R-FCN: Object Detection via Region-based
Fully Convolutional Networks, in NIPS, Barcelona, Spain, 2016, pp.379–
387.
[31] W. Liu, D. Anguelov, D. Erhan, et al., SSD: Single Shot MultiBox
Detector, in ECCV, Amsterdam, Netherlands, 2016, pp.21-37.
[32] S. Liu, L. Qi, H. Qin, et al., Path Aggregation Network for Instance
Segmentation,” in CVPR, Salt Lake City, UT, USA, 2018, pp.8759-8768.
[33] K. He, X. Zhang, S. Ren, et al., Deep Residual Learning for Image
Recognition,” in CVPR, Las Vegas, NV, USA, 2016, pp.770–778.
[34] D. Park, C. Zitnick, D. Ramanan, et al., “Exploring Weak Stabilization for
Motion Feature Extraction,” in CVPR , Portland, Oregon, 2013, pp.2882–
2889.
[35] S. Walk, N. Majer, K. Schindler, et al., New Features and Insights for
Pedestrian Detection,” in CVPR, San Francisco, CA, USA, 2010, pp.1030–
1037.
[36] G. Chen, Y. Ding, J. Xiao, et al., Detection Evolution with Multi-order
Contextual Co-occurrence,” in CVPR, Portland, Oregon, 2013,pp.1798–
1805.
[37] W. Ouyang, X. Wang, Joint Deep Learning for Pedestrian Detection,” in
ICCV, Sydney, Australia, 2013, pp.2056–2063.
[38] B. Wu and R. Nevatia, Detection and tracking of multiple, partially oc-
cluded humans by bayesian combination of edgelet based part detectors,”
International Journal of Computer Vision (IJCV), vol. 75, no. 2, pp. 247–
266, 2007.
[39] S. Paisitkriangkrai, C. Shen, A. van den Hengel, Strengthening the
Effectiveness of Pedestrian Detection, in ECCV, Zurich, Switzerland,
2014, pp.546–561.
[40] S. Paisitkriangkrai, C. Shen, A. van den Hengel, Efficient pedestrian
detection by directly optimize the partial area under the ROC curve, in
ICCV, Sydney, Australia, 2013, pp.1057–1064.
[41] P. Sermanet, K. Kavukcuoglu, S. Chintala, et al., “Pedestrian Detection
with Unsupervised Multi-Stage Feature Learning,” in CVPR, Portland,
Oregon, 2013, pp.3626–3633.
[42] B. Zhou, A. Khosla, A. Lapedriza, et al., Learning Deep Features for
Discriminative Localization, in CVPR, Las Vegas, NV, USA, 2016, pp.
2921-2929.
[43] T. Lin, P. Dollar, R. Girshick, et al., Feature Pyramid Networks for Object
Detection,” in CVPR, Honolulu, USA, 2017, pp.2117–2125.
[44] S. Zhang, J. Yang, B. Schiele, et al., Occluded Pedestrian Detection
Through Guided Attention in CNNs,” in CVPR, Salt Lake City, UT, USA,
2018, pp.6995–7003.
[45] Z. Hao, Y. Liu, H. Qin, et al., Scale-Aware Face Detection,” in CVPR,
Salt Lake City, UT, USA, 2018, pp.6186–6195.
[46] S. Gao, M. Cheng, K. Zhao, et al., Res2Net: A New Multi-scale
Backbone Architecture,” arXiv preprint arXiv:1904.01169, 2019.
[47] S. Liu, L. Qi, H. Qin, et al., Path Aggregation Network for Instance
Segmentation,”in CVPR, Salt Lake City, UT, USA, 2018, pp.8759–8768
[48] C. Fu, W. Liu, A. Ranga, etal, DSSD : Deconvolutional Single Shot
Detector, arXiv preprint arXiv:1701.06659, 2019.
[49] F. Yu, D. Wang, E. Shelhamer, et al., Deep Layer Aggregation,” in CVPR,
Salt Lake City, UT, USA, 2018, pp.2403–2412
[50] X. Zhang, L. Cheng, B. Li, et al., Too Far to See? Not Really!
Pedestrian Detection with Scale-aware Localization Policy,” IEEE Trans.
On Image Processing , vol. 27, no. 8, pp. 3703–3715, 2018.
[51] P. Felzenszwalb, R. Girshick, D. McAllester, et al., Object detection with
discriminatively trained part based models, IEEE Trans. Pattern Analysis
and Machine Intelligence , vol. 32, no. 9, pp.1627–1645, 2010.
[52] J. Cao, Y. Pang, X. Li, Pedestrian Detection Inspired by Appearance
Constancy and Shape Symmetry IEEE Trans. On Image Processing, vol.
25, no. 12, pp. 5538–5551, 2016.
[53] G. Ghiasi, T. Lin, Q. Le, NAS-FPN: Learning Scalable Feature Pyramid
Architecture for Object Detection in CVPR, Long Beach, CA, USA,
2019, pp. 7036–7045.
[54] Y. Li, Y. Chen, N. Wang, Z. Zhang, Scale-Aware Trident Networks for
Object Detection,” in CVPR, Long Beach, CA, USA, 2019.
[55] P. Dollar, C. Wojek, B. Schiele, et al., Pedestrian detection: An evalu-
ation of the state of the art,” IEEE Trans. Pattern Analysis and Machine
Intelligence , vol. 34, no. 4, pp. 743–761, 2012.
[56] B. Yang, J. Yan, Z. Lei, S. Li, Convolutional Channel Features, in ICCV,
Santiago, Chile, 2015.
[57] S. Bell, C. L. Zitnick, K. Bala, and R. Girshick., Insideoutside net: De-
tecting objects in context with skip pooling and recurrent neural networks,”
in CVPR, Las Vegas, NV, USA, 2016, pp. 2874-2883.
[58] S. Choudhury, R. Padhy, A. Sangaiah, et al., Scale Aware Deep Pedes-
trian Detection,” Transactions on Emerging Telecommunications Tech-
nologies , vol. 30, no. 9, pp. 1–14, 2019.
[59] T. Liu, M. Elmikaty, T. Stathaki, et al., SAM-RCNN: Scale-Aware Multi-
Resolution Multi-Channel Pedestrian Detection,” in BMVC, Newcastle,
UK, 2018.
XIAOWEI ZHANG received the Ph.D. degree in
computer science from Beihang University, Bei-
jing, China, in 2018, the M.S. degree in com-
puter science from Shandong Normal University,
Jinan, China, in 2013, and the B.S. degree in
computer science from Shanxi Normal University,
Linfen, China, in 2009. He was a visiting stu-
dent at Bioinformatics Institute (BII), A*STAR,
Singapore from 2016 to 2017. Currently, he is
an assistant professor of Computer Science and
Engineering at Qingdao University. His current research interests include
image/video analysis and understanding, computer vision and machine
learning.
SHUAI CAO received the B.S. degree in computer
science and technology from Liaoning University
of Technology, Jinzhou, China, in 2018. He is
currently pursuing the M.S. degree in Computer
Science and Engineering with Qingdao University
of China. His current research interests include
pedestrian detection and machine learning.
CHENGLIZHAO CHEN received the Ph.D. de-
gree in computer science from Beihang University
in 2017. He is currently an assistant professor with
Qingdao University. His research interests include
computer vision, machine learning, and pattern
recognition.
VOLUME 4, 2016 11
... O BJECT detection [1], as a longstanding, fundamental and challenging problem in computer vision [2], has been an active field of research for several decades [3], [4]. The task of object detection is to identify object categories and predict the location of each object in an image by a bounding box, and there are many real world applications [5] based on this task, such as face detection and pedestrian detection [6]. Since deep learning [7] entered the object detection field, the milestone approaches primarily divided into two categories: one-stage detectors, like SSD [8], Reti-naNet [9], You Only Look Once (YOLO) series, including YOLOv1 [10], YOLOv2 [11], YOLOv3 [12], YOLOv4 [13], YOLOv5 [14], YOLOv6 [15], YOLOv7 [16] and YOLOv8 [17], and two-stage detectors, such as Region-based Convolutional Neural Network (R-CNN) series, including R-CNN [18], Fast R-CNN [19], Faster R-CNN [20], R-FCN [21], FPN [22]. ...
... The CSPDarknet-53 is adopted as the representation of the CSPNet in this paper. The assignment strategy of RoI of different scales in FPN follows the (6). Here w, h is the width and height of RoI in P k level feature pyramid, respectively. ...
... The computation complexity of algorithm is denoted as FLOPs. The COCO evaluation metrics is utilized to study effectiveness of these backbone for Faster RCNN.BackboneClassification FPN results Top1 err FLOPs APcoco AP 50 AP 75 AP S AP M AP L 42.7 64.0 46.7 27.3 46.8 53.6 ...
Article
Full-text available
Although Faster R-CNN has undergone a lot of improvements, it still exists a significant gap in the performance between the detection of small and large objects, mainly because the low-level network lacks semantic information and small objects are only involved in a few images. To mitigate the above issues, we propose an object detection model based on Multi-Scale Feature fusion Cross Stage Partial Network (MSF-CSPNet) in this paper. The proposed MSF-CSPNet focuses on the fusion of concrete features and abstract features from multi-scale feature by learning shallow features at the shallow level and deep features at the deep level. Meanwhile, the data augmentation is performed by using random horizontal flip. On the basis, the improved Faster-RCNN model with Automatic Mixed Precision, Group Batch Sampler and MSF-CSPNet was formed. The proposed algorithm is valuated on the Microsoft Common Objects in Context (MS COCO) 2017 and obtained leading performance with 5.4% improvement in AP <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> coco </sub> , 5.9% improvement in AP <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">50</sub> , 6.9% improvement in AP <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">75</sub> , 5.8% improvement in AP <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> S </sub> , 6.1% improvement in AP <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> M </sub> , 5.8% improvement in AP <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> L </sub> compare to Faster R-CNN based on ResNet-50 with Feature Pyramid Network (FPN) backbone, and also outperformed previous reports on state-of-art Faster R-CNN series using other backbone networks, especially for small object detection. This research shows that the combination of a backbone with stronger learning ability and FPN is helpful to detect the expression of objects. Faster R-CNN based on MSF-CSPNet has high efficiency and better balance between accuracy and speed.
... It also provides a new pedestrian detector with no anchor points, simple structure and excellent performance. Reference [41] proposes a scale-aware hierarchical detection network for pedestrian detection problems under large-scale changes. It enhances the feature pyramid representation by introducing a cross-scale feature aggregation module and effectively integrates it under a unified framework. ...
... Fig. 15. shows the performance evaluation of different pedestrian detection models on the CityPersons dataset, including YOLO series algorithms, SSD [13], Faster RCNN [12], SHDN [41], and APD [42]. Our proposed method outperforms all compared methods and achieves the lowest log-average miss rate of 15.74%. ...
... Fig. 13. shows the performance evaluation of different pedestrian detection models on the Caltech dataset, including YOLO series algorithms, SSD[13], Faster RCNN[12], SHDN[41], and APD[42]. Our proposed method outperforms all compared methods and achieves the lowest log-average miss rate of 42.85%. ...
Article
Full-text available
Small-scale pedestrian detection is a challenge. The main issues are as follows: (1) Troubled by their small scale, it is difficult to extract features effectively; (2) During the detection process, it is easily disturbed by background noise such as inter-class occlusion and intra-class occlusion, leading to missed or false detection; (3) The current widely used IoU measurement method is very sensitive to the position deviation of small objects, which seriously reduces the detection performance. To address these problems, we improve YOLOv5 structure by integrating Non-Local and Convolution structures, building a new feature extraction module called ResNet-Conv&NonL, combined with the ResNet structure. This module was then integrated into the backbone of YOLOv5 for better image feature extraction. In addition, we developed a novel model to measure the similarity between bounding boxes, which are embedded in the loss function of the YOLOv5 structure to replace the normal IoU measurement. Experiments on a self-made dataset and a combined dataset from Caltech and CityPersons show the feasibility of the proposed network structure. Results demonstrate the feasibility of the improved network structure is superior to the original method because it increases average precision by 6.0% compared to the original one.
... [46] constructed a bidirectional feature enhancement module, which effectively utilizes the localization information of lowlevel features and the semantic information of highlevel features. [47] proposed a cross-scale feature aggregation module that can adaptively aggregate multi-layer feature map information, thus making the generated feature representation more suitable for multi-scale objects. Inspired by the above approach, this paper designs a weighted cross-scale feature fusion module to solve the multi-scale problem by enhancing the feature representation and hierarchical detection strategy. ...
... On the one hand, pedestrian objects at different scales exhibit different features, and this difference in features can pose a great challenge for accurate detection [45]. On the other hand, the best responses exhibited by pedestrian objects at different scales tend to appear in feature maps at different levels; in general, large scale pedestrian objects respond strongly in feature maps at higher levels, while the best responses of small scale pedestrian objects usually appear in feature maps at lower levels [47]. To address the above issues, we consider designing a multi-scale feature fusion module to improve the detection performance. ...
Article
Full-text available
Pedestrian detection is the use of computer vision technology to identify and accurately locate pedestrians in image or video data, which has a strong use value. This technology can be used as the research basis for visual tasks such as person re-identification, human pose estimation and behavior analysis, and can also be applied to industrial fields such as intelligent security, automatic driving and human-computer interaction. However, the problems of low image resolution, blurred appearance, large scale difference of pedestrians, occluded pedestrians and complex background still bring great challenges to the detection performance. To solve these problems, this paper proposes a high-performance pedestrian detection network dedicated to difficult conditions: DCPDN. Firstly, we design an optimized super-resolution reconstruction network to preprocess the image to alleviate the performance damage caused by low-resolution and blurred images. Then, to solve the multi-scale problem in pedestrian detection, we propose a weighted cross-scale feature fusion module, which adopts a hierarchical detection strategy to deal with pedestrian objects of different scales while fully fusing feature maps of different levels. Finally, to solve the occlusion problem that has plagued pedestrian detection for a long time, we design an occlusion processing module based on graph convolutional network, which can effectively use the correlation information between different parts of the human body and promote the feature expression of occluded objects. On the CityPersons dataset, the MR <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">-2</sup> of the detector is reduced by 6.9%, 19.2%, 8.9%, 1.9%, 3.6% and 14.2%, respectively, corresponding to different partition subsets of R, HO, A, L, M and S. On the Caltech dataset, corresponding to different divisions of R, HO, A, L and S, the MR <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">-2</sup> of the detector is reduced by 9.9%, 15.8%, 16.3%, 6.8% and 25.8%, respectively. The experimental results show that the performance improvement of the detector is significant on both severe occlusion (HO) and small scale (S) subsets. After testing, the algorithm has strong robustness to occluded pedestrians, and can be easily embedded in other detection frameworks. Our DCPDN is able to compete with the state of the art methods and is especially effective when dealing with the pedestrian detection problem under difficult conditions.
... And another work of Zhang et al. [22] used multiple layers from an FPN to enhance the performance of capturing the difference scale of a pedestrian. High-level convolutional layers, such as ResNet-50-C4 and ResNet-50-C5, are used to extract ROI features and perform feature aggregation for pedestrian detection at different scales. ...
Article
Full-text available
Text detection is a fundamental task in computer vision, particularly for Optical Character Recognition (OCR) applications. This study focuses on text detection within an OCR application, encompassing text detection, text recognition, and information extraction, with a specific emphasis on text detection. Character-Region Awareness for Text Detection (CRAFT), Pyramid Mask Text Detector (PMTD), and Scene Text Detection with Supervised Pyramid Context Network (SPCNET) have demonstrated promising results in bounding-box detection. However, it faces challenges related to postprocessing and multiline text detection. A post-processing problem arises because of the need to reconfigure the model when new documents are introduced, which leads to inefficiencies and complexities. In addition, especially for CRAFT tends to merge bounding boxes from consecutive lines by introducing multiline errors. To address these challenges, this study proposes an adapted approach based on Mask R-CNN, an instance segmentation model that treats each text element as an individual object. By adopting the Mask R-CNN approach, post-processing issues were successfully eliminated. Moreover, the multiline problem is effectively resolved. Comparative experiments demonstrate that the proposed model achieves comparable results to these models, while surpassing it in terms of accuracy and versatility. The proposed model is extensively evaluated on various document types, including bankbooks, Thai ID cards (both front and back sides), invoices, car registrations, mobile banking slips, passports, Indonesian ID cards, driver licenses, and receipts. The results indicated the model’s high performance and potential for real-world applications. The elimination of post-processing and multiline problems ensures the adaptability of the model to a wide range of document structures, and reduces both time inference and resource utilization.
... Li et al. [22] proposed a lightweight pedestrian detection network based on YOLO v5, using the Ghost module to reduce model parameters and computational complexity. Although there have been significant improvements in pedestrian detection technology, there are still cases of multi-scale pedestrian missed detections [30,41,49]. ...
Article
Full-text available
Pedestrian detection technology, combined with techniques such as pedestrian tracking and behavior analysis, can be widely applied in fields closely related to people's lives such as traffic, security, and machine interaction. However, the multi-scale changes of pedestrians have always been a challenge for pedestrian detection. Aiming at the shortcomings of the traditional RetinaNet algorithm in multi-scale pedestrian detection, such as false detection, missed detection, and low detection accuracy, an improved RetinaNet algorithm is proposed to enhance the detection ability of the network model. This paper mainly makes innovations in the following two aspects. Firstly, in order to obtain more semantic information, we use a multi-branch structure to expand the network and extract the characteristics of different receptive fields at different depths. Secondly, in order to make the model pay more attention to the important information of pedestrian features, double pooling attention mechanism module is embedded in the prediction head of the model to enhance the correlation of feature information between channels, suppress unimportant information, and improve the detection accuracy of the model. Experiments were conducted on different datasets such as the COCO dataset, and the results showed that compared with the traditional RetinaNet model, the model proposed in this paper has improved in various evaluation indicators and has good performance, which can meet the needs of pedestrian detection.
... Several applications utilize this initial system, such as self-driving, person re-identification, pose estimation, human action recognition, and robotics. It localizes regions of the entire human body in an image by providing bounding boxes as predicted areas [1]. Intelligent systems have developed rapidly by employing video surveillance to prevent criminal acts. ...
Article
Pedestrian detection in thermal and visible images is crucial for various applications, such as surveillance, driver assistance, and autonomous driving. In this paper, we propose a novel fusion scheme that effectively integrates multimodal features to improve detection performance. Our approach relies on Cross-Modality Reference Module (CMRM) for exchanging complementary features extracted from different modalities, solving incorrect sensor dominance problem in rare untrained-for contexts. We also utilize modality-specific region proposal networks to explore potential candidates separately in each modality, ensuring accurate and reliable proposals. The fusion of region proposals is performed using the Multimodal Fusion Module (MFM) that employs an attention mechanism to combine features based on their attention scores. To improve the robustness of the model in practical scenarios, we introduce a group of new data augmentation techniques, which simulate real-world challenges. Experimental evaluations conducted on the public KAIST, CVC-14, and FLIR datasets demonstrate the effectiveness of our proposed method. The results show that our fusion scheme significantly outperforms the existing methods in terms of detection performance by as much as 16.4% in practical scenarios.
Article
Full-text available
Representing features at multiple scales is of great importance for numerous vision tasks. Recent advances in backbone convolutional neural networks (CNNs) continually demonstrate stronger multi-scale representation ability, leading to consistent performance gains on a wide range of applications. However, most existing methods represent the multi-scale features in a layer-wise manner. In this paper, we propose a novel building block for CNNs, namely Res2Net, by constructing hierarchical residual-like connections within one single residual block. The Res2Net represents multi-scale features at a granular level and increases the range of receptive fields for each network layer. The proposed Res2Net block can be plugged into the state-of-the-art backbone CNN models, e.g., ResNet, ResNeXt, and DLA. We evaluate the Res2Net block on all these models and demonstrate consistent performance gains over baseline models on widely-used datasets, e.g., CIFAR-100 and ImageNet. Further ablation studies and experimental results on representative computer vision tasks, i.e., object detection, class activation mapping, and salient object detection, further verify the superiority of the Res2Net over the state-of-the-art baseline methods. The source code and trained models are available on https://mmcheng.net/res2net/.
Article
Full-text available
Feature pyramids are widely exploited by both the state-of-the-art one-stage object detectors (e.g., DSSD, RetinaNet, RefineDet) and the two-stage object detectors (e.g., Mask RCNN, DetNet) to alleviate the problem arising from scale variation across object instances. Although these object detectors with feature pyramids achieve encouraging results, they have some limitations due to that they only simply construct the feature pyramid according to the inherent multiscale, pyramidal architecture of the backbones which are originally designed for object classification task. Newly, in this work, we present Multi-Level Feature Pyramid Network (MLFPN) to construct more effective feature pyramids for detecting objects of different scales. First, we fuse multi-level features (i.e. multiple layers) extracted by backbone as the base feature. Second, we feed the base feature into a block of alternating joint Thinned U-shape Modules and Feature Fusion Modules and exploit the decoder layers of each Ushape module as the features for detecting objects. Finally, we gather up the decoder layers with equivalent scales (sizes) to construct a feature pyramid for object detection, in which every feature map consists of the layers (features) from multiple levels. To evaluate the effectiveness of the proposed MLFPN, we design and train a powerful end-to-end one-stage object detector we call M2Det by integrating it into the architecture of SSD, and achieve better detection performance than state-of-the-art one-stage detectors. Specifically, on MSCOCO benchmark, M2Det achieves AP of 41.0 at speed of 11.8 FPS with single-scale inference strategy and AP of 44.2 with multi-scale inference strategy, which are the new stateof-the-art results among one-stage detectors. The code will be made available on https://github.com/qijiezhao/M2Det.
Article
Full-text available
Pedestrian detection in crowded scenes is an intractable problem in computer vision, in which occlusion often presents a great challenge. In this paper, we propose a novel context-aware feature learning method for detecting pedestrians in crowds, with the purpose of making better use of context information for dealing with occlusion. Unlike most current pedestrian detectors that only extract context information from a single and fixed region, a new pixel-level context embedding module is developed to integrate multi-cue context into a deep CNN feature hierarchy, which enables access to the context of various regions by multi-branch convolution layers with different receptive fields. In addition, to utilize the distinctive visual characteristics formed by pedestrians that appear in groups and occlude each other, we propose a novel instance-level context prediction module which is actually implemented by a 2-person detector, to improve the 1-person detection performance. Applying with these strategies, we achieve an efficient and lightweight detector that can be trained in an end-to-end fashion. We evaluate the proposed approach on two popular pedestrian detection datasets, i.e., Caltech and CityPersons. Extensive experimental results demonstrate the effectiveness of the proposed method, especially under heavy occlusion cases.
Article
Full-text available
Pedestrian detection is a key problem for automatic driving, and results have been improved significantly via deep convolutional networks. However, there is still room to improve the performance of pedestrian detection by carefully dealing with some critical issues. To take advantages of more discriminative information for pedestrian detection, we propose a novel architecture to auto-choose semantic as well as specific information among the feature maps at different levels and integrate valuable information among the feature maps in multi-scales. Particularly, our architecture consists of feature maps concatenating in different levels and feature maps integrating with multi-scales. Both the operations are equipped with a competitive attention block. The architecture has the ability to obtain more efficient and discriminate features for pedestrian detection. In comparison with the other prevailing models, our architecture provides superior performance. The promising results achieved through experimentation with this architecture achieve a new state-of-the-art on Caltech dataset.