ArticlePDF Available

Feature Refinement and Filter Network for Person Re-Identification

Authors:
  • Institute of Semiconductors,Chinese Academy of Sciences

Abstract and Figures

In the task of person re-identification, the attention mechanism and fine-grained information have been proved to be effective. However, it has been observed that models often focus on the extraction of features with strong discrimination, and neglect other valuable features. The extracted fine-grained information may include redundancies. In addition, current methods lack an effective scheme to remove background interference. Therefore, this paper proposes the feature refinement and filter network to solve the above problems from three aspects: first, by weakening the high response features, we aim to identify highly valuable features and extract the complete features of persons, thereby enhancing the robustness of the model; second, by positioning and intercepting the high response areas of persons, we eliminate the interference arising from background information and strengthen the response of the model to the complete features of persons; finally, valuable fine-grained features are selected using a multi-branch attention network for person re-identification to enhance the performance of the model. Our extensive experiments on the benchmark Market-1501, DukeMTMC-reID, CUHK03 and MSMT17 person re-identification datasets demonstrate that the performance of our method is comparable to that of state-of-the-art approaches.
Content may be subject to copyright.
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) <
1
AbstractIn the task of person re-identification, the attention
mechanism and fine-grained information have been proved to be
effective. However, it has been observed that models often focus on
the extraction of features with strong discrimination, and neglect
other valuable features. The extracted fine-grained information
may include redundancies. In addition, current methods lack an
effective scheme to remove background interference. Therefore,
this paper proposes the feature refinement and filter network to
solve the above problems from three aspects: first, by weakening
the high response features, we aim to identify highly valuable
features and extract the complete features of persons, thereby
enhancing the robustness of the model; second, by positioning and
intercepting the high response areas of persons, we eliminate the
interference arising from background information and strengthen
the response of the model to the complete features of persons;
finally, valuable fine-grained features are selected using a multi-
branch attention network for person re-identification to enhance
the performance of the model. Our extensive experiments on the
benchmark Market-1501, DukeMTMC-reID, CUHK03 and
MSMT17 person re-identification datasets demonstrate that the
performance of our method is comparable to that of state-of-the-
art approaches.
Index TermsPerson Re-Identification; Deep Learning;
attention; Person Search
I. INTRODUCTION
ERSON re-identification, as a technology for retrieving
specific person images from cameras in multiple non-
overlapping areas, has pivotal applications in the security field,
including target tracking and person retrieval. In such tasks, the
image pixels of the person are too low to be identifiable through
face recognition. Moreover, the images have rather intricate
backgrounds, which are also accompanied by occlusions and
variations in person’s poses. As cameras with disparate
orientations normally have dissimilar viewing angles, the
Manuscript is submitted on April 27, 2020. This work was supported by the
National Natural Science Foundation of China (Grant No.61901436), and
Shenzhen Wave Kingdom Co., Ltd. (Corresponding Author: Ke Gong and
Weijun Li)
X. Ning and L. Zhang are with the Institute of Semiconductors, Chinese
Academy of Sciences, Beijing 100083, China; and Cognitive Computing
Technology Joint Laboratory, Wave Group, Beijing, 102208, China (e-mail:
ningxin@semi.ac.cn; zliping@semi.ac.cn).
K. Gong is with Cognitive Computing Technology Joint Laboratory, Wave
Group, Beijing, 102208, China (e-mail: gongke@wavewisdom-bj.com).
W. Li is with the Institute of Semiconductors, Chinese Academy of Sciences,
Beijing 100083, China; Center of Materials Science and Optoelectronics
difficulty of person recognition is also increased thereby [1].
Hence, person re-identification has invariably been a
challenging task.
Fig 1. Visualization of the model feature map. (a) is the persons’ image; (b) and
(c) show the feature map of Resnet and our model, respectively. It can be seen
that the Resnet can’t focus on all the features of the person and background
information is also included. However, our model greatly improves this
situation.
The performance of person re-identification, which is a sub-
topic of image recognition, largely depends on the
representation of a person’s features. In recent years, image
recognition has entered a new stage owing to multilayer
convolution-based deep learning methods. However, desirable
results can hardly be attained by models that perform
excellently in Image dataset classification, such as Resnet [2],
InceptionNet [3] and VGG [4]. This is because person’s
Engineering & School of Microelectronics, University of Chinese Academy of
Sciences, Beijing 100049, and Cognitive Computing Technology Joint
Laboratory, Wave Group, Beijing, 102208, China (e-mail: wjli@semi.ac.cn).
X. Bai is with School of Computer Science and Engineering, Beihang
University, Beijing,100191, China (e-mail: baixiao@buaa.edu.cn).
S. Tian is with School of Software, Xinjiang University, Xinjiang,830000,
China.
Copyright © 2020 IEEE. Personal use of this material is permitted. However,
permission to use this material for any other purposes must be obtained from
the IEEE by sending an email to pubs-permissions@ieee.org.
Feature Refinement and Filter Network for
Person Re-identification
Xin Ning, Member, IEEE, Ke Gong, Weijun Li, Senior Member, Liping Zhang, Member, IEEE, Xiao
Bai, and Shengwei Tian
P
This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication.
The final version of record is available at http://dx.doi.org/10.1109/TCSVT.2020.3043026
Copyright (c) 2020 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing pubs-permissions@ieee.org.
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) <
2
pictures have similar appearance features and relatively low
inter-picture distinction. Owing to the influence of factors such
as the pose, the viewing angle, lighting, occlusion and
background interferences, the classification becomes more
difficult. Moreover, as shown in Fig 1, the existing methods
tend to focus merely on the parts of the image with the highest
contribution to classification performance, instead of focusing
all the features of persons; importantly, the neglected parts
often have recognition value. In addition, some background
information is also used for recognition, which also affects their
performance.
In the course of deep learning research, Li et al. [5] found
that when recognizing images, the model sometimes focused its
attention on the image backgrounds that are unrelated to the
image recognition performance. Tian et al. [6] attempted to
experimentally quantify the effect of background interference
on the performance of person re-identification algorithms, and
eliminate the effect by the stochastic background method. There
are also plentiful studies that located human body parts with the
aid of models like attribute recognition [7], key point location
[8] and semantic image segmentation [9], which have
accomplished the goal of removing background influences.
However, the models for other tasks are introduced by these
methods, biases can arise to the person re-identification systems,
thereby compromising performance.
Fig. 2 Not all features in person pictures are beneficial to pedestrian re-
identification. (a) Occlusion, (b) Complex backgrounds, (c) Distinguishing
features. Different features of person have different contributions to person re-
identification. The importance of the features of the red frame in (c) is obviously
greater than other features.
Current researches emphasize the extraction of diverse
detailed local features. Specifically, person picture is
segmented into n parts of identical size for separate extraction
of deeper features, after which they will be combined together
as the discriminative features. This method [10][11][12][13],
which is capable of extracting features for different image parts,
often yields excellent results. However, as shown in Fig.2, some
local features are not necessarily useful ones and even among
useful features, the importance of features is different. Joint
extraction and processing of features that have no effect on
person recognition will undoubtedly affect the model
performance. Compared to such extraction of local features
with fixed-size partitions, how to filter discriminative zones and
extract detailed features are more meaningful.
In addition to the above methods of obtaining local features,
there have also been reports of obtaining local features with
diverse attention. In [37], singular value decomposition is
applied to the weight matrix of the last layer of the network to
reduce the correlation between features. Although some
satisfactory results have been achieved, the computational cost
is high expensive. In addition to the above methods, the
diversity regularization term based on Hellinger distance is
used in the [39] to focus parts. [29] avoid too much
concentration by constraining the distance of row indexes with
high response features in feature maps. These attention
diversity methods are the same as those in [43]. They can
acquire different attention features when there are few attention
branches, but when there is too much attention, feature
redundancy will easily affect the quality of features. To address
problem we propose a new loss of attention diversity.
In this paper, Feature refinement and filter network is
proposed to cope with the problem of person re-identification
from three aspects. First, the extraction of complete pedestrian
features can be realized. By weakening the features of the high
response region, the model can pay attention to more useful
features and enhance the robustness of the model; secondly, on
the basis of getting complete features, it can further locate the
high response features of the person, remove the interference of
background information and enhance the generalization ability
of features; finally, valuable fine-grained features for person re-
identification can be selected through the multi-branch attention
network, so as to yield the enhanced performance of the model.
The main contributions of this work can be summarized as
follows:
(1) By weakening the feature values of high-response areas,
the model can mine more valuable areas in the image, and it not
only ensures the stability of the training process, but also learns
the complete feature of a person.
(2) By locating the high-response feature based on (1), the
interference arising from background information of the input
picture is removed, thereby improving the extraction of the
global interference-free feature.
(3) In order to obtain the local fine-grained features of a
person, a multi-branch attention network with diversity loss is
designed, and the local features are obtained via adaptive
filtering by removing interference information.
(4) We conducted extensive experiments on the currently
mainstream Market-1501, DukeMTMC-reID, CUHK03 and
MSMT17 datasets, notably, we proved the outstanding
performance of the proposed method through heat maps.
II. RELATED WORK
This study involves feature representation, feature
suppression, the attention mechanism and attention diversity.
An overview of the recent work in connection with these
directions of person re-identification is discussed as below.
A. Feature representation
Feature representation is the core problem of pedestrian
recognition. Many methods are based on extracting
discriminative features with stability in different scenes to
enhance the performance of the model.
This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication.
The final version of record is available at http://dx.doi.org/10.1109/TCSVT.2020.3043026
Copyright (c) 2020 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing pubs-permissions@ieee.org.
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) <
3
Fig. 3 The framework of the proposed Feature refinement and filter network for person re-identification. The network includes a backbone network composed of
Weaken Feature Conv blocks and the multi-branch network. The Weaken Feature Conv blocks 1~4 are based on the four Resnet blocks, while the Conv block 4 is
based on the last block of Resnet.
Conventional person recognition generally adopts the idea of
manual features plus classifiers, such as methods in [14][15].
With conventional methods, the feature representation relies
primarily on the manually designed features, which require
professional knowledge and a complex process of parameter
adjustment. The development of deep learning has witnessed
impressive results of person re-identification achieved on
plenty of challenging datasets [16][17][18][19]. In spite of this,
factors such as occlusion, view changes and background
information limit the further upgrading of model performance,
generating the special focus of some recent studies on the local
information of pictures. In [20][21] the global and local features
were obtained via region segmentation, where each local
feature corresponded to a dissimilar segmentation block. A
network named part-based convolutional baseline (PCB) was
put forward in [10], which divided the image features into six
horizontal blocks of identical size. Moreover, each local feature
generated an independent ID prediction loss. Later, this method
of feature segmentation was inherited and popularized in
[11][12][13][18][22][23]. Nonetheless, the practice of dividing
a picture into fixed-size parts was unable to guarantee the
validity of the divided blocks, nor does it filter for valid blocks.
Consequently, there will be a waste of computing resources,
affecting the algorithmic performance to some extent.
Local information can be acquired not only by the feature
segmentation but also the guidance of other auxiliary networks.
By utilizing Person Parsing Net, Tian et al. [6] separately
obtained the characteristics of the head, the trunk and legs. And
semantic segmentation at a pixel-level precision was used in
[9][24] in an attempt to extract the features of the foreground,
the head, the upper body, the lower body and shoes from the
body parts. Eventually, these features were fused to implement
recognition. There are also studies [7][17] that located the body
parts with a attribute recognition or a skeletal keypoints model
before the extraction of local features. Thus in terms of this
approach, the introduction of models for other tasks will result
in deviations in the performance of person re-identification
tasks.
B. Feature suppression
Feature suppression can also be called “feature drop”. Prior
to this work, there was no such concept in the relevant literature,
but the following work can be categorized into this category.
The earliest reference [24] can make the model have better
robustness to noise and occlusion by randomly occluding part
of the photos in the sample, and improve the performance of the
model. Similarly, in [26], a data enhancement method named
Cutout was proposed, which aimed to diminish overfitting and
enhance network generalization. As was found, the Cutout not
only allowed the model to learn the way to distinguish features,
but also to better integrate the context, thus taking notice of
some minor local features. Similar to the aforementioned
studies, Hu et al. [27] also achieved sample data augmentation
by occluding some local areas in the sample pictures,
generating a model with enhanced generalization performance
and robustness.
With the deepening of research, some scholars have found
that the feature occlusion could also be unexpectedly
efficacious. According to [28], when the most discriminative
block was occluded, the network could be forced to find other
relevant parts, which was applied to the unsupervised semantic
segmentation and the visual interpretation of neural networks in
[5]. In this regard, the model performance in the person re-
This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication.
The final version of record is available at http://dx.doi.org/10.1109/TCSVT.2020.3043026
Copyright (c) 2020 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing pubs-permissions@ieee.org.
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) <
4
identification projects was improved in [29][30][31] by
suppressing the most recognizable feature, which allowed the
networks to learn comprehensive saliency maps.
Although this way of suppressing features is somewhat
conducive to the model performance, feature suppression can
make model training difficult, as mentioned in [31]. The reason
lies in that when a vital feature is found in the previous training,
it is erased (suppressed) the next time, which adds to the
difficulty of the training of the model. For this problem, we
propose a scheme for weakening features, which will be
described specifically in the following section.
Fig. 4 The Framework of Weaken Feature Conv blocks. It is based on the
Resnet block (such as conv2_x, conv3_x, conv4_x, conv5_x of Resnet-50),
where each output of the Resnet block is reduced to a channel into a feature
map. The weaken salient feature module whittles the high response
characteristics and performs a bitwise multiplication operation with the input
to form a new input.
C. Attention and diversity
Attention mechanisms have been applied extensively to the
person re-identification tasks. The most used are the channel
attention [32], the spatial attention [33], which have been
integrated into some studies [18][23][24] for the sake of better
performance. What’s more, Yuan et al. [74] proposed the
vertical spatial sequence attention. Based on the principles of
attention mechanisms, some scholars have put forward the anti-
attention mechanisms, whose initial application was found in
the tasks of image semantic segmentation [34] and object
detection [35]. An integration of channel, spatial, anti-channel
and anti-spatial attention was adopted in [36] to enhance the
feature representation in person re-identification. To achieve
the goal, obtaining rich and diverse features is often essential.
Thus, in the application of attention mechanisms, we normally
hope that the generated attention maps can be diverse,
especially for the multi-branch attention. On the basis of [37],
weighted soft orthogonality is devoted to [38] for ensuring the
diversity of extracted features. Aside from the approach of
constraint weight orthogonalization, Li et al. [39] used a
Hellinger distance-based diversity regularization term to give
attention to different parts. In this paper, we use feature
activation and Gaussian normalization to measure the distance
between attention to achieve attention diversity.
III. METHOD
In this section, we will describe the proposed Feature
refinement and filter network, whose overall process are
presented in Fig. 3. It primarily comprises the global feature
enhancement network, the multi-attention network, as well as
loss of attention diversity.
A. Feature refinement and filter network
To achieve optimal feature learning, we need to identify
regions with discriminative features in the sample pictures, to
extract the most discriminative features in the images. However,
as mentioned above, such an approach is often unable to
produce the desired result. Hence, this study attempts to
eliminate the interference information in the pictures and
extract all the feature information that is useful for re-
identification, rather than merely extracting some of the
information with strong discrimination ability. It has been
proven useful in [17][29]. To demonstrate the performance of
the proposed method in the paper, we chose Resnet-50 as the
backbone network as in other studies.
Weaken salient feature: To this end, we designed the Weaken
Feature Conv blocks based on the Resnet module, as shown in
Fig. 4. This module can also be designed based on other
network modules like VGG and Inception Net. For an image I,
we can express its feature map through a Conv block (e.g. a
Resnet block) as
, where
H
,
W
and
C
represent
the height, the width and the channel number of the feature layer,
respectively. For more convenient implementation of next
operation, we reduce the channel of the feature map to 1 via the
channel-like attention, therefore obtaining the feature map
HW
AR
. Besides, its operational procedure is as follows:
1( , )
HW
cc
ij
F i j
HW
=
(1)
1
cc
c
fF
C
=
(2)
( _ ( ))A sigmoid up sample f=
(3)
where
denotes the bitwise multiplication and
_up sample
represents the up-sampling operation. The feature map is
extended to the same size as the input of the Conv module, for
the purpose of facilitating their bitwise multiplication operation.
Through the above operation, we derive the attention feature
map
A
of the Conv module; where the higher value regions
indicate more attention paid by the model to the corresponding
areas in the original image and the lower value regions indicate
correspondingly no or little attention. Here, we weaken the
high-response areas in the feature map
A
, by which the
network can be forced to focus on the features of areas other
than those corresponding to high-responses. In this way, all
useful features in the input image can be attached great
importance to the model, thereby eliminating the interferences
of background noise and other useless features. To be specific,
we set a threshold
, and regard the areas in the feature map
A
with values exceeding this threshold as the high-response
areas. Afterwards, a weakening factor
(0,1)
is introduced,
and the weakening factor matrix M is specified as Eq 4. When
This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication.
The final version of record is available at http://dx.doi.org/10.1109/TCSVT.2020.3043026
Copyright (c) 2020 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing pubs-permissions@ieee.org.
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) <
5
feature value
( , )A i j
is greater than
, it is considered to be a
high-response feature, and the corresponding weakening factor
is
; Otherwise, it is considered to be a low-response feature
and no weakening is required, and the weakening factor is 1.
, ( , )
( , ) 1,
if A i j
M i j otherwise

=
(4)
Lastly, as shown in Fig. 4, the input I of the Conv module is
multiplied bitwise by the weakening factor matrix
M
to obtain
a weakened input
I
. By doing so, the originally high-response
areas is weakened, while other areas are strengthened, allowing
the model to focus more on areas outside of the high-response
areas to dig out all useful features in the input image. To better
comprehend the Weaken Feature Conv blocks, the details of the
feature sizes for various parts are listed in Table I. Blocks 1~4
are based on the corresponding blocks of Resnet, with the size
of input image being
384 128
.
TABLE I. SIZES OF FEATURES OF THE MAIN OPERATION IN FIG. 4
Module
Input of the
module (I)
Output of the
module (F)
Feature
map (f)
Output of
weaken salient
feature (M)
Block 1
(64, 96, 32)
(256, 96, 32)
(96, 32)
(96, 32)
Block 2
(256, 96, 32)
(512, 48, 16)
(48, 16)
(96, 32)
Block 3
(512, 48, 16)
(1024, 24, 8)
(24, 8)
(48, 16)
Block 4
(1024, 24, 8)
(2048, 24, 8)
(24, 8)
(24, 8)
Weakening factor: In the previous section, we put forward the
concept of the weakening factor, whose value is a key
parameter of the model performance. Although we can specify
this value manually, it is not necessarily the optimal value, the
obtaining of which requires extensive experiments and
substantial computing performance. To address this problem,
we put forward a better solution.
Feature map
=0
=0.4
=0.5
=0.6
=0.8
(a)
Feature map
=0.3
=0.4
=0.5
=0.6
=0.8
(b)
Fig. 5 The effect of different weakening factors
and different thresholds
.
(a) The feature weakening effect of various weakening factors. (b) High
response areas with corresponding to varied thresholds
.
Fig.5 shows the process of feature weakening and the effect
of weakening factor
and threshold
on the high response
region. The weakening process is only for the high-response
features, and different weakening factors correspond to varied
degrees of coverage, as shown in Fig.5(a). Further, it is obvious
from Fig.5(b) that the size of the high response area changes
with
. The high-response area is large, when the weakening
factor is relatively small, in this case, the high-response features
will be completely covered, and there will be no useful features
to be exploited for model recognition. In contrast, the
considered high response area is small, and completely covered,
this will force the model to focus on other areas for person
recognition, which is more conducive to obtaining
comprehensive features. Thus, the weakening factor should
have a negative correlation with the size of the considered high-
response features.
Fig. 6 Elimination of useless information to enhance valuable features. (a)
represents the original image of a person; (b) Locate the high response area; (c)
Enlarge the high-response area to the same size as the input. When the person
image shown in (a) is input to the network, the feature map shown in (b) will
be generated in weaken feature Conv Block 4. Then, locate the high response
area and erase the background information area. After the person image is
enlarged to the input size, re-enter the network to continue the training.
Feature strengthening: In this section, we elucidate the
manner in which features encompassing more information can
be strengthened and shifted to eliminate irrelevant or interfering
information.
After passing through the saliency weaken network
displayed in Fig. 3 and Fig. 4, the Weaken Feature Conv Block
4 outputs features
2048 HW
FR 

where
H
and
W
is 1/16
size of the original image. The feature map
1HW
AR


is
obtained by Eq. (1), Eq. (2). The sigmoid activation function
optimizes the value of
A
, then enlarge the feature map by Eq.
(3) to obtain
*1HW
AR

which has the same size as that of the
original image. The high response area of this feature map
corresponds to all the features of a person, whereas the low
response area corresponds to the background and noise in the
picture. We In order to locate the high response area, it is
necessary to obtain the index of the high response area in the
feature map
*
A
. Assume that the index matrix of the high
response area is
2N
LR
, where N is the number of high
response points, the first column of L represents the row index,
and the second column of L represents the column index.
Perform the following operations to determine the high-
response rectangular region
11
( , )xy
,
21
( , )xy
,
22
( , )xy
,
12
( , )xy
:
This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication.
The final version of record is available at http://dx.doi.org/10.1109/TCSVT.2020.3043026
Copyright (c) 2020 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing pubs-permissions@ieee.org.
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) <
6
1
2
1
2
max(min( [:,0]),0)
min(max( [:,0]), )
max(min( [:,1]),0)
min(max( [:,1], )
xL
x L H
yL
y L W
=
=
=
=
(5)
By bit-wise multiplication of the original image and
*
A
,
Fig.6(b) can be obtained. To remove the background and noise
in the original image, we need to locate the high-response area
in the maximum range by Eq. (5), as shown in the red box in
Fig.6 (b). Then, we enlarge the part corresponding to the red
box in the original image to the size of the original image, as
shown in Figure 6 (c). Thus, a new image which can minimize
the influence of the background and highlight the principal
features of the person is obtained. Subsequently, we continue to
feed the image shown in Fig. 6(c) as input to the network for
the secondary extraction and screening of features, thereby
enhancing useful features.
Fig. 7 The details of attention generator model.
Algorithm 1: Feature Refinement and Filter Network
Input: Training images data I, feature weakening backbone network
pretrained hyperparameters W and other attention branch
hyperparameters
.
Output: backbone network hyperparameters W and other attention branch
hyperparameters
.
Repeat:
1: Through the features weakening the backbone network to obtain the
complete feature of pedestrians F=W(I).
2: Calculate feature map
*
A
of F by Eq. (1)(2)(3).
3: Locate the high response area of
*
A
.
4: Enlarge the high response area in the original image to the size of the
original image as
I
.
5: Obtain complete and non-interfering features
1()F W I
=
and update
W by Eq. (10)(11); and calculate attention maps constrained by Eq. (9).
6: Obtain multi-attention branches embedding features
23
, , , N
F F F
,
and update
by Eq. (10)(11);
End for
Until convergence or maximum number of iterations.
B. Feature screening
In the foregoing section, feature screening has already been
mentioned. Here we perform further feature selection with the
aid of multi-branch attention and attention diversity.
Multi-branch attention: Inspired by [40], the multi-branch
attention comprises an attention generator (Fig. 7) and attention
branches. In order to obtain better N attention maps, these N
attention maps should be able to integrate all the information of
the output features, so we give up the one-sided channel
attention that weights important channels like SENet [32]. In
this section, we design an attention that can capture the long-
range dependence of the channel. By integrating the features of
each channel, we generate N attention maps to guide the next
multi-branch attention. The input features
C H W
FR

is
reduced to N by convolution layer with kernel size of
11
to
reduce the amount of calculation; then reshape operation is
performed respectively, and the following two inputs are matrix
multiplied to calculate the dependence between each channel,
and then normalized by sigmoid; The above input also performs
the same operation, first reduce the channel, then reshape, and
finally multiply it with the
NN
relational matrix, reshape for
feature maps with size
N H W
. As shown in Fig. 3, two
branches are split after Weaken feature Conv Block 4, with one
for predicting the ID score, and the other for generating N
attention maps subjected to bitwise multiplication separately by
the output of Conv Block 3. Then, the outcomes are input into
the new Conv Block 4 to form N new attention branches. For
each branch, the softmax losses of label smoothing
regularization and the triplet losses of batch hard sampling are
combined for the parameter optimization.
Attention diversity: To make extraction of discriminative
features by each attention possible, we apply a hybrid loss
function to each attention branch (described in the next section).
If these attention branches are unconstrained in the designing
process of the network, they will all focus on the most
discriminative area in the original image. Consequently, the
areas they focus on will be the same, and the features they
extracted will be roughly identical. As a result, the multi-
attention effect will be lost, and the network will degenerate,
which is to be averted by the design of the losses of attention
diversity.
Fig. 8: Training process on market-1501 dataset. The curves of Rank-1, mAP
and loss during the training process until convergence are shown.
This paper avoids the overlapping of attention areas by
restricting the distance between attention. To be specific, we
compute the position of maximum response for each attention
feature map as the center of each attention, so that our task
becomes a constraint to disallow the overlapping of various
attention centers. Accordingly, we devise a diversity loss
function, whose value is considerably large with excessively
close center distance of any two attention maps.
To better exert the function of diversity loss, the attention
map of each branch is expected to be a Gaussian distribution.
That is, the extreme values of an attention map should be
concentrated. Thus, the generated attention maps are
normalized to make the Gaussian distribution obeyed by feature
maps.
For a feature map
HW
AR
generated by the attention
generator:
11
1( , )
HW
ij
a A i j
HW==

(6)
This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication.
The final version of record is available at http://dx.doi.org/10.1109/TCSVT.2020.3043026
Copyright (c) 2020 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing pubs-permissions@ieee.org.
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) <
7
( )
2
2
11
1( , )
HW
ij
A i j a
HW
==
−

(7)
2
( , )
ˆ( , ) A i j a
A i j

+
(8)
After the above operation, the maximum value positions of
feature maps are taken as the central locations of attention maps,
with the inter-center distances of various attention maps
computed for the constraining purpose, so that the attention
centers do not coincide. Besides, a reasonable threshold
D
should be set in the light of the amount of attention and the
width of the image. If the center distance is shorter than
D
, the
loss of attention will be more. Accordingly, we define the loss
of attention diversity as follows:
_
_ max(0, )
ij
batch size i j
loss dist D d
=−

(6)
where
_batch size
denotes the number of training samples in a
batch;
i
and
j
represent disparate attention maps, respectively;
ij
d
is the distance between the attention maps
i
and
j
.
(a) (b) (c) (d)
Fig. 9: Sensitivity of person re-ID accuracy to the number of attention branches, the threshold D, the weight coefficient
and threshold
. Rank-1 and mAP on
Market-1501 are shown.
C. Discussion
In this paper, the saliency weakening method is utilized to
eliminate the background information and to obtain more useful
local features. Furthermore, the stability of training process can
be ensured by weakening rather than erasing directly. After
obtaining more useful features, the original image of
corresponding features is intercepted and re-entered into the
network, which has the same principle as the attention
mechanism in strengthening features considering the ability to
extract features once again for new inputs, and to refine/screen
features over again. Subsequently, each attention branch
network pays attention to different features, which is a process
of screening dissimilar salient features. Through the above
operations, the extraction of complete person features, as well
as that of diverse local fine-grained features, can be ensured.
The study in [43] shows that the optimal number of attention
branches is 3 in the pedestrian re-recognition data set. Too
D. Loss function
For the sake of better generalization performance of the
model, all losses are the softmax losses of label smoothing
regularization [41] in combination with the triplet losses of
batch hard sampling [42]. These are extremely common
methods in the person re-identification projects:
1
1log((1 )
N
sof LS i i
i
L g p
NS
=
= − +
(10)
where
i
g
is the true label of the sample;
i
p
is the probability of
the network prediction;
denotes the smoothing factor;
S
represents the total number of person identity (ID) labels
participating in the training;
N
is the number of pictures
trained in a batch.
Meanwhile, the triplet loss of batch hard sampling selects
simply the hardest-to-divide samples in a small batch, which
ensures the stability of training and shortens the training time,
as shown in Eq. (11):
,,
1(max min )
tri a p a n
nB
pA
a batch
L d d
PQ
+
= − +
(11)
where a batch has P person identities, each of which has Q
pictures. For a picture
a
, the set of pictures with the same ID
(identity) is A, and that with different IDs is B;
,ap
d
means the
similarity (distance) between the features of picture a and
picture p, while
,an
d
represents that between features of picture
a and picture n;
is a threshold;
()z+
is equivalent to
( ,0)max z
.
In Fig. 3, each branch, including the global branch, uses a
combination of the above two loss functions for the parameter
optimization. The overall loss can be expressed as:
s
0( ) _
M
all of LS tri
i
Loss L L loss dist
=
= + + 
(12)
Where M is the number of branches including global branches;
denote the coefficients that adjust the weight between
various losses.
many attention branches will pay attention to the
background information or cause information redundancy to
affect the performance of the model. The paper avoids the
above situation by obtaining more comprehensive features of
persons, removing background interference information for
feature strengthening, and finally Gaussian normalization and
loss of attention diversification.
The implementation of our method is summarized in
Algorithm 1, and details of algorithm convergence and training
neural networks are shown in Fig.8.
This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication.
The final version of record is available at http://dx.doi.org/10.1109/TCSVT.2020.3043026
Copyright (c) 2020 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing pubs-permissions@ieee.org.
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) <
8
IV. EXPERIMENTS AND RESULTS
To verify the performance of the model herein, we performed
experiments by choosing the public datasets Market-1501 [44],
DukeMTMC-reID [45], CUHK03-NP [46] and MSMT17[61],
which are universally used in re-ID tasks. In the comparison
experiment, we adopt the Cumulative Matching Characteristics
(CMC)[47] at Rank-1, which used the label with the highest
score as the prediction label to calculate the accuracy, and mean
Average Precision (mAP) on all the datasets.
The Market-1501 dataset was collected on the campus of
Tsinghua University, which includes 1,501 persons captured by
6 cameras. Among them, 751 are in the training set containing
12,936 pictures and 750 are in the test set with 19,732 pictures.
For 750 persons, an image is selected randomly from each
camera as a query, giving rise to maximum 6 queries for each
person. In total, there are 3,368 images.
The DukeMTMC-reID dataset is a subset of DukeMTMC,
which incorporates 36,411 pictures of 1,812 people collected
by 8 cameras. There are 1,404 people who appear under two
cameras, and 408 people appearing under simply one camera.
The training set consists of 16,522 pictures from 702 people,
while 17,661 pictures from another 702 people are used as the
test set. For the 702 people in the test set, an image is selected
randomly from each camera as a query, with 2,228 images in
total.
TABLE II. COMPARISON OF OUR PROPOSED METHOD WITH THE STATE-OF-THE-
ART METHOD OF THE MARKET-1501 DATASET
Method
mAP(%)
Rank-1(%)
DuATM(CVPR’18)[48]
76.2
91.3
MLFN(CVPR’18)[49]
74.4
90.0
MGCAM(CVPR’18)[50]
74.25
83.55
HA-CNN(CVPR’18)[18]
75.7
91.2
KPM(CVPR’18)[51]
75.3
90.1
CRF(CVPR’18)[52]
81.6
93.5
AACN(CVPR’18)[7]
66.9
85.9
Mancs(ECCV’18)[53]
82.3
93.1
PCB+RPP (ECCV’18)[10]
81.6
93.8
ABD-Net(ICCV’19)[38]
88.28
95.6
BDB(ICCV’19)[30]
86.7
95.3
SONA(ICCV’19)[54]
88.67
95.68
Auto-ReID(ICCV”19)[55]
85.1
94.5
OSNet(ICCV’19)[56]
84.9
94.8
IAN(CVPR’19)[57]
83.1
94.4
CAMA(CVPR’19)[43]
84.5
94.7
MHN(CVPR’19)[58]
85.0
95.1
Pyramid(CVPR’19)[13]
88.2
95.7
BNNeck (CVPR’19)[59]
85.9
94.5
CASN (CVPR’19)[29]
83.2
94.8
AANet-50 (CVPR’19)[7]
82.45
93.89
SIF(TIP’20)[60]
87.6
95.2
st-ReID+RE[71]
87.6
98.1
PAN(TCSVT’18)[62]
69.33
86.67
BISAA(TCSVT’19)[63]
75.1
92.1
CRAN(TCSVT’19)[64]
84.9
94.9
HPDN[67]
82.3
95.2
SMC+ECD[68]
59.73
80.38
DropEasy[69]
78.3
93.8
Gconv[73]
72.3
88.1
BNNeck+k-reciprocal[72]
92.4
95.2
BNNeck+END[70]+ Jaccard
93.2
95.5
Ours
89.6
96.7
Ours+k-reciprocal[72]
92.7
97.8
Ours+ECN[8]
93.6
97.8
Ours+END+Jaccard[70]
94.2
98.3
As for the CUHK03-NP dataset, it collected 14,097 pictures
of 1,467 individuals. We evaluate the re-ID performance as per
novel training/testing protocol proposed in [19], which is more
challenging. First of all, the new protocol has a larger pool of
candidates, covering 5,332 images of 700 identities.
Comparatively, the original protocol had 100 images of 100
identities only. Secondly, the new protocol has a smaller
training set (767 identities) than that of the original protocol
(1,467 identities). In the "Detected" set, all borders are
generated by DPM, while in the "Labeled" set, all the images
are drawn manually. In this paper, the methods for evaluating
those two sets are studied.
MSMT17 is currently the largest public person re-
identification database, consisting of 126,441 images of 4101
identities captured by 12 outdoor cameras and 3 indoor cameras.
The dataset contains different weather conditions, and covers
three time periods: morning, noon, and afternoon. Among them,
there are 30284 images with 1041 identities for training, 11659
query images for testing, and 82161 images with 3060 identities
for testing gallery images.
TABLE III. COMPARISON OF OUR PROPOSED METHOD WITH THE STATE-OF-THE-
ART METHOD OF THE DUKEMTMC-REID DATASET
Method
mAP(%)
Rank-1(%)
DuATM(CVPR’2018)[48]
64.6
81.8
MLFN(CVPR’18)[49]
62.8
81.2
HA-CNN(CVPR’18)[18]
63.8
80.5
AACN(CVPR’18)[7]
59.25
76.84
PCB+RPP (ECCV’18)[10]
69.2
83.3
Mancs (ECCV’18)[53]
71.8
84.9
IAN [5] (CVPR’19)[57]
73.4
87.1
CAMA [4] (CVPR’19)[43]
72.9
85.8
MHN [3] (CVPR’19)[58]
77.2
89.1
BNNeck (CVPR’19)[59]
76.4
86.4
AANet-50(CVPR’19)[7]
72.56
86.42
CASN(CVPR’19)[29]
73.7
87.7
ABD (ICCV’19)[38]
78.59
89.0
BDB(ICCV’19)[30]
76.0
89.0
SONA (ICCV’19)[54]
78.05
89.25
Auto-ReID(ICCV’19)[55]
75.1
88.5
OSNet(ICCV’19)[56]
73.5
88.6
SIF(TIP’20)[60]
79.4
89.8
st-ReID+RE[71]
83.9
94.4
PAN(TCSVT’18)[62]
51.51
71.59
CRAN(TCSVT’19)[64]
74.7
87.6
HPDN[67]
68.0
83.6
DropEasy[69]
58.9
78.6
Gconv[73]
61.7
77.3
BNNeck+k-reciprocal[72]
89.1
90.0
BNNeck+END[70]+Jaccard
87.9
91.6
Ours
81.6
93.4
Ours+k-reciprocal[72]
87.7
94.2
Ours+ECN[8]
86.3
94.5
Ours+END+Jaccard[70]
90.3
94.7
A. Implementation Details
Rsenet-50, which is pre-trained on ImageNet, is chosen to
build the backbone network. Similar to the case of PCB[10], the
downsampling in Resnet Block 4 is removed, and the Batch size
set 32, respectively. The picture size is uniformly adjusted to
384 128
, and the data are augmented by the random flip and
erasure of images. The smoothing factor
in Eq.(10) is set to
0.1. Besides, we set the number of attention branches to 5 and
the threshold D of attention distance to 2. On an NVIDIA GTX-
1080Ti GPU, 300 epochs are trained at an initial learning rate
This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication.
The final version of record is available at http://dx.doi.org/10.1109/TCSVT.2020.3043026
Copyright (c) 2020 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing pubs-permissions@ieee.org.
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) <
9
of 0.00035, which is decayed by half every 50 epochs. In the
test phase, the global features are combined with those of all
attention branches, which serve as the final query features for
score matching.
TABLE IV. COMPARISON OF OUR PROPOSED METHOD WITH THE STATE-OF-THE-
ART METHOD OF THE CUHK03 DATASET
Method
Labeled
Detected
mAP(%)
Rank-
1(%)
mAP(%)
Rank-
1(%)
HA-CNN (CVPR’18)[18]
41.0
44.4
38.6
41.7
MLFN (CVPR’18)[49]
49.2
54.7
47.8
52.8
DaRe+RE (CVPR’18)[21]
61.6
66.1
59.0
63.3
PCB+RPP (ECCV’18)[10]
57.5
63.7
Mancs (ECCV’18)[53]
63.9
69.0
60.5
65.5
MGN(ACM MM’18)[20]
67.4
68.0
66.0
66.8
BDB (ICCV’19)[30]
76.7
79.4
73.5
76.4
Auto-ReID (ICCV’19)[55]
73.0
77.9
69.3
73.3
CASN(CVPR’19)[29]
68.0
73.7
64.4
71.5
MHN (CVPR’19)[58]
72.4
77.2
65.4
71.7
SONA (ICCV’19)[54]
79.23
81.85
76.35
79.10
SIF(TIP’20)[60]
77.0
79.5
73.9
76.6
PAN(TCSVT’18)[62]
45.8
43.9
43.8
41.9
CRAN(TCSVT’19)[64]
68.2
72.7
64.9
69.9
HPDN[67]
58.2
64.3
56.8
63.1
SMC+ECD[68]
71.76
DropEasy[69]
50.4
55.9
Gconv[73]
91.5
85.9
89.0
83.1
Ours
81.2
85.1
78.3
80.9
Ours+k-reciprocal[72]
84.5
87.2
79.3
82.6
Ours+ECN[8]
84.6
87.1
80.2
83.4
Ours+END+Jaccard[70]
84.9
88.6
80.6
84.2
TABLE V. COMPARISON OF OUR PROPOSED METHOD WITH THE STATE-OF-THE-
ART METHOD OF THE MSMT17 DATASET
Method
mAP(%)
Rank-1(%)
Auto-ReID(ICCV’19)[55]
52.5
78.2
OSNet(ICCV’19)[56]
52.9
78.7
IAN(CVPR’19)[57]
46.8
75.5
HA-CNN(CVPR’18) [18]
37.2
64.7
BISAA(TCSVT’19)[63]
39.1
68.7
CRAN(TCSVT’19)[64]
52.4
78.7
DI-REID(CVPR20’)[65]
47.1
75.5
RGA-SC(CVPR20’)[66]
57.5
80.3
Ours
59.6
81.5
Ours+k-reciprocal[72]
63.9
83.2
Ours+ECN[8]
65.5
84.6
Ours+END+Jaccard[70]
66.7
84.9
TABLE VI. RESULTS OF ABLATION STUDY ON MARKET-1501
ResNet-
50
Weak
Salient
Feature
Feature
strengthening
Multi-
attention
branches
mAP(%)
Rank-
1(%)
Time
68.2
84.2
35.6
72.8
87.9
38.3
78.5
89.5
70.8
86.1
94.9
64.2
89.6
96.7
78.6
Note: Time denote running time (in epoch/s).
B. Parameter analysis
We analyze some crucial parameters of our method on the
Market-1501 dataset, where the same parameters once
optimized are used for all two datasets.
a) The number of attention branches: In our method, a vital
parameter is the number of multi-attention branches. Fig 9(a)
shows that without the attention branch, the model cannot
extract fine-grained features and the distinguishability of the
features is insufficient. Hence, when there is an addition of
attention, the performance of the model is remarkably improved.
With the increase of attention, the accuracy of the model first
improves and then reaches the optimal state when the number
of attention branches is equal to 5. Afterwards, due to the
augment in attention branches, redundant features were
generated, and the model performance dwindled slightly.
b) The value of threshold D: In the loss of attention diversity,
there is a parameter D. We evaluate the sensitivity of the result
accuracy to threshold D as Fig. 9(b) shows. The size of the
attention map is
24 8
, and so we report the rank-1 and mAP
when changing D from 0.5 to 5. It is observed that when D=2,
the best performance can be obtained.
c) The value of the weight coefficient: the weight coefficient
used in Eq. 12 is an important parameter in the loss function.
From Fig 9(c), when
ranges at different magnitude, the
performance of the model increases first and then decreases.
When
is set to 0.1, the best precision is obtained. If
is too
large, it will limit the learning ability of the network,
highlighting the necessity of a proper
.
d) The value of the threshold
: In Eq.4, threshold
is an
important parameter to determine whether it is a high response
region. The method of weakening salient features is used in
Resnet-50 and experimented on the Market-1501 dataset. The
result is shown in Fig. 9(d). The best performance occurs at
around
=0.4
and
=0.5
, and similar in other datasets
Comparison with State-of-the-art Methods.
On the aforementioned three datasets, the performance of our
model is compared with the algorithms published in the top
conferences and journals in recent years. In Tables II, III and
IV, the comparison results are listed. As can be seen from the
data, the method proposed herein is apparently superior to the
state-of-the-art methods, including DuATM[48], MLFN[49],
HA-CNN[18], AACN[7], Mancs[53], PCB[10], ABD-Net[38],
AANet-50[7], CASN[29], MGN[20], CAMA[43], and
DropEasy[69].
In addition, in order to get better results, we adopt some re-
ranking methods, such as k-reciprocal[72], ECN[8] and
END+Jaccard[70]. After reordering, mAP/rank-1 can be
achieved 94.2%/98.3%, 90.3%/94.7%, 84.2%/88.3%,
66.7%/84.9% on Market-1501, DukeMTMC-reID,
CUHK03(Labeled), CUHK03(Detected), respectively. Among
them, the gabor-based deep learning model proposed in [73] has
an advantage over the traditional convolution. Its results on
CUHK03 dataset are better than ours, but the results on market-
1501 and Dukemtmc-reID dataset are worse than ours.
Without re-ranking method, Our method has achieved the best
results in the MSMT17 datasets. The performance of our
method is comparable to that of st-ReID+RE[71] on Market-
1501 dataset, and it performs slightly worse than st-
ReID+RE[71] on Dukemtmc-reID dataset. That's because our
method uses the pure visual information like most methods, and
st-ReID+RE introduces spatio-temporal information. By fusing
the multimodal information (spatiotemporal information and
visual information), it has achieved an effect that surpasses all
purely visual supervised methods on the Market-1501 and Duke
datasets.
This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication.
The final version of record is available at http://dx.doi.org/10.1109/TCSVT.2020.3043026
Copyright (c) 2020 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing pubs-permissions@ieee.org.
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) <
10
C. Visualization
Visualization of feature maps for global and multi-attention
branches: Fig. 10 displays the visualization of the global feature
and 5 attention branches for 3 sample pictures. From the figure,
it is also explicit that our global branch can fundamentally pay
great attention to all useful features of the person, while each of
the five attention branches focuses on disparate local features
of the person. This also suggests that our proposed method can
learn global and various diverse local information, which is the
reason for our model’s high performance.
Fig. 10 Visualization for the global branch and 5 attention branches. The
proposed architecture allows the model to learn global and diverse local detailed
features.
D. Ablation Studies
According to Tables II, III and IV, our method attains
comparatively competitive results in all of Market-1501,
DukeMTMC-reID, CUHK03 and MSMT17. There are two
probable causes for getting such results: (1) The employ of
weak saliency and feature strengthen methods enables the
model to remove background information and acquire more
comprehensive person features; (2) the application of attention
branches and branch diversity helps the model acquire more
detailed information locally. Furthermore, through a
combination of global features with local features of multi-
attention branches, the features become more discriminative
and distinguishable. From Table VI, as can be seen from table
6, with the increase of network complexity, the performance of
the model will be improved, and the corresponding training
time will also increase. The average inference time for an image
is 392 ms.
Benefit of Weak Salient Feature: Under the same conditions
as this paper in [48], at batch size = 64, the mAP/Rank-1 can
reach 71.4%/87.5% for ResNet-50. In this paper, after the batch
size is set to 32, mAP/Rank-1 can merely be 68.2%/84.2%. As
is indicated in Table VI, after the addition of the Weak Salient
Feature, mAP and Rank-1 increase by 4.6% and 3.7%,
respectively. Global in Fig. 10 displays the activated feature
map following the feature weakening, after which the model is
found to focus on more features. For Resnet-50, after feature
strength, the mAP/Rank-1 also increases by 5.7%/1.6%, with
the removal of Weak Salient Feature operation from the model
herein generating reduction of mAP/Rank-1 distribution by
3.3%/1.6%, indicating the powerful role of the feature.
strengthening Because the feature weakening module does not
introduce new parameters, the training time does not increase
much compared with Resnet-50.
The impact of Feature strengthening: We compared the
effect of adding Feature strengthening module in backbone
network and deleting feature strengthening module in whole
network, and find that feature strengthening module is
beneficial. After adding feature strengthening module, the
performance of backbone network is improved from mAP /
Rank-1 = 72.8% / 87.9% to mAP / Rank-1 = 78.5% / 89.5%.
After removing the feature strengthening module in the whole
network, the performance of mAP and Rank -1 decreased by
3.3% and 1.6% respectively. But the feature strengthening
module also significantly increases the training time.
The impact of Multi-Attention branches: Since diverse
attention branches play a key role in this work, we add the
multi-branch attention separately behind the Weak Salient
Feature and the model, as shown in the last two rows of Table
VI. Their mAP/Rank-1 are 13.2%/7.0% and 11.1%/7.2% higher
than those before the addition of the attention. From the heat
maps of attention branches in Fig. 10, we can find that the
multiple branches pay respective attention to different local
information, which realizes the diversity of model features and
improves the model performance immensely. Similarly, the
attention branch increases the amount of parameters, so the
training time is significantly increased.
V. CONCLUSION
This paper reports a feature selection network that combines
global and local fine-grained features to realize person re-
identification. The proposed model explores more valuable
features by weakening the salient features, and obtaining
diverse fine-grained features after eliminating interference
information. Through experiments, the state-of-the-art
performance of the Feature refinement and filter network on the
mainstream datasets for person re-identification is verified.
In the future, we expect to apply the proposed method fused
with the temporal attention mechanism to video-based person
re-identification tasks, to identify a person in different frames.
Additionally, we intend to explore the feasibility of using this
method for other deep learning-based tasks.
VI. ACKNOWLEDGMENT
his work is supported by the National Natural Science
Foundation of China (Grant No.61901436).
REFERENCES
[1] S. Li, H. Yu, and R. Hu, “Attributes-aided part detection and
refinement for person re-identification,” Pattern Recognit., vol. 97,
2020.
[2] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for
image recognition,” Proc. IEEE Comput. Soc. Conf. Comput. Vis.
Pattern Recognit., vol. 2016-Decem, pp. 770778, 2016.
This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication.
The final version of record is available at http://dx.doi.org/10.1109/TCSVT.2020.3043026
Copyright (c) 2020 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing pubs-permissions@ieee.org.
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) <
11
[3] C. Szegedy et al., “Going deeper with convolutions,” in Proceedings
of the IEEE Computer Society Conference on Computer Vision and
Pattern Recognition, 2015.
[4] K. Simonyan and A. Zisserman, “Very deep convolutional networks
for large-scale image recognition,” 3rd Int. Conf. Learn. Represent.
ICLR 2015 - Conf. Track Proc., pp. 114, 2015.
[5] K. Li, Z. Wu, K. C. Peng, J. Ernst, and Y. Fu, “Tell Me Where to
Look: Guided Attention Inference Network,” Proc. IEEE Comput.
Soc. Conf. Comput. Vis. Pattern Recognit., pp. 92159223, 2018.
[6] M. Tian et al., “Eliminating Background-bias for Robust Person Re-
identification,” Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern
Recognit., pp. 57945803, 2018.
[7] C. P. Tay, S. Roy, and K. H. Yap, “Aanet: Attribute attention network
for person re-identifications,” Proc. IEEE Comput. Soc. Conf.
Comput. Vis. Pattern Recognit., vol. 2019-June, pp. 71277136, 2019
[8] M. S. Sarfraz, A. Schumann, A. Eberle, and R. Stiefelhagen, “A Pose-
Sensitive Embedding for Person Re-identification with Expanded
Cross Neighborhood Re-ranking,” Proc. IEEE Comput. Soc. Conf.
Comput. Vis. Pattern Recognit., pp. 420429, 2018.
[9] M. M. Kalayeh, E. Basaran, M. Gokmen, M. E. Kamasak, and M.
Shah, “Human Semantic Parsing for Person Re-identification,” Proc.
IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., pp. 1062
1071, 2018.
[10] Y. Sun, L. Zheng, Y. Yang, Q. Tian, and S. Wang, “Beyond Part
Models: Person Retrieval with Refined Part Pooling (and A Strong
Convolutional Baseline),” Lect. Notes Comput. Sci. (including
Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics), vol.
11208 LNCS, pp. 501518, 2018..
[11] L. Sun, J. Liu, Y. Zhu, and Z. Jiang, “Local to Global with Multi-
Scale Attention Network for Person Re-Identification,” Proc. - Int.
Conf. Image Process. ICIP, vol. 2019-September, pp. 22542258,
2019.
[12] X. Sun, N. Zhang, Q. Chen, Y. Cao, and B. Liu, “PEOPLE RE-
IDENTIFICATION BY MULTI-BRANCH CNN WITH MULTI-
SCALE FEATURES Xinzi Sun , Ning Zhang , Qilei Chen , Yu Cao ,
Benyuan Liu Department of Computer Science University of
Massachusetts Lowell,” 2019 IEEE Int. Conf. Image Process., pp.
22692273, 2019.
[13] F. Zheng et al., “Pyramidal person re-identification via multi-loss
dynamic training,” Proc. IEEE Comput. Soc. Conf. Comput. Vis.
Pattern Recognit., vol. 2019-June, pp. 85068514, 2019
[14] S. Karanam, M. Gou, Z. Wu, A. Rates-Borras, O. Camps, and R. J.
Radke, “A Systematic Evaluation and Benchmark for Person Re-
Identification: Features, Metrics, and Datasets,” IEEE Trans. Pattern
Anal. Mach. Intell., vol. 41, no. 3, pp. 523536, 2019.
[15] L. Zheng, Y. Yang, and A. G. Hauptmann, “Person Re-identification:
Past, Present and Future,” arXiv Comput. Sci., vol. 14, no. 8, pp. 1
20, 2016.
[16] L. He, J. Liang, H. Li, and Z. Sun, “Deep Spatial Feature
Reconstruction for Partial Person Re-identification: Alignment-free
Approach,” Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern
Recognit., vol. 2, pp. 70737082, 2018.
[17] J. Xu, R. Zhao, F. Zhu, H. Wang, and W. Ouyang, “Attention-Aware
Compositional Network for Person Re-identification,” Proc. IEEE
Comput. Soc. Conf. Comput. Vis. Pattern Recognit., pp. 21192128,
2018.
[18] W. Li, X. Zhu, and S. Gong, “Harmonious Attention Network for
Person Re-identification,” Proc. IEEE Comput. Soc. Conf. Comput.
Vis. Pattern Recognit., no. I, pp. 22852294, 2018.
[19] Z. Zhang, C. Lan, W. Zeng, and Z. Chen, “Densely semantically
aligned person re-identification,” Proc. IEEE Comput. Soc. Conf.
Comput. Vis. Pattern Recognit., vol. 2019-June, no. d, pp. 667676,
2019.
[20] G. Wang, Y. Yuan, X. Chen, J. Li, and X. Zhou, “Learning
discriminative features with multiple granularities for person re-
identification,” MM 2018 - Proc. 2018 ACM Multimed. Conf., pp.
274282, 2018.
[21] Y. Wang et al., “Resource Aware Person Re-identification Across
Multiple Resolutions,” Proc. IEEE Comput. Soc. Conf. Comput. Vis.
Pattern Recognit., pp. 80428051, 2018.
[22] B. Xie, X. Wu, S. Zhang, S. Zhao, and M. Li, “Learning Diverse
Features with Part-Level Resolution for Person Re-Identification.
(arXiv:2001.07442v1 [cs.CV]),” arXiv Comput. Sci., pp. 18, 2019.
[23] Y. Zhu, X. Guo, J. Liu, and Z. Jiang, “MULTI-BRANCH
CONTEXT-AWARE NETWORK FOR PERSON RE-
IDENTIFICATION Beijing University of Posts and
Telecommunications , Beijing , China Academy of Broadcasting
Science , Beijing , China,” 2019 IEEE Int. Conf. Image Process., pp.
22742278, 2019.
[24] H. Guo, H. Wu, C. Zhao, H. Zhang, J. Wang, and H. Lu, “CASCADE
ATTENTION NETWORK FOR PERSON RE-IDENTIFICATION
National Laboratory of Pattern Recognition , Institute of Automation
Chinese Academy of Sciences , Beijing , China , 100190 Beijing
Institute of Technology , Beijing , China , 100000 Key Laboratory
of,” 2019 IEEE Int. Conf. Image Process., pp. 22642268, 2019.
[25] Z. Zhong, L. Zheng, G. Kang, S. Li, and Y. Yang, “Random Erasing
Data Augmentation,” Proc. AAAI Conf. Artif. Intell., vol. 34, no. 07,
pp. 1300113008, 2020.
[26] T. DeVries and G. W. Taylor, “Improved Regularization of
Convolutional Neural Networks with Cutout,” arXiv, no.
1708.04552v2, 2017.
[27] T. Hu, H. Qi, Q. Huang, and Y. Lu, “See Better Before Looking
Closer: Weakly Supervised Data Augmentation Network for Fine-
Grained Visual Classification,” no. 1, 2019.
[28] K. K. Singh and Y. J. Lee, “Hide-and-Seek: Forcing a Network to be
Meticulous for Weakly-Supervised Object and Action Localization,”
Proc. IEEE Int. Conf. Comput. Vis., vol. 2017-October, pp. 3544
3553, 2017.
[29] M. Zheng, S. Karanam, Z. Wu, and R. J. Radke, “Re-identification
with consistent attentive siamese networks,” Proc. IEEE Comput. Soc.
Conf. Comput. Vis. Pattern Recognit., vol. 2019-June, pp. 57285737,
2019.
[30] Z. Dai, M. Chen, X. Gu, S. Zhu, and P. Tan, “Batch dropblock
network for person re-identification and beyond,” Proc. IEEE Int.
Conf. Comput. Vis., vol. 2019-October, pp. 36903700, 2019.
[31] M. Zheng, S. Karanam, T. Chen, R. J. Radke, and Z. Wu, “Learning
Similarity Attention,” arXiv, 2019.
[32] J. Hu, L. Shen, and G. Sun, “Squeeze-and-Excitation Networks,” in
Proceedings of the IEEE Computer Society Conference on Computer
Vision and Pattern Recognition, 2018.
[33] S. Woo, J. Park, J. Y. Lee, and I. S. Kweon, “CBAM: Convolutional
block attention module,” Lect. Notes Comput. Sci. (including Subser.
Lect. Notes Artif. Intell. Lect. Notes Bioinformatics), vol. 11211
LNCS, pp. 319, 2018.
[34] Q. Huang et al., “Semantic segmentation with reverse attention,” Br.
Mach. Vis. Conf. 2017, BMVC 2017, 2017.
[35] S. Chen, X. Tan, B. Wang, and X. Hu, “Reverse attention for salient
object detection,” Lect. Notes Comput. Sci. (including Subser. Lect.
Notes Artif. Intell. Lect. Notes Bioinformatics), vol. 11213 LNCS, pp.
236252, 2018.
[36] S. Liu, L. Qi, Y. Zhang, and W. Shi, “Dual Reverse Attention
Networks for Person Re-Identification,” 2019 IEEE Int. Conf. Image
Process., pp. 12321236, 2019.
[37] Y. Sun, L. Zheng, W. Deng, and S. Wang, “SVDNet for Pedestrian
Retrieval,” in Proceedings of the IEEE International Conference on
Computer Vision, 2017.
[38] T. Chen et al., “ABD-net: Attentive but diverse person re-
identification,” Proc. IEEE Int. Conf. Comput. Vis., vol. 2019-Octob,
pp. 83508360, 2019
[39] S. Li, S. Bak, P. Carr, and X. Wang, “Diversity Regularized
Spatiotemporal Attention for Video-Based Person Re-identification,”
Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., pp.
369378, 2018.
[40] H. Fukui, T. Hirakawa, T. Yamashita, and H. Fujiyoshi, “Attention
branch network: Learning of attention mechanism for visual
explanation,” Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern
Recognit., vol. 2019-June, no. January, pp. 1069710706, 2019.
[41] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna,
“Rethinking the Inception Architecture for Computer Vision,” in
Proceedings of the IEEE Computer Society Conference on Computer
Vision and Pattern Recognition, 2016.
[42] A. Hermans, L. Beyer, and B. Leibe, “In Defense of the Triplet Loss
for Person Re-Identification,” CoRR, vol. abs/1703.0, 2017.
[43] W. Yang, H. Huang, Z. Zhang, X. Chen, K. Huang, and S. Zhang,
“Towards rich feature discovery with class activation maps
augmentation for person re-identification,” in Proceedings of the
IEEE Computer Society Conference on Computer Vision and Pattern
Recognition, 2019.
This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication.
The final version of record is available at http://dx.doi.org/10.1109/TCSVT.2020.3043026
Copyright (c) 2020 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing pubs-permissions@ieee.org.
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) <
12
[44] L. Zheng, L. Shen, L. Tian, S. Wang, J. Wang, and Q. Tian, “Scalable
person re-identification: A benchmark,” Proc. IEEE Int. Conf.
Comput. Vis., vol. 2015 Inter, no. November, pp. 11161124, 2015.
[45] E. Ristani, F. Solera, R. Zou, R. Cucchiara, and C. Tomasi,
“Performance measures and a data set for multi-target, multi-camera
tracking,” in Lecture Notes in Computer Science (including subseries
Lecture Notes in Artificial Intelligence and Lecture Notes in
Bioinformatics), 2016.
[46] W. Li, R. Zhao, T. Xiao, and X. Wang, “DeepReID: Deep filter
pairing neural network for person re-identification,” in Proceedings
of the IEEE Computer Society Conference on Computer Vision and
Pattern Recognition, 2014.
[47] R. M. Bolle, J. H. Connell, S. Pankanti, N. K. Ratha, and A. W. Senior,
“The relation between the ROC curve and the CMC,” in Proceedings
- Fourth IEEE Workshop on Automatic Identification Advanced
Technologies, AUTO ID 2005, 2005.
[48] J. Si et al., “Dual Attention Matching Network for Context-Aware
Feature Sequence Based Person Re-identification,” Proc. IEEE
Comput. Soc. Conf. Comput. Vis. Pattern Recognit., pp. 53635372,
2018.
[49] X. Chang, T. M. Hospedales, and T. Xiang, “Multi-level
Factorisation Net for Person Re-identification,” Proc. IEEE Comput.
Soc. Conf. Comput. Vis. Pattern Recognit., pp. 21092118, 2018.
[50] C. Song, Y. Huang, W. Ouyang, and L. Wang, “Mask-Guided
Contrastive Attention Model for Person Re-identification,” Proc.
IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., pp. 1179
1188, 2018.
[51] Y. Shen, T. Xiao, H. Li, S. Yi, and X. Wang, “End-to-End Deep
Kronecker-Product Matching for Person Re-identification,” Proc.
IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., pp. 6886
6895, 2018.
[52] D. Chen, D. Xu, H. Li, N. Sebe, and X. Wang, “Group Consistent
Similarity Learning via Deep CRF for Person Re-identification,”
Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., pp.
86498658, 2018.
[53] C. Wang, Q. Zhang, C. Huang, W. Liu, and X. Wang, “Mancs: A
Multi-task Attentional Network with Curriculum Sampling for
Person Re-Identification,” Lect. Notes Comput. Sci. (including
Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics), vol.
11208 LNCS, pp. 384400, 2018.
[54] B. Bryan, Y. Gong, Y. Zhang, and C. Poellabauer, “Second-order
non-local attention networks for person re-identification,” Proc. IEEE
Int. Conf. Comput. Vis., vol. 2019-Octob, no. November, pp. 3759
3768, 2019.
[55] R. Quan, X. Dong, Y. Wu, L. Zhu, and Y. Yang, “Auto-reID:
Searching for a part-aware convnet for person re-identification,” Proc.
IEEE Int. Conf. Comput. Vis., vol. 2019-October, pp. 37493758,
2019.
[56] K. Zhou, Y. Yang, A. Cavallaro, and T. Xiang, “Omni-scale feature
learning for person re-identification,” Proc. IEEE Int. Conf. Comput.
Vis., vol. 2019-October, no. d, pp. 37013711, 2019.
[57] R. Hou, B. Ma, H. Chang, X. Gu, S. Shan, and X. Chen, “Interaction-
and-Aggregation Network for Person Re-identification Key
Laboratory of Intelligent Information Processing of Chinese
Academy of Sciences ( CAS ),” Cvpr, pp. 93179326, 2019.
[58] B. Chen, W. Deng, and J. Hu, “Mixed high-order attention network
for person re-identification,” Proc. IEEE Int. Conf. Comput. Vis., vol.
2019-October, pp. 371381, 2019.
[59] H. Luo, Y. Gu, X. Liao, S. Lai, and W. Jiang, “Bag of tricks and a
strong baseline for deep person re-identification,” IEEE Comput. Soc.
Conf. Comput. Vis. Pattern Recognit. Work., vol. 2019-June, pp.
14871495, 2019.
[60] L. Wei et al., "SIF: Self-Inspirited Feature Learning for Person Re-
Identification," in IEEE Transactions on Image Processing, vol. 29,
pp. 4942-4951, 2020.
[61] L. Wei, S. Zhang, W. Gao, and Q. Tian, “Person Transfer GAN to
Bridge Domain Gap for Person Re-identification,” Proc. IEEE
Comput. Soc. Conf. Comput. Vis. Pattern Recognit., pp. 7988, 2018.
[62] Z. Zheng, L. Zheng, and Y. Yang, “Pedestrian alignment network for
large-scale person re-identification,” IEEE Trans. Circuits Syst.
Video Technol., vol. 29, no. 10, pp. 30373045, 2019.
[63] X. Liu, S. Bi, S. Fang, and A. Bouridane, “Bayesian Inferred Self-
Attentive Aggregation for Multi-Shot Person Re-Identification,”
IEEE Trans. Circuits Syst. Video Technol., vol. PP, no. c, pp. 11,
2019.
[64] C. Han, R. Zheng, C. Gao, and N. Sang, “Complementation-
Reinforced Attention Network for Person Re-Identification,” IEEE
Trans. Circuits Syst. Video Technol., vol. PP, no. XX, pp. 11, 2019.
[65] Y. Huang, Z.-J. Zha, X. Fu, R. Hong, and L. Li, “Real-world Person
Re-Identification via Degradation Invariance Learning,” pp. 14084–
14094, Apr. 2020.
[66] Z. Zhang, C. Lan, W. Zeng, X. Jin, and Z. Chen, “Relation-Aware
Global Attention for Person Re-identification,” 2019.
[67] H. Wang, T. Fang, Y. Fan, and W. Wu, “Person re-identification
based on dropeasy method,” IEEE Access, vol. 7, pp. 97021–97031,
2019.
[68] A. Borgia, Y. Hua, E. Kodirov, and N. M. Robertson, “Cross-View
Discriminative Feature Learning for Person Re-Identification,” IEEE
Trans. Image Process., vol. 27, no. 11, pp. 53385349, 2018.
[69] H. Wang, T. Fang, Y. Fan, and W. Wu, “Person re-identification
based on dropeasy method,” IEEE Access, vol. 7, pp. 97021–97031,
2019.
[70] J. Lv, Z. Li, K. Nai, Y. Chen, and J. Yuan, “Person re-identification
with expanded neighborhoods distance re-ranking,” Image Vis.
Comput., vol. 95, p. 103875, 2020.
[71] G. Wang, J. Lai, P. Huang, and X. Xie, “Spatial-Temporal Person Re-
Identification,” Proc. AAAI Conf. Artif. Intell., vol. 33, pp. 8933–
8940, 2019.
[72] J. V. C. I. R et al., “Person re-identification based on re-ranking with
expanded k-reciprocal nearest neighbors q,” J. Vis. Commun. Image
Represent., vol. 58, pp. 486494, 2019.
[73] Y. Yuan, J. Zhang, and Q. Wang, “Deep Gabor convolution network
for person re-identification,” Neurocomputing, vol. 378, pp. 387–398,
2020.
[74] Y. Yuan, Z. Xiong, and Q. Wang, “VSSA-NET: Vertical Spatial
Sequence Attention Network for Traffic Sign Detection,” IEEE Trans.
Image Process., vol. 28, no. 7, pp. 34233434, 2019.
Xin Ning received his Ph.D. in 2017 from
Institute of Semiconductors, Chinese
Academy of Sciences. He is currently an
Assistant Professor of Artificial
Intelligence at Institute of
Semiconductors Chinese Academy of
Sciences. His research interests include
deep learning machine art, pattern
recognition, and image cognitive
computation. He is a member of IEEE.
Ke Gong received his bachelor degree in
China university of petroleum (Beijing) in
2018, and now he is working at Beijing
Wave Security Technology company
limited, Cognitive Computing Technology
Joint Laboratory, Wave Group.
Weijun Li received his Ph.D. in 2004 from
Institute of Semiconductors, Chinese
Academy of Sciences. He is currently a
Professor of Artificial Intelligence at
Institute of Semiconductors Chinese
Academy of Sciences (ISCAS) and the
University of Chinese Academy of
Sciences. He is in charge of the Artificial
intelligence research Center of ISCAS,
This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication.
The final version of record is available at http://dx.doi.org/10.1109/TCSVT.2020.3043026
Copyright (c) 2020 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing pubs-permissions@ieee.org.
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) <
13
also the Director of the Lab of Highspeed Circuits & Neural
Networks of ISCAS. His research interests include deep
modeling, machine art, pattern recognition, artificial neural
networks and intelligent system. He is a senior member of IEEE.
Liping Zhang received her Ph.D. from
Institute of Semiconductors, Chinese
Academy of Sciences in 2018. Currently,
she is an assistant research fellow in the
Laboratory of High-speed Circuit and
Artificial Neural networks at Institute of
Semiconductors, Chinese Academy of
Sciences. Her research interests include
biometrics, pattern analysis. She is a
member of IEEE.
Xiao Bai received the B.Eng. degree in
computer science from Beihang
University of China, Beijing, China, in
2001, and the Ph.D. degree in computer
science from the University of York,
York, U.K., in 2006.
He was a Research Officer (Fellow,
Scientist) with the Computer Science
Department, University of Bath, until 2008. He is currently a
Full Professor with the School of Computer Science and
Engineering, Beihang University. He has authored or co-
authored more than 100 papers in journals and refereed
conferences. His current research interests include pattern
recognition, image processing, and remote sensing image
analysis. He is the Associate Editor for journal of Pattern
Recognition and Signal Processing.
Shengwei Tian received the Ph.D. degree
in computer science and technology from
the Xinjiang University, in 2010. He is
currently a Professor with the Xinjiang
University of Technology.
His research interests include intelligence
computing, image processing, and natural
language
processing.
t Award for Excellence in 2008, and the IEEE Electromagnetic
Compatibility Society Best Symposium Paper Award in 2011.
This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication.
The final version of record is available at http://dx.doi.org/10.1109/TCSVT.2020.3043026
Copyright (c) 2020 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing pubs-permissions@ieee.org.
... The CNN algorithm, a popular DL technique, currently finds widespread applications across diverse domains including computer vision, natural language processing, and visualsemantic alignments [21]- [24]. Depending on the necessity of the region proposal, it can be categorized into two types: one-stage detection and two-stage detection. ...
Article
Full-text available
Traffic sign detection (TSD) is crucial for real-world applications like driverless vehicles, intelligent driver-assistance systems, and traffic management. Recent advancements in TSD have demonstrated promising outcomes. Nonetheless, challenges persist in terms of speed, accuracy, memory consumption, the capability of the backbone to generate features, and computational cost, especially in handling diverse traffic sign characteristics. To overcome these challenges, we propose Sign-YOLO (You Only Look Once), a novel attention-based one-stage method that integrates YOLOv7 with the squeeze-and-excitation (SE) model and special attention mechanism. Sign-YOLO enhances the feature representation capacity of the model in the presence of variations in traffic sign sizes. The SE block adjusts channel-specific feature responses by actively considering the relationships between channels. By selectively focusing on relevant features, the attention mechanism helps the model better capture and understand the distinctive characteristics of traffic signs, thereby improving detection accuracy. Sign-YOLO effectively reduces the computational cost and memory consumption; further, it effectively enriches the robustness of extracted features. The proposed approach enables the model to allocate more attention to relevant regions of the input, thereby reducing the impact of size discrepancies and contributing to the overall robustness of the system. The experimental findings highlight the success of Sign-YOLO in TSD tasks. Our proposed method exhibits cutting-edge performance on the German Traffic Signs Detection Benchmark (GTSDB) dataset, simultaneously achieving a 98% reduction in model size and memory consumption. Sign-YOLO attains a 99.10% mean average precision (mAP) on the GTSDB dataset. In comparison to both two- and one-stage detectors, our approach exhibits an improvement of 3.33%. The proposed approach is the swiftest and most lightweight framework in terms of memory usage, making it the ideal option for implementation in real-time applications.
... Depthwise separable convolutions are a popular variant of traditional convolutions and have been widely used as a critical building block in many neural networks [39][40][41][42][43]. In this work, we also use them. ...
Article
Full-text available
Continuous sign language recognition (CSLR) aims to identify a sequence of glosses from a sign language video with only a sentence-level label provided in a weakly supervised way. In sign language videos, the transitions among actions are naturally fluent, and different glosses or the same gloss correspond to video clips with various temporal scales. Obviously, these factors pose a challenge to the effective extraction of complex temporal information. However, most previous deep learning-based CSLR methods employ a temporal modeling method with a fixed temporal receptive field, which is a simple and effective solution but does not cope well with video clips that have various temporal scales. To relieve this problem, we propose a dual-stage temporal perception module (DTPM) by leveraging the strengths of both temporal convolutions and transformers, which follows a hierarchical structure with dual stages aimed at capturing richer and more comprehensive temporal features. Specifically, each stage for DTPM is cleverly composed of two parts: a multi-scale local temporal module (MS-LTM), followed by a set of global–local temporal modules (GLTMs), where each GLTM can be further decomposed into a global temporal relational module (GTRM) and a local temporal relational module (LTRM). At each stage, an MS-LTM is first employed to model multi-scale local temporal relations and then utilize a set of GLTMs to model global temporal relations and strengthen local temporal relations. We finally aggregate the output features of each stage to form a video feature representation with rich semantic information. Extensive experiments on three CSLR benchmarks, PHOENIX14 (Koller et al. Comput Vis Image Underst 141:108–125, 2015), PHOENIX14-T (Camgoz et al., in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7784–7793, 2018), and CSL (Huang et al., in: Proceedings of the AAAI conference on artificial intelligence, pp 32, 2018), validate the effectiveness of our proposed method.
... As an algorithmic idea, the attention mechanism plays an important role in many fields [20][21][22][23][24]. It is commonly employed in combination with the Encoder-Decoder framework, as shown in Fig. 2. ...
Preprint
Single-camera-training person re-identification (SCT re-ID) aims to train a re-ID model using SCT datasets where each person appears in only one camera. The main challenge of SCT re-ID is to learn camera-invariant feature representations without cross-camera same-person (CCSP) data as supervision. Previous methods address it by assuming that the most similar person should be found in another camera. However, this assumption is not guaranteed to be correct. In this paper, we propose a Camera-Invariant Meta-Learning Network (CIMN) for SCT re-ID. CIMN assumes that the camera-invariant feature representations should be robust to camera changes. To this end, we split the training data into meta-train set and meta-test set based on camera IDs and perform a cross-camera simulation via meta-learning strategy, aiming to enforce the representations learned from the meta-train set to be robust to the meta-test set. With the cross-camera simulation, CIMN can learn camera-invariant and identity-discriminative representations even there are no CCSP data. However, this simulation also causes the separation of the meta-train set and the meta-test set, which ignores some beneficial relations between them. Thus, we introduce three losses: meta triplet loss, meta classification loss, and meta camera alignment loss, to leverage the ignored relations. The experiment results demonstrate that our method achieves comparable performance with and without CCSP data, and outperforms the state-of-the-art methods on SCT re-ID benchmarks. In addition, it is also effective in improving the domain generalization ability of the model.
Article
Full-text available
The re-identification (ReID) task has received increasing studies in recent years and its performance has gained significant improvement. The progress mainly comes from searching for new network structures to learn person representations. Most of these networks are trained using the classic stochastic gradient descent optimizer. However, limited efforts have been made to explore potential performance of existing ReID networks directly by better training scheme, which leaves a large space for ReID research. In this paper, we propose a Self-Inspirited Feature Learning (SIF) method to enhance performance of given ReID networks from the viewpoint of optimization. We design a simple adversarial learning scheme to encourage a network to learn more discriminative person representation. In our method, an auxiliary branch is added into the network only in the training stage, while the structure of the original network stays unchanged during the testing stage. In summary, SIF has three aspects of advantages: (1) it is designed under general setting; (2) it is compatible with many existing feature learning networks on the ReID task; (3) it is easy to implement and has steady performance. We evaluate the performance of SIF on three public ReID datasets: Market1501, DuckMTMC-reID, and CUHK03(both labeled and detected). The results demonstrate significant improvement in performance brought by SIF. We also apply SIF to obtain state-of-the-art results on all the three datasets. Specifically, mAP / Rank-1 accuracy are: 87.6% / 95.2% (without re-rank) on Market1501, 79.4% / 89.8% on DuckMTMC-reID, 77.0% / 79.5% on CUHK03 (labeled) and 73.9% / 76.6% on CUHK03 (detected), respectively. The code of SIF will be available soon.
Chapter
Learning diverse features is key to the success of person re-identification. Various part-based methods have been extensively proposed for learning local representations, which, however, are still inferior to the best-performing methods for person re-identification. This paper proposes to construct a strong lightweight network architecture, termed PLR-OSNet, based on the idea of Part-Level feature Resolution over the Omni-Scale Network (OSNet) for achieving feature diversity. The proposed PLR-OSNet has two branches, one branch for global feature representation and the other branch for local feature representation. The local branch employs a uniform partition strategy for part-level feature resolution but produces only a single identity-prediction loss, which is in sharp contrast to the existing part-based methods. Empirical evidence demonstrates that the proposed PLR-OSNet achieves state-of-the-art performance on popular person Re-ID datasets, including Market1501, DukeMTMC-reID and CUHK03, despite its small model size.