ArticlePDF Available

Feature Refinement and Filter Network for Person Re-Identification

September 2021
IEEE Transactions on Circuits and Systems for Video Technology PP(99):1-1

September 2021
PP(99):1-1

DOI:10.1109/TCSVT.2020.3043026

Authors:

Xin Ning

Institute of Semiconductors，Chinese Academy of Sciences

Li Weijun

Chinese Academy of Sciences

Zhang Liping

Chinese Academy of Sciences

Show all 6 authorsHide

In the task of person re-identification, the attention mechanism and fine-grained information have been proved to be effective. However, it has been observed that models often focus on the extraction of features with strong discrimination, and neglect other valuable features. The extracted fine-grained information may include redundancies. In addition, current methods lack an effective scheme to remove background interference. Therefore, this paper proposes the feature refinement and filter network to solve the above problems from three aspects: first, by weakening the high response features, we aim to identify highly valuable features and extract the complete features of persons, thereby enhancing the robustness of the model; second, by positioning and intercepting the high response areas of persons, we eliminate the interference arising from background information and strengthen the response of the model to the complete features of persons; finally, valuable fine-grained features are selected using a multi-branch attention network for person re-identification to enhance the performance of the model. Our extensive experiments on the benchmark Market-1501, DukeMTMC-reID, CUHK03 and MSMT17 person re-identification datasets demonstrate that the performance of our method is comparable to that of state-of-the-art approaches.

Not all features in person pictures are beneficial to pedestrian reidentification. (a) Occlusion, (b) Complex backgrounds, (c) Distinguishing features. Different features of person have different contributions to person reidentification. The importance of the features of the red frame in (c) is obviously greater than other features.

…

The framework of the proposed Feature refinement and filter network for person re-identification. The network includes a backbone network composed of Weaken Feature Conv blocks and the multi-branch network. The Weaken Feature Conv blocks 1~4 are based on the four Resnet blocks, while the Conv block 4 is based on the last block of Resnet.

…

sualization for the global branch and 5 attention branches. The proposed architecture allows the model to learn global and diverse local detailed features.

…

Figures - uploaded by Xin Ning

Content may be subject to copyright.

Content uploaded by Xin Ning

Content may be subject to copyright.

> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) <

Abstract—In the task of person re-identification, the attention

mechanism and fine-grained information have been proved to be

effective. However, it has been observed that models often focus on

the extraction of features with strong discrimination, and neglect

other valuable features. The extracted fine-grained information

may include redundancies. In addition, current methods lack an

effective scheme to remove background interference. Therefore,

this paper proposes the feature refinement and filter network to

solve the above problems from three aspects: first, by weakening

the high response features, we aim to identify highly valuable

features and extract the complete features of persons, thereby

enhancing the robustness of the model; second, by positioning and

intercepting the high response areas of persons, we eliminate the

interference arising from background information and strengthen

the response of the model to the complete features of persons;

finally, valuable fine-grained features are selected using a multi-

branch attention network for person re-identification to enhance

the performance of the model. Our extensive experiments on the

benchmark Market-1501, DukeMTMC-reID, CUHK03 and

MSMT17 person re-identification datasets demonstrate that the

performance of our method is comparable to that of state-of-the-

art approaches.

Index Terms—Person Re-Identification; Deep Learning;

attention; Person Search

I. INTRODUCTION

ERSON re-identification, as a technology for retrieving

specific person images from cameras in multiple non-

overlapping areas, has pivotal applications in the security field,

including target tracking and person retrieval. In such tasks, the

image pixels of the person are too low to be identifiable through

face recognition. Moreover, the images have rather intricate

backgrounds, which are also accompanied by occlusions and

variations in person’s poses. As cameras with disparate

orientations normally have dissimilar viewing angles, the

Manuscript is submitted on April 27, 2020. This work was supported by the

National Natural Science Foundation of China (Grant No.61901436), and

Shenzhen Wave Kingdom Co., Ltd. (Corresponding Author: Ke Gong and

Weijun Li)

X. Ning and L. Zhang are with the Institute of Semiconductors, Chinese

Academy of Sciences, Beijing 100083, China; and Cognitive Computing

Technology Joint Laboratory, Wave Group, Beijing, 102208, China (e-mail:

ningxin@semi.ac.cn; zliping@semi.ac.cn).

K. Gong is with Cognitive Computing Technology Joint Laboratory, Wave

Group, Beijing, 102208, China (e-mail: gongke@wavewisdom-bj.com).

W. Li is with the Institute of Semiconductors, Chinese Academy of Sciences,

Beijing 100083, China; Center of Materials Science and Optoelectronics

difficulty of person recognition is also increased thereby [1].

Hence, person re-identification has invariably been a

challenging task.

Fig 1. Visualization of the model feature map. (a) is the persons’ image; (b) and

that the Resnet can’t focus on all the features of the person and background

information is also included. However, our model greatly improves this

situation.

The performance of person re-identification, which is a sub-

topic of image recognition, largely depends on the

representation of a person’s features. In recent years, image

recognition has entered a new stage owing to multilayer

convolution-based deep learning methods. However, desirable

results can hardly be attained by models that perform

excellently in Image dataset classification, such as Resnet [2],

InceptionNet [3] and VGG [4]. This is because person’s

Engineering & School of Microelectronics, University of Chinese Academy of

Sciences, Beijing 100049, and Cognitive Computing Technology Joint

Laboratory, Wave Group, Beijing, 102208, China (e-mail: wjli@semi.ac.cn).

X. Bai is with School of Computer Science and Engineering, Beihang

University, Beijing,100191, China (e-mail: baixiao@buaa.edu.cn).

S. Tian is with School of Software, Xinjiang University, Xinjiang,830000,

China.

permission to use this material for any other purposes must be obtained from

the IEEE by sending an email to pubs-permissions@ieee.org.

Feature Refinement and Filter Network for

Person Re-identification

Xin Ning, Member, IEEE, Ke Gong, Weijun Li, Senior Member, Liping Zhang, Member, IEEE, Xiao

Bai, and Shengwei Tian

This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication.

The final version of record is available at http://dx.doi.org/10.1109/TCSVT.2020.3043026

> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) <

pictures have similar appearance features and relatively low

inter-picture distinction. Owing to the influence of factors such

as the pose, the viewing angle, lighting, occlusion and

background interferences, the classification becomes more

difficult. Moreover, as shown in Fig 1, the existing methods

tend to focus merely on the parts of the image with the highest

contribution to classification performance, instead of focusing

all the features of persons; importantly, the neglected parts

often have recognition value. In addition, some background

information is also used for recognition, which also affects their

performance.

In the course of deep learning research, Li et al. [5] found

that when recognizing images, the model sometimes focused its

attention on the image backgrounds that are unrelated to the

image recognition performance. Tian et al. [6] attempted to

experimentally quantify the effect of background interference

on the performance of person re-identification algorithms, and

eliminate the effect by the stochastic background method. There

are also plentiful studies that located human body parts with the

aid of models like attribute recognition [7], key point location

[8] and semantic image segmentation [9], which have

accomplished the goal of removing background influences.

However, the models for other tasks are introduced by these

methods, biases can arise to the person re-identification systems,

thereby compromising performance.

Fig. 2 Not all features in person pictures are beneficial to pedestrian re-

identification. (a) Occlusion, (b) Complex backgrounds, (c) Distinguishing

features. Different features of person have different contributions to person re-

identification. The importance of the features of the red frame in (c) is obviously

greater than other features.

Current researches emphasize the extraction of diverse

detailed local features. Specifically, person picture is

segmented into n parts of identical size for separate extraction

of deeper features, after which they will be combined together

as the discriminative features. This method [10][11][12][13],

which is capable of extracting features for different image parts,

often yields excellent results. However, as shown in Fig.2, some

local features are not necessarily useful ones and even among

useful features, the importance of features is different. Joint

extraction and processing of features that have no effect on

person recognition will undoubtedly affect the model

performance. Compared to such extraction of local features

with fixed-size partitions, how to filter discriminative zones and

extract detailed features are more meaningful.

In addition to the above methods of obtaining local features,

there have also been reports of obtaining local features with

diverse attention. In [37], singular value decomposition is

applied to the weight matrix of the last layer of the network to

reduce the correlation between features. Although some

satisfactory results have been achieved, the computational cost

is high expensive. In addition to the above methods, the

diversity regularization term based on Hellinger distance is

used in the [39] to focus parts. [29] avoid too much

concentration by constraining the distance of row indexes with

high response features in feature maps. These attention

diversity methods are the same as those in [43]. They can

acquire different attention features when there are few attention

branches, but when there is too much attention, feature

redundancy will easily affect the quality of features. To address

problem we propose a new loss of attention diversity.

In this paper, Feature refinement and filter network is

proposed to cope with the problem of person re-identification

from three aspects. First, the extraction of complete pedestrian

features can be realized. By weakening the features of the high

response region, the model can pay attention to more useful

features and enhance the robustness of the model; secondly, on

the basis of getting complete features, it can further locate the

high response features of the person, remove the interference of

background information and enhance the generalization ability

of features; finally, valuable fine-grained features for person re-

identification can be selected through the multi-branch attention

network, so as to yield the enhanced performance of the model.

The main contributions of this work can be summarized as

follows:

(1) By weakening the feature values of high-response areas,

the model can mine more valuable areas in the image, and it not

only ensures the stability of the training process, but also learns

the complete feature of a person.

(2) By locating the high-response feature based on (1), the

interference arising from background information of the input

picture is removed, thereby improving the extraction of the

global interference-free feature.

(3) In order to obtain the local fine-grained features of a

person, a multi-branch attention network with diversity loss is

designed, and the local features are obtained via adaptive

filtering by removing interference information.

(4) We conducted extensive experiments on the currently

mainstream Market-1501, DukeMTMC-reID, CUHK03 and

MSMT17 datasets, notably, we proved the outstanding

performance of the proposed method through heat maps.

II. RELATED WORK

This study involves feature representation, feature

suppression, the attention mechanism and attention diversity.

An overview of the recent work in connection with these

directions of person re-identification is discussed as below.

A. Feature representation

Feature representation is the core problem of pedestrian

recognition. Many methods are based on extracting

discriminative features with stability in different scenes to

enhance the performance of the model.

This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication.

The final version of record is available at http://dx.doi.org/10.1109/TCSVT.2020.3043026

> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) <

Fig. 3 The framework of the proposed Feature refinement and filter network for person re-identification. The network includes a backbone network composed of

Weaken Feature Conv blocks and the multi-branch network. The Weaken Feature Conv blocks 1~4 are based on the four Resnet blocks, while the Conv block 4 is

based on the last block of Resnet.

Conventional person recognition generally adopts the idea of

manual features plus classifiers, such as methods in [14][15].

With conventional methods, the feature representation relies

primarily on the manually designed features, which require

professional knowledge and a complex process of parameter

adjustment. The development of deep learning has witnessed

impressive results of person re-identification achieved on

plenty of challenging datasets [16][17][18][19]. In spite of this,

factors such as occlusion, view changes and background

information limit the further upgrading of model performance,

generating the special focus of some recent studies on the local

information of pictures. In [20][21] the global and local features

were obtained via region segmentation, where each local

feature corresponded to a dissimilar segmentation block. A

network named part-based convolutional baseline (PCB) was

put forward in [10], which divided the image features into six

horizontal blocks of identical size. Moreover, each local feature

generated an independent ID prediction loss. Later, this method

of feature segmentation was inherited and popularized in

[11][12][13][18][22][23]. Nonetheless, the practice of dividing

a picture into fixed-size parts was unable to guarantee the

validity of the divided blocks, nor does it filter for valid blocks.

Consequently, there will be a waste of computing resources,

affecting the algorithmic performance to some extent.

Local information can be acquired not only by the feature

segmentation but also the guidance of other auxiliary networks.

By utilizing Person Parsing Net, Tian et al. [6] separately

obtained the characteristics of the head, the trunk and legs. And

semantic segmentation at a pixel-level precision was used in

[9][24] in an attempt to extract the features of the foreground,

the head, the upper body, the lower body and shoes from the

body parts. Eventually, these features were fused to implement

recognition. There are also studies [7][17] that located the body

parts with a attribute recognition or a skeletal keypoints model

before the extraction of local features. Thus in terms of this

approach, the introduction of models for other tasks will result

in deviations in the performance of person re-identification

tasks.

B. Feature suppression

Feature suppression can also be called “feature drop”. Prior

to this work, there was no such concept in the relevant literature,

but the following work can be categorized into this category.

The earliest reference [24] can make the model have better

robustness to noise and occlusion by randomly occluding part

of the photos in the sample, and improve the performance of the

model. Similarly, in [26], a data enhancement method named

Cutout was proposed, which aimed to diminish overfitting and

enhance network generalization. As was found, the Cutout not

only allowed the model to learn the way to distinguish features,

but also to better integrate the context, thus taking notice of

some minor local features. Similar to the aforementioned

studies, Hu et al. [27] also achieved sample data augmentation

by occluding some local areas in the sample pictures,

generating a model with enhanced generalization performance

and robustness.

With the deepening of research, some scholars have found

that the feature occlusion could also be unexpectedly

efficacious. According to [28], when the most discriminative

block was occluded, the network could be forced to find other

relevant parts, which was applied to the unsupervised semantic

segmentation and the visual interpretation of neural networks in

[5]. In this regard, the model performance in the person re-

This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication.

The final version of record is available at http://dx.doi.org/10.1109/TCSVT.2020.3043026

> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) <

identification projects was improved in [29][30][31] by

suppressing the most recognizable feature, which allowed the

networks to learn comprehensive saliency maps.

Although this way of suppressing features is somewhat

conducive to the model performance, feature suppression can

make model training difficult, as mentioned in [31]. The reason

lies in that when a vital feature is found in the previous training,

it is erased (suppressed) the next time, which adds to the

difficulty of the training of the model. For this problem, we

propose a scheme for weakening features, which will be

described specifically in the following section.

Fig. 4 The Framework of Weaken Feature Conv blocks. It is based on the

Resnet block (such as conv2_x, conv3_x, conv4_x, conv5_x of Resnet-50),

where each output of the Resnet block is reduced to a channel into a feature

map. The weaken salient feature module whittles the high response

characteristics and performs a bitwise multiplication operation with the input

to form a new input.

C. Attention and diversity

Attention mechanisms have been applied extensively to the

person re-identification tasks. The most used are the channel

attention [32], the spatial attention [33], which have been

integrated into some studies [18][23][24] for the sake of better

performance. What’s more, Yuan et al. [74] proposed the

vertical spatial sequence attention. Based on the principles of

attention mechanisms, some scholars have put forward the anti-

attention mechanisms, whose initial application was found in

the tasks of image semantic segmentation [34] and object

detection [35]. An integration of channel, spatial, anti-channel

and anti-spatial attention was adopted in [36] to enhance the

feature representation in person re-identification. To achieve

the goal, obtaining rich and diverse features is often essential.

Thus, in the application of attention mechanisms, we normally

hope that the generated attention maps can be diverse,

especially for the multi-branch attention. On the basis of [37],

weighted soft orthogonality is devoted to [38] for ensuring the

diversity of extracted features. Aside from the approach of

constraint weight orthogonalization, Li et al. [39] used a

Hellinger distance-based diversity regularization term to give

attention to different parts. In this paper, we use feature

activation and Gaussian normalization to measure the distance

between attention to achieve attention diversity.

III. METHOD

In this section, we will describe the proposed Feature

refinement and filter network, whose overall process are

presented in Fig. 3. It primarily comprises the global feature

enhancement network, the multi-attention network, as well as

loss of attention diversity.

A. Feature refinement and filter network

To achieve optimal feature learning, we need to identify

regions with discriminative features in the sample pictures, to

extract the most discriminative features in the images. However,

as mentioned above, such an approach is often unable to

produce the desired result. Hence, this study attempts to

eliminate the interference information in the pictures and

extract all the feature information that is useful for re-

identification, rather than merely extracting some of the

information with strong discrimination ability. It has been

proven useful in [17][29]. To demonstrate the performance of

the proposed method in the paper, we chose Resnet-50 as the

backbone network as in other studies.

Weaken salient feature: To this end, we designed the Weaken

Feature Conv blocks based on the Resnet module, as shown in

Fig. 4. This module can also be designed based on other

network modules like VGG and Inception Net. For an image I,

we can express its feature map through a Conv block (e.g. a

Resnet block) as

H W C





, where

and

represent

the height, the width and the channel number of the feature layer,

respectively. For more convenient implementation of next

operation, we reduce the channel of the feature map to 1 via the

channel-like attention, therefore obtaining the feature map





. Besides, its operational procedure is as follows:

1( , )

F i j



=

(1)



=



(2)

( _ ( ))A sigmoid up sample f=

(3)

where



denotes the bitwise multiplication and

_up sample

represents the up-sampling operation. The feature map is

extended to the same size as the input of the Conv module, for

the purpose of facilitating their bitwise multiplication operation.

Through the above operation, we derive the attention feature

map

of the Conv module; where the higher value regions

indicate more attention paid by the model to the corresponding

areas in the original image and the lower value regions indicate

correspondingly no or little attention. Here, we weaken the

high-response areas in the feature map

, by which the

network can be forced to focus on the features of areas other

than those corresponding to high-responses. In this way, all

useful features in the input image can be attached great

importance to the model, thereby eliminating the interferences

of background noise and other useless features. To be specific,

we set a threshold



, and regard the areas in the feature map

with values exceeding this threshold as the high-response

areas. Afterwards, a weakening factor

(0,1)





is introduced,

and the weakening factor matrix M is specified as Eq 4. When

This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication.

The final version of record is available at http://dx.doi.org/10.1109/TCSVT.2020.3043026

> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) <

feature value

( , )A i j

is greater than



, it is considered to be a

high-response feature, and the corresponding weakening factor



; Otherwise, it is considered to be a low-response feature

and no weakening is required, and the weakening factor is 1.

, ( , )

( , ) 1,

if A i j

M i j otherwise







=



(4)

Lastly, as shown in Fig. 4, the input I of the Conv module is

multiplied bitwise by the weakening factor matrix

to obtain

a weakened input

I

. By doing so, the originally high-response

areas is weakened, while other areas are strengthened, allowing

the model to focus more on areas outside of the high-response

areas to dig out all useful features in the input image. To better

comprehend the Weaken Feature Conv blocks, the details of the

feature sizes for various parts are listed in Table I. Blocks 1~4

are based on the corresponding blocks of Resnet, with the size

of input image being

384 128

TABLE I. SIZES OF FEATURES OF THE MAIN OPERATION IN FIG. 4

Module

Input of the

module (I)

Output of the

module (F)

Feature

map (f)

Output of

weaken salient

feature (M)

Block 1

(64, 96, 32)

(256, 96, 32)

(96, 32)

Block 2

(256, 96, 32)

(512, 48, 16)

(48, 16)

(96, 32)

Block 3

(512, 48, 16)

(1024, 24, 8)

(24, 8)

(48, 16)

Block 4

(1024, 24, 8)

(2048, 24, 8)

(24, 8)

Weakening factor: In the previous section, we put forward the

concept of the weakening factor, whose value is a key

parameter of the model performance. Although we can specify

this value manually, it is not necessarily the optimal value, the

obtaining of which requires extensive experiments and

substantial computing performance. To address this problem,

we put forward a better solution.

Feature map



=0.4



=0.5



=0.6



=0.8



(a)

Feature map

=0.3



=0.4



=0.5



=0.6



=0.8



(b)

Fig. 5 The effect of different weakening factors



and different thresholds



(a) The feature weakening effect of various weakening factors. (b) High

response areas with corresponding to varied thresholds



Fig.5 shows the process of feature weakening and the effect

of weakening factor



and threshold



on the high response

region. The weakening process is only for the high-response

features, and different weakening factors correspond to varied

degrees of coverage, as shown in Fig.5(a). Further, it is obvious

from Fig.5(b) that the size of the high response area changes

with



. The high-response area is large, when the weakening

factor is relatively small, in this case, the high-response features

will be completely covered, and there will be no useful features

to be exploited for model recognition. In contrast, the

considered high response area is small, and completely covered,

this will force the model to focus on other areas for person

recognition, which is more conducive to obtaining

comprehensive features. Thus, the weakening factor should

have a negative correlation with the size of the considered high-

response features.

Fig. 6 Elimination of useless information to enhance valuable features. (a)

represents the original image of a person; (b) Locate the high response area; (c)

Enlarge the high-response area to the same size as the input. When the person

image shown in (a) is input to the network, the feature map shown in (b) will

be generated in weaken feature Conv Block 4. Then, locate the high response

area and erase the background information area. After the person image is

enlarged to the input size, re-enter the network to continue the training.

Feature strengthening: In this section, we elucidate the

manner in which features encompassing more information can

be strengthened and shifted to eliminate irrelevant or interfering

information.

After passing through the saliency weaken network

displayed in Fig. 3 and Fig. 4, the Weaken Feature Conv Block

4 outputs features

2048 HW

FR 





where

H

and

W

is 1/16

size of the original image. The feature map

1HW







obtained by Eq. (1), Eq. (2). The sigmoid activation function

optimizes the value of

A

, then enlarge the feature map by Eq.

(3) to obtain

*1HW





which has the same size as that of the

original image. The high response area of this feature map

corresponds to all the features of a person, whereas the low

response area corresponds to the background and noise in the

picture. We In order to locate the high response area, it is

necessary to obtain the index of the high response area in the

feature map

. Assume that the index matrix of the high

response area is





, where N is the number of high

response points, the first column of L represents the row index,

and the second column of L represents the column index.

Perform the following operations to determine the high-

response rectangular region

( , )xy

This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication.

The final version of record is available at http://dx.doi.org/10.1109/TCSVT.2020.3043026

> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) <

max(min( [:,0]),0)

min(max( [:,0]), )

max(min( [:,1]),0)

min(max( [:,1], )

x L H

y L W

(5)

By bit-wise multiplication of the original image and

Fig.6(b) can be obtained. To remove the background and noise

in the original image, we need to locate the high-response area

in the maximum range by Eq. (5), as shown in the red box in

Fig.6 (b). Then, we enlarge the part corresponding to the red

box in the original image to the size of the original image, as

shown in Figure 6 (c). Thus, a new image which can minimize

the influence of the background and highlight the principal

features of the person is obtained. Subsequently, we continue to

feed the image shown in Fig. 6(c) as input to the network for

the secondary extraction and screening of features, thereby

enhancing useful features.

Fig. 7 The details of attention generator model.

Algorithm 1: Feature Refinement and Filter Network

Input: Training images data I, feature weakening backbone network

pretrained hyperparameters W and other attention branch

hyperparameters



Output: backbone network hyperparameters W and other attention branch

hyperparameters



Repeat:

1: Through the features weakening the backbone network to obtain the

complete feature of pedestrians F=W(I).

2: Calculate feature map

of F by Eq. (1)(2)(3).

3: Locate the high response area of

4: Enlarge the high response area in the original image to the size of the

original image as

I

5: Obtain complete and non-interfering features

1()F W I

and update

W by Eq. (10)(11); and calculate attention maps constrained by Eq. (9).

6: Obtain multi-attention branches embedding features

, , , N

F F F

and update



by Eq. (10)(11);

End for

Until convergence or maximum number of iterations.

B. Feature screening

In the foregoing section, feature screening has already been

mentioned. Here we perform further feature selection with the

aid of multi-branch attention and attention diversity.

Multi-branch attention: Inspired by [40], the multi-branch

attention comprises an attention generator (Fig. 7) and attention

branches. In order to obtain better N attention maps, these N

attention maps should be able to integrate all the information of

the output features, so we give up the one-sided channel

attention that weights important channels like SENet [32]. In

this section, we design an attention that can capture the long-

range dependence of the channel. By integrating the features of

each channel, we generate N attention maps to guide the next

multi-branch attention. The input features

C H W





reduced to N by convolution layer with kernel size of

11

reduce the amount of calculation; then reshape operation is

performed respectively, and the following two inputs are matrix

multiplied to calculate the dependence between each channel,

and then normalized by sigmoid; The above input also performs

the same operation, first reduce the channel, then reshape, and

finally multiply it with the

NN

relational matrix, reshape for

feature maps with size

N H W

. As shown in Fig. 3, two

branches are split after Weaken feature Conv Block 4, with one

for predicting the ID score, and the other for generating N

attention maps subjected to bitwise multiplication separately by

the output of Conv Block 3. Then, the outcomes are input into

the new Conv Block 4 to form N new attention branches. For

each branch, the softmax losses of label smoothing

regularization and the triplet losses of batch hard sampling are

combined for the parameter optimization.

Attention diversity: To make extraction of discriminative

features by each attention possible, we apply a hybrid loss

function to each attention branch (described in the next section).

If these attention branches are unconstrained in the designing

process of the network, they will all focus on the most

discriminative area in the original image. Consequently, the

areas they focus on will be the same, and the features they

extracted will be roughly identical. As a result, the multi-

attention effect will be lost, and the network will degenerate,

which is to be averted by the design of the losses of attention

diversity.

Fig. 8: Training process on market-1501 dataset. The curves of Rank-1, mAP

and loss during the training process until convergence are shown.

This paper avoids the overlapping of attention areas by

restricting the distance between attention. To be specific, we

compute the position of maximum response for each attention

feature map as the center of each attention, so that our task

becomes a constraint to disallow the overlapping of various

attention centers. Accordingly, we devise a diversity loss

function, whose value is considerably large with excessively

close center distance of any two attention maps.

To better exert the function of diversity loss, the attention

map of each branch is expected to be a Gaussian distribution.

That is, the extreme values of an attention map should be

concentrated. Thus, the generated attention maps are

normalized to make the Gaussian distribution obeyed by feature

maps.

For a feature map





generated by the attention

generator:

1( , )

a A i j

HW==



(6)

This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication.

The final version of record is available at http://dx.doi.org/10.1109/TCSVT.2020.3043026

> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) <

( )

1( , )

A i j a



−



(7)

( , )

ˆ( , ) A i j a

A i j



−

+

(8)

After the above operation, the maximum value positions of

feature maps are taken as the central locations of attention maps,

with the inter-center distances of various attention maps

computed for the constraining purpose, so that the attention

centers do not coincide. Besides, a reasonable threshold

should be set in the light of the amount of attention and the

width of the image. If the center distance is shorter than

, the

loss of attention will be more. Accordingly, we define the loss

of attention diversity as follows:

_ max(0, )

batch size i j

loss dist D d



=−



(6)

where

_batch size

denotes the number of training samples in a

batch;

and

represent disparate attention maps, respectively;

is the distance between the attention maps

and

(a) (b) (c) (d)

Fig. 9: Sensitivity of person re-ID accuracy to the number of attention branches, the threshold D, the weight coefficient



and threshold



. Rank-1 and mAP on

Market-1501 are shown.

C. Discussion

In this paper, the saliency weakening method is utilized to

eliminate the background information and to obtain more useful

local features. Furthermore, the stability of training process can

be ensured by weakening rather than erasing directly. After

obtaining more useful features, the original image of

corresponding features is intercepted and re-entered into the

network, which has the same principle as the attention

mechanism in strengthening features considering the ability to

extract features once again for new inputs, and to refine/screen

features over again. Subsequently, each attention branch

network pays attention to different features, which is a process

of screening dissimilar salient features. Through the above

operations, the extraction of complete person features, as well

as that of diverse local fine-grained features, can be ensured.

The study in [43] shows that the optimal number of attention

branches is 3 in the pedestrian re-recognition data set. Too

D. Loss function

For the sake of better generalization performance of the

model, all losses are the softmax losses of label smoothing

regularization [41] in combination with the triplet losses of

batch hard sampling [42]. These are extremely common

methods in the person re-identification projects:

1log((1 )

sof LS i i

L g p



−

= − − +



(10)

where

is the true label of the sample;

is the probability of

the network prediction;



denotes the smoothing factor;

represents the total number of person identity (ID) labels

participating in the training;

is the number of pictures

trained in a batch.

Meanwhile, the triplet loss of batch hard sampling selects

simply the hardest-to-divide samples in a small batch, which

ensures the stability of training and shortens the training time,

as shown in Eq. (11):

1(max min )

tri a p a n

a batch

L d d





= − +



(11)

where a batch has P person identities, each of which has Q

pictures. For a picture

, the set of pictures with the same ID

(identity) is A, and that with different IDs is B;

,ap

means the

similarity (distance) between the features of picture a and

picture p, while

,an

represents that between features of picture

a and picture n;



is a threshold;

()z+

is equivalent to

( ,0)max z

In Fig. 3, each branch, including the global branch, uses a

combination of the above two loss functions for the parameter

optimization. The overall loss can be expressed as:

0( ) _

all of LS tri

Loss L L loss dist



−

= + + 



(12)

Where M is the number of branches including global branches;



denote the coefficients that adjust the weight between

various losses.

many attention branches will pay attention to the

background information or cause information redundancy to

affect the performance of the model. The paper avoids the

above situation by obtaining more comprehensive features of

persons, removing background interference information for

feature strengthening, and finally Gaussian normalization and

loss of attention diversification.

The implementation of our method is summarized in

Algorithm 1, and details of algorithm convergence and training

neural networks are shown in Fig.8.

This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication.

The final version of record is available at http://dx.doi.org/10.1109/TCSVT.2020.3043026

> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) <

IV. EXPERIMENTS AND RESULTS

To verify the performance of the model herein, we performed

experiments by choosing the public datasets Market-1501 [44],

DukeMTMC-reID [45], CUHK03-NP [46] and MSMT17[61],

which are universally used in re-ID tasks. In the comparison

experiment, we adopt the Cumulative Matching Characteristics

(CMC)[47] at Rank-1, which used the label with the highest

score as the prediction label to calculate the accuracy, and mean

Average Precision (mAP) on all the datasets.

The Market-1501 dataset was collected on the campus of

Tsinghua University, which includes 1,501 persons captured by

6 cameras. Among them, 751 are in the training set containing

12,936 pictures and 750 are in the test set with 19,732 pictures.

For 750 persons, an image is selected randomly from each

camera as a query, giving rise to maximum 6 queries for each

person. In total, there are 3,368 images.

The DukeMTMC-reID dataset is a subset of DukeMTMC,

which incorporates 36,411 pictures of 1,812 people collected

by 8 cameras. There are 1,404 people who appear under two

cameras, and 408 people appearing under simply one camera.

The training set consists of 16,522 pictures from 702 people,

while 17,661 pictures from another 702 people are used as the

test set. For the 702 people in the test set, an image is selected

randomly from each camera as a query, with 2,228 images in

total.

TABLE II. COMPARISON OF OUR PROPOSED METHOD WITH THE STATE-OF-THE-

ART METHOD OF THE MARKET-1501 DATASET

Method

mAP(%)

Rank-1(%)

DuATM(CVPR’18)[48]

76.2

91.3

MLFN(CVPR’18)[49]

74.4

90.0

MGCAM(CVPR’18)[50]

74.25

83.55

HA-CNN(CVPR’18)[18]

75.7

91.2

KPM(CVPR’18)[51]

75.3

90.1

CRF(CVPR’18)[52]

81.6

93.5

AACN(CVPR’18)[7]

66.9

85.9

Mancs(ECCV’18)[53]

82.3

93.1

PCB+RPP (ECCV’18)[10]

81.6

93.8

ABD-Net(ICCV’19)[38]

88.28

95.6

BDB(ICCV’19)[30]

86.7

95.3

SONA(ICCV’19)[54]

88.67

95.68

Auto-ReID(ICCV”19)[55]

85.1

94.5

OSNet(ICCV’19)[56]

84.9

94.8

IAN(CVPR’19)[57]

83.1

94.4

CAMA(CVPR’19)[43]

84.5

94.7

MHN(CVPR’19)[58]

85.0

95.1

Pyramid(CVPR’19)[13]

88.2

95.7

BNNeck (CVPR’19)[59]

85.9

94.5

CASN (CVPR’19)[29]

83.2

94.8

AANet-50 (CVPR’19)[7]

82.45

93.89

SIF(TIP’20)[60]

87.6

95.2

st-ReID+RE[71]

87.6

98.1

PAN(TCSVT’18)[62]

69.33

86.67

BISAA(TCSVT’19)[63]

75.1

92.1

CRAN(TCSVT’19)[64]

84.9

94.9

HPDN[67]

82.3

95.2

SMC+ECD[68]

59.73

80.38

DropEasy[69]

78.3

93.8

Gconv[73]

72.3

88.1

BNNeck+k-reciprocal[72]

92.4

95.2

BNNeck+END[70]+ Jaccard

93.2

95.5

Ours

89.6

96.7

Ours+k-reciprocal[72]

92.7

97.8

Ours+ECN[8]

93.6

97.8

Ours+END+Jaccard[70]

94.2

98.3

As for the CUHK03-NP dataset, it collected 14,097 pictures

of 1,467 individuals. We evaluate the re-ID performance as per

novel training/testing protocol proposed in [19], which is more

challenging. First of all, the new protocol has a larger pool of

candidates, covering 5,332 images of 700 identities.

Comparatively, the original protocol had 100 images of 100

identities only. Secondly, the new protocol has a smaller

training set (767 identities) than that of the original protocol

(1,467 identities). In the "Detected" set, all borders are

generated by DPM, while in the "Labeled" set, all the images

are drawn manually. In this paper, the methods for evaluating

those two sets are studied.

MSMT17 is currently the largest public person re-

identification database, consisting of 126,441 images of 4101

identities captured by 12 outdoor cameras and 3 indoor cameras.

The dataset contains different weather conditions, and covers

three time periods: morning, noon, and afternoon. Among them,

there are 30284 images with 1041 identities for training, 11659

query images for testing, and 82161 images with 3060 identities

for testing gallery images.

TABLE III. COMPARISON OF OUR PROPOSED METHOD WITH THE STATE-OF-THE-

ART METHOD OF THE DUKEMTMC-REID DATASET

Method

mAP(%)

Rank-1(%)

DuATM(CVPR’2018)[48]

64.6

81.8

MLFN(CVPR’18)[49]

62.8

81.2

HA-CNN(CVPR’18)[18]

63.8

80.5

AACN(CVPR’18)[7]

59.25

76.84

PCB+RPP (ECCV’18)[10]

69.2

83.3

Mancs (ECCV’18)[53]

71.8

84.9

IAN [5] (CVPR’19)[57]

73.4

87.1

CAMA [4] (CVPR’19)[43]

72.9

85.8

MHN [3] (CVPR’19)[58]

77.2

89.1

BNNeck (CVPR’19)[59]

76.4

86.4

AANet-50(CVPR’19)[7]

72.56

86.42

CASN(CVPR’19)[29]

73.7

87.7

ABD (ICCV’19)[38]

78.59

89.0

BDB(ICCV’19)[30]

76.0

89.0

SONA (ICCV’19)[54]

78.05

89.25

Auto-ReID(ICCV’19)[55]

75.1

88.5

OSNet(ICCV’19)[56]

73.5

88.6

SIF(TIP’20)[60]

79.4

89.8

st-ReID+RE[71]

83.9

94.4

PAN(TCSVT’18)[62]

51.51

71.59

CRAN(TCSVT’19)[64]

74.7

87.6

HPDN[67]

68.0

83.6

DropEasy[69]

58.9

78.6

Gconv[73]

61.7

77.3

BNNeck+k-reciprocal[72]

89.1

90.0

BNNeck+END[70]+Jaccard

87.9

91.6

Ours

81.6

93.4

Ours+k-reciprocal[72]

87.7

94.2

Ours+ECN[8]

86.3

94.5

Ours+END+Jaccard[70]

90.3

94.7

A. Implementation Details

Rsenet-50, which is pre-trained on ImageNet, is chosen to

build the backbone network. Similar to the case of PCB[10], the

downsampling in Resnet Block 4 is removed, and the Batch size

set 32, respectively. The picture size is uniformly adjusted to

384 128

, and the data are augmented by the random flip and

erasure of images. The smoothing factor



in Eq.(10) is set to

0.1. Besides, we set the number of attention branches to 5 and

the threshold D of attention distance to 2. On an NVIDIA GTX-

1080Ti GPU, 300 epochs are trained at an initial learning rate

This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication.

The final version of record is available at http://dx.doi.org/10.1109/TCSVT.2020.3043026

> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) <

of 0.00035, which is decayed by half every 50 epochs. In the

test phase, the global features are combined with those of all

attention branches, which serve as the final query features for

score matching.

TABLE IV. COMPARISON OF OUR PROPOSED METHOD WITH THE STATE-OF-THE-

ART METHOD OF THE CUHK03 DATASET

Method

Labeled

Detected

mAP(%)

Rank-

1(%)

mAP(%)

Rank-

1(%)

HA-CNN (CVPR’18)[18]

41.0

44.4

38.6

41.7

MLFN (CVPR’18)[49]

49.2

54.7

47.8

52.8

DaRe+RE (CVPR’18)[21]

61.6

66.1

59.0

63.3

PCB+RPP (ECCV’18)[10]

—

57.5

63.7

Mancs (ECCV’18)[53]

63.9

69.0

60.5

65.5

MGN(ACM MM’18)[20]

67.4

68.0

66.0

66.8

BDB (ICCV’19)[30]

76.7

79.4

73.5

76.4

Auto-ReID (ICCV’19)[55]

73.0

77.9

69.3

73.3

CASN(CVPR’19)[29]

68.0

73.7

64.4

71.5

MHN (CVPR’19)[58]

72.4

77.2

65.4

71.7

SONA (ICCV’19)[54]

79.23

81.85

76.35

79.10

SIF(TIP’20)[60]

77.0

79.5

73.9

76.6

PAN(TCSVT’18)[62]

45.8

43.9

43.8

41.9

CRAN(TCSVT’19)[64]

68.2

72.7

64.9

69.9

HPDN[67]

58.2

64.3

56.8

63.1

SMC+ECD[68]

—

71.76

—

DropEasy[69]

50.4

55.9

—

Gconv[73]

91.5

85.9

89.0

83.1

Ours

81.2

85.1

78.3

80.9

Ours+k-reciprocal[72]

84.5

87.2

79.3

82.6

Ours+ECN[8]

84.6

87.1

80.2

83.4

Ours+END+Jaccard[70]

84.9

88.6

80.6

84.2

TABLE V. COMPARISON OF OUR PROPOSED METHOD WITH THE STATE-OF-THE-

ART METHOD OF THE MSMT17 DATASET

Method

mAP(%)

Rank-1(%)

Auto-ReID(ICCV’19)[55]

52.5

78.2

OSNet(ICCV’19)[56]

52.9

78.7

IAN(CVPR’19)[57]

46.8

75.5

HA-CNN(CVPR’18) [18]

37.2

64.7

BISAA(TCSVT’19)[63]

39.1

68.7

CRAN(TCSVT’19)[64]

52.4

78.7

DI-REID(CVPR20’)[65]

47.1

75.5

RGA-SC(CVPR20’)[66]

57.5

80.3

Ours

59.6

81.5

Ours+k-reciprocal[72]

63.9

83.2

Ours+ECN[8]

65.5

84.6

Ours+END+Jaccard[70]

66.7

84.9

TABLE VI. RESULTS OF ABLATION STUDY ON MARKET-1501

ResNet-

Weak

Salient

Feature

strengthening

Multi-

attention

branches

mAP(%)

Rank-

1(%)

Time

√

68.2

84.2

35.6

√

72.8

87.9

38.3

√

78.5

89.5

70.8

√

86.1

94.9

64.2

√

89.6

96.7

78.6

Note: Time denote running time (in epoch/s).

B. Parameter analysis

We analyze some crucial parameters of our method on the

Market-1501 dataset, where the same parameters once

optimized are used for all two datasets.

a) The number of attention branches: In our method, a vital

parameter is the number of multi-attention branches. Fig 9(a)

shows that without the attention branch, the model cannot

extract fine-grained features and the distinguishability of the

features is insufficient. Hence, when there is an addition of

attention, the performance of the model is remarkably improved.

With the increase of attention, the accuracy of the model first

improves and then reaches the optimal state when the number

of attention branches is equal to 5. Afterwards, due to the

augment in attention branches, redundant features were

generated, and the model performance dwindled slightly.

b) The value of threshold D: In the loss of attention diversity,

there is a parameter D. We evaluate the sensitivity of the result

accuracy to threshold D as Fig. 9(b) shows. The size of the

attention map is

24 8

, and so we report the rank-1 and mAP

when changing D from 0.5 to 5. It is observed that when D=2,

the best performance can be obtained.

c) The value of the weight coefficient: the weight coefficient



used in Eq. 12 is an important parameter in the loss function.

From Fig 9(c), when



ranges at different magnitude, the

performance of the model increases first and then decreases.

When



is set to 0.1, the best precision is obtained. If



is too

large, it will limit the learning ability of the network,

highlighting the necessity of a proper



d) The value of the threshold



: In Eq.4, threshold



is an

important parameter to determine whether it is a high response

region. The method of weakening salient features is used in

Resnet-50 and experimented on the Market-1501 dataset. The

result is shown in Fig. 9(d). The best performance occurs at

around

=0.4



and

=0.5



, and similar in other datasets

Comparison with State-of-the-art Methods.

On the aforementioned three datasets, the performance of our

model is compared with the algorithms published in the top

conferences and journals in recent years. In Tables II, III and

IV, the comparison results are listed. As can be seen from the

data, the method proposed herein is apparently superior to the

state-of-the-art methods, including DuATM[48], MLFN[49],

HA-CNN[18], AACN[7], Mancs[53], PCB[10], ABD-Net[38],

AANet-50[7], CASN[29], MGN[20], CAMA[43], and

DropEasy[69].

In addition, in order to get better results, we adopt some re-

ranking methods, such as k-reciprocal[72], ECN[8] and

END+Jaccard[70]. After reordering, mAP/rank-1 can be

achieved 94.2%/98.3%, 90.3%/94.7%, 84.2%/88.3%,

66.7%/84.9% on Market-1501, DukeMTMC-reID,

CUHK03(Labeled), CUHK03(Detected), respectively. Among

them, the gabor-based deep learning model proposed in [73] has

an advantage over the traditional convolution. Its results on

CUHK03 dataset are better than ours, but the results on market-

1501 and Dukemtmc-reID dataset are worse than ours.

Without re-ranking method, Our method has achieved the best

results in the MSMT17 datasets. The performance of our

method is comparable to that of st-ReID+RE[71] on Market-

1501 dataset, and it performs slightly worse than st-

ReID+RE[71] on Dukemtmc-reID dataset. That's because our

method uses the pure visual information like most methods, and

st-ReID+RE introduces spatio-temporal information. By fusing

the multimodal information (spatiotemporal information and

visual information), it has achieved an effect that surpasses all

purely visual supervised methods on the Market-1501 and Duke

datasets.

This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication.

The final version of record is available at http://dx.doi.org/10.1109/TCSVT.2020.3043026

> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) <

C. Visualization

Visualization of feature maps for global and multi-attention

branches: Fig. 10 displays the visualization of the global feature

and 5 attention branches for 3 sample pictures. From the figure,

it is also explicit that our global branch can fundamentally pay

great attention to all useful features of the person, while each of

the five attention branches focuses on disparate local features

of the person. This also suggests that our proposed method can

learn global and various diverse local information, which is the

reason for our model’s high performance.

Fig. 10 Visualization for the global branch and 5 attention branches. The

proposed architecture allows the model to learn global and diverse local detailed

features.

D. Ablation Studies

According to Tables II, III and IV, our method attains

comparatively competitive results in all of Market-1501,

DukeMTMC-reID, CUHK03 and MSMT17. There are two

probable causes for getting such results: (1) The employ of

weak saliency and feature strengthen methods enables the

model to remove background information and acquire more

comprehensive person features; (2) the application of attention

branches and branch diversity helps the model acquire more

detailed information locally. Furthermore, through a

combination of global features with local features of multi-

attention branches, the features become more discriminative

and distinguishable. From Table VI, as can be seen from table

6, with the increase of network complexity, the performance of

the model will be improved, and the corresponding training

time will also increase. The average inference time for an image

is 392 ms.

Benefit of Weak Salient Feature: Under the same conditions

as this paper in [48], at batch size = 64, the mAP/Rank-1 can

reach 71.4%/87.5% for ResNet-50. In this paper, after the batch

size is set to 32, mAP/Rank-1 can merely be 68.2%/84.2%. As

is indicated in Table VI, after the addition of the Weak Salient

Feature, mAP and Rank-1 increase by 4.6% and 3.7%,

respectively. Global in Fig. 10 displays the activated feature

map following the feature weakening, after which the model is

found to focus on more features. For Resnet-50, after feature

strength, the mAP/Rank-1 also increases by 5.7%/1.6%, with

the removal of Weak Salient Feature operation from the model

herein generating reduction of mAP/Rank-1 distribution by

3.3%/1.6%, indicating the powerful role of the feature.

strengthening Because the feature weakening module does not

introduce new parameters, the training time does not increase

much compared with Resnet-50.

The impact of Feature strengthening: We compared the

effect of adding Feature strengthening module in backbone

network and deleting feature strengthening module in whole

network, and find that feature strengthening module is

beneficial. After adding feature strengthening module, the

performance of backbone network is improved from mAP /

Rank-1 = 72.8% / 87.9% to mAP / Rank-1 = 78.5% / 89.5%.

After removing the feature strengthening module in the whole

network, the performance of mAP and Rank -1 decreased by

3.3% and 1.6% respectively. But the feature strengthening

module also significantly increases the training time.

The impact of Multi-Attention branches: Since diverse

attention branches play a key role in this work, we add the

multi-branch attention separately behind the Weak Salient

Feature and the model, as shown in the last two rows of Table

VI. Their mAP/Rank-1 are 13.2%/7.0% and 11.1%/7.2% higher

than those before the addition of the attention. From the heat

maps of attention branches in Fig. 10, we can find that the

multiple branches pay respective attention to different local

information, which realizes the diversity of model features and

improves the model performance immensely. Similarly, the

attention branch increases the amount of parameters, so the

training time is significantly increased.

V. CONCLUSION

This paper reports a feature selection network that combines

global and local fine-grained features to realize person re-

identification. The proposed model explores more valuable

features by weakening the salient features, and obtaining

diverse fine-grained features after eliminating interference

information. Through experiments, the state-of-the-art

performance of the Feature refinement and filter network on the

mainstream datasets for person re-identification is verified.

In the future, we expect to apply the proposed method fused

with the temporal attention mechanism to video-based person

re-identification tasks, to identify a person in different frames.

Additionally, we intend to explore the feasibility of using this

method for other deep learning-based tasks.

VI. ACKNOWLEDGMENT

his work is supported by the National Natural Science

Foundation of China (Grant No.61901436).

REFERENCES

[1] S. Li, H. Yu, and R. Hu, “Attributes-aided part detection and

refinement for person re-identification,” Pattern Recognit., vol. 97,

2020.

[2] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for

image recognition,” Proc. IEEE Comput. Soc. Conf. Comput. Vis.

Pattern Recognit., vol. 2016-Decem, pp. 770–778, 2016.

This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication.

The final version of record is available at http://dx.doi.org/10.1109/TCSVT.2020.3043026

> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) <

[3] C. Szegedy et al., “Going deeper with convolutions,” in Proceedings

of the IEEE Computer Society Conference on Computer Vision and

Pattern Recognition, 2015.

[4] K. Simonyan and A. Zisserman, “Very deep convolutional networks

for large-scale image recognition,” 3rd Int. Conf. Learn. Represent.

ICLR 2015 - Conf. Track Proc., pp. 1–14, 2015.

[5] K. Li, Z. Wu, K. C. Peng, J. Ernst, and Y. Fu, “Tell Me Where to

Look: Guided Attention Inference Network,” Proc. IEEE Comput.

Soc. Conf. Comput. Vis. Pattern Recognit., pp. 9215–9223, 2018.

[6] M. Tian et al., “Eliminating Background-bias for Robust Person Re-

identification,” Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern

Recognit., pp. 5794–5803, 2018.

[7] C. P. Tay, S. Roy, and K. H. Yap, “Aanet: Attribute attention network

for person re-identifications,” Proc. IEEE Comput. Soc. Conf.

Comput. Vis. Pattern Recognit., vol. 2019-June, pp. 7127–7136, 2019

[8] M. S. Sarfraz, A. Schumann, A. Eberle, and R. Stiefelhagen, “A Pose-

Sensitive Embedding for Person Re-identification with Expanded

Cross Neighborhood Re-ranking,” Proc. IEEE Comput. Soc. Conf.

Comput. Vis. Pattern Recognit., pp. 420–429, 2018.

[9] M. M. Kalayeh, E. Basaran, M. Gokmen, M. E. Kamasak, and M.

Shah, “Human Semantic Parsing for Person Re-identification,” Proc.

IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., pp. 1062–

1071, 2018.

[10] Y. Sun, L. Zheng, Y. Yang, Q. Tian, and S. Wang, “Beyond Part

Models: Person Retrieval with Refined Part Pooling (and A Strong

Convolutional Baseline),” Lect. Notes Comput. Sci. (including

Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics), vol.

11208 LNCS, pp. 501–518, 2018..

[11] L. Sun, J. Liu, Y. Zhu, and Z. Jiang, “Local to Global with Multi-

Scale Attention Network for Person Re-Identification,” Proc. - Int.

Conf. Image Process. ICIP, vol. 2019-September, pp. 2254–2258,

2019.

[12] X. Sun, N. Zhang, Q. Chen, Y. Cao, and B. Liu, “PEOPLE RE-

IDENTIFICATION BY MULTI-BRANCH CNN WITH MULTI-

SCALE FEATURES Xinzi Sun , Ning Zhang , Qilei Chen , Yu Cao ,

Benyuan Liu Department of Computer Science University of

Massachusetts Lowell,” 2019 IEEE Int. Conf. Image Process., pp.

2269–2273, 2019.

[13] F. Zheng et al., “Pyramidal person re-identification via multi-loss

dynamic training,” Proc. IEEE Comput. Soc. Conf. Comput. Vis.

Pattern Recognit., vol. 2019-June, pp. 8506–8514, 2019

[14] S. Karanam, M. Gou, Z. Wu, A. Rates-Borras, O. Camps, and R. J.

Radke, “A Systematic Evaluation and Benchmark for Person Re-

Identification: Features, Metrics, and Datasets,” IEEE Trans. Pattern

Anal. Mach. Intell., vol. 41, no. 3, pp. 523–536, 2019.

[15] L. Zheng, Y. Yang, and A. G. Hauptmann, “Person Re-identification:

Past, Present and Future,” arXiv Comput. Sci., vol. 14, no. 8, pp. 1–

20, 2016.

[16] L. He, J. Liang, H. Li, and Z. Sun, “Deep Spatial Feature

Reconstruction for Partial Person Re-identification: Alignment-free

Approach,” Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern

Recognit., vol. 2, pp. 7073–7082, 2018.

[17] J. Xu, R. Zhao, F. Zhu, H. Wang, and W. Ouyang, “Attention-Aware

Compositional Network for Person Re-identification,” Proc. IEEE

Comput. Soc. Conf. Comput. Vis. Pattern Recognit., pp. 2119–2128,

2018.

[18] W. Li, X. Zhu, and S. Gong, “Harmonious Attention Network for

Person Re-identification,” Proc. IEEE Comput. Soc. Conf. Comput.

Vis. Pattern Recognit., no. I, pp. 2285–2294, 2018.

[19] Z. Zhang, C. Lan, W. Zeng, and Z. Chen, “Densely semantically

aligned person re-identification,” Proc. IEEE Comput. Soc. Conf.

Comput. Vis. Pattern Recognit., vol. 2019-June, no. d, pp. 667–676,

2019.

[20] G. Wang, Y. Yuan, X. Chen, J. Li, and X. Zhou, “Learning

discriminative features with multiple granularities for person re-

identification,” MM 2018 - Proc. 2018 ACM Multimed. Conf., pp.

274–282, 2018.

[21] Y. Wang et al., “Resource Aware Person Re-identification Across

Multiple Resolutions,” Proc. IEEE Comput. Soc. Conf. Comput. Vis.

Pattern Recognit., pp. 8042–8051, 2018.

[22] B. Xie, X. Wu, S. Zhang, S. Zhao, and M. Li, “Learning Diverse

Features with Part-Level Resolution for Person Re-Identification.

(arXiv:2001.07442v1 [cs.CV]),” arXiv Comput. Sci., pp. 1–8, 2019.

[23] Y. Zhu, X. Guo, J. Liu, and Z. Jiang, “MULTI-BRANCH

CONTEXT-AWARE NETWORK FOR PERSON RE-

IDENTIFICATION Beijing University of Posts and

Telecommunications , Beijing , China Academy of Broadcasting

Science , Beijing , China,” 2019 IEEE Int. Conf. Image Process., pp.

2274–2278, 2019.

[24] H. Guo, H. Wu, C. Zhao, H. Zhang, J. Wang, and H. Lu, “CASCADE

ATTENTION NETWORK FOR PERSON RE-IDENTIFICATION

National Laboratory of Pattern Recognition , Institute of Automation

Chinese Academy of Sciences , Beijing , China , 100190 Beijing

Institute of Technology , Beijing , China , 100000 Key Laboratory

of,” 2019 IEEE Int. Conf. Image Process., pp. 2264–2268, 2019.

[25] Z. Zhong, L. Zheng, G. Kang, S. Li, and Y. Yang, “Random Erasing

Data Augmentation,” Proc. AAAI Conf. Artif. Intell., vol. 34, no. 07,

pp. 13001–13008, 2020.

[26] T. DeVries and G. W. Taylor, “Improved Regularization of

Convolutional Neural Networks with Cutout,” arXiv, no.

1708.04552v2, 2017.

[27] T. Hu, H. Qi, Q. Huang, and Y. Lu, “See Better Before Looking

Closer: Weakly Supervised Data Augmentation Network for Fine-

Grained Visual Classification,” no. 1, 2019.

[28] K. K. Singh and Y. J. Lee, “Hide-and-Seek: Forcing a Network to be

Meticulous for Weakly-Supervised Object and Action Localization,”

Proc. IEEE Int. Conf. Comput. Vis., vol. 2017-October, pp. 3544–

3553, 2017.

[29] M. Zheng, S. Karanam, Z. Wu, and R. J. Radke, “Re-identification

with consistent attentive siamese networks,” Proc. IEEE Comput. Soc.

Conf. Comput. Vis. Pattern Recognit., vol. 2019-June, pp. 5728–5737,

2019.

[30] Z. Dai, M. Chen, X. Gu, S. Zhu, and P. Tan, “Batch dropblock

network for person re-identification and beyond,” Proc. IEEE Int.

Conf. Comput. Vis., vol. 2019-October, pp. 3690–3700, 2019.

[31] M. Zheng, S. Karanam, T. Chen, R. J. Radke, and Z. Wu, “Learning

Similarity Attention,” arXiv, 2019.

[32] J. Hu, L. Shen, and G. Sun, “Squeeze-and-Excitation Networks,” in

Proceedings of the IEEE Computer Society Conference on Computer

Vision and Pattern Recognition, 2018.

[33] S. Woo, J. Park, J. Y. Lee, and I. S. Kweon, “CBAM: Convolutional

block attention module,” Lect. Notes Comput. Sci. (including Subser.

Lect. Notes Artif. Intell. Lect. Notes Bioinformatics), vol. 11211

LNCS, pp. 3–19, 2018.

[34] Q. Huang et al., “Semantic segmentation with reverse attention,” Br.

Mach. Vis. Conf. 2017, BMVC 2017, 2017.

[35] S. Chen, X. Tan, B. Wang, and X. Hu, “Reverse attention for salient

object detection,” Lect. Notes Comput. Sci. (including Subser. Lect.

Notes Artif. Intell. Lect. Notes Bioinformatics), vol. 11213 LNCS, pp.

236–252, 2018.

[36] S. Liu, L. Qi, Y. Zhang, and W. Shi, “Dual Reverse Attention

Networks for Person Re-Identification,” 2019 IEEE Int. Conf. Image

Process., pp. 1232–1236, 2019.

[37] Y. Sun, L. Zheng, W. Deng, and S. Wang, “SVDNet for Pedestrian

Retrieval,” in Proceedings of the IEEE International Conference on

Computer Vision, 2017.

[38] T. Chen et al., “ABD-net: Attentive but diverse person re-

identification,” Proc. IEEE Int. Conf. Comput. Vis., vol. 2019-Octob,

pp. 8350–8360, 2019

[39] S. Li, S. Bak, P. Carr, and X. Wang, “Diversity Regularized

Spatiotemporal Attention for Video-Based Person Re-identification,”

Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., pp.

369–378, 2018.

[40] H. Fukui, T. Hirakawa, T. Yamashita, and H. Fujiyoshi, “Attention

branch network: Learning of attention mechanism for visual

explanation,” Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern

Recognit., vol. 2019-June, no. January, pp. 10697–10706, 2019.

[41] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna,

“Rethinking the Inception Architecture for Computer Vision,” in

Proceedings of the IEEE Computer Society Conference on Computer

Vision and Pattern Recognition, 2016.

[42] A. Hermans, L. Beyer, and B. Leibe, “In Defense of the Triplet Loss

for Person Re-Identification,” CoRR, vol. abs/1703.0, 2017.

[43] W. Yang, H. Huang, Z. Zhang, X. Chen, K. Huang, and S. Zhang,

“Towards rich feature discovery with class activation maps

augmentation for person re-identification,” in Proceedings of the

IEEE Computer Society Conference on Computer Vision and Pattern

Recognition, 2019.

This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication.

The final version of record is available at http://dx.doi.org/10.1109/TCSVT.2020.3043026

> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) <

[44] L. Zheng, L. Shen, L. Tian, S. Wang, J. Wang, and Q. Tian, “Scalable

person re-identification: A benchmark,” Proc. IEEE Int. Conf.

Comput. Vis., vol. 2015 Inter, no. November, pp. 1116–1124, 2015.

[45] E. Ristani, F. Solera, R. Zou, R. Cucchiara, and C. Tomasi,

“Performance measures and a data set for multi-target, multi-camera

tracking,” in Lecture Notes in Computer Science (including subseries

Lecture Notes in Artificial Intelligence and Lecture Notes in

Bioinformatics), 2016.

[46] W. Li, R. Zhao, T. Xiao, and X. Wang, “DeepReID: Deep filter

pairing neural network for person re-identification,” in Proceedings

of the IEEE Computer Society Conference on Computer Vision and

Pattern Recognition, 2014.

[47] R. M. Bolle, J. H. Connell, S. Pankanti, N. K. Ratha, and A. W. Senior,

“The relation between the ROC curve and the CMC,” in Proceedings

- Fourth IEEE Workshop on Automatic Identification Advanced

Technologies, AUTO ID 2005, 2005.

[48] J. Si et al., “Dual Attention Matching Network for Context-Aware

Feature Sequence Based Person Re-identification,” Proc. IEEE

Comput. Soc. Conf. Comput. Vis. Pattern Recognit., pp. 5363–5372,

2018.

[49] X. Chang, T. M. Hospedales, and T. Xiang, “Multi-level

Factorisation Net for Person Re-identification,” Proc. IEEE Comput.

Soc. Conf. Comput. Vis. Pattern Recognit., pp. 2109–2118, 2018.

[50] C. Song, Y. Huang, W. Ouyang, and L. Wang, “Mask-Guided

Contrastive Attention Model for Person Re-identification,” Proc.

IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., pp. 1179–

1188, 2018.

[51] Y. Shen, T. Xiao, H. Li, S. Yi, and X. Wang, “End-to-End Deep

Kronecker-Product Matching for Person Re-identification,” Proc.

IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., pp. 6886–

6895, 2018.

[52] D. Chen, D. Xu, H. Li, N. Sebe, and X. Wang, “Group Consistent

Similarity Learning via Deep CRF for Person Re-identification,”

Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., pp.

8649–8658, 2018.

[53] C. Wang, Q. Zhang, C. Huang, W. Liu, and X. Wang, “Mancs: A

Multi-task Attentional Network with Curriculum Sampling for

Person Re-Identification,” Lect. Notes Comput. Sci. (including

Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics), vol.

11208 LNCS, pp. 384–400, 2018.

[54] B. Bryan, Y. Gong, Y. Zhang, and C. Poellabauer, “Second-order

non-local attention networks for person re-identification,” Proc. IEEE

Int. Conf. Comput. Vis., vol. 2019-Octob, no. November, pp. 3759–

3768, 2019.

[55] R. Quan, X. Dong, Y. Wu, L. Zhu, and Y. Yang, “Auto-reID:

Searching for a part-aware convnet for person re-identification,” Proc.

IEEE Int. Conf. Comput. Vis., vol. 2019-October, pp. 3749–3758,

2019.

[56] K. Zhou, Y. Yang, A. Cavallaro, and T. Xiang, “Omni-scale feature

learning for person re-identification,” Proc. IEEE Int. Conf. Comput.

Vis., vol. 2019-October, no. d, pp. 3701–3711, 2019.

[57] R. Hou, B. Ma, H. Chang, X. Gu, S. Shan, and X. Chen, “Interaction-

and-Aggregation Network for Person Re-identification Key

Laboratory of Intelligent Information Processing of Chinese

Academy of Sciences ( CAS ),” Cvpr, pp. 9317–9326, 2019.

[58] B. Chen, W. Deng, and J. Hu, “Mixed high-order attention network

for person re-identification,” Proc. IEEE Int. Conf. Comput. Vis., vol.

2019-October, pp. 371–381, 2019.

[59] H. Luo, Y. Gu, X. Liao, S. Lai, and W. Jiang, “Bag of tricks and a

strong baseline for deep person re-identification,” IEEE Comput. Soc.

Conf. Comput. Vis. Pattern Recognit. Work., vol. 2019-June, pp.

1487–1495, 2019.

[60] L. Wei et al., "SIF: Self-Inspirited Feature Learning for Person Re-

Identification," in IEEE Transactions on Image Processing, vol. 29,

pp. 4942-4951, 2020.

[61] L. Wei, S. Zhang, W. Gao, and Q. Tian, “Person Transfer GAN to

Bridge Domain Gap for Person Re-identification,” Proc. IEEE

Comput. Soc. Conf. Comput. Vis. Pattern Recognit., pp. 79–88, 2018.

[62] Z. Zheng, L. Zheng, and Y. Yang, “Pedestrian alignment network for

large-scale person re-identification,” IEEE Trans. Circuits Syst.

Video Technol., vol. 29, no. 10, pp. 3037–3045, 2019.

[63] X. Liu, S. Bi, S. Fang, and A. Bouridane, “Bayesian Inferred Self-

Attentive Aggregation for Multi-Shot Person Re-Identification,”

IEEE Trans. Circuits Syst. Video Technol., vol. PP, no. c, pp. 1–1,

2019.

[64] C. Han, R. Zheng, C. Gao, and N. Sang, “Complementation-

Reinforced Attention Network for Person Re-Identification,” IEEE

Trans. Circuits Syst. Video Technol., vol. PP, no. XX, pp. 1–1, 2019.

[65] Y. Huang, Z.-J. Zha, X. Fu, R. Hong, and L. Li, “Real-world Person

Re-Identification via Degradation Invariance Learning,” pp. 14084–

14094, Apr. 2020.

[66] Z. Zhang, C. Lan, W. Zeng, X. Jin, and Z. Chen, “Relation-Aware

Global Attention for Person Re-identification,” 2019.

[67] H. Wang, T. Fang, Y. Fan, and W. Wu, “Person re-identification

based on dropeasy method,” IEEE Access, vol. 7, pp. 97021–97031,

2019.

[68] A. Borgia, Y. Hua, E. Kodirov, and N. M. Robertson, “Cross-View

Discriminative Feature Learning for Person Re-Identification,” IEEE

Trans. Image Process., vol. 27, no. 11, pp. 5338–5349, 2018.

[69] H. Wang, T. Fang, Y. Fan, and W. Wu, “Person re-identification

based on dropeasy method,” IEEE Access, vol. 7, pp. 97021–97031,

2019.

[70] J. Lv, Z. Li, K. Nai, Y. Chen, and J. Yuan, “Person re-identification

with expanded neighborhoods distance re-ranking,” Image Vis.

Comput., vol. 95, p. 103875, 2020.

[71] G. Wang, J. Lai, P. Huang, and X. Xie, “Spatial-Temporal Person Re-

Identification,” Proc. AAAI Conf. Artif. Intell., vol. 33, pp. 8933–

8940, 2019.

[72] J. V. C. I. R et al., “Person re-identification based on re-ranking with

expanded k-reciprocal nearest neighbors q,” J. Vis. Commun. Image

Represent., vol. 58, pp. 486–494, 2019.

[73] Y. Yuan, J. Zhang, and Q. Wang, “Deep Gabor convolution network

for person re-identification,” Neurocomputing, vol. 378, pp. 387–398,

2020.

[74] Y. Yuan, Z. Xiong, and Q. Wang, “VSSA-NET: Vertical Spatial

Sequence Attention Network for Traffic Sign Detection,” IEEE Trans.

Image Process., vol. 28, no. 7, pp. 3423–3434, 2019.

Xin Ning received his Ph.D. in 2017 from

Institute of Semiconductors, Chinese

Academy of Sciences. He is currently an

Assistant Professor of Artificial

Intelligence at Institute of

Semiconductors Chinese Academy of

Sciences. His research interests include

deep learning machine art, pattern

recognition, and image cognitive

computation. He is a member of IEEE.

Ke Gong received his bachelor degree in

China university of petroleum (Beijing) in

2018, and now he is working at Beijing

Wave Security Technology company

limited, Cognitive Computing Technology

Joint Laboratory, Wave Group.

Weijun Li received his Ph.D. in 2004 from

Institute of Semiconductors, Chinese

Academy of Sciences. He is currently a

Professor of Artificial Intelligence at

Institute of Semiconductors Chinese

Academy of Sciences (ISCAS) and the

University of Chinese Academy of

Sciences. He is in charge of the Artificial

intelligence research Center of ISCAS,

This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication.

The final version of record is available at http://dx.doi.org/10.1109/TCSVT.2020.3043026

> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) <

also the Director of the Lab of Highspeed Circuits & Neural

Networks of ISCAS. His research interests include deep

modeling, machine art, pattern recognition, artificial neural

networks and intelligent system. He is a senior member of IEEE.

Liping Zhang received her Ph.D. from

Institute of Semiconductors, Chinese

Academy of Sciences in 2018. Currently,

she is an assistant research fellow in the

Laboratory of High-speed Circuit and

Artificial Neural networks at Institute of

Semiconductors, Chinese Academy of

Sciences. Her research interests include

biometrics, pattern analysis. She is a

member of IEEE.

Xiao Bai received the B.Eng. degree in

computer science from Beihang

University of China, Beijing, China, in

2001, and the Ph.D. degree in computer

science from the University of York,

York, U.K., in 2006.

He was a Research Officer (Fellow,

Scientist) with the Computer Science

Department, University of Bath, until 2008. He is currently a

Full Professor with the School of Computer Science and

Engineering, Beihang University. He has authored or co-

authored more than 100 papers in journals and refereed

conferences. His current research interests include pattern

recognition, image processing, and remote sensing image

analysis. He is the Associate Editor for journal of Pattern

Recognition and Signal Processing.

Shengwei Tian received the Ph.D. degree

in computer science and technology from

the Xinjiang University, in 2010. He is

currently a Professor with the Xinjiang

University of Technology.

His research interests include intelligence

computing, image processing, and natural

language

processing.

t Award for Excellence in 2008, and the IEEE Electromagnetic

Compatibility Society Best Symposium Paper Award in 2011.

This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication.

The final version of record is available at http://dx.doi.org/10.1109/TCSVT.2020.3043026

Sign-YOLO: Traffic Sign Detection Using Attention-based YOLOv7

Article

Full-text available

Jan 2024

Traffic sign detection (TSD) is crucial for real-world applications like driverless vehicles, intelligent driver-assistance systems, and traffic management. Recent advancements in TSD have demonstrated promising outcomes. Nonetheless, challenges persist in terms of speed, accuracy, memory consumption, the capability of the backbone to generate features, and computational cost, especially in handling diverse traffic sign characteristics. To overcome these challenges, we propose Sign-YOLO (You Only Look Once), a novel attention-based one-stage method that integrates YOLOv7 with the squeeze-and-excitation (SE) model and special attention mechanism. Sign-YOLO enhances the feature representation capacity of the model in the presence of variations in traffic sign sizes. The SE block adjusts channel-specific feature responses by actively considering the relationships between channels. By selectively focusing on relevant features, the attention mechanism helps the model better capture and understand the distinctive characteristics of traffic signs, thereby improving detection accuracy. Sign-YOLO effectively reduces the computational cost and memory consumption; further, it effectively enriches the robustness of extracted features. The proposed approach enables the model to allocate more attention to relevant regions of the input, thereby reducing the impact of size discrepancies and contributing to the overall robustness of the system. The experimental findings highlight the success of Sign-YOLO in TSD tasks. Our proposed method exhibits cutting-edge performance on the German Traffic Signs Detection Benchmark (GTSDB) dataset, simultaneously achieving a 98% reduction in model size and memory consumption. Sign-YOLO attains a 99.10% mean average precision (mAP) on the GTSDB dataset. In comparison to both two- and one-stage detectors, our approach exhibits an improvement of 3.33%. The proposed approach is the swiftest and most lightweight framework in terms of memory usage, making it the ideal option for implementation in real-time applications.

Dual-stage temporal perception network for continuous sign language recognition

Article

Full-text available

Jun 2024
VISUAL COMPUT

Continuous sign language recognition (CSLR) aims to identify a sequence of glosses from a sign language video with only a sentence-level label provided in a weakly supervised way. In sign language videos, the transitions among actions are naturally fluent, and different glosses or the same gloss correspond to video clips with various temporal scales. Obviously, these factors pose a challenge to the effective extraction of complex temporal information. However, most previous deep learning-based CSLR methods employ a temporal modeling method with a fixed temporal receptive field, which is a simple and effective solution but does not cope well with video clips that have various temporal scales. To relieve this problem, we propose a dual-stage temporal perception module (DTPM) by leveraging the strengths of both temporal convolutions and transformers, which follows a hierarchical structure with dual stages aimed at capturing richer and more comprehensive temporal features. Specifically, each stage for DTPM is cleverly composed of two parts: a multi-scale local temporal module (MS-LTM), followed by a set of global–local temporal modules (GLTMs), where each GLTM can be further decomposed into a global temporal relational module (GTRM) and a local temporal relational module (LTRM). At each stage, an MS-LTM is first employed to model multi-scale local temporal relations and then utilize a set of GLTMs to model global temporal relations and strengthen local temporal relations. We finally aggregate the output features of each stage to form a video feature representation with rich semantic information. Extensive experiments on three CSLR benchmarks, PHOENIX14 (Koller et al. Comput Vis Image Underst 141:108–125, 2015), PHOENIX14-T (Camgoz et al., in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7784–7793, 2018), and CSL (Huang et al., in: Proceedings of the AAAI conference on artificial intelligence, pp 32, 2018), validate the effectiveness of our proposed method.

Research on Sarcasm Detection Technology Based on Image-Text Fusion

Article

Full-text available

Jan 2024
CMC-COMPUT MATER CON

Camera-Invariant Meta-Learning Network for Single-Camera-Training Person Re-identification

Preprint

Jun 2024

Single-camera-training person re-identification (SCT re-ID) aims to train a re-ID model using SCT datasets where each person appears in only one camera. The main challenge of SCT re-ID is to learn camera-invariant feature representations without cross-camera same-person (CCSP) data as supervision. Previous methods address it by assuming that the most similar person should be found in another camera. However, this assumption is not guaranteed to be correct. In this paper, we propose a Camera-Invariant Meta-Learning Network (CIMN) for SCT re-ID. CIMN assumes that the camera-invariant feature representations should be robust to camera changes. To this end, we split the training data into meta-train set and meta-test set based on camera IDs and perform a cross-camera simulation via meta-learning strategy, aiming to enforce the representations learned from the meta-train set to be robust to the meta-test set. With the cross-camera simulation, CIMN can learn camera-invariant and identity-discriminative representations even there are no CCSP data. However, this simulation also causes the separation of the meta-train set and the meta-test set, which ignores some beneficial relations between them. Thus, we introduce three losses: meta triplet loss, meta classification loss, and meta camera alignment loss, to leverage the ignored relations. The experiment results demonstrate that our method achieves comparable performance with and without CCSP data, and outperforms the state-of-the-art methods on SCT re-ID benchmarks. In addition, it is also effective in improving the domain generalization ability of the model.

Applying deep learning image enhancement methods to improve person re-identification

Article

Jun 2024
NEUROCOMPUTING

Person re-identification by utilizing hierarchical spatial relation reasoning

Article

Jun 2024
IMAGE VISION COMPUT

Attribute discrimination combined with selected sample dropout for unsupervised domain adaptive person re-identification

Article

Jun 2024
IMAGE VISION COMPUT

A Hybrid Approach for Efficient Traffic Sign Detection Using Yolov8 And SAM

Conference Paper

May 2024

Digital Vigilance: AI Solutions in the Quest for Missing Persons using face recognition with deep learning algorithms

Conference Paper

Mar 2024

AugReID: Transformer-Based Augmentation Person Re-Identification

Conference Paper

Nov 2023

SIF: Self-Inspirited Feature Learning for Person Re-Identification

Article

Full-text available

Mar 2020

The re-identification (ReID) task has received increasing studies in recent years and its performance has gained significant improvement. The progress mainly comes from searching for new network structures to learn person representations. Most of these networks are trained using the classic stochastic gradient descent optimizer. However, limited efforts have been made to explore potential performance of existing ReID networks directly by better training scheme, which leaves a large space for ReID research. In this paper, we propose a Self-Inspirited Feature Learning (SIF) method to enhance performance of given ReID networks from the viewpoint of optimization. We design a simple adversarial learning scheme to encourage a network to learn more discriminative person representation. In our method, an auxiliary branch is added into the network only in the training stage, while the structure of the original network stays unchanged during the testing stage. In summary, SIF has three aspects of advantages: (1) it is designed under general setting; (2) it is compatible with many existing feature learning networks on the ReID task; (3) it is easy to implement and has steady performance. We evaluate the performance of SIF on three public ReID datasets: Market1501, DuckMTMC-reID, and CUHK03(both labeled and detected). The results demonstrate significant improvement in performance brought by SIF. We also apply SIF to obtain state-of-the-art results on all the three datasets. Specifically, mAP / Rank-1 accuracy are: 87.6% / 95.2% (without re-rank) on Market1501, 79.4% / 89.8% on DuckMTMC-reID, 77.0% / 79.5% on CUHK03 (labeled) and 73.9% / 76.6% on CUHK03 (detected), respectively. The code of SIF will be available soon.

Auto-ReID: Searching for a Part-Aware ConvNet for Person Re-Identification

Conference Paper

Full-text available

Oct 2019

Second-Order Non-Local Attention Networks for Person Re-Identification

Conference Paper

Full-text available

Oct 2019

Learning Diverse Features with Part-Level Resolution for Person Re-identification

Chapter

Oct 2020

Learning diverse features is key to the success of person re-identification. Various part-based methods have been extensively proposed for learning local representations, which, however, are still inferior to the best-performing methods for person re-identification. This paper proposes to construct a strong lightweight network architecture, termed PLR-OSNet, based on the idea of Part-Level feature Resolution over the Omni-Scale Network (OSNet) for achieving feature diversity. The proposed PLR-OSNet has two branches, one branch for global feature representation and the other branch for local feature representation. The local branch employs a uniform partition strategy for part-level feature resolution but produces only a single identity-prediction loss, which is in sharp contrast to the existing part-based methods. Empirical evidence demonstrates that the proposed PLR-OSNet achieves state-of-the-art performance on popular person Re-ID datasets, including Market1501, DukeMTMC-reID and CUHK03, despite its small model size.

Relation-Aware Global Attention for Person Re-Identification

Conference Paper