Conference PaperPDF Available

Task-conditioned Domain Adaptation for Pedestrian Detection in Thermal Imagery

July 2020

July 2020

Conference: ECCV 2020

Authors:

Kieu My

University of Florence

Marco Bertini

University of Florence

Pedestrian detection is a core problem in computer vision that sees broad application in video surveillance and, more recently, in advanced driving assistance systems. Despite its broad application and interest, it remains a challenging problem in part due to the vast range of conditions under which it must be robust. Pedestrian detection at night-time and during adverse weather conditions is particularly challenging, which is one of the reasons why thermal and multispectral approaches have been become popular in recent years. In this paper, we propose a novel approach to domain adaptation that significantly improves pedestrian detection performance in the thermal domain. The key idea behind our technique is to adapt an RGB-trained detection network to simultaneously solve two related tasks. An auxiliary classification task that distinguishes between daytime and nighttime thermal images is added to the main detection task during domain adaptation. The internal representation learned to perform this classification task is used to condition a YOLOv3 detector at multiple points in order to improve its adaptation to the thermal domain. We validate the effectiveness of task-conditioned domain adaptation by comparing with the state-of-the-art on the KAIST Multispectral Pedestrian Detection Benchmark. To the best of our knowledge, our proposed task-conditioned approach achieves the best single-modality detection results.

Conditioning layer and auxiliary classification network. The auxiliary network learns an internal representation used to solve a classification task. This representation is then leveraged by conditioning layers to adjust internal convolutional feature maps in the detection network.

…

TC Res Group: Conditioning residual groups of YOLOv3. The pre-ReLU activations of the last layer of each convolutional group are modified by parameters γi and βi. Conditioning is done before the final residual connection of each group.

…

TC Det: Conditioning the detection heads of YOLOv3. Feature maps used for detection are conditioned using the internal representation of the auxiliary network.

…

Examples of KAIST thermal images with detections. The first two rows are daytime images and the last two are nighttime. The first and the third rows show detection results without conditioning, and the second and last rows detections with our TC Det detector. Blue boxes are true positive detections, green boxes are false negatives, and red boxes indicate false positives. See section 4.3 for detailed analysis.

…

The effects of conditioning during daytime and nighttime. The first two columns show results for a thermal detector without conditioning and with conditioning. Blue boxes are true positive detections, green boxes are false negatives, and red boxes indicate false positives. See text detailed analysis.

…

Figures - uploaded by Kieu My

Content may be subject to copyright.

Content uploaded by Kieu My

Content may be subject to copyright.

Task-conditioned Domain Adaptation for

Pedestrian Detection in Thermal Imagery

My Kieu[0000−0002−7813−5744] , Andrew D. Bagdanov[0000−0001−6408−7043],

Marco Bertini[0000−0002−1364−218X], and Alberto del Bimbo[0000−0002−1052−8322]

Media Integration and Communication Center - University of Florence, Italy

{firstname.lastname}@unifi.it

Abstract. Pedestrian detection is a core problem in computer vision

that sees broad application in video surveillance and, more recently, in

advanced driving assistance systems. Despite its broad application and

interest, it remains a challenging problem in part due to the vast range of

conditions under which it must be robust. Pedestrian detection at night-

time and during adverse weather conditions is particularly challenging,

which is one of the reasons why thermal and multispectral approaches

have been become popular in recent years. In this paper, we propose a

novel approach to domain adaptation that signiﬁcantly improves pedes-

trian detection performance in the thermal domain. The key idea behind

our technique is to adapt an RGB-trained detection network to simul-

taneously solve two related tasks. An auxiliary classiﬁcation task that

distinguishes between daytime and nighttime thermal images is added

to the main detection task during domain adaptation. The internal rep-

resentation learned to perform this classiﬁcation task is used to con-

dition a YOLOv3 detector at multiple points in order to improve its

adaptation to the thermal domain. We validate the eﬀectiveness of task-

conditioned domain adaptation by comparing with the state-of-the-art

on the KAIST Multispectral Pedestrian Detection Benchmark. To the

best of our knowledge, our proposed task-conditioned approach achieves

the best single-modality detection results.

Keywords: object detection, pedestrian detection, thermal imagery, task-

conditioned, domain adaptation, conditioning network, thermal imagery

1 Introduction

Object detection and, in particular, pedestrian detection is one of the most

important problems in computer vision due to its central role in diverse practical

applications such as safety and security, surveillance, and autonomous driving.

The detection problem is particularly challenging in many common contexts

such as limited illumination (nighttime) or adverse weather conditions (fog, rain,

dust) [22, 19]. In such conditions the majority of detectors [4, 27, 40] using visible

spectrum imagery can fail.

For these reasons, detectors exploiting thermal imagery have been proposed

as suitable for robust pedestrian detection [19, 38, 25, 20, 5, 22, 23, 14]. A

2 My Kieu and Andrew D. Bagdanov et al.

growing number of works have also investigated multispectral detectors that

combine visible and thermal images for robust pedestrian detection [36, 1, 29,

24, 38, 25, 20, 39, 5, 14, 22, 23].

However, multispectral detectors, in order to make the most out of both

modalities, typically need to resort to additional (and expensive) annotations,

and are usually based on far more complex network architectures than single-

modality methods (see table 3). Moreover, due to the cost of deploying multiple

aligned sensors (thermal and visible) at inference time, multispectral models

can have limited applicability in real-world applications. Aside from the techni-

cal and economic reasons, the privacy-preserving aﬀordances oﬀered by thermal

imagery are also a motivation for prefering thermal-only detecion [19]. Because

of this, several recent works do not use visible images, but focus only on thermal

images for pedestrian detection [18, 16, 3, 7, 19, 15]. They typically yield lower

performance than multispectral detectors since robust pedestrian detection using

only thermal data is nontrivial and there is still potential for improvement.

In this paper we propose a task-conditioned network architecture for domain

adaptation of pedestrian detectors to thermal imagery. Our key idea is to aug-

ment a detector with an auxiliary network that solves a simpler classiﬁcation

task and then to exploit the learned representation of this auxiliary network to

inject conditioning parameters into strategically chosen convolutional layers of

the main detection network. The resulting, adapted network operates entirely

in the thermal domain and achieves excellent performance compared to other

single-modality approaches.

The contributions of this work are:

–we propose a novel task-conditioned network architecture based on YOLOv3

[32] that uses the auxiliary task of day/night classiﬁcation to aid adaptation

to the thermal domain;

–we conduct extensive ablative analyses probing the eﬀectiveness of various

task-conditioning architectures and adaptation schedules;

–to the best of our knowledge, our task-conditioned detection networks out-

perform all single-modality detection approaches the KAIST Multispectral

Pedestrian Detection Benchmark [17]; and

–exploiting only thermal imagery, we outperform many state-of-the-art mul-

tispectral pedestrian detectors on the KAIST benchmark at nighttime.

The rest of the paper is organized as follows. In the next section we review

the scientiﬁc literature related to our proposed domain adaptation approach. In

section 3 we describe our approach to conditioning thermal domain adaptation

on the auxiliary task of day/night discrimination. We report in section 4 on

an extensive set of experiments performed to evaluate the eﬀectiveness of task-

conditioning, and in section 5 we conclude with a discussion of our contribution.

2 Related work

Pedestrian detection has attracted much attention from the scientiﬁc community

over the years because of its usefulness in many applications. Thanks to the

Task-conditioned Domain Adaptation for Pedestrian Detection 3

reduction of costs and availability of thermal cameras, many recent works have

investigated how to perform it in multispectral and thermal domains.

2.1 Pedestrian detection in the visible spectrum

The main challenges to robust pedestrian detection in the visible spectrum arise

from a variety of environmental conditions such as occlusion, changing illumi-

nation, and variation of viewpoint and background [29]. In [36] discriminative

detectors are learned by jointly optimizing them along with semantic tasks such

as pedestrian and scene attributes detection; in [29] joint estimation of visibility

of multiple pedestrians and recognition of overlapping pedestrians is done us-

ing a mutual visibility deep model; in [5] semantic segmentation is used as an

additional supervision to improve the simultaneous detection. In [40] the Re-

gion Proposal Network (RPN) originally proposed in Faster R-CNN is used for

standalone pedestrian detection; dealing with multiple scales using a specialized

sub-networks based on Fast R-CNN is proposed in [24]; prediction of pedestrian

centers and scales in one pass and without anchors was recently proposed in [27].

2.2 Multispectral pedestrian detection approaches

Many recent works have used both thermal and RGB images to improve detec-

tion results [38, 25, 20, 39, 22, 23], combining visible and thermal images for

training and testing. The authors of [38] investigated two types of fusion net-

works to exploit visible and thermal image pairs. Four diﬀerent network fusion

approaches (early, halfway, late, and score fusion) for the multispectral pedes-

trian detection task were also introduced in [25]. The cross-modality learning

framework including a Region Reconstruction Network (RRN) and Multi-Scale

Detection Network (MDN) of [39] used thermal image features to improve de-

tection results in visible data.

Because the combination of visible and thermal images works well in two-

stage network architectures, most of top-performing multispectral pedestrian

detection are based on the approach originally used in Fast-/Faster R-CNNs.

For instance, the Faster R-CNN detector was used to perform multispectral

pedestrian detection in Illumination-aware Faster R-CNN (IAF R-CNN) [23].

The authors in [20] detected persons in multispectral video with a combination

of a Fully Convolutional RPN and a Boosted Decision Trees Classiﬁer (BDT).

The generalization ability of RPN was also investigated in [10], evaluating which

multispectral dataset results in better generalization. MSDS-RCNN [22] is a

fusion of a multispectral proposal network (MPN) and a multispectral classi-

ﬁcation network (MCN). In [41] an Aligned Region CNN is proposed to deal

with weakly aligned multispectral data. Box-level segmentation via a supervised

learning framework was proposed in [6], eliminating the need of anchor boxes.

Approaches based on one-stage detectors have also been investigated. The

authors in [37] used YOLOv2 [32] as a fast single-pass network architecture

for multispectral detection. A deconvolutional single-shot multi-box detector

(DSSD) was also leveraged by authors in [21] to exploit the correlation between

4 My Kieu and Andrew D. Bagdanov et al.

visible and thermal features. The work in [43] adopted two Single Shot Detec-

tors (SSDs) to investigate the potential of fusing color and thermal features with

Gated Fusion Units (GFU).

2.3 Pedestrian detection in thermal imagery.

A few works have addressed pedestrian detection using thermal (IR) imagery

only. Adaptive fuzzy C-means for IR image segmentation and CNN for pedestri-

ans detection were proposed in [18]. A combination of Thermal Position Inten-

sity Histogram of Oriented Gradients (TPIHOG) and the additive kernel SVM

(AKSVM) was proposed by [3] for nighttime-only detection in thermal imagery.

Thermal images augmented with saliency maps used as attention mechanism

have been used to train a Faster R-CNN detector in [12]. In [16] several video

preprocessing steps are performed to make thermal images look more similar

to grayscale images converted from RGB, then a pre-trained and ﬁne-tuned

SSD detector is used. Recently, the authors in [7] used Cycle-GAN for image-

to-image translation of thermal to pseudo-RGB data, using it to ﬁne-tune to a

multimodal Faster-RCNN detector. Instead, the authors in [15] used a GAN to

transform visible images to synthetic thermal images, as a data augmentation

processing to train a pedestrian detector to work on thermal-only imagery. An-

other recent work dealing with domain adaptation is the Top-down and Bottom-

up Domain Adaptation approaches proposed in [19] for pedestrian detection in

thermal imagery. In this work, bottom-up adaptation obtains state-of-the-art

single-modality results at nighttime on KAIST dataset [17].

2.4 Task-conditioned networks

There are a few task-conditioning approaches, such as conditional generative

models like those based on adversarial networks [28], and the seminal work in [31]

that proposed architecture guidelines for training Deep Convolutional GANs.

In particular, our approach is inspired by the general conditioning layer called

Feature-wise Linear Modulation (FiLM) proposed in [30] for conditioning visual

reasoning tasks.

In this paper we perform pedestrian detection on thermal imagery only. Our

method is based on the single-stage detector YOLOv3 [33], whose computa-

tional eﬃciency makes it particularly well-suited to practical applications with

real-time requirements. We extend the YOLOv3 architecture by integrating con-

ditioning layers to better specialize the network to deal with day- and nighttime

images. We evaluate conditioning of residual groups, detection heads, and their

combination during domain adaptation.

3 Task-conditioned domain adaptation

In this section we describe our approach to conditioning a detector during adap-

tation to the thermal domain. Our central idea is that robust pedestrian de-

tection naturally depends on low-level semantic qualities of input images – for

Task-conditioned Domain Adaptation for Pedestrian Detection 5

example whether an image is captured during the day or at night. This aux-

iliary information should be useful for learning representations upon which we

can condition the adaptation internal representations used for the primary de-

tection task. In the next section we describe the architecture of an auxiliary

classiﬁcation network that is connected to the main detection network, and in

section 3.2 we describe the conditioning layers that can be strategically inserted

into the network to modify internal representation. We describe two alternative

conditioning architectures for YOLOv3 in section 3.3, and in section 3.4 we put

everything together into a description of the combined adaptation loss.

3.1 Auxiliary classiﬁcation network

Let DΘd(x) represent the detector network (YOLOv3 in our case) parameterized

by Θd, and let Fi(x) represent the output of the ith convolutional layer of the

detection network. We deﬁne an auxiliary classiﬁcation network as follows. The

output of an early convolutional layer (e.g. F4(x) as in Fig. 1), is average pooled

to form a feature that is then fed to two fully-connected layers of size Cwith

ReLU activations. The resulting feature representation is then passed to a ﬁnal

fully connected layer with a single output and a sigmoid activation. We denote

the output of this auxiliary network AΘa(x).

During training we use the following loss attached to the output of the aux-

iliary network:

La(xi, yi;Θa) = [yi·log f(xi) + (1 −yi)·log(1 −f(xi))] ,(1)

where for all training images xiwe associate an auxiliary training label yi. Since

we experiment on the KAIST dataset, which distinguishes daytime and nighttime

images in its annotations and evaluation protocol, we deﬁne yi= 0 if xiwas

captured during the day, and yi= 1 if xiwas captured at night. In this case the

auxiliary network has the task of classifying images as daytime or nighttime.

3.2 Conditioning layers

Our idea to use the internal, C-dimensional representation learned in the auxil-

iary classiﬁcation network (i.e. the representation after the two fully-connected

layers used for classiﬁcation) rather than its output. See Figure 1 for a schematic

representation of the conditioning process. This representation is task-speciﬁc: in

our experiments it is learned to capture the salient information useful for deter-

mining whether an image was captured during the day or at night. At strategic

points in the main detection network we will use this representation to generate

conditioning parameters that condition a convolutional feature map using the

representation learned by the auxiliary network.

Consider an arbitrary convolutional output Fi(x) of the main detector net-

work DΘd, and let dibe the number of convolutional feature maps in Fi(x). We

generate conditioning parameters γiand βi:

γi= ReLU[Wi

γA(x) + bi

γ]

βi= ReLU[Wi

βA(x) + bi

β],

6 My Kieu and Andrew D. Bagdanov et al.

Conditioning layer

Fully Connected

Classiﬁcation

Fully Connected

AvgPooling

...

... ...

Fig. 1. Conditioning layer and auxiliary classiﬁcation network. The auxiliary network

learns an internal representation used to solve a classiﬁcation task. This representation

is then leveraged by conditioning layers to adjust internal convolutional feature maps

in the detection network.

where Wi

γ, W i

β∈Rdi×Cand bi

γ, bi

β∈Rdiare the weights and biases, respectively,

of two new fully connected layers of D units added to the network (purple layers

in Fig. 1). These new layers are responsible for generating the parameters used

for conditioning Fi.

Fiis substituted by the conditioned version:

i(x) = ReLU[(1 −γi)Fi(x)⊕βi],

where and ⊕are, respectively, the elementwise multiplication and addition

operations broadcasted to cover the spatial dimensions of the feature maps Fi(x).

In this way, the generated γiparameters can scale feature maps independently

and the βiparameters independently translate them.

3.3 Conditioned network architectures

YOLOv3 is a very deep detection network with three detection heads for de-

tecting objects at diﬀerent scales [33]. In order to investigate the eﬀectiveness

of conditioning YOLOv3 during domain adaptation, we experimented with two

diﬀerent strategies for injecting conditioning layers into the network. In sec-

tion 4.3 we report on a series of ablation experiments performed to evaluate

these diﬀerent architectural possibilities for conditioning the network.

Conditioning residual groups (TC Res Group). YOLOv3 uses a 52-layer,

fully-convolutional residual network as its backbone. The network is coarsely

structured into ﬁve residual groups, each consisting of one or more residual blocks

of two-convolutional layers with residual connections adding the input of each

block to the output.

A natural conditioning point is at each of these residual groups. This strategy

is illustrated in ﬁgure 2; the ﬁgure reports also the size of the layers of the

conditioning network (C= 1024). After each group of residual blocks, we insert

Task-conditioned Domain Adaptation for Pedestrian Detection 7

2x 1x

1024 FC

1 FC

Condition

Classiﬁcation

Input

Output

Residual

block

3x3

1x1

Detection

64 FC

AvgPooling

1024 FC

...

Condition

Residual

block

128 FC

...

Condition

Residual

block

256 FC

...

Condition

Residual

block

512FC

512 FC

...

Condition

Residual

block

1024 FC

... ...

Fig. 2. TC Res Group: Conditioning residual groups of YOLOv3. The pre-ReLU

activations of the last layer of each convolutional group are modiﬁed by parameters γi

and βi. Conditioning is done before the ﬁnal residual connection of each group.

512 FC

1 FC

Condition

Classiﬁcation

Input

Output

3x3

1x1

Detection

Output

512

3x3

1x1

Detection

Output

256

3x3

1x1

Detection

1024 FC

Condition 512 FC

512 FC

Condition 256 FC

256 FC

...

AvgPooling

1024

3x3

...

Fig. 3. TC Det: Conditioning the detection heads of YOLOv3. Feature maps used for

detection are conditioned using the internal representation of the auxiliary network.

a conditioning layer after the last convolutional layer and before the ﬁnal residual

connection of the group.

Conditioning detection heads (TC Det). A natural alternative to condi-

tioning residual groups is to condition each of the three detection heads branch-

ing oﬀ of the YOLOv3 backbone. The intuition here is to condition the network

closer to where the actual detections are being made.

Detection heads in YOLOv3 consist of one convolutional block for the large-

scale detection head, and three convolutional blocks for the other two. We insert

the conditioning layer after the last convolution of these blocks and before the

ﬁnal 1 ×1 convolutional layer producing the detection head output. Figure 3

gives a schematic illustration of detection head conditioning architecture, and

reports the size of the layers of the conditioning network (C= 512).

8 My Kieu and Andrew D. Bagdanov et al.

3.4 Adaptation loss

The ﬁnal loss function used for domain adaptation is:

L(xi,yi, yi;ΘD, ΘA) = Ld(xi,yi) + La(xi, yi),

where xis a training thermal image, Ldis the standard detection loss based

on the structured target detections yi, and Lais the auxiliary classiﬁcation loss

deﬁned in equation (1).

When we backpropagate error from the auxiliary loss Lawe are improving

the internal representation of the auxiliary network AΘa, making it better for

classifying day/night. When we backpropagate error from the detection loss,

we simultaneously improve the generated conditioning parameters (γi, βi) and

the internal representation in the YOLOv3 backbone. Our intuition is that this

adapts feature maps to be conditionable on based on the representation learned

in the auxiliary classiﬁcation network.

4 Experimental results

In this section we report results of a number of experiments we performed to

evaluate the eﬀectiveness of task-conditioned domain adaptation. In section 4.1

we describe the characteristics of the KAIST Multispectral Pedestrian Detection

benchmark, and in section 4.3 we present two ablation studies we conducted to

evaluate the various architectural parameters of our approach. In section 4.4

we compare with state-of-the-art single- and multimodal pedestrian detection

approaches.

4.1 Dataset and evaluation metrics

Our experiments were conducted on the KAIST Multispectral Pedestrian Bench-

mark dataset [17]. KAIST is the only large-scale dataset with well-aligned visi-

ble/thermal pairs [7], and it contains videos captured both during the day and

at night.

The KAIST dataset consists of 95,328 aligned visible/thermal image pairs

split into 50,172 for training and 45,156 for testing. As is common practice, we use

the reasonable setting [9, 17, 22, 25], and use the improved training annotations

from [22] and test annotations from [25]. We sample every two frames from

training videos and exclude heavily occluded and small person instances (<50

pixels). The ﬁnal training set contains 7,601 images. The test set contains 2,252

image pairs sampled every 20 frames. Figure 4 shows some example images with

our detection results on KAIST.

We used standard evaluation metrics for object detection, namely miss rate

as a function of False Positives Per Image (FPPI), and log-average miss rate for

thresholds in the range of [10−2,100]. For computing miss rates, an Intersection

over Union (IoU) threshold of 0.5 is used to calculate True Positive (TP), False

Positives (FP) and False Negatives (FN).

Task-conditioned Domain Adaptation for Pedestrian Detection 9

Fig. 4. Examples of KAIST thermal images with detections. The ﬁrst two rows are

daytime images and the last two are nighttime. The ﬁrst and the third rows show

detection results without conditioning, and the second and last rows detections with

our TC Det detector. Blue boxes are true positive detections,green boxes are false

negatives, and red boxes indicate false positives. See section 4.3 for detailed analysis.

4.2 Implementation and training

All of our networks were implemented in PyTorch and source code and pretrained

models are available.1During training, at each epoch we set aside 10% of the

training images for validation for that epoch. We use the same hyperparameter

settings of the original YOLOv3 model [33] and use weights pretrained on MS

COCO as a starting point. We use Stochastic Gradient Descent (SGD) with

an initial learning rate of 0.0001. When the validation performance no longer

improves, we reduce the learning rate by a factor of 10. Training is halted after

decreasing the learning rate twice in this way. All models were trained for a

maximum of 50 epochs with a batch size of 8 and input image size 640 ×512.

For most cases, training stops at around 30 epochs and requires about 12 hours

on an NVIDIA GTX 1080.

1https://github.com/mrkieumy/task-conditioned

10 My Kieu and Andrew D. Bagdanov et al.

Fig. 5. Ablation study of diﬀerent conditioning points. Plots report miss rate as a

function of false positives per image, and log-average miss rates are given in the legends.

4.3 Ablation studies

In this section we report on a series of experiments we conducted to explore the

design space for task-conditioned adaptation of a pretrained YOLOv3 detector to

the thermal domain. We ﬁrst consider the where-aspect of task-conditioning (i.e.

at which points in the YOLOv3 architecture task-conditioning is most eﬀective),

and then consider the when-aspect of task conditioning by exploring the many

possibilities of conditioning adaptation phases.

Comparison of conditioning points. YOLOv3 is a very deep network which

presents many options for intervening with conditioning layers. It has 23 residual

blocks, each consisting of two convolutional layers and one residual connection.

These 23 residual blocks are organized into ﬁve groups as illustrated in ﬁgure 2.

Inspired by the paper [30], in which the authors demonstrate that conditioning

residual blocks can be eﬀective, we performed an architectural ablation on where

to condition the network by considering conditioning of all residual blocks versus

conditioning each residual group. We investigate also conditioning of the three

detection heads, both alone and in combination with residual group conditioning.

The conﬁgurations investigated are:

– No Conditioning (direct ﬁne-tuning on thermal): the YOLOv3 network

pretrained on MSCOCO is directly ﬁne-tuned on KAIST thermal images.

– TC Res Group (conditioning of residual groups): the conditioning scheme

described in section 3.3 and illustrated in ﬁgure 2. We insert conditioning

layers into all residual groups at the ﬁnal residual block.

– TC Res All (conditioning of all residual blocks): similar to group condi-

tioning, but conditioning all residual blocks of the YOLOv3 network.

– TC Det (conditioning of detection heads): the scheme described in sec-

tion 3.3 and illustrated in ﬁgure 3.

– TC Res Group + Det (conditioned residual groups and detection heads):

a combination of TC Res Group and TC Det.

In ﬁgure 5 we plot the miss rate as a function of False Positive Per Image

(FPPI) for the ﬁve diﬀerent conditioning options. Note that all task-conditioned

networks result in improvement over the No Conditioning network trained

Task-conditioned Domain Adaptation for Pedestrian Detection 11

Table 1. Ablation on adaptation schedules for TC Det. Results are on KAIST in

terms of log-average miss rate (lower is better). NC indicates the modality is used for

adaptation with no conditioning, Cindicates the modality is used with conditioning

of detection heads, and %indicates the modality is not used during adaptation.

Training Testing Miss Rate

visible thermal visible thermal all day night

NC %!%36.67 32.83 45.00

C%!%34.73 29.53 46.09

%NC %!31.06 37.34 16.69

NC NC %!30.50 37.45 15.73

C NC %!28.48 35.86 12.97

%C%!29.95 38.16 12.61

NC C %!28.53 36.59 11.03

C C %!27.11 34.81 10.31

with standard ﬁne-tuning. TC Det performs best overall and performs especially

well at nighttime with a miss rate of only 10.31% – an improvement of 6.38%

over the No Conditioning network.

While conditioning residual groups (TC Res Group) is also eﬀective com-

pared to ﬁne-tuning, adding more conditioning layers results in worse perfor-

mance. One reason for this might be that conditioning layers add parameters to

the network, and depending on the size of the feature maps being conditioned

could be leading to overﬁtting on the KAIST training set.

In ﬁgure 4 we give example detections from the TC Det and No Condition-

ing detectors. TC Det yields more true positive and fewer false positive detec-

tions with respect to simple ﬁne-tuning. On daytime images (ﬁrst two columns

of ﬁgure 4), the detector without conditioning (top row) produces a number of

false positives and missed detections which TC Det does not. The diﬀerence is

even more pronounced at nighttime (second two columns of ﬁgure 4).

This ablation analysis indicates that conditioning only detection layers (TC

Det) is most eﬀective when compared to conditioning of residual blocks – an-

swering the where of task-conditioning. In all of the following experiments we

consider only the TC Det task-conditioned network.

Comparison of conditional adaptation schedules. In this set of experi-

ments we compare the many options of conditioning when adapting a pretrained

detector from the visible to the thermal domain. Starting from a pretrained de-

tector, we can ﬁne-tune (with or without conditioning) on KAIST RGB images,

then ﬁne-tune (again with or without conditioning) on KAIST thermal images.

In table 1 we give results of an ablation study considering all these possibili-

ties. Adapting ﬁrst using RGB images, rather than going directly to thermal,

is generally useful. In fact, the best adaptation schedule is to ﬁne-tune a condi-

tioning network on visible spectrum images, and then ﬁne-tune that conditioned

network on thermal imagery.

12 My Kieu and Andrew D. Bagdanov et al.

Thermal TC Det Thermal

F ' i

Night time Day time

Fig. 6. The eﬀects of conditioning during daytime and nighttime. The ﬁrst two columns

show results for a thermal detector without conditioning and with conditioning. Blue

boxes are true positive detections,green boxes are false negatives, and red boxes indi-

cate false positives. See text detailed analysis.

Visualizing the eﬀects of conditioning. Figure 6 illustrates the eﬀect con-

ditioning has on the feature maps of YOLOv3. The heatmaps in this ﬁgure were

generated by averaging the convolutional feature maps input to the medium-

scale detection head of YOLOv3 and superimposing this on the original thermal

image. The third column is the average feature map of a non-conditioned thermal

detector (TD), and the fourth and ﬁfth columns are, respectively, the average

feature maps before and after conditioning.

From the heatmaps in ﬁgure 6 we note that pedestrians show more contrast

with the background in the task-conditioned feature maps for both daytime

and nighttime. Also, the thermal detector without conditioning misses several

pedestrians and produces one false positive at nighttime, while TC Det correctly

detects these and does not produce false positive detections. Task-conditioning

also helps eliminate one false positive in the daytime image.

4.4 Comparison with the state-of-the-art

In this section we compare our approaches with the state-of-the-art on KAIST.

Since our approach focuses on detection only in thermal images at test time,

we ﬁrst compare with state-of-the-art single-modality detectors using use only

visible or only thermal images. Then, we compare our approaches with state-of-

the-art multispectral detectors using both visible and thermal images.

Comparison with single-modality detectors. Table 2 compares our ap-

proaches with the single-modality detectors using thermal-only or visible-only

at training and testing time. TC Det obtains the best results with 27.11% miss-

rate in all modalities and 10.31% missrate at nighttime. Our results outperform

all existing single-modality methods by a large margin in all conditions (day,

night, and all). To the best our knowledge, our detectors outperform all state-

of-the-art single-modality approaches on KAIST dataset.

Task-conditioned Domain Adaptation for Pedestrian Detection 13

Table 2. Comparison with state-of-the-art single-modality approaches on KAIST

in term of log-average miss rate (lower is better). Best results highlighted in

underlined bold, second best in bold.

Detectors MR all MR day MR night Test images

FasterRCNN-C [25] 48.59 42.51 64.39 RGB

RRN+MDN [39] 49.55 47.30 54.78 RGB

FasterRCNN-T [25] 47.59 50.13 40.93 thermal

TPIHOG [3] - - 57.38 thermal

SSD300 [16] 69.81 - - thermal

Saliency Maps [12] - 30.40 21.00 thermal

VGG16-two-stage [15] 46.30 53.37 31.63 thermal

ResNet101-two-stage [15] 42.65 49.59 26.70 thermal

Bottom-up [19] 35.20 40.00 20.50 thermal

Ours TC Visible 34.73 29.53 46.09 RGB

Ours TC Thermal 28.53 36.59 11.03 thermal

Ours TC Det 27.11 34.81 10.31 thermal

Table 3. Comparison with state-of-the-art multimodal approaches in terms of log-

average miss rate on KAIST dataset (lower is better). All approaches use both visible

and thermal spectra at training and test time, while ours use only thermal imagery for

testing. Results for Methods indicated with *were computed using detections provided

by the authors. Best results highlighted in underlined bold, second best in bold.

Method MR all MR day MR night Detector Architecture

KAIST baseline [17] 64.76 64.17 63.99 ACF [8]

Late Fusion [38] 43.80 46.15 37.00 RCNN [13]

Halfway Fusion [25] 36.99 36.84 35.49 Faster R-CNN [34]

RPN+BDT [20] 29.83 30.51 27.62 VGG-16 + BF [35, 2]

IATDNN+IAMSS [14] 26.37 27.29 24.41 VGG-16 + RPN [35, 20]

IAF R-CNN*[23] 20.95 21.85 18.96 VGG-16 + Faster R-CNN [35, 34]

MSDS-RCNN [22] 11.63 10.60 13.73 VGG-16 + RPN [35]

MSDS sanitized*[22] 10.89 12.22 7.82 VGG-16 + RPN [35]

YOLO TLV [37] 31.20 35.10 22.70 YOLOv2 [32]

DSSD-HC [21] 34.32 - - DSSD [11]

GFD-SSD [43] 28.00 25.80 30.03 SSD [26]

Ours Thermal 31.06 37.34 16.69 YOLOv3 [33]

Ours TC Res Group 28.69 34.95 14.97 YOLOv3 [33]

Ours TC Det 27.11 34.81 10.31 YOLOv3 [33]

Comparison with multimodal detectors. Table 3 compares our detectors

with state-of-the-art multimodal approaches. Some multispectral methods using

both visible and thermal images for training and testing such as MSDS [22],

IAF [23] or IATDNN+IAMSS [14] are superior in terms of combined day/night

miss rate (all). This is due to the advantage they have in exploiting both visi-

ble and thermal imagery, aﬀecting in particular pedestrian detection during the

day. In fact, the authors in MSDS [22] proposed a set of manually “sanitized”

annotations for KAIST that correct problems in the original annotations and

14 My Kieu and Andrew D. Bagdanov et al.

their sanitized results at night-time (indicated by *) are better than the origi-

nal results due to misalignment correction. Another key diﬀerence is that most

state-of-the-art multispectral approaches use more complex, two-stage detection

architectures like Faster RCNN (last column of table 3). Note, however, that

both TC Res Group and TC Det, surpass many multimodal techniques, while

TC Det performs second-best at night.

We note that recent advances in the state-of-the-art on KAIST have been

made by augmenting and/or correcting the original dataset annotations. For ex-

ample, the authors of AR-CNN [42] completely re-annotated the KAIST dataset,

correcting localization errors, adding relationships, and labeling unpaired ob-

jects, resulting in signiﬁcantly improved performance. Use of additional manual

annotations, however, renders their results impossible to compare with those of

other approaches and are thus excluded from our comparison.

Speed analysis. The average inference time for YOLOv3 is 28.57 milliseconds

per image (∼35 FPS). Our TC Det network requires 33.17 milliseconds per

image (∼30 FPS), and TC Res Group 35.01 milliseconds per image (∼29

FPS). Thus, task conditioning does not signiﬁcantly increase the complexity of

the network – in fact our TC Det network requires less than ﬁve milliseconds

more for single-image inference compared to the original YOLOv3 detector.

5 Conclusions

In this paper we proposed a task-conditioned architecture for adapting visible-

spectrum detectors to the thermal domain. Our approach exploits the internal

learned representation of an auxiliary day/night classiﬁcation network to inject

conditioning parameters at strategic points in the detector network. Our exper-

iments demonstrate that task-based conditioning of the YOLOv3 detection net-

work can signiﬁcantly improve thermal-only pedestrian detection performance.

Task-conditioned networks preserve the eﬃciency of the single-shot YOLOv3

architecture and perform respectably even compared to some multispectral de-

tectors. However, they are outperformed by more complex, two-stage multispec-

tral detectors such as MSDS [22]. We think, however, that our task-conditioning

approach can also be fruitfully applied to such detectors by conditioning both

region proposal and classiﬁcation subnetworks.

Acknowledgments

The authors thank NVIDIA for the generous donation of GPUs. This work was

partially supported by the project ARS01 00421: “PON IDEHA - Innovazioni

per l’elaborazione dei dati nel settore del Patrimonio Culturale.”

Task-conditioned Domain Adaptation for Pedestrian Detection 15

References

1. Angelova, A., Krizhevsky, A., Vanhoucke, V., Ogale, A., Ferguson, D.: Real-time

pedestrian detection with deep network cascades. In: Proc. of British Machine

Vision Conference (BMVC) (2015)

2. Appel, R., Fuchs, T., Doll´ar, P., Perona, P.: Quickly boosting decision trees–

pruning underachieving features early. In: International conference on machine

learning. pp. 594–602 (2013)

3. Baek, J., Hong, S., Kim, J., Kim, E.: Eﬃcient pedestrian detection at nighttime

using a thermal camera. Sensors 17(8), 1850 (2017)

4. Benenson, R., Omran, M., Hosang, J., Schiele, B.: Ten years of pedestrian detec-

tion, what have we learned? In: Proc. of European Conference on Computer Vision

(ECCV) (2014)

5. Brazil, G., Yin, X., Liu, X.: Illuminating pedestrians via simultaneous detection

& segmentation. In: Proc. of IEEE International Conference on Computer Vision

(ICCV) (2017)

6. Cao, Y., Guan, D., Wu, Y., Yang, J., Cao, Y., Yang, M.Y.: Box-level segmen-

tation supervised deep neural networks for accurate and real-time multispectral

pedestrian detection. ISPRS Journal of Photogrammetry and Remote Sensing 150,

70–79 (2019)

7. Devaguptapu, C., Akolekar, N., M Sharma, M., N Balasubramanian, V.: Borrow

from anywhere: Pseudo multi-modal object detection in thermal imagery. In: Proc.

of IEEE Conference on Computer Vision and Pattern Recognition Workshops

(CVPR-W) (2019)

8. Doll´ar, P., Appel, R., Belongie, S., Perona, P.: Fast feature pyramids for object

detection. IEEE transactions on pattern analysis and machine intelligence 36(8),

1532–1545 (2014)

9. Dollar, P., Wojek, C., Schiele, B., Perona, P.: Pedestrian detection: An evalua-

tion of the state of the art. IEEE Transactions on Pattern Analysis and Machine

Intelligence (TPAMI) 34(4), 743–761 (2011)

10. Fritz, K., K¨onig, D., Klauck, U., Teutsch, M.: Generalization ability of region

proposal networks for multispectral person detection. In: Proc. of Automatic Tar-

get Recognition XXIX. vol. 10988. International Society for Optics and Photonics

(2019)

11. Fu, C.Y., Liu, W., Ranga, A., Tyagi, A., Berg, A.C.: Dssd: Deconvolutional single

shot detector. arXiv preprint arXiv:1701.06659 (2017)

12. Ghose, D., Desai, S.M., Bhattacharya, S., Chakraborty, D., Fiterau, M., Rahman,

T.: Pedestrian detection in thermal images using saliency maps. In: Proc. of IEEE

Conference on Computer Vision and Pattern Recognition Workshops (CVPR-W)

(2019)

13. Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for ac-

curate object detection and semantic segmentation. In: Proceedings of the IEEE

conference on computer vision and pattern recognition. pp. 580–587 (2014)

14. Guan, D., Cao, Y., Yang, J., Cao, Y., Yang, M.Y.: Fusion of multispectral data

through illumination-aware deep neural networks for pedestrian detection. Infor-

mation Fusion 50, 148–157 (2019)

15. Guo, T., Huynh, C.P., Solh, M.: Domain-adaptive pedestrian detection in thermal

images. In: Proc. of IEEE International Conference on Image Processing (ICIP)

(2019)

16 My Kieu and Andrew D. Bagdanov et al.

16. Herrmann, C., Ruf, M., Beyerer, J.: CNN-based thermal infrared person detection

by domain adaptation. In: Proc. of Autonomous Systems: Sensors, Vehicles, Secu-

rity, and the Internet of Everything. vol. 10643. International Society for Optics

and Photonics (2018)

17. Hwang, S., Park, J., Kim, N., Choi, Y., Kweon, I.: Multispectral pedestrian detec-

tion: Benchmark dataset and baseline. In: Proc. of IEEE Conference on Computer

Vision and Pattern Recognition (CVPR) (2015)

18. John, V., Mita, S., Liu, Z., Qi, B.: Pedestrian detection in thermal images using

adaptive fuzzy c-means clustering and convolutional neural networks. In: Proc.

of IAPR International Conference on Machine Vision Applications (MVA). pp.

246–249 (2015)

19. Kieu, M., Bagdanov, A.D., Bertini, M., Del Bimbo, A.: Domain adaptation for

privacy-preserving pedestrian detection in thermal imagery. In: Proc. of Interna-

tional Conference on Image Analysis and Processing (ICIAP) (2019)

20. Konig, D., Adam, M., Jarvers, C., Layher, G., Neumann, H., Teutsch, M.: Fully

convolutional region proposal networks for multispectral person detection. In: Proc.

of IEEE Conference on Computer Vision and Pattern Recognition Workshops

(CVPR-W) (2017)

21. Lee, Y., Bui, T.D., Shin, J.: Pedestrian detection based on deep fusion network us-

ing feature correlation. In: Proc. of Asia-Paciﬁc Signal and Information Processing

Association Annual Summit and Conference (APSIPA ASC) (2018)

22. Li, C., Song, D., Tong, R., Tang, M.: Multispectral pedestrian detection via simul-

taneous detection and segmentation. In: Proc. of British Machine Vision Confer-

ence (BMVC) (2018)

23. Li, C., Song, D., Tong, R., Tang, M.: Illumination-aware faster R-CNN for robust

multispectral pedestrian detection. Pattern Recognition 85, 161–171 (2019)

24. Li, J., Liang, X., Shen, S., Xu, T., Feng, J., Yan, S.: Scale-aware fast R-CNN for

pedestrian detection. IEEE Transactions on Multimedia (TMM) 20(4), 985–996

(2017)

25. Liu, J., Zhang, S., Wang, S., Metaxas, D.N.: Multispectral deep neural networks

for pedestrian detection. arXiv preprint arXiv:1611.02644 (2016)

26. Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.Y., Berg, A.C.:

Ssd: Single shot multibox detector. In: European conference on computer vision.

pp. 21–37. Springer (2016)

27. Liu, W., Liao, S., Ren, W., Hu, W., Yu, Y.: High-level semantic feature detec-

tion: A new perspective for pedestrian detection. In: Proc. of IEEE Conference on

Computer Vision and Pattern Recognition (CVPR) (2019)

28. Mirza, M., Osindero, S.: Conditional generative adversarial nets. arXiv preprint

arXiv:1411.1784 (2014)

29. Ouyang, W., Zeng, X., Wang, X.: Learning mutual visibility relationship for pedes-

trian detection with a deep model. International Journal of Computer Vision

(IJCV) 120(1), 14–27 (2016)

30. Perez, E., Strub, F., de Vries, H., Dumoulin, V., Courville, A.C.: FiLM: Visual

reasoning with a general conditioning layer. In: Proc. of AAAI Conference on

Artiﬁcial Intelligence (AAAI) (2017)

31. Radford, A., Metz, L., Chintala, S.: Unsupervised representation learning with deep

convolutional generative adversarial networks. CoRR abs/1511.06434 (2015)

32. Redmon, J., Farhadi, A.: YOLO9000: better, faster, stronger. In: Proc. of IEEE

Conference on Computer Vision and Pattern Recognition (CVPR) (2017)

33. Redmon, J., Farhadi, A.: YOLOv3: An incremental improvement. arXiv preprint

arXiv:1804.02767 abs/1804.02767 (2018)

Task-conditioned Domain Adaptation for Pedestrian Detection 17

34. Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: Towards real-time object detec-

tion with region proposal networks. In: Advances in neural information processing

systems. pp. 91–99 (2015)

35. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale

image recognition. arXiv preprint arXiv:1409.1556 (2014)

36. Tian, Y., Luo, P., Wang, X., Tang, X.: Pedestrian detection aided by deep learning

semantic tasks. In: Proc. of IEEE Conference on Computer Vision and Pattern

Recognition (CVPR) (2015)

37. Vandersteegen, M., Van Beeck, K., Goedem´e, T.: Real-time multispectral pedes-

trian detection with a single-pass deep neural network. In: Proc. of International

Conference Image Analysis and Recognition (ICIAR) (2018)

38. Wagner, J., Fischer, V., Herman, M., Behnke, S.: Multispectral pedestrian de-

tection using deep fusion convolutional neural networks. In: Proc. of European

Symposium on Artiﬁcial Neural Networks (ESANN) (2016)

39. Xu, D., Ouyang, W., Ricci, E., Wang, X., Sebe, N.: Learning cross-modal deep

representations for robust pedestrian detection. In: Proc. of IEEE Conference on

Computer Vision and Pattern Recognition (CVPR) (2017)

40. Zhang, L., Lin, L., Liang, X., He, K.: Is faster R-CNN doing well for pedestrian

detection? In: Proc. of European Conference on Computer Vision (ECCV) (2016)

41. Zhang, L., Liu, Z., Chen, X., Yang, X.: The cross-modality disparity problem in

multispectral pedestrian detection. arXiv preprint arXiv:1901.02645 (2019)

42. Zhang, L., Zhu, X., Chen, X., Yang, X., Lei, Z., Liu, Z.: Weakly aligned cross-

modal learning for multispectral pedestrian detection. In: Proceedings of the IEEE

International Conference on Computer Vision. pp. 5127–5137 (2019)

43. Zheng, Y., Izzat, I.H., Ziaee, S.: GFD-SSD: Gated fusion double SSD for multi-

spectral pedestrian detection. arXiv preprint arXiv:1903.06999 (2019)

Who Cares about the Weather? Inferring Weather Conditions for Weather-Aware Object Detection in Thermal Images

Preprint

Full-text available

Sep 2023

Deployments of real-world object-detection systems often experience a degradation in performance over time due to concept drift. Systems that leverage thermal cameras are especially susceptible because the respective thermal signatures of objects and their surroundings are highly sensitive to environmental changes. In this study, a conditioning method is investigated. The method aims to guide the training loop of thermal object detection systems by leveraging an auxiliary branch to predict the weather, while directly or indirectly conditioning the baseline detection system. Leveraging such an approach to train detection networks does not necessarily improve the performance of native architectures, however, it can be observed that conditioned networks manage to extract a signal from thermal images that guides the network to detect objects that baseline models miss. As the extracted signal appears to be quite noisy and very challenging to regress accurately, further work is needed to identify an ideal optimization vector.

INSANet: INtra-INter Spectral Attention Network for Effective Feature Fusion of Multispectral Pedestrian Detection

Article

Full-text available

Feb 2024
SENSORS-BASEL

Pedestrian detection is a critical task for safety-critical systems, but detecting pedestrians is challenging in low-light and adverse weather conditions. Thermal images can be used to improve robustness by providing complementary information to RGB images. Previous studies have shown that multi-modal feature fusion using convolution operation can be effective, but such methods rely solely on local feature correlations, which can degrade the performance capabilities. To address this issue, we propose an attention-based novel fusion network, referred to as INSANet (INtra-INter Spectral Attention Network), that captures global intra- and inter-information. It consists of intra- and inter-spectral attention blocks that allow the model to learn mutual spectral relationships. Additionally, we identified an imbalance in the multispectral dataset caused by several factors and designed an augmentation strategy that mitigates concentrated distributions and enables the model to learn the diverse locations of pedestrians. Extensive experiments demonstrate the effectiveness of the proposed methods, which achieve state-of-the-art performance on the KAIST dataset and LLVIP dataset. Finally, we conduct a regional performance evaluation to demonstrate the effectiveness of our proposed network in various regions.

Pedestrian Detection in Low-Light Conditions: A Comprehensive Survey

Preprint

Full-text available

Jan 2024

Pedestrian detection remains a critical problem in various domains, such as computer vision, surveillance, and autonomous driving. In particular, accurate and instant detection of pedestrians in low-light conditions and reduced visibility is of utmost importance for autonomous vehicles to prevent accidents and save lives. This paper aims to comprehensively survey various pedestrian detection approaches, baselines, and datasets that specifically target low-light conditions. The survey discusses the challenges faced in detecting pedestrians at night and explores state-of-the-art methodologies proposed in recent years to address this issue. These methodologies encompass a diverse range, including deep learning-based, feature-based, and hybrid approaches, which have shown promising results in enhancing pedestrian detection performance under challenging lighting conditions. Furthermore, the paper highlights current research directions in the field and identifies potential solutions that merit further investigation by researchers. By thoroughly examining pedestrian detection techniques in low-light conditions, this survey seeks to contribute to the advancement of safer and more reliable autonomous driving systems and other applications related to pedestrian safety. Accordingly, most of the current approaches in the field use deep learning-based image fusion methodologies (i.e., early, halfway, and late fusion) for accurate and reliable pedestrian detection. Moreover, the majority of the works in the field (approximately 48%) have been evaluated on the KAIST dataset, while the real-world video feeds recorded by authors have been used in less than six percent of the works.

Privacy-preserving visual analysis: training video obfuscation models without sensitive labels

Article

Full-text available

May 2024
APPL INTELL

Visual analysis tasks, including crowd management, often require resource-intensive machine learning models, posing challenges for deployment on edge hardware. Consequently, cloud computing emerges as a prevalent solution. To address privacy concerns associated with offloading video data to remote cloud platforms, we present a novel approach using adversarial training to develop a lightweight obfuscator neural network. Our method focuses on pedestrian detection as an example of visual analysis, allowing the transformation of video frames on the camera itself to retain only essential information for pedestrian detection while preserving privacy. Importantly, the obfuscated data remains compatible with publicly available object detectors, requiring no modifications or significant loss in accuracy. Additionally, our technique overcomes the common limitation of relying on labeled sensitive attributes for privacy preservation. By demonstrating the inability of pedestrian attribute recognition models to detect attributes in obfuscated videos, we validate the efficacy of our privacy protection method. Our results suggest that this scalable approach holds promise for enabling camera usage in video analytics while upholding personal privacy.

Infrared Target Detection Based on Interval Sampling Weighting and 3D Attention Head in Complex Scenario

Article

Full-text available

Dec 2023

Thermal infrared detection technology can enable night vision and is robust in complex environments, making it highly advantageous for various fields. However, infrared images have low resolution and high noise, resulting in limited detailed information being available about the target object. This difficulty is further amplified when detecting small targets, which are prone to occlusion. In response to these challenges, we propose a model for infrared target detection designed to achieve efficient feature representation. Firstly, an interval sampling weighted (ISW) module is proposed, which strengthens the fusion network’s spatial relationship modeling, thereby elevating the model’s generalization capability across diverse target-density regions. Next, a detection head founded on 3D attention (TAHNet) is introduced, which helps the network more comprehensively understand the feature details of the target. This enhances the accuracy of the model in identifying the target object’s location, reduces false positives and false negatives, and optimizes the network’s performance. Furthermore, to our model, we introduce the C2f module to transfer gradient information across multiple branches. The features learned using diverse branches interact and fuse in subsequent stages, further enhancing the model’s representation ability and understanding of the target. Experimental outcomes validate the efficacy of the proposed model, showcasing state-of-the-art detection performance on FLIR and KAIST thermal infrared datasets and showing strong antiocclusion and robustness in complex scenes.

Misaligned RGB-Infrared Object Detection via Adaptive Dual-Discrepancy Calibration

Article

Full-text available

Oct 2023

Object detection based on RGB and infrared images has emerged as a crucial research area in computer vision, and the synergy of RGB-Infrared ensures the robustness of object-detection algorithms under varying lighting conditions. However, the RGB-IR image pairs captured typically exhibit spatial misalignment due to sensor discrepancies, leading to compromised localization performance. Furthermore, since the inconsistent distribution of deep features from the two modalities, directly fusing multi-modal features will weaken the feature difference between the object and the background, therefore interfering with the RGB-Infrared object-detection performance. To address these issues, we propose an adaptive dual-discrepancy calibration network (ADCNet) for misaligned RGB-Infrared object detection, including spatial discrepancy and domain-discrepancy calibration. Specifically, the spatial discrepancy calibration module conducts an adaptive affine transformation to achieve spatial alignment of features. Then, the domain-discrepancy calibration module separately aligns object and background features from different modalities, making the distribution of the object and background of the fusion feature easier to distinguish, therefore enhancing the effectiveness of RGB-Infrared object detection. Our ADCNet outperforms the baseline by 3.3% and 2.5% in mAP50 on the FLIR and misaligned M3FD datasets, respectively. Experimental results demonstrate the superiorities of our proposed method over the state-of-the-art approaches.

Cross-Modality Proposal-guided Feature Mining for Unregistered RGB-Thermal Pedestrian Detection

Preprint

Aug 2023

RGB-Thermal (RGB-T) pedestrian detection aims to locate the pedestrians in RGB-T image pairs to exploit the complementation between the two modalities for improving detection robustness in extreme conditions. Most existing algorithms assume that the RGB-T image pairs are well registered, while in the real world they are not aligned ideally due to parallax or different field-of-view of the cameras. The pedestrians in misaligned image pairs may locate at different positions in two images, which results in two challenges: 1) how to achieve inter-modality complementation using spatially misaligned RGB-T pedestrian patches, and 2) how to recognize the unpaired pedestrians at the boundary. To deal with these issues, we propose a new paradigm for unregistered RGB-T pedestrian detection, which predicts two separate pedestrian locations in the RGB and thermal images, respectively. Specifically, we propose a cross-modality proposal-guided feature mining (CPFM) mechanism to extract the two precise fusion features for representing the pedestrian in the two modalities, even if the RGB-T image pair is unaligned. It enables us to effectively exploit the complementation between the two modalities. With the CPFM mechanism, we build a two-stream dense detector; it predicts the two pedestrian locations in the two modalities based on the corresponding fusion feature mined by the CPFM mechanism. Besides, we design a data augmentation method, named Homography, to simulate the discrepancy in scales and views between images. We also investigate two non-maximum suppression (NMS) methods for post-processing. Favorable experimental results demonstrate the effectiveness and robustness of our method in dealing with unregistered pedestrians with different shifts.

Multimodal Object Detection by Channel Switching and Spatial Attention

Conference Paper

Full-text available

Jun 2023

Pedestrian detection in low-light conditions: A comprehensive survey

Article

Jun 2024
IMAGE VISION COMPUT

An In-Depth Analysis of Domain Adaptation in Computer and Robotic Vision

Article

Full-text available

Nov 2023

This review article comprehensively delves into the rapidly evolving field of domain adaptation in computer and robotic vision. It offers a detailed technical analysis of the opportunities and challenges associated with this topic. Domain adaptation methods play a pivotal role in facilitating seamless knowledge transfer and enhancing the generalization capabilities of computer and robotic vision systems. Our methodology involves systematic data collection and preparation, followed by the application of diverse assessment metrics to evaluate the efficacy of domain adaptation strategies. This study assesses the effectiveness and versatility of conventional, deep learning-based, and hybrid domain adaptation techniques within the domains of computer and robotic vision. Through a cross-domain analysis, we scrutinize the performance of these approaches in different contexts, shedding light on their strengths and limitations. The findings gleaned from our evaluation of specific domains and models offer valuable insights for practical applications while reinforcing the validity of the proposed methodologies.

Learning Mutual Visibility Relationship for Pedestrian Detection with a Deep Model

Article

Full-text available

Oct 2016
INT J COMPUT VISION

Detecting pedestrians in cluttered scenes is a challenging problem in computer vision. The difficulty is added when several pedestrians overlap in images and occlude each other. We observe, however, that the occlusion/visibility statuses of overlapping pedestrians provide useful mutual relationship for visibility estimation—the visibility estimation of one pedestrian facilitates the visibility estimation of another. In this paper, we propose a mutual visibility deep model that jointly estimates the visibility statuses of overlapping pedestrians. The visibility relationship among pedestrians is learned from the deep model for recognizing co-existing pedestrians. Then the evidence of co-existing pedestrians is used for improving the single pedestrian detection results. Compared with existing image-based pedestrian detection approaches, our approach has the lowest average miss rate on the Caltech-Train dataset and the ETH dataset. Experimental results show that the mutual visibility deep model effectively improves the pedestrian detection results. The mutual visibility deep model leads to 6–15 % improvements on multiple benchmark datasets.

Pedestrian Detection in Thermal Images Using Saliency Maps

Conference Paper

Full-text available

Jun 2019

Thermal images are mainly used to detect the presence of people at night or in bad lighting conditions, but perform poorly at daytime. To solve this problem, most state-of-the-art techniques employ a fusion network that uses features from paired thermal and color images. Instead, we propose to augment thermal images with their saliency maps, to serve as an attention mechanism for the pedestrian detector especially during daytime. We investigate how such an approach results in improved performance for pedestrian detection using only thermal images, eliminating the need for paired color images. For our experiments, we train the Faster R-CNN for pedestrian detection and report the added effect of saliency maps generated using static and deep methods (PiCA-Net and R 3-Net). Our best performing model results in an absolute reduction of miss rate by 13.4% and 19.4% over the baseline in day and night images respectively. We also annotate and release pixel level masks of pedestrians on a subset of the KAIST Multispec-tral Pedestrian Detection dataset, which is a first publicly available dataset for salient pedestrian detection.

Illuminating Pedestrians via Simultaneous Detection and Segmentation

Conference Paper

Full-text available

Oct 2017

Borrow From Anywhere: Pseudo Multi-Modal Object Detection in Thermal Imagery

Conference Paper

Jun 2019

Weakly Aligned Cross-Modal Learning for Multispectral Pedestrian Detection

Conference Paper

Oct 2019

High-Level Semantic Feature Detection: A New Perspective for Pedestrian Detection

Conference Paper

Jun 2019

Domain-Adaptive Pedestrian Detection in Thermal Images

Conference Paper

Sep 2019

Illumination-aware Faster R-CNN for Robust Multispectral Pedestrian Detection

Article

Mar 2018
PATTERN RECOGN

Multispectral images of color-thermal pairs have shown more effective than a single color channel for pedestrian detection, especially under challenging illumination conditions. However, there is still a lack of studies on how to fuse the two modalities effectively. In this paper, we deeply compare six different convolutional network fusion architectures and analyse their adaptations, enabling a vanilla architecture to obtain detection performances comparable to the state-of-the-art results. Further, we discover that pedestrian detection confidences from color or thermal images are correlated with illumination conditions. With this in mind, we propose an Illumination-aware Faster R-CNN (IAF RCNN). Specifically, an Illumination-aware Network is introduced to give an illumination measure of the input image. Then we adaptively merge color and thermal sub-networks via a gate function defined over the illumination value. The experimental results on KAIST Multispectral Pedestrian Benchmark validate the effectiveness of the proposed IAF R-CNN.

YOLO9000: Better, Faster, Stronger

Conference Paper

Jul 2017

Rich feature hierarchies for accurate object detection and semantic segmentation

Conference Paper

Nov 2014

Can a large convolutional neural network trained for whole-image classification on ImageNet be coaxed into detecting objects in PASCAL? We show that the answer is yes, and that the resulting system is simple, scalable, and boosts mean average precision, relative to the venerable deformable part model, by more than 40% (achieving a final mAP of 48% on VOC 2007). Our framework combines powerful computer vision techniques for generating bottom-up region proposals with recent advances in learning high-capacity convolutional neural networks. We call the resulting system R-CNN: Regions with CNN features. The same framework is also competitive with state-of-the-art semantic segmentation methods, demonstrating its flexibility. Beyond these results, we execute a battery of experiments that provide insight into what the network learns to represent, revealing a rich hierarchy of discriminative and often semantically meaningful features.

Task-conditioned Domain Adaptation for Pedestrian Detection in Thermal Imagery

Abstract and Figures

Recommended publications

Task-Conditioned Domain Adaptation for Pedestrian Detection in Thermal Imagery

Domain Adaptation for Privacy-Preserving Pedestrian Detection in Thermal Imagery

Bottom-up and Layer-wise Domain Adaptation for Pedestrian Detection in Thermal Images

Robust pedestrian detection in thermal imagery using synthesized images