ArticlePDF Available

Two-dimensional subspace alignment for convolutional activations adaptation

November 2017
Pattern Recognition 71:320-336

November 2017
71:320-336

DOI:10.1016/j.patcog.2017.06.010

Authors:

Hao Lu

Huazhong University of Science and Technology

Zhi-Guo Cao

Huazhong University of Science and Technology

Yang Xiao

Huazhong University of Science and Technology

Yanjun Zhu

Huazhong University of Science and Technology

In real-world computer vision applications, many intrinsic and extrinsic variations can cause a significant domain shift. Although deep convolutional models have provided us with better domain-invariant features, existing mechanisms to adapt convolutional activations are still limited. Notice that convolutional activations are intrinsically represented as tensors, in this paper we develop a two-dimensional subspace alignment (2DSA) approach based on 2D principal component analysis (PCA) to better adapt convolutional activations. Extensive experiments demonstrate the advantages of 2DSA over its counterpart SA in both effectiveness and efficiency. In particular, when trying to explain why 2DSA works well, we find that the best classification performance has low correlation with the global domain discrepancy measure. In an effort to find a better way to compare domains, we introduce within- and between-class domain divergence measures to characterize the class-level differences. The proposed measures somewhat shed light on what a good alignment might be for classification. Furthermore, we also demonstrate a novel domain adaptation application in agriculture and create a dataset for the problem.

The framework of subspace alignment based visual domain adaptation.

…

Three typical situations in the subspace alignment based domain adaptation. Black denotes the source domain, and red the target. A marker denotes a specific class. The 'alignment' indicates a transformation that moves the source subspace to the target one. The left is an ideal situation, middle the situation occurring in the SA paradigm, and right the 2DSA. SA aligns two domains well but mixes instances coming from different classes (target data cannot be classified correctly), whilst 2DSA only aligns two domains moderately but preserves good margins between different classes (target data still can be separated linearly). This finding motivates us to ponder a fundamental question: to what extend is an alignment enough for classification?

…

Recognition accuracy (%) on the Office31 dataset over 20 trials. The highest accuracy is boldfaced, the second best is shown in red, and the standard deviation is shown in parentheses (we focus on the comparison between SA and 2DSA, results of other approaches are reported for reference).

…

Figures - uploaded by Hao Lu

Content may be subject to copyright.

Content uploaded by Hao Lu

Content may be subject to copyright.

Two-dimensional subspace alignment for convolutional activations adaptation

Hao Lua, Zhiguo Caoa,∗, Yang Xiaoa, Yanjun Zhua

aNational Key Laboratory of Science and Technology on Multi-Spectral Information Processing

School of Automation, Huazhong University of Science and Technology, Wuhan 430074, PR China

Abstract

In real-world computer vision applications, many intrinsic and extrinsic variations can cause a signiﬁcant domain shift. Although

deep convolutional models have provided us with better domain-invariant features, existing mechanisms to adapt convolutional

activations are still limited. Notice that convolutional activations are intrinsically represented as tensors, in this paper we develop

a two-dimensional subspace alignment (2DSA) approach based on 2D principal component analysis (PCA) to better adapt convo-

lutional activations. Extensive experiments demonstrate the advantages of 2DSA over its counterpart SA in both eﬀectiveness and

eﬃciency. In particular, when trying to explain why 2DSA works well, we ﬁnd that the best classiﬁcation performance has low

correlation with the global domain discrepancy measure. In an eﬀort to ﬁnd a better way to compare domains, we introduce within-

and between-class domain divergence measures to characterize the class-level diﬀerences. The proposed measures somewhat shed

light on what a good alignment might be for classiﬁcation. Furthermore, we also demonstrate a novel domain adaptation application

in agriculture and create a dataset for the problem.

Keywords: Visual domain adaptation, Subspace alignment, Convolutional activations, Two-dimensional PCA, Domain divergence

measure

1. Introduction

In real-world computer vision applications, many intrinsic

and extrinsic variations, such as color, pose, illumination, back-

ground, view point, blurring, or image resolution, can cause a

signiﬁcant domain variance so that a model built on the source

domain may not perform well on the data with diﬀerent dis-

tributions from the target domain. Indeed, a stream of studies

have reported the algorithm performance degrades signiﬁcantly

across datasets [

–

]. This is a typical problem called domain

mismatch. A rich body of studies have attempted to alleviate

this challenge under the name of covariate shift, class imbal-

ance, dataset bias, transfer learning, multi-view analysis, and

more recently, domain adaptation (DA) [4].

Deep convolutional neural networks (CNNs) have brought us

the-state-of-art visual descriptor, which has benchmarked a se-

ries of computer vision tasks, such as image classiﬁcation [

]

and object detection [

]. Indeed, features can be more transfer-

able when learned in deep networks [

]. A strong evidence

includes impressive results in DA with a deep CNNs based ap-

proach [

], implying an important role of domain-invariant fea-

ture representation. That is, the feature matters.

The most common and eﬀective way to adapt CNN features

is to ﬁne-tune an end-to-end CNN model so that the parameters

can be adjusted to better ﬁt the target dataset [

]. Fine-tuning

is good as long as we have free access to the supervision and

∗Corresponding author

Email addresses: poppinace@hust.edu.cn (Hao Lu),

zgcao@hust.edu.cn

(Zhiguo Cao),

Yang_Xiao@hust.edu.cn

(Yang Xiao),

yjzhu@hust.edu.cn (Yanjun Zhu)

a suﬃcient number of training data. Yet, we intend to seek in

this paper whether there exists another mechanism to correct

this kind of shift. In particular, we consider the scenarios where

no supervision is provided in the target domain or the labeled

target data alone is deﬁcient to build a good classiﬁer, which is

exactly the case of DA.

DA is a frequently concerned issue in statistics, machine

learning, pattern recognition, natural language processing, and

recently, in computer vision. Over the years, many theoretical

methods have been developed to address this problem with a

moderate degree of success [

–

]. However, to our knowl-

edge, most of methods only formulate the problem in the vector-

form paradigm. That is, the input features must be vectors. No-

tice that convolutional activations are intrinsically represented

as tensors, it may be more natural to model them as matrices or

tensors, rather than vectors [

]. Also, it has shown that, when

using convolutional activations, we can prevent the object defor-

mation by feeding an image with arbitrary size [

] and reuse

the features by building a mapping between the raw image and

feature map [18].

Recently, a subspace alignment (SA) based unsupervised DA

approach [

] stands out due to its eﬀectiveness and simplicity.

Our work is built within this framework. Speciﬁcally, we pro-

pose to perform two-dimensional subspace alignment (2DSA).

A 2DPCA [

] based approach is consequently developed to

adapt convolutional activations eﬀectively and eﬃciently. Com-

pared with its counterpart SA, 2DSA requires less training data,

and learning parameters is more accurate and eﬃcient. Experi-

ments on several datasets validate the eﬀectiveness of 2DSA and

show that 2DSA signiﬁcantly outperforms SA by large margins.

Accepted by Pattern Recognition

http://dx.doi.org/10.1016/j.patcog.2017.06.010

June 16, 2017

Is this a good

alignment?

Exactly!

How about

this one?

Ahhh…

maybe not!

And this?

Well…maybe aligning

close is enough!

Figure 1. Three typical situations in the subspace alignment based domain adaptation. Black denotes the source domain, and red the target. A marker denotes a

speciﬁc class. The “alignment” indicates a transformation that moves the source subspace to the target one. The left is an ideal situation, middle the situation occurring

in the SA paradigm, and right the 2DSA. SA aligns two domains well but mixes instances coming from diﬀerent classes (target data cannot be classiﬁed correctly),

whilst 2DSA only aligns two domains moderately but preserves good margins between diﬀerent classes (target data still can be separated linearly). This ﬁnding

motivates us to ponder a fundamental question: to what extend is an alignment enough for classiﬁcation?

In some cases, SA even worsens the classiﬁcation performance.

We are interested in explaining why 2DSA works better. Our

analysis from the reconstruction error perspective shows that

2DPCA generates a better subspace than PCA (the reconstruc-

tion error of 2DPCA is lower than PCA). Statistically, when ex-

ploiting a global

∆

-divergence [

] to measure the domain-

level discrepancy, we surprisingly ﬁnd that results are beyond ex-

pectation. The best classiﬁcation performance conversely yields

the worst

∆

value. After visualizing the data distribution,

we observe two interesting patterns shown in Fig. 1. One is

that SA aligns two domains well but mixes instances coming

from diﬀerent classes. The other is that 2DSA only aligns two

domains moderately but preserves good margins between dif-

ferent classes. This motivates us to ponder a fundamental issue:

to what extend is an alignment enough for classiﬁcation? We

answer this question by giving a new perspective at local class

distributions. We believe that, a good alignment in classiﬁcation

indeed needs to push two distributions of the same class close,

but more importantly, it should enlarge or at least preserve the

margins between diﬀerent classes. To formalize this idea, two

novel domain discrepancy measures called within-class diver-

gence

∆

and between-class divergence

∆

are con-

sequently proposed. Diﬀerent from the

∆

-divergence that

only characterizes the domain-level discrepancy, the proposed

∆

and

∆

divergences are able to characterize the

class-level diﬀerences and thus can be viewed as a class-level

extension of

∆

-divergence. By measuring the domain dis-

crepancy from a ﬁne-grained perspective, our results somewhat

shed light on what a good alignment might be for classiﬁcation.

In addition, we further describe an interesting DA application

in agriculture. The application involves categorizing three types

of maize tassel ﬂowering status (MTFS):

non-flowering

partially-flowering

, and

fully-flowering

. A dataset

termed MTFS3–DA is also constructed. The dataset includes

10 domains and 1500 images covering 5-year timespan, 4 maize

cultivars and 3 geographical locations. Extensive experiments

on this dataset also show that 2DSA outperforms SA. We hope

this dataset could inspire interests from the pattern recognition

community to address cross-ﬁeld challenges in agriculture.

Overall, the contributions of this paper include:

•

2DSA: a two-dimensional subspace alignment approach is

developed for better convolutional activations adaptation.

It is very eﬀective, computationally eﬃcient, and easy to

implement;

• Hw

∆

: two novel divergence measures ca-

pable of quantifying within- and between-class variations

are proposed to characterize the class-level domain dis-

crepancy. It encourages new perspectives from considering

cross-dataset generalization for classiﬁcation;

•

MTFS3–DA: a new dataset concerning three types of ﬂow-

ering status of maize tassel is created for cross-ﬁeld eval-

uations in agriculture. It consists of 10 domains and 1500

images.

The dataset and source code are made available online. 1

2. Related work

DA is set in one of the possible settings of transfer learn-

ing [

]. Over the years, DA has been extensively studied in both

theory and practice, such as the probabilistic inference in statis-

tics [

], generalization bound in machine learning [

distribution analysis in pattern recognition [

], as well as

various applications in natural language processing [

–

] and

computer vision [

–

]. Recent works in computer vi-

sion ﬁeld mainly focus on the visual recognition problem in ei-

ther unsupervised (only unlabeled data are used from the target

domain) [

] or semi-supervised (a limited amount of labeled

data are used from the target domain) [

] setting. Read-

ers can refer to [

] for a comprehensive survey. In this paper,

The dataset and source code are made available at:

https://sites.

google.com/site/poppinace/.

VGG

Model CNN

Feature

VGG

Model

CNN

Feature

Xa Xt

Subspace Alignment

MΔst>>0

Δat≈0

Laptop

Backpack

Keyboard

Predict

Training Test

Source Domain

Target Domain

Figure 2. The framework of subspace alignment based visual domain adaptation.

we concentrate on the most challenging case—unsupervised vi-

sual DA (some literatures also refer to it as transductive transfer

learning).

According to whether source labels are utilized in the op-

timization process of DA, we simply divide existing unsuper-

vised DA approaches into two categories: domain-orientated

and domain-classiﬁcation-orientated. The ﬁrst category only

aims at the adaptation between two domains. This line of ap-

proaches usually seek a way to build explicit connections or ﬁnd

implicit commonnesses between two domains. Some represen-

tative works include TCA [

], SGF [

], GFK [

], SA [

TJM [

] and LSSA [

]. Also, since current DA approaches

are usually evaluated in the context of classiﬁcation, the sec-

ond category prefers to model the adaptation and classiﬁca-

tion jointly. This line of works often involve iterative optimiza-

tion between the adaptation and classiﬁcation taking source

labels into account, expecting to achieve good classiﬁcation

performance and a ﬁne overlap between domains simultane-

ously. Some works worth mentioning include (A)SYMM [

ARCT [

], DAM [

], STM [

], MMDT [

], HFA [

], and

recent deep learning based approaches (DDA [

] and DAN [

]).

Our proposed method, 2DSA, belongs to the ﬁrst category.

Our work is of particular relevance to those subspace-based

DA approaches. These works share the idea of exploiting low-

dimensional data structures that are intrinsic to domains. In par-

ticular, [

] proposes sampling a ﬁnite number of intermediate

subspaces and building geodesic ﬂows to connect the source

and target domains. Gong et al. [

] extends above work by

constructing a geodesic ﬂow kernel that projects image repre-

sentations into inﬁnite dimensional feature vectors, expecting

to encapsulate incremental changes between subspaces that un-

derly the diﬀerence and commonness between domains. Diﬀer-

ent from these two ideas, Fernando et al. [

] argues that it is

more appropriate to align the two domains directly. The basic

idea is to learn a transformation matrix by minimizing the Breg-

man matrix divergence. Intuitively, the transformation matrix

deﬁnes a movement that potentially pushes the source subspace

close to the target. More recently, [

] further extends [

] in

a landmarks-based kernelized paradigm via selecting potential

landmarks and incorporating further non-linearity with Gaussian

kernel.

Our work is closely related to [

], because we are built in

the same subspace alignment based framework. The main diﬀer-

ence, however, is that SA [

] performs stronger feature-wise

alignment, while our method, 2DSA, only performs partial align-

ment because the subspace analysis is carried out on a smaller

space. Our analysis in Sec. 4shows that it is adequate to move

two subspaces only close to each other to achieve superior clas-

siﬁcation results. Moreover, 2DSA is very fast when tackling

high-dimensional data, such as convolutional activations, which

facilitates the parameter tunning during the cross validation.

3. Subspace alignment based visual domain adaptation

We start by reviewing the subspace alignment based DA

framework [

] to give readers a global view. We then discuss

the seminal vector-form formulation of SA in Sec. 3.1. Next

in Sec. 3.2, we present our matrix-form extension 2DSA in de-

tail. In particular, we follow the conventional nomenclature as

denoting vectors by lowercase boldface letters, like

, matrices

uppercase boldface letters, like

, and tensors calligraphic let-

ters, like

. We allow the input image to be of arbitrary size,

so a simple spatial pooling is applied as a normalization step,

ensuring the consistency of dimensionality. Concretely, any con-

volutional activations with size of

H×W×D

will be normalized

Figure 3. Illustration of spatial pooling normalization. Any activations within a

spatial bin will be pooled by max operation.

K×K×D

by max pooling. Note that, to preserve spatial infor-

mation, pooled activations are not vectorized as the fashion of

spatial pyramid pooling (SPP) in [

]. Intuitively, this process

is illustrated in Fig. 3.

The framework of subspace alignment based visual DA is

shown in Fig. 2. The scenario is that we use the training data

from the source domain to generate the subspace expanded by

, and data from the target domain to generate

(

and

are generated by PCA in [

] and by 2DPCA in 2DSA, which

will be explained later). Yet, the domain shift between

and

is quite large (∆

st 

0), so the subspace

is aligned by

to correct this shift. Conceptually,

deﬁnes a movement

that pushes

close to

. The resulting aligned subspace is

denoted by

(

XsM

). At this time,

looks similar to

Xt(∆at ≈0). Finally, labeled instances from the source domain

are projected by

and are used to train the linear SVM at the

training stage. At the test stage, unlabeled instances from the

target domain are projected by

and are predicted with the

learned model. The more appropriate an alignment is, the better

classiﬁcation results should achieve.

When learning the transformation matrix

, [

] chooses to

minimize the following Bregman matrix divergence as

F(M)=kXsM−Xtk2

F,(1)

where

k·kF

denotes the Frobenius norm. Under this paradigm,

a closed-form solution can be obtained as

M∗

sXt

, and

Xa=XsXT

sXt.

3.1. Problems in vector-form formulation

In the context of vector-form formulation, each

K×K×D

tensor activation

has to be vectorized into a long vector

with size of

K2D×

1 (note that we have restricted our object

to the convolutional activations). However, the resulting vec-

torized representations are high-dimensional. When applying

PCA to generate a subspace, we need to solve SVD on an

extremely large matrix with size of

K2D×K2D

, but solving

high-dimensional SVD is quite slow. More importantly, it is not

tractable in practice because the DA problem is exactly the case

we do not have enough training data from the target domain

to get an exact solution of SVD. For instance, assume that we

only have

(

NM,M

K2D

) training instances, and let us

denote them as

ai∈RM,i

, ..., N

, and combine them in a

matrix as

A∈RM×N

. The corresponding covariance matrix

Gsa

can be derived as

Gsa =1

NA AT.(2)

Algorithm 1 2DSA: Two-dimensional Subspace Alignment

Input:

Source features

, Target features

, Source labels

Subspace dimensionality d

Output: Target labels Lt

1: Xs←2DPCA(Fs,d)

2: Xt←2DPCA(Ft,d)

3: Xa←XsXT

sXt

4: Pa←FsXa

5: Pt←FtXt

6: Lt←S V M(Pa,Pt,Ls)

However, note that

rank

(

Gsa

rank

(

A AT

rank

(

)

≤N

which means we will only get less than

nonzero eigenvalues

when solving the SVD on

Gsa

. In other words, the exact solu-

tion is limited by the number of training data, and an appropriate

subspace may not be generated (our empirical study in Sec. 6.4

justiﬁes this point). In fact, according to the widely-cited rule of

thumb in [

], we expect to have at least 10 times as many the

number of training data as the feature dimensionality. Therefore,

we argue that aligning directly vector-form convolutional acti-

vations may not be a good choice. Inspired by [

], it motivates

us to reconsider modeling them in their intrinsic structure.

3.2. 2DSA: matrix formulation with 2DPCA

2DSA formulates the matrix-form convolutional activations.

Speciﬁcally, we resort to 2DPCA [

] to generate subspaces.

First, each tensor activation with size of

K×K×D

is reshaped

into a

D×K2

matrix. Given a set of matrix-form descriptors

Ai∈RD×K2,i

, ..., N

, the covariance matrix

G2dsa

can be

evaluated as

G2dsa =1

i=1

iAi,(3)

where

G2dsa ∈RK2×K2

. From the physical sense,

G2dsa

actually

models the global dependency between diﬀerent ﬁlter activa-

tions across all pair-wise spatial locations. By solving SVD on

G2dsa

, all feature maps share their eigenvectors instead of having

eigenvectors in a cube of features (many 2D feature maps). It is

eﬃcient to derive

and

in corresponding domains, because

is usually a small value. Also, since

is small, 2DSA do

not require a substantial amount of training data. We then can

reuse Eq. 1to compute the transformation matrix and align the

subspace in the same vein. Notice that the orthogonality con-

straint in both PCA and 2DPCA is important to preserve good

class separations in their subspace representations. The pseudo-

code of this approach is summarized in Algorithm 1, which is

analogous to the algorithm presented in [14].

In both two formulations, we only need to tune one hyper

parameter

that controls the dimensionality of subspace. To ad-

dress this, we choose to leverage the theoretical bound deduced

by [

] to select the maximum dimensionality

dmax

to guide the

selection process. According to the variant of consistency the-

orem [

], it is said that given a conﬁdence

δ >

0 and a ﬁxed

deviation γ > 0, dmax can be selected if it could satisfy

(λmin

dmax −λmin

dmax +1)≥



1+rln(2/δ)

2



 16d3/2B

γ√nmin !,(4)

where (

λmin

dmax −λmin

dmax +1

min

[(

λs

d−λs

d+1

)

(

λt

d−λt

d+1

)],

λb

the

-th eigenvalue (in descending order) computed from the

domain

is selected so that for any vector

kxk ≤ B

nmin

min

(

Ns,Nt

and

are the number of training data in the

source and target domain, respectively. Once

dmax

is identiﬁed,

for any d≤dmax, one can get a reliable solution of Min Eq. 1.

4. Domain discrepancy analysis

In this section, we draw upon domain divergence measures to

analyze the domain discrepancy.

4.1.

Quantifying domain discrepancy based on

∆

divergence

According to our experiments, we ﬁnd that 2DSA achieves

higher classiﬁcation accuracy than SA. In this section, we at-

tempt to explain why it works. From the statistical perspective, a

common way is to use a distribution measure to estimate the do-

main discrepancy. The pioneer work of Ben-David et al. [

]

established the theoretical risk bounds for DA. Since our analy-

sis highly depends on these developments, we begin with a brief

introduction to their theoretical results.

Following the theorem 2 in [

], it states that given a hypothe-

sis space

of VC-dimension

, and instance sets

of size

˜m

each sampled i.i.d. from distributions

and

respectively,

then with probability at least 1

−δ

, for every

h∈ H

, the corre-

sponding generalization error on the target set can be bounded

t(h)≤s(h)+1

dH∆H(Us,Ut)+4s2˜

dlog(2 ˜m)+log(2/δ)

˜m+˜

λ ,

(5)

where

s

(

) is the source error, and

equals the combined error

of ideal joint hypothesis

t

(

s

(

), which can suppose to be

a negligible term in the case of DA. The bound shows that the

source error and

dH∆H

(

Us,Ut

) (also called

∆

-divergence)

are the most relevant quantities in computing the target error.

In particular, we are interested in quantifying

dH∆H

(

Us,Ut

because we may understand why 2DSA works better if the per-

formance correlates well with this measure. Next, we shall give

a ﬁrst look at its counterpart of

-divergence

(

Ds,Dt

) that

plays a vital role in the rest of our analysis.

(

Ds,Dt

) is also known as the

-distance or total varia-

tion distance derived from the statistical distance family, which

is used to measure the diﬀerence between two probability distri-

butions. Formally, it is deﬁned in [20] as

dH(Ds,Dt)=2 sup

h∈H |Ps(h)− Pt(h)|,(6)

where

(

) and

(

) denote the probability of event

under

distributions

and

, respectively. Intuitively, it describes

the largest possible diﬀerence between the probabilities that two

probability distributions can assign to the same event. With these

notions, the symmetric diﬀerence hypothesis space

∆

can

be further deﬁned [23] as

H∆H={g(x)|g(x)=h(x)⊕h0(x)},h,h0∈ H ,(7)

where

⊕

denotes the XOR operation. In other words,

(

) will be

positive in

∆

if and only if a couple of hypothesis

(

) and

(

) disagree with each other. Thus,

dH∆H

(

Ds,Dt

) means to

compute the

-distance over the symmetric diﬀerence hypoth-

esis space. However, directly computing

dH∆H

(

Ds,Dt

) is not

tractable in practice, so an alternative is to compute its empirical

version

dH∆H

(

Us,Ut

). In particular, estimating

dH∆H

(

Us,Ut

)

requires learning a linear classiﬁer

to see whether source and

target instances could be diﬀerentiated. More speciﬁcally, it in-

volves the following steps:

Step 1.

Pseudo-labeling the source and target instances with +1

and −1;

Step 2.

Randomly sampling two sets of instances as the training

and test set, respectively.

Step 3.

Learning a linear classiﬁer

on the training set and

verifying its performance on the test set.

Step 4.

Estimating the distance as

dH∆H

(

Us,Ut

)=2(1

−

err(ˆ

h)) [20], where err(ˆ

h) is the test error.

If two distributions perfectly overlap with each other,

err

(

)

≈

5, and

dH∆H

(

Us,Ut

)

≈

0. Conversely, if two distributions

have large enough margins,

err

(

)

≈

0, and

dH∆H

(

Us,Ut

)

≈

Therefore,

dH∆H

(

Us,Ut

)

∈

2]. The lower the value is, the

better two distributions align. In other words, a low divergence

value should imply high classiﬁcation performance.

Now we can empirically evaluate the domain discrepancy of

SA and 2DSA. To our surprise, this measure does not correlate

with the classiﬁcation performance. Fig. 4illustrates a typical

case of adapting images from Amazon to Caltech (details see

Sec. 6.1). The highest classiﬁcation performance does not corre-

spond to the lowest

∆

measure. According to the visualiza-

tion of data distributions, we observe that, both approaches do

have pushed the same class of diﬀerent domains close to each

other (looking at the “+” class), and the classes aligned by SA

generally overlap better than those aligned by 2DSA (looking at

those “

” class and “



” class for example). Why SA achieves

inferior classiﬁcation accuracy? This interesting phenomenon

motivates us to ponder a fundamental issue: to what extend is an

alignment enough for classiﬁcation? We shall give our answer

in the next subsection.

4.2.

Measuring domain discrepancy with local class divergence

Note that both SA and 2DSA are in the sense of global align-

ment because all the training data are used to generate the sub-

spaces, the diﬀerence is that SA performs stronger feature-wise

adaptation. However, if an alignment is too strong, it may even

align the data coming from diﬀerent classes, resulting in the

cases shown by the yellow circles in Fig. 4. That is, data from

diﬀerent classes are promiscuous in the SA adaptation. In this

case, the alignment makes no sense. In addition, let us revisit the

No Adaptation, H∆H = 1.33, Recognition Accuracy = 69.32

SA Adaptation, H∆H = 1.23, Recognition Accuracy = 58.14

2DSA Adaptation, H∆H = 1.45, Recognition Accuracy = 78.93

Figure 4. Category-speciﬁc data visualization using t-SNE [

] over a typical DA task from Amazon (red) to Caltech (black) in the Oﬃce–Caltech10 dataset.

∆

and recognition accuracy are indicated in each sub-ﬁgure title. Each category is denoted by a certain type of marker (Oﬃce–Caltech10 dataset has 10 categories).

Table 1. Cultivar information of each sequence in MTFS3–DA dataset.

Sequence Jundan Wuyue Nongda Zhengdan

No.20 No.3 No.108 No.958

Zhengzhou 2010 X— — —

Zhengzhou 2011 X— — —

Zhengzhou 2012 — — — X

Taian 2010–1 — X— —

Taian 2010–2 — X— —

Taian 2011–1 — — X—

Taian 2011–2 — — X—

Taian 2012–1 — — — X

Taian 2012–2 — — — X

Gucheng 2014 — — — X

“+” class in Fig. 4. In the both SA and 2DSA scenarios, this class

is aligned moderately, but if we classify the data, it may turn out

to be the most easily separated class. Therefore, our point is that

in classiﬁcation we actually do not enforce two domains to be

exactly overlapped with each other, and favorable performance

can be achieved as long as they have enough margins with other

classes. Hence, as per these observations, we deem that it is ad-

equate in the context of classiﬁcation for the same class to be

aligned that is only close to each other and for diﬀerent classes

that have large enough margins.

To formalize our idea, two novel domain discrepancy mea-

sures called the within-class divergence

∆

and the

between-class divergence

∆

are proposed to characterize

the class-level diﬀerences. A natural way to characterize these

diﬀerences is to compute a distance over speciﬁc distributions.

Let us denote the within-class and between-class distances as

(

s,Pi

) and

(

s,t,Pj

s,t

), respectively, where the superscript

denotes the class, and the subscript the domain. Thus, it is clear

that

(

s,Pi

) is computed from a certain class between two

domains, and

(

s,t,Pj

s,t

) is computed by considering both do-

mains as a whole between diﬀerent classes. Moreover, we fur-

ther impose two kinds of constraints to the distances as:

dw(Pi

s,Pi

t)< γw,(8)

db(Pi

s,t,Pj

s,t)> γb,(9)

where

γw

and

γb

ensure a relative small within-class distance

and a large enough between-class distance, respectively. With

these two inequalities,

∆

and

∆

can be expressed by

incorporating Eq. 8and Eq. 9into hinge-loss-like formulations

as:

Hw∆Hw=1

max(0,dw(Pi

s,Pi

t)−γw),(10)

Hb∆Hb=1

C(C−1)

i=1

j=1,j,i

max(0, γb−db(Pi

s,t,Pj

s,t)) .

(11)

We can see that only those distances violate the inequality

constraints will contribute losses to the measure. Intuitively,

Zhengzhou 2010

Zhengzhou 2011

Zhengzhou 2012

Taian 2010-1

Taian 2010-2

Taian 2011-1

Taian 2011-2

Taian 2012-1

Taian 2012-2

Gucheng 2014

Figure 5. Examples of maize tassel images in MTFS3–DA dataset from 10

diﬀerent ﬁelds. In each ﬁeld, from left to right, images denote the ﬂowering

status of

non-flowering

partially-flowering

, and

fully-flowering

respectively. Images are rescaled for better view.

∆

assesses how good two distributions locally align, and

∆

scores how well an alignment suits for classiﬁcation.

Also, we observe that, larger

γw

and smaller

γb

are, looser those

inequalities constrain. Intuitively, how small should

(

s,Pi

)

be? We think it should not exceed

γw

so that data from the same

class are close enough and have a high probability to be clas-

siﬁed correctly. Meanwhile, how large should

(

s,t,Pj

s,t

) be?

We believe it should be at least larger than

γb

so that data with

diﬀerent classes could be separated easily. As a consequence,

when

γw

gradually decreases and

γb

gradually increases, two

kinds of curves can be drawn to demonstrate the domain dis-

crepancy under various distance levels.

To make a direct comparison with the

∆

divergence, we

choose to estimate the

(

s,Pi

) and

(

s,t,Pj

s,t

) in a similar

vein as

dH∆H

(

Us,Ut

). In addition, when plotting the curves, we

can conventionally leverage the numerical range of

-distance

to reduce one parameter by setting

γb

γ, γw

−γ

. Also,

will gradually increase in the interval of [1, 2]. In Sec. 6.5, we

show that,

∆

correlates well with the classiﬁcation accu-

racy, and

∆

is also in consistent with the global

∆

Our results imply that,

∆

can be seen as a local version of

∆

measure, and

∆

also extends

∆

by endowing

the ability to measure local variances between classes.

5. MTFS3–DA dataset

The image acquisition device is described in [

]. 10 maize

sequences in all are collected to construct the MTFS3–DA

dataset. The dataset covers 5-year timespan from 2010 to

2014, 4 diﬀerent maize cultivars of Wuyue No.3, Jundan No.20,

Nongda No.108 and Zhengdan No.958, and 3 diﬀerent geo-

graphical locations of Zhengzhou, Henan province, China, Ta-

ian, Shangdong province, China, and Gucheng, Hebei province,

China. In practice, cross-ﬁeld domain shifts in agriculture are

mainly caused by these three factors. The information of each

sequence is summarized in Table 1.

As the camera monitors the growth of maize from the

tasseling stage to the ﬂowering stage (two critical growth

stages of maize), we ﬁnd that maize tassels exhibit three

types of ﬂowering status [

]. That is, initial

non-flowering

status, intermediate

partially-flowering

status, and ﬁnal

fully-flowering

status. Some example images of each se-

quence are illustrated in Fig. 5. We observe that there only exists

subtle textural diﬀerences between diﬀerent types of ﬂowering

status, so it can be viewed as a typical cross-domain textural

categorization problem. We hope this problem can inspire inter-

ests from the pattern recognition community to address those

cross-ﬁeld challenges in agriculture.

Concretely, we choose to leverage the oﬀ-the-shelf bounding

boxes annotations released in our previous work [

] to crop

the tassel images from the full-resolution images (extra annota-

tions have been done on the Gucheng 2014 sequence). By doing

this, we could relieve the inﬂuence of background as much as

possible, and it can also be viewed as a coarse pose normal-

ization. In addition, an agrometeorological observer with more

than 10-year experience is invited to help us annotate all sub-

images to ensure the correctness of labels. For each sequence,

we manually select 50 images from each class. In all, we have

150 images in each visual ﬁeld and 1500 images in the MTFS3–

DA dataset. Notice that, the dataset originally released in [

]

is mainly developed for the evaluation of detection problem and

does not involve any image-level annotations, while the MTFS3–

DA dataset is tailored to the DA problem and is set in the context

of visual recognition.

6. Experiments and discussion

We ﬁrst evaluate our approach in the context of visual recogni-

tion on standard DA datasets and follow the same experimental

protocol as in [

]. In addition, we also perform eval-

uations on other widely-used image classiﬁcation datasets and

our constructed MTFS–DA dataset. Along with these numerical

results, we further present empirical studies to explain why our

method works.

6.1. Experimental dataset and protocol

Oﬃce–Caltech10 dataset. Oﬃce–Caltech10 dataset [

] ex-

tends the Oﬃce31 [

] dataset by adding another Caltech do-

main, leading to 4 domains of Amazon,DSLR,web-cam, and

Caltech. 10 common categories are chosen from these domains,

resulting in about 2500 images. Overall, we have 12 DA prob-

lems.

Oﬃce31 dataset. Oﬃce31 dataset is originally introduced

by [

]. It consists of 31 categories and 3 domains. We add

another 5 images downloaded from the Internet with the same

image resolution into the

ruler

category of DSLR domain (only

Subspace Dimensionality

0 10 20 30 40 50 60 70 80 90 100

Eigenvalue Difference

0.5

1.5

2.5

3.5

4.5

Theoretical Bound

λd

min-λd+1

min

Figure 6. Illustration of selecting a subspace dimensionality with the guide of

theoretical bound.

7 images are contained in the original dataset) so that experi-

ments can be conducted under the same protocol. This dataset

has 6 DA problems.

ImageNet–VOC2007 dataset. We also evaluate our method on

the widely-used ImageNet and PASCAL VOC2007 datasets. We

choose the same 20 categories as VOC2007 dataset from the Im-

ageNet 2012 to constitute the source domain, and the VOC2007

dataset is regarded as the target domain. Since categories in-

cluded are very diﬀerent from above datasets, experiments per-

formed on this dataset can somewhat prove the generality of our

method.

MTFS3–DA dataset. Since our dataset comprises 10 diﬀerent

domains, it can lead to a total of

=90 diﬀerent DA prob-

lems. Instead of blindly evaluating all DA problems, we grad-

ually increase the domain shift and organize experiments in a

hierarchical manner (see Sec. 6.3 for details). For short, each

domain is denoted by

{Location}{Year}{Cultivar }

{S equence Number}

. The Se-

quence Number only appears in Taian sequences. For instance,

Zhengzhou 2010 domain is denoted by

, and

Taian 2011–1

domain T11N

Experimental protocol. Each DA problem is denoted by

S ource→T arget

. For Oﬃce–Caltech10, Oﬃce31 and our

MTFS3–DA datasets, the average multi-class recognition accu-

racy across 10 categories over 20 trials is reported on the target

domain. In each trial, 20 images are randomly sampled from

each category of the source domain as the training set (8 im-

ages if the source domain is web-cam and DSLR), and the target

data is used during both the training and testing stages. Note

that, the experimental protocol we use on the Oﬃce–Caltech10

and Oﬃce31 datasets is exactly the same as in [

] ex-

cept that we use diﬀerent feature representations, i.e., convolu-

tional activations. Since better feature representations are used,

the baseline accuracy is substantially higher than the results re-

ported in their papers (the conventional SURF feature is used

in [

]). For the ImageNet–VOC2007 dataset, 50 images

are randomly sampled from each category of ImageNet2012

subset as the source domain, and images from the

test

set of

VOC2007 are used as the target domain, and the average preci-

sion regarding each category is reported, respectively. Since we

have suﬃcient data in the source domain, Sec. 6.6 will present

additional results with other settings on this dataset.

Parameters setting. The optimal dimensionality

d∗

in both SA

and 2DSA scenarios is determined by the two-fold cross val-

idation over the labeled source data with the guide of the-

oretical bound (Sec. 3.2) using the range of values 2

k,k

, ..., log2dmax

. Fig. 6illustrates how to ﬁnd a stable so-

lution with the guide of theoretical bound. The optimal dimen-

sionality

d∗

should be identiﬁed before the intersection of two

types of lines. Since we focus on the adaptation of convolutional

activations, those methods taking fully-connected activations,

like DeCAF feature [8], as the representation are not employed

for comparisons. Generally, the CO N V5 activations extracted

from a pretrained 7-layer CNN model (

imagenet-vgg-m

[

])

are considered as the feature representation (

=512), and

=6 is set in the spatial pooling step (Sec. 3). Thus, the fea-

ture dimensionality is

K2×D

=18432. Additionally, one-vs-rest

linear SVMs [

] are used as the classiﬁer, and the penalty fac-

tor

is determined by two-fold cross validation on the source

domain using the range of values 10p,p=−3,−2,−1,0,1,2,3.

6.2. Visual recognition results on standard datasets

Several baseline methods are employed to compare against

our 2DSA approach:

•

No Adaptation (NA): NA is the basic baseline, the classiﬁer

trained on the source domain is directly applied to the target

domain.

•

Geodesic Flow Kernel (GFK) [

]: GFK is a kernel-based

DA method that uses an inﬁnite number of subspaces along

the geodesic ﬂow to bridge two domains.

•

Transfer Joint Matching (TJM) [

]: TJM formulates fea-

ture matching and instance reweighting in a joint optimiza-

tion problem.

•

Landmarks Selection Subspace Alignment (LSSA) [

LSSA extends SA by projecting samples onto landmarks

and adding further nonlinearity with Gaussian kernel. Both

TJM and LSSA can work at the instance level.

•

Subspace Alignment (SA) [

]: SA is aforementioned. This

is our closely related work and the direct baseline.

•

∗

and 2DSA

∗

: It may be interesting to see how SA and

2DSA work with nonlinearity. We add these two variants

that use SVM with Gaussian kernel as the classiﬁer. Similar

to the penalty factor

, the kernel parameter

is also tuned

by two-fold cross validation.

•

2DSA

†

: A variant of 2DSA that adopts

i∈RK2×D,i

, ..., N

, as the matrix descriptor, so 2DSA

†

will solve a

D×D

covariance matrix as per Eq. 3. In contrast to 2DSA

that performs spatial-mode adaptation, 2DSA

†

performs

feature-mode adaptation. This variant will show what ex-

actly makes 2DSA diﬀerent from other approaches.

•

2DSA

‡

: In 2DSA, all feature maps are summed together

and sent to 2DPCA. In 2DSA

‡

, each feature map is vec-

torized into a 1

×K2

vector and considered as a speciﬁc

pattern. All these patterns will be sent to standard PCA so

that all feature maps share the same eigenvector. We add

such a baseline to justify whether 2DPCA is what makes

a diﬀerence and not the fact that eigenvectors are shared

across feature maps.

In the following, the vector-form representation is preﬁxed

with v-, and the matrix-form representation m-. Convolutional

activations are denoted as CONV for short. Conventional ap-

proaches receive vC O NV as the feature representation, while

2DSA and its variants receive mCO N V as the representation.

Tables 2,3and 4list the numerical results. We can make the

following observations:

•

Classiﬁcation results on all three standard DA datasets

demonstrate that 2DSA almost consistently and signiﬁ-

cantly outperforms SA by large margins. In particular,

2DSA achieves the highest mean average accuracy of

74.1% and 55.6% on the Oﬃce–Caltech10 and Oﬃce31

datasets, respectively. 2DSA also exhibits consistently

lower standard deviations than SA, which implies 2DSA is

very stable. In addition, 2DSA also ranks the second place

on the challenging VOC2007

test

set (2DSA

∗

wins the

ﬁrst place);

•

SA with vC ON V does not see notable improvements

in accuracy and sometimes even worsens the classiﬁca-

tion performance. Similar results are also reported in [

]

that SA falls behind NA when 4096-dimensional fully-

connected activations are used. One reason perhaps is that

the number of training data aﬀects the quality of generated

subspace. This point will be further justiﬁed in Sections 6.4

and 6.6;

•

The results of 2DSA

†

show that the improvement of the

feature-mode adaptation of CONV is marginal, and 2DSA

†

even degrades the classiﬁcation performance signiﬁcantly

on the ImageNet–VOC2007 dataset. This may justify that

the spatial-mode adaptation mechanism in 2DSA matters.

We also consider this is what distinguishes 2DSA from

other comparing approaches—the feature mode is not ex-

plicitly adapted. Such a phenomenon may inspire a further

exploration: is it necessary to adapt features when the fea-

ture representations at hand are already good enough? We

leave such a question open at present.

•

It is interesting that 2DSA

‡

also works considerably well.

We think the reason is also that 2DSA‡performs 2D adap-

tion and adapts the spatial mode only. However, compared

to 2DSA, its classiﬁcation accuracy is slightly lower, and

the standard deviation is generally higher (especially on

the ImageNet–VOC2007 dataset). Perhaps when all fea-

ture maps are summed together as in 2DSA, the covari-

ance matrix can appropriately capture the holistic informa-

tion of samples, more stable than blindly modeling indi-

vidual feature maps. In addition, the advantage of 2DSA

Table 2. Recognition accuracy (%) on the Oﬃce–Caltech10 dataset over 20 trials. The highest accuracy is boldfaced, the second best is shown in red, and the standard

deviation is shown in parentheses (we focus on the comparison between SA and 2DSA, results of other approaches are reported for reference).

Method A→C C→A A→D D→A A→W W→A C→D D→C C→W W→C D→W W→Dmean

NA 68.3(2.4) 81.2(3.0) 67.9(4.2) 66.4(2.7) 56.7(4.7) 57.9(3.6) 68.7(3.6) 52.6(1.9) 58.8(3.9) 47.2(3.5) 89.2(2.7) 90.5(2.3) 67.1

GFK 77.2(2.2) 85.2(2.2) 78.1(3.6) 80.4(2.6) 74.3(2.8) 70.4(6.5) 79.2(2.7) 63.9(7.0) 74.7(3.1) 59.6(5.0) 86.9(3.2) 86.4(3.0) 76.4

TJM 80.7(1.5) 88.6(1.6) 79.8(7.6) 74.2(6.2) 75.3(5.8) 62.8(7.4) 84.4(3.8) 59.6(7.2) 79.5(4.8) 52.2(4.8) 94.0(2.5) 92.3(2.1) 76.9

LSSA 78.7(0.9) 87.0(1.7) 78.3(2.2) 77.7(4.2) 70.6(2.3) 63.9(5.1) 77.8(2.7) 64.5(4.1) 68.2(5.2) 58.0(3.4) 95.4(2.2) 96.0(1.6) 76.3

SA∗67.4(4.8) 75.8(3.3) 42.3(14.9) 59.1(19.2) 45.5(10.9) 29.8(24.2) 34.7(11.4) 56.5(4.6) 43.5(12.5) 36.9(20.3) 70.7(36.0) 79.2(17.1) 53.5

SA 69.3(3.7) 84.6(3.8) 64.7(7.6) 66.7(12.5) 53.7(5.2) 58.7(10.4) 70.6(5.2) 56.4(12.1) 58.5(4.3) 50.3(8.5) 83.4(6.2) 84.6(5.8) 66.8

2DSA∗69.8(14.8) 84.9(16.1) 68.0(15.7) 56.4(24.9) 58.2(17.9) 53.6(23.0) 73.8(14.0) 47.3(22.2) 69.4(13.2) 42.9(20.6) 76.0(34.0) 82.6(31.4) 65.2

2DSA†68.6(2.4) 85.0(2.4) 67.2(7.0) 69.3(7.0) 60.3(4.3) 58.8(7.8) 72.5(2.4) 53.0(5.9) 58.4(4.0) 48.3(5.5) 81.3(6.6) 85.0(4.2) 67.3

2DSA‡74.8(3.2) 85.4(2.6) 72.7(6.7) 70.7(4.4) 67.4(4.0) 58.2(5.1) 77.3(4.9) 58.5(4.2) 70.7(3.7) 50.1(3.2) 91.0(2.8) 92.9(3.4) 72.5

2DSA 75.2(3.0) 85.9(1.8) 75.4(5.2) 73.4(5.0) 66.8(3.9) 63.8(5.6) 76.6(5.0) 62.5(4.7) 69.5(2.9) 55.1(5.1) 91.8(3.0) 93.0(2.5) 74.1

Table 3. Recognition accuracy (%) on the Oﬃce31 dataset over 20 trials. The highest accuracy is boldfaced, the second best is shown in red, and the standard

deviation is shown in parentheses (we focus on the comparison between SA and 2DSA, results of other approaches are reported for reference).

Method A→D D→A A→W W→A D→W W→Dmean

NA 41.1(3.2) 30.2(2.3) 33.1(2.8) 26.4(2.2) 74.0(2.6) 76.6(1.5) 46.9

GFK 44.5(2.4) 31.1(4.2) 37.1(3.0) 27.3(2.0) 73.3(2.2) 75.6(2.2) 48.1

TJM 44.9(5.1) 34.8(3.3) 37.5(4.3) 31.3(5.1) 73.9(1.8) 76.6(3.2) 49.8

LSSA 38.9(9.6) 34.5(4.5) 29.0(11.7) 33.1(4.4) 80.4(2.4) 79.3(2.3) 49.2

SA∗27.8(3.4) 28.9(6.0) 27.9(3.3) 23.1(8.5) 74.5(3.3) 73.0(5.6) 42.5

SA 43.8(9.2) 40.1(3.3) 40.9(7.7) 35.8(6.0) 81.5(2.6) 75.6(4.7) 52.9

2DSA∗47.8(6.5) 35.6(4.1) 40.3(5.8) 29.1(7.1) 86.8(2.8) 88.4(2.6) 54.7

2DSA†45.4(2.7) 35.7(2.0) 37.3(3.4) 32.4(2.1) 78.7(2.0) 80.5(2.9) 51.7

2DSA‡45.2(4.2) 32.3(5.3) 39.2(3.8) 28.3(3.4) 81.3(2.5) 84.2(2.1) 51.8

2DSA 47.3(4.2) 37.6(1.7) 39.2(6.0) 35.8(1.4) 85.7(2.1) 88.3(2.2) 55.6

over 2DSA

‡

is obvious when the number of source samples

is limited (D/W

→

A/C), which means 2DSA is more suit-

able for small sample sizes than 2DSA

‡

. As a consequence,

2DPCA seems a better choice for 2D adaptation.

•

2DSA

∗

achieves the highest average precision on the

ImageNet–VOC2007 dataset. Yet, the nonlinearity used in

SA and 2DSA does not always beneﬁt classiﬁcation. On

the Oﬃce–Caltech10 and Oﬃce31 datasets, when the Gaus-

sian kernel is introduced, it has a negative eﬀect on the

classiﬁcation accuracy. SA

∗

and 2DSA

∗

also exhibit much

higher standard deviations than their linear counterparts on

the Oﬃce–Caltech10 dataset. Hence, one should be careful

when using nonlinearity in practice.

•

Although TJM achieves higher classiﬁcation accuracy than

2DSA on the Oﬃce–Caltech10 dataset, TJM does not work

well when tackling complicated classiﬁcation problems (31

categories are included in the Oﬃce31 dataset) or when

inferring classes with complex background (VOC2007

dataset). Here is a plausible explanation. Since TJM op-

timizes an instance reweighting procedure, it works at the

instance level. However, as shown in Fig. 7, the Oﬃce31

dataset contains some images with inaccurate labels, and

the VOC2007 dataset is a typical multi-label dataset. Am-

biguous labels are very likely to lead to sample shifts from

one class to the other. If these samples are assigned with

larger weights, the quality of adaptation will be largely af-

fected. In contrast, 2DSA is a subspace-based approach

and works at the domain level. It is not that sensitive to the

variations of individual instances. This may explain why

2DSA outperforms TJM on the Oﬃce31 and ImageNet–

VOC2007 datasets.

•

The reason why 2DSA outperforms LSSA may be similar

as TJM. LSSA also contains an instance reweighting pro-

cess, so it may suﬀer from the same problem as TJM. In

LSSA, the source and target data will be projected onto a

shared space using a Gaussian kernel with respect to the

selected landmarks. If the selected landmarks contain noisy

samples, the resulting nonlinear representations may also

be unreliable. With unreliable representations, the data dis-

tributions may not change the way we expect in the pro-

jected space to beneﬁt linear classiﬁcation.

•

In fact, we think the performance degradation has also

something to do with the use of deep features. According

to a recent work [

], deep features are considered fragile—

features are separable but not discriminative enough (the

Table 4. Average precision (%) on the ImageNet–VOC2007 dataset. The highest average precision is boldfaced, the second best is shown in red, and the standard

deviation is shown in parentheses (we focus on the comparison between SA and 2DSA, results of other approaches are reported for reference).

VOC2007 aero bike bird boat bottle bus car cat chair cow

NA 68.7(6.5) 60.2(4.1) 49.3(4.5) 60.2(7.7) 25.8(1.9) 51.1(3.1) 65.0(2.9) 65.3(2.0) 16.1(4.6) 22.7(7.2)

GFK 59.6(7.6) 57.6(5.9) 33.3(17.6) 40.3(9.9) 23.7(4.3) 48.5(3.2) 64.4(2.6) 46.3(12.0) 13.5(4.9) 19.6(7.1)

TJM 70.8(1.3) 63.9(2.6) 14.8(8.7) 40.2(14.9) 14.9(3.3) 49.0(3.7) 68.7(1.8) 51.9(10.7) 14.3(5.9) 17.0(14.2)

LSSA 54.2(2.3) 58.3(3.1) 25.6(3.2) 21.9(3.1) 22.1(2.0) 39.3(2.4) 63.8(1.4) 45.6(4.3) 27.5(4.0) 16.5(3.0)

SA∗70.6(3.9) 62.2(5.1) 55.2(10.2) 61.5(11.5) 22.8(4.4) 58.8(4.1) 71.4(2.5) 64.0(6.8) 15.5(6.1) 30.1(6.0)

SA 60.1(12.8) 51.6(18.4) 28.3(11.3) 41.8(18.5) 17.8(3.7) 45.9(12.7) 66.4(7.4) 48.6(12.2) 19.4(9.8) 27.2(7.1)

2DSA∗78.8(3.6) 73.4(3.0) 68.5(6.1) 74.4(4.2) 30.4(2.6) 64.3(3.6) 75.4(2.5) 74.9(2.8) 20.5(6.2) 48.2(5.8)

2DSA†54.2(2.3) 58.3(3.1) 25.6(3.2) 21.9(3.1) 22.1(2.0) 39.3(2.4) 63.8(1.4) 45.6(4.3) 27.5(4.0) 16.5(3.0)

2DSA‡69.3(3.3) 63.5(3.8) 60.9(3.8) 67.3(5.4) 30.0(2.8) 52.2(4.1) 69.5(2.2) 69.6(3.6) 13.9(4.4) 32.4(4.9)

2DSA 68.7(1.3) 66.3(1.2) 50.7(2.9) 65.9(2.1) 30.8(2.5) 53.8(2.0) 74.6(1.1) 67.7(0.8) 29.1(5.4) 33.5(2.8)

table dog horse mbike person plant sheep sofa train tv mean

NA 27.4(4.0) 42.2(10.8) 37.4(13.2) 51.1(4.4) 70.8(2.2) 18.1(2.5) 44.7(4.7) 36.9(5.6) 69.2(2.8) 47.8(5.3) 46.5

GFK 26.4(6.6) 18.1(4.4) 18.6(17.9) 52.3(8.2) 73.9(2.3) 14.7(0.7) 27.4(15.7) 35.3(5.7) 58.8(5.8) 31.1(6.3) 38.2

TJM 29.2(3.8) 24.5(16.0) 22.6(27.0) 53.7(2.2) 67.6(2.4) 15.3(2.0) 10.0(12.3) 39.3(2.2) 64.3(2.6) 33.9(2.3) 38.3

LSSA 25.3(2.7) 40.8(1.7) 40.4(7.6) 41.7(3.4) 74.3(2.8) 11.0(3.1) 23.7(4.5) 27.7(3.6) 52.4(3.9) 29.5(3.4) 37.1

SA∗32.1(6.3) 47.6(7.4) 51.4(14.9) 64.0(3.7) 75.5(2.1) 17.6(2.6) 42.7(8.9) 48.3(7.0) 72.9(5.0) 48.9(6.5) 50.7

SA 33.6(4.8) 47.2(8.1) 55.9(15.6) 49.2(13.9) 69.9(9.4) 13.4(5.3) 36.6(12.9) 24.9(13.2) 62.8(10.1) 32.7(5.5) 41.7

2DSA∗38.5(5.3) 59.3(3.5) 65.1(6.7) 70.9(4.6) 77.7(2.2) 18.5(2.0) 64.1(3.6) 58.6(5.3) 78.4(2.8) 61.3(5.8) 60.1

2DSA†25.3(2.7) 40.8(1.7) 40.4(7.6) 41.7(3.4) 74.3(2.8) 11.0(3.1) 23.7(4.5) 27.7(3.6) 52.4(3.9) 29.5(3.4) 37.1

2DSA‡37.3(6.2) 57.6(4.0) 48.0(9.7) 59.4(3.7) 72.8(2.3) 17.1(3.0) 56.8(3.8) 38.3(3.6) 72.2(4.3) 55.5(3.8) 52.2

2DSA 35.3(2.5) 56.3(3.2) 48.4(5.6) 58.5(3.6) 77.5(1.6) 24.5(1.9) 52.2(4.0) 43.2(2.0) 75.0(0.9) 57.2(2.0) 53.5

intra-class variations are still large). [

] shows that deep

features typically present bubble-like shapes in the feature

space, diﬀerent bubbles indicating diﬀerent classes may

easily intersect if a disturbance appears. Such a problem

becomes serious in the context of CONV adaptation. The

disturbances can be the problem nature of DA (distribution

mismatch) or the poor estimation of parameters because of

high dimensionality. Nevertheless, the good news is that

the spatial-mode adaptation mechanism in 2DSA seems

not to ruin the good class separation of CONV.

•

It can be concluded that, when aligning convolutional acti-

vations, it is better to formulate the problem in the two-

dimensional paradigm. Moreover, if we have desirable

domain-invariant feature representations, a simple linear

adaptation seems already adequate.

6.3. Visual recognition results on MTFS3–DA dataset

For the MTFS3–DA dataset, we organize our experiments in

a hierarchical manner. In particular, we gradually increase the

domain shifts and evaluate the recognition performance under

single-type, double-type and triple-type variations. More specif-

ically, three types of variations of years, cultivars and geograph-

ical locations are considered. On this dataset, we only compare

the performance of 2DSA against NA and SA. The accuracy

improvement over baseline NA by around 10% is underlined,

which means a signiﬁcant improvement.

Figure 7. Images shown at the ﬁrst row are labeled as

ruler

in the Oﬃce31

dataset, and the second row shows images with multiple labels in the VOC2007

dataset. These images with ambiguous labels may aﬀect the performance of

instance-level DA methods.

6.3.1. Performance degradation

Before we evaluate these DA problems, we ﬁrst highlight the

problem of cross-ﬁeld performance degradation. Concretely, we

choose 3 typical domains of

and

as the source

domains, respectively, and test the recognition performance on

the other 9 target domains. The mean recognition accuracy is

reported. Numerical results listed in Table 5show that the per-

formance degrades signiﬁcantly in all the cases when directly

applying the classiﬁer trained on the source domain. This is an

important problem that one often ignores to concern in ﬁeld-

based visual applications in agriculture. The factors that plants

may be diﬀerent from year to year or from location to location

are complicated. For instance, the quality of seeds, the variations

Table 5. Performance degradation from one domain to the other. The performance in the ﬁrst column is obtained by testing the data from the same domain, and the

standard deviation is shown in parentheses.

Source Target

Z10JZ11JZ12ZT10W

1T10W

2T11N

1T11N

2T12Z

1T12Z

2G14Z

77.1(5.0) 56.6(4.0)↓56.7(5.0)↓58.7(5.7)↓55.1(5.1)↓47.2(4.9)↓50.4(4.5)↓55.4(4.6)↓51.8(4.2)↓52.6(6.1)↓

T11N

1Z10JZ11JZ12ZT10W

1T10W

2T11N

2T12Z

1T12Z

2G14Z

77.1(5.1) 48.5(5.3)↓43.3(4.9)↓42.5(3.3)↓52.9(5.3)↓46.5(5.0)↓44.7(4.7)↓51.5(4.6)↓48.6(4.4)↓50.2(6.2)↓

G14ZZ10JZ11JZ12ZT10W

1T10W

2T11N

1T11N

2T12Z

1T12Z

72.8(4.4) 44.4(4.8)↓40.7(4.5)↓36.0(3.6)↓49.4(5.2)↓45.3(3.6)↓46.2(4.8)↓45.8(3.9)↓48.7(4.2)↓44.1(3.9)↓

Table 6. Recognition accuracy (%) under the same cultivar and geographical

location but diﬀerent years over 2 DA problems. The highest average precision

is boldfaced, the second best is shown in red, and the standard deviation is shown

in parentheses.

Method Z10J→Z11JZ11J→Z10 Jmean

NA 56.6(4.0) 51.6(4.1) 54.1

SA 55.5(7.9) 48.2(12.7) 51.9

2DSA 61.1(4.9) 56.8(5.6) 59.0

of weather and the nutritional status in soil both largely aﬀect

the growth of plants. In addition, diﬀerent plants will encounter

interspeciﬁc competition. This is the reason why diﬀerent plants

tend to exhibit diﬀerent ﬂowering status even if they are seeded

at the same time.

6.3.2. DA evaluation under single-type variation

In the ﬁrst series of evaluations, we consider DA problems

caused by only single-type variation. In particular, two types

of variations of years and geographical locations are evaluated,

respectively. Note that the scenario of single cultivar variation is

not included, because plants with diﬀerent cultivars are currently

not planted within the same year and geographical location.

Same cultivar and geographical location but diﬀerent years.

Here, we only allow the year to vary when the other two fac-

tors are ﬁxed, leading to 2 DA problems shown in Table 6. In

this situation, the weather condition is the main factor that af-

fects the growth of plants. Results show that 2DSA improves

the cross-ﬁeld classiﬁcation performance and also outperforms

SA, which means the shifts caused by weather conditions can

be corrected appropriately.

Same cultivar and year but diﬀerent geographical locations. In

this setting, we restrict the cultivar and year to be the same and

only vary the geographical locations, resulting in 4 DA prob-

lems shown in Table 7. Plants in diﬀerent locations are greatly

inﬂuenced by the soil conditions. Results demonstrate a similar

tendency as to the ﬁrst experiment.

6.3.3. DA evaluation under double-type variation

In the second series of DA evaluations, we consider three

kinds of double-type variations. Concretely, they are as follows.

Same geographical location but diﬀerent years and cultivars.

We simultaneously vary the shifts with respect to years and cul-

tivars but require the geographical location to be the same place.

It gives rise to 24 DA problems. When diﬀerent cultivars are

considered, maize tassels tend to exhibit signiﬁcant appearance

variations, e.g., diﬀerent colors. Results are listed in Table 8. It

is surprising to see that 2DSA signiﬁcantly improves the clas-

siﬁcation performance in 13 out of 24 DA problems, implying

that the shifts caused by years and cultivars are not that serious.

Same cultivar but diﬀerent years and geographical locations.

Similarly, we restrict the factor of same cultivar and change the

other two in this setting. 6 DA problems in Table 9also demon-

strate the eﬀectiveness of 2DSA, and 3 of them exhibit a notable

degree of performance improvement over 10%.

Same year but diﬀerent cultivars and geographical locations.

Under this context, only cultivars and geographical locations

can change simultaneously, and 8 DA problems in all are eval-

uated. According to the results shown in Table 10, 2DSA only

signiﬁcantly improves the accuracy on only one DA task, and

2DSA does not work on the

2→Z

problem. Hence, on

the basis of above results, we conclude that the geographical

location is the more important factor that causes domain shifts

than the cultivar and the year. Indeed, it is in accordance with

our intuition that various soil conditions of diﬀerent locations

greatly aﬀect the growth of plants.

6.3.4. DA evaluation under triple-type variation

In the ﬁnal experiment, all three kinds of variations can vary

simultaneously, resulting in the most challenging setting.

Diﬀerent years, cultivars and geographical locations. Overall,

we have 36 DA problems. Numerical results are listed in Ta-

ble 11. It is interesting that all DA tasks with signiﬁcant improve-

ments involve the

14 domain, which means the shifts caused

by such a domain is not easily adapted. For other DA tasks that

does not involve the

14 domain, we ﬁnd that, although 2DSA

still works, the recognition baseline is generally lower than the

single-type and double-type cases. Domain shifts seem serious

when all variations are involved, because 22 problems do not

exhibit notable accuracy improvements. In addition, SA even

works better than 2DSA in two DA problems. As per these ob-

servations, we believe that the classiﬁcation performance indeed

has a close relation to speciﬁc data distributions.

Table 7. Recognition accuracy (%) under the same cultivar and year but diﬀerent geographical locations over 4 DA problems. The highest average precision is

boldfaced, the second best is shown in red, and the standard deviation is shown in parentheses.

Method Z12Z→T12Z

1T12Z

1→Z12ZZ12Z→T12Z

2T12Z

2→Z12Zmean

NA 51.2(5.8) 54.9(5.2) 47.6(4.1) 48.9(5.5) 50.6

SA 49.6(11.0) 49.4(5.8) 43.2(9.4) 47.8(6.8) 47.5

2DSA 57.8(5.6) 59.4(3.0) 55.2(5.3) 54.3(6.5) 56.7

Table 8. Recognition accuracy (%) under the same geographical location but diﬀerent years and cultivars over 24 DA problems. The highest average precision is

boldfaced, the second best is shown in red, and the standard deviation is shown in parentheses.

Method T10W

1→T11N

1T11N

1→T10W

1T10W

1→T11N

2T11N

2→T10W

1T10W

1→T12Z

1T12Z

1→T10W

1T10W

1→T12Z

2T12Z

2→T10W

NA 58.8(6.3) 52.9(5.3) 49.9(3.8) 54.1(6.5) 61.4(5.6) 59.4(4.4) 58.0(5.8) 58.7(5.5)

SA 57.8(8.9) 59.4(11.1) 50.7(7.1) 48.7(9.9) 60.2(7.5) 56.8(11.7) 52.2(9.9) 55.8(9.4)

2DSA 66.0(3.7) 62.8(5.0) 59.5(4.0) 62.8(5.9) 66.5(4.6) 69.5(4.1) 67.6(4.2) 66.9(5.9)

T10W

2→T11N

1T11N

1→T10W

2T10W

2→T11N

2T11N

2→T10W

2T10W

2→T12Z

1T12Z

1→T10W

2T10W

2→T12Z

2T12Z

2→T10W

NA 55.2(5.1) 46.5(5.0) 52.4(4.7) 52.9(6.2) 59.3(6.7) 56.6(4.6) 57.9(4.7) 51.3(6.1)

SA 53.7(9.9) 49.4(9.0) 53.3(7.1) 48.4(8.1) 53.6(14.8) 57.8(6.9) 45.3(9.9) 50.0(9.2)

2DSA 66.6(4.0) 58.8(5.0) 59.7(5.9) 60.1(5.0) 67.5(3.6) 67.2(3.4) 64.9(5.9) 64.5(5.9)

T11N

1→T12Z

1T12Z

1→T11N

1T11N

1→T12Z

2T12Z

2→T11N

1T11N

2→T12Z

1T12Z

1→T11N

2T11N

2→T12Z

2T12Z

2→T11N

2mean

NA 51.5(4.6) 50.1(4.9) 48.6(4.4) 52.3(5.5) 54.6(6.5) 48.6(4.0) 50.0(5.1) 46.2(4.8) 53.6

SA 55.4(5.2) 53.3(5.7) 48.7(6.3) 44.4(9.4) 51.1(12.1) 52.4(10.2) 43.7(8.7) 51.9(6.8) 52.2

2DSA 60.7(3.8) 61.6(4.4) 57.5(5.7) 61.9(4.0) 61.6(6.0) 59.8(5.6) 55.4(6.6) 56.2(4.6) 62.7

Table 9. Recognition accuracy (%) under the same cultivar but diﬀerent years and geographical locations over 6 DA problems. The highest average precision is

boldfaced, the second best is shown in red, and the standard deviation is shown in parentheses.

Method G14Z→T12Z

1T12Z

1→G14ZG14Z→T12Z

2T12Z

2→G14ZG14Z→Z12ZZ12Z→G14Zmean

NA 48.7(4.2) 55.0(6.7) 44.1(3.9) 53.0(6.3) 36.0(3.6) 48.9(4.7) 47.6

SA 54.3(7.2) 51.5(8.3) 50.6(7.3) 51.8(8.0) 47.6(8.7) 48.6(8.9) 50.7

2DSA 62.3(7.8) 62.7(2.9) 61.4(3.6) 59.2(5.3) 57.1(6.5) 57.4(8.8) 60.0

Table 10. Recognition accuracy (%) under the same year but diﬀerent cultivars and geographical locations over 8 DA problems. The highest average precision is

boldfaced, the second best is shown in red, and the standard deviation is shown in parentheses.

Method Z10J→T10W

1T10W

1→Z10JZ10J→T10W

2T10W

2→Z10JZ11J→T11N

1T11N

1→Z11JZ11J→T11W

2T11W

2→Z11Jmean

NA 58.7(5.7) 59.6(4.1) 55.1(5.1) 60.6(5.3) 45.1(6.5) 43.3(4.9) 44.9(4.6) 49.5(7.3) 52.1

SA 54.3(9.2) 60.0(5.5) 55.3(9.6) 53.6(10.7) 45.3(7.0) 50.2(8.6) 44.5(6.3) 48.7(7.3) 51.5

2DSA 67.4(4.7) 65.3(3.1) 64.4(4.3) 60.6(4.9) 51.0(8.6) 55.4(6.9) 50.7(6.1) 56.7(4.5) 59.0

6.4. Subspace analysis by measuring the reconstruction error

As previously stated, we conjecture the quality of generated

subspaces aﬀects the performance. To justify this, here we as-

sess the subspace quality from the perspective of reconstruction

error. Fig. 8illustrates the results. Note that

shown in the ﬁg-

ure denotes the widely-used energy parameter that controls the

subspace dimensionality. It is clear that the reconstruction error

of 2DPCA is generally lower than PCA. Also, we note that PCA

exhibits a relatively high error even when

equals to 100%,

while 2DPCA is already close to zero. This gives evidence that

PCA cannot appropriately reconstruct convolutional activations

with a limited number of training data.

6.5.

Quantifying the domain discrepancy using divergence mea-

sures

Here, we evaluate the domain discrepancy on four typical DA

problems on the Oﬃce–Caltech10 dataset, using both the global

and proposed local divergence measures. Concretely, we com-

pute the

∆

-divergence measure and corresponding recog-

nition accuracy over selected DA problems. Results are listed

in Table 12. It demonstrates a similar tendency to our observa-

tions in Sec. 4.1. That is, a lower

∆

value does not imply a

good classiﬁcation result, which means the superiority of 2DSA

cannot be explained from the global sense. To this end, we fur-

ther compute the within-class divergence

∆

and between-

class divergence

∆

, expecting to infer the results from a

local perspective. Concretely, we plot the

-curves for

∆

Table 11. Recognition accuracy (%) under diﬀerent years, cultivars and geographical locations over 36 DA problems. The highest average precision is boldfaced, the

second best is shown in red, and the standard deviation is shown in parentheses.

Method Z10J→T11N

1T11N

1→Z10JZ10J→T11N

2T11N

2→Z10JZ10J→T12Z

1T12Z

1→Z10JZ10J→T12Z

2T12Z

2→Z10JZ10J→G14Z

NA 47.2(4.9) 48.5(5.3) 50.4(4.5) 51.0(5.3) 55.4(4.6) 58.1(4.3) 51.8(4.2) 52.6(4.7) 52.6(6.1)

SA 51.7(6.2) 55.4(9.4) 54.0(5.9) 51.8(8.4) 55.5(5.8) 56.0(11.7) 48.4(9.9) 55.4(10.1) 52.5(6.1)

2DSA 65.2(5.0) 59.1(6.2) 60.9(5.5) 61.5(6.7) 64.7(4.8) 62.1(5.0) 56.8(4.2) 60.3(6.5) 63.1(4.2)

G14Z→Z10JZ11J→T10W

1T10W

1→Z11JZ11J→T10W

2T10W

2→Z11JZ11J→T12Z

1T12Z

1→Z11JZ11J→T12Z

2T12Z

2→Z11J

NA 44.4(4.8) 46.5(6.0) 52.8(4.9) 49.8(6.2) 56.2(4.7) 47.1(6.6) 47.8(6.3) 43.7(4.5) 47.3(5.5)

SA 60.9(6.9) 50.0(9.5) 49.5(10.1) 48.2(10.7) 45.1(9.3) 45.4(10.3) 49.6(9.0) 49.3(8.3) 49.3(9.0)

2DSA 59.1(7.2) 55.0(6.5) 59.7(5.9) 53.8(7.4) 60.2(4.6) 54.8(8.0) 49.8(5.0) 52.1(3.7) 49.1(7.3)

Z11J→G14ZG14Z→Z11JZ12Z→T10W

1T10W

1→Z12ZZ12Z→T10W

2T10W

2→Z12ZZ12Z→T11N

1T11N

1→Z12ZZ12Z→T11N

NA 46.4(7.0) 40.7(4.5) 53.3(5.1) 50.3(5.3) 53.6(4.0) 54.8(4.9) 44.0(3.9) 42.5(3.3) 46.5(3.8)

SA 41.0(8.1) 56.1(7.1) 58.0(12.4) 51.8(10.6) 50.0(10.2) 52.3(7.0) 46.3(6.8) 46.8(6.6) 47.9(7.8)

2DSA 53.4(5.3) 54.5(7.4) 62.7(6.1) 61.8(4.7) 61.4(6.1) 62.6(4.6) 56.6(5.1) 55.6(5.7) 55.0(4.7)

T11N

2→Z12ZT10W

1→G14ZG14Z→T10W

1T10W

2→G14ZG14Z→T10W

2T11N

1→G14ZG14Z→T11N

1T11N

2→G14ZG14Z→T11N

2mean

NA 49.5(3.0) 56.8(3.5) 49.4(5.2) 50.9(6.8) 45.3(3.6) 50.2(6.2) 46.2(4.8) 48.6(5.3) 45.8(3.9) 49.4

SA 47.4(9.3) 50.8(9.9) 54.6(6.4) 51.0(9.4) 51.0(7.8) 47.6(8.0) 48.3(7.9) 46.0(7.2) 47.3(7.6) 50.6

2DSA 52.6(5.1) 63.2(4.4) 61.6(5.1) 60.5(6.0) 60.9(7.8) 56.1(4.5) 60.5(8.9) 57.9(4.8) 58.2(4.9) 58.4

Table 12.

∆

domain discrepancy measure and the corresponding recognition

accuracy (%) (in parentheses) of diﬀerent approaches over a speciﬁc trial of

4 adaptation problems on the Oﬃce–Caltech10 dataset. The lowest

∆

boldfaced and the highest accuracy is underlined.

Method A→C A→D C→D C→W

NA 1.33 (69.3) 1.99 (76.3) 1.79 (65.0) 1.66 (64.3)

SA 1.23 (58.1) 0.89 (54.4) 0.94 (52.7) 1.13 (62.9)

2DSA 1.45 (78.9) 2.00 (83.7) 1.94 (83.4) 1.65 (74.9)

and

∆

over the same adaptation tasks in Fig. 9and Fig. 10,

respectively. We observe that the tendency in

∆

is analo-

gous to the

∆

though some ﬂuctuations occur. That is to say,

∆

can be seen as a local version of

∆

to some degree.

Finally, when resorting to the between-class divergence, we ﬁnd

that

∆

correlates well with the recognition accuracy. In

general, lower

∆

implies good recognition performance.

According to these results, we can see that,

∆

character-

izes how good an alignment is, and

∆

depicts how well

the classiﬁcation performs. Thus, we believe one should pay

more attentions to the local class distributions when considering

cross-domain classiﬁcation problems, especially the between-

class distributions.

6.6. Do we need more training data?

As aforementioned, we partially ascribe the inferior perfor-

mance of SA to the limited number of training data, and we

have also justiﬁed our point from the perspective of reconstruc-

tion error. In this section, we further conduct experiments to see

whether the performance can be enhanced if we add more train-

ing data. Speciﬁcally, we use the ImageNet–VOC2007 dataset.

We continuously change the number of training data sampled

from each category, denoted by

Nclass

, and monitor the varia-

tions of average precision (AP). Results of the 9 typical classes

are illustrated in Fig. 11. We observe that, the more training data

use, the better performance generally achieves. This trend is ob-

vious when looking at those methods employing vector-form

representations (NA and SA). Furthermore, we ﬁnd that only

2DSA can achieve favorable results even with a limited number

of training data (

Nclass

=8 or

Nclass

=16), implying that the per-

formance of 2DSA is not that sensitive as

Nclass

changes. This

also implies that 2DSA can be applied in the small-sample-size

situation, which is common in real-world applications. Based

on the results presented, we believe that the performance of SA

indeed has a close relation to the number of training data.

6.7. Does the feature really matter?

In this section, we analyze the performance of C O N V in

diﬀerent layers to emphasize the role of feature representation.

One intuition about deep convolutional models is that the deeper

the layers are, the better the representation expresses [

To this end, we analyze the performance of 2DSA with diﬀerent

layers of CO N V on the Oﬃce–Caltech10 dataset, following

the standard experimental setting. Numerical results are listed

in Table 13. Generally, it has shown better accuracy when using

deeper representations. For instance, we surprisingly observe a

signiﬁcant accuracy improvement from 28.5% to 75.2% under

the A→C task. We have to admit good features really matter.

Here is our point. Indeed, DA methods really count, but

domain-invariant features also play a vital role. What are the

factors that cause domain shift? As what we have mentioned in

the beginning of the main text, they are those intrinsic and ex-

trinsic variations. Hence, it may be a good idea to devote ourself

to developing powerful features that achieve invariance to poses,

scales, rotations, illuminations and background, just like those

eﬀorts made that endow an ability to convolutional models to

identify spatial transformations [51].

0 10 20 30 40 50 60 70 80 90 100

Reconstruction Error

×105

0.5

1.5

2.5

3.5

Amazon

2DSA

0 10 20 30 40 50 60 70 80 90 100

Reconstruction Error

×105

0.5

1.5

2.5

3.5

4.5

Caltech

2DSA

0 10 20 30 40 50 60 70 80 90 100

Reconstruction Error

×105

0.5

1.5

2.5

3.5

DSLR

2DSA

0 10 20 30 40 50 60 70 80 90 100

Reconstruction Error

×105

0.5

1.5

2.5

3.5

web-cam

2DSA

Figure 8. Reconstruction error of diﬀerent approaches with changing energy parameter Q(%) in four domains.

1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2

Hw∆Hw

0.5

1.5

Amazon →Caltech

2DSA

1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2

Hw∆Hw

0.4

0.6

0.8

1.2

1.4

1.6

1.8

Amazon →DSLR

2DSA

1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2

Hw∆Hw

0.2

0.4

0.6

0.8

1.2

1.4

1.6

1.8

Caltech →DSLR

2DSA

1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2

Hw∆Hw

0.4

0.6

0.8

1.2

1.4

1.6

1.8

Caltech →web-cam

2DSA

Figure 9. γ-curves regarding local within-class divergence Hw∆Hwover four DA tasks. In this case, γ=2−γw.

1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2

Hb∆Hb

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

Amazon →Caltech

2DSA

1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2

Hb∆Hb

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

Amazon →DSLR

2DSA

1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2

Hb∆Hb

0.05

0.1

0.15

0.2

0.25

Caltech →DSLR

2DSA

1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2

Hb∆Hb

0.05

0.1

0.15

0.2

0.25

Caltech →web-cam

2DSA

Figure 10. γ-curves regarding local between-class divergence Hb∆Hbover four DA tasks. In this case, γ=γb.

Table 13. Recognition accuracy (%) of 2DSA with diﬀerent layers of CO NV activations on Oﬃce–Caltech10 dataset over 20 trials. The arrow indicates the change

compared with the previous row, and the standard deviation is shown in parentheses.

Feature A→C C→A A→D D→A A→W W→A C→D D→C C→W W→C D→W W→Dmean

mCONV1 28.5(2.6) 40.5(4.8) 22.7(5.7) 17.1(2.4) 17.5(4.6) 15.6(4.8) 27.6(4.6) 17.9(2.4) 21.2(5.2) 16.3(1.6) 43.8(5.0) 40.4(7.7) 25.8

mCONV2 46.5(2.2)↑56.8(8.8)↑39.9(6.2)↑27.9(5.2)↑31.9(5.0)↑29.0(4.1)↑43.3(6.9)↑22.1(3.5)↑33.8(3.8)↑22.1(2.4)↑70.1(10.1)↑66.3(7.7)↑40.8↑

mCONV3 44.0(6.3)↓61.7(5.7)↑38.2(6.0)↓22.6(5.0)↓31.0(7.3)↓30.9(5.9)↑47.1(5.3)↑20.8(3.6)↓37.5(4.8)↑22.7(2.9)↑71.1(5.5)↑66.4(10.0)↑41.2↑

mCONV4 64.7(2.7)↑76.8(2.3)↑60.7(12.0)↑56.6(4.7)↑55.8(10.3)↑44.4(13.0)↑65.1(6.3)↑44.7(3.9)↑57.3(5.1)↑39.3(3.7)↑90.7(3.2)↑90.6(2.9)↑62.2↑

mCONV5 75.2(3.0)↑85.9(1.8)↑75.4(5.2)↑73.4(5.0)↑66.8(3.9)↑63.8(5.6)↑76.6(5.0)↑62.5(4.7)↑69.5(2.9)↑55.1(5.1)↑91.8(3.0)↑93.0(2.5)↑74.1↑

Table 14. Average evaluation time (s) of each trial with varying feature dimensionality. (OS: Windows 7 64-bit, CPU: Intel i3-2120 3.30GHz, RAM: 16 GB)

Dimensionality 1152 1568 3872 7200 9248 16928

SA 1.03 2.54 34.57 210.40 444.83 2655.25

2DSA 0.32 0.37 0.45 0.63 0.81 2.14

6.8. Eﬃciency comparison between SA and 2DSA

As aforementioned, compared with SA, 2DSA provides an-

other important attraction in computation eﬃciency. Here, we

tend to verify this. Concretely, the single-core CPU runtime is

tested as the feature dimensionality varies, and the average eval-

uation time of each trial is reported. According to the numerical

Nclass

101102

-10

aeroplane

2DSA

Nclass

101102

bicycle

2DSA

Nclass

101102

bus

2DSA

Nclass

101102

car

2DSA

Nclass

101102

cow

2DSA

Nclass

101102

dog

2DSA

Nclass

101102

motorbike

2DSA

Nclass

101102

person

2DSA

Nclass

101102

train

2DSA

Figure 11. The performance of diﬀerent methods with the varying number of training data sampled from each category on the ImageNet–VOC2007 dataset.

results in Table 14, 2DSA is shown to be signiﬁcantly at rates

faster than SA when dealing with high-dimensional data, imply-

ing that 2DSA would be particularly attractive in practice due

to its nature of high eﬃciency.

6.9. Problems within H∆H-divergence

Finally, we tend to emphasize an important problem within

∆

-divergence to inspire further studies. Fig. 12 illustrates

two typical relative positions between two domains: separation

and tangency. If we estimate the

∆

-divergence for these

two situations according to those steps mentioned in Sec. 4.1,

∆

values will make no diﬀerence. Since domains in two sit-

uations are linearly separable, their

∆

values will be close

to 2. However, our analysis shows that, when two domains are

close enough, they have a high probability to be classiﬁed cor-

rectly. Hence, it is necessary for a domain divergence measure

to diﬀerentiate these two situations.

7. Conclusion

In this paper, we showed that it is better to align convolu-

tional activations in the two-dimensional world. In particular,

we proposed a 2DSA approach to adapt convolutional activa-

tions. We gave our deep insight on why 2DSA works better

and further introduced two novel domain divergence measures

termed

∆

and

∆

taking labels into account. Exten-

sive experiments justiﬁed 2DSA signiﬁcantly outperformed SA

in both eﬀectiveness and eﬃciency and also showed superior

or at least comparable classiﬁcation performance than existing

benchmarking approaches. In addition, an interesting DA appli-

cation in agriculture was demonstrated as well.

Source Target Source Target

Figure 12. Two typical relative positions between two domains. The left denotes

source domain is separate from the target, and right indicates source is tangent

to target. However, since both domains in these two situations can be linearly

separated, it makes no diﬀerence to H∆H-divergence.

Notice that the proposed 2DSA does have limitations. Since

2DSA is only a linear adaptation method, when the distributions

of two domains are signiﬁcantly distinct, a linear alignment is

typically not suﬃcient and thus 2DSA as proposed may not

work. Moreover, in real-world applications, one may encounter

the situation that a new test set comes and whose transformed

subspace is not aligned with the subspace of target domain. Un-

der such a circumstance, 2DSA may also fail. Perhaps one pos-

sible solution is to realign the new subspace.

For future work, it could be interesting to assign pseudo la-

bels to the target data and iteratively optimize both within- and

between-class measures so that they could be used as a guid-

ing criteria for choosing a good adaptation in an unsupervised

DA context. Moreover, it is worth noting that the introduced

measures are independent of a speciﬁc distance metric. It is

also interesting to explore whether we can learn some kind of

metric that could achieve both low within- and between-class

divergences simultaneously. In addition, we plan to formulate

the three-dimensional subspace alignment problem for unsuper-

vised DA, as adapting 3D tensors may be a stronger way to

model convolutional activations and may lead to interesting ap-

plications, e.g., the adaptation of CNNs not only for new do-

mains but also for new tasks.

Acknowledgment

The authors would like to thank the anonymous reviewers

for their insightful comments. This work is jointly supported by

the National High-tech R&D Program of China (863 Program)

(Grant No. 2015AA015904) and the National Natural Science

Foundation of China (Grant No. 61502187).

References

[1]

F. Perronnin, J. S

anchez, T. Mensink, Improving the ﬁsher kernel for large-

scale image classiﬁcation, in: Proc. European Conference on Computer Vi-

sion (ECCV), 2010, pp. 143–156. doi:

10.1007/978-3- 642-15561- 1_

11.

[2]

A. Torralba, A. A. Efros, Unbiased look at dataset bias, in: Proc. IEEE

Conference on Computer Vision and Pattern Recognition (CVPR), 2011,

pp. 1521–1528. doi:10.1109/CVPR.2011.5995347.

[3]

P. Dollar, C. Wojek, B. Schiele, P. Perona, Pedestrian detection: An evalua-

tion of the state of the art, IEEE Transactions on Pattern Analysis and Ma-

chine Intelligence 34 (2012) 743–761. doi:10.1109/TPAMI.2011.155.

[4]

V. M. Patel, R. Gopalan, R. Li, R. Chellappa, Visual domain adaptation:

A survey of recent advances, IEEE Signal Processing Magazine 32 (2015)

53–69. doi:10.1109/MSP.2014.2347059.

[5]

A. Krizhevsky, I. Sutskever, G. E. Hinton, Imagenet classiﬁcation with

deep convolutional neural networks, in: Advances in Neural Information

Processing Systems (NIPS), 2012, pp. 1097–1105.

[6]

R. Girshick, J. Donahue, T. Darrell, J. Malik, Rich feature hierarchies

for accurate object detection and semantic segmentation, in: Proc. IEEE

Conference on Computer Vision and Pattern Recognition (CVPR), 2014,

pp. 580–587. doi:10.1109/CVPR.2014.81.

[7]

J. Yosinski, J. Clune, Y. Bengio, H. Lipson, How transferable are features

in deep neural networks?, in: Advances in Neural Information Processing

Systems (NIPS), 2014, pp. 3320–3328.

[8]

J. Donahue, Y. Jia, O. Vinyals, J. Hoﬀman, N. Zhang, E. Tzeng, T. Darrell,

DeCAF: A deep convolutional activation feature for generic visual recog-

nition., in: Proc. International Conference on Machine Learning (ICML),

2014, pp. 647–655.

[9]

Y. Ganin, V. Lempitsky, Unsupervised domain adaptation by

backpropagation, in: Proc. International Conference on Machine

Learning (ICML), 2015, pp. 1180–1189. URL:

http://jmlr.org/

proceedings/papers/v37/ganin15.pdf.

[10]

N. Zhang, J. Donahue, R. Girshick, T. Darrell, Part-based R-CNNs

for ﬁne-grained category detection, in: Proc. European Confer-

ence on Computer Vision (ECCV), 2014, pp. 834–849. doi:

10.1007/

978-3- 319-10590- 1_54.

[11]

K. Saenko, B. Kulis, M. Fritz, T. Darrell, Adapting visual category models

to new domains, in: Proc. European Conference on Computer Vision

(ECCV), 2010, pp. 213–226. doi:10.1007/978-3- 642-15561- 1_16.

[12]

R. Gopalan, R. Li, R. Chellappa, Domain adaptation for object recogni-

tion: An unsupervised approach, in: Proc. IEEE International Conference

on Computer Vision (ICCV), 2011, pp. 999–1006. doi:

10.1109/ICCV.

2011.6126344.

[13]

B. Gong, Y. Shi, F. Sha, K. Grauman, Geodesic ﬂow kernel for un-

supervised domain adaptation, in: Proc. IEEE Conference on Com-

puter Vision and Pattern Recognition (CVPR), 2012, pp. 2066–2073.

doi:10.1109/CVPR.2012.6247911.

[14]

B. Fernando, A. Habrard, M. Sebban, T. Tuytelaars, Unsupervised visual

domain adaptation using subspace alignment, in: Proc. IEEE International

Conference on Computer Vision (ICCV), 2013, pp. 2960–2967. doi:

10.

1109/ICCV.2013.368.

[15]

W. Li, L. Duan, D. Xu, I. W. Tsang, Learning with augmented features for

supervised and semi-supervised heterogeneous domain adaptation, IEEE

Transactions on Pattern Analysis and Machine Intelligence 36 (2014)

1134–1148. doi:10.1109/TPAMI.2013.167.

[16]

H. Pirsiavash, D. Ramanan, C. C. Fowlkes, Bilinear classiﬁers for vi-

sual recognition, in: Advances in Neural Information Processing Systems

(NIPS), 2009, pp. 1482–1490.

[17]

K. He, X. Zhang, S. Ren, J. Sun, Spatial pyramid pooling in deep con-

volutional networks for visual recognition, IEEE Transactions on Pattern

Analysis and Machine Intelligence 37 (2015) 1904–1916. doi:

10.1109/

TPAMI.2015.2389824.

[18]

R. Girshick, Fast R-CNN, in: Proc. IEEE International Conference

on Computer Vision (ICCV), 2015, pp. 1440–1448. doi:

10.1109/ICCV.

2015.169.

[19]

J. Yang, D. Zhang, A. Frangi, J.-Y. Yang, Two-dimensional PCA: a new

approach to appearance-based face representation and recognition, IEEE

Transactions on Pattern Analysis and Machine Intelligence 26 (2004) 131–

137. doi:10.1109/TPAMI.2004.1261097.

[20]

S. Ben-David, J. Blitzer, K. Crammer, F. Pereira, et al., Analysis of rep-

resentations for domain adaptation, in: Advances in Neural Information

Processing Systems (NIPS), volume 19, 2007, p. 137.

[21]

S. J. Pan, Q. Yang, A survey on transfer learning, IEEE Transactions on

Knowledge and Data Engineering 22 (2010) 1345–1359. doi:

10.1109/

TKDE.2009.191.

[22]

H. Shimodaira, Improving predictive inference under covariate shift by

weighting the log-likelihood function, Journal of Statistical Planning and

Inference 90 (2000) 227–244. doi:

10.1016/S0378-3758(00)00115- 4

[23]

S. Ben-David, J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, J. W.

Vaughan, A theory of learning from diﬀerent domains, Machine Learning

79 (2010) 151–175. doi:10.1007/s10994-009- 5152-4.

[24]

J. Tao, F. lai Chung, S. Wang, On minimum distribution discrepancy

support vector machine for domain adaptation, Pattern Recognition 45

(2012) 3962 – 3984. doi:10.1016/j.patcog.2012.04.014.

[25]

A. S. Mozafari, M. Jamzad, A svm-based model-transferring method for

heterogeneous domain adaptation, Pattern Recognition 56 (2016) 142–

158. doi:10.1016/j.patcog.2016.03.009.

[26]

J. Blitzer, R. McDonald, F. Pereira, Domain adaptation with structural

correspondence learning, in: Proc. Conference on Empirical Methods in

Natural Language Processing (EMNLP), 2006, pp. 120–128.

[27]

H. Daum

e III, Frustratingly easy domain adaptation, in: Proc. Association

for Computational Linguistics (ACL), 2007.

[28]

Q.-F. Wang, F. Yin, C.-L. Liu, Unsupervised language model adaptation

for handwritten chinese text recognition, Pattern Recognition 47 (2014)

1202–1216. doi:10.1016/j.patcog.2013.09.015.

[29]

A. Bergamo, L. Torresani, Exploiting weakly-labeled web images to im-

prove object classiﬁcation: a domain adaptation approach, in: Advances

in Neural Information Processing Systems (NIPS), 2010, pp. 181–189.

[30]

B. Kulis, K. Saenko, T. Darrell, What you saw is not what you get: Domain

adaptation using asymmetric kernel transforms, in: Proc. IEEE Conference

on Computer Vision and Pattern Recognition (CVPR), 2011, pp. 1785–

1792. doi:10.1109/CVPR.2011.5995702.

[31]

X. Li, M. Fang, J.-J. Zhang, J. Wu, Learning coupled classiﬁers with

rgb images for rgb-d object recognition, Pattern Recognition 61 (2017)

433–446. doi:10.1016/j.patcog.2016.08.016.

[32]

E. Kodirov, T. Xiang, Z. Fu, S. Gong, Unsupervised domain adaptation for

zero-shot learning, in: Proc. IEEE International Conference on Computer

Vision (ICCV), 2015, pp. 2452–2460. doi:10.1109/ICCV.2015.282.

[33]

J. Hoﬀman, E. Rodner, J. Donahue, T. Darrell, K. Saenko, Eﬃcient learn-

ing of domain-invariant image representations, CoRR abs/1301.3224

(2013).

[34]

S. J. Pan, I. W. Tsang, J. T. Kwok, Q. Yang, Domain adaptation via transfer

component analysis, IEEE Transactions on Neural Networks 22 (2011)

199–210. doi:10.1109/TNN.2010.2091281.

[35]

M. Long, J. Wang, G. Ding, J. Sun, P. S. Yu, Transfer joint matching for

unsupervised domain adaptation, in: Proc. IEEE Conference on Computer

Vision and Pattern Recognition (CVPR), 2014, pp. 1410–1417. doi:

10.

1109/CVPR.2014.183.

[36]

R. Aljundi, R. Emonet, D. Muselet, M. Sebban, Landmarks-based kernel-

ized subspace alignment for unsupervised domain adaptation, in: Proc.

IEEE Conference on Computer Vision and Pattern Recognition (CVPR),

2015, pp. 56–63. doi:10.1109/CVPR.2015.7298600.

[37]

L. Duan, D. Xu, I. W.-H. Tsang, Domain adaptation from multiple

sources: A domain-dependent regularization approach, IEEE Transac-

tions on Neural Networks and Learning Systems 23 (2012) 504–518.

doi:10.1109/TNNLS.2011.2178556.

[38]

W.-S. Chu, F. De La Torre, J. F. Cohn, Selective transfer machine for

personalized facial action unit detection, in: Proc. IEEE Conference on

Computer Vision and Pattern Recognition (CVPR), 2013. doi:

10.1109/

CVPR.2013.451.

[39]

M. Long, Y. Cao, J. Wang, M. I. Jordan, Learning transferable features

with deep adaptation networks, in: Proc. Internation Conference on Ma-

chine Learning (ICML), 2015.

[40]

J. W. Osborne,A. B. Costello, Sample size and subject to item ratio in prin-

cipal components analysis, Practical Assessment, Research & Evaluation

9 (2004) 8.

[41]

L. Van der Maaten, G. Hinton, Visualizing data using t-SNE, Journal of

Machine Learning Research 9 (2008) 85.

[42]

H. Lu, Z. Cao, Y. Xiao, Z. Fang, Y. Zhu, Toward good practices for ﬁne-

grained maize cultivar identiﬁcation with ﬁlter-speciﬁc convolutional ac-

tivations, IEEE Transactions on Automation Science and Engineering

(2016). doi:10.1109/TASE.2016.2616485.

[43]

H. Lu, Z. Cao, Y. Xiao, Z. Fang, Y. Zhu, Towards ﬁne-grained maize tassel

ﬂowering status recognition: dataset, theory and practice, Applied Soft

Computing 56 (2017) 34–45. doi:10.1016/j.asoc.2017.02.026.

[44]

H. Lu, Z. Cao, Y. Xiao, Z. Fang,Y. Zhu, K. Xian, Fine-grained maize tassel

trait characterization with multi-view representations, Computers and

Electronics in Agriculture 118 (2015) 143–158. doi:

10.1016/j.compag.

2015.08.027.

[45]

A. Vedaldi, K. Lenc, MatConvNet: Convolutional neural networks for

matlab, in: Proc. ACM International Conference on Multimedia, 2015, pp.

689–692.

[46]

R.-e. Fan, X.-r. Wang, C.-j. Lin, LIBLINEAR : A library for large linear

classiﬁcation, Journal of Machine Learning Research 9 (2014) 1871–1874.

[47]

B. Sun, J. Feng, K. Saenko, Return of frustratingly easy domain adaptation,

in: Proc. AAAI Conference on Artiﬁcial Intelligence, 2016.

[48]

Y. Wen, K. Zhang, Z. Li, Y. Qiao, A discriminative feature learning ap-

proach for deep face recognition, in: Proc. European Conference on Com-

puter Vision (ECCV), Springer, 2016, pp. 499–515.

[49]

K. Simonyan, A. Zisserman, Very deep convolutional networks for large-

scale image recognition, CoRR abs/1409.1556 (2014). URL:

http://

arxiv.org/abs/1409.1556.

[50]

K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recog-

nition, in: Proc. IEEE Conference on Computer Vision and Pattern Recog-

nition (CVPR), 2016.

[51]

M. Jaderberg, K. Simonyan, A. Zisserman, K. Kavukcuoglu, Spatial trans-

former networks, in: Advances in Neural Information Processing Systems

(NIPS), 2015, pp. 2008–2016.

2DSA Matlab release v1.0

Data

June 2017

Hao Lu · Zhi-Guo Cao · Yang Xiao · Yanjun Zhu

MTFS-DA

Data

June 2017

Hao Lu · Zhi-Guo Cao · Yang Xiao · Yanjun Zhu

An Embarrassingly Simple Approach to Visual Domain Adaptation

Article

Full-text available

Mar 2018

We show that it is possible to achieve high-quality domain adaptation without explicit adaptation. The nature of the classification problem means that when samples from the same class in different domains are sufficiently close, and samples from differing classes are separated by large enough margins, there is a high probability that each will be classified correctly. Inspired by this, we propose an embarrassingly simple yet effective approach to domain adaptation–only the class mean is used to learn classspecific linear projections. Learning these projections is naturally cast into a linear-discriminant-analysis-like framework, which gives an efficient, closed form solution. Further, to enable to application of this approach to unsupervised learning, an iterative validation strategy is developed to infer target labels. Extensive experiments on cross-domain visual recognition demonstrate that, even with the simplest formulation, our approach outperforms existing non-deep adaptation methods and exhibits classification performance comparable with that of modern deep adaptation methods. An analysis of potential issues effecting the practical application of the method is also described, including robustness, convergence, and the impact of small sample sizes.

Maize tassels detection: a benchmark of the state of the art

Article

Full-text available

Aug 2020
PLANT METHODS

Background: The population of plants is a crucial indicator in plant phenotyping and agricultural production, such as growth status monitoring, yield estimation, and grain depot management. To enhance the production efficiency and liberate labor force, many automated counting methods have been proposed, in which computer vision-based approaches show great potentials due to the feasibility of high-throughput processing and low cost. In particular, with the success of deep learning, more and more deeper learning-based approaches are introduced to deal with agriculture automation. Since different detection- and regression-based counting models have distinct characteristics, how to choose an appropriate model given the target task at hand remains unexplored and is important for practitioners. Results: Targeting in-field maize tassels as a representative case study, the goal of this work is to present a comprehensive benchmark of state-of-the-art object detection and object counting methods, including Faster R-CNN, YOLOv3, FaceBoxes, RetinaNet, and the leading counting model of maize tassels-TasselNet. We create a Maize Tassel Detection Counting (MTDC) dataset by supplementing bounding box annotations to the Maize Tassels Counting (MTC) dataset to allow the training of detection models. We investigate key factors effecting the practical applications of the models, such as convergence behavior, scale robustness, speed-accuracy trade-off, as well as parameter sensitivity. Based on our benchmark, we summarise the advantages and limitations of each method and suggest several possible directions to improve current detection- and regression-based counting approaches to benefit next-generation intelligent agriculture. Conclusions: Current state-of-the-art detection- and regression-based counting approaches can all achieve a relatively high degree of accuracy when dealing with in-field maize tassels, with at least 0.85 R 2 values and 28.2% rRMSE error. While detection-based methods are more robust than regression-based methods in scale variations and can infer extra information (e.g., object positions and sizes), the latter ones have significantly faster convergence behaviors and inference speed. To choose an appropriate in-filed plant counting method, accuracy, robustness, speed and some other algorithm-specific factors should be taken into account with the same priority. This work sheds light on different aspects of existing detection and counting approaches and provides guidance on how to tackle in-field plant counting. The MTDC dataset is made available at https://git.io/MTDC.

Accurate and fast implementation of soybean pod counting and localization from high-resolution image

Article

Full-text available

Feb 2024

Introduction Soybean pod count is one of the crucial indicators of soybean yield. Nevertheless, due to the challenges associated with counting pods, such as crowded and uneven pod distribution, existing pod counting models prioritize accuracy over efficiency, which does not meet the requirements for lightweight and real-time tasks. Methods To address this goal, we have designed a deep convolutional network called PodNet. It employs a lightweight encoder and an efficient decoder that effectively decodes both shallow and deep information, alleviating the indirect interactions caused by information loss and degradation between non-adjacent levels. Results We utilized a high-resolution dataset of soybean pods from field harvesting to evaluate the model’s generalization ability. Through experimental comparisons between manual counting and model yield estimation, we confirmed the effectiveness of the PodNet model. The experimental results indicate that PodNet achieves an R² of 0.95 for the prediction of soybean pod quantities compared to ground truth, with only 2.48M parameters, which is an order of magnitude lower than the current SOTA model YOLO POD, and the FPS is much higher than YOLO POD. Discussion Compared to advanced computer vision methods, PodNet significantly enhances efficiency with almost no sacrifice in accuracy. Its lightweight architecture and high FPS make it suitable for real-time applications, providing a new solution for counting and locating dense objects.

TasselNetV2+: A Fast Implementation for High-Throughput Plant Counting From High-Resolution RGB Imagery

Article

Full-text available

Dec 2020

Plant counting runs through almost every stage of agricultural production from seed breeding, germination, cultivation, fertilization, pollination to yield estimation, and harvesting. With the prevalence of digital cameras, graphics processing units and deep learning-based computer vision technology, plant counting has gradually shifted from traditional manual observation to vision-based automated solutions. One of popular solutions is a state-of-the-art object detection technique called Faster R-CNN where plant counts can be estimated from the number of bounding boxes detected. It has become a standard configuration for many plant counting systems in plant phenotyping. Faster R-CNN, however, is expensive in computation, particularly when dealing with high-resolution images. Unfortunately high-resolution imagery is frequently used in modern plant phenotyping platforms such as unmanned aerial vehicles, engendering inefficient image analysis. Such inefficiency largely limits the throughput of a phenotyping system. The goal of this work hence is to provide an effective and efficient tool for high-throughput plant counting from high-resolution RGB imagery. In contrast to conventional object detection, we encourage another promising paradigm termed object counting where plant counts are directly regressed from images, without detecting bounding boxes. In this work, by profiling the computational bottleneck, we implement a fast version of a state-of-the-art plant counting model TasselNetV2 with several minor yet effective modifications. We also provide insights why these modifications make sense. This fast version, TasselNetV2+, runs an order of magnitude faster than TasselNetV2, achieving around 30 fps on image resolution of 1980 × 1080, while it still retains the same level of counting accuracy. We validate its effectiveness on three plant counting tasks, including wheat ears counting, maize tassels counting, and sorghum heads counting. To encourage the use of this tool, our implementation has been made available online at https://tinyurl.com/TasselNetV2plus.

Unsupervised Domain Adaptation via Discriminative Classes-Center Feature Learning in Adversarial Network

Article

Full-text available

Aug 2020
NEURAL PROCESS LETT

Adversarial learning has achieved remarkable advance in learning transferable representations across different domains. Generally, previous works are mainly devoted to reducing domain shift between labeled source data and unlabeled target data by extracting domain-invariant features. However, these adversarial methods rarely consider task-specific decision boundaries among classes, causing classification performance degradation in cross domain tasks. In this paper, we propose a novel approach for the task of unsupervised domain adaptation via discriminative classes-center feature learning in adversarial network (C2FAN), which concentrates on learning domain-invariant representation and paying close attention to classification decision boundary simultaneously to improve the ability of transferable knowledge across different domains. C2FAN consists of a feature extractor, a classifier and a discriminator. Firstly, for reducing domain gaps between source and target domains in the feature extractor, we propose to utilize a conditional adversarial learning module to extract domain-invariant feature and improve discriminability of the classifier simultaneously. Further, we present a high-efficiency layer normalization module to reduce domain shift existing in the classifier. Secondly, we design a discriminative classes-center feature learning module in the classifier to diminish the distribution distance of the same-class samples so that the decision boundary can distinguish different classes easily, which can reduce the misclassification on target samples. What’s more, C2FAN is an effective yet considerable simple approach which can be embedded into current domain adaptation approaches conveniently. Extensive experiments demonstrate that our proposed model achieves satisfactory results on some standard domain adaptation benchmarks.

Adversarial Adaptation From Synthesis to Reality in Fast Detector for Smoke Detection

Article

Full-text available

Mar 2019

Video smoke detection is a promising method for early fire prevention. However, it is still a challenging task for application of video smoke detection in real world detection systems, as the limitations of smoke image samples for training and lack of efficient detection algorithm. This paper proposes a method based on two state-of-the-art fast detectors, single shot multi-box detector and multi-scale deep convolutional neural network, for smoke detection using synthetic smoke image samples. The virtual data can automatically offer rich samples with ground truth annotations. However, the learning of smoke representation in the detectors will be restricted by the appearance gap between real and synthetic smoke samples, which will cause significant performance drop. To train a strong detector with synthetic smoke samples, we incorporate the domain adaptation into the fast detectors. A series of branches with the same structure as the detection branches are integrated into the fast detectors for domain adaptation. We design an adversarial training strategy to optimize the model of the adapted detectors, to learn a domain-invariant representation for smoke detection. The domain discrimination, domain confusion and detection are combined in the iterative training procedure. The performance of the proposed approach surpasses the original baseline in our experiments.

Article

Mar 2021
SIGNAL PROCESS-IMAGE

Recent advances in unsupervised domain adaptation mainly focus on learning shared representations by global statistics alignment, such as the Maximum Mean Discrepancy (MMD) which matches the Mean statistics across domains. The lack of class information, however, may lead to partial alignment (or even misalignment) and poor generalization performance. For robust domain alignment, we argue that the similarities across different features in the source domain should be consistent with that in the target domain. Based on this assumption, we propose a new domain discrepancy metric, i.e., Self-similarity Consistency (SSC), to enforce the pairwise relationship between different features being consistent across domains. The Gram matrix matching and Correlation Alignment is proven to be a special case, and a sub-optimal measure of our proposed SSC. Furthermore, we also propose to mitigate the side effect of the partial alignment and misalignment by incorporating the discriminative information of the deep representations. Specifically, a simple yet effective feature norm constraint is exploited to enlarge the discrepancy of inter-class samples. It relieves the requirements of strict alignment when performing adaptation, therefore improving the adaptation performance significantly. Extensive experiments on visual domain adaptation tasks demonstrate the effectiveness of our proposed SSC metric and feature discrimination approach.

TasselNetV3: Explainable Plant Counting With Guided Upsampling and Background Suppression

Article

Feb 2021

Fast and accurate plant counting tools affect revolution in modern agriculture. Agricultural practitioners, however, expect the output of the tools to be not only accurate but also explainable. Such explainability often refers to the ability to infer which instance is counted. One intuitive way is to generate a bounding box for each instance. Nevertheless, compared with counting by detection, plant counts can be inferred more directly in the local count framework, while one thing reproaching this paradigm is its poor explainability of output visualization. In particular, we find that the poor explainability becomes a bottleneck limiting the counting performance. To address this, we explore the idea of guided upsampling and background suppression where a novel upsampling operator is proposed to allow count redistribution, and segmentation decoders with different fusion strategies are investigated to suppress background, respectively. By integrating them into our previous counting model TasselNetV2, we introduce TasselNetV3 series: TasselNetV3-Lite and TasselNetV3-Seg. We validate the TasselNetV3 series on three public plant counting data sets and a new unmanned aircraft vehicle (UAV)-based data set, covering maize tassels counting, wheat ears counting, and rice plants counting. Extensive results show that guided upsampling and background suppression not only improve counting performance but also enable explainable visualization. Aside from state-of-the-art performance, we have several interesting observations: 1) a limited-receptive-field counter in most cases outperforms a large-receptive-field one; 2) it is sufficient to generate empirical segmentation masks from dotted annotations; 3) middle fusion is a good choice to integrate foreground–background a priori knowledge; and 4) decoupling the learning of counting and segmentation matters.

Unsupervised domain adaptation for in-field cotton boll status identification

Article

Nov 2020
COMPUT ELECTRON AGR

In-field identification of cotton boll status is an important indicator of maturity grading and precise field management. However, the growth status of cotton boll is highly affected by environmental factors. Differentiation and correlation exhibit in data distribution of different domains, caused by distinct districts, time, weather, and farming operations. Therefore, distribution mismatch is a common phenomenon in agricultural image acquisition such that traditional manual observation measures or standard classification models that are independent and identically distributed (i.i.d.) often cannot obtain satisfactory results. One feasible solution to address this problem is to use domain adaptation that adapts knowledge from the original training data, a.k.a. the source domain, to the new testing data, a.k.a. the target domain. In this paper, we propose a novel NCA-based unsupervised domain adaptation method termed NCADA, which includes three procedures: feature extraction using a deep CNN, feature transformation matrix generation, and target label inference. We validate the NCADA method on our constructed in-field cotton boll dataset with images. Extensive experiments show that NCADA method achieves accurate identification performances of and on ‘Internet -> Field’ and two different ‘Field -> Field’ settings, demonstrating that NCADA can be a useful tool to replace manual observation and standard classification methods.

Generative Attention Adversarial Classification Network for Unsupervised Domain Adaptation

Article

Jun 2020
PATTERN RECOGN

Domain adaptation is a significant and popular issue of solving distribution discrepancy among different domains in computer vision. Generally, previous works proposed are mainly devoted to reducing domain shift between source domain with labeled data and target domain without labels. Adversarial learning in deep networks has already been widely applied to learn disentangled and transferable features between two different domains to minimize domains distribution discrepancy. However, these methods rarely consider class distributions among source data during adversarial learning, and they pay little attention to these transferable regions among source and target domains images. In this paper, we propose a Generative Attention Adversarial Classification Network (GAACN) model for unsupervised domain adaptation. To learn a joint feature distribution between source and target domains, we present an improved generative adversarial network (GAN) following the feature extractor. Firstly, the discriminator of GAN discriminates the distribution of domains and the classes distribution among source data during adversarial learning, so that our feature extractor can learn a joint feature distribution between source and target domains and maintain the classes consistent simultaneously. Secondly, we present an attention module embedded in GAN, which allows the discriminator to discriminate the transferable regions among the images of source and target domains. Lastly, we propose a simple and efficient method which allocates pseudo-labels for unlabeled target data, and it can improve the performance of our model GAACN while mitigating negative transfer. Extensive experiments demonstrate that our proposed model achieves perfect results on several standard domain adaptation datasets.

Towards fine-grained maize tassel flowering status recognition: Dataset, theory and practice

Article

Full-text available

Jul 2017
APPL SOFT COMPUT

Maize is one of the three main cereal crops of the world. Accurately knowing its tassel flowering status can help to analyze the growth status and adjust the farming operation accordingly. At the current stage, acquiring the tassel flowering status mainly depends on human observation. Actually, it is costly and subjective, especially for the large-scale quantitative analysis under the in-field environment. To alleviate this, we propose an automatic maize tassel flowering status (i.e., non-flowering, partially-flowering and fully-flowering) recognition method via the computer vision technology in this paper. In particular, this task is formulated as a fine-grained image categorization problem. More specifically, scale-invariant feature transform (SIFT) is first extracted as the low-level visual descriptor to characterize the maize flower. Fisher vector (FV) is then applied to execute feature encoding on SIFT to generate more discriminative flowering status representation. To further leverage the performance, a novel metric leaning method termed large-margin dimensionality reduction (LMDR) is proposed. To verify the effectiveness of the proposed method, a flowering status dataset that consists of 3000 images is built. The experimental results demonstrate that our approach goes beyond the state-of-the-art by large margins (at least 8.3%). The dataset and source code are made available online.

Toward Good Practices for Fine-Grained Maize Cultivar Identification With Filter-Specific Convolutional Activations

Article

Full-text available

Nov 2016

Crop cultivar identification is an important aspect in agricultural systems. Traditional solutions involve excessive human interventions, which is labor-intensive and time-consuming. Also, cultivar identification is a typical task of fine-grained visual categorization (FGVC). Compared with other common topics in FGVC, studies on this problem are somewhat lagged and limited. In this paper, targeting four Chinese maize cultivars of Jundan No.20, Wuyue No.3, Nongda No.108 and Zhengdan No.958, we first consider a problem of identifying the maize cultivar based on its tassel characteristics by computer vision. In particular, a novel fine-grained maize cultivar identification dataset termed HUST-FG-MCI that contains 5000 images is first constructed. To better capture the textual differences in a weakly-supervised manner, we proposed an effective deep convolutional neural network (CNN) and Fisher vector (FV) based feature encoding mechanism. The mechanism tends to highlight subtle object patterns via filter-specific convolutional representations and thus provides strong discrimination for cultivar identification. Experimental results demonstrate that our method outperforms the other stat of the art. We show also that, FV encoding can weaken the linear dependency between convolutional activations, redundant filters exist in the convolutional layer, and high accuracy can be maintained with relatively low-dimensional convolutional features and one or two Gaussian components in FV.

Landmarks-based Kernelized Subspace Alignment for Unsupervised Domain Adaptation

Conference Paper

Full-text available

Jun 2015

Domain adaptation (DA) has gained a lot of success in the recent years in computer vision to deal with situations where the learning process has to transfer knowledge from a source to a target domain. In this paper, we introduce a novel unsupervised DA approach based on both subspace alignment and selection of landmarks similarly distributed between the two domains. Those landmarks are selected so as to reduce the discrepancy between the domains and then are used to non linearly project the data in the same space where an efficient subspace alignment (in closed-form) is performed. We carry out a large experimental comparison in visual domain adaptation showing that our new method outperforms the most recent unsupervised DA approaches.

Part-based R-CNNs for Fine-grained Category Detection

Conference Paper

Jul 2014

Semantic part localization can facilitate fine-grained categorization by explicitly isolating subtle appearance differences associated with specific object parts. Methods for pose-normalized representations have been proposed, but generally presume bounding box annotations at test time due to the difficulty of object detection. We propose a model for fine-grained categorization that overcomes these limitations by leveraging deep convolutional features computed on bottom-up region proposals. Our method learns whole-object and part detectors, enforces learned geometric constraints between them, and predicts a fine-grained category from a pose-normalized representation. Experiments on the Caltech-UCSD bird dataset confirm that our method outperforms state-of-the-art fine-grained categorization methods in an end-to-end evaluation without requiring a bounding box at test time.

DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition

Conference Paper

Oct 2013

We evaluate whether features extracted from the activation of a deep convolutional network trained in a fully supervised fashion on a large, fixed set of object recognition tasks can be re-purposed to novel generic tasks. Our generic tasks may differ significantly from the originally trained tasks and there may be insufficient labeled or unlabeled data to conventionally train or adapt a deep architecture to the new tasks. We investigate and visualize the semantic clustering of deep convolutional features with respect to a variety of such tasks, including scene recognition, domain adaptation, and fine-grained recognition challenges. We compare the efficacy of relying on various network levels to define a fixed feature, and report novel results that significantly outperform the state-of-the-art on several important vision challenges. We are releasing DeCAF, an open-source implementation of these deep convolutional activation features, along with all associated network parameters to enable vision researchers to be able to conduct experimentation with deep representations across a range of visual concept learning paradigms.

Imagenet classification with deep convolutional neural networks

Conference Paper

Jan 2012

We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 dif- ferent classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implemen- tation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called dropout that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry

Deep Residual Learning for Image Recognition

Conference Paper

Jun 2016

Frustratingly easy domain adaptation

Article

Jan 2007

H. Daumé III

A Discriminative Feature Learning Approach for Deep Face Recognition

Conference Paper

Oct 2016

Convolutional neural networks (CNNs) have been widely used in computer vision community, significantly improving the state-of-the-art. In most of the available CNNs, the softmax loss function is used as the supervision signal to train the deep model. In order to enhance the discriminative power of the deeply learned features, this paper proposes a new supervision signal, called center loss, for face recognition task. Specifically, the center loss simultaneously learns a center for deep features of each class and penalizes the distances between the deep features and their corresponding class centers. More importantly, we prove that the proposed center loss function is trainable and easy to optimize in the CNNs. With the joint supervision of softmax loss and center loss, we can train a robust CNNs to obtain the deep features with the two key learning objectives, inter-class dispension and intra-class compactness as much as possible, which are very essential to face recognition. It is encouraging to see that our CNNs (with such joint supervision) achieve the state-of-the-art accuracy on several important face recognition benchmarks, Labeled Faces in the Wild (LFW), YouTube Faces (YTF), and MegaFace Challenge. Especially, our new approach achieves the best results on MegaFace (the largest public domain face benchmark) under the protocol of small training set (contains under 500000 images and under 20000 persons), significantly improving the previous results and setting new state-of-the-art for both face recognition and face verification tasks.

Learning Coupled Classifiers with RGB images for RGB-D object recognition

Article

Jan 2017
PATTERN RECOGN

The emergence of depth images opens a new dimension to address the challenging object recognition tasks. However, when only a small amount of labeled data is available, we cannot learn a discriminative classifier directly using the RGB-D images. To cope with this problem, we proposed a new method, Learning Coupled Classifiers with RGB images for RGB-D object recognition (LCCRRD). We learn the coupled classifiers using RGB images from source domain, the combined RGB and depth images from target domain and RGB images from target domain. The predicted results of the two target classifiers are made to be similar to make them more accurate. We also utilize the correlation between source and target RGB images to boost the relevant features and eliminate the irrelevant features. It also has the capacity to incorporate the manifold structure into our model. Furthermore, a unified objective function is presented to learn the classifier parameters. To evaluate our LCCRRD method, we apply it to five cross domain datasets. The experimental results demonstrate that our method can achieve competing performance against the state-of-art methods for object recognition tasks.

Two-dimensional subspace alignment for convolutional activations adaptation

Abstract and Figures

Supplementary resources (2)

Recommended publications

Adaptive sparse coding on PCA dictionary for image denoising

Multivariate classification of constrained data: Problems and alternatives

Domain Adaption Based on ELM Autoencoder

Building Class Sensitive Models for Tracking Applications