Content uploaded by Hao Lu
Author content
All content in this area was uploaded by Hao Lu on Sep 22, 2017
Content may be subject to copyright.
Two-dimensional subspace alignment for convolutional activations adaptation
Hao Lua, Zhiguo Caoa,∗, Yang Xiaoa, Yanjun Zhua
aNational Key Laboratory of Science and Technology on Multi-Spectral Information Processing
School of Automation, Huazhong University of Science and Technology, Wuhan 430074, PR China
Abstract
In real-world computer vision applications, many intrinsic and extrinsic variations can cause a significant domain shift. Although
deep convolutional models have provided us with better domain-invariant features, existing mechanisms to adapt convolutional
activations are still limited. Notice that convolutional activations are intrinsically represented as tensors, in this paper we develop
a two-dimensional subspace alignment (2DSA) approach based on 2D principal component analysis (PCA) to better adapt convo-
lutional activations. Extensive experiments demonstrate the advantages of 2DSA over its counterpart SA in both effectiveness and
efficiency. In particular, when trying to explain why 2DSA works well, we find that the best classification performance has low
correlation with the global domain discrepancy measure. In an effort to find a better way to compare domains, we introduce within-
and between-class domain divergence measures to characterize the class-level differences. The proposed measures somewhat shed
light on what a good alignment might be for classification. Furthermore, we also demonstrate a novel domain adaptation application
in agriculture and create a dataset for the problem.
Keywords: Visual domain adaptation, Subspace alignment, Convolutional activations, Two-dimensional PCA, Domain divergence
measure
1. Introduction
In real-world computer vision applications, many intrinsic
and extrinsic variations, such as color, pose, illumination, back-
ground, view point, blurring, or image resolution, can cause a
significant domain variance so that a model built on the source
domain may not perform well on the data with different dis-
tributions from the target domain. Indeed, a stream of studies
have reported the algorithm performance degrades significantly
across datasets [
1
–
3
]. This is a typical problem called domain
mismatch. A rich body of studies have attempted to alleviate
this challenge under the name of covariate shift, class imbal-
ance, dataset bias, transfer learning, multi-view analysis, and
more recently, domain adaptation (DA) [4].
Deep convolutional neural networks (CNNs) have brought us
the-state-of-art visual descriptor, which has benchmarked a se-
ries of computer vision tasks, such as image classification [
5
]
and object detection [
6
]. Indeed, features can be more transfer-
able when learned in deep networks [
7
,
8
]. A strong evidence
includes impressive results in DA with a deep CNNs based ap-
proach [
9
], implying an important role of domain-invariant fea-
ture representation. That is, the feature matters.
The most common and effective way to adapt CNN features
is to fine-tune an end-to-end CNN model so that the parameters
can be adjusted to better fit the target dataset [
6
,
10
]. Fine-tuning
is good as long as we have free access to the supervision and
∗Corresponding author
Email addresses: poppinace@hust.edu.cn (Hao Lu),
zgcao@hust.edu.cn
(Zhiguo Cao),
Yang_Xiao@hust.edu.cn
(Yang Xiao),
yjzhu@hust.edu.cn (Yanjun Zhu)
a sufficient number of training data. Yet, we intend to seek in
this paper whether there exists another mechanism to correct
this kind of shift. In particular, we consider the scenarios where
no supervision is provided in the target domain or the labeled
target data alone is deficient to build a good classifier, which is
exactly the case of DA.
DA is a frequently concerned issue in statistics, machine
learning, pattern recognition, natural language processing, and
recently, in computer vision. Over the years, many theoretical
methods have been developed to address this problem with a
moderate degree of success [
9
,
11
–
15
]. However, to our knowl-
edge, most of methods only formulate the problem in the vector-
form paradigm. That is, the input features must be vectors. No-
tice that convolutional activations are intrinsically represented
as tensors, it may be more natural to model them as matrices or
tensors, rather than vectors [
16
]. Also, it has shown that, when
using convolutional activations, we can prevent the object defor-
mation by feeding an image with arbitrary size [
17
] and reuse
the features by building a mapping between the raw image and
feature map [18].
Recently, a subspace alignment (SA) based unsupervised DA
approach [
14
] stands out due to its effectiveness and simplicity.
Our work is built within this framework. Specifically, we pro-
pose to perform two-dimensional subspace alignment (2DSA).
A 2DPCA [
19
] based approach is consequently developed to
adapt convolutional activations effectively and efficiently. Com-
pared with its counterpart SA, 2DSA requires less training data,
and learning parameters is more accurate and efficient. Experi-
ments on several datasets validate the effectiveness of 2DSA and
show that 2DSA significantly outperforms SA by large margins.
Accepted by Pattern Recognition
http://dx.doi.org/10.1016/j.patcog.2017.06.010
June 16, 2017
Is this a good
alignment?
Exactly!
How about
this one?
Ahhh…
maybe not!
And this?
Well…maybe aligning
close is enough!
Figure 1. Three typical situations in the subspace alignment based domain adaptation. Black denotes the source domain, and red the target. A marker denotes a
specific class. The “alignment” indicates a transformation that moves the source subspace to the target one. The left is an ideal situation, middle the situation occurring
in the SA paradigm, and right the 2DSA. SA aligns two domains well but mixes instances coming from different classes (target data cannot be classified correctly),
whilst 2DSA only aligns two domains moderately but preserves good margins between different classes (target data still can be separated linearly). This finding
motivates us to ponder a fundamental question: to what extend is an alignment enough for classification?
In some cases, SA even worsens the classification performance.
We are interested in explaining why 2DSA works better. Our
analysis from the reconstruction error perspective shows that
2DPCA generates a better subspace than PCA (the reconstruc-
tion error of 2DPCA is lower than PCA). Statistically, when ex-
ploiting a global
H
∆
H
-divergence [
20
] to measure the domain-
level discrepancy, we surprisingly find that results are beyond ex-
pectation. The best classification performance conversely yields
the worst
H
∆
H
value. After visualizing the data distribution,
we observe two interesting patterns shown in Fig. 1. One is
that SA aligns two domains well but mixes instances coming
from different classes. The other is that 2DSA only aligns two
domains moderately but preserves good margins between dif-
ferent classes. This motivates us to ponder a fundamental issue:
to what extend is an alignment enough for classification? We
answer this question by giving a new perspective at local class
distributions. We believe that, a good alignment in classification
indeed needs to push two distributions of the same class close,
but more importantly, it should enlarge or at least preserve the
margins between different classes. To formalize this idea, two
novel domain discrepancy measures called within-class diver-
gence
Hw
∆
Hw
and between-class divergence
Hb
∆
Hb
are con-
sequently proposed. Different from the
H
∆
H
-divergence that
only characterizes the domain-level discrepancy, the proposed
Hw
∆
Hw
and
Hb
∆
Hb
divergences are able to characterize the
class-level differences and thus can be viewed as a class-level
extension of
H
∆
H
-divergence. By measuring the domain dis-
crepancy from a fine-grained perspective, our results somewhat
shed light on what a good alignment might be for classification.
In addition, we further describe an interesting DA application
in agriculture. The application involves categorizing three types
of maize tassel flowering status (MTFS):
non-flowering
,
partially-flowering
, and
fully-flowering
. A dataset
termed MTFS3–DA is also constructed. The dataset includes
10 domains and 1500 images covering 5-year timespan, 4 maize
cultivars and 3 geographical locations. Extensive experiments
on this dataset also show that 2DSA outperforms SA. We hope
this dataset could inspire interests from the pattern recognition
community to address cross-field challenges in agriculture.
Overall, the contributions of this paper include:
•
2DSA: a two-dimensional subspace alignment approach is
developed for better convolutional activations adaptation.
It is very effective, computationally efficient, and easy to
implement;
• Hw
∆
Hw
&
Hb
∆
Hb
: two novel divergence measures ca-
pable of quantifying within- and between-class variations
are proposed to characterize the class-level domain dis-
crepancy. It encourages new perspectives from considering
cross-dataset generalization for classification;
•
MTFS3–DA: a new dataset concerning three types of flow-
ering status of maize tassel is created for cross-field eval-
uations in agriculture. It consists of 10 domains and 1500
images.
The dataset and source code are made available online. 1
2. Related work
DA is set in one of the possible settings of transfer learn-
ing [
21
]. Over the years, DA has been extensively studied in both
theory and practice, such as the probabilistic inference in statis-
tics [
22
], generalization bound in machine learning [
20
,
23
],
distribution analysis in pattern recognition [
24
,
25
], as well as
various applications in natural language processing [
26
–
28
] and
computer vision [
11
,
12
,
29
–
31
]. Recent works in computer vi-
sion field mainly focus on the visual recognition problem in ei-
ther unsupervised (only unlabeled data are used from the target
domain) [
9
,
32
] or semi-supervised (a limited amount of labeled
data are used from the target domain) [
15
,
30
,
33
] setting. Read-
ers can refer to [
4
] for a comprehensive survey. In this paper,
1
The dataset and source code are made available at:
https://sites.
google.com/site/poppinace/.
2
VGG
Model CNN
Feature
VGG
Model
CNN
Feature
Xs
Xa Xt
Subspace Alignment
MΔst>>0
Δat≈0
Xa
Xt
Laptop
Backpack
Keyboard
Predict
Training Test
Source Domain
Target Domain
Figure 2. The framework of subspace alignment based visual domain adaptation.
we concentrate on the most challenging case—unsupervised vi-
sual DA (some literatures also refer to it as transductive transfer
learning).
According to whether source labels are utilized in the op-
timization process of DA, we simply divide existing unsuper-
vised DA approaches into two categories: domain-orientated
and domain-classification-orientated. The first category only
aims at the adaptation between two domains. This line of ap-
proaches usually seek a way to build explicit connections or find
implicit commonnesses between two domains. Some represen-
tative works include TCA [
34
], SGF [
12
], GFK [
13
], SA [
14
],
TJM [
35
] and LSSA [
36
]. Also, since current DA approaches
are usually evaluated in the context of classification, the sec-
ond category prefers to model the adaptation and classifica-
tion jointly. This line of works often involve iterative optimiza-
tion between the adaptation and classification taking source
labels into account, expecting to achieve good classification
performance and a fine overlap between domains simultane-
ously. Some works worth mentioning include (A)SYMM [
11
],
ARCT [
30
], DAM [
37
], STM [
38
], MMDT [
33
], HFA [
15
], and
recent deep learning based approaches (DDA [
9
] and DAN [
39
]).
Our proposed method, 2DSA, belongs to the first category.
Our work is of particular relevance to those subspace-based
DA approaches. These works share the idea of exploiting low-
dimensional data structures that are intrinsic to domains. In par-
ticular, [
12
] proposes sampling a finite number of intermediate
subspaces and building geodesic flows to connect the source
and target domains. Gong et al. [
13
] extends above work by
constructing a geodesic flow kernel that projects image repre-
sentations into infinite dimensional feature vectors, expecting
to encapsulate incremental changes between subspaces that un-
derly the difference and commonness between domains. Differ-
ent from these two ideas, Fernando et al. [
14
] argues that it is
more appropriate to align the two domains directly. The basic
idea is to learn a transformation matrix by minimizing the Breg-
man matrix divergence. Intuitively, the transformation matrix
defines a movement that potentially pushes the source subspace
close to the target. More recently, [
36
] further extends [
14
] in
a landmarks-based kernelized paradigm via selecting potential
landmarks and incorporating further non-linearity with Gaussian
kernel.
Our work is closely related to [
14
], because we are built in
the same subspace alignment based framework. The main differ-
ence, however, is that SA [
14
] performs stronger feature-wise
alignment, while our method, 2DSA, only performs partial align-
ment because the subspace analysis is carried out on a smaller
space. Our analysis in Sec. 4shows that it is adequate to move
two subspaces only close to each other to achieve superior clas-
sification results. Moreover, 2DSA is very fast when tackling
high-dimensional data, such as convolutional activations, which
facilitates the parameter tunning during the cross validation.
3. Subspace alignment based visual domain adaptation
We start by reviewing the subspace alignment based DA
framework [
14
] to give readers a global view. We then discuss
the seminal vector-form formulation of SA in Sec. 3.1. Next
in Sec. 3.2, we present our matrix-form extension 2DSA in de-
tail. In particular, we follow the conventional nomenclature as
denoting vectors by lowercase boldface letters, like
x
, matrices
uppercase boldface letters, like
X
, and tensors calligraphic let-
ters, like
X
. We allow the input image to be of arbitrary size,
so a simple spatial pooling is applied as a normalization step,
ensuring the consistency of dimensionality. Concretely, any con-
volutional activations with size of
H×W×D
will be normalized
3
D
H
W
K
K
D
Figure 3. Illustration of spatial pooling normalization. Any activations within a
spatial bin will be pooled by max operation.
to
K×K×D
by max pooling. Note that, to preserve spatial infor-
mation, pooled activations are not vectorized as the fashion of
spatial pyramid pooling (SPP) in [
17
]. Intuitively, this process
is illustrated in Fig. 3.
The framework of subspace alignment based visual DA is
shown in Fig. 2. The scenario is that we use the training data
from the source domain to generate the subspace expanded by
Xs
, and data from the target domain to generate
Xt
(
Xs
and
Xt
are generated by PCA in [
14
] and by 2DPCA in 2DSA, which
will be explained later). Yet, the domain shift between
Xs
and
Xt
is quite large (∆
st
0), so the subspace
Xs
is aligned by
M
to correct this shift. Conceptually,
M
defines a movement
that pushes
Xs
close to
Xt
. The resulting aligned subspace is
denoted by
Xa
(
Xa
=
XsM
). At this time,
Xa
looks similar to
Xt(∆at ≈0). Finally, labeled instances from the source domain
are projected by
Xa
and are used to train the linear SVM at the
training stage. At the test stage, unlabeled instances from the
target domain are projected by
Xt
and are predicted with the
learned model. The more appropriate an alignment is, the better
classification results should achieve.
When learning the transformation matrix
M
, [
14
] chooses to
minimize the following Bregman matrix divergence as
F(M)=kXsM−Xtk2
F,(1)
where
k·kF
denotes the Frobenius norm. Under this paradigm,
a closed-form solution can be obtained as
M∗
=
XT
sXt
, and
Xa=XsXT
sXt.
3.1. Problems in vector-form formulation
In the context of vector-form formulation, each
K×K×D
tensor activation
X
has to be vectorized into a long vector
x
with size of
K2D×
1 (note that we have restricted our object
to the convolutional activations). However, the resulting vec-
torized representations are high-dimensional. When applying
PCA to generate a subspace, we need to solve SVD on an
extremely large matrix with size of
K2D×K2D
, but solving
high-dimensional SVD is quite slow. More importantly, it is not
tractable in practice because the DA problem is exactly the case
we do not have enough training data from the target domain
to get an exact solution of SVD. For instance, assume that we
only have
N
(
NM,M
=
K2D
) training instances, and let us
denote them as
ai∈RM,i
=1
, ..., N
, and combine them in a
matrix as
A∈RM×N
. The corresponding covariance matrix
Gsa
can be derived as
Gsa =1
NA AT.(2)
Algorithm 1 2DSA: Two-dimensional Subspace Alignment
Input:
Source features
Fs
, Target features
Ft
, Source labels
Ls
,
Subspace dimensionality d
Output: Target labels Lt
1: Xs←2DPCA(Fs,d)
2: Xt←2DPCA(Ft,d)
3: Xa←XsXT
sXt
4: Pa←FsXa
5: Pt←FtXt
6: Lt←S V M(Pa,Pt,Ls)
However, note that
rank
(
Gsa
)=
rank
(
A AT
)=
rank
(
A
)
≤N
,
which means we will only get less than
N
nonzero eigenvalues
when solving the SVD on
Gsa
. In other words, the exact solu-
tion is limited by the number of training data, and an appropriate
subspace may not be generated (our empirical study in Sec. 6.4
justifies this point). In fact, according to the widely-cited rule of
thumb in [
40
], we expect to have at least 10 times as many the
number of training data as the feature dimensionality. Therefore,
we argue that aligning directly vector-form convolutional acti-
vations may not be a good choice. Inspired by [
16
], it motivates
us to reconsider modeling them in their intrinsic structure.
3.2. 2DSA: matrix formulation with 2DPCA
2DSA formulates the matrix-form convolutional activations.
Specifically, we resort to 2DPCA [
19
] to generate subspaces.
First, each tensor activation with size of
K×K×D
is reshaped
into a
D×K2
matrix. Given a set of matrix-form descriptors
Ai∈RD×K2,i
=1
, ..., N
, the covariance matrix
G2dsa
can be
evaluated as
G2dsa =1
N
N
X
i=1
AT
iAi,(3)
where
G2dsa ∈RK2×K2
. From the physical sense,
G2dsa
actually
models the global dependency between different filter activa-
tions across all pair-wise spatial locations. By solving SVD on
G2dsa
, all feature maps share their eigenvectors instead of having
eigenvectors in a cube of features (many 2D feature maps). It is
efficient to derive
Xs
and
Xt
in corresponding domains, because
K2
is usually a small value. Also, since
K2
is small, 2DSA do
not require a substantial amount of training data. We then can
reuse Eq. 1to compute the transformation matrix and align the
subspace in the same vein. Notice that the orthogonality con-
straint in both PCA and 2DPCA is important to preserve good
class separations in their subspace representations. The pseudo-
code of this approach is summarized in Algorithm 1, which is
analogous to the algorithm presented in [14].
In both two formulations, we only need to tune one hyper
parameter
d
that controls the dimensionality of subspace. To ad-
dress this, we choose to leverage the theoretical bound deduced
by [
14
] to select the maximum dimensionality
dmax
to guide the
selection process. According to the variant of consistency the-
orem [
14
], it is said that given a confidence
δ >
0 and a fixed
4
deviation γ > 0, dmax can be selected if it could satisfy
(λmin
dmax −λmin
dmax +1)≥
1+rln(2/δ)
2
16d3/2B
γ√nmin !,(4)
where (
λmin
dmax −λmin
dmax +1
)=
min
[(
λs
d−λs
d+1
)
,
(
λt
d−λt
d+1
)],
λb
a
is
the
a
-th eigenvalue (in descending order) computed from the
domain
b
.
B
is selected so that for any vector
x
,
kxk ≤ B
.
nmin
=
min
(
Ns,Nt
),
Ns
and
Nt
are the number of training data in the
source and target domain, respectively. Once
dmax
is identified,
for any d≤dmax, one can get a reliable solution of Min Eq. 1.
4. Domain discrepancy analysis
In this section, we draw upon domain divergence measures to
analyze the domain discrepancy.
4.1.
Quantifying domain discrepancy based on
H
∆
H
-
divergence
According to our experiments, we find that 2DSA achieves
higher classification accuracy than SA. In this section, we at-
tempt to explain why it works. From the statistical perspective, a
common way is to use a distribution measure to estimate the do-
main discrepancy. The pioneer work of Ben-David et al. [
20
,
23
]
established the theoretical risk bounds for DA. Since our analy-
sis highly depends on these developments, we begin with a brief
introduction to their theoretical results.
Following the theorem 2 in [
23
], it states that given a hypothe-
sis space
H
of VC-dimension
˜
d
, and instance sets
Us
,
Ut
of size
˜m
each sampled i.i.d. from distributions
Ds
and
Dt
respectively,
then with probability at least 1
−δ
, for every
h∈ H
, the corre-
sponding generalization error on the target set can be bounded
as
t(h)≤s(h)+1
2
ˆ
dH∆H(Us,Ut)+4s2˜
dlog(2 ˜m)+log(2/δ)
˜m+˜
λ ,
(5)
where
s
(
h
) is the source error, and
˜
λ
equals the combined error
of ideal joint hypothesis
t
(
h
)+
s
(
h
), which can suppose to be
a negligible term in the case of DA. The bound shows that the
source error and
ˆ
dH∆H
(
Us,Ut
) (also called
H
∆
H
-divergence)
are the most relevant quantities in computing the target error.
In particular, we are interested in quantifying
ˆ
dH∆H
(
Us,Ut
),
because we may understand why 2DSA works better if the per-
formance correlates well with this measure. Next, we shall give
a first look at its counterpart of
H
-divergence
dH
(
Ds,Dt
) that
plays a vital role in the rest of our analysis.
dH
(
Ds,Dt
) is also known as the
A
-distance or total varia-
tion distance derived from the statistical distance family, which
is used to measure the difference between two probability distri-
butions. Formally, it is defined in [20] as
dH(Ds,Dt)=2 sup
h∈H |Ps(h)− Pt(h)|,(6)
where
Ps
(
h
) and
Pt
(
h
) denote the probability of event
h
under
distributions
Ds
and
Dt
, respectively. Intuitively, it describes
the largest possible difference between the probabilities that two
probability distributions can assign to the same event. With these
notions, the symmetric difference hypothesis space
H
∆
H
can
be further defined [23] as
H∆H={g(x)|g(x)=h(x)⊕h0(x)},h,h0∈ H ,(7)
where
⊕
denotes the XOR operation. In other words,
g
(
x
) will be
positive in
H
∆
H
if and only if a couple of hypothesis
h
(
x
) and
h0
(
x
) disagree with each other. Thus,
dH∆H
(
Ds,Dt
) means to
compute the
A
-distance over the symmetric difference hypoth-
esis space. However, directly computing
dH∆H
(
Ds,Dt
) is not
tractable in practice, so an alternative is to compute its empirical
version
ˆ
dH∆H
(
Us,Ut
). In particular, estimating
ˆ
dH∆H
(
Us,Ut
)
requires learning a linear classifier
ˆ
h
to see whether source and
target instances could be differentiated. More specifically, it in-
volves the following steps:
Step 1.
Pseudo-labeling the source and target instances with +1
and −1;
Step 2.
Randomly sampling two sets of instances as the training
and test set, respectively.
Step 3.
Learning a linear classifier
ˆ
h
on the training set and
verifying its performance on the test set.
Step 4.
Estimating the distance as
ˆ
dH∆H
(
Us,Ut
)=2(1
−
2
·
err(ˆ
h)) [20], where err(ˆ
h) is the test error.
If two distributions perfectly overlap with each other,
err
(
ˆ
h
)
≈
0
.
5, and
ˆ
dH∆H
(
Us,Ut
)
≈
0. Conversely, if two distributions
have large enough margins,
err
(
ˆ
h
)
≈
0, and
ˆ
dH∆H
(
Us,Ut
)
≈
2.
Therefore,
ˆ
dH∆H
(
Us,Ut
)
∈
[0
,
2]. The lower the value is, the
better two distributions align. In other words, a low divergence
value should imply high classification performance.
Now we can empirically evaluate the domain discrepancy of
SA and 2DSA. To our surprise, this measure does not correlate
with the classification performance. Fig. 4illustrates a typical
case of adapting images from Amazon to Caltech (details see
Sec. 6.1). The highest classification performance does not corre-
spond to the lowest
H
∆
H
measure. According to the visualiza-
tion of data distributions, we observe that, both approaches do
have pushed the same class of different domains close to each
other (looking at the “+” class), and the classes aligned by SA
generally overlap better than those aligned by 2DSA (looking at
those “
N
” class and “
” class for example). Why SA achieves
inferior classification accuracy? This interesting phenomenon
motivates us to ponder a fundamental issue: to what extend is an
alignment enough for classification? We shall give our answer
in the next subsection.
4.2.
Measuring domain discrepancy with local class divergence
Note that both SA and 2DSA are in the sense of global align-
ment because all the training data are used to generate the sub-
spaces, the difference is that SA performs stronger feature-wise
adaptation. However, if an alignment is too strong, it may even
align the data coming from different classes, resulting in the
cases shown by the yellow circles in Fig. 4. That is, data from
different classes are promiscuous in the SA adaptation. In this
case, the alignment makes no sense. In addition, let us revisit the
5
No Adaptation, H∆H = 1.33, Recognition Accuracy = 69.32
1
2
3
4
5
6
7
8
9
10
SA Adaptation, H∆H = 1.23, Recognition Accuracy = 58.14
1
2
3
4
5
6
7
8
9
10
2DSA Adaptation, H∆H = 1.45, Recognition Accuracy = 78.93
1
2
3
4
5
6
7
8
9
10
Figure 4. Category-specific data visualization using t-SNE [
41
] over a typical DA task from Amazon (red) to Caltech (black) in the Office–Caltech10 dataset.
H
∆
H
and recognition accuracy are indicated in each sub-figure title. Each category is denoted by a certain type of marker (Office–Caltech10 dataset has 10 categories).
6
Table 1. Cultivar information of each sequence in MTFS3–DA dataset.
Sequence Jundan Wuyue Nongda Zhengdan
No.20 No.3 No.108 No.958
Zhengzhou 2010 X— — —
Zhengzhou 2011 X— — —
Zhengzhou 2012 — — — X
Taian 2010–1 — X— —
Taian 2010–2 — X— —
Taian 2011–1 — — X—
Taian 2011–2 — — X—
Taian 2012–1 — — — X
Taian 2012–2 — — — X
Gucheng 2014 — — — X
“+” class in Fig. 4. In the both SA and 2DSA scenarios, this class
is aligned moderately, but if we classify the data, it may turn out
to be the most easily separated class. Therefore, our point is that
in classification we actually do not enforce two domains to be
exactly overlapped with each other, and favorable performance
can be achieved as long as they have enough margins with other
classes. Hence, as per these observations, we deem that it is ad-
equate in the context of classification for the same class to be
aligned that is only close to each other and for different classes
that have large enough margins.
To formalize our idea, two novel domain discrepancy mea-
sures called the within-class divergence
Hw
∆
Hw
and the
between-class divergence
Hb
∆
Hb
are proposed to characterize
the class-level differences. A natural way to characterize these
differences is to compute a distance over specific distributions.
Let us denote the within-class and between-class distances as
dw
(
Pi
s,Pi
t
) and
db
(
Pi
s,t,Pj
s,t
), respectively, where the superscript
denotes the class, and the subscript the domain. Thus, it is clear
that
dw
(
Pi
s,Pi
t
) is computed from a certain class between two
domains, and
db
(
Pi
s,t,Pj
s,t
) is computed by considering both do-
mains as a whole between different classes. Moreover, we fur-
ther impose two kinds of constraints to the distances as:
dw(Pi
s,Pi
t)< γw,(8)
db(Pi
s,t,Pj
s,t)> γb,(9)
where
γw
and
γb
ensure a relative small within-class distance
and a large enough between-class distance, respectively. With
these two inequalities,
Hw
∆
Hw
and
Hb
∆
Hb
can be expressed by
incorporating Eq. 8and Eq. 9into hinge-loss-like formulations
as:
Hw∆Hw=1
C
C
X
i
max(0,dw(Pi
s,Pi
t)−γw),(10)
Hb∆Hb=1
C(C−1)
C
X
i=1
C
X
j=1,j,i
max(0, γb−db(Pi
s,t,Pj
s,t)) .
(11)
We can see that only those distances violate the inequality
constraints will contribute losses to the measure. Intuitively,
Zhengzhou 2010
Zhengzhou 2011
Zhengzhou 2012
Taian 2010-1
Taian 2010-2
Taian 2011-1
Taian 2011-2
Taian 2012-1
Taian 2012-2
Gucheng 2014
Figure 5. Examples of maize tassel images in MTFS3–DA dataset from 10
different fields. In each field, from left to right, images denote the flowering
status of
non-flowering
,
partially-flowering
, and
fully-flowering
,
respectively. Images are rescaled for better view.
Hw
∆
Hw
assesses how good two distributions locally align, and
Hb
∆
Hb
scores how well an alignment suits for classification.
Also, we observe that, larger
γw
and smaller
γb
are, looser those
inequalities constrain. Intuitively, how small should
dw
(
Pi
s,Pi
t
)
be? We think it should not exceed
γw
so that data from the same
class are close enough and have a high probability to be clas-
sified correctly. Meanwhile, how large should
db
(
Pi
s,t,Pj
s,t
) be?
We believe it should be at least larger than
γb
so that data with
different classes could be separated easily. As a consequence,
when
γw
gradually decreases and
γb
gradually increases, two
kinds of curves can be drawn to demonstrate the domain dis-
crepancy under various distance levels.
To make a direct comparison with the
H
∆
H
divergence, we
choose to estimate the
dw
(
Pi
s,Pi
t
) and
db
(
Pi
s,t,Pj
s,t
) in a similar
vein as
ˆ
dH∆H
(
Us,Ut
). In addition, when plotting the curves, we
can conventionally leverage the numerical range of
A
-distance
to reduce one parameter by setting
γb
=
γ, γw
=2
−γ
. Also,
γ
will gradually increase in the interval of [1, 2]. In Sec. 6.5, we
show that,
Hb
∆
Hb
correlates well with the classification accu-
racy, and
Hw
∆
Hw
is also in consistent with the global
H
∆
H
.
Our results imply that,
Hw
∆
Hw
can be seen as a local version of
H
∆
H
measure, and
Hb
∆
Hb
also extends
H
∆
H
by endowing
the ability to measure local variances between classes.
5. MTFS3–DA dataset
The image acquisition device is described in [
42
]. 10 maize
sequences in all are collected to construct the MTFS3–DA
dataset. The dataset covers 5-year timespan from 2010 to
2014, 4 different maize cultivars of Wuyue No.3, Jundan No.20,
7
Nongda No.108 and Zhengdan No.958, and 3 different geo-
graphical locations of Zhengzhou, Henan province, China, Ta-
ian, Shangdong province, China, and Gucheng, Hebei province,
China. In practice, cross-field domain shifts in agriculture are
mainly caused by these three factors. The information of each
sequence is summarized in Table 1.
As the camera monitors the growth of maize from the
tasseling stage to the flowering stage (two critical growth
stages of maize), we find that maize tassels exhibit three
types of flowering status [
43
]. That is, initial
non-flowering
status, intermediate
partially-flowering
status, and final
fully-flowering
status. Some example images of each se-
quence are illustrated in Fig. 5. We observe that there only exists
subtle textural differences between different types of flowering
status, so it can be viewed as a typical cross-domain textural
categorization problem. We hope this problem can inspire inter-
ests from the pattern recognition community to address those
cross-field challenges in agriculture.
Concretely, we choose to leverage the off-the-shelf bounding
boxes annotations released in our previous work [
44
] to crop
the tassel images from the full-resolution images (extra annota-
tions have been done on the Gucheng 2014 sequence). By doing
this, we could relieve the influence of background as much as
possible, and it can also be viewed as a coarse pose normal-
ization. In addition, an agrometeorological observer with more
than 10-year experience is invited to help us annotate all sub-
images to ensure the correctness of labels. For each sequence,
we manually select 50 images from each class. In all, we have
150 images in each visual field and 1500 images in the MTFS3–
DA dataset. Notice that, the dataset originally released in [
44
]
is mainly developed for the evaluation of detection problem and
does not involve any image-level annotations, while the MTFS3–
DA dataset is tailored to the DA problem and is set in the context
of visual recognition.
6. Experiments and discussion
We first evaluate our approach in the context of visual recogni-
tion on standard DA datasets and follow the same experimental
protocol as in [
11
,
13
,
14
]. In addition, we also perform eval-
uations on other widely-used image classification datasets and
our constructed MTFS–DA dataset. Along with these numerical
results, we further present empirical studies to explain why our
method works.
6.1. Experimental dataset and protocol
Office–Caltech10 dataset. Office–Caltech10 dataset [
13
] ex-
tends the Office31 [
11
] dataset by adding another Caltech do-
main, leading to 4 domains of Amazon,DSLR,web-cam, and
Caltech. 10 common categories are chosen from these domains,
resulting in about 2500 images. Overall, we have 12 DA prob-
lems.
Office31 dataset. Office31 dataset is originally introduced
by [
11
]. It consists of 31 categories and 3 domains. We add
another 5 images downloaded from the Internet with the same
image resolution into the
ruler
category of DSLR domain (only
Subspace Dimensionality
0 10 20 30 40 50 60 70 80 90 100
Eigenvalue Difference
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
Theoretical Bound
λd
min-λd+1
min
Figure 6. Illustration of selecting a subspace dimensionality with the guide of
theoretical bound.
7 images are contained in the original dataset) so that experi-
ments can be conducted under the same protocol. This dataset
has 6 DA problems.
ImageNet–VOC2007 dataset. We also evaluate our method on
the widely-used ImageNet and PASCAL VOC2007 datasets. We
choose the same 20 categories as VOC2007 dataset from the Im-
ageNet 2012 to constitute the source domain, and the VOC2007
dataset is regarded as the target domain. Since categories in-
cluded are very different from above datasets, experiments per-
formed on this dataset can somewhat prove the generality of our
method.
MTFS3–DA dataset. Since our dataset comprises 10 different
domains, it can lead to a total of
A2
10
=90 different DA prob-
lems. Instead of blindly evaluating all DA problems, we grad-
ually increase the domain shift and organize experiments in a
hierarchical manner (see Sec. 6.3 for details). For short, each
domain is denoted by
{Location}{Year}{Cultivar }
{S equence Number}
. The Se-
quence Number only appears in Taian sequences. For instance,
Zhengzhou 2010 domain is denoted by
Z
10
J
, and
Taian 2011–1
domain T11N
1.
Experimental protocol. Each DA problem is denoted by
S ource→T arget
. For Office–Caltech10, Office31 and our
MTFS3–DA datasets, the average multi-class recognition accu-
racy across 10 categories over 20 trials is reported on the target
domain. In each trial, 20 images are randomly sampled from
each category of the source domain as the training set (8 im-
ages if the source domain is web-cam and DSLR), and the target
data is used during both the training and testing stages. Note
that, the experimental protocol we use on the Office–Caltech10
and Office31 datasets is exactly the same as in [
11
,
13
,
14
] ex-
cept that we use different feature representations, i.e., convolu-
tional activations. Since better feature representations are used,
the baseline accuracy is substantially higher than the results re-
ported in their papers (the conventional SURF feature is used
in [
11
,
13
,
14
]). For the ImageNet–VOC2007 dataset, 50 images
are randomly sampled from each category of ImageNet2012
subset as the source domain, and images from the
test
set of
VOC2007 are used as the target domain, and the average preci-
sion regarding each category is reported, respectively. Since we
8
have sufficient data in the source domain, Sec. 6.6 will present
additional results with other settings on this dataset.
Parameters setting. The optimal dimensionality
d∗
in both SA
and 2DSA scenarios is determined by the two-fold cross val-
idation over the labeled source data with the guide of the-
oretical bound (Sec. 3.2) using the range of values 2
k,k
=
0
,
1
,
2
, ..., log2dmax
. Fig. 6illustrates how to find a stable so-
lution with the guide of theoretical bound. The optimal dimen-
sionality
d∗
should be identified before the intersection of two
types of lines. Since we focus on the adaptation of convolutional
activations, those methods taking fully-connected activations,
like DeCAF feature [8], as the representation are not employed
for comparisons. Generally, the CO N V5 activations extracted
from a pretrained 7-layer CNN model (
imagenet-vgg-m
[
45
])
are considered as the feature representation (
D
=512), and
K
=6 is set in the spatial pooling step (Sec. 3). Thus, the fea-
ture dimensionality is
K2×D
=18432. Additionally, one-vs-rest
linear SVMs [
46
] are used as the classifier, and the penalty fac-
tor
C
is determined by two-fold cross validation on the source
domain using the range of values 10p,p=−3,−2,−1,0,1,2,3.
6.2. Visual recognition results on standard datasets
Several baseline methods are employed to compare against
our 2DSA approach:
•
No Adaptation (NA): NA is the basic baseline, the classifier
trained on the source domain is directly applied to the target
domain.
•
Geodesic Flow Kernel (GFK) [
13
]: GFK is a kernel-based
DA method that uses an infinite number of subspaces along
the geodesic flow to bridge two domains.
•
Transfer Joint Matching (TJM) [
35
]: TJM formulates fea-
ture matching and instance reweighting in a joint optimiza-
tion problem.
•
Landmarks Selection Subspace Alignment (LSSA) [
36
]:
LSSA extends SA by projecting samples onto landmarks
and adding further nonlinearity with Gaussian kernel. Both
TJM and LSSA can work at the instance level.
•
Subspace Alignment (SA) [
14
]: SA is aforementioned. This
is our closely related work and the direct baseline.
•
SA
∗
and 2DSA
∗
: It may be interesting to see how SA and
2DSA work with nonlinearity. We add these two variants
that use SVM with Gaussian kernel as the classifier. Similar
to the penalty factor
C
, the kernel parameter
σ
is also tuned
by two-fold cross validation.
•
2DSA
†
: A variant of 2DSA that adopts
A0
i∈RK2×D,i
=
1
, ..., N
, as the matrix descriptor, so 2DSA
†
will solve a
D×D
covariance matrix as per Eq. 3. In contrast to 2DSA
that performs spatial-mode adaptation, 2DSA
†
performs
feature-mode adaptation. This variant will show what ex-
actly makes 2DSA different from other approaches.
•
2DSA
‡
: In 2DSA, all feature maps are summed together
and sent to 2DPCA. In 2DSA
‡
, each feature map is vec-
torized into a 1
×K2
vector and considered as a specific
pattern. All these patterns will be sent to standard PCA so
that all feature maps share the same eigenvector. We add
such a baseline to justify whether 2DPCA is what makes
a difference and not the fact that eigenvectors are shared
across feature maps.
In the following, the vector-form representation is prefixed
with v-, and the matrix-form representation m-. Convolutional
activations are denoted as CONV for short. Conventional ap-
proaches receive vC O NV as the feature representation, while
2DSA and its variants receive mCO N V as the representation.
Tables 2,3and 4list the numerical results. We can make the
following observations:
•
Classification results on all three standard DA datasets
demonstrate that 2DSA almost consistently and signifi-
cantly outperforms SA by large margins. In particular,
2DSA achieves the highest mean average accuracy of
74.1% and 55.6% on the Office–Caltech10 and Office31
datasets, respectively. 2DSA also exhibits consistently
lower standard deviations than SA, which implies 2DSA is
very stable. In addition, 2DSA also ranks the second place
on the challenging VOC2007
test
set (2DSA
∗
wins the
first place);
•
SA with vC ON V does not see notable improvements
in accuracy and sometimes even worsens the classifica-
tion performance. Similar results are also reported in [
47
]
that SA falls behind NA when 4096-dimensional fully-
connected activations are used. One reason perhaps is that
the number of training data affects the quality of generated
subspace. This point will be further justified in Sections 6.4
and 6.6;
•
The results of 2DSA
†
show that the improvement of the
feature-mode adaptation of CONV is marginal, and 2DSA
†
even degrades the classification performance significantly
on the ImageNet–VOC2007 dataset. This may justify that
the spatial-mode adaptation mechanism in 2DSA matters.
We also consider this is what distinguishes 2DSA from
other comparing approaches—the feature mode is not ex-
plicitly adapted. Such a phenomenon may inspire a further
exploration: is it necessary to adapt features when the fea-
ture representations at hand are already good enough? We
leave such a question open at present.
•
It is interesting that 2DSA
‡
also works considerably well.
We think the reason is also that 2DSA‡performs 2D adap-
tion and adapts the spatial mode only. However, compared
to 2DSA, its classification accuracy is slightly lower, and
the standard deviation is generally higher (especially on
the ImageNet–VOC2007 dataset). Perhaps when all fea-
ture maps are summed together as in 2DSA, the covari-
ance matrix can appropriately capture the holistic informa-
tion of samples, more stable than blindly modeling indi-
vidual feature maps. In addition, the advantage of 2DSA
9
Table 2. Recognition accuracy (%) on the Office–Caltech10 dataset over 20 trials. The highest accuracy is boldfaced, the second best is shown in red, and the standard
deviation is shown in parentheses (we focus on the comparison between SA and 2DSA, results of other approaches are reported for reference).
Method A→C C→A A→D D→A A→W W→A C→D D→C C→W W→C D→W W→Dmean
NA 68.3(2.4) 81.2(3.0) 67.9(4.2) 66.4(2.7) 56.7(4.7) 57.9(3.6) 68.7(3.6) 52.6(1.9) 58.8(3.9) 47.2(3.5) 89.2(2.7) 90.5(2.3) 67.1
GFK 77.2(2.2) 85.2(2.2) 78.1(3.6) 80.4(2.6) 74.3(2.8) 70.4(6.5) 79.2(2.7) 63.9(7.0) 74.7(3.1) 59.6(5.0) 86.9(3.2) 86.4(3.0) 76.4
TJM 80.7(1.5) 88.6(1.6) 79.8(7.6) 74.2(6.2) 75.3(5.8) 62.8(7.4) 84.4(3.8) 59.6(7.2) 79.5(4.8) 52.2(4.8) 94.0(2.5) 92.3(2.1) 76.9
LSSA 78.7(0.9) 87.0(1.7) 78.3(2.2) 77.7(4.2) 70.6(2.3) 63.9(5.1) 77.8(2.7) 64.5(4.1) 68.2(5.2) 58.0(3.4) 95.4(2.2) 96.0(1.6) 76.3
SA∗67.4(4.8) 75.8(3.3) 42.3(14.9) 59.1(19.2) 45.5(10.9) 29.8(24.2) 34.7(11.4) 56.5(4.6) 43.5(12.5) 36.9(20.3) 70.7(36.0) 79.2(17.1) 53.5
SA 69.3(3.7) 84.6(3.8) 64.7(7.6) 66.7(12.5) 53.7(5.2) 58.7(10.4) 70.6(5.2) 56.4(12.1) 58.5(4.3) 50.3(8.5) 83.4(6.2) 84.6(5.8) 66.8
2DSA∗69.8(14.8) 84.9(16.1) 68.0(15.7) 56.4(24.9) 58.2(17.9) 53.6(23.0) 73.8(14.0) 47.3(22.2) 69.4(13.2) 42.9(20.6) 76.0(34.0) 82.6(31.4) 65.2
2DSA†68.6(2.4) 85.0(2.4) 67.2(7.0) 69.3(7.0) 60.3(4.3) 58.8(7.8) 72.5(2.4) 53.0(5.9) 58.4(4.0) 48.3(5.5) 81.3(6.6) 85.0(4.2) 67.3
2DSA‡74.8(3.2) 85.4(2.6) 72.7(6.7) 70.7(4.4) 67.4(4.0) 58.2(5.1) 77.3(4.9) 58.5(4.2) 70.7(3.7) 50.1(3.2) 91.0(2.8) 92.9(3.4) 72.5
2DSA 75.2(3.0) 85.9(1.8) 75.4(5.2) 73.4(5.0) 66.8(3.9) 63.8(5.6) 76.6(5.0) 62.5(4.7) 69.5(2.9) 55.1(5.1) 91.8(3.0) 93.0(2.5) 74.1
Table 3. Recognition accuracy (%) on the Office31 dataset over 20 trials. The highest accuracy is boldfaced, the second best is shown in red, and the standard
deviation is shown in parentheses (we focus on the comparison between SA and 2DSA, results of other approaches are reported for reference).
Method A→D D→A A→W W→A D→W W→Dmean
NA 41.1(3.2) 30.2(2.3) 33.1(2.8) 26.4(2.2) 74.0(2.6) 76.6(1.5) 46.9
GFK 44.5(2.4) 31.1(4.2) 37.1(3.0) 27.3(2.0) 73.3(2.2) 75.6(2.2) 48.1
TJM 44.9(5.1) 34.8(3.3) 37.5(4.3) 31.3(5.1) 73.9(1.8) 76.6(3.2) 49.8
LSSA 38.9(9.6) 34.5(4.5) 29.0(11.7) 33.1(4.4) 80.4(2.4) 79.3(2.3) 49.2
SA∗27.8(3.4) 28.9(6.0) 27.9(3.3) 23.1(8.5) 74.5(3.3) 73.0(5.6) 42.5
SA 43.8(9.2) 40.1(3.3) 40.9(7.7) 35.8(6.0) 81.5(2.6) 75.6(4.7) 52.9
2DSA∗47.8(6.5) 35.6(4.1) 40.3(5.8) 29.1(7.1) 86.8(2.8) 88.4(2.6) 54.7
2DSA†45.4(2.7) 35.7(2.0) 37.3(3.4) 32.4(2.1) 78.7(2.0) 80.5(2.9) 51.7
2DSA‡45.2(4.2) 32.3(5.3) 39.2(3.8) 28.3(3.4) 81.3(2.5) 84.2(2.1) 51.8
2DSA 47.3(4.2) 37.6(1.7) 39.2(6.0) 35.8(1.4) 85.7(2.1) 88.3(2.2) 55.6
over 2DSA
‡
is obvious when the number of source samples
is limited (D/W
→
A/C), which means 2DSA is more suit-
able for small sample sizes than 2DSA
‡
. As a consequence,
2DPCA seems a better choice for 2D adaptation.
•
2DSA
∗
achieves the highest average precision on the
ImageNet–VOC2007 dataset. Yet, the nonlinearity used in
SA and 2DSA does not always benefit classification. On
the Office–Caltech10 and Office31 datasets, when the Gaus-
sian kernel is introduced, it has a negative effect on the
classification accuracy. SA
∗
and 2DSA
∗
also exhibit much
higher standard deviations than their linear counterparts on
the Office–Caltech10 dataset. Hence, one should be careful
when using nonlinearity in practice.
•
Although TJM achieves higher classification accuracy than
2DSA on the Office–Caltech10 dataset, TJM does not work
well when tackling complicated classification problems (31
categories are included in the Office31 dataset) or when
inferring classes with complex background (VOC2007
dataset). Here is a plausible explanation. Since TJM op-
timizes an instance reweighting procedure, it works at the
instance level. However, as shown in Fig. 7, the Office31
dataset contains some images with inaccurate labels, and
the VOC2007 dataset is a typical multi-label dataset. Am-
biguous labels are very likely to lead to sample shifts from
one class to the other. If these samples are assigned with
larger weights, the quality of adaptation will be largely af-
fected. In contrast, 2DSA is a subspace-based approach
and works at the domain level. It is not that sensitive to the
variations of individual instances. This may explain why
2DSA outperforms TJM on the Office31 and ImageNet–
VOC2007 datasets.
•
The reason why 2DSA outperforms LSSA may be similar
as TJM. LSSA also contains an instance reweighting pro-
cess, so it may suffer from the same problem as TJM. In
LSSA, the source and target data will be projected onto a
shared space using a Gaussian kernel with respect to the
selected landmarks. If the selected landmarks contain noisy
samples, the resulting nonlinear representations may also
be unreliable. With unreliable representations, the data dis-
tributions may not change the way we expect in the pro-
jected space to benefit linear classification.
•
In fact, we think the performance degradation has also
something to do with the use of deep features. According
to a recent work [
48
], deep features are considered fragile—
features are separable but not discriminative enough (the
10
Table 4. Average precision (%) on the ImageNet–VOC2007 dataset. The highest average precision is boldfaced, the second best is shown in red, and the standard
deviation is shown in parentheses (we focus on the comparison between SA and 2DSA, results of other approaches are reported for reference).
VOC2007 aero bike bird boat bottle bus car cat chair cow
NA 68.7(6.5) 60.2(4.1) 49.3(4.5) 60.2(7.7) 25.8(1.9) 51.1(3.1) 65.0(2.9) 65.3(2.0) 16.1(4.6) 22.7(7.2)
GFK 59.6(7.6) 57.6(5.9) 33.3(17.6) 40.3(9.9) 23.7(4.3) 48.5(3.2) 64.4(2.6) 46.3(12.0) 13.5(4.9) 19.6(7.1)
TJM 70.8(1.3) 63.9(2.6) 14.8(8.7) 40.2(14.9) 14.9(3.3) 49.0(3.7) 68.7(1.8) 51.9(10.7) 14.3(5.9) 17.0(14.2)
LSSA 54.2(2.3) 58.3(3.1) 25.6(3.2) 21.9(3.1) 22.1(2.0) 39.3(2.4) 63.8(1.4) 45.6(4.3) 27.5(4.0) 16.5(3.0)
SA∗70.6(3.9) 62.2(5.1) 55.2(10.2) 61.5(11.5) 22.8(4.4) 58.8(4.1) 71.4(2.5) 64.0(6.8) 15.5(6.1) 30.1(6.0)
SA 60.1(12.8) 51.6(18.4) 28.3(11.3) 41.8(18.5) 17.8(3.7) 45.9(12.7) 66.4(7.4) 48.6(12.2) 19.4(9.8) 27.2(7.1)
2DSA∗78.8(3.6) 73.4(3.0) 68.5(6.1) 74.4(4.2) 30.4(2.6) 64.3(3.6) 75.4(2.5) 74.9(2.8) 20.5(6.2) 48.2(5.8)
2DSA†54.2(2.3) 58.3(3.1) 25.6(3.2) 21.9(3.1) 22.1(2.0) 39.3(2.4) 63.8(1.4) 45.6(4.3) 27.5(4.0) 16.5(3.0)
2DSA‡69.3(3.3) 63.5(3.8) 60.9(3.8) 67.3(5.4) 30.0(2.8) 52.2(4.1) 69.5(2.2) 69.6(3.6) 13.9(4.4) 32.4(4.9)
2DSA 68.7(1.3) 66.3(1.2) 50.7(2.9) 65.9(2.1) 30.8(2.5) 53.8(2.0) 74.6(1.1) 67.7(0.8) 29.1(5.4) 33.5(2.8)
table dog horse mbike person plant sheep sofa train tv mean
NA 27.4(4.0) 42.2(10.8) 37.4(13.2) 51.1(4.4) 70.8(2.2) 18.1(2.5) 44.7(4.7) 36.9(5.6) 69.2(2.8) 47.8(5.3) 46.5
GFK 26.4(6.6) 18.1(4.4) 18.6(17.9) 52.3(8.2) 73.9(2.3) 14.7(0.7) 27.4(15.7) 35.3(5.7) 58.8(5.8) 31.1(6.3) 38.2
TJM 29.2(3.8) 24.5(16.0) 22.6(27.0) 53.7(2.2) 67.6(2.4) 15.3(2.0) 10.0(12.3) 39.3(2.2) 64.3(2.6) 33.9(2.3) 38.3
LSSA 25.3(2.7) 40.8(1.7) 40.4(7.6) 41.7(3.4) 74.3(2.8) 11.0(3.1) 23.7(4.5) 27.7(3.6) 52.4(3.9) 29.5(3.4) 37.1
SA∗32.1(6.3) 47.6(7.4) 51.4(14.9) 64.0(3.7) 75.5(2.1) 17.6(2.6) 42.7(8.9) 48.3(7.0) 72.9(5.0) 48.9(6.5) 50.7
SA 33.6(4.8) 47.2(8.1) 55.9(15.6) 49.2(13.9) 69.9(9.4) 13.4(5.3) 36.6(12.9) 24.9(13.2) 62.8(10.1) 32.7(5.5) 41.7
2DSA∗38.5(5.3) 59.3(3.5) 65.1(6.7) 70.9(4.6) 77.7(2.2) 18.5(2.0) 64.1(3.6) 58.6(5.3) 78.4(2.8) 61.3(5.8) 60.1
2DSA†25.3(2.7) 40.8(1.7) 40.4(7.6) 41.7(3.4) 74.3(2.8) 11.0(3.1) 23.7(4.5) 27.7(3.6) 52.4(3.9) 29.5(3.4) 37.1
2DSA‡37.3(6.2) 57.6(4.0) 48.0(9.7) 59.4(3.7) 72.8(2.3) 17.1(3.0) 56.8(3.8) 38.3(3.6) 72.2(4.3) 55.5(3.8) 52.2
2DSA 35.3(2.5) 56.3(3.2) 48.4(5.6) 58.5(3.6) 77.5(1.6) 24.5(1.9) 52.2(4.0) 43.2(2.0) 75.0(0.9) 57.2(2.0) 53.5
intra-class variations are still large). [
48
] shows that deep
features typically present bubble-like shapes in the feature
space, different bubbles indicating different classes may
easily intersect if a disturbance appears. Such a problem
becomes serious in the context of CONV adaptation. The
disturbances can be the problem nature of DA (distribution
mismatch) or the poor estimation of parameters because of
high dimensionality. Nevertheless, the good news is that
the spatial-mode adaptation mechanism in 2DSA seems
not to ruin the good class separation of CONV.
•
It can be concluded that, when aligning convolutional acti-
vations, it is better to formulate the problem in the two-
dimensional paradigm. Moreover, if we have desirable
domain-invariant feature representations, a simple linear
adaptation seems already adequate.
6.3. Visual recognition results on MTFS3–DA dataset
For the MTFS3–DA dataset, we organize our experiments in
a hierarchical manner. In particular, we gradually increase the
domain shifts and evaluate the recognition performance under
single-type, double-type and triple-type variations. More specif-
ically, three types of variations of years, cultivars and geograph-
ical locations are considered. On this dataset, we only compare
the performance of 2DSA against NA and SA. The accuracy
improvement over baseline NA by around 10% is underlined,
which means a significant improvement.
Figure 7. Images shown at the first row are labeled as
ruler
in the Office31
dataset, and the second row shows images with multiple labels in the VOC2007
dataset. These images with ambiguous labels may affect the performance of
instance-level DA methods.
6.3.1. Performance degradation
Before we evaluate these DA problems, we first highlight the
problem of cross-field performance degradation. Concretely, we
choose 3 typical domains of
Z
10
J
,
T
11
N
1
and
G
14
Z
as the source
domains, respectively, and test the recognition performance on
the other 9 target domains. The mean recognition accuracy is
reported. Numerical results listed in Table 5show that the per-
formance degrades significantly in all the cases when directly
applying the classifier trained on the source domain. This is an
important problem that one often ignores to concern in field-
based visual applications in agriculture. The factors that plants
may be different from year to year or from location to location
are complicated. For instance, the quality of seeds, the variations
11
Table 5. Performance degradation from one domain to the other. The performance in the first column is obtained by testing the data from the same domain, and the
standard deviation is shown in parentheses.
Source Target
Z10JZ11JZ12ZT10W
1T10W
2T11N
1T11N
2T12Z
1T12Z
2G14Z
77.1(5.0) 56.6(4.0)↓56.7(5.0)↓58.7(5.7)↓55.1(5.1)↓47.2(4.9)↓50.4(4.5)↓55.4(4.6)↓51.8(4.2)↓52.6(6.1)↓
T11N
1Z10JZ11JZ12ZT10W
1T10W
2T11N
2T12Z
1T12Z
2G14Z
77.1(5.1) 48.5(5.3)↓43.3(4.9)↓42.5(3.3)↓52.9(5.3)↓46.5(5.0)↓44.7(4.7)↓51.5(4.6)↓48.6(4.4)↓50.2(6.2)↓
G14ZZ10JZ11JZ12ZT10W
1T10W
2T11N
1T11N
2T12Z
1T12Z
2
72.8(4.4) 44.4(4.8)↓40.7(4.5)↓36.0(3.6)↓49.4(5.2)↓45.3(3.6)↓46.2(4.8)↓45.8(3.9)↓48.7(4.2)↓44.1(3.9)↓
Table 6. Recognition accuracy (%) under the same cultivar and geographical
location but different years over 2 DA problems. The highest average precision
is boldfaced, the second best is shown in red, and the standard deviation is shown
in parentheses.
Method Z10J→Z11JZ11J→Z10 Jmean
NA 56.6(4.0) 51.6(4.1) 54.1
SA 55.5(7.9) 48.2(12.7) 51.9
2DSA 61.1(4.9) 56.8(5.6) 59.0
of weather and the nutritional status in soil both largely affect
the growth of plants. In addition, different plants will encounter
interspecific competition. This is the reason why different plants
tend to exhibit different flowering status even if they are seeded
at the same time.
6.3.2. DA evaluation under single-type variation
In the first series of evaluations, we consider DA problems
caused by only single-type variation. In particular, two types
of variations of years and geographical locations are evaluated,
respectively. Note that the scenario of single cultivar variation is
not included, because plants with different cultivars are currently
not planted within the same year and geographical location.
Same cultivar and geographical location but different years.
Here, we only allow the year to vary when the other two fac-
tors are fixed, leading to 2 DA problems shown in Table 6. In
this situation, the weather condition is the main factor that af-
fects the growth of plants. Results show that 2DSA improves
the cross-field classification performance and also outperforms
SA, which means the shifts caused by weather conditions can
be corrected appropriately.
Same cultivar and year but different geographical locations. In
this setting, we restrict the cultivar and year to be the same and
only vary the geographical locations, resulting in 4 DA prob-
lems shown in Table 7. Plants in different locations are greatly
influenced by the soil conditions. Results demonstrate a similar
tendency as to the first experiment.
6.3.3. DA evaluation under double-type variation
In the second series of DA evaluations, we consider three
kinds of double-type variations. Concretely, they are as follows.
Same geographical location but different years and cultivars.
We simultaneously vary the shifts with respect to years and cul-
tivars but require the geographical location to be the same place.
It gives rise to 24 DA problems. When different cultivars are
considered, maize tassels tend to exhibit significant appearance
variations, e.g., different colors. Results are listed in Table 8. It
is surprising to see that 2DSA significantly improves the clas-
sification performance in 13 out of 24 DA problems, implying
that the shifts caused by years and cultivars are not that serious.
Same cultivar but different years and geographical locations.
Similarly, we restrict the factor of same cultivar and change the
other two in this setting. 6 DA problems in Table 9also demon-
strate the effectiveness of 2DSA, and 3 of them exhibit a notable
degree of performance improvement over 10%.
Same year but different cultivars and geographical locations.
Under this context, only cultivars and geographical locations
can change simultaneously, and 8 DA problems in all are eval-
uated. According to the results shown in Table 10, 2DSA only
significantly improves the accuracy on only one DA task, and
2DSA does not work on the
T
10
W
2→Z
10
J
problem. Hence, on
the basis of above results, we conclude that the geographical
location is the more important factor that causes domain shifts
than the cultivar and the year. Indeed, it is in accordance with
our intuition that various soil conditions of different locations
greatly affect the growth of plants.
6.3.4. DA evaluation under triple-type variation
In the final experiment, all three kinds of variations can vary
simultaneously, resulting in the most challenging setting.
Different years, cultivars and geographical locations. Overall,
we have 36 DA problems. Numerical results are listed in Ta-
ble 11. It is interesting that all DA tasks with significant improve-
ments involve the
G
14 domain, which means the shifts caused
by such a domain is not easily adapted. For other DA tasks that
does not involve the
G
14 domain, we find that, although 2DSA
still works, the recognition baseline is generally lower than the
single-type and double-type cases. Domain shifts seem serious
when all variations are involved, because 22 problems do not
exhibit notable accuracy improvements. In addition, SA even
works better than 2DSA in two DA problems. As per these ob-
servations, we believe that the classification performance indeed
has a close relation to specific data distributions.
12
Table 7. Recognition accuracy (%) under the same cultivar and year but different geographical locations over 4 DA problems. The highest average precision is
boldfaced, the second best is shown in red, and the standard deviation is shown in parentheses.
Method Z12Z→T12Z
1T12Z
1→Z12ZZ12Z→T12Z
2T12Z
2→Z12Zmean
NA 51.2(5.8) 54.9(5.2) 47.6(4.1) 48.9(5.5) 50.6
SA 49.6(11.0) 49.4(5.8) 43.2(9.4) 47.8(6.8) 47.5
2DSA 57.8(5.6) 59.4(3.0) 55.2(5.3) 54.3(6.5) 56.7
Table 8. Recognition accuracy (%) under the same geographical location but different years and cultivars over 24 DA problems. The highest average precision is
boldfaced, the second best is shown in red, and the standard deviation is shown in parentheses.
Method T10W
1→T11N
1T11N
1→T10W
1T10W
1→T11N
2T11N
2→T10W
1T10W
1→T12Z
1T12Z
1→T10W
1T10W
1→T12Z
2T12Z
2→T10W
1
NA 58.8(6.3) 52.9(5.3) 49.9(3.8) 54.1(6.5) 61.4(5.6) 59.4(4.4) 58.0(5.8) 58.7(5.5)
SA 57.8(8.9) 59.4(11.1) 50.7(7.1) 48.7(9.9) 60.2(7.5) 56.8(11.7) 52.2(9.9) 55.8(9.4)
2DSA 66.0(3.7) 62.8(5.0) 59.5(4.0) 62.8(5.9) 66.5(4.6) 69.5(4.1) 67.6(4.2) 66.9(5.9)
T10W
2→T11N
1T11N
1→T10W
2T10W
2→T11N
2T11N
2→T10W
2T10W
2→T12Z
1T12Z
1→T10W
2T10W
2→T12Z
2T12Z
2→T10W
2
NA 55.2(5.1) 46.5(5.0) 52.4(4.7) 52.9(6.2) 59.3(6.7) 56.6(4.6) 57.9(4.7) 51.3(6.1)
SA 53.7(9.9) 49.4(9.0) 53.3(7.1) 48.4(8.1) 53.6(14.8) 57.8(6.9) 45.3(9.9) 50.0(9.2)
2DSA 66.6(4.0) 58.8(5.0) 59.7(5.9) 60.1(5.0) 67.5(3.6) 67.2(3.4) 64.9(5.9) 64.5(5.9)
T11N
1→T12Z
1T12Z
1→T11N
1T11N
1→T12Z
2T12Z
2→T11N
1T11N
2→T12Z
1T12Z
1→T11N
2T11N
2→T12Z
2T12Z
2→T11N
2mean
NA 51.5(4.6) 50.1(4.9) 48.6(4.4) 52.3(5.5) 54.6(6.5) 48.6(4.0) 50.0(5.1) 46.2(4.8) 53.6
SA 55.4(5.2) 53.3(5.7) 48.7(6.3) 44.4(9.4) 51.1(12.1) 52.4(10.2) 43.7(8.7) 51.9(6.8) 52.2
2DSA 60.7(3.8) 61.6(4.4) 57.5(5.7) 61.9(4.0) 61.6(6.0) 59.8(5.6) 55.4(6.6) 56.2(4.6) 62.7
Table 9. Recognition accuracy (%) under the same cultivar but different years and geographical locations over 6 DA problems. The highest average precision is
boldfaced, the second best is shown in red, and the standard deviation is shown in parentheses.
Method G14Z→T12Z
1T12Z
1→G14ZG14Z→T12Z
2T12Z
2→G14ZG14Z→Z12ZZ12Z→G14Zmean
NA 48.7(4.2) 55.0(6.7) 44.1(3.9) 53.0(6.3) 36.0(3.6) 48.9(4.7) 47.6
SA 54.3(7.2) 51.5(8.3) 50.6(7.3) 51.8(8.0) 47.6(8.7) 48.6(8.9) 50.7
2DSA 62.3(7.8) 62.7(2.9) 61.4(3.6) 59.2(5.3) 57.1(6.5) 57.4(8.8) 60.0
Table 10. Recognition accuracy (%) under the same year but different cultivars and geographical locations over 8 DA problems. The highest average precision is
boldfaced, the second best is shown in red, and the standard deviation is shown in parentheses.
Method Z10J→T10W
1T10W
1→Z10JZ10J→T10W
2T10W
2→Z10JZ11J→T11N
1T11N
1→Z11JZ11J→T11W
2T11W
2→Z11Jmean
NA 58.7(5.7) 59.6(4.1) 55.1(5.1) 60.6(5.3) 45.1(6.5) 43.3(4.9) 44.9(4.6) 49.5(7.3) 52.1
SA 54.3(9.2) 60.0(5.5) 55.3(9.6) 53.6(10.7) 45.3(7.0) 50.2(8.6) 44.5(6.3) 48.7(7.3) 51.5
2DSA 67.4(4.7) 65.3(3.1) 64.4(4.3) 60.6(4.9) 51.0(8.6) 55.4(6.9) 50.7(6.1) 56.7(4.5) 59.0
6.4. Subspace analysis by measuring the reconstruction error
As previously stated, we conjecture the quality of generated
subspaces affects the performance. To justify this, here we as-
sess the subspace quality from the perspective of reconstruction
error. Fig. 8illustrates the results. Note that
Q
shown in the fig-
ure denotes the widely-used energy parameter that controls the
subspace dimensionality. It is clear that the reconstruction error
of 2DPCA is generally lower than PCA. Also, we note that PCA
exhibits a relatively high error even when
Q
equals to 100%,
while 2DPCA is already close to zero. This gives evidence that
PCA cannot appropriately reconstruct convolutional activations
with a limited number of training data.
6.5.
Quantifying the domain discrepancy using divergence mea-
sures
Here, we evaluate the domain discrepancy on four typical DA
problems on the Office–Caltech10 dataset, using both the global
and proposed local divergence measures. Concretely, we com-
pute the
H
∆
H
-divergence measure and corresponding recog-
nition accuracy over selected DA problems. Results are listed
in Table 12. It demonstrates a similar tendency to our observa-
tions in Sec. 4.1. That is, a lower
H
∆
H
value does not imply a
good classification result, which means the superiority of 2DSA
cannot be explained from the global sense. To this end, we fur-
ther compute the within-class divergence
Hw
∆
Hw
and between-
class divergence
Hb
∆
Hb
, expecting to infer the results from a
local perspective. Concretely, we plot the
γ
-curves for
Hw
∆
Hw
13
Table 11. Recognition accuracy (%) under different years, cultivars and geographical locations over 36 DA problems. The highest average precision is boldfaced, the
second best is shown in red, and the standard deviation is shown in parentheses.
Method Z10J→T11N
1T11N
1→Z10JZ10J→T11N
2T11N
2→Z10JZ10J→T12Z
1T12Z
1→Z10JZ10J→T12Z
2T12Z
2→Z10JZ10J→G14Z
NA 47.2(4.9) 48.5(5.3) 50.4(4.5) 51.0(5.3) 55.4(4.6) 58.1(4.3) 51.8(4.2) 52.6(4.7) 52.6(6.1)
SA 51.7(6.2) 55.4(9.4) 54.0(5.9) 51.8(8.4) 55.5(5.8) 56.0(11.7) 48.4(9.9) 55.4(10.1) 52.5(6.1)
2DSA 65.2(5.0) 59.1(6.2) 60.9(5.5) 61.5(6.7) 64.7(4.8) 62.1(5.0) 56.8(4.2) 60.3(6.5) 63.1(4.2)
G14Z→Z10JZ11J→T10W
1T10W
1→Z11JZ11J→T10W
2T10W
2→Z11JZ11J→T12Z
1T12Z
1→Z11JZ11J→T12Z
2T12Z
2→Z11J
NA 44.4(4.8) 46.5(6.0) 52.8(4.9) 49.8(6.2) 56.2(4.7) 47.1(6.6) 47.8(6.3) 43.7(4.5) 47.3(5.5)
SA 60.9(6.9) 50.0(9.5) 49.5(10.1) 48.2(10.7) 45.1(9.3) 45.4(10.3) 49.6(9.0) 49.3(8.3) 49.3(9.0)
2DSA 59.1(7.2) 55.0(6.5) 59.7(5.9) 53.8(7.4) 60.2(4.6) 54.8(8.0) 49.8(5.0) 52.1(3.7) 49.1(7.3)
Z11J→G14ZG14Z→Z11JZ12Z→T10W
1T10W
1→Z12ZZ12Z→T10W
2T10W
2→Z12ZZ12Z→T11N
1T11N
1→Z12ZZ12Z→T11N
2
NA 46.4(7.0) 40.7(4.5) 53.3(5.1) 50.3(5.3) 53.6(4.0) 54.8(4.9) 44.0(3.9) 42.5(3.3) 46.5(3.8)
SA 41.0(8.1) 56.1(7.1) 58.0(12.4) 51.8(10.6) 50.0(10.2) 52.3(7.0) 46.3(6.8) 46.8(6.6) 47.9(7.8)
2DSA 53.4(5.3) 54.5(7.4) 62.7(6.1) 61.8(4.7) 61.4(6.1) 62.6(4.6) 56.6(5.1) 55.6(5.7) 55.0(4.7)
T11N
2→Z12ZT10W
1→G14ZG14Z→T10W
1T10W
2→G14ZG14Z→T10W
2T11N
1→G14ZG14Z→T11N
1T11N
2→G14ZG14Z→T11N
2mean
NA 49.5(3.0) 56.8(3.5) 49.4(5.2) 50.9(6.8) 45.3(3.6) 50.2(6.2) 46.2(4.8) 48.6(5.3) 45.8(3.9) 49.4
SA 47.4(9.3) 50.8(9.9) 54.6(6.4) 51.0(9.4) 51.0(7.8) 47.6(8.0) 48.3(7.9) 46.0(7.2) 47.3(7.6) 50.6
2DSA 52.6(5.1) 63.2(4.4) 61.6(5.1) 60.5(6.0) 60.9(7.8) 56.1(4.5) 60.5(8.9) 57.9(4.8) 58.2(4.9) 58.4
Table 12.
H
∆
H
domain discrepancy measure and the corresponding recognition
accuracy (%) (in parentheses) of different approaches over a specific trial of
4 adaptation problems on the Office–Caltech10 dataset. The lowest
H
∆
H
is
boldfaced and the highest accuracy is underlined.
Method A→C A→D C→D C→W
NA 1.33 (69.3) 1.99 (76.3) 1.79 (65.0) 1.66 (64.3)
SA 1.23 (58.1) 0.89 (54.4) 0.94 (52.7) 1.13 (62.9)
2DSA 1.45 (78.9) 2.00 (83.7) 1.94 (83.4) 1.65 (74.9)
and
Hb
∆
Hb
over the same adaptation tasks in Fig. 9and Fig. 10,
respectively. We observe that the tendency in
Hw
∆
Hw
is analo-
gous to the
H
∆
H
though some fluctuations occur. That is to say,
Hw
∆
Hw
can be seen as a local version of
H
∆
H
to some degree.
Finally, when resorting to the between-class divergence, we find
that
Hb
∆
Hb
correlates well with the recognition accuracy. In
general, lower
Hb
∆
Hb
implies good recognition performance.
According to these results, we can see that,
Hw
∆
Hw
character-
izes how good an alignment is, and
Hb
∆
Hb
depicts how well
the classification performs. Thus, we believe one should pay
more attentions to the local class distributions when considering
cross-domain classification problems, especially the between-
class distributions.
6.6. Do we need more training data?
As aforementioned, we partially ascribe the inferior perfor-
mance of SA to the limited number of training data, and we
have also justified our point from the perspective of reconstruc-
tion error. In this section, we further conduct experiments to see
whether the performance can be enhanced if we add more train-
ing data. Specifically, we use the ImageNet–VOC2007 dataset.
We continuously change the number of training data sampled
from each category, denoted by
Nclass
, and monitor the varia-
tions of average precision (AP). Results of the 9 typical classes
are illustrated in Fig. 11. We observe that, the more training data
use, the better performance generally achieves. This trend is ob-
vious when looking at those methods employing vector-form
representations (NA and SA). Furthermore, we find that only
2DSA can achieve favorable results even with a limited number
of training data (
Nclass
=8 or
Nclass
=16), implying that the per-
formance of 2DSA is not that sensitive as
Nclass
changes. This
also implies that 2DSA can be applied in the small-sample-size
situation, which is common in real-world applications. Based
on the results presented, we believe that the performance of SA
indeed has a close relation to the number of training data.
6.7. Does the feature really matter?
In this section, we analyze the performance of C O N V in
different layers to emphasize the role of feature representation.
One intuition about deep convolutional models is that the deeper
the layers are, the better the representation expresses [
49
,
50
].
To this end, we analyze the performance of 2DSA with different
layers of CO N V on the Office–Caltech10 dataset, following
the standard experimental setting. Numerical results are listed
in Table 13. Generally, it has shown better accuracy when using
deeper representations. For instance, we surprisingly observe a
significant accuracy improvement from 28.5% to 75.2% under
the A→C task. We have to admit good features really matter.
Here is our point. Indeed, DA methods really count, but
domain-invariant features also play a vital role. What are the
factors that cause domain shift? As what we have mentioned in
the beginning of the main text, they are those intrinsic and ex-
trinsic variations. Hence, it may be a good idea to devote ourself
to developing powerful features that achieve invariance to poses,
scales, rotations, illuminations and background, just like those
efforts made that endow an ability to convolutional models to
identify spatial transformations [51].
14
Q
0 10 20 30 40 50 60 70 80 90 100
Reconstruction Error
×105
0
0.5
1
1.5
2
2.5
3
3.5
Amazon
SA
2DSA
Q
0 10 20 30 40 50 60 70 80 90 100
Reconstruction Error
×105
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
Caltech
SA
2DSA
Q
0 10 20 30 40 50 60 70 80 90 100
Reconstruction Error
×105
0
0.5
1
1.5
2
2.5
3
3.5
4
DSLR
SA
2DSA
Q
0 10 20 30 40 50 60 70 80 90 100
Reconstruction Error
×105
0
0.5
1
1.5
2
2.5
3
3.5
web-cam
SA
2DSA
Figure 8. Reconstruction error of different approaches with changing energy parameter Q(%) in four domains.
γ
1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2
Hw∆Hw
0
0.5
1
1.5
Amazon →Caltech
NA
SA
2DSA
γ
1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2
Hw∆Hw
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
Amazon →DSLR
NA
SA
2DSA
γ
1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2
Hw∆Hw
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
Caltech →DSLR
NA
SA
2DSA
γ
1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2
Hw∆Hw
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
Caltech →web-cam
NA
SA
2DSA
Figure 9. γ-curves regarding local within-class divergence Hw∆Hwover four DA tasks. In this case, γ=2−γw.
γ
1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2
Hb∆Hb
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
Amazon →Caltech
NA
SA
2DSA
γ
1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2
Hb∆Hb
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
0.2
Amazon →DSLR
NA
SA
2DSA
γ
1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2
Hb∆Hb
0
0.05
0.1
0.15
0.2
0.25
Caltech →DSLR
NA
SA
2DSA
γ
1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2
Hb∆Hb
0
0.05
0.1
0.15
0.2
0.25
Caltech →web-cam
NA
SA
2DSA
Figure 10. γ-curves regarding local between-class divergence Hb∆Hbover four DA tasks. In this case, γ=γb.
Table 13. Recognition accuracy (%) of 2DSA with different layers of CO NV activations on Office–Caltech10 dataset over 20 trials. The arrow indicates the change
compared with the previous row, and the standard deviation is shown in parentheses.
Feature A→C C→A A→D D→A A→W W→A C→D D→C C→W W→C D→W W→Dmean
mCONV1 28.5(2.6) 40.5(4.8) 22.7(5.7) 17.1(2.4) 17.5(4.6) 15.6(4.8) 27.6(4.6) 17.9(2.4) 21.2(5.2) 16.3(1.6) 43.8(5.0) 40.4(7.7) 25.8
mCONV2 46.5(2.2)↑56.8(8.8)↑39.9(6.2)↑27.9(5.2)↑31.9(5.0)↑29.0(4.1)↑43.3(6.9)↑22.1(3.5)↑33.8(3.8)↑22.1(2.4)↑70.1(10.1)↑66.3(7.7)↑40.8↑
mCONV3 44.0(6.3)↓61.7(5.7)↑38.2(6.0)↓22.6(5.0)↓31.0(7.3)↓30.9(5.9)↑47.1(5.3)↑20.8(3.6)↓37.5(4.8)↑22.7(2.9)↑71.1(5.5)↑66.4(10.0)↑41.2↑
mCONV4 64.7(2.7)↑76.8(2.3)↑60.7(12.0)↑56.6(4.7)↑55.8(10.3)↑44.4(13.0)↑65.1(6.3)↑44.7(3.9)↑57.3(5.1)↑39.3(3.7)↑90.7(3.2)↑90.6(2.9)↑62.2↑
mCONV5 75.2(3.0)↑85.9(1.8)↑75.4(5.2)↑73.4(5.0)↑66.8(3.9)↑63.8(5.6)↑76.6(5.0)↑62.5(4.7)↑69.5(2.9)↑55.1(5.1)↑91.8(3.0)↑93.0(2.5)↑74.1↑
Table 14. Average evaluation time (s) of each trial with varying feature dimensionality. (OS: Windows 7 64-bit, CPU: Intel i3-2120 3.30GHz, RAM: 16 GB)
Dimensionality 1152 1568 3872 7200 9248 16928
SA 1.03 2.54 34.57 210.40 444.83 2655.25
2DSA 0.32 0.37 0.45 0.63 0.81 2.14
6.8. Efficiency comparison between SA and 2DSA
As aforementioned, compared with SA, 2DSA provides an-
other important attraction in computation efficiency. Here, we
tend to verify this. Concretely, the single-core CPU runtime is
tested as the feature dimensionality varies, and the average eval-
uation time of each trial is reported. According to the numerical
15
Nclass
101102
AP
-10
0
10
20
30
40
50
60
70
80
aeroplane
NA
SA
2DSA
Nclass
101102
AP
0
10
20
30
40
50
60
70
80
bicycle
NA
SA
2DSA
Nclass
101102
AP
0
10
20
30
40
50
60
70
bus
NA
SA
2DSA
Nclass
101102
AP
0
10
20
30
40
50
60
70
80
car
NA
SA
2DSA
Nclass
101102
AP
0
5
10
15
20
25
30
35
40
45
cow
NA
SA
2DSA
Nclass
101102
AP
10
20
30
40
50
60
70
dog
NA
SA
2DSA
Nclass
101102
AP
0
10
20
30
40
50
60
70
motorbike
NA
SA
2DSA
Nclass
101102
AP
40
45
50
55
60
65
70
75
80
85
person
NA
SA
2DSA
Nclass
101102
AP
0
10
20
30
40
50
60
70
80
train
NA
SA
2DSA
Figure 11. The performance of different methods with the varying number of training data sampled from each category on the ImageNet–VOC2007 dataset.
results in Table 14, 2DSA is shown to be significantly at rates
faster than SA when dealing with high-dimensional data, imply-
ing that 2DSA would be particularly attractive in practice due
to its nature of high efficiency.
6.9. Problems within H∆H-divergence
Finally, we tend to emphasize an important problem within
H
∆
H
-divergence to inspire further studies. Fig. 12 illustrates
two typical relative positions between two domains: separation
and tangency. If we estimate the
H
∆
H
-divergence for these
two situations according to those steps mentioned in Sec. 4.1,
H
∆
H
values will make no difference. Since domains in two sit-
uations are linearly separable, their
H
∆
H
values will be close
to 2. However, our analysis shows that, when two domains are
close enough, they have a high probability to be classified cor-
rectly. Hence, it is necessary for a domain divergence measure
to differentiate these two situations.
7. Conclusion
In this paper, we showed that it is better to align convolu-
tional activations in the two-dimensional world. In particular,
we proposed a 2DSA approach to adapt convolutional activa-
tions. We gave our deep insight on why 2DSA works better
and further introduced two novel domain divergence measures
termed
Hw
∆
Hw
and
Hb
∆
Hb
taking labels into account. Exten-
sive experiments justified 2DSA significantly outperformed SA
in both effectiveness and efficiency and also showed superior
or at least comparable classification performance than existing
benchmarking approaches. In addition, an interesting DA appli-
cation in agriculture was demonstrated as well.
16
Source Target Source Target
Figure 12. Two typical relative positions between two domains. The left denotes
source domain is separate from the target, and right indicates source is tangent
to target. However, since both domains in these two situations can be linearly
separated, it makes no difference to H∆H-divergence.
Notice that the proposed 2DSA does have limitations. Since
2DSA is only a linear adaptation method, when the distributions
of two domains are significantly distinct, a linear alignment is
typically not sufficient and thus 2DSA as proposed may not
work. Moreover, in real-world applications, one may encounter
the situation that a new test set comes and whose transformed
subspace is not aligned with the subspace of target domain. Un-
der such a circumstance, 2DSA may also fail. Perhaps one pos-
sible solution is to realign the new subspace.
For future work, it could be interesting to assign pseudo la-
bels to the target data and iteratively optimize both within- and
between-class measures so that they could be used as a guid-
ing criteria for choosing a good adaptation in an unsupervised
DA context. Moreover, it is worth noting that the introduced
measures are independent of a specific distance metric. It is
also interesting to explore whether we can learn some kind of
metric that could achieve both low within- and between-class
divergences simultaneously. In addition, we plan to formulate
the three-dimensional subspace alignment problem for unsuper-
vised DA, as adapting 3D tensors may be a stronger way to
model convolutional activations and may lead to interesting ap-
plications, e.g., the adaptation of CNNs not only for new do-
mains but also for new tasks.
Acknowledgment
The authors would like to thank the anonymous reviewers
for their insightful comments. This work is jointly supported by
the National High-tech R&D Program of China (863 Program)
(Grant No. 2015AA015904) and the National Natural Science
Foundation of China (Grant No. 61502187).
References
[1]
F. Perronnin, J. S
´
anchez, T. Mensink, Improving the fisher kernel for large-
scale image classification, in: Proc. European Conference on Computer Vi-
sion (ECCV), 2010, pp. 143–156. doi:
10.1007/978-3- 642-15561- 1_
11.
[2]
A. Torralba, A. A. Efros, Unbiased look at dataset bias, in: Proc. IEEE
Conference on Computer Vision and Pattern Recognition (CVPR), 2011,
pp. 1521–1528. doi:10.1109/CVPR.2011.5995347.
[3]
P. Dollar, C. Wojek, B. Schiele, P. Perona, Pedestrian detection: An evalua-
tion of the state of the art, IEEE Transactions on Pattern Analysis and Ma-
chine Intelligence 34 (2012) 743–761. doi:10.1109/TPAMI.2011.155.
[4]
V. M. Patel, R. Gopalan, R. Li, R. Chellappa, Visual domain adaptation:
A survey of recent advances, IEEE Signal Processing Magazine 32 (2015)
53–69. doi:10.1109/MSP.2014.2347059.
[5]
A. Krizhevsky, I. Sutskever, G. E. Hinton, Imagenet classification with
deep convolutional neural networks, in: Advances in Neural Information
Processing Systems (NIPS), 2012, pp. 1097–1105.
[6]
R. Girshick, J. Donahue, T. Darrell, J. Malik, Rich feature hierarchies
for accurate object detection and semantic segmentation, in: Proc. IEEE
Conference on Computer Vision and Pattern Recognition (CVPR), 2014,
pp. 580–587. doi:10.1109/CVPR.2014.81.
[7]
J. Yosinski, J. Clune, Y. Bengio, H. Lipson, How transferable are features
in deep neural networks?, in: Advances in Neural Information Processing
Systems (NIPS), 2014, pp. 3320–3328.
[8]
J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, T. Darrell,
DeCAF: A deep convolutional activation feature for generic visual recog-
nition., in: Proc. International Conference on Machine Learning (ICML),
2014, pp. 647–655.
[9]
Y. Ganin, V. Lempitsky, Unsupervised domain adaptation by
backpropagation, in: Proc. International Conference on Machine
Learning (ICML), 2015, pp. 1180–1189. URL:
http://jmlr.org/
proceedings/papers/v37/ganin15.pdf.
[10]
N. Zhang, J. Donahue, R. Girshick, T. Darrell, Part-based R-CNNs
for fine-grained category detection, in: Proc. European Confer-
ence on Computer Vision (ECCV), 2014, pp. 834–849. doi:
10.1007/
978-3- 319-10590- 1_54.
[11]
K. Saenko, B. Kulis, M. Fritz, T. Darrell, Adapting visual category models
to new domains, in: Proc. European Conference on Computer Vision
(ECCV), 2010, pp. 213–226. doi:10.1007/978-3- 642-15561- 1_16.
[12]
R. Gopalan, R. Li, R. Chellappa, Domain adaptation for object recogni-
tion: An unsupervised approach, in: Proc. IEEE International Conference
on Computer Vision (ICCV), 2011, pp. 999–1006. doi:
10.1109/ICCV.
2011.6126344.
[13]
B. Gong, Y. Shi, F. Sha, K. Grauman, Geodesic flow kernel for un-
supervised domain adaptation, in: Proc. IEEE Conference on Com-
puter Vision and Pattern Recognition (CVPR), 2012, pp. 2066–2073.
doi:10.1109/CVPR.2012.6247911.
[14]
B. Fernando, A. Habrard, M. Sebban, T. Tuytelaars, Unsupervised visual
domain adaptation using subspace alignment, in: Proc. IEEE International
Conference on Computer Vision (ICCV), 2013, pp. 2960–2967. doi:
10.
1109/ICCV.2013.368.
[15]
W. Li, L. Duan, D. Xu, I. W. Tsang, Learning with augmented features for
supervised and semi-supervised heterogeneous domain adaptation, IEEE
Transactions on Pattern Analysis and Machine Intelligence 36 (2014)
1134–1148. doi:10.1109/TPAMI.2013.167.
[16]
H. Pirsiavash, D. Ramanan, C. C. Fowlkes, Bilinear classifiers for vi-
sual recognition, in: Advances in Neural Information Processing Systems
(NIPS), 2009, pp. 1482–1490.
[17]
K. He, X. Zhang, S. Ren, J. Sun, Spatial pyramid pooling in deep con-
volutional networks for visual recognition, IEEE Transactions on Pattern
Analysis and Machine Intelligence 37 (2015) 1904–1916. doi:
10.1109/
TPAMI.2015.2389824.
[18]
R. Girshick, Fast R-CNN, in: Proc. IEEE International Conference
on Computer Vision (ICCV), 2015, pp. 1440–1448. doi:
10.1109/ICCV.
2015.169.
[19]
J. Yang, D. Zhang, A. Frangi, J.-Y. Yang, Two-dimensional PCA: a new
approach to appearance-based face representation and recognition, IEEE
Transactions on Pattern Analysis and Machine Intelligence 26 (2004) 131–
137. doi:10.1109/TPAMI.2004.1261097.
[20]
S. Ben-David, J. Blitzer, K. Crammer, F. Pereira, et al., Analysis of rep-
resentations for domain adaptation, in: Advances in Neural Information
Processing Systems (NIPS), volume 19, 2007, p. 137.
[21]
S. J. Pan, Q. Yang, A survey on transfer learning, IEEE Transactions on
Knowledge and Data Engineering 22 (2010) 1345–1359. doi:
10.1109/
TKDE.2009.191.
[22]
H. Shimodaira, Improving predictive inference under covariate shift by
weighting the log-likelihood function, Journal of Statistical Planning and
Inference 90 (2000) 227–244. doi:
10.1016/S0378-3758(00)00115- 4
.
[23]
S. Ben-David, J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, J. W.
Vaughan, A theory of learning from different domains, Machine Learning
79 (2010) 151–175. doi:10.1007/s10994-009- 5152-4.
[24]
J. Tao, F. lai Chung, S. Wang, On minimum distribution discrepancy
support vector machine for domain adaptation, Pattern Recognition 45
(2012) 3962 – 3984. doi:10.1016/j.patcog.2012.04.014.
[25]
A. S. Mozafari, M. Jamzad, A svm-based model-transferring method for
17
heterogeneous domain adaptation, Pattern Recognition 56 (2016) 142–
158. doi:10.1016/j.patcog.2016.03.009.
[26]
J. Blitzer, R. McDonald, F. Pereira, Domain adaptation with structural
correspondence learning, in: Proc. Conference on Empirical Methods in
Natural Language Processing (EMNLP), 2006, pp. 120–128.
[27]
H. Daum
´
e III, Frustratingly easy domain adaptation, in: Proc. Association
for Computational Linguistics (ACL), 2007.
[28]
Q.-F. Wang, F. Yin, C.-L. Liu, Unsupervised language model adaptation
for handwritten chinese text recognition, Pattern Recognition 47 (2014)
1202–1216. doi:10.1016/j.patcog.2013.09.015.
[29]
A. Bergamo, L. Torresani, Exploiting weakly-labeled web images to im-
prove object classification: a domain adaptation approach, in: Advances
in Neural Information Processing Systems (NIPS), 2010, pp. 181–189.
[30]
B. Kulis, K. Saenko, T. Darrell, What you saw is not what you get: Domain
adaptation using asymmetric kernel transforms, in: Proc. IEEE Conference
on Computer Vision and Pattern Recognition (CVPR), 2011, pp. 1785–
1792. doi:10.1109/CVPR.2011.5995702.
[31]
X. Li, M. Fang, J.-J. Zhang, J. Wu, Learning coupled classifiers with
rgb images for rgb-d object recognition, Pattern Recognition 61 (2017)
433–446. doi:10.1016/j.patcog.2016.08.016.
[32]
E. Kodirov, T. Xiang, Z. Fu, S. Gong, Unsupervised domain adaptation for
zero-shot learning, in: Proc. IEEE International Conference on Computer
Vision (ICCV), 2015, pp. 2452–2460. doi:10.1109/ICCV.2015.282.
[33]
J. Hoffman, E. Rodner, J. Donahue, T. Darrell, K. Saenko, Efficient learn-
ing of domain-invariant image representations, CoRR abs/1301.3224
(2013).
[34]
S. J. Pan, I. W. Tsang, J. T. Kwok, Q. Yang, Domain adaptation via transfer
component analysis, IEEE Transactions on Neural Networks 22 (2011)
199–210. doi:10.1109/TNN.2010.2091281.
[35]
M. Long, J. Wang, G. Ding, J. Sun, P. S. Yu, Transfer joint matching for
unsupervised domain adaptation, in: Proc. IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), 2014, pp. 1410–1417. doi:
10.
1109/CVPR.2014.183.
[36]
R. Aljundi, R. Emonet, D. Muselet, M. Sebban, Landmarks-based kernel-
ized subspace alignment for unsupervised domain adaptation, in: Proc.
IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
2015, pp. 56–63. doi:10.1109/CVPR.2015.7298600.
[37]
L. Duan, D. Xu, I. W.-H. Tsang, Domain adaptation from multiple
sources: A domain-dependent regularization approach, IEEE Transac-
tions on Neural Networks and Learning Systems 23 (2012) 504–518.
doi:10.1109/TNNLS.2011.2178556.
[38]
W.-S. Chu, F. De La Torre, J. F. Cohn, Selective transfer machine for
personalized facial action unit detection, in: Proc. IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), 2013. doi:
10.1109/
CVPR.2013.451.
[39]
M. Long, Y. Cao, J. Wang, M. I. Jordan, Learning transferable features
with deep adaptation networks, in: Proc. Internation Conference on Ma-
chine Learning (ICML), 2015.
[40]
J. W. Osborne,A. B. Costello, Sample size and subject to item ratio in prin-
cipal components analysis, Practical Assessment, Research & Evaluation
9 (2004) 8.
[41]
L. Van der Maaten, G. Hinton, Visualizing data using t-SNE, Journal of
Machine Learning Research 9 (2008) 85.
[42]
H. Lu, Z. Cao, Y. Xiao, Z. Fang, Y. Zhu, Toward good practices for fine-
grained maize cultivar identification with filter-specific convolutional ac-
tivations, IEEE Transactions on Automation Science and Engineering
(2016). doi:10.1109/TASE.2016.2616485.
[43]
H. Lu, Z. Cao, Y. Xiao, Z. Fang, Y. Zhu, Towards fine-grained maize tassel
flowering status recognition: dataset, theory and practice, Applied Soft
Computing 56 (2017) 34–45. doi:10.1016/j.asoc.2017.02.026.
[44]
H. Lu, Z. Cao, Y. Xiao, Z. Fang,Y. Zhu, K. Xian, Fine-grained maize tassel
trait characterization with multi-view representations, Computers and
Electronics in Agriculture 118 (2015) 143–158. doi:
10.1016/j.compag.
2015.08.027.
[45]
A. Vedaldi, K. Lenc, MatConvNet: Convolutional neural networks for
matlab, in: Proc. ACM International Conference on Multimedia, 2015, pp.
689–692.
[46]
R.-e. Fan, X.-r. Wang, C.-j. Lin, LIBLINEAR : A library for large linear
classification, Journal of Machine Learning Research 9 (2014) 1871–1874.
[47]
B. Sun, J. Feng, K. Saenko, Return of frustratingly easy domain adaptation,
in: Proc. AAAI Conference on Artificial Intelligence, 2016.
[48]
Y. Wen, K. Zhang, Z. Li, Y. Qiao, A discriminative feature learning ap-
proach for deep face recognition, in: Proc. European Conference on Com-
puter Vision (ECCV), Springer, 2016, pp. 499–515.
[49]
K. Simonyan, A. Zisserman, Very deep convolutional networks for large-
scale image recognition, CoRR abs/1409.1556 (2014). URL:
http://
arxiv.org/abs/1409.1556.
[50]
K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recog-
nition, in: Proc. IEEE Conference on Computer Vision and Pattern Recog-
nition (CVPR), 2016.
[51]
M. Jaderberg, K. Simonyan, A. Zisserman, K. Kavukcuoglu, Spatial trans-
former networks, in: Advances in Neural Information Processing Systems
(NIPS), 2015, pp. 2008–2016.
18