ArticlePDF Available

Abstract and Figures

In real-world computer vision applications, many intrinsic and extrinsic variations can cause a significant domain shift. Although deep convolutional models have provided us with better domain-invariant features, existing mechanisms to adapt convolutional activations are still limited. Notice that convolutional activations are intrinsically represented as tensors, in this paper we develop a two-dimensional subspace alignment (2DSA) approach based on 2D principal component analysis (PCA) to better adapt convolutional activations. Extensive experiments demonstrate the advantages of 2DSA over its counterpart SA in both effectiveness and efficiency. In particular, when trying to explain why 2DSA works well, we find that the best classification performance has low correlation with the global domain discrepancy measure. In an effort to find a better way to compare domains, we introduce within- and between-class domain divergence measures to characterize the class-level differences. The proposed measures somewhat shed light on what a good alignment might be for classification. Furthermore, we also demonstrate a novel domain adaptation application in agriculture and create a dataset for the problem.
Content may be subject to copyright.
Two-dimensional subspace alignment for convolutional activations adaptation
Hao Lua, Zhiguo Caoa,, Yang Xiaoa, Yanjun Zhua
aNational Key Laboratory of Science and Technology on Multi-Spectral Information Processing
School of Automation, Huazhong University of Science and Technology, Wuhan 430074, PR China
Abstract
In real-world computer vision applications, many intrinsic and extrinsic variations can cause a significant domain shift. Although
deep convolutional models have provided us with better domain-invariant features, existing mechanisms to adapt convolutional
activations are still limited. Notice that convolutional activations are intrinsically represented as tensors, in this paper we develop
a two-dimensional subspace alignment (2DSA) approach based on 2D principal component analysis (PCA) to better adapt convo-
lutional activations. Extensive experiments demonstrate the advantages of 2DSA over its counterpart SA in both eectiveness and
eciency. In particular, when trying to explain why 2DSA works well, we find that the best classification performance has low
correlation with the global domain discrepancy measure. In an eort to find a better way to compare domains, we introduce within-
and between-class domain divergence measures to characterize the class-level dierences. The proposed measures somewhat shed
light on what a good alignment might be for classification. Furthermore, we also demonstrate a novel domain adaptation application
in agriculture and create a dataset for the problem.
Keywords: Visual domain adaptation, Subspace alignment, Convolutional activations, Two-dimensional PCA, Domain divergence
measure
1. Introduction
In real-world computer vision applications, many intrinsic
and extrinsic variations, such as color, pose, illumination, back-
ground, view point, blurring, or image resolution, can cause a
significant domain variance so that a model built on the source
domain may not perform well on the data with dierent dis-
tributions from the target domain. Indeed, a stream of studies
have reported the algorithm performance degrades significantly
across datasets [
1
3
]. This is a typical problem called domain
mismatch. A rich body of studies have attempted to alleviate
this challenge under the name of covariate shift, class imbal-
ance, dataset bias, transfer learning, multi-view analysis, and
more recently, domain adaptation (DA) [4].
Deep convolutional neural networks (CNNs) have brought us
the-state-of-art visual descriptor, which has benchmarked a se-
ries of computer vision tasks, such as image classification [
5
]
and object detection [
6
]. Indeed, features can be more transfer-
able when learned in deep networks [
7
,
8
]. A strong evidence
includes impressive results in DA with a deep CNNs based ap-
proach [
9
], implying an important role of domain-invariant fea-
ture representation. That is, the feature matters.
The most common and eective way to adapt CNN features
is to fine-tune an end-to-end CNN model so that the parameters
can be adjusted to better fit the target dataset [
6
,
10
]. Fine-tuning
is good as long as we have free access to the supervision and
Corresponding author
Email addresses: poppinace@hust.edu.cn (Hao Lu),
zgcao@hust.edu.cn
(Zhiguo Cao),
Yang_Xiao@hust.edu.cn
(Yang Xiao),
yjzhu@hust.edu.cn (Yanjun Zhu)
a sucient number of training data. Yet, we intend to seek in
this paper whether there exists another mechanism to correct
this kind of shift. In particular, we consider the scenarios where
no supervision is provided in the target domain or the labeled
target data alone is deficient to build a good classifier, which is
exactly the case of DA.
DA is a frequently concerned issue in statistics, machine
learning, pattern recognition, natural language processing, and
recently, in computer vision. Over the years, many theoretical
methods have been developed to address this problem with a
moderate degree of success [
9
,
11
15
]. However, to our knowl-
edge, most of methods only formulate the problem in the vector-
form paradigm. That is, the input features must be vectors. No-
tice that convolutional activations are intrinsically represented
as tensors, it may be more natural to model them as matrices or
tensors, rather than vectors [
16
]. Also, it has shown that, when
using convolutional activations, we can prevent the object defor-
mation by feeding an image with arbitrary size [
17
] and reuse
the features by building a mapping between the raw image and
feature map [18].
Recently, a subspace alignment (SA) based unsupervised DA
approach [
14
] stands out due to its eectiveness and simplicity.
Our work is built within this framework. Specifically, we pro-
pose to perform two-dimensional subspace alignment (2DSA).
A 2DPCA [
19
] based approach is consequently developed to
adapt convolutional activations eectively and eciently. Com-
pared with its counterpart SA, 2DSA requires less training data,
and learning parameters is more accurate and ecient. Experi-
ments on several datasets validate the eectiveness of 2DSA and
show that 2DSA significantly outperforms SA by large margins.
Accepted by Pattern Recognition
http://dx.doi.org/10.1016/j.patcog.2017.06.010
June 16, 2017
Is this a good
alignment?
Exactly!
How about
this one?
Ahhh…
maybe not!
And this?
Well…maybe aligning
close is enough!
Figure 1. Three typical situations in the subspace alignment based domain adaptation. Black denotes the source domain, and red the target. A marker denotes a
specific class. The “alignment” indicates a transformation that moves the source subspace to the target one. The left is an ideal situation, middle the situation occurring
in the SA paradigm, and right the 2DSA. SA aligns two domains well but mixes instances coming from dierent classes (target data cannot be classified correctly),
whilst 2DSA only aligns two domains moderately but preserves good margins between dierent classes (target data still can be separated linearly). This finding
motivates us to ponder a fundamental question: to what extend is an alignment enough for classification?
In some cases, SA even worsens the classification performance.
We are interested in explaining why 2DSA works better. Our
analysis from the reconstruction error perspective shows that
2DPCA generates a better subspace than PCA (the reconstruc-
tion error of 2DPCA is lower than PCA). Statistically, when ex-
ploiting a global
H
H
-divergence [
20
] to measure the domain-
level discrepancy, we surprisingly find that results are beyond ex-
pectation. The best classification performance conversely yields
the worst
H
H
value. After visualizing the data distribution,
we observe two interesting patterns shown in Fig. 1. One is
that SA aligns two domains well but mixes instances coming
from dierent classes. The other is that 2DSA only aligns two
domains moderately but preserves good margins between dif-
ferent classes. This motivates us to ponder a fundamental issue:
to what extend is an alignment enough for classification? We
answer this question by giving a new perspective at local class
distributions. We believe that, a good alignment in classification
indeed needs to push two distributions of the same class close,
but more importantly, it should enlarge or at least preserve the
margins between dierent classes. To formalize this idea, two
novel domain discrepancy measures called within-class diver-
gence
Hw
Hw
and between-class divergence
Hb
Hb
are con-
sequently proposed. Dierent from the
H
H
-divergence that
only characterizes the domain-level discrepancy, the proposed
Hw
Hw
and
Hb
Hb
divergences are able to characterize the
class-level dierences and thus can be viewed as a class-level
extension of
H
H
-divergence. By measuring the domain dis-
crepancy from a fine-grained perspective, our results somewhat
shed light on what a good alignment might be for classification.
In addition, we further describe an interesting DA application
in agriculture. The application involves categorizing three types
of maize tassel flowering status (MTFS):
non-flowering
,
partially-flowering
, and
fully-flowering
. A dataset
termed MTFS3–DA is also constructed. The dataset includes
10 domains and 1500 images covering 5-year timespan, 4 maize
cultivars and 3 geographical locations. Extensive experiments
on this dataset also show that 2DSA outperforms SA. We hope
this dataset could inspire interests from the pattern recognition
community to address cross-field challenges in agriculture.
Overall, the contributions of this paper include:
2DSA: a two-dimensional subspace alignment approach is
developed for better convolutional activations adaptation.
It is very eective, computationally ecient, and easy to
implement;
• Hw
Hw
&
Hb
Hb
: two novel divergence measures ca-
pable of quantifying within- and between-class variations
are proposed to characterize the class-level domain dis-
crepancy. It encourages new perspectives from considering
cross-dataset generalization for classification;
MTFS3–DA: a new dataset concerning three types of flow-
ering status of maize tassel is created for cross-field eval-
uations in agriculture. It consists of 10 domains and 1500
images.
The dataset and source code are made available online. 1
2. Related work
DA is set in one of the possible settings of transfer learn-
ing [
21
]. Over the years, DA has been extensively studied in both
theory and practice, such as the probabilistic inference in statis-
tics [
22
], generalization bound in machine learning [
20
,
23
],
distribution analysis in pattern recognition [
24
,
25
], as well as
various applications in natural language processing [
26
28
] and
computer vision [
11
,
12
,
29
31
]. Recent works in computer vi-
sion field mainly focus on the visual recognition problem in ei-
ther unsupervised (only unlabeled data are used from the target
domain) [
9
,
32
] or semi-supervised (a limited amount of labeled
data are used from the target domain) [
15
,
30
,
33
] setting. Read-
ers can refer to [
4
] for a comprehensive survey. In this paper,
1
The dataset and source code are made available at:
https://sites.
google.com/site/poppinace/.
2
VGG
Model CNN
Feature
VGG
Model
CNN
Feature
Xs
Xa Xt
Subspace Alignment
MΔst>>0
Δat0
Xa
Xt
Laptop
Backpack
Keyboard
Predict
Training Test
Source Domain
Target Domain
Figure 2. The framework of subspace alignment based visual domain adaptation.
we concentrate on the most challenging case—unsupervised vi-
sual DA (some literatures also refer to it as transductive transfer
learning).
According to whether source labels are utilized in the op-
timization process of DA, we simply divide existing unsuper-
vised DA approaches into two categories: domain-orientated
and domain-classification-orientated. The first category only
aims at the adaptation between two domains. This line of ap-
proaches usually seek a way to build explicit connections or find
implicit commonnesses between two domains. Some represen-
tative works include TCA [
34
], SGF [
12
], GFK [
13
], SA [
14
],
TJM [
35
] and LSSA [
36
]. Also, since current DA approaches
are usually evaluated in the context of classification, the sec-
ond category prefers to model the adaptation and classifica-
tion jointly. This line of works often involve iterative optimiza-
tion between the adaptation and classification taking source
labels into account, expecting to achieve good classification
performance and a fine overlap between domains simultane-
ously. Some works worth mentioning include (A)SYMM [
11
],
ARCT [
30
], DAM [
37
], STM [
38
], MMDT [
33
], HFA [
15
], and
recent deep learning based approaches (DDA [
9
] and DAN [
39
]).
Our proposed method, 2DSA, belongs to the first category.
Our work is of particular relevance to those subspace-based
DA approaches. These works share the idea of exploiting low-
dimensional data structures that are intrinsic to domains. In par-
ticular, [
12
] proposes sampling a finite number of intermediate
subspaces and building geodesic flows to connect the source
and target domains. Gong et al. [
13
] extends above work by
constructing a geodesic flow kernel that projects image repre-
sentations into infinite dimensional feature vectors, expecting
to encapsulate incremental changes between subspaces that un-
derly the dierence and commonness between domains. Dier-
ent from these two ideas, Fernando et al. [
14
] argues that it is
more appropriate to align the two domains directly. The basic
idea is to learn a transformation matrix by minimizing the Breg-
man matrix divergence. Intuitively, the transformation matrix
defines a movement that potentially pushes the source subspace
close to the target. More recently, [
36
] further extends [
14
] in
a landmarks-based kernelized paradigm via selecting potential
landmarks and incorporating further non-linearity with Gaussian
kernel.
Our work is closely related to [
14
], because we are built in
the same subspace alignment based framework. The main dier-
ence, however, is that SA [
14
] performs stronger feature-wise
alignment, while our method, 2DSA, only performs partial align-
ment because the subspace analysis is carried out on a smaller
space. Our analysis in Sec. 4shows that it is adequate to move
two subspaces only close to each other to achieve superior clas-
sification results. Moreover, 2DSA is very fast when tackling
high-dimensional data, such as convolutional activations, which
facilitates the parameter tunning during the cross validation.
3. Subspace alignment based visual domain adaptation
We start by reviewing the subspace alignment based DA
framework [
14
] to give readers a global view. We then discuss
the seminal vector-form formulation of SA in Sec. 3.1. Next
in Sec. 3.2, we present our matrix-form extension 2DSA in de-
tail. In particular, we follow the conventional nomenclature as
denoting vectors by lowercase boldface letters, like
x
, matrices
uppercase boldface letters, like
X
, and tensors calligraphic let-
ters, like
X
. We allow the input image to be of arbitrary size,
so a simple spatial pooling is applied as a normalization step,
ensuring the consistency of dimensionality. Concretely, any con-
volutional activations with size of
H×W×D
will be normalized
3
D
H
W
K
K
D
Figure 3. Illustration of spatial pooling normalization. Any activations within a
spatial bin will be pooled by max operation.
to
K×K×D
by max pooling. Note that, to preserve spatial infor-
mation, pooled activations are not vectorized as the fashion of
spatial pyramid pooling (SPP) in [
17
]. Intuitively, this process
is illustrated in Fig. 3.
The framework of subspace alignment based visual DA is
shown in Fig. 2. The scenario is that we use the training data
from the source domain to generate the subspace expanded by
Xs
, and data from the target domain to generate
Xt
(
Xs
and
Xt
are generated by PCA in [
14
] and by 2DPCA in 2DSA, which
will be explained later). Yet, the domain shift between
Xs
and
Xt
is quite large (
st
0), so the subspace
Xs
is aligned by
M
to correct this shift. Conceptually,
M
defines a movement
that pushes
Xs
close to
Xt
. The resulting aligned subspace is
denoted by
Xa
(
Xa
=
XsM
). At this time,
Xa
looks similar to
Xt(at 0). Finally, labeled instances from the source domain
are projected by
Xa
and are used to train the linear SVM at the
training stage. At the test stage, unlabeled instances from the
target domain are projected by
Xt
and are predicted with the
learned model. The more appropriate an alignment is, the better
classification results should achieve.
When learning the transformation matrix
M
, [
14
] chooses to
minimize the following Bregman matrix divergence as
F(M)=kXsMXtk2
F,(1)
where
k·kF
denotes the Frobenius norm. Under this paradigm,
a closed-form solution can be obtained as
M
=
XT
sXt
, and
Xa=XsXT
sXt.
3.1. Problems in vector-form formulation
In the context of vector-form formulation, each
K×K×D
tensor activation
X
has to be vectorized into a long vector
x
with size of
K2D×
1 (note that we have restricted our object
to the convolutional activations). However, the resulting vec-
torized representations are high-dimensional. When applying
PCA to generate a subspace, we need to solve SVD on an
extremely large matrix with size of
K2D×K2D
, but solving
high-dimensional SVD is quite slow. More importantly, it is not
tractable in practice because the DA problem is exactly the case
we do not have enough training data from the target domain
to get an exact solution of SVD. For instance, assume that we
only have
N
(
NM,M
=
K2D
) training instances, and let us
denote them as
aiRM,i
=1
, ..., N
, and combine them in a
matrix as
ARM×N
. The corresponding covariance matrix
Gsa
can be derived as
Gsa =1
NA AT.(2)
Algorithm 1 2DSA: Two-dimensional Subspace Alignment
Input:
Source features
Fs
, Target features
Ft
, Source labels
Ls
,
Subspace dimensionality d
Output: Target labels Lt
1: Xs2DPCA(Fs,d)
2: Xt2DPCA(Ft,d)
3: XaXsXT
sXt
4: PaFsXa
5: PtFtXt
6: LtS V M(Pa,Pt,Ls)
However, note that
rank
(
Gsa
)=
rank
(
A AT
)=
rank
(
A
)
N
,
which means we will only get less than
N
nonzero eigenvalues
when solving the SVD on
Gsa
. In other words, the exact solu-
tion is limited by the number of training data, and an appropriate
subspace may not be generated (our empirical study in Sec. 6.4
justifies this point). In fact, according to the widely-cited rule of
thumb in [
40
], we expect to have at least 10 times as many the
number of training data as the feature dimensionality. Therefore,
we argue that aligning directly vector-form convolutional acti-
vations may not be a good choice. Inspired by [
16
], it motivates
us to reconsider modeling them in their intrinsic structure.
3.2. 2DSA: matrix formulation with 2DPCA
2DSA formulates the matrix-form convolutional activations.
Specifically, we resort to 2DPCA [
19
] to generate subspaces.
First, each tensor activation with size of
K×K×D
is reshaped
into a
D×K2
matrix. Given a set of matrix-form descriptors
AiRD×K2,i
=1
, ..., N
, the covariance matrix
G2dsa
can be
evaluated as
G2dsa =1
N
N
X
i=1
AT
iAi,(3)
where
G2dsa RK2×K2
. From the physical sense,
G2dsa
actually
models the global dependency between dierent filter activa-
tions across all pair-wise spatial locations. By solving SVD on
G2dsa
, all feature maps share their eigenvectors instead of having
eigenvectors in a cube of features (many 2D feature maps). It is
ecient to derive
Xs
and
Xt
in corresponding domains, because
K2
is usually a small value. Also, since
K2
is small, 2DSA do
not require a substantial amount of training data. We then can
reuse Eq. 1to compute the transformation matrix and align the
subspace in the same vein. Notice that the orthogonality con-
straint in both PCA and 2DPCA is important to preserve good
class separations in their subspace representations. The pseudo-
code of this approach is summarized in Algorithm 1, which is
analogous to the algorithm presented in [14].
In both two formulations, we only need to tune one hyper
parameter
d
that controls the dimensionality of subspace. To ad-
dress this, we choose to leverage the theoretical bound deduced
by [
14
] to select the maximum dimensionality
dmax
to guide the
selection process. According to the variant of consistency the-
orem [
14
], it is said that given a confidence
δ >
0 and a fixed
4
deviation γ > 0, dmax can be selected if it could satisfy
(λmin
dmax λmin
dmax +1)
1+rln(2)
2
16d3/2B
γnmin !,(4)
where (
λmin
dmax λmin
dmax +1
)=
min
[(
λs
dλs
d+1
)
,
(
λt
dλt
d+1
)],
λb
a
is
the
a
-th eigenvalue (in descending order) computed from the
domain
b
.
B
is selected so that for any vector
x
,
kxk ≤ B
.
nmin
=
min
(
Ns,Nt
),
Ns
and
Nt
are the number of training data in the
source and target domain, respectively. Once
dmax
is identified,
for any ddmax, one can get a reliable solution of Min Eq. 1.
4. Domain discrepancy analysis
In this section, we draw upon domain divergence measures to
analyze the domain discrepancy.
4.1.
Quantifying domain discrepancy based on
H
H
-
divergence
According to our experiments, we find that 2DSA achieves
higher classification accuracy than SA. In this section, we at-
tempt to explain why it works. From the statistical perspective, a
common way is to use a distribution measure to estimate the do-
main discrepancy. The pioneer work of Ben-David et al. [
20
,
23
]
established the theoretical risk bounds for DA. Since our analy-
sis highly depends on these developments, we begin with a brief
introduction to their theoretical results.
Following the theorem 2 in [
23
], it states that given a hypothe-
sis space
H
of VC-dimension
˜
d
, and instance sets
Us
,
Ut
of size
˜m
each sampled i.i.d. from distributions
Ds
and
Dt
respectively,
then with probability at least 1
δ
, for every
h∈ H
, the corre-
sponding generalization error on the target set can be bounded
as
t(h)s(h)+1
2
ˆ
dHH(Us,Ut)+4s2˜
dlog(2 ˜m)+log(2)
˜m+˜
λ ,
(5)
where
s
(
h
) is the source error, and
˜
λ
equals the combined error
of ideal joint hypothesis
t
(
h
)+
s
(
h
), which can suppose to be
a negligible term in the case of DA. The bound shows that the
source error and
ˆ
dHH
(
Us,Ut
) (also called
H
H
-divergence)
are the most relevant quantities in computing the target error.
In particular, we are interested in quantifying
ˆ
dHH
(
Us,Ut
),
because we may understand why 2DSA works better if the per-
formance correlates well with this measure. Next, we shall give
a first look at its counterpart of
H
-divergence
dH
(
Ds,Dt
) that
plays a vital role in the rest of our analysis.
dH
(
Ds,Dt
) is also known as the
A
-distance or total varia-
tion distance derived from the statistical distance family, which
is used to measure the dierence between two probability distri-
butions. Formally, it is defined in [20] as
dH(Ds,Dt)=2 sup
h∈H |Ps(h)− Pt(h)|,(6)
where
Ps
(
h
) and
Pt
(
h
) denote the probability of event
h
under
distributions
Ds
and
Dt
, respectively. Intuitively, it describes
the largest possible dierence between the probabilities that two
probability distributions can assign to the same event. With these
notions, the symmetric dierence hypothesis space
H
H
can
be further defined [23] as
HH={g(x)|g(x)=h(x)h0(x)},h,h0∈ H ,(7)
where
denotes the XOR operation. In other words,
g
(
x
) will be
positive in
H
H
if and only if a couple of hypothesis
h
(
x
) and
h0
(
x
) disagree with each other. Thus,
dHH
(
Ds,Dt
) means to
compute the
A
-distance over the symmetric dierence hypoth-
esis space. However, directly computing
dHH
(
Ds,Dt
) is not
tractable in practice, so an alternative is to compute its empirical
version
ˆ
dHH
(
Us,Ut
). In particular, estimating
ˆ
dHH
(
Us,Ut
)
requires learning a linear classifier
ˆ
h
to see whether source and
target instances could be dierentiated. More specifically, it in-
volves the following steps:
Step 1.
Pseudo-labeling the source and target instances with +1
and 1;
Step 2.
Randomly sampling two sets of instances as the training
and test set, respectively.
Step 3.
Learning a linear classifier
ˆ
h
on the training set and
verifying its performance on the test set.
Step 4.
Estimating the distance as
ˆ
dHH
(
Us,Ut
)=2(1
2
·
err(ˆ
h)) [20], where err(ˆ
h) is the test error.
If two distributions perfectly overlap with each other,
err
(
ˆ
h
)
0
.
5, and
ˆ
dHH
(
Us,Ut
)
0. Conversely, if two distributions
have large enough margins,
err
(
ˆ
h
)
0, and
ˆ
dHH
(
Us,Ut
)
2.
Therefore,
ˆ
dHH
(
Us,Ut
)
[0
,
2]. The lower the value is, the
better two distributions align. In other words, a low divergence
value should imply high classification performance.
Now we can empirically evaluate the domain discrepancy of
SA and 2DSA. To our surprise, this measure does not correlate
with the classification performance. Fig. 4illustrates a typical
case of adapting images from Amazon to Caltech (details see
Sec. 6.1). The highest classification performance does not corre-
spond to the lowest
H
H
measure. According to the visualiza-
tion of data distributions, we observe that, both approaches do
have pushed the same class of dierent domains close to each
other (looking at the “+” class), and the classes aligned by SA
generally overlap better than those aligned by 2DSA (looking at
those “
N
” class and “
” class for example). Why SA achieves
inferior classification accuracy? This interesting phenomenon
motivates us to ponder a fundamental issue: to what extend is an
alignment enough for classification? We shall give our answer
in the next subsection.
4.2.
Measuring domain discrepancy with local class divergence
Note that both SA and 2DSA are in the sense of global align-
ment because all the training data are used to generate the sub-
spaces, the dierence is that SA performs stronger feature-wise
adaptation. However, if an alignment is too strong, it may even
align the data coming from dierent classes, resulting in the
cases shown by the yellow circles in Fig. 4. That is, data from
dierent classes are promiscuous in the SA adaptation. In this
case, the alignment makes no sense. In addition, let us revisit the
5
SA Adaptation, HH = 1.23, Recognition Accuracy = 58.14
1
2
3
4
5
6
7
8
9
10
2DSA Adaptation, HH = 1.45, Recognition Accuracy = 78.93
1
2
3
4
5
6
7
8
9
10
Figure 4. Category-specific data visualization using t-SNE [
41
] over a typical DA task from Amazon (red) to Caltech (black) in the Oce–Caltech10 dataset.
H
H
and recognition accuracy are indicated in each sub-figure title. Each category is denoted by a certain type of marker (Oce–Caltech10 dataset has 10 categories).
6
Table 1. Cultivar information of each sequence in MTFS3–DA dataset.
Sequence Jundan Wuyue Nongda Zhengdan
No.20 No.3 No.108 No.958
Zhengzhou 2010 X— —
Zhengzhou 2011 X— —
Zhengzhou 2012 X
Taian 2010–1 X— —
Taian 2010–2 X— —
Taian 2011–1 X
Taian 2011–2 X
Taian 2012–1 X
Taian 2012–2 X
Gucheng 2014 X
+” class in Fig. 4. In the both SA and 2DSA scenarios, this class
is aligned moderately, but if we classify the data, it may turn out
to be the most easily separated class. Therefore, our point is that
in classification we actually do not enforce two domains to be
exactly overlapped with each other, and favorable performance
can be achieved as long as they have enough margins with other
classes. Hence, as per these observations, we deem that it is ad-
equate in the context of classification for the same class to be
aligned that is only close to each other and for dierent classes
that have large enough margins.
To formalize our idea, two novel domain discrepancy mea-
sures called the within-class divergence
Hw
Hw
and the
between-class divergence
Hb
Hb
are proposed to characterize
the class-level dierences. A natural way to characterize these
dierences is to compute a distance over specific distributions.
Let us denote the within-class and between-class distances as
dw
(
Pi
s,Pi
t
) and
db
(
Pi
s,t,Pj
s,t
), respectively, where the superscript
denotes the class, and the subscript the domain. Thus, it is clear
that
dw
(
Pi
s,Pi
t
) is computed from a certain class between two
domains, and
db
(
Pi
s,t,Pj
s,t
) is computed by considering both do-
mains as a whole between dierent classes. Moreover, we fur-
ther impose two kinds of constraints to the distances as:
dw(Pi
s,Pi
t)< γw,(8)
db(Pi
s,t,Pj
s,t)> γb,(9)
where
γw
and
γb
ensure a relative small within-class distance
and a large enough between-class distance, respectively. With
these two inequalities,
Hw
Hw
and
Hb
Hb
can be expressed by
incorporating Eq. 8and Eq. 9into hinge-loss-like formulations
as:
HwHw=1
C
C
X
i
max(0,dw(Pi
s,Pi
t)γw),(10)
HbHb=1
C(C1)
C
X
i=1
C
X
j=1,j,i
max(0, γbdb(Pi
s,t,Pj
s,t)) .
(11)
We can see that only those distances violate the inequality
constraints will contribute losses to the measure. Intuitively,
Zhengzhou 2010
Zhengzhou 2011
Zhengzhou 2012
Taian 2010-1
Taian 2010-2
Taian 2011-1
Taian 2011-2
Taian 2012-1
Taian 2012-2
Gucheng 2014
Figure 5. Examples of maize tassel images in MTFS3–DA dataset from 10
dierent fields. In each field, from left to right, images denote the flowering
status of
non-flowering
,
partially-flowering
, and
fully-flowering
,
respectively. Images are rescaled for better view.
Hw
Hw
assesses how good two distributions locally align, and
Hb
Hb
scores how well an alignment suits for classification.
Also, we observe that, larger
γw
and smaller
γb
are, looser those
inequalities constrain. Intuitively, how small should
dw
(
Pi
s,Pi
t
)
be? We think it should not exceed
γw
so that data from the same
class are close enough and have a high probability to be clas-
sified correctly. Meanwhile, how large should
db
(
Pi
s,t,Pj
s,t
) be?
We believe it should be at least larger than
γb
so that data with
dierent classes could be separated easily. As a consequence,
when
γw
gradually decreases and
γb
gradually increases, two
kinds of curves can be drawn to demonstrate the domain dis-
crepancy under various distance levels.
To make a direct comparison with the
H
H
divergence, we
choose to estimate the
dw
(
Pi
s,Pi
t
) and
db
(
Pi
s,t,Pj
s,t
) in a similar
vein as
ˆ
dHH
(
Us,Ut
). In addition, when plotting the curves, we
can conventionally leverage the numerical range of
A
-distance
to reduce one parameter by setting
γb
=
γ, γw
=2
γ
. Also,
γ
will gradually increase in the interval of [1, 2]. In Sec. 6.5, we
show that,
Hb
Hb
correlates well with the classification accu-
racy, and
Hw
Hw
is also in consistent with the global
H
H
.
Our results imply that,
Hw
Hw
can be seen as a local version of
H
H
measure, and
Hb
Hb
also extends
H
H
by endowing
the ability to measure local variances between classes.
5. MTFS3–DA dataset
The image acquisition device is described in [
42
]. 10 maize
sequences in all are collected to construct the MTFS3–DA
dataset. The dataset covers 5-year timespan from 2010 to
2014, 4 dierent maize cultivars of Wuyue No.3, Jundan No.20,
7
Nongda No.108 and Zhengdan No.958, and 3 dierent geo-
graphical locations of Zhengzhou, Henan province, China, Ta-
ian, Shangdong province, China, and Gucheng, Hebei province,
China. In practice, cross-field domain shifts in agriculture are
mainly caused by these three factors. The information of each
sequence is summarized in Table 1.
As the camera monitors the growth of maize from the
tasseling stage to the flowering stage (two critical growth
stages of maize), we find that maize tassels exhibit three
types of flowering status [
43
]. That is, initial
non-flowering
status, intermediate
partially-flowering
status, and final
fully-flowering
status. Some example images of each se-
quence are illustrated in Fig. 5. We observe that there only exists
subtle textural dierences between dierent types of flowering
status, so it can be viewed as a typical cross-domain textural
categorization problem. We hope this problem can inspire inter-
ests from the pattern recognition community to address those
cross-field challenges in agriculture.
Concretely, we choose to leverage the o-the-shelf bounding
boxes annotations released in our previous work [
44
] to crop
the tassel images from the full-resolution images (extra annota-
tions have been done on the Gucheng 2014 sequence). By doing
this, we could relieve the influence of background as much as
possible, and it can also be viewed as a coarse pose normal-
ization. In addition, an agrometeorological observer with more
than 10-year experience is invited to help us annotate all sub-
images to ensure the correctness of labels. For each sequence,
we manually select 50 images from each class. In all, we have
150 images in each visual field and 1500 images in the MTFS3–
DA dataset. Notice that, the dataset originally released in [
44
]
is mainly developed for the evaluation of detection problem and
does not involve any image-level annotations, while the MTFS3–
DA dataset is tailored to the DA problem and is set in the context
of visual recognition.
6. Experiments and discussion
We first evaluate our approach in the context of visual recogni-
tion on standard DA datasets and follow the same experimental
protocol as in [
11
,
13
,
14
]. In addition, we also perform eval-
uations on other widely-used image classification datasets and
our constructed MTFS–DA dataset. Along with these numerical
results, we further present empirical studies to explain why our
method works.
6.1. Experimental dataset and protocol
Oce–Caltech10 dataset. Oce–Caltech10 dataset [
13
] ex-
tends the Oce31 [
11
] dataset by adding another Caltech do-
main, leading to 4 domains of Amazon,DSLR,web-cam, and
Caltech. 10 common categories are chosen from these domains,
resulting in about 2500 images. Overall, we have 12 DA prob-
lems.
Oce31 dataset. Oce31 dataset is originally introduced
by [
11
]. It consists of 31 categories and 3 domains. We add
another 5 images downloaded from the Internet with the same
image resolution into the
ruler
category of DSLR domain (only
Subspace Dimensionality
0 10 20 30 40 50 60 70 80 90 100
Eigenvalue Difference
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
Theoretical Bound
λd
min-λd+1
min
Figure 6. Illustration of selecting a subspace dimensionality with the guide of
theoretical bound.
7 images are contained in the original dataset) so that experi-
ments can be conducted under the same protocol. This dataset
has 6 DA problems.
ImageNet–VOC2007 dataset. We also evaluate our method on
the widely-used ImageNet and PASCAL VOC2007 datasets. We
choose the same 20 categories as VOC2007 dataset from the Im-
ageNet 2012 to constitute the source domain, and the VOC2007
dataset is regarded as the target domain. Since categories in-
cluded are very dierent from above datasets, experiments per-
formed on this dataset can somewhat prove the generality of our
method.
MTFS3–DA dataset. Since our dataset comprises 10 dierent
domains, it can lead to a total of
A2
10
=90 dierent DA prob-
lems. Instead of blindly evaluating all DA problems, we grad-
ually increase the domain shift and organize experiments in a
hierarchical manner (see Sec. 6.3 for details). For short, each
domain is denoted by
{Location}{Year}{Cultivar }
{S equence Number}
. The Se-
quence Number only appears in Taian sequences. For instance,
Zhengzhou 2010 domain is denoted by
Z
10
J
, and
Taian 2011–1
domain T11N
1.
Experimental protocol. Each DA problem is denoted by
S ourceT arget
. For Oce–Caltech10, Oce31 and our
MTFS3–DA datasets, the average multi-class recognition accu-
racy across 10 categories over 20 trials is reported on the target
domain. In each trial, 20 images are randomly sampled from
each category of the source domain as the training set (8 im-
ages if the source domain is web-cam and DSLR), and the target
data is used during both the training and testing stages. Note
that, the experimental protocol we use on the Oce–Caltech10
and Oce31 datasets is exactly the same as in [
11
,
13
,
14
] ex-
cept that we use dierent feature representations, i.e., convolu-
tional activations. Since better feature representations are used,
the baseline accuracy is substantially higher than the results re-
ported in their papers (the conventional SURF feature is used
in [
11
,
13
,
14
]). For the ImageNet–VOC2007 dataset, 50 images
are randomly sampled from each category of ImageNet2012
subset as the source domain, and images from the
test
set of
VOC2007 are used as the target domain, and the average preci-
sion regarding each category is reported, respectively. Since we
8
have sucient data in the source domain, Sec. 6.6 will present
additional results with other settings on this dataset.
Parameters setting. The optimal dimensionality
d
in both SA
and 2DSA scenarios is determined by the two-fold cross val-
idation over the labeled source data with the guide of the-
oretical bound (Sec. 3.2) using the range of values 2
k,k
=
0
,
1
,
2
, ..., log2dmax
. Fig. 6illustrates how to find a stable so-
lution with the guide of theoretical bound. The optimal dimen-
sionality
d
should be identified before the intersection of two
types of lines. Since we focus on the adaptation of convolutional
activations, those methods taking fully-connected activations,
like DeCAF feature [8], as the representation are not employed
for comparisons. Generally, the CO N V5 activations extracted
from a pretrained 7-layer CNN model (
imagenet-vgg-m
[
45
])
are considered as the feature representation (
D
=512), and
K
=6 is set in the spatial pooling step (Sec. 3). Thus, the fea-
ture dimensionality is
K2×D
=18432. Additionally, one-vs-rest
linear SVMs [
46
] are used as the classifier, and the penalty fac-
tor
C
is determined by two-fold cross validation on the source
domain using the range of values 10p,p=3,2,1,0,1,2,3.
6.2. Visual recognition results on standard datasets
Several baseline methods are employed to compare against
our 2DSA approach:
No Adaptation (NA): NA is the basic baseline, the classifier
trained on the source domain is directly applied to the target
domain.
Geodesic Flow Kernel (GFK) [
13
]: GFK is a kernel-based
DA method that uses an infinite number of subspaces along
the geodesic flow to bridge two domains.
Transfer Joint Matching (TJM) [
35
]: TJM formulates fea-
ture matching and instance reweighting in a joint optimiza-
tion problem.
Landmarks Selection Subspace Alignment (LSSA) [
36
]:
LSSA extends SA by projecting samples onto landmarks
and adding further nonlinearity with Gaussian kernel. Both
TJM and LSSA can work at the instance level.
Subspace Alignment (SA) [
14
]: SA is aforementioned. This
is our closely related work and the direct baseline.
SA
and 2DSA
: It may be interesting to see how SA and
2DSA work with nonlinearity. We add these two variants
that use SVM with Gaussian kernel as the classifier. Similar
to the penalty factor
C
, the kernel parameter
σ
is also tuned
by two-fold cross validation.
2DSA
: A variant of 2DSA that adopts
A0
iRK2×D,i
=
1
, ..., N
, as the matrix descriptor, so 2DSA
will solve a
D×D
covariance matrix as per Eq. 3. In contrast to 2DSA
that performs spatial-mode adaptation, 2DSA
performs
feature-mode adaptation. This variant will show what ex-
actly makes 2DSA dierent from other approaches.
2DSA
: In 2DSA, all feature maps are summed together
and sent to 2DPCA. In 2DSA
, each feature map is vec-
torized into a 1
×K2
vector and considered as a specific
pattern. All these patterns will be sent to standard PCA so
that all feature maps share the same eigenvector. We add
such a baseline to justify whether 2DPCA is what makes
a dierence and not the fact that eigenvectors are shared
across feature maps.
In the following, the vector-form representation is prefixed
with v-, and the matrix-form representation m-. Convolutional
activations are denoted as CONV for short. Conventional ap-
proaches receive vC O NV as the feature representation, while
2DSA and its variants receive mCO N V as the representation.
Tables 2,3and 4list the numerical results. We can make the
following observations:
Classification results on all three standard DA datasets
demonstrate that 2DSA almost consistently and signifi-
cantly outperforms SA by large margins. In particular,
2DSA achieves the highest mean average accuracy of
74.1% and 55.6% on the Oce–Caltech10 and Oce31
datasets, respectively. 2DSA also exhibits consistently
lower standard deviations than SA, which implies 2DSA is
very stable. In addition, 2DSA also ranks the second place
on the challenging VOC2007
test
set (2DSA
wins the
first place);
SA with vC ON V does not see notable improvements
in accuracy and sometimes even worsens the classifica-
tion performance. Similar results are also reported in [
47
]
that SA falls behind NA when 4096-dimensional fully-
connected activations are used. One reason perhaps is that
the number of training data aects the quality of generated
subspace. This point will be further justified in Sections 6.4
and 6.6;
The results of 2DSA
show that the improvement of the
feature-mode adaptation of CONV is marginal, and 2DSA
even degrades the classification performance significantly
on the ImageNet–VOC2007 dataset. This may justify that
the spatial-mode adaptation mechanism in 2DSA matters.
We also consider this is what distinguishes 2DSA from
other comparing approaches—the feature mode is not ex-
plicitly adapted. Such a phenomenon may inspire a further
exploration: is it necessary to adapt features when the fea-
ture representations at hand are already good enough? We
leave such a question open at present.
It is interesting that 2DSA
also works considerably well.
We think the reason is also that 2DSAperforms 2D adap-
tion and adapts the spatial mode only. However, compared
to 2DSA, its classification accuracy is slightly lower, and
the standard deviation is generally higher (especially on
the ImageNet–VOC2007 dataset). Perhaps when all fea-
ture maps are summed together as in 2DSA, the covari-
ance matrix can appropriately capture the holistic informa-
tion of samples, more stable than blindly modeling indi-
vidual feature maps. In addition, the advantage of 2DSA
9
Table 2. Recognition accuracy (%) on the Oce–Caltech10 dataset over 20 trials. The highest accuracy is boldfaced, the second best is shown in red, and the standard
deviation is shown in parentheses (we focus on the comparison between SA and 2DSA, results of other approaches are reported for reference).
Method AC CA AD DA AW WA CD DC CW WC DW WDmean
NA 68.3(2.4) 81.2(3.0) 67.9(4.2) 66.4(2.7) 56.7(4.7) 57.9(3.6) 68.7(3.6) 52.6(1.9) 58.8(3.9) 47.2(3.5) 89.2(2.7) 90.5(2.3) 67.1
GFK 77.2(2.2) 85.2(2.2) 78.1(3.6) 80.4(2.6) 74.3(2.8) 70.4(6.5) 79.2(2.7) 63.9(7.0) 74.7(3.1) 59.6(5.0) 86.9(3.2) 86.4(3.0) 76.4
TJM 80.7(1.5) 88.6(1.6) 79.8(7.6) 74.2(6.2) 75.3(5.8) 62.8(7.4) 84.4(3.8) 59.6(7.2) 79.5(4.8) 52.2(4.8) 94.0(2.5) 92.3(2.1) 76.9
LSSA 78.7(0.9) 87.0(1.7) 78.3(2.2) 77.7(4.2) 70.6(2.3) 63.9(5.1) 77.8(2.7) 64.5(4.1) 68.2(5.2) 58.0(3.4) 95.4(2.2) 96.0(1.6) 76.3
SA67.4(4.8) 75.8(3.3) 42.3(14.9) 59.1(19.2) 45.5(10.9) 29.8(24.2) 34.7(11.4) 56.5(4.6) 43.5(12.5) 36.9(20.3) 70.7(36.0) 79.2(17.1) 53.5
SA 69.3(3.7) 84.6(3.8) 64.7(7.6) 66.7(12.5) 53.7(5.2) 58.7(10.4) 70.6(5.2) 56.4(12.1) 58.5(4.3) 50.3(8.5) 83.4(6.2) 84.6(5.8) 66.8
2DSA69.8(14.8) 84.9(16.1) 68.0(15.7) 56.4(24.9) 58.2(17.9) 53.6(23.0) 73.8(14.0) 47.3(22.2) 69.4(13.2) 42.9(20.6) 76.0(34.0) 82.6(31.4) 65.2
2DSA68.6(2.4) 85.0(2.4) 67.2(7.0) 69.3(7.0) 60.3(4.3) 58.8(7.8) 72.5(2.4) 53.0(5.9) 58.4(4.0) 48.3(5.5) 81.3(6.6) 85.0(4.2) 67.3
2DSA74.8(3.2) 85.4(2.6) 72.7(6.7) 70.7(4.4) 67.4(4.0) 58.2(5.1) 77.3(4.9) 58.5(4.2) 70.7(3.7) 50.1(3.2) 91.0(2.8) 92.9(3.4) 72.5
2DSA 75.2(3.0) 85.9(1.8) 75.4(5.2) 73.4(5.0) 66.8(3.9) 63.8(5.6) 76.6(5.0) 62.5(4.7) 69.5(2.9) 55.1(5.1) 91.8(3.0) 93.0(2.5) 74.1
Table 3. Recognition accuracy (%) on the Oce31 dataset over 20 trials. The highest accuracy is boldfaced, the second best is shown in red, and the standard
deviation is shown in parentheses (we focus on the comparison between SA and 2DSA, results of other approaches are reported for reference).
Method AD DA AW WA DW WDmean
NA 41.1(3.2) 30.2(2.3) 33.1(2.8) 26.4(2.2) 74.0(2.6) 76.6(1.5) 46.9
GFK 44.5(2.4) 31.1(4.2) 37.1(3.0) 27.3(2.0) 73.3(2.2) 75.6(2.2) 48.1
TJM 44.9(5.1) 34.8(3.3) 37.5(4.3) 31.3(5.1) 73.9(1.8) 76.6(3.2) 49.8
LSSA 38.9(9.6) 34.5(4.5) 29.0(11.7) 33.1(4.4) 80.4(2.4) 79.3(2.3) 49.2
SA27.8(3.4) 28.9(6.0) 27.9(3.3) 23.1(8.5) 74.5(3.3) 73.0(5.6) 42.5
SA 43.8(9.2) 40.1(3.3) 40.9(7.7) 35.8(6.0) 81.5(2.6) 75.6(4.7) 52.9
2DSA47.8(6.5) 35.6(4.1) 40.3(5.8) 29.1(7.1) 86.8(2.8) 88.4(2.6) 54.7
2DSA45.4(2.7) 35.7(2.0) 37.3(3.4) 32.4(2.1) 78.7(2.0) 80.5(2.9) 51.7
2DSA45.2(4.2) 32.3(5.3) 39.2(3.8) 28.3(3.4) 81.3(2.5) 84.2(2.1) 51.8
2DSA 47.3(4.2) 37.6(1.7) 39.2(6.0) 35.8(1.4) 85.7(2.1) 88.3(2.2) 55.6
over 2DSA
is obvious when the number of source samples
is limited (D/W
A/C), which means 2DSA is more suit-
able for small sample sizes than 2DSA
. As a consequence,
2DPCA seems a better choice for 2D adaptation.
2DSA
achieves the highest average precision on the
ImageNet–VOC2007 dataset. Yet, the nonlinearity used in
SA and 2DSA does not always benefit classification. On
the Oce–Caltech10 and Oce31 datasets, when the Gaus-
sian kernel is introduced, it has a negative eect on the
classification accuracy. SA
and 2DSA
also exhibit much
higher standard deviations than their linear counterparts on
the Oce–Caltech10 dataset. Hence, one should be careful
when using nonlinearity in practice.
Although TJM achieves higher classification accuracy than
2DSA on the Oce–Caltech10 dataset, TJM does not work
well when tackling complicated classification problems (31
categories are included in the Oce31 dataset) or when
inferring classes with complex background (VOC2007
dataset). Here is a plausible explanation. Since TJM op-
timizes an instance reweighting procedure, it works at the
instance level. However, as shown in Fig. 7, the Oce31
dataset contains some images with inaccurate labels, and
the VOC2007 dataset is a typical multi-label dataset. Am-
biguous labels are very likely to lead to sample shifts from
one class to the other. If these samples are assigned with
larger weights, the quality of adaptation will be largely af-
fected. In contrast, 2DSA is a subspace-based approach
and works at the domain level. It is not that sensitive to the
variations of individual instances. This may explain why
2DSA outperforms TJM on the Oce31 and ImageNet–
VOC2007 datasets.
The reason why 2DSA outperforms LSSA may be similar
as TJM. LSSA also contains an instance reweighting pro-
cess, so it may suer from the same problem as TJM. In
LSSA, the source and target data will be projected onto a
shared space using a Gaussian kernel with respect to the
selected landmarks. If the selected landmarks contain noisy
samples, the resulting nonlinear representations may also
be unreliable. With unreliable representations, the data dis-
tributions may not change the way we expect in the pro-
jected space to benefit linear classification.
In fact, we think the performance degradation has also
something to do with the use of deep features. According
to a recent work [
48
], deep features are considered fragile—
features are separable but not discriminative enough (the
10
Table 4. Average precision (%) on the ImageNet–VOC2007 dataset. The highest average precision is boldfaced, the second best is shown in red, and the standard
deviation is shown in parentheses (we focus on the comparison between SA and 2DSA, results of other approaches are reported for reference).
VOC2007 aero bike bird boat bottle bus car cat chair cow
NA 68.7(6.5) 60.2(4.1) 49.3(4.5) 60.2(7.7) 25.8(1.9) 51.1(3.1) 65.0(2.9) 65.3(2.0) 16.1(4.6) 22.7(7.2)
GFK 59.6(7.6) 57.6(5.9) 33.3(17.6) 40.3(9.9) 23.7(4.3) 48.5(3.2) 64.4(2.6) 46.3(12.0) 13.5(4.9) 19.6(7.1)
TJM 70.8(1.3) 63.9(2.6) 14.8(8.7) 40.2(14.9) 14.9(3.3) 49.0(3.7) 68.7(1.8) 51.9(10.7) 14.3(5.9) 17.0(14.2)
LSSA 54.2(2.3) 58.3(3.1) 25.6(3.2) 21.9(3.1) 22.1(2.0) 39.3(2.4) 63.8(1.4) 45.6(4.3) 27.5(4.0) 16.5(3.0)
SA70.6(3.9) 62.2(5.1) 55.2(10.2) 61.5(11.5) 22.8(4.4) 58.8(4.1) 71.4(2.5) 64.0(6.8) 15.5(6.1) 30.1(6.0)
SA 60.1(12.8) 51.6(18.4) 28.3(11.3) 41.8(18.5) 17.8(3.7) 45.9(12.7) 66.4(7.4) 48.6(12.2) 19.4(9.8) 27.2(7.1)
2DSA78.8(3.6) 73.4(3.0) 68.5(6.1) 74.4(4.2) 30.4(2.6) 64.3(3.6) 75.4(2.5) 74.9(2.8) 20.5(6.2) 48.2(5.8)
2DSA54.2(2.3) 58.3(3.1) 25.6(3.2) 21.9(3.1) 22.1(2.0) 39.3(2.4) 63.8(1.4) 45.6(4.3) 27.5(4.0) 16.5(3.0)
2DSA69.3(3.3) 63.5(3.8) 60.9(3.8) 67.3(5.4) 30.0(2.8) 52.2(4.1) 69.5(2.2) 69.6(3.6) 13.9(4.4) 32.4(4.9)
2DSA 68.7(1.3) 66.3(1.2) 50.7(2.9) 65.9(2.1) 30.8(2.5) 53.8(2.0) 74.6(1.1) 67.7(0.8) 29.1(5.4) 33.5(2.8)
table dog horse mbike person plant sheep sofa train tv mean
NA 27.4(4.0) 42.2(10.8) 37.4(13.2) 51.1(4.4) 70.8(2.2) 18.1(2.5) 44.7(4.7) 36.9(5.6) 69.2(2.8) 47.8(5.3) 46.5
GFK 26.4(6.6) 18.1(4.4) 18.6(17.9) 52.3(8.2) 73.9(2.3) 14.7(0.7) 27.4(15.7) 35.3(5.7) 58.8(5.8) 31.1(6.3) 38.2
TJM 29.2(3.8) 24.5(16.0) 22.6(27.0) 53.7(2.2) 67.6(2.4) 15.3(2.0) 10.0(12.3) 39.3(2.2) 64.3(2.6) 33.9(2.3) 38.3
LSSA 25.3(2.7) 40.8(1.7) 40.4(7.6) 41.7(3.4) 74.3(2.8) 11.0(3.1) 23.7(4.5) 27.7(3.6) 52.4(3.9) 29.5(3.4) 37.1
SA32.1(6.3) 47.6(7.4) 51.4(14.9) 64.0(3.7) 75.5(2.1) 17.6(2.6) 42.7(8.9) 48.3(7.0) 72.9(5.0) 48.9(6.5) 50.7
SA 33.6(4.8) 47.2(8.1) 55.9(15.6) 49.2(13.9) 69.9(9.4) 13.4(5.3) 36.6(12.9) 24.9(13.2) 62.8(10.1) 32.7(5.5) 41.7
2DSA38.5(5.3) 59.3(3.5) 65.1(6.7) 70.9(4.6) 77.7(2.2) 18.5(2.0) 64.1(3.6) 58.6(5.3) 78.4(2.8) 61.3(5.8) 60.1
2DSA25.3(2.7) 40.8(1.7) 40.4(7.6) 41.7(3.4) 74.3(2.8) 11.0(3.1) 23.7(4.5) 27.7(3.6) 52.4(3.9) 29.5(3.4) 37.1
2DSA37.3(6.2) 57.6(4.0) 48.0(9.7) 59.4(3.7) 72.8(2.3) 17.1(3.0) 56.8(3.8) 38.3(3.6) 72.2(4.3) 55.5(3.8) 52.2
2DSA 35.3(2.5) 56.3(3.2) 48.4(5.6) 58.5(3.6) 77.5(1.6) 24.5(1.9) 52.2(4.0) 43.2(2.0) 75.0(0.9) 57.2(2.0) 53.5
intra-class variations are still large). [
48
] shows that deep
features typically present bubble-like shapes in the feature
space, dierent bubbles indicating dierent classes may
easily intersect if a disturbance appears. Such a problem
becomes serious in the context of CONV adaptation. The
disturbances can be the problem nature of DA (distribution
mismatch) or the poor estimation of parameters because of
high dimensionality. Nevertheless, the good news is that
the spatial-mode adaptation mechanism in 2DSA seems
not to ruin the good class separation of CONV.
It can be concluded that, when aligning convolutional acti-
vations, it is better to formulate the problem in the two-
dimensional paradigm. Moreover, if we have desirable
domain-invariant feature representations, a simple linear
adaptation seems already adequate.
6.3. Visual recognition results on MTFS3–DA dataset
For the MTFS3–DA dataset, we organize our experiments in
a hierarchical manner. In particular, we gradually increase the
domain shifts and evaluate the recognition performance under
single-type, double-type and triple-type variations. More specif-
ically, three types of variations of years, cultivars and geograph-
ical locations are considered. On this dataset, we only compare
the performance of 2DSA against NA and SA. The accuracy
improvement over baseline NA by around 10% is underlined,
which means a significant improvement.
Figure 7. Images shown at the first row are labeled as
ruler
in the Oce31
dataset, and the second row shows images with multiple labels in the VOC2007
dataset. These images with ambiguous labels may aect the performance of
instance-level DA methods.
6.3.1. Performance degradation
Before we evaluate these DA problems, we first highlight the
problem of cross-field performance degradation. Concretely, we
choose 3 typical domains of
Z
10
J
,
T
11
N
1
and
G
14
Z
as the source
domains, respectively, and test the recognition performance on
the other 9 target domains. The mean recognition accuracy is
reported. Numerical results listed in Table 5show that the per-
formance degrades significantly in all the cases when directly
applying the classifier trained on the source domain. This is an
important problem that one often ignores to concern in field-
based visual applications in agriculture. The factors that plants
may be dierent from year to year or from location to location
are complicated. For instance, the quality of seeds, the variations
11
Table 5. Performance degradation from one domain to the other. The performance in the first column is obtained by testing the data from the same domain, and the
standard deviation is shown in parentheses.
Source Target
Z10JZ11JZ12ZT10W
1T10W
2T11N
1T11N
2T12Z
1T12Z
2G14Z
77.1(5.0) 56.6(4.0)56.7(5.0)58.7(5.7)55.1(5.1)47.2(4.9)50.4(4.5)55.4(4.6)51.8(4.2)52.6(6.1)
T11N
1Z10JZ11JZ12ZT10W
1T10W
2T11N
2T12Z
1T12Z
2G14Z
77.1(5.1) 48.5(5.3)43.3(4.9)42.5(3.3)52.9(5.3)46.5(5.0)44.7(4.7)51.5(4.6)48.6(4.4)50.2(6.2)
G14ZZ10JZ11JZ12ZT10W
1T10W
2T11N
1T11N
2T12Z
1T12Z
2
72.8(4.4) 44.4(4.8)40.7(4.5)36.0(3.6)49.4(5.2)45.3(3.6)46.2(4.8)45.8(3.9)48.7(4.2)44.1(3.9)
Table 6. Recognition accuracy (%) under the same cultivar and geographical
location but dierent years over 2 DA problems. The highest average precision
is boldfaced, the second best is shown in red, and the standard deviation is shown
in parentheses.
Method Z10JZ11JZ11JZ10 Jmean
NA 56.6(4.0) 51.6(4.1) 54.1
SA 55.5(7.9) 48.2(12.7) 51.9
2DSA 61.1(4.9) 56.8(5.6) 59.0
of weather and the nutritional status in soil both largely aect
the growth of plants. In addition, dierent plants will encounter
interspecific competition. This is the reason why dierent plants
tend to exhibit dierent flowering status even if they are seeded
at the same time.
6.3.2. DA evaluation under single-type variation
In the first series of evaluations, we consider DA problems
caused by only single-type variation. In particular, two types
of variations of years and geographical locations are evaluated,
respectively. Note that the scenario of single cultivar variation is
not included, because plants with dierent cultivars are currently
not planted within the same year and geographical location.
Same cultivar and geographical location but dierent years.
Here, we only allow the year to vary when the other two fac-
tors are fixed, leading to 2 DA problems shown in Table 6. In
this situation, the weather condition is the main factor that af-
fects the growth of plants. Results show that 2DSA improves
the cross-field classification performance and also outperforms
SA, which means the shifts caused by weather conditions can
be corrected appropriately.
Same cultivar and year but dierent geographical locations. In
this setting, we restrict the cultivar and year to be the same and
only vary the geographical locations, resulting in 4 DA prob-
lems shown in Table 7. Plants in dierent locations are greatly
influenced by the soil conditions. Results demonstrate a similar
tendency as to the first experiment.
6.3.3. DA evaluation under double-type variation
In the second series of DA evaluations, we consider three
kinds of double-type variations. Concretely, they are as follows.
Same geographical location but dierent years and cultivars.
We simultaneously vary the shifts with respect to years and cul-
tivars but require the geographical location to be the same place.
It gives rise to 24 DA problems. When dierent cultivars are
considered, maize tassels tend to exhibit significant appearance
variations, e.g., dierent colors. Results are listed in Table 8. It
is surprising to see that 2DSA significantly improves the clas-
sification performance in 13 out of 24 DA problems, implying
that the shifts caused by years and cultivars are not that serious.
Same cultivar but dierent years and geographical locations.
Similarly, we restrict the factor of same cultivar and change the
other two in this setting. 6 DA problems in Table 9also demon-
strate the eectiveness of 2DSA, and 3 of them exhibit a notable
degree of performance improvement over 10%.
Same year but dierent cultivars and geographical locations.
Under this context, only cultivars and geographical locations
can change simultaneously, and 8 DA problems in all are eval-
uated. According to the results shown in Table 10, 2DSA only
significantly improves the accuracy on only one DA task, and
2DSA does not work on the
T
10
W
2Z
10
J
problem. Hence, on
the basis of above results, we conclude that the geographical
location is the more important factor that causes domain shifts
than the cultivar and the year. Indeed, it is in accordance with
our intuition that various soil conditions of dierent locations
greatly aect the growth of plants.
6.3.4. DA evaluation under triple-type variation
In the final experiment, all three kinds of variations can vary
simultaneously, resulting in the most challenging setting.
Dierent years, cultivars and geographical locations. Overall,
we have 36 DA problems. Numerical results are listed in Ta-
ble 11. It is interesting that all DA tasks with significant improve-
ments involve the
G
14 domain, which means the shifts caused
by such a domain is not easily adapted. For other DA tasks that
does not involve the
G
14 domain, we find that, although 2DSA
still works, the recognition baseline is generally lower than the
single-type and double-type cases. Domain shifts seem serious
when all variations are involved, because 22 problems do not
exhibit notable accuracy improvements. In addition, SA even
works better than 2DSA in two DA problems. As per these ob-
servations, we believe that the classification performance indeed
has a close relation to specific data distributions.
12
Table 7. Recognition accuracy (%) under the same cultivar and year but dierent geographical locations over 4 DA problems. The highest average precision is
boldfaced, the second best is shown in red, and the standard deviation is shown in parentheses.
Method Z12ZT12Z
1T12Z
1Z12ZZ12ZT12Z
2T12Z
2Z12Zmean
NA 51.2(5.8) 54.9(5.2) 47.6(4.1) 48.9(5.5) 50.6
SA 49.6(11.0) 49.4(5.8) 43.2(9.4) 47.8(6.8) 47.5
2DSA 57.8(5.6) 59.4(3.0) 55.2(5.3) 54.3(6.5) 56.7
Table 8. Recognition accuracy (%) under the same geographical location but dierent years and cultivars over 24 DA problems. The highest average precision is
boldfaced, the second best is shown in red, and the standard deviation is shown in parentheses.
Method T10W
1T11N
1T11N
1T10W
1T10W
1T11N
2T11N
2T10W
1T10W
1T12Z
1T12Z
1T10W
1T10W
1T12Z
2T12Z
2T10W
1
NA 58.8(6.3) 52.9(5.3) 49.9(3.8) 54.1(6.5) 61.4(5.6) 59.4(4.4) 58.0(5.8) 58.7(5.5)
SA 57.8(8.9) 59.4(11.1) 50.7(7.1) 48.7(9.9) 60.2(7.5) 56.8(11.7) 52.2(9.9) 55.8(9.4)
2DSA 66.0(3.7) 62.8(5.0) 59.5(4.0) 62.8(5.9) 66.5(4.6) 69.5(4.1) 67.6(4.2) 66.9(5.9)
T10W
2T11N
1T11N
1T10W
2T10W
2T11N
2T11N
2T10W
2T10W
2T12Z
1T12Z
1T10W
2T10W
2T12Z
2T12Z
2T10W
2
NA 55.2(5.1) 46.5(5.0) 52.4(4.7) 52.9(6.2) 59.3(6.7) 56.6(4.6) 57.9(4.7) 51.3(6.1)
SA 53.7(9.9) 49.4(9.0) 53.3(7.1) 48.4(8.1) 53.6(14.8) 57.8(6.9) 45.3(9.9) 50.0(9.2)
2DSA 66.6(4.0) 58.8(5.0) 59.7(5.9) 60.1(5.0) 67.5(3.6) 67.2(3.4) 64.9(5.9) 64.5(5.9)
T11N
1T12Z
1T12Z
1T11N
1T11N
1T12Z
2T12Z
2T11N
1T11N
2T12Z
1T12Z
1T11N
2T11N
2T12Z
2T12Z
2T11N
2mean
NA 51.5(4.6) 50.1(4.9) 48.6(4.4) 52.3(5.5) 54.6(6.5) 48.6(4.0) 50.0(5.1) 46.2(4.8) 53.6
SA 55.4(5.2) 53.3(5.7) 48.7(6.3) 44.4(9.4) 51.1(12.1) 52.4(10.2) 43.7(8.7) 51.9(6.8) 52.2
2DSA 60.7(3.8) 61.6(4.4) 57.5(5.7) 61.9(4.0) 61.6(6.0) 59.8(5.6) 55.4(6.6) 56.2(4.6) 62.7
Table 9. Recognition accuracy (%) under the same cultivar but dierent years and geographical locations over 6 DA problems. The highest average precision is
boldfaced, the second best is shown in red, and the standard deviation is shown in parentheses.
Method G14ZT12Z
1T12Z
1G14ZG14ZT12Z
2T12Z
2G14ZG14ZZ12ZZ12ZG14Zmean
NA 48.7(4.2) 55.0(6.7) 44.1(3.9) 53.0(6.3) 36.0(3.6) 48.9(4.7) 47.6
SA 54.3(7.2) 51.5(8.3) 50.6(7.3) 51.8(8.0) 47.6(8.7) 48.6(8.9) 50.7
2DSA 62.3(7.8) 62.7(2.9) 61.4(3.6) 59.2(5.3) 57.1(6.5) 57.4(8.8) 60.0
Table 10. Recognition accuracy (%) under the same year but dierent cultivars and geographical locations over 8 DA problems. The highest average precision is
boldfaced, the second best is shown in red, and the standard deviation is shown in parentheses.
Method Z10JT10W
1T10W
1Z10JZ10JT10W
2T10W
2Z10JZ11JT11N
1T11N
1Z11JZ11JT11W
2T11W
2Z11Jmean
NA 58.7(5.7) 59.6(4.1) 55.1(5.1) 60.6(5.3) 45.1(6.5) 43.3(4.9) 44.9(4.6) 49.5(7.3) 52.1
SA 54.3(9.2) 60.0(5.5) 55.3(9.6) 53.6(10.7) 45.3(7.0) 50.2(8.6) 44.5(6.3) 48.7(7.3) 51.5
2DSA 67.4(4.7) 65.3(3.1) 64.4(4.3) 60.6(4.9) 51.0(8.6) 55.4(6.9) 50.7(6.1) 56.7(4.5) 59.0
6.4. Subspace analysis by measuring the reconstruction error
As previously stated, we conjecture the quality of generated
subspaces aects the performance. To justify this, here we as-
sess the subspace quality from the perspective of reconstruction
error. Fig. 8illustrates the results. Note that
Q
shown in the fig-
ure denotes the widely-used energy parameter that controls the
subspace dimensionality. It is clear that the reconstruction error
of 2DPCA is generally lower than PCA. Also, we note that PCA
exhibits a relatively high error even when
Q
equals to 100%,
while 2DPCA is already close to zero. This gives evidence that
PCA cannot appropriately reconstruct convolutional activations
with a limited number of training data.
6.5.
Quantifying the domain discrepancy using divergence mea-
sures
Here, we evaluate the domain discrepancy on four typical DA
problems on the Oce–Caltech10 dataset, using both the global
and proposed local divergence measures. Concretely, we com-
pute the
H
H
-divergence measure and corresponding recog-
nition accuracy over selected DA problems. Results are listed
in Table 12. It demonstrates a similar tendency to our observa-
tions in Sec. 4.1. That is, a lower
H
H
value does not imply a
good classification result, which means the superiority of 2DSA
cannot be explained from the global sense. To this end, we fur-
ther compute the within-class divergence
Hw
Hw
and between-
class divergence
Hb
Hb
, expecting to infer the results from a
local perspective. Concretely, we plot the
γ
-curves for
Hw
Hw
13
Table 11. Recognition accuracy (%) under dierent years, cultivars and geographical locations over 36 DA problems. The highest average precision is boldfaced, the
second best is shown in red, and the standard deviation is shown in parentheses.
Method Z10JT11N
1T11N
1Z10JZ10JT11N
2T11N
2Z10JZ10JT12Z
1T12Z
1Z10JZ10JT12Z
2T12Z
2Z10JZ10JG14Z
NA 47.2(4.9) 48.5(5.3) 50.4(4.5) 51.0(5.3) 55.4(4.6) 58.1(4.3) 51.8(4.2) 52.6(4.7) 52.6(6.1)
SA 51.7(6.2) 55.4(9.4) 54.0(5.9) 51.8(8.4) 55.5(5.8) 56.0(11.7) 48.4(9.9) 55.4(10.1) 52.5(6.1)
2DSA 65.2(5.0) 59.1(6.2) 60.9(5.5) 61.5(6.7) 64.7(4.8) 62.1(5.0) 56.8(4.2) 60.3(6.5) 63.1(4.2)
G14ZZ10JZ11JT10W
1T10W
1Z11JZ11JT10W
2T10W
2Z11JZ11JT12Z
1T12Z
1Z11JZ11JT12Z
2T12Z
2Z11J
NA 44.4(4.8) 46.5(6.0) 52.8(4.9) 49.8(6.2) 56.2(4.7) 47.1(6.6) 47.8(6.3) 43.7(4.5) 47.3(5.5)
SA 60.9(6.9) 50.0(9.5) 49.5(10.1) 48.2(10.7) 45.1(9.3) 45.4(10.3) 49.6(9.0) 49.3(8.3) 49.3(9.0)
2DSA 59.1(7.2) 55.0(6.5) 59.7(5.9) 53.8(7.4) 60.2(4.6) 54.8(8.0) 49.8(5.0) 52.1(3.7) 49.1(7.3)
Z11JG14ZG14ZZ11JZ12ZT10W
1T10W
1Z12ZZ12ZT10W
2T10W
2Z12ZZ12ZT11N
1T11N
1Z12ZZ12ZT11N
2
NA 46.4(7.0) 40.7(4.5) 53.3(5.1) 50.3(5.3) 53.6(4.0) 54.8(4.9) 44.0(3.9) 42.5(3.3) 46.5(3.8)
SA 41.0(8.1) 56.1(7.1) 58.0(12.4) 51.8(10.6) 50.0(10.2) 52.3(7.0) 46.3(6.8) 46.8(6.6) 47.9(7.8)
2DSA 53.4(5.3) 54.5(7.4) 62.7(6.1) 61.8(4.7) 61.4(6.1) 62.6(4.6) 56.6(5.1) 55.6(5.7) 55.0(4.7)
T11N
2Z12ZT10W
1G14ZG14ZT10W
1T10W
2G14ZG14ZT10W
2T11N
1G14ZG14ZT11N
1T11N
2G14ZG14ZT11N
2mean
NA 49.5(3.0) 56.8(3.5) 49.4(5.2) 50.9(6.8) 45.3(3.6) 50.2(6.2) 46.2(4.8) 48.6(5.3) 45.8(3.9) 49.4
SA 47.4(9.3) 50.8(9.9) 54.6(6.4) 51.0(9.4) 51.0(7.8) 47.6(8.0) 48.3(7.9) 46.0(7.2) 47.3(7.6) 50.6
2DSA 52.6(5.1) 63.2(4.4) 61.6(5.1) 60.5(6.0) 60.9(7.8) 56.1(4.5) 60.5(8.9) 57.9(4.8) 58.2(4.9) 58.4
Table 12.
H
H
domain discrepancy measure and the corresponding recognition
accuracy (%) (in parentheses) of dierent approaches over a specific trial of
4 adaptation problems on the Oce–Caltech10 dataset. The lowest
H
H
is
boldfaced and the highest accuracy is underlined.
Method AC AD CD CW
NA 1.33 (69.3) 1.99 (76.3) 1.79 (65.0) 1.66 (64.3)
SA 1.23 (58.1) 0.89 (54.4) 0.94 (52.7) 1.13 (62.9)
2DSA 1.45 (78.9) 2.00 (83.7) 1.94 (83.4) 1.65 (74.9)
and
Hb
Hb
over the same adaptation tasks in Fig. 9and Fig. 10,
respectively. We observe that the tendency in
Hw
Hw
is analo-
gous to the
H
H
though some fluctuations occur. That is to say,
Hw
Hw
can be seen as a local version of
H
H
to some degree.
Finally, when resorting to the between-class divergence, we find
that
Hb
Hb
correlates well with the recognition accuracy. In
general, lower
Hb
Hb
implies good recognition performance.
According to these results, we can see that,
Hw
Hw
character-
izes how good an alignment is, and
Hb
Hb
depicts how well
the classification performs. Thus, we believe one should pay
more attentions to the local class distributions when considering
cross-domain classification problems, especially the between-
class distributions.
6.6. Do we need more training data?
As aforementioned, we partially ascribe the inferior perfor-
mance of SA to the limited number of training data, and we
have also justified our point from the perspective of reconstruc-
tion error. In this section, we further conduct experiments to see
whether the performance can be enhanced if we add more train-
ing data. Specifically, we use the ImageNet–VOC2007 dataset.
We continuously change the number of training data sampled
from each category, denoted by
Nclass
, and monitor the varia-
tions of average precision (AP). Results of the 9 typical classes
are illustrated in Fig. 11. We observe that, the more training data
use, the better performance generally achieves. This trend is ob-
vious when looking at those methods employing vector-form
representations (NA and SA). Furthermore, we find that only
2DSA can achieve favorable results even with a limited number
of training data (
Nclass
=8 or
Nclass
=16), implying that the per-
formance of 2DSA is not that sensitive as
Nclass
changes. This
also implies that 2DSA can be applied in the small-sample-size
situation, which is common in real-world applications. Based
on the results presented, we believe that the performance of SA
indeed has a close relation to the number of training data.
6.7. Does the feature really matter?
In this section, we analyze the performance of C O N V in
dierent layers to emphasize the role of feature representation.
One intuition about deep convolutional models is that the deeper
the layers are, the better the representation expresses [
49
,
50
].
To this end, we analyze the performance of 2DSA with dierent
layers of CO N V on the Oce–Caltech10 dataset, following
the standard experimental setting. Numerical results are listed
in Table 13. Generally, it has shown better accuracy when using
deeper representations. For instance, we surprisingly observe a
significant accuracy improvement from 28.5% to 75.2% under
the AC task. We have to admit good features really matter.
Here is our point. Indeed, DA methods really count, but
domain-invariant features also play a vital role. What are the
factors that cause domain shift? As what we have mentioned in
the beginning of the main text, they are those intrinsic and ex-
trinsic variations. Hence, it may be a good idea to devote ourself
to developing powerful features that achieve invariance to poses,
scales, rotations, illuminations and background, just like those
eorts made that endow an ability to convolutional models to
identify spatial transformations [51].
14
Q
0 10 20 30 40 50 60 70 80 90 100
Reconstruction Error
×105
0
0.5
1
1.5
2
2.5
3
3.5
Amazon
SA
2DSA
Q
0 10 20 30 40 50 60 70 80 90 100
Reconstruction Error
×105
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
Caltech
SA
2DSA
Q
0 10 20 30 40 50 60 70 80 90 100
Reconstruction Error
×105
0
0.5
1
1.5
2
2.5
3
3.5
4
DSLR
SA
2DSA
Q
0 10 20 30 40 50 60 70 80 90 100
Reconstruction Error
×105
0
0.5
1
1.5
2
2.5
3
3.5
web-cam
SA
2DSA
Figure 8. Reconstruction error of dierent approaches with changing energy parameter Q(%) in four domains.
γ
1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2
HwHw
0
0.5
1
1.5
Amazon Caltech
NA
SA
2DSA
γ
1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2
HwHw
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
Amazon DSLR
NA
SA
2DSA
γ
1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2
HwHw
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
Caltech DSLR
NA
SA
2DSA
γ
1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2
HwHw
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
Caltech web-cam
NA
SA
2DSA
Figure 9. γ-curves regarding local within-class divergence HwHwover four DA tasks. In this case, γ=2γw.
γ
1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2
HbHb
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
Amazon Caltech
NA
SA
2DSA
γ
1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2
HbHb
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
0.2
Amazon DSLR
NA
SA
2DSA
γ
1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2
HbHb
0
0.05
0.1
0.15
0.2
0.25
Caltech DSLR
NA
SA
2DSA
γ
1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2
HbHb
0
0.05
0.1
0.15
0.2
0.25
Caltech web-cam
NA
SA
2DSA
Figure 10. γ-curves regarding local between-class divergence HbHbover four DA tasks. In this case, γ=γb.
Table 13. Recognition accuracy (%) of 2DSA with dierent layers of CO NV activations on Oce–Caltech10 dataset over 20 trials. The arrow indicates the change
compared with the previous row, and the standard deviation is shown in parentheses.
Feature AC CA AD DA AW WA CD DC CW WC DW WDmean
mCONV1 28.5(2.6) 40.5(4.8) 22.7(5.7) 17.1(2.4) 17.5(4.6) 15.6(4.8) 27.6(4.6) 17.9(2.4) 21.2(5.2) 16.3(1.6) 43.8(5.0) 40.4(7.7) 25.8
mCONV2 46.5(2.2)56.8(8.8)39.9(6.2)27.9(5.2)31.9(5.0)29.0(4.1)43.3(6.9)22.1(3.5)33.8(3.8)22.1(2.4)70.1(10.1)66.3(7.7)40.8
mCONV3 44.0(6.3)61.7(5.7)38.2(6.0)22.6(5.0)31.0(7.3)30.9(5.9)47.1(5.3)20.8(3.6)37.5(4.8)22.7(2.9)71.1(5.5)66.4(10.0)41.2
mCONV4 64.7(2.7)76.8(2.3)60.7(12.0)56.6(4.7)55.8(10.3)44.4(13.0)65.1(6.3)44.7(3.9)57.3(5.1)39.3(3.7)90.7(3.2)90.6(2.9)62.2
mCONV5 75.2(3.0)85.9(1.8)75.4(5.2)73.4(5.0)66.8(3.9)63.8(5.6)76.6(5.0)62.5(4.7)69.5(2.9)55.1(5.1)91.8(3.0)93.0(2.5)74.1
Table 14. Average evaluation time (s) of each trial with varying feature dimensionality. (OS: Windows 7 64-bit, CPU: Intel i3-2120 3.30GHz, RAM: 16 GB)
Dimensionality 1152 1568 3872 7200 9248 16928
SA 1.03 2.54 34.57 210.40 444.83 2655.25
2DSA 0.32 0.37 0.45 0.63 0.81 2.14
6.8. Eciency comparison between SA and 2DSA
As aforementioned, compared with SA, 2DSA provides an-
other important attraction in computation eciency. Here, we
tend to verify this. Concretely, the single-core CPU runtime is
tested as the feature dimensionality varies, and the average eval-
uation time of each trial is reported. According to the numerical
15
Nclass
101102
AP
-10
0
10
20
30
40
50
60
70
80
aeroplane
NA
SA
2DSA
Nclass
101102
AP
0
10
20
30
40
50
60
70
80
bicycle
NA
SA
2DSA
Nclass
101102
AP
0
10
20
30
40
50
60
70
bus
NA
SA
2DSA
Nclass
101102
AP
0
10
20
30
40
50
60
70
80
car
NA
SA
2DSA
Nclass
101102
AP
0
5
10
15
20
25
30
35
40
45
cow
NA
SA
2DSA
Nclass
101102
AP
10
20
30
40
50
60
70
dog
NA
SA
2DSA
Nclass
101102
AP
0
10
20
30
40
50
60
70
motorbike
NA
SA
2DSA
Nclass
101102
AP
40
45
50
55
60
65
70
75
80
85
person
NA
SA
2DSA
Nclass
101102
AP
0
10
20
30
40
50
60
70
80
train
NA
SA
2DSA
Figure 11. The performance of dierent methods with the varying number of training data sampled from each category on the ImageNet–VOC2007 dataset.
results in Table 14, 2DSA is shown to be significantly at rates
faster than SA when dealing with high-dimensional data, imply-
ing that 2DSA would be particularly attractive in practice due
to its nature of high eciency.
6.9. Problems within HH-divergence
Finally, we tend to emphasize an important problem within
H
H
-divergence to inspire further studies. Fig. 12 illustrates
two typical relative positions between two domains: separation
and tangency. If we estimate the
H
H
-divergence for these
two situations according to those steps mentioned in Sec. 4.1,
H
H
values will make no dierence. Since domains in two sit-
uations are linearly separable, their
H
H
values will be close
to 2. However, our analysis shows that, when two domains are
close enough, they have a high probability to be classified cor-
rectly. Hence, it is necessary for a domain divergence measure
to dierentiate these two situations.
7. Conclusion
In this paper, we showed that it is better to align convolu-
tional activations in the two-dimensional world. In particular,
we proposed a 2DSA approach to adapt convolutional activa-
tions. We gave our deep insight on why 2DSA works better
and further introduced two novel domain divergence measures
termed
Hw
Hw
and
Hb
Hb
taking labels into account. Exten-
sive experiments justified 2DSA significantly outperformed SA
in both eectiveness and eciency and also showed superior
or at least comparable classification performance than existing
benchmarking approaches. In addition, an interesting DA appli-
cation in agriculture was demonstrated as well.
16
Source Target Source Target
Figure 12. Two typical relative positions between two domains. The left denotes
source domain is separate from the target, and right indicates source is tangent
to target. However, since both domains in these two situations can be linearly
separated, it makes no dierence to HH-divergence.
Notice that the proposed 2DSA does have limitations. Since
2DSA is only a linear adaptation method, when the distributions
of two domains are significantly distinct, a linear alignment is
typically not sucient and thus 2DSA as proposed may not
work. Moreover, in real-world applications, one may encounter
the situation that a new test set comes and whose transformed
subspace is not aligned with the subspace of target domain. Un-
der such a circumstance, 2DSA may also fail. Perhaps one pos-
sible solution is to realign the new subspace.
For future work, it could be interesting to assign pseudo la-
bels to the target data and iteratively optimize both within- and
between-class measures so that they could be used as a guid-
ing criteria for choosing a good adaptation in an unsupervised
DA context. Moreover, it is worth noting that the introduced
measures are independent of a specific distance metric. It is
also interesting to explore whether we can learn some kind of
metric that could achieve both low within- and between-class
divergences simultaneously. In addition, we plan to formulate
the three-dimensional subspace alignment problem for unsuper-
vised DA, as adapting 3D tensors may be a stronger way to
model convolutional activations and may lead to interesting ap-
plications, e.g., the adaptation of CNNs not only for new do-
mains but also for new tasks.
Acknowledgment
The authors would like to thank the anonymous reviewers
for their insightful comments. This work is jointly supported by
the National High-tech R&D Program of China (863 Program)
(Grant No. 2015AA015904) and the National Natural Science
Foundation of China (Grant No. 61502187).
References
[1]
F. Perronnin, J. S
´
anchez, T. Mensink, Improving the fisher kernel for large-
scale image classification, in: Proc. European Conference on Computer Vi-
sion (ECCV), 2010, pp. 143–156. doi:
10.1007/978-3- 642-15561- 1_
11.
[2]
A. Torralba, A. A. Efros, Unbiased look at dataset bias, in: Proc. IEEE
Conference on Computer Vision and Pattern Recognition (CVPR), 2011,
pp. 1521–1528. doi:10.1109/CVPR.2011.5995347.
[3]
P. Dollar, C. Wojek, B. Schiele, P. Perona, Pedestrian detection: An evalua-
tion of the state of the art, IEEE Transactions on Pattern Analysis and Ma-
chine Intelligence 34 (2012) 743–761. doi:10.1109/TPAMI.2011.155.
[4]
V. M. Patel, R. Gopalan, R. Li, R. Chellappa, Visual domain adaptation:
A survey of recent advances, IEEE Signal Processing Magazine 32 (2015)
53–69. doi:10.1109/MSP.2014.2347059.
[5]
A. Krizhevsky, I. Sutskever, G. E. Hinton, Imagenet classification with
deep convolutional neural networks, in: Advances in Neural Information
Processing Systems (NIPS), 2012, pp. 1097–1105.
[6]
R. Girshick, J. Donahue, T. Darrell, J. Malik, Rich feature hierarchies
for accurate object detection and semantic segmentation, in: Proc. IEEE
Conference on Computer Vision and Pattern Recognition (CVPR), 2014,
pp. 580–587. doi:10.1109/CVPR.2014.81.
[7]
J. Yosinski, J. Clune, Y. Bengio, H. Lipson, How transferable are features
in deep neural networks?, in: Advances in Neural Information Processing
Systems (NIPS), 2014, pp. 3320–3328.
[8]
J. Donahue, Y. Jia, O. Vinyals, J. Homan, N. Zhang, E. Tzeng, T. Darrell,
DeCAF: A deep convolutional activation feature for generic visual recog-
nition., in: Proc. International Conference on Machine Learning (ICML),
2014, pp. 647–655.
[9]
Y. Ganin, V. Lempitsky, Unsupervised domain adaptation by
backpropagation, in: Proc. International Conference on Machine
Learning (ICML), 2015, pp. 1180–1189. URL:
http://jmlr.org/
proceedings/papers/v37/ganin15.pdf.
[10]
N. Zhang, J. Donahue, R. Girshick, T. Darrell, Part-based R-CNNs
for fine-grained category detection, in: Proc. European Confer-
ence on Computer Vision (ECCV), 2014, pp. 834–849. doi:
10.1007/
978-3- 319-10590- 1_54.
[11]
K. Saenko, B. Kulis, M. Fritz, T. Darrell, Adapting visual category models
to new domains, in: Proc. European Conference on Computer Vision
(ECCV), 2010, pp. 213–226. doi:10.1007/978-3- 642-15561- 1_16.
[12]
R. Gopalan, R. Li, R. Chellappa, Domain adaptation for object recogni-
tion: An unsupervised approach, in: Proc. IEEE International Conference
on Computer Vision (ICCV), 2011, pp. 999–1006. doi:
10.1109/ICCV.
2011.6126344.
[13]
B. Gong, Y. Shi, F. Sha, K. Grauman, Geodesic flow kernel for un-
supervised domain adaptation, in: Proc. IEEE Conference on Com-
puter Vision and Pattern Recognition (CVPR), 2012, pp. 2066–2073.
doi:10.1109/CVPR.2012.6247911.
[14]
B. Fernando, A. Habrard, M. Sebban, T. Tuytelaars, Unsupervised visual
domain adaptation using subspace alignment, in: Proc. IEEE International
Conference on Computer Vision (ICCV), 2013, pp. 2960–2967. doi:
10.
1109/ICCV.2013.368.
[15]
W. Li, L. Duan, D. Xu, I. W. Tsang, Learning with augmented features for
supervised and semi-supervised heterogeneous domain adaptation, IEEE
Transactions on Pattern Analysis and Machine Intelligence 36 (2014)
1134–1148. doi:10.1109/TPAMI.2013.167.
[16]
H. Pirsiavash, D. Ramanan, C. C. Fowlkes, Bilinear classifiers for vi-
sual recognition, in: Advances in Neural Information Processing Systems
(NIPS), 2009, pp. 1482–1490.
[17]
K. He, X. Zhang, S. Ren, J. Sun, Spatial pyramid pooling in deep con-
volutional networks for visual recognition, IEEE Transactions on Pattern
Analysis and Machine Intelligence 37 (2015) 1904–1916. doi:
10.1109/
TPAMI.2015.2389824.
[18]
R. Girshick, Fast R-CNN, in: Proc. IEEE International Conference
on Computer Vision (ICCV), 2015, pp. 1440–1448. doi:
10.1109/ICCV.
2015.169.
[19]
J. Yang, D. Zhang, A. Frangi, J.-Y. Yang, Two-dimensional PCA: a new
approach to appearance-based face representation and recognition, IEEE
Transactions on Pattern Analysis and Machine Intelligence 26 (2004) 131–
137. doi:10.1109/TPAMI.2004.1261097.
[20]
S. Ben-David, J. Blitzer, K. Crammer, F. Pereira, et al., Analysis of rep-
resentations for domain adaptation, in: Advances in Neural Information
Processing Systems (NIPS), volume 19, 2007, p. 137.
[21]
S. J. Pan, Q. Yang, A survey on transfer learning, IEEE Transactions on
Knowledge and Data Engineering 22 (2010) 1345–1359. doi:
10.1109/
TKDE.2009.191.
[22]
H. Shimodaira, Improving predictive inference under covariate shift by
weighting the log-likelihood function, Journal of Statistical Planning and
Inference 90 (2000) 227–244. doi:
10.1016/S0378-3758(00)00115- 4
.
[23]
S. Ben-David, J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, J. W.
Vaughan, A theory of learning from dierent domains, Machine Learning
79 (2010) 151–175. doi:10.1007/s10994-009- 5152-4.
[24]
J. Tao, F. lai Chung, S. Wang, On minimum distribution discrepancy
support vector machine for domain adaptation, Pattern Recognition 45
(2012) 3962 – 3984. doi:10.1016/j.patcog.2012.04.014.
[25]
A. S. Mozafari, M. Jamzad, A svm-based model-transferring method for
17
heterogeneous domain adaptation, Pattern Recognition 56 (2016) 142–
158. doi:10.1016/j.patcog.2016.03.009.
[26]
J. Blitzer, R. McDonald, F. Pereira, Domain adaptation with structural
correspondence learning, in: Proc. Conference on Empirical Methods in
Natural Language Processing (EMNLP), 2006, pp. 120–128.
[27]
H. Daum
´
e III, Frustratingly easy domain adaptation, in: Proc. Association
for Computational Linguistics (ACL), 2007.
[28]
Q.-F. Wang, F. Yin, C.-L. Liu, Unsupervised language model adaptation
for handwritten chinese text recognition, Pattern Recognition 47 (2014)
1202–1216. doi:10.1016/j.patcog.2013.09.015.
[29]
A. Bergamo, L. Torresani, Exploiting weakly-labeled web images to im-
prove object classification: a domain adaptation approach, in: Advances
in Neural Information Processing Systems (NIPS), 2010, pp. 181–189.
[30]
B. Kulis, K. Saenko, T. Darrell, What you saw is not what you get: Domain
adaptation using asymmetric kernel transforms, in: Proc. IEEE Conference
on Computer Vision and Pattern Recognition (CVPR), 2011, pp. 1785–
1792. doi:10.1109/CVPR.2011.5995702.
[31]
X. Li, M. Fang, J.-J. Zhang, J. Wu, Learning coupled classifiers with
rgb images for rgb-d object recognition, Pattern Recognition 61 (2017)
433–446. doi:10.1016/j.patcog.2016.08.016.
[32]
E. Kodirov, T. Xiang, Z. Fu, S. Gong, Unsupervised domain adaptation for
zero-shot learning, in: Proc. IEEE International Conference on Computer
Vision (ICCV), 2015, pp. 2452–2460. doi:10.1109/ICCV.2015.282.
[33]
J. Homan, E. Rodner, J. Donahue, T. Darrell, K. Saenko, Ecient learn-
ing of domain-invariant image representations, CoRR abs/1301.3224
(2013).
[34]
S. J. Pan, I. W. Tsang, J. T. Kwok, Q. Yang, Domain adaptation via transfer
component analysis, IEEE Transactions on Neural Networks 22 (2011)
199–210. doi:10.1109/TNN.2010.2091281.
[35]
M. Long, J. Wang, G. Ding, J. Sun, P. S. Yu, Transfer joint matching for
unsupervised domain adaptation, in: Proc. IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), 2014, pp. 1410–1417. doi:
10.
1109/CVPR.2014.183.
[36]
R. Aljundi, R. Emonet, D. Muselet, M. Sebban, Landmarks-based kernel-
ized subspace alignment for unsupervised domain adaptation, in: Proc.
IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
2015, pp. 56–63. doi:10.1109/CVPR.2015.7298600.
[37]
L. Duan, D. Xu, I. W.-H. Tsang, Domain adaptation from multiple
sources: A domain-dependent regularization approach, IEEE Transac-
tions on Neural Networks and Learning Systems 23 (2012) 504–518.
doi:10.1109/TNNLS.2011.2178556.
[38]
W.-S. Chu, F. De La Torre, J. F. Cohn, Selective transfer machine for
personalized facial action unit detection, in: Proc. IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), 2013. doi:
10.1109/
CVPR.2013.451.
[39]
M. Long, Y. Cao, J. Wang, M. I. Jordan, Learning transferable features
with deep adaptation networks, in: Proc. Internation Conference on Ma-
chine Learning (ICML), 2015.
[40]
J. W. Osborne,A. B. Costello, Sample size and subject to item ratio in prin-
cipal components analysis, Practical Assessment, Research & Evaluation
9 (2004) 8.
[41]
L. Van der Maaten, G. Hinton, Visualizing data using t-SNE, Journal of
Machine Learning Research 9 (2008) 85.
[42]
H. Lu, Z. Cao, Y. Xiao, Z. Fang, Y. Zhu, Toward good practices for fine-
grained maize cultivar identification with filter-specific convolutional ac-
tivations, IEEE Transactions on Automation Science and Engineering
(2016). doi:10.1109/TASE.2016.2616485.
[43]
H. Lu, Z. Cao, Y. Xiao, Z. Fang, Y. Zhu, Towards fine-grained maize tassel
flowering status recognition: dataset, theory and practice, Applied Soft
Computing 56 (2017) 34–45. doi:10.1016/j.asoc.2017.02.026.
[44]
H. Lu, Z. Cao, Y. Xiao, Z. Fang,Y. Zhu, K. Xian, Fine-grained maize tassel
trait characterization with multi-view representations, Computers and
Electronics in Agriculture 118 (2015) 143–158. doi:
10.1016/j.compag.
2015.08.027.
[45]
A. Vedaldi, K. Lenc, MatConvNet: Convolutional neural networks for
matlab, in: Proc. ACM International Conference on Multimedia, 2015, pp.
689–692.
[46]
R.-e. Fan, X.-r. Wang, C.-j. Lin, LIBLINEAR : A library for large linear
classification, Journal of Machine Learning Research 9 (2014) 1871–1874.
[47]
B. Sun, J. Feng, K. Saenko, Return of frustratingly easy domain adaptation,
in: Proc. AAAI Conference on Artificial Intelligence, 2016.
[48]
Y. Wen, K. Zhang, Z. Li, Y. Qiao, A discriminative feature learning ap-
proach for deep face recognition, in: Proc. European Conference on Com-
puter Vision (ECCV), Springer, 2016, pp. 499–515.
[49]
K. Simonyan, A. Zisserman, Very deep convolutional networks for large-
scale image recognition, CoRR abs/1409.1556 (2014). URL:
http://
arxiv.org/abs/1409.1556.
[50]
K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recog-
nition, in: Proc. IEEE Conference on Computer Vision and Pattern Recog-
nition (CVPR), 2016.
[51]
M. Jaderberg, K. Simonyan, A. Zisserman, K. Kavukcuoglu, Spatial trans-
former networks, in: Advances in Neural Information Processing Systems
(NIPS), 2015, pp. 2008–2016.
18

Supplementary resources (2)

... During the exploration, we observe that CNN features meet an interesting phenomenon, i.e., the samples of the same class across domains are sufficiently close, and different classes have large enough margins [13]. This implies that target samples are very likely to be classified correctly, and one may achieve good classification without adaptation. ...
... To justify the effectiveness of LDADA, we conduct evaluations on two standard cross-domain object recognition datasets (Office-Caltech10 [14] and Office31 [12]), a cross-place satellite scene dataset (Satellite-Scene5 [15], [16], [17]) as well as a cross-field fine-grained texture recognition dataset (Maize-Tassel-Flowering-Status3 [13]) under both unsupervised DA (target labels are not presented) and semi-supervised DA (few labeled target data are made available) settings. Extensive experimental results demonstrate that our approach indicates a good advantage with respect to existing non-deep DA methods and exhibits comparable classification performance against modern deep DA alternatives. ...
... 3) Maize-Tassel-Flowering-Status3 (MTFS3) Dataset: MTFS3 dataset is a cross-field fine-grained texture categorization dataset introduced recently [13]. It concerns the recognition of three types of flowering status: non-flowering, partially-flowering, and fully-flowering. ...
Article
Full-text available
We show that it is possible to achieve high-quality domain adaptation without explicit adaptation. The nature of the classification problem means that when samples from the same class in different domains are sufficiently close, and samples from differing classes are separated by large enough margins, there is a high probability that each will be classified correctly. Inspired by this, we propose an embarrassingly simple yet effective approach to domain adaptation–only the class mean is used to learn classspecific linear projections. Learning these projections is naturally cast into a linear-discriminant-analysis-like framework, which gives an efficient, closed form solution. Further, to enable to application of this approach to unsupervised learning, an iterative validation strategy is developed to infer target labels. Extensive experiments on cross-domain visual recognition demonstrate that, even with the simplest formulation, our approach outperforms existing non-deep adaptation methods and exhibits classification performance comparable with that of modern deep adaptation methods. An analysis of potential issues effecting the practical application of the method is also described, including robustness, convergence, and the impact of small sample sizes.
... And we find that the category of tassel in these two sequences are quite different other sequences, and the training set has a different distribution with test set. So one may consider to alleviate these issues by adding more extra training data or trying domain adaptation [50,51]. • Qualitative results in Fig. 10a shows that Faster R-CNN, RetinaNet and TasselNet * all can estimate reasonable approximations to the ground truth counts. ...
... 4. Try the idea of counting by regression in crowd scene and counting by detection when the scales of tassels change a lot. 5. Try the idea of data synthesis to augment training data. 6. Try to use domain adaptation [50,51] to fill the differences between sequences, e.g. domain adaptive Faster R-CNN [52]. ...
Article
Full-text available
Background: The population of plants is a crucial indicator in plant phenotyping and agricultural production, such as growth status monitoring, yield estimation, and grain depot management. To enhance the production efficiency and liberate labor force, many automated counting methods have been proposed, in which computer vision-based approaches show great potentials due to the feasibility of high-throughput processing and low cost. In particular, with the success of deep learning, more and more deeper learning-based approaches are introduced to deal with agriculture automation. Since different detection- and regression-based counting models have distinct characteristics, how to choose an appropriate model given the target task at hand remains unexplored and is important for practitioners. Results: Targeting in-field maize tassels as a representative case study, the goal of this work is to present a comprehensive benchmark of state-of-the-art object detection and object counting methods, including Faster R-CNN, YOLOv3, FaceBoxes, RetinaNet, and the leading counting model of maize tassels-TasselNet. We create a Maize Tassel Detection Counting (MTDC) dataset by supplementing bounding box annotations to the Maize Tassels Counting (MTC) dataset to allow the training of detection models. We investigate key factors effecting the practical applications of the models, such as convergence behavior, scale robustness, speed-accuracy trade-off, as well as parameter sensitivity. Based on our benchmark, we summarise the advantages and limitations of each method and suggest several possible directions to improve current detection- and regression-based counting approaches to benefit next-generation intelligent agriculture. Conclusions: Current state-of-the-art detection- and regression-based counting approaches can all achieve a relatively high degree of accuracy when dealing with in-field maize tassels, with at least 0.85 R 2 values and 28.2% rRMSE error. While detection-based methods are more robust than regression-based methods in scale variations and can infer extra information (e.g., object positions and sizes), the latter ones have significantly faster convergence behaviors and inference speed. To choose an appropriate in-filed plant counting method, accuracy, robustness, speed and some other algorithm-specific factors should be taken into account with the same priority. This work sheds light on different aspects of existing detection and counting approaches and provides guidance on how to tackle in-field plant counting. The MTDC dataset is made available at https://git.io/MTDC.
... where challenges from the dataset may extend beyond the six types mentioned in Section 2.1, leading to a decrease in counting performance in unseen environments or challenges. To address these issues, transfer learning and domain adaptation techniques may be potential solutions (Lu et al., 2017a;2017b;. ...
Article
Full-text available
Introduction Soybean pod count is one of the crucial indicators of soybean yield. Nevertheless, due to the challenges associated with counting pods, such as crowded and uneven pod distribution, existing pod counting models prioritize accuracy over efficiency, which does not meet the requirements for lightweight and real-time tasks. Methods To address this goal, we have designed a deep convolutional network called PodNet. It employs a lightweight encoder and an efficient decoder that effectively decodes both shallow and deep information, alleviating the indirect interactions caused by information loss and degradation between non-adjacent levels. Results We utilized a high-resolution dataset of soybean pods from field harvesting to evaluate the model’s generalization ability. Through experimental comparisons between manual counting and model yield estimation, we confirmed the effectiveness of the PodNet model. The experimental results indicate that PodNet achieves an R² of 0.95 for the prediction of soybean pod quantities compared to ground truth, with only 2.48M parameters, which is an order of magnitude lower than the current SOTA model YOLO POD, and the FPS is much higher than YOLO POD. Discussion Compared to advanced computer vision methods, PodNet significantly enhances efficiency with almost no sacrifice in accuracy. Its lightweight architecture and high FPS make it suitable for real-time applications, providing a new solution for counting and locating dense objects.
... Models learned on the training set may not generalize well to the testing set with significant variations in plant cultivars, illumination changes, and poses. In this case, the idea of domain adaptation may be applied to fill the performance loss (Lu et al., 2017b(Lu et al., , 2018. All evaluation results above suggest the general applicability of TasselNetV2+ in plant counting, especially when only the count value is the output of interest. ...
Article
Full-text available
Plant counting runs through almost every stage of agricultural production from seed breeding, germination, cultivation, fertilization, pollination to yield estimation, and harvesting. With the prevalence of digital cameras, graphics processing units and deep learning-based computer vision technology, plant counting has gradually shifted from traditional manual observation to vision-based automated solutions. One of popular solutions is a state-of-the-art object detection technique called Faster R-CNN where plant counts can be estimated from the number of bounding boxes detected. It has become a standard configuration for many plant counting systems in plant phenotyping. Faster R-CNN, however, is expensive in computation, particularly when dealing with high-resolution images. Unfortunately high-resolution imagery is frequently used in modern plant phenotyping platforms such as unmanned aerial vehicles, engendering inefficient image analysis. Such inefficiency largely limits the throughput of a phenotyping system. The goal of this work hence is to provide an effective and efficient tool for high-throughput plant counting from high-resolution RGB imagery. In contrast to conventional object detection, we encourage another promising paradigm termed object counting where plant counts are directly regressed from images, without detecting bounding boxes. In this work, by profiling the computational bottleneck, we implement a fast version of a state-of-the-art plant counting model TasselNetV2 with several minor yet effective modifications. We also provide insights why these modifications make sense. This fast version, TasselNetV2+, runs an order of magnitude faster than TasselNetV2, achieving around 30 fps on image resolution of 1980 × 1080, while it still retains the same level of counting accuracy. We validate its effectiveness on three plant counting tasks, including wheat ears counting, maize tassels counting, and sorghum heads counting. To encourage the use of this tool, our implementation has been made available online at https://tinyurl.com/TasselNetV2plus.
... In order to mitigate the generalization bottleneck caused by domain shift, many methods of domain adaptation [3,10,37,41,43] have been proposed in past decades. One of the main research directions is to utilize metric criterions to measure the distribution discrepancy across different domains, such as Maximum Mean Discrepancy (M M D) [24,44], Central Moment Discrepancy (C M D) [50], subspace alignment [29] and second-order statistics CORrelation ALignment (C O R AL) [40,41]. These methods explicitly minimize distribution discrepancy between source and target domains to exploit transferable representations. ...
Article
Full-text available
Adversarial learning has achieved remarkable advance in learning transferable representations across different domains. Generally, previous works are mainly devoted to reducing domain shift between labeled source data and unlabeled target data by extracting domain-invariant features. However, these adversarial methods rarely consider task-specific decision boundaries among classes, causing classification performance degradation in cross domain tasks. In this paper, we propose a novel approach for the task of unsupervised domain adaptation via discriminative classes-center feature learning in adversarial network (C2FAN), which concentrates on learning domain-invariant representation and paying close attention to classification decision boundary simultaneously to improve the ability of transferable knowledge across different domains. C2FAN consists of a feature extractor, a classifier and a discriminator. Firstly, for reducing domain gaps between source and target domains in the feature extractor, we propose to utilize a conditional adversarial learning module to extract domain-invariant feature and improve discriminability of the classifier simultaneously. Further, we present a high-efficiency layer normalization module to reduce domain shift existing in the classifier. Secondly, we design a discriminative classes-center feature learning module in the classifier to diminish the distribution distance of the same-class samples so that the decision boundary can distinguish different classes easily, which can reduce the misclassification on target samples. What’s more, C2FAN is an effective yet considerable simple approach which can be embedded into current domain adaptation approaches conveniently. Extensive experiments demonstrate that our proposed model achieves satisfactory results on some standard domain adaptation benchmarks.
... The work [26] investigated the possibility of automatically ranking source CNNs prior to utilizing them in the given target task. In order to learn domain-invariant features, Lu [27] developed a twodimensional subspace alignment approach based on 2D principal component analysis to adapt convolutional activations. Ganin [28] proposed a gradient reversal layer for a domain mixer, which acts an identity transform during the forward propagation and multiplies the gradient from the subsequent level by a negative during the back propagation. ...
Article
Full-text available
Video smoke detection is a promising method for early fire prevention. However, it is still a challenging task for application of video smoke detection in real world detection systems, as the limitations of smoke image samples for training and lack of efficient detection algorithm. This paper proposes a method based on two state-of-the-art fast detectors, single shot multi-box detector and multi-scale deep convolutional neural network, for smoke detection using synthetic smoke image samples. The virtual data can automatically offer rich samples with ground truth annotations. However, the learning of smoke representation in the detectors will be restricted by the appearance gap between real and synthetic smoke samples, which will cause significant performance drop. To train a strong detector with synthetic smoke samples, we incorporate the domain adaptation into the fast detectors. A series of branches with the same structure as the detection branches are integrated into the fast detectors for domain adaptation. We design an adversarial training strategy to optimize the model of the adapted detectors, to learn a domain-invariant representation for smoke detection. The domain discrimination, domain confusion and detection are combined in the iterative training procedure. The performance of the proposed approach surpasses the original baseline in our experiments.
Article
Recent advances in unsupervised domain adaptation mainly focus on learning shared representations by global statistics alignment, such as the Maximum Mean Discrepancy (MMD) which matches the Mean statistics across domains. The lack of class information, however, may lead to partial alignment (or even misalignment) and poor generalization performance. For robust domain alignment, we argue that the similarities across different features in the source domain should be consistent with that in the target domain. Based on this assumption, we propose a new domain discrepancy metric, i.e., Self-similarity Consistency (SSC), to enforce the pairwise relationship between different features being consistent across domains. The Gram matrix matching and Correlation Alignment is proven to be a special case, and a sub-optimal measure of our proposed SSC. Furthermore, we also propose to mitigate the side effect of the partial alignment and misalignment by incorporating the discriminative information of the deep representations. Specifically, a simple yet effective feature norm constraint is exploited to enlarge the discrepancy of inter-class samples. It relieves the requirements of strict alignment when performing adaptation, therefore improving the adaptation performance significantly. Extensive experiments on visual domain adaptation tasks demonstrate the effectiveness of our proposed SSC metric and feature discrimination approach.
Article
Fast and accurate plant counting tools affect revolution in modern agriculture. Agricultural practitioners, however, expect the output of the tools to be not only accurate but also explainable. Such explainability often refers to the ability to infer which instance is counted. One intuitive way is to generate a bounding box for each instance. Nevertheless, compared with counting by detection, plant counts can be inferred more directly in the local count framework, while one thing reproaching this paradigm is its poor explainability of output visualization. In particular, we find that the poor explainability becomes a bottleneck limiting the counting performance. To address this, we explore the idea of guided upsampling and background suppression where a novel upsampling operator is proposed to allow count redistribution, and segmentation decoders with different fusion strategies are investigated to suppress background, respectively. By integrating them into our previous counting model TasselNetV2, we introduce TasselNetV3 series: TasselNetV3-Lite and TasselNetV3-Seg. We validate the TasselNetV3 series on three public plant counting data sets and a new unmanned aircraft vehicle (UAV)-based data set, covering maize tassels counting, wheat ears counting, and rice plants counting. Extensive results show that guided upsampling and background suppression not only improve counting performance but also enable explainable visualization. Aside from state-of-the-art performance, we have several interesting observations: 1) a limited-receptive-field counter in most cases outperforms a large-receptive-field one; 2) it is sufficient to generate empirical segmentation masks from dotted annotations; 3) middle fusion is a good choice to integrate foreground–background a priori knowledge; and 4) decoupling the learning of counting and segmentation matters.
Article
In-field identification of cotton boll status is an important indicator of maturity grading and precise field management. However, the growth status of cotton boll is highly affected by environmental factors. Differentiation and correlation exhibit in data distribution of different domains, caused by distinct districts, time, weather, and farming operations. Therefore, distribution mismatch is a common phenomenon in agricultural image acquisition such that traditional manual observation measures or standard classification models that are independent and identically distributed (i.i.d.) often cannot obtain satisfactory results. One feasible solution to address this problem is to use domain adaptation that adapts knowledge from the original training data, a.k.a. the source domain, to the new testing data, a.k.a. the target domain. In this paper, we propose a novel NCA-based unsupervised domain adaptation method termed NCADA, which includes three procedures: feature extraction using a deep CNN, feature transformation matrix generation, and target label inference. We validate the NCADA method on our constructed in-field cotton boll dataset with images. Extensive experiments show that NCADA method achieves accurate identification performances of and on ‘Internet -> Field’ and two different ‘Field -> Field’ settings, demonstrating that NCADA can be a useful tool to replace manual observation and standard classification methods.
Article
Domain adaptation is a significant and popular issue of solving distribution discrepancy among different domains in computer vision. Generally, previous works proposed are mainly devoted to reducing domain shift between source domain with labeled data and target domain without labels. Adversarial learning in deep networks has already been widely applied to learn disentangled and transferable features between two different domains to minimize domains distribution discrepancy. However, these methods rarely consider class distributions among source data during adversarial learning, and they pay little attention to these transferable regions among source and target domains images. In this paper, we propose a Generative Attention Adversarial Classification Network (GAACN) model for unsupervised domain adaptation. To learn a joint feature distribution between source and target domains, we present an improved generative adversarial network (GAN) following the feature extractor. Firstly, the discriminator of GAN discriminates the distribution of domains and the classes distribution among source data during adversarial learning, so that our feature extractor can learn a joint feature distribution between source and target domains and maintain the classes consistent simultaneously. Secondly, we present an attention module embedded in GAN, which allows the discriminator to discriminate the transferable regions among the images of source and target domains. Lastly, we propose a simple and efficient method which allocates pseudo-labels for unlabeled target data, and it can improve the performance of our model GAACN while mitigating negative transfer. Extensive experiments demonstrate that our proposed model achieves perfect results on several standard domain adaptation datasets.
Article
Full-text available
Maize is one of the three main cereal crops of the world. Accurately knowing its tassel flowering status can help to analyze the growth status and adjust the farming operation accordingly. At the current stage, acquiring the tassel flowering status mainly depends on human observation. Actually, it is costly and subjective, especially for the large-scale quantitative analysis under the in-field environment. To alleviate this, we propose an automatic maize tassel flowering status (i.e., non-flowering, partially-flowering and fully-flowering) recognition method via the computer vision technology in this paper. In particular, this task is formulated as a fine-grained image categorization problem. More specifically, scale-invariant feature transform (SIFT) is first extracted as the low-level visual descriptor to characterize the maize flower. Fisher vector (FV) is then applied to execute feature encoding on SIFT to generate more discriminative flowering status representation. To further leverage the performance, a novel metric leaning method termed large-margin dimensionality reduction (LMDR) is proposed. To verify the effectiveness of the proposed method, a flowering status dataset that consists of 3000 images is built. The experimental results demonstrate that our approach goes beyond the state-of-the-art by large margins (at least 8.3%). The dataset and source code are made available online.
Article
Full-text available
Crop cultivar identification is an important aspect in agricultural systems. Traditional solutions involve excessive human interventions, which is labor-intensive and time-consuming. Also, cultivar identification is a typical task of fine-grained visual categorization (FGVC). Compared with other common topics in FGVC, studies on this problem are somewhat lagged and limited. In this paper, targeting four Chinese maize cultivars of Jundan No.20, Wuyue No.3, Nongda No.108 and Zhengdan No.958, we first consider a problem of identifying the maize cultivar based on its tassel characteristics by computer vision. In particular, a novel fine-grained maize cultivar identification dataset termed HUST-FG-MCI that contains 5000 images is first constructed. To better capture the textual differences in a weakly-supervised manner, we proposed an effective deep convolutional neural network (CNN) and Fisher vector (FV) based feature encoding mechanism. The mechanism tends to highlight subtle object patterns via filter-specific convolutional representations and thus provides strong discrimination for cultivar identification. Experimental results demonstrate that our method outperforms the other stat of the art. We show also that, FV encoding can weaken the linear dependency between convolutional activations, redundant filters exist in the convolutional layer, and high accuracy can be maintained with relatively low-dimensional convolutional features and one or two Gaussian components in FV.
Conference Paper
Full-text available
Domain adaptation (DA) has gained a lot of success in the recent years in computer vision to deal with situations where the learning process has to transfer knowledge from a source to a target domain. In this paper, we introduce a novel unsupervised DA approach based on both subspace alignment and selection of landmarks similarly distributed between the two domains. Those landmarks are selected so as to reduce the discrepancy between the domains and then are used to non linearly project the data in the same space where an efficient subspace alignment (in closed-form) is performed. We carry out a large experimental comparison in visual domain adaptation showing that our new method outperforms the most recent unsupervised DA approaches.
Conference Paper
Semantic part localization can facilitate fine-grained categorization by explicitly isolating subtle appearance differences associated with specific object parts. Methods for pose-normalized representations have been proposed, but generally presume bounding box annotations at test time due to the difficulty of object detection. We propose a model for fine-grained categorization that overcomes these limitations by leveraging deep convolutional features computed on bottom-up region proposals. Our method learns whole-object and part detectors, enforces learned geometric constraints between them, and predicts a fine-grained category from a pose-normalized representation. Experiments on the Caltech-UCSD bird dataset confirm that our method outperforms state-of-the-art fine-grained categorization methods in an end-to-end evaluation without requiring a bounding box at test time.
Conference Paper
We evaluate whether features extracted from the activation of a deep convolutional network trained in a fully supervised fashion on a large, fixed set of object recognition tasks can be re-purposed to novel generic tasks. Our generic tasks may differ significantly from the originally trained tasks and there may be insufficient labeled or unlabeled data to conventionally train or adapt a deep architecture to the new tasks. We investigate and visualize the semantic clustering of deep convolutional features with respect to a variety of such tasks, including scene recognition, domain adaptation, and fine-grained recognition challenges. We compare the efficacy of relying on various network levels to define a fixed feature, and report novel results that significantly outperform the state-of-the-art on several important vision challenges. We are releasing DeCAF, an open-source implementation of these deep convolutional activation features, along with all associated network parameters to enable vision researchers to be able to conduct experimentation with deep representations across a range of visual concept learning paradigms.
Conference Paper
We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 dif- ferent classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implemen- tation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called dropout that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry
Conference Paper
Convolutional neural networks (CNNs) have been widely used in computer vision community, significantly improving the state-of-the-art. In most of the available CNNs, the softmax loss function is used as the supervision signal to train the deep model. In order to enhance the discriminative power of the deeply learned features, this paper proposes a new supervision signal, called center loss, for face recognition task. Specifically, the center loss simultaneously learns a center for deep features of each class and penalizes the distances between the deep features and their corresponding class centers. More importantly, we prove that the proposed center loss function is trainable and easy to optimize in the CNNs. With the joint supervision of softmax loss and center loss, we can train a robust CNNs to obtain the deep features with the two key learning objectives, inter-class dispension and intra-class compactness as much as possible, which are very essential to face recognition. It is encouraging to see that our CNNs (with such joint supervision) achieve the state-of-the-art accuracy on several important face recognition benchmarks, Labeled Faces in the Wild (LFW), YouTube Faces (YTF), and MegaFace Challenge. Especially, our new approach achieves the best results on MegaFace (the largest public domain face benchmark) under the protocol of small training set (contains under 500000 images and under 20000 persons), significantly improving the previous results and setting new state-of-the-art for both face recognition and face verification tasks.
Article
The emergence of depth images opens a new dimension to address the challenging object recognition tasks. However, when only a small amount of labeled data is available, we cannot learn a discriminative classifier directly using the RGB-D images. To cope with this problem, we proposed a new method, Learning Coupled Classifiers with RGB images for RGB-D object recognition (LCCRRD). We learn the coupled classifiers using RGB images from source domain, the combined RGB and depth images from target domain and RGB images from target domain. The predicted results of the two target classifiers are made to be similar to make them more accurate. We also utilize the correlation between source and target RGB images to boost the relevant features and eliminate the irrelevant features. It also has the capacity to incorporate the manifold structure into our model. Furthermore, a unified objective function is presented to learn the classifier parameters. To evaluate our LCCRRD method, we apply it to five cross domain datasets. The experimental results demonstrate that our method can achieve competing performance against the state-of-art methods for object recognition tasks.