ArticlePDF Available

Discriminative Learning and Recognition of Image Set Classes Using Canonical Correlations

Authors:

Abstract and Figures

We address the problem of comparing sets of images for object recognition, where the sets may represent variations in an object's appearance due to changing camera pose and lighting conditions. Canonical Correlations (also known as principal or canonical angles), which can be thought of as the angles between two d-dimensional subspaces, have recently attracted attention for image set matching. Canonical correlations offer many benefits in accuracy, efficiency, and robustness compared to the two main classical methods: parametric distribution-based and nonparametric sample-based matching of sets. Here, this is first demonstrated experimentally for reasonably sized data sets using existing methods exploiting canonical correlations. Motivated by their proven effectiveness, a novel discriminative learning method over sets is proposed for set classification. Specifically, inspired by classical Linear Discriminant Analysis (LDA), we develop a linear discriminant function that maximizes the canonical correlations of within-class sets and minimizes the canonical correlations of between-class sets. Image sets transformed by the discriminant function are then compared by the canonical correlations. Classical orthogonal subspace method (OSM) is also investigated for the similar purpose and compared with the proposed method. The proposed method is evaluated on various object recognition problems using face image sets with arbitrary motion captured under different illuminations and image sets of 500 general objects taken at different views. The method is also applied to object category recognition using ETH-80 database. The proposed method is shown to outperform the state-of-the-art methods in terms of accuracy and efficiency.
Content may be subject to copyright.
Technical Report
Tae-Kyun Kim
20 August 2006
Department of Engineering
University of Cambridge
Chapter 1
Discriminative Learning and
Recognition of Image Set Classes Using
Canonical Correlations
Tae-Kyun Kim
1
, Josef Kittler
2
, Roberto Cipolla
1
1 : Engineering Department, University of Cambridge, Cambridge, CB2 1PZ, UK
2 : CVSSP, University of Surrey, Guildford, GU2 7XH, UK
1
Abstract
We address the problem of comparing sets of images for object recognition, where the
sets may represent variations in an object’s appearance due to changing camera pose and
lighting conditions. Canonical Correlations (also known as principal or canonical angles),
which can be thought of as the angles between two d-dimensional subspaces, have recently
attracted attention for image set matching. Canonical correlations offer many benefits in
accuracy, efficiency, and robustness compared to the two main classical methods: para-
metric distribution-based and non-parametric sample-based matching of sets. Here, this
is first demonstrated experimentally for reasonably sized data sets using existing methods
exploiting canonical correlations. Motivated by their proven effectiveness, a novel discrim-
inative learning method over sets is proposed for set classification. Specifically, inspired by
classical Linear Discriminant Analysis (LDA), we develop a linear discriminant function
that maximizes the canonical correlations of within-class sets and minimizes the canonical
correlations of between-class sets. Image sets transformed by the discriminant function are
then compared by the canonical correlations. The proposed method is evaluated on various
object recognition problems using face image sets with arbitrary motion captured under dif-
ferent illuminations and image sets of five hundred general objects taken at different views.
The method is also applied to object category recognition using ETH-80 database. The
proposed method is shown to outperform the state-of-the-art methods in terms of accuracy
and efficiency.
2
1 Discriminative Learning and Recognition of Image Set Classes Using Canonical
Correlations
1.1 Introduction
Many computer vision tasks can be cast as learning problems over vector or image sets. In
object recognition, for example, a set of vectors may represent a variation in an object’s
appearance be it due to camera pose changes, non-rigid deformations or variation in
illumination conditions. The objective of this work is to classify an unknown set of vectors
to one of the training classes, each also represented by vector sets. More robust object
recognition performance can be achieved by efficiently using set information rather than
a single vector or image as input. Examples of pattern sets of an object are shown in
Figure 1.1.
Whereas most of the previous work on matching image sets for object recognition ex-
ploits temporal coherence between consecutive images [1, 2, 3, 4, 5], this study does not
make any such assumption. Sets may be derived from sparse and unordered observations
acquired by multiple still shots of a three dimensional object or a long term monitoring of
a scene, as exemplified e.g. by surveillance systems, where a subject would not face the
camera all the time. By this, training sets can be more conveniently augmented in the pro-
posed framework. As this work does not exploit any data-semantics explicitly, the proposed
method is expected to be applied to many other problems requiring a set comparison.
Relevant previous approaches to set matching for set classification can be broadly parti-
tioned into parametric model-based [6, 7] and non-parametric sample-based methods [8, 9].
In the model-based approaches, each set is represented by a parametric distribution func-
tion, typically Gaussian. The closeness of the two distributions is then measured by the
Kullback-Leibler Divergence (KLD) [30]. Due to the difficulty of parameter estimation
under limited training data, these methods easily fail when the training and novel test sets
do not have strong statistical relationships.
Rather more relevant methods for comparing sets are based on matching of pairwise
samples of sets, e.g. Nearest Neighbour (NN) and Hausdorff distance matching [8, 9]. The
methods are based on the premise that similarity of a pair of sets is reflected by the sim-
ilarity of the modes (or NN samples) of the two respective sets. This is certainly useful
in many computer vision applications where the data acquisition conditions may change
dramatically over time. For example, as shown in Figure 1.1 (a), when two sets contain im-
ages of an object taken from different views but with a certain overlap in views, global data
characteristics of the sets are significantly different making the model-based approaches un-
successful. To recognise the two sets as the same class, the most effective solution would
be to find the common views and measure the similarity of those parts of data. In spite
of their rational basis, the non-parametric sample-based methods easily fail, as they do not
take into account the effect of outliers as well as the natural variability of the sensory data
due to the 3D nature of the observed objects. Note also that such methods are very time
consuming as they require a comparison of every pair of samples drawn from the two sets.
The above discussion is concerned purely with how to quantify the degree of match
between two sets, that is, how to define similarity of two sets. However, the other impor-
tant problem in set classification is how to learn discriminative function from training data
associated with a given similarity function. To our knowledge, the topic of discriminative
learning over sets has not been given a proper attention in the literature. In this study, we
interpret the classical Linear Discriminant Analysis (LDA) [9, 10] and its non-parametric
1 Discriminative Learning and Recognition of Image Set Classes Using Canonical
Correlations 3
(a) Two sets (top and bottom) contain images of a 3D object taken from different views but with a
certain overlap in their views.
(b) Two face image sets (top and bottom) collected from videos taken under different illumination
settings. Face patterns of the two sets vary in both lighting and pose.
Figure 1.1: Examples of image sets. The sets contain different pattern variations caused
by different views and lighting.
variants, Non-parametric Discriminant Analysis (NDA) [18], as techniques of discrimina-
tive learning over sets (See Section 1.2.1). LDA has been recognized as a powerful method
for face recognition based on a single face image as input. The methods based on LDA
have been widely advocated in the literature [10, 11, 12, 13, 14, 17]. However, note that
these methods do not consider multiple input images. When they are directly applied to
set classification based on sample matching, they inherit the drawbacks of the classical
non-parametric sample-based methods as discussed above.
Relatively recently the concept of canonical correlations has attracted increasing at-
tention for image set matching in [15, 19, 20, 21, 22], following the early works [23, 24,
25, 26]. Each set is represented by a linear subspace and the angles between two high-
dimensional subspaces are exploited as a similarity measure of two sets (See Section 1.2.2
for more details). As a method for comparing sets, the benefits of canonical correlations
over both parametric distribution-based and sample-based matching, have been noted in
our earlier work [15] as well as in [7]. They include efficiency, accuracy and robust-
ness. This will be discussed and demonstrated in a more detailed and rigorous manner
in Section 1.2.2 and Section 1.5. A nonlinear extension of canonical correlation has been
proposed in [15, 20, 38] and a feature selection scheme for the method in [15]. The Con-
strained Mutual Subspace Method (CMSM) [21][22] is the most related to the approach of
this study. In CMSM, a constrained subspace is defined as the subspace in which the en-
tire class population exhibits small variance. The authors showed that the sets of different
classes in the constrained subspace had small canonical correlations. However, the prin-
ciple of CMSM is rather heuristic, especially the process of selecting the dimensionality
of the constrained subspace. If the dimensionality is too low, the subspace will be a null
space. In the opposite case, the subspace simply captures all the energy of the original data
and thus cannot play the role of a discriminant function.
This study presents a novel method of object recognition using image sets, which is
based on canonical correlations. The previous conference version [16] has been extended
4
1 Discriminative Learning and Recognition of Image Set Classes Using Canonical
Correlations
by a more detailed discussion of the key ingredients of the method and the convergence
properties of the proposed learning, as well as by reporting the results of additional exper-
iments on face recognition and general object category recognition using the ETH80 [36]
data base. The main contributions of this study are as follows: First of all, as a method of
comparing sets of images, the benefits of canonical correlations of linear subspaces are ex-
plained and evaluated. Extensive experiments comparing canonical correlations with both
classical methods (parametric model-based and non-parametric sample-based matching)
are carried out to demonstrate these advantages empirically. A novel method of discrim-
inant analysis of canonical correlations is then proposed. A linear discriminant function
that maximizes the canonical correlations of within-class sets and minimizes the canoni-
cal correlations of between-class sets is defined, by analogy to the optimization concept
of LDA. The linear mapping is found by a novel iterative optimization algorithm. Image
sets transformed by the discriminant function are then compared by canonical correlations.
The discriminative capability of the proposed method is shown to be significantly bet-
ter than both, the method [19] that simply aggregates canonical correlations and the kNN
method applied to image vectors transformed by LDA. Interestingly, the method exhibits
very good accuracy as well as other attractive properties: low computational matching cost
and simplicity of feature selection. The proposed iterative solution is further compared
with classical orthogonal subspace method (OSM) [31], also devised to improve the simple
canonical correlation method. As canonical correlations are only determined up to rota-
tions within subspaces, the canonical correlations of subspaces of between-class sets can
be minimized by orthogonalizing those subspaces. To our knowledge, the close relationship
of the orthogonal subspace method and canonical correlations has not been noted before.
It is also interesting to see that OSM has a close affinity to CMSM. The proposed method
and OSM are assessed experimentally on diverse object recognition problems: faces with
arbitrary motion under different lighting, general 3D objects observed from different view
points and the ETH80 general object category database. The new techniques are shown
to outperform the state-of-the-art methods, including OSM/CMSM and a commercial face
recognition software, in terms of accuracy and efficiency.
The chapter is organized as follows. The relevant background methods are briefly re-
viewed and discussed in Section 1.2. Section 1.3 highlights the problem of discriminant
analysis over sets and presents a novel iterative solution. In Section 1.4, the orthogonal
subspace method is explained and related to both the proposed method and the prior art.
The experimental results and their discussion are presented in Section 1.5. Conclusions are
drawn in Section 1.6.
1.2 Key Ingredients of the Proposed Learning
1.2.1 Parametric/Non-parametric Linear Discriminant Analysis
Assume that a data matrix X = {x
1
, x
2
, ..., x
M
} R
N×M
is given, where x
i
R
N
is a
N-dimensional column vector obtained by raster-scanning an image. Each vector belongs
to one of object classes denoted by C
i
. Classical linear discriminant analysis (LDA) finds
1 Discriminative Learning and Recognition of Image Set Classes Using Canonical
Correlations 5
a transformation T R
N×n
(n N) which maps a vector x to
e
x = T
T
x R
n
such that
the transformed data have maximum separation between classes and minimum separation
within classes. The between-class and within-class scatter matrices in LDA [10] are given
by B =
P
c
M
c
(m
c
m)(m
c
m)
T
, W =
P
c
P
xC
c
(x m
c
)(x m
c
)
T
, where m
c
denotes the class mean, m is the global mean of the entire sample set and M
c
denotes
the number of samples in class c. With the assumption that all classes have Gaussian
distributions with equal covariance matrix, trace(B) and trace(W) measure the scatter of
vectors in the between-class and within-class populations respectively. A nonparametric
form of these scatter matrices is also proposed in [18] with the definition of the between-
class and within-class neighbours of a sample x
i
C
c
given by
B =
1
M
M
X
i=1
w
i
(
B
i
)(
B
i
)
T
, W =
1
M
M
X
i=1
(
W
i
)(
W
i
)
T
(1.1)
where
B
i
= x
i
x
B
i
,
W
i
= x
i
x
W
i
, x
B
= {x
0
C
c
|kx
0
xk kz xk, z C
c
}
and x
W
= {x
0
C
c
|kx
0
xk |z xk, z C
c
}. w
i
is a sample weight in order to
deemphasize samples away from class boundaries. LDA or Nonparametric Discriminant
Analysis (NDA) finds the optimal T which maximizes trace(
˜
B) and minimizes trace(
˜
W),
where
˜
B,
˜
W are the scatter matrices of the transformed data. As these are explicitly rep-
resented with T by
˜
B = T
T
BT,
˜
W = T
T
WT, the solution T can be easily obtained by
solving the generalized eigen-problem, BT = WTΛ, where Λ is the eigenvalue matrix.
When we regard the training data of each class as a set, LDA or NDA can be viewed
as the discriminant analysis of the vector sets based on similarity of parametric model-
based and non-parametric sample-based matching of sets respectively. In LDA, each set
(i.e. a class) is assumed to be normally distributed with equal covariance matrix and these
parametric distributions are optimally separated. On the other hand, in NDA, set similarity
is measured by the aggregated distance of a certain number of nearest neighbour samples
and the separation of the sets is optimized based on this set similarity.
It is also worth noting that the between-class and within-class scatter measures based
on pairwise vector-distance in LDA/NDA can be related to pairwise vector-correlation in
many pattern recognition problems. The magnitude of a data vector is often normalized so
that |x| = 1. As trace(AB) = trace(BA) for any matrix A, B and |x| = 1, trace(W)
in (1.1) equals
1
M
trace(
P
i
2(1 x
T
i
x
W
i
)). The problem of minimizing trace(W) can be
changed into the maximization of trace(W
0
) and similarly the maximization of trace(B)
into the minimization of trace(B
0
), where
B
0
=
X
i
x
T
i
x
B
i
, W
0
=
X
i
x
T
i
x
W
i
(1.2)
and x
B
i
, x
W
i
indicate the closest between-class and within-class vectors of a given vector
x
i
. Note the weight w
i
is omitted for simplicity and the total number of training sets M
does not change the direction of the desired components. We now see the optimization
6
1 Discriminative Learning and Recognition of Image Set Classes Using Canonical
Correlations
problem of classical NDA defined by correlations of pairwise vectors. Rather than dealing
with correlations of every pair of vectors, in the proposed method, we exploit canonical
correlations of pairwise linear subspaces of sets (See Section 1.3.1 for the proposed prob-
lem formulation). By resorting to canonical correlations the proposed method overcomes
the shortcomings of both classical model-based and sample-based approaches in set com-
parison.
1.2.2 Definition and Solution of Canonical Correlations
Canonical correlations, which are cosines of principal angles 0 θ
1
. . . θ
d
(π/2)
between any two d-dimensional linear subspaces L
1
and L
2
are uniquely defined as:
cos θ
i
= max
u
i
∈L
1
max
v
i
∈L
2
u
T
i
v
i
(1.3)
subject to u
T
i
u
i
= v
T
i
v
i
= 1, u
T
i
u
j
= v
T
i
v
j
= 0, i 6= j. There are various ways to solve
this problem. They are all equivalent but the Singular Value Decomposition (SVD) solu-
tion [26] is more numerically stable than the others, as the number of free parameters to esti-
mate is smaller. A comparison with the method called MSM [19] is given in Appendix .0.1.
The SVD solution is as follows: Assume that P
1
R
N×d
and P
2
R
N×d
form unitary
orthogonal bases for two linear subspaces, L
1
and L
2
. Let the SVD of P
T
1
P
2
R
d×d
be
P
T
1
P
2
= Q
12
ΛQ
T
21
s.t. Λ = diag(σ
1
, ..., σ
d
) (1.4)
where Q
T
12
Q
12
= Q
T
21
Q
21
= Q
12
Q
T
12
= Q
21
Q
T
21
= I
d
. Canonical correlations are the
singular values and the associated canonical vectors, whose correlations are defined as
canonical correlations, are given by
U = P
1
Q
12
= [u
1
, ..., u
d
], V = P
2
Q
21
= [v
1
, ..., v
d
] (1.5)
Canonical vectors are orthonormal in each subspace and Q
12
, Q
21
can be seen as rotation
matrices of P
1
, P
2
. The concept is illustrated in Figure 1.2.
Intuitively, the first canonical correlation tells us how close are the closest vectors from
two subspaces. Similarly, the higher canonical correlations tell us about the proximity of
vectors of the two subspaces in other dimensions (perpendicular to the previous ones) of
the embedding space. Note that a set of high-dimensional pattern vectors can usually be
well confined to a low-dimensional subspace which retains most of the energy of the set.
See Figure 1.3 for the canonical vectors computed from the sample image sets given in
Figure 1.1. The common modes (views and/or illuminations) of the two different sets are
well captured by the first few canonical vectors found. Each canonical vector of one set is
very similar to the corresponding canonical vector of the other set despite the data changes
across the sets. The canonical vectors of different dimensions represent different variations
1 Discriminative Learning and Recognition of Image Set Classes Using Canonical
Correlations 7
Figure 1.2: Conceptual illustration of canonical correlations. Two sets are represented as
linear subspaces which are planes here. Canonical vectors on the planes are
found to yield maximum correlations. In a two dimensional subspace case, the
second canonical vectors u
2
, v
2
are determined to be perpendicular to the first
ones.
(a) (b)
Figure 1.3: Principal components vs. canonical vectors. (a) The first 5 principal compo-
nents computed from the four image sets shown in Figure 1.1. The principal
components of the different image sets are significantly different. (b) The first
5 canonical vectors of the four image sets, which are computed for each pair
of the two image sets of the same object. Every pair of canonical vectors
(each column) U, V well captures the common modes (views and illumina-
tions) of the two sets containing the same object. The pairwise canonical vec-
tors are quite similar. The canonical vectors of different dimensions u
1
, ...u
5
and v
1
, ..., v
5
represent different pattern variations e.g. in pose or lighting.
of the patterns. Compared with the parametric distribution-based matching, this concept is
more flexible as it effectively places a uniform prior over the subspace of possible pattern
variations. Compared with the NN matching of samples, this approach is much more stable
as the samples are confined to a certain subspace. The complexity of SVD of a d × d
dimensional matrix is also very low.
8
1 Discriminative Learning and Recognition of Image Set Classes Using Canonical
Correlations
1.3 Discriminant-analysis of Canonical Correlations (DCC)
As shown in Figure 1.3, canonical correlations of two different image sets of the same objet
acquired in different conditions proved to be a promising measure of similarity of the two
sets. This suggests that by matching based on image sets one could achieve a robust solution
to the problem of object recognition even when the observation data is subject to extensive
data variations. However, it is further required to suppress the contribution to similarity of
canonical vectors of two image sets due to common environmental condition. The optimal
discriminant function is proposed to transform image sets so that canonical correlations
of within-class sets are maximized while canonical correlations of between-class sets are
minimized in the transformed data space.
1.3.1 Problem Formulation
Assume m sets of vectors are given as {X
1
, ..., X
m
}, where X
i
describes a data matrix of
the i-th set containing observation vectors (or images) in its columns. Each set belongs to
one of object classes denoted by C
i
. A d-dimensional linear subspace of the i-th set is rep-
resented by an orthonormal basis matrix P
i
R
N×d
s.t. X
i
X
T
i
' P
i
Λ
i
P
T
i
, where Λ
i
, P
i
are the eigenvalue and eigenvector matrices of the d largest eigenvalues respectively and N
denotes the vector dimension. We define a transformation matrix T = [t
1
, ..., t
n
] R
N×n
,
where n N, |t
i
| = 1 s.t. T : X
i
Y
i
= T
T
X
i
. The matrix T transforms images so that
the transformed image sets are class-wise more discriminative using canonical correlations.
Representation. Orthonormal basis matrices of the subspaces of the transformed data are
obtained from the previous matrix factorization of X
i
X
T
i
:
Y
i
Y
T
i
= (T
T
X
i
)(T
T
X
i
)
T
' (T
T
P
i
)Λ
i
(T
T
P
i
)
T
(1.6)
Except when T is an orthogonal matrix, T
T
P
i
is not generally an orthonormal basis ma-
trix. Note that canonical correlations are only defined for orthonormal basis matrices of
subspaces. Any orthonormal components of T
T
P
i
now defined by T
T
P
0
i
can represent an
orthonormal basis matrix of the transformed data. See Section 1.3.2 for details.
Set Similarity. The similarity of any two transformed data sets represented by T
T
P
0
i
,
T
T
P
0
j
are defined as the sum of canonical correlations by
F
ij
= max
Q
ij
,Q
ji
tr(M
ij
), (1.7)
M
ij
= Q
T
ij
P
0
T
i
TT
T
P
0
j
Q
ji
or T
T
P
0
j
Q
ji
Q
T
ij
P
0
T
i
T, (1.8)
as tr(AB) = tr(BA) for any matrix A, B. Q
ij
, Q
ji
are the rotation matrices similarly
defined in the SVD solution of canonical correlations (1.4) with the two transformed sub-
spaces.
1 Discriminative Learning and Recognition of Image Set Classes Using Canonical
Correlations 9
Figure 1.4: Conceptual illustration of the proposed method. Here are drawn the three
sets represented by the basis vector matrices P
i
, i = 1, ..., 3. We assume that
the two sets P
1
, P
2
are within-class sets and the third one is coming from the
other class. Canonical vectors P
i
Q
ij
, i = 1, ..., 3, j 6= i are equivalent to basis
vectors P
i
in this simple drawing where each set occupies a one-dimensional
space. Basis vectors are projected on the discriminative subspace by T and
normalized such that |T
T
P
0
| = 1. Then, the principal angle of within-class
sets, θ becomes zero and the angles of between-class sets, φ
1
, φ
2
are maxi-
mized.
Discriminant Function. The discriminative function (or matrix) T is found to maximize
the similarities of any pairs of within-class sets while minimizing the similarities of pair-
wise sets of different classes. Matrix T is defined with the objective function J by
T = arg max
T
J = arg max
T
P
m
i=1
P
kW
i
F
ik
P
m
i=1
P
lB
i
F
il
(1.9)
where the indices are defined as W
i
= {j |X
j
C
i
} and B
i
= {j |X
j
/ C
i
}. That
is, the two index sets W
i
, B
i
denote, respectively, the within-class and between-class sets
for a given set of class i, by analogy to [18]. See Figure 1.4 for the concept of the pro-
posed problem. In the discriminative subspace represented by T, canonical correlations of
within-class sets are to be maximized and canonical correlations of between-class sets to
be minimized.
1.3.2 Iterative Learning
The optimization problem of T involves the variables Q, P
0
as well as T. As the other
variables are not explicitly represented by T, a closed form solution for T is hard to find.
We propose an iterative optimization algorithm. Specifically, we compute an optimal solu-
tion for one of the three variables at a time by fixing the other two and repeating this for
10
1 Discriminative Learning and Recognition of Image Set Classes Using Canonical
Correlations
a certain number of iterations. Thus, the proposed iterative optimization is comprised of
the three main steps: normalization of P, optimization of matrices Q, and T. Each step is
explained below:
Normalization. The matrix P
i
is normalized to P
0
i
for a fixed T so that the columns of
T
T
P
0
i
are orthonormal. QR-decomposition of T
T
P
i
is performed s.t. T
T
P
i
= Φ
i
i
,
where Φ
i
R
N×d
is the orthonormal matrix composed by the first d columns and
i
R
d×d
is the d × d invertible upper-triangular matrix. From (1.6), Y
i
= T
T
P
i
Λ
i
=
Φ
i
i
Λ
i
. As
i
Λ
i
is still an upper-triangular matrix, Φ
i
can represent an orthonormal
basis matrix of the transformed data Y
i
. As
i
is invertible,
Φ
i
= T
T
(P
i
1
i
) P
0
i
= P
i
1
i
. (1.10)
Computation of rotation matrices Q. Rotation matrices Q
ij
for every i, j are obtained for
a fixed T and P
0
i
. The correlation matrix M
ij
defined in the left of (1.8) can be conveniently
used for the optimization of Q
ij
, as it has Q
ij
outside of the matrix product. Let the SVD
of P
0
T
i
TT
T
P
0
j
be
P
0
T
i
TT
T
P
0
j
= Q
ij
ΛQ
T
ji
(1.11)
where Λ is a singular matrix and Q
ij
, Q
ji
are orthogonal rotation matrices. Note that the
matrices which are Singular-Value decomposed have only d
2
elements.
Computation of T. The optimal discriminant transformation matrix T is computed for
given P
0
i
and Q
ij
by using the definition of M
ij
in the right of (1.8) and (1.9). With T being
on the outside of the matrix product M
ij
, it is convenient to solve for. The discriminative
function is found by
T = max
argT
tr(T
T
S
b
T)/tr(T
T
S
w
T) (1.12)
S
b
=
m
X
i=1
X
lB
i
(P
0
l
Q
li
P
0
i
Q
il
)(P
0
l
Q
li
P
0
i
Q
il
)
T
,
S
w
=
m
X
i=1
X
kW
i
(P
0
k
Q
ki
P
0
i
Q
ik
)(P
0
k
Q
ki
P
0
i
Q
ik
)
T
.
where B
i
= {j |X
j
/ C
i
} and W
i
= {j |X
j
C
i
}. Note that no loss of generality is
incurred from (1.9) as
A
T
B = I 1/2 · (A B)
T
(A B),
where A = T
T
P
0
i
Q
ij
, B = T
T
P
0
j
Q
ji
. The solution {t
i
}
n
i=1
is obtained by solving the
following generalized eigenvalue problem: S
b
t = λS
w
t. When S
w
is non singular, the
optimal T is computed by eigen-decomposition of (S
w
)
1
S
b
. Note also that the proposed
learning can avoid a singular case of S
w
by pre-applying PCA to data similarly with the
1 Discriminative Learning and Recognition of Image Set Classes Using Canonical
Correlations 11
Algorithm 1. Discriminant-analysis of Canonical Correlations (DCC)
Input: All P
i
R
N×d
Output: T R
N×n
1. T I
N
2. Do iterate the followings:
3. For all i, do QR-decomposition: T
T
P
i
= Φ
i
i
P
0
i
= P
i
1
i
4. For every pair i, j, do SVD: P
0
T
i
TT
T
P
0
j
= Q
ij
ΛQ
T
ji
5. Compute S
b
=
P
m
i=1
P
lB
i
(P
0
l
Q
li
P
0
i
Q
il
)(P
0
l
Q
li
P
0
i
Q
il
)
T
,
S
w
=
P
m
i=1
P
kW
i
(P
0
k
Q
ki
P
0
i
Q
ik
)(P
0
k
Q
ki
P
0
i
Q
ik
)
T
.
6. Compute eigenvectors {t
i
}
N
i=1
of (S
w
)
1
S
b
, T [t
1
, ..., t
N
]
7.End
8.T [t
1
, ..., t
n
]
Figure 1.5: Proposed iterative algorithm for finding T, which maximizes class separation
in terms of canonical correlations
Fisherface method [10] and speed up by using a small number of nearest neighboring sets
in B
i
, W
i
similarly with [18]. Canonical correlation analysis for multiple sets [37] is also
noteworthy here with regard to fast learning. It may help speeding up by reformulating
the between-class and within-class scatter matrices in (1.12) by the canonical correlation
analysis of multiple sets thus by avoiding the computation of the rotation matrices of every
pair of image sets in the iterations.
With the identity matrix I R
N×N
as the initial value of T, the algorithm is iterated
until it converges to a stable point. A Pseudo-code for the learning is given in Algorithm 1.
Once T maximizing the canonical correlations of within-class sets and minimizing those
of between-class sets in the training data is found, a comparison of any two novel sets
is achieved by transforming them by T, and then computing canonical correlations (See
(1.7)).
1.3.3 Discussion about Convergence
Although we do not provide a proof of convergence or uniqueness of the proposed op-
timization process, its convergence to a global maximum was confirmed experimentally.
See Figure 1.6 for examples of the iterative learning. Each example is for the learning us-
ing a different training data set. The value of the objective function J for all cases becomes
stable after first few iterations, starting with the initial value T = I. This fast and stable
convergence is very favorable for keeping the learning cost low. Furthermore, as shown at
bottom right in Figure 1.6, it was observed that the proposed algorithm converged to the
same point irrespective of the initial value of T. These results are indicative of the defined
criterion being a quadratic convex function with respect to the joint set of variables as well
as each individual variable as argued in [32, 33].
For all of the experiments in Section 1.5, the number of iterations was fixed to 5. The
proposed learning took about 50 seconds for the face experiments on a Pentium IV PC us-
12
1 Discriminative Learning and Recognition of Image Set Classes Using Canonical
Correlations
1 2 3 4 5 6 7 8 9 10
1.75
1.8
1.85
1.9
1.95
2
Number of iterations
Jtot
1 2 3 4 5 6 7 8 9 10
1.65
1.7
1.75
1.8
1.85
1.9
Number of iterations
Jtot
1 2 3 4 5 6 7 8 9 10
1.5
1.52
1.54
1.56
1.58
1.6
1.62
1.64
1.66
1.68
Number of iterations
Jtot
1 2 3 4 5 6 7 8 9 10
1.5
1.55
1.6
1.65
1.7
1.75
1.8
1.85
1.9
Number of iterations
Jtot
Figure 1.6: Convergence characteristics of the optimization: The cost of J of a given
training set is shown as a function of the number of iterations.The bottom right
shows the convergence to a unique maximum with different random initials of
T.
ing non-optimized Matlab code, while the OSM/CMSM methods took around 5 seconds.
Note the learning is performed once in an off-liner manner. On-line matching by the three
recognition methods is highly time-efficient. See the experimental section for more infor-
mation about the time complexity of the methods.
1.4 Alternative Methods of Discriminative Canonical Cor-
relations for Set Classification
1.4.1 Orthogonal Subspace Method (OSM)
Orthogonality of two subspaces means that any vector of one subspace is orthogonal to
any vector of the other subspace [31]. This requirement is equivalent to that of each basis
vector of one subspace being orthogonal to each basis vector of the other. When recalling
that canonical correlations are defined as maximal correlations between any two vectors
of two subspaces as given in (1.3), it is very clear that canonical correlations of any two
orthogonal subspaces are zeros. Thus, measuring canonical correlations of class specific
orthogonal subspaces might be a basis for classifying image sets.
Let us assume that the subspaces of the between-class sets B
i
= {j|X
j
/ C
i
}of a given
data set X
i
are orthogonal to the subspace of the set X
i
. If the subspaces are orthogonal,
1 Discriminative Learning and Recognition of Image Set Classes Using Canonical
Correlations 13
all canonical correlations of those subspaces would also be zero as
P
i
T
P
lB
i
= O R
d×d
trace(Q
T
il
P
i
T
P
l
Q
li
) = 0 (1.13)
where O is a zero matrix and P
i
is a basis matrix of the set X
i
. The classical orthog-
onal subspace method (OSM) [31] has been developed as a method designed to obtain
class-specific orthogonal subspaces. The OSM finds the subspace, which is represented by
the basis matrix denoted by P
0
, where data sets of different classes are orthogonal. See
Appendix .0.2 for the details of the OSM solution. By orthogonalizing the subspaces of
between-class sets, the discrimination of image sets, in terms of canonical correlations is
achieved.
Comparison with the Proposed Solution, DCC. Note that the orthogonality of subspaces
is a restrictive condition, at least when the number of classes is large. It is often the case that
the subspaces of OSM represented by P
i
and P
lB
i
are correlated. If P
i
T
P
l
has non-zero
values, canonical correlations could be much greater than zero as
q
T
il
P
i
T
P
l
q
li
À 0 (1.14)
where q is a column of the rotation matrix Q in the definition of canonical correlations.
Generally, the problem of minimizing correlations of basis matrices P
i
T
P
l
in OSM is not
equivalent to the proposed problem formulation where the canonical correlations q
T
il
P
i
T
P
l
q
li
are minimized.
Note again that the principal components of P are sensitive to data changes, whereas
canonical vectors PQ are consistent as shown in Figure 1.3. Thus, the proposed optimiza-
tion by canonical correlations is expected to be more robust to possible data changes than
the OSM solution based on P. Moreover the orthogonal subspace method does not ex-
plicitly attempt to maximize canonical correlations of the within-class sets. It combines all
examples of a class together. See Appendix .0.2 for details. The better accuracy of DCC
over OSM was evident when the number of training classes was large or the conditions for
obtaining the training and test data were different (See the experimental section).
1.4.2 Constrained Mutual Subspace Method (CMSM)
It is worth noting that CMSM [21][22] can be seen to be closely related to the orthog-
onal subspace method. For the details of CMSM, refer to Appendix .0.3. CMSM finds
the constrained subspace where the total projection operators have small variances. Each
class is represented by a subspace which maximally represents the class data variances,
then the class subspace is projected into the constrained subspace. The projected data
subspace compromises the maximum representation of each class and the minimum repre-
sentation of a mixture of all the other classes. This is similar in concept with the orthogonal
subspace method explained in Appendix .0.2. Both methods try to minimize the correla-
tion of between-class subspaces defined by P
i
T
P
lB
i
. However, the dimensionality of the
14
1 Discriminative Learning and Recognition of Image Set Classes Using Canonical
Correlations
constrained subspace of CMSM should be optimised for each application. If the dimen-
sionality is too low, the constrained subspace will be a null space. In the opposite case,
the constrained subspace simply retains all the energy of the original data and thus can
not play a role as a discriminant function. This dependence of CMSM on the parame-
ter (dimensionality) selection makes it rather empirical. In contrast, there is no need to
choose any subspace from the discriminative space represented by the rotation matrix P
0
in the orthogonal subspace method. A full dimension of the matrix can simply be adopted.
Note the proposed method, DCC, also exhibited insensitivity to dimensionality, thus being
practically, as well as theoretically, very appealing (See the experimental section).
1.5 Experimental Results and Discussion
The proposed method (the code is available at http://mi.eng.cam.ac.uk/tkk22) is evaluated
on various object or object category recognition problems: using face image sets with
arbitrary motion captured under different illuminations, image sets of five hundred general
objects taken at different views and the 8 general object categories, each of which has
several different objects. The task of all of the experiments is to classify an unknown set of
vectors to one of the training classes, each also represented by vector sets.
1.5.1 Database of Face Image Sets
We have collected a database called the Cambridge-Toshiba Face Video Database with 100
individuals of varying age and ethnicity, and equally represented genders, which are shown
in Figure 1.7. For each person, 14 (7 illuminations × two recordings) video sequences of
the person in arbitrary motion were collected. Each sequence was recorded in a different
illumination setting for 10s at 10fps and at 320×240 pixel resolution. See Figure 1.8 for
samples from an original image sequence and seven different lightings. Following auto-
matic localization using a cascaded face detector [27] and cropping to a uniform scale of
20×20 pixels, images of faces were histogram equalized. Note that the face localization
was performed automatically on the images of uncontrolled quality. Thus it was not as
accurate as any conventional face registration with either manual or automatic eye posi-
tions performed on high quality face images. Our experimental conditions are closer to the
conditions given for typical surveillance systems.
1.5.2 Comparative Methods and Parameter Setting
We compared the performance of :
KL-Divergence algorithm (KLD) [6] as a representative parametric model-based
method,
Non-parametric sample-based methods such as k-Nearest Neighbour (kNN) and Haus-
dorff Distance (d(S
1
, S
2
) = min
x
1
S
1
max
x
2
S
2
d(x
1
, x
2
)) [9] of images transformed
1 Discriminative Learning and Recognition of Image Set Classes Using Canonical
Correlations 15
Figure 1.7: Examples of Face Video Database. The data set contains 100 face classes
with varying age, ethnicity, and genders. Each class has about 1400 images
from the 14 image sequences captured under 7 different lighting conditions.
(a)
(b)
Figure 1.8: Example images of the face data sets. (a) Frames of a typical face video
sequence with automatic face detection (b) Face prototypes of 7 different illu-
minations
by (i) PCA, and (ii) LDA [10] subspaces, which are estimated from training data
similarly to [8],
Nearest Neighbour (NN) by FaceIt (v.5.0), the commercial face recognition system
from Identix, which ranked top overall in the Face Recognition Vendor Test 2000 and
2002 [34, 35],
Mutual Subspace Method (MSM) [19], which is equivalent to a simple aggregation
of canonical correlations,
Constrained MSM (CMSM) [21, 22] used in a state-of-the-art commercial system
called FacePass [29],
Orthogonal Subspace Method (OSM) [31],
16
1 Discriminative Learning and Recognition of Image Set Classes Using Canonical
Correlations
and the proposed iterative discriminative learning, DCC.
To compare different algorithms, important parameters of each method were adjusted
and the optimal ones in terms of test identification rates were selected. In KLD, 96% of data
energy was explained by the principal subspace of training data used [6]. In kNN methods,
the dimension of PCA subspace was chosen to be 150, which represents more than 98%
of training data energy (Note that removing the first 3 components improved the accuracy
in the face recognition experiment as similarly observed in [10]). The best dimension of
LDA subspace was also found at around 150. The number of nearest neighbors used was
chosen from one to ten. In MSM/CMSM/OSM/DCC, the dimension of the linear subspace
of each image set represented 98% of data energy of the set, which was around 10. PCA
was performed for each set in the MSM/CMSM/DCC methods.
Dimension Selection of the Discriminative Subspaces in CMSM/OSM/DCC. As shown
in Figure 1.9 (a), CMSM exhibited a high peaking in the the relationship between accuracy
and dimensionality of the constrained subspace, whereas the proposed method, DCC, pro-
vided constant identification rates regardless of dimensionality of T beyond a certain point.
The best dimension of the constrained subspace of CMSM was found to be at around 360
and was fixed. For DCC, we fixed the dimension at 150 for all experiments (the full dimen-
sion can also be conveniently exploited without any feature selection). The full dimension
was also used for the rotation matrix P
0
in OSM. Note that the proposed method DCC
and OSM do not require any elaborate feature selection and this behaviour of DCC/OSM
is highly attractive from the practical point of view, compared to CMSM. Without feature
selection the accuracy of CMSM in the full space drops dramatically to the level equiva-
lent to that of MSM, which is a simple aggregation of canonical correlations without any
discriminative transformation.
Number of Canonical Correlations. Figure 1.9 (b) shows the accuracy of MSM/DCC
according to the number of canonical correlations used. Basically, this parameter does not
affect the accuracy of the methods as much as the dimension of the discriminative subspace,
as shown in Figure 1.9 (a). The proposed method, DCC, was shown to be less sensitive to
this parameter than MSM. The number of canonical correlations was fixed to be the same
(i.e. this was set as the dimension of linear subspaces of image sets) for all the methods,
MSM/CMSM/OSM/DCC.
1.5.3 Face Recognition Experiments
Training of all the algorithms was performed with data sequences acquired in a single il-
lumination setting and testing with a single other setting. We used 18 randomly selected
training/test combinations of the sequences for reporting identification rates. The perfor-
mance of the evaluated recognition algorithms is shown in Figure 1.10 and Table 1.1. The
18 experiments were divided into two parts according to the degree of difference between
the training and the test data of the experiments, which was measured by KL-Divergence
1 Discriminative Learning and Recognition of Image Set Classes Using Canonical
Correlations 17
50 100 150 200 250 300 350 400
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Dimension
Identification rate
DCC
CMSM
1 2 3 4 5 6 7 8 9 10
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Number of principal angles
Identification rate
DCC
MSM
(a) (b)
Figure 1.9: (a) The effect of the dimensionality of the discriminative subspace on the
proposed iterative method (DCC) and CMSM. The accuracy of CMSM at 400
is equivalent to that of MSM, a simple aggregation of canonical correlations.
(b) The effect of the number of canonical correlations on DCC and MSM.
0 5 10 15 20
0.92
0.93
0.94
0.95
0.96
0.97
0.98
0.99
1
Rank
Cumulative identification rate
MSM
CMSM
OSM
DCC
kNN−LDA
Figure 1.10: Cumulative recognition plot for the MSM/kNN-LDA/CMSM/OSM/DCC meth-
ods
between the training and test data. Figure 1.10 shows the cumulative recognition rates for
the averaged results of all 18 experiments and Table 1.1 shows the results separately for the
first (easier) and the second parts (more difficult) of the experiments.
In this experiment, all training samples of a class were drawn from a single video se-
quence of arbitrary head movement, so they were randomly divided into two sets for the
within-class sets in the proposed learning. Note that the proposed method with this random
partition still worked well. The test recognition rates changed by less than 1-2 % for the dif-
ferent trials of random partitioning. If samples of a class can be partitioned according to the
data semantics, the concept of the within-class sets would be more useful and reasonable,
which is the case of the following other experiments.
In Table 1.1, most of the methods generally had lower recognition rates for the exper-
iments with larger KL-Divergence between the training and test data. The KLD method
achieved by far the worst recognition rate. Considering that the illumination conditions
18
1 Discriminative Learning and Recognition of Image Set Classes Using Canonical
Correlations
KLD HD-PCA 1NN-PCA 10NN-PCA FaceIt S/W
1st half 0.49±0.14 0.60±0.07 0.95±0.03 0.96±0.03 0.90±0.09
2nd half 0.24±0.13 0.47±0.09 0.71±0.20 0.71±0.21 0.86±0.05
10NN-LDA MSM CMSM OSM DCC
1st half 0.98±0.01 0.94±0.03 0.98±0.01 0.98±0.01 0.98±0.01
2nd half 0.87±0.07 0.91±0.02 0.93±0.06 0.94±0.06 0.95±0.04
Table 1.1: Evaluation results. The mean and standard deviation of recognition rates of different
methods. The results are shown separately for the first (easier) and the second parts
(more difficult) of the experiments.
varied across data and that the face motion was largely unconstrained, the distribution of
within-class face patterns was very broad, making this result unsurprising. In the methods
of non-parametric sample-based matching, the Hausdorff-Distance (HD) measure provided
far poorer results than the k-Nearest Neighbors (kNN) methods defined in the PCA sub-
space. 10NN-PCA yielded the best accuracy of the sample-based methods defined in the
PCA subspace, which is worse than MSM by 8.6% on average. Its performance greatly var-
ied across the experiments. Note that MSM showed robust performance with a large margin
over kNN-PCA method under the different experimental conditions. The improvement of
MSM over both KLD and HD/kNN-PCA methods was very impressive. The benefits of us-
ing canonical correlations over both classical approaches for set classification, which have
been explained throughout the previous sections, were confirmed.
The commercial face recognition software FaceIt (v.5.0) yielded the performance which
is in the middle of those of kNN-PCA and kNN-LDA methods on average. Although the
NN method using FaceIt is based on individual sample matching, it delivered more robust
performance for the data changes (the difference in accuracy between the first half and
the second half is not as large as those of kNN-PCA/LDA methods). This is reasonable,
considering that FaceIt was trained independently with the training images used for other
methods.
Table 1.1 also gives a comparison for the methods combined with discriminative learn-
ing. kNN-LDA yielded a big improvement over kNN-PCA but the accuracy of the method
again greatly varied across the experiments. Note that 10NN-LDA outperformed MSM for
similar conditions between the training and test sets, but it became noticeably inferior as
the conditions changed. It delivered similar accuracy to MSM on average, which is also
shown in Figure 1.10. The proposed method DCC, CMSM and OSM constantly provided
a significant improvement over both MSM and kNN-LDA method as shown in Table 1.1 as
well as in Figure 1.10.
More Comparison of DCC, OSM and CMSM. Note that CMSM/OSM can be consid-
ered as measuring correlation between subspaces defined by the basis matrix P in a simple
way which is different from the canonical correlations defined by PQ. In spite of this
difference, the accuracy of CMSM/OSM was impressive in this experiment. As explained
above, when an ideal solution of CMSM/OSM exists and Q only provides a rotation within
the subspace, the solution of CMSM/OSM can be close to that of the proposed method
DCC. However, if class subspaces cannot be made orthogonal to each other, then the di-
rect optimization of canonical correlations offered by DCC is preferred. The novel data
1 Discriminative Learning and Recognition of Image Set Classes Using Canonical
Correlations 19
(a) MSM (b) CMSM
(c) OSM (d) DCC
Figure 1.11: Confusion matrices for the MSM/CMSM/OSM/DCC methods. The diagonal
and off-diagonal values in the DCC confusion matrix can be distinguished
much better.
space PQ is robust to environmental changes as shown in Figure 1.3, making the solution
of DCC, which is obtained by directly optimizing PQ space, also robust. Note that the
proposed method was better than CMSM/OSM for the second half of the experiments in
Table 1.1 (although it is not very clear).
The differences of the three methods are clearly apparent from the associated confusion
matrices of the training data. We trained the three methods using both training and test sets
of the worst experimental case for the methods (See the last two of Figure 1.8 (b)), and
compared their confusion matrices of the total class data with that of MSM, as shown in
Figure 1.11. Both OSM and CMSM considerably improved the ability of class discrimi-
nation over MSM, but they were still far from optimal compared with DCC for the given
data. As discussed above, both of the proposed method, DCC, and OSM are preferable to
CMSM as they do not involve the selection of dimensionality of the discriminative sub-
spaces. While the best dimension for CMSM had to be identified with reference to the test
results, the full dimension of the discriminative space can simply be adopted for any new
test data in the DCC and OSM methods.
We designed another face experiments with more face image sets in the Cambridge-
Toshiba face video database. The database involves two sets of videos acquired at different
times, each of which consists of seven different illumination sequences for each person.
We used one time set for training and the other set for testing thus having more variations
between the training and testing (See Figure 1.12 for an example of the two different times
sets acquired in the same illumination). Note the training and testing sets in the previous
experimental setting were drawn from the same time set. In this experiment using a sin-
gle illumination set for training, the full 49 combinations of the different lighting settings
were exploited. We also increased the number of image sets per each class for training.
20
1 Discriminative Learning and Recognition of Image Set Classes Using Canonical
Correlations
Figure 1.12: Example of the two time sets (top and bottom) of a person acquired in a single
lighting setting. They contain significant variations in pose and expression.
0.8
0.82
0.84
0.86
0.88
0.9
0.92
0.94
0.96
0.98
1
Identification rate
DCC
OSM
CMSM
Single set
Double sets
Triple sets
Figure 1.13: Recognition rates of the CMSM/OSM/DCC methods when using a single,
double and triple image sets in training.
We randomly drew a combination of different illumination sequences for training and used
all 7 illumination sequences for testing. 10-fold cross validation was performed for these
experiments. Figure 1.13 shows the mean and standard deviations of recognition rates of
all experiments. The proposed method significantly outperformed OSM/CMSM methods
when the test sets are much different from the training sets. These results are consistent
with those of the methods in the 2nd part of the experiment in Table 1.1 (but the difference
is much clearer here). Overall, all three methods improved their accuracy by using more
image sets in training.
Matching complexity. The complexity of the methods based on canonical correlations
(MSM/ CMSM/OSM/DCC), O(d
3
), is much lower than that of the sample-based matching
methods (kNN-PCA/LDA), O(m
2
n), where d is the subspace dimension of each set, m
is the number of samples of each set and n is the dimensionality of feature vectors, since
d ¿ m, n. In the face experiments, the unit matching time of comparing the two image
sets which contain about 100 images is 0.004 for the canonical correlations based method
and 1.1 seconds for the kNN method.
1.5.4 Experiment on Large Scale General Object Database
The ALOI database [28] with 500 general object categories taken at different viewing an-
gles provides another experimental data set for the proposed method. Object images were
segmented from the simple background and scaled to 20×20 pixel size. A training set and
1 Discriminative Learning and Recognition of Image Set Classes Using Canonical
Correlations 21
(a)
(b)
Figure 1.14: ALOI experiment. (a) The training set consists of 18 images taken at 10
degree intervals. (b) Two test sets are shown. Each test set contains 9
images at 10 degree intervals, different from the training set.
90 110 130 150 170
0.4
0.5
0.6
0.7
0.8
0.9
1
View angle deviation between train and test set (degree)
Identification rate
MSM
CMSM
DCC
kNN−LDA
OSM
Figure 1.15: Identification rates for the 5 different test sets. The object viewing angles of
the test sets differ from those of the training set to a varying extent.
ve test sets were set up with different viewing angles of the objects as shown in Figure 1.14
(a) and (b). Note that the pose of all the images in the test sets differed by at least 5 degree
from every sample of the training set. The methods of MSM, kNN-LDA, CMSM and OSM
were compared with the proposed method in terms of identification rate. The parameter
were selected in the same way as in the face recognition experiment. The dimension of
the linear subspace of each image set was fixed to 5, representing more than 98% data en-
ergy in MSM/CMSM/OSM/DCC methods. The best number of nearest neighbors in the
kNN-LDA method was found to be five.
Judging from Figure 1.15 and Figure 1.16, kNN-LDA yielded better accuracy than
MSM in all the cases. This contrasted with the findings in the face recognition experi-
ment. This may have been caused by the somewhat artificial experimental setting. The
nearest neighbours of the training and test set differed only slightly due to the five degree
pose difference. Please note that the two sets had no changes in lighting and had accu-
rate localization of the objects as well. Further note that the accuracy of MSM could be
improved by using only the first canonical correlation, similarly to the results shown in Fig-
ure 1.9 (b). Here again, CMSM, OSM and the proposed method DCC were substantially
superior to MSM. Overall, the accuracy of CMSM/OSM was similar to that of kNN-LDA
22
1 Discriminative Learning and Recognition of Image Set Classes Using Canonical
Correlations
1 6 11 16 21 26 31 36 41 46
0.7
0.75
0.8
0.85
0.9
0.95
1
Rank
Cumulative Identification rate
OSM
MSM
CMSM
DCC
kNN−LDA
Figure 1.16: Cumulative recognition rates of the MSM/kNN-LDA/CMSM/OSM/DCC meth-
ods for the ALOI experiment
method as shown in Figure 1.16. The proposed iterative method, DCC, constantly outper-
formed all the other methods including OSM/CMSM as well as kNN-LDA. Please note
this experiment involved a larger number of classes, compared with the face experiments.
Furthermore, the set of images of the training class had quite different pose distributions
from those of the test set. The accuracy of CMSM/OSM methods might be degraded by all
these factors, whereas the proposed method is still robust.
1.5.5 Object Category Recognition using ETH80 database
An interesting problem of object category recognition was performed using the public
ETH80 data base. As shown in Figure 1.17, there are 8 categories which contain 10 objects
each, with 41 images of different views. More details about the data base can be found
in [36]. We randomly partitioned 10 objects into two sets of five objects for training and
testing. In Experiment 1, we used all 41 view images of objects. In Experiment 2, we
used all 41 views for training but a random subset of 15 view images for testing. 10-fold
cross-validation was carried out for both experiments. Parameters such as the dimension of
the linear subspaces, the number of principal angles and nearest neighbors were selected
as in the previous experiment. The dimension of the constrained subspace of CMSM was
also best optimised.
From Table 1.2, it is worth noting that the accuracy of kNN-PCA method is similar
(but slightly inferior) to that of the PCA method reported in [36]. Note that we used only
5 objects per category, in contrast to [36] where 9 objects were used for training. The
recognition rates for individual object categories also showed similar behaviors to those
of [36].
As shown in Table 1.2, the kNN methods were much inferior to the methods based
on canonical correlations. The sample-based matching method was very sensitive to the
variations in different objects of the same categories, failing in object categorization. The
methods using canonical correlations provided much more accurate results. The proposed
method (DCC) delivered the best accuracy over all tested methods. The improvement of
Discriminative Learning and Recognition of Image Set Classes Using Canonical
Correlations 23
(a)
(b)
Figure 1.17: Object category database (ETH80) contains (a) 8 different object categories
and (b) 10 different objects for each category.
kNN-PCA kNN-LDA MSM CMSM OSM DCC
exp.1 0.762±0.21 0.752±0.17 0.865±0.13 0.897±0.10 0.905±0.09 0.917±0.09
exp.2 - - - 0.852±0.21 0.865±0.18 0.912±0.13
Table 1.2: Evaluation results of object categorization. The mean recognition rate and its standard
deviation for all experiments.
DCC over CMSM/OSM was bigger in the second experiment where only a subset of images
of objects was involved in the testing. Note that this makes the testing set very different
from the training set. The major principal components of the image sets are highly sen-
sitive to the variations in pose. The accuracy of CMSM/OSM methods was considerably
decreased in the presence of this variation, while the DCC method maintained almost the
same accuracy.
1.6 Conclusions
A novel discriminative learning framework has been proposed for set classification based
on canonical correlations. It is based on iterative learning which is theoretically and practi-
cally appealing. The proposed method has been evaluated on various object and object cate-
gory recognition problems. The new technique facilitates effective discriminative learning
over sets, and exhibits an impressive set classification accuracy. It significantly outper-
formed the KLD method representing a parametric distribution-based matching, and kNN
methods in both PCA/LDA subspaces as examples of non-parametric sample-based match-
ing. It also largely outperformed the method based on a simple aggregation of canonical
correlations.
The proposed DCC method achieved not only better accuracy but also possesses many
good properties, compared with CMSM/OSM methods. CMSM had to be optimised apos-
teriori by feature selection. In contrast DCC does not need any feature selection. It exhib-
ited a robust performance over a wide range of dimensions of the discriminative subspace
as well as the number of canonical correlations used. Although CMSM/OSM delivered a
comparable accuracy to DCC in particular cases, in general it lagged behind the proposed
method.
The canonical-correlation based methods including the proposed method were also
shown to be highly time efficient in matching, thus offering an attractive tool for recog-
24
Discriminative Learning and Recognition of Image Set Classes Using Canonical
Correlations
nition involving a large scale database.
.0.1 Equivalence of SVD solution to Mutual Subspace Method [19]
In Mutual Subspace Method (MSM), canonical correlations are defined as the eigenvalues
of the matrix P
1
P
T
1
P
2
P
T
2
P
1
P
T
1
R
N×N
, where P
i
R
N×d
is a basis matrix of a data set
i. The SVD solution in (1.4) for computing canonical correlations is symmetric. That is,
Q
T
12
P
T
1
P
2
Q
21
= Λ
Q
T
21
P
T
2
P
1
Q
12
= Λ
By multiplying the above two equations, we obtain
(Q
T
12
P
T
1
P
2
Q
21
)(Q
T
21
P
T
2
P
1
Q
12
) = Λ
2
Q
T
12
P
T
1
P
2
P
T
2
P
1
Q
12
= Λ
2
P
1
P
T
1
P
2
P
T
2
P
1
P
T
1
= P
1
Q
12
Λ
2
Q
T
12
P
T
1
as Q
12
Q
T
12
= Q
21
Q
T
21
= I. P
1
Q
12
and Λ
2
are the eigenvector matrix and eigenvalue
matrix respectively of the matrix P
1
P
T
1
P
2
P
T
2
P
1
P
T
1
. That is, the canonical correlations of
MSM simply assume the square value of the canonical correlations of the SVD solution.
Please note that the dimension of the matrix P
T
1
P
2
R
d×d
is relatively low compared with
the dimension of P
1
P
T
1
P
2
P
T
2
P
1
P
T
1
R
N×N
.
.0.2 OSM solution
Denote the correlation matrices of the C classes by C
1
, ...C
C
and the respective a priori
probabilities by π
1
, ..., π
C
[31]. Then matrix C
0
=
P
C
i=1
π
i
C
i
is the correlation matrix of
the mixture of all the classes. Matrix C
0
can be diagonalized by BC
0
B
T
= Λ. Denoting
P
0
= Λ
1/2
B, we have P
0
C
0
P
T
0
= I. Then,
π
1
P
0
C
1
P
T
0
+ ...π
C
P
0
C
C
P
T
0
= I
This means that matrices π
i
P
0
C
i
P
T
0
and Σ
j6=i
π
j
P
0
C
j
P
T
0
have the same eigenvectors but
the eigenvalues λ
i
k
of π
i
P
0
C
i
P
T
0
and λ
i
k
of Σ
j6=i
π
j
P
0
C
j
P
T
0
are related by λ
i
k
= 1 λ
i
k
.
That is, in the space rotated by matrix P
0
, the most important basis vectors of class i,
which are the eigenvectors of π
i
P
0
C
i
P
T
0
corresponding to largest eigenvalues, are at the
same time the least significant basis vectors for the ensemble of the rest of the classes. Let
P
i
be such an eigenvector matrix so that
π
i
P
i
T
P
0
C
i
P
T
0
P
i
= Λ
i
Discriminative Learning and Recognition of Image Set Classes Using Canonical
Correlations 25
Then,
Σ
j6=i
π
j
P
i
T
P
0
C
j
P
T
0
P
i
= I Λ
i
Since every matrix π
j
P
0
C
j
P
T
0
for all j 6= i is positive semidefinite, π
j
P
i
T
P
0
C
j
P
T
0
P
i
should be a diagonal matrix having smaller elements than 1 λ
i
. If we let P
j
denote the
eigenvectors of j-th class by π
j
P
0
C
j
P
T
0
P
j
Λ
j
P
j
T
, the matrix P
i
T
P
j
Λ
j
P
j
T
P
i
now
has small diagonal elements. Accordingly, P
i
T
P
j
should have all the elements close to
zero. In the ideal case when π
i
P
0
C
i
P
T
0
has the eigenvalues which are exactly equal to
one, the matrix P
i
T
P
j
would be a zero matrix for all j 6= i. The two subspaces defined
by P
i
, P
j
are called orthogonal subspaces. That is, every column of P
i
is perpendicular to
every column of P
j
.
Note that the OSM method does not exploit the concept of multiple sets in a single
class (or within-class sets). The method assumes that all data vectors of a single class i are
represented by a single set P
i
. From the above, the matrix P
0
could represent an alternative
discriminative space where the canonical correlation of between-class sets are minimized.
Note that the matrix P
0
is a rotation matrix in its concept and therefore it is a square matrix.
.0.3 Constrained Mutual Subspace Method [21]
The constrained subspace D is spanned by N
d
eigenvectors d of the matrix G =
P
C
i=1
P
i
P
T
i
s.t.
Gd = λd
where C is the number of training classes, P
i
is a basis matrix of the original i-th class
data and eigenvector d corresponds to the N
d
smallest eigenvalues. The optimal dimension
N
d
of the constrained subspace is set experimentally. The subspace P
i
is projected onto
D and the orthogonal components of the projected subspace, normalised to unit length, are
obtained as inputs for computing canonical correlations by the method of MSM [19].
Bibliography
[1] K. Lee, M. Yang, and D. Kriegman. Video-based face recognition using probabilistic
apperance manifolds. Proc. Computer Vision and Pattern Recognition, pp. 313–320,
2003.
[2] S. Zhou, V. Krueger, and R. Chellappa. Probabilistic recognition of human faces from
video. Computer Vision and Image Inderstanding, vol. 91, no. 1, pp. 214–245, 2003.
[3] Y. Li, S. Gong, and H. Liddell. Recognising the dynamics of faces across multiple
views. Proc. British Machine Vision Conference, pp. 242–251, 2000.
[4] X. Liu and T. Chen. Video-Based Face Recognition Using Adaptive Hidden Markov
Models. Proc. Computer Vision and Pattern Recognition, pp. 340–345, 2003.
[5] A. Hadid and M. Pietikainen. From Still Image to Video-Based Face Recognition:
An Experimental Analysis. Sixth IEEE International Conference on Automatic Face
and Gesture Recognition, pp. 813–818, 2004.
[6] G. Shakhnarovich, J. W. Fisher, and T. Darrel. Face recognition from long-term ob-
servations. Proc. European Conf. Computer Vision, pp. 851–868, 2002.
[7] O. Arandjelovi
´
c, G. Shakhnarovich, J. Fisher, R. Cipolla, and T. Darrell. Face recog-
nition with image sets using manifold density divergence. Proc. Computer Vision and
Pattern Recognition, pp. 581–588, 2005.
[8] S. Satoh. Comparative Evaluation of Face Sequence Matching for Content-based
Video Access. Proc. Int’l Conf. on Automatic Face and Gesture Recognition, pp.
163–168 2000.
[9] R.O. Duda, P.E. Hart, and D.G. Stork. Pattern Classification. John Wily & Sons, Inc.,
New York, 2nd edition, 2000.
[10] P.N. Belhumeur, J.P. Hespanha, and D.J. Kriegman. Eigenfaces vs. Fisherfaces:
Recognition Using Class Specific Linear Projection. IEEE Trans. Pattern Analysis
and Machine Intelligence, vol. 19, no. 7, pp. 711–720, 1997.
[11] W.Y. Zhao, R. Chellappa, and A. Krishnaswamy. Discriminant Analysis of Principal
Components for Face Recognition. Proc. Int’l Conf. on Automatic Face and Gesture
Recognition, pp. 336–341, 1998.
26
BIBLIOGRAPHY 27
[12] M.T. Sadeghi and J.V. Kittler. Decision Making in the LDA Space: Generalised Gra-
dient Direction Metric. Proc. Int’l Conf. on Automatic Face and Gesture Recognition,
pp. 248–253, 2004.
[13] X. Wang and X. Tang. Random Sampling LDA for Face Recognition. Proc. Computer
Vision and Pattern Recognition, pp. 259–265, 2004.
[14] T-K. Kim and J. Kittler. Locally Linear Discriminant Analysis for Multimodally Dis-
tributed Classes for Face Recognition with a Single Model Image. IEEE Trans. Pat-
tern Analysis and Machine Intelligence, vol. 27, no.3, pp. 318–327, 2005.
[15] T-K. Kim, O. Arandjelovi
´
c and R. Cipolla, Learning over Sets using Boosted Mani-
fold Principal Angles (BoMPA). Proc. British Machine Vision Conference, pp. 779–
788, 2005.
[16] T-K. Kim, J. Kittler and R. Cipolla, Learning Discriminative Canonical Correlations
for Object Recognition with Image Sets. Proc. European Conf. Computer Vision, pp.
251–262, 2006.
[17] M.-H. Yang. Kernel Eigenfaces vs. Kernel Fisherfaces: Face Recognition Using Ker-
nel Methods. Proc. Int’l Conf. on Automatic Face and Gesture Recognition, pp. 215–
220, 2002.
[18] M. Bressan, J. Vitria Nonparametric discriminant analysis and nearest neighbor clas-
sification. Pattern Recognition Letters, vol. 24, no. 15, pp. 2743–2749, 2003.
[19] O. Yamaguchi, K. Fukui, and K. Maeda. Face recognition using temporal image
sequence. Proc. Int’l Conf. on Automatic Face and Gesture Recognition, pp. 318–
323, 1998.
[20] L. Wolf and A. Shashua. Learning over sets using kernel principal angles. J. Machine
Learning Research, vol. 4, no. 10, pp. 913–931, 2003.
[21] K. Fukui and O. Yamaguchi. Face recognition using multi-viewpoint patterns for
robot vision. Int’l Symp. of Robotics Research, pp. 192–201, 2003.
[22] M. Nishiyama, O. Yamaguchi and K. Fukui, Face Recognition with the Multiple
Constrained Mutual Subspace Method. Proc. of Audio- and Video-based Biometric
Person Authentication, pp. 71-80, 2005.
[23] H. Hotelling. Relations between two sets of variates. Biometrika, vol. 28, no. 34, pp.
321–372, 1936.
[24] T. Kailath. A view of three decades of linear filtering theory. IEEE Trans. Information
Theory, vol. 20, no. 2, pp. 146–181, 1974.
[25] R. Gittins. Canonical analysis: A review with applications in ecology. Springer-
Verlag, Berlin, Germany, 1985.
28 BIBLIOGRAPHY
[26]
˚
A. Bj
¨
orck and G. H. Golub. Numerical methods for computing angles between linear
subspaces. Mathematics of Computation, vol. 27, no. 123, pp. 579–594, 1973.
[27] P. Viola and M. Jones. Robust real-time face detection. Int’l J. Computer Vision, vol.
57, no. 2, pp. 137–154, 2004.
[28] J.M. Geusebroek, G.J. Burghouts, and A.W.M. Smeulders. The Amsterdam library of
object images. Int’l J. Computer Vision, vol. 61, no. 1, pp. 103–112, January, 2005.
[29] Toshiba Corporation, Facepass. http://www.toshiba.co.jp/mmlab/tech/ w31e.htm.
[30] T. M. Cover and J. A. Thomas. Elements of Information Theory. Wiley, 1991.
[31] E.Oja, Subspace Methods of Pattern Recognition. Research Studies Press, 1983.
[32] D.D. Lee and H.S. Seung, Algorithms for Non-Negative Matirx Factorization. Ad-
vances in Neural Information Processing Systems, pp. 556–562, 2001.
[33] D.D. Lee and H.S. Seung, Learning the Parts of Objects by Non-Negative Matrix
Factorization. Nature, vol. 401, no. 6755, pp. 788–791, 1999.
[34] P.J. Phillips, P. Grother, R.J Micheals, D.M. Blackburn, E. Tabassi, and J.M. Bone,
FRVT 2002: Evaluation Report, Mar. 2003. http://www.frvt.org/FRVT2002/.
[35] D.M. Blackburn, M. Bone, and P.J. Phillips, Facial Recognition Vendor Test 2000:
Evaluation Report, 2000.
[36] B. Leibe and B. Schiele, Analyzing appearance and contour based methods for object
categorization. Proc. Computer Vision and Pattern Recognition, pp. 409–415, 2003.
[37] J. Via, I. Santamaria, J. Perez, Canonical Correlation Analysis (CCA) Algorithms for
Multiple Data Sets: Application to Blind SIMO Equalization, 13th European Signal
Processing Conference, Antalya, Turkey, 2005.
[38] D. Hardoon, S. Szedmak and J. Shawe-Taylor Canonical correlation analysis; An
overview with application to learning methods Neural Computation, vol. 16, no. 12,
pp. 2639–2664, 2004
... Image set modeling is a critical component in image set tasks, and various effective methods have been proposed, such as affine or convex hull (Cevikalp and Triggs 2010), linear subspace (Kim, Kittler, and Cipolla 2007), covariance matrix (Wang et al. 2012). Given an image set representation, it becomes crucial to measure distance between two sets. ...
... Given an image set representation, it becomes crucial to measure distance between two sets. Existing image set methods can be broadly divided into four categories based on representation and measurement: linear subspace methods (Yamaguchi, Fukui, and Maeda 1998; Kim, Kittler, and Cipolla 2007), affine/convex hull methods (Cevikalp and Triggs 2010;Hu, Mian, and Owens 2012), kernel methods (Ham and Lee 2008;Wang et al. 2012), Riemannian metric learning methods (Huang et al. 2015a,b). Based on these well-studied Riemannian metrics (Arsigny et al. 2006;Huang et al. 2015a), a variety of image set methods (Huang et al. 2018;Chen et al. 2023; have been proposed to obtain nonlinear feature. ...
... For example, Mutual Subspace Method (MSM) (Yamaguchi, Fukui, and Maeda 1998) calculates Canonical Correlation (CC) between two linear subspaces as a measurement of distance. Later Discriminant-Analysis of CC (DCC) (Kim, Kittler, and Cipolla 2007) learns a linear mapping by maximizing CCs among within-class sets and minimizing the CCs among between-class sets. To capture nonlinear relationships, kernel methods embed Riemannian manifold into Hilbert space. ...
Article
Conventional image set methods typically learn from image sets stored in one location. However, in real-world applications, image sets are often distributed or collected across different positions. Learning from such distributed image sets presents a challenge that has not been studied thus far. Moreover, efficiency is seldom addressed in large-scale image set applications. To fulfill these gaps, this paper proposes Distributed Manifold Hashing (DMH), which models distributed image sets as a connected graph. DMH employs Riemannian manifold to effectively represent each image set and further suggests learning hash code for each image set to achieve efficient computation and storage. DMH is formally formulated as a distributed learning problem with local consistency constraint on global variables among neighbor nodes, and can be optimized in parallel. Extensive experiments on three benchmark datasets demonstrate that DMH achieves highly competitive accuracies in a distributed setting and provides faster classification and retrieval than state-of-the-arts.
... Least squares regression (LSR) [1][2][3] has been widely used in many machine learning tasks, such as image recognition [4], discriminative learning [5,6], visual tracking [7], and so on. Different LSR variants can be tailored for different tasks (e.g., classification and regression) by embedding priori information or carefully designing. ...
Article
Full-text available
Least squares regression (LSR) has been widely used in the field of pattern recognition. However, LSR-based classifier still suffers from the following issues. One is that it focuses only on the dependency between the input data and the output targets, while overlooking the local structure of instances. Another one is that using binary labels as the regression targets is too strict to fully exploit the discriminative information of the data. To address these issues, we propose a novel multiclass classification method called discriminative latent subspace learning with adaptive metric learning (DLSAML). Specifically, DLSAML adaptively learns a metric matrix for the residuals between inputs and outputs, driving smaller distances between instances of the same class and larger distances between instances of different classes in the output space. To solve the second problem, latent representations are learnt guided by the pairwise label relations as the regression targets, allowing for more flexible use of discriminative information in the data. As a combination of these two techniques, the interactive optimization of the projection matrix and metric matrix allows DLSAML to fully exploit the structural and supervised information of the data to obtain a more discriminative latent subspace for multiclass classification. Extensive experiments on several benchmark datasets have demonstrated the effectiveness of the proposed method.
... The IMax approach [20] is similar, but it considers the Mutual Information (MI) between different views instead of simple correlation. Discriminative CCA is a particular instantiation in which the views that are considered are a data view and a target to be predicted [60,63,117,225], which has been shown to be equivalent to linear regression methods [216]. Nonlinear extensions of CCA through deep mappings were also recently explored [10,60,63]. ...
Preprint
Full-text available
Recently emerged technologies based on Deep Learning (DL) achieved outstanding results on a variety of tasks in the field of Artificial Intelligence (AI). However, these encounter several challenges related to robustness to adversarial inputs, ecological impact, and the necessity of huge amounts of training data. In response, researchers are focusing more and more interest on biologically grounded mechanisms, which are appealing due to the impressive capabilities exhibited by biological brains. This survey explores a range of these biologically inspired models of synaptic plasticity, their application in DL scenarios, and the connections with models of plasticity in Spiking Neural Networks (SNNs). Overall, Bio-Inspired Deep Learning (BIDL) represents an exciting research direction, aiming at advancing not only our current technologies but also our understanding of intelligence.
Chapter
Cross-company defect prediction (CCDP) learns a prediction model by using training data from one or multiple projects of a source company and then applies the model to the target company data. Existing CCDP methods are based on the assumption that the data of source and target companies should have the same software metrics. However, for CCDP, the source and target company data is usually heterogeneous, namely the metrics used and the size of metric set are different in the data of two companies. We call CCDP in this scenario as heterogeneous CCDP (HCCDP) task. We aim to provide an effective solution for HCCDP. Cross-project defect prediction (CPDP) refers to predicting defects in a target project using prediction models trained from historical data of other source projects. And CPDP in the scenario where source and target projects have different metric sets is called heterogeneous defect prediction (HDP). Existing HDP methods only consider the linear correlation relationship among the features (metrics) of the source and target projects, and such models are insufficient to evaluate nonlinear correlation relationship among the features. We propose a new cost-sensitive transfer kernel canonical correlation analysis (CTKCCA) approach for HDP. CTKCCA can not only make the data distributions of source and target projects much more similar in the nonlinear feature space, where the learned features have favorable separability, but also utilize the different misclassification costs for defective and defect-free classes to alleviate the class imbalance problem. To facilitate data sharing, it is essential to study how to protect the privacy of data owners before they release their data.
Chapter
Canonical correlation analysis is a typical multiview representation learning technique, which utilizes within-set and between-set covariance matrices to analyze the correlation between two multidimensional datasets. However, it is quite difficult for the covariance matrix to measure the nonlinear relationship between features because of its linear structure. In this paper, we propose a multiple covariation projection (MCP) method to learn latent two-view representation, which has the ability to model the complicated feature relationship. The proposed MCP first constructs multiple general covariance matrices for modeling diverse feature relations, and then integrates them together via a linear ensemble strategy. At last, an efficient two-stage algorithm is designed for solutions. In addition, we further present a multiview MCP for dealing with the case of multiple (more than two) views. Experimental results on benchmark datasets show the effectiveness of our proposed MCP method in multiview classification and clustering tasks.
Article
Full-text available
We develop a face recognition algorithm which is insensitive to large variation in lighting direction and facial expression. Taking a pattern classification approach, we consider each pixel in an image as a coordinate in a high-dimensional space. We take advantage of the observation that the images of a particular face, under varying illumination but fixed pose, lie in a 3D linear subspace of the high dimensional image space-if the face is a Lambertian surface without shadowing. However, since faces are not truly Lambertian surfaces and do indeed produce self-shadowing, images will deviate from this linear subspace. Rather than explicitly modeling this deviation, we linearly project the image into a subspace in a manner which discounts those regions of the face with large deviation. Our projection method is based on Fisher's linear discriminant and produces well separated classes in a low-dimensional subspace, even under severe variation in lighting and facial expressions. The eigenface technique, another method based on linearly projecting the image space to a low dimensional subspace, has similar computational requirements. Yet, extensive experimental results demonstrate that the proposed “Fisherface” method has error rates that are lower than those of the eigenface technique for tests on the Harvard and Yale face databases
Article
Full-text available
Assume that two subspaces $F$ and $G$ of a unitary space are defined as the ranges (or null spaces) of given rectangular matrices $A$ and $B$. Accurate numerical methods are developed for computing the principal angles $\theta_k(F, G)$ and orthogonal sets of principal vectors $u_k \in F$ and $\upsilon_k \in G, k = 1, 2, \cdots, q = \dim(G) \leqq \dim(F)$. An important application in statistics is computing the canonical correlations $\sigma_k = \cos \theta_k$ between two sets of variates. A perturbation analysis shows that the condition number for $\theta_k$ essentially is $\max(\kappa(A), _\kappa(B))$, where $\kappa$ denotes the condition number of a matrix. The algorithms are based on a preliminary $QR$-factorization of $A$ and $B$ (or $A^H$ and $B^H$), for which either the method of Householder transformations (HT) or the modified Gram-Schmidt method (MGS) is used. Then $\cos \theta_k$ and $\sin \theta_k$ are computed as the singular values of certain related matrices. Experimental results are given, which indicates that MGS gives $\theta_k$ with equal precision and fewer arithmetic operations than HT. However, HT gives principal vectors, which are orthogonal to working accuracy, which is not generally true for MGS. Finally, the case when $A$ and/or $B$ are rank deficient is discussed.
Article
We present a general method using kernel Canonical Correlation Analysis to learn a semantic representation to web images and their associated text. The semantic space provides a common representation and enables a comparison between the text and images. In the experiments we look at two approaches of retrieving images based only on their content from a text query. We compare the approaches against a standard cross-representation retrieval technique known as the Generalised Vector Space Model.
Article
Assume that two subspaces F and G of a unitary space are defined as the ranges (or null spaces) of given rectangular matrices A and B. Accurate numerical methods are developed for computing the principal angles θk(F, G) and orthogonal sets of principal vectors uk ϵ Fand vk ϵ G, k = 1, 2, q = dim(G) ≤ dim(F). An important application in statistics is computing the canonical correlations σk = cos θk between two sets of variates. A perturbation analysis shows that the condition number for θk essentially is max(k(A), k(B)), where k denotes the condition number of a matrix. The algorithms are based on a preliminary QR-factorization of A and B (or AH and BH, for which either the method of Householder transformations (HT) or the modified Gram-Schmidt method (MGS) is used. Then cos θk and sin θk are computed as the singular values of certain related matrices. Experimental results are given, which indicates that MGS gives θk with equal precision and fewer arithmetic operations than HT. However, HT gives principal vectors, which are orthogonal to working accuracy, which is not generally true for MGS. Finally, the case when A and /or B are rank deficient is discussed.
Book
Subspace methods are decision-theoretic vector space methods in which each pattern class is represented by a relatively low-dimensional subspace. Classification is usually based on projections, which computationally involves only inner products between a test vector and a set of basis vectors and is therefore very fast. Several attempts at constructing the classification subspace in an optimal way have been reported, most of them based on class correlation matrices computed from samples. A promising way to compute subspaces with high discriminatory power, recently introduced by Kohonen, is to use decision-directed learning. Variants of the basic method have been developed. They are analyzed mathematically and results are reviewed on their application to phonemic recognition of speech.
Article
The Subspace Pattern Recognition Method (SPRM) is a statistical method where each class is represented by a separate subspace. There are a number of variants to it including the Averaged Learning Subspace Method (ALSM). The decision surfaces in all these methods are quadratic. In some applications, we may require decision surfaces which are more nonlinear in nature. In this paper, we have proposed the use of more than one subspace (cluster) in the representation of the classes in the subspace methods of pattern recognition. By keeping the number of principal components in all the clusters the same, this model allows for a piecewise linear approximation of the decision surfaces. We have used this model to extend both the SPRM and the ALSM to obtain the Extended SPRM and the Extended ALSM, respectively. We have investigated the use of a dynamic clustering algorithm for the assignment of patterns to different clusters as opposed to a random assignment. We have conducted experiments on three data sets including a 192-dimensional large character data set. The results indicate that the proposed methods have the potential to approximate any decision surface, and can considerably improve the classification accuracy on the test sets.
Article
Concepts of correlation and regression may be applied not only to ordinary one-dimensional variates but also to variates of two or more dimensions. Marksmen side by side firing simultaneous shots at targets, so that the deviations are in part due to independent individual errors and in part to common causes such as wind, provide a familiar introduction to the theory of correlation; but only the correlation of the horizontal components is ordinarily discussed, whereas the complex consisting of horizontal and vertical deviations may be even more interesting. The wind at two places may be compared, using both components of the velocity in each place. A fluctuating vector is thus matched at each moment with another fluctuating vector. The study of individual differences in mental and physical traits calls for a detailed study of the relations between sets of correlated variates. For example the scores on a number of mental tests may be compared with physical measurements on the same persons. The questions then arise of determining the number and nature of the independent relations of mind and body shown by these data to exist, and of extracting from the multiplicity of correlations in the system suitable characterizations of these independent relations. As another example, the inheritance of intelligence in rats might be studied by applying not one but s different mental tests to N mothers and to a daughter of each
Article
Canonical Correlation Analysis (CCA) is a classical tool in statisti-cal analysis that measures the linear relationship between two or several data sets. In [1] it was shown that CCA of M = 2 data sets can be reformulated as a pair of coupled least squares (LS) problems. Here, we generalize this idea to M > 2 data sets. First, we present a batch algorithm to extract all the canonical vectors through an iterative regression procedure, which at each iteration uses as desired output the mean of the outputs obtained in the pre-vious iteration. Furthermore, this alternative formulation of CCA as M coupled regression problems allows us to derive in a straight-forward manner a recursive least squares (RLS) algorithm for on-line CCA. The proposed batch and on-line algorithms are applied to blind equalization of single-input multiple-output (SIMO) channels. Some simulation results show that the CCA-based algorithms out-perform other techniques based on second-order statistics for this particular application.