Content uploaded by Yongyong Chen
Author content
All content in this area was uploaded by Yongyong Chen on Mar 08, 2021
Content may be subject to copyright.
IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 22, NO. 8, AUGUST 2020 1985
Jointly Learning Kernel Representation Tensor and
Affinity Matrix for Multi-View Clustering
Yongyong Chen , Xiaolin Xiao , and Yicong Zhou , Senior Member, IEEE
Abstract—Multi-view clustering refers to the task of partitioning
numerous unlabeled multimedia data into several distinct clusters
using multiple features. In this paper, we propose a novel nonlinear
method called joint learning multi-view clustering (JLMVC) to
jointly learn kernel representation tensor and affinity matrix.
The proposed JLMVC has three advantages: (1) unlike existing
low-rank representation-based multi-view clustering methods that
learn the representation tensor and affinity matrix in two separate
steps, JLMVC jointly learns them both; (2) using the “kernel trick,”
JLMVC can handle nonlinear data structures for various real
applications; and (3) different from most existing methods that
treat representations of all views equally, JLMVC automatically
learns a reasonable weight for each view. Based on the alternating
direction method of multipliers, an effective algorithm is designed
to solve the proposed model. Extensive experiments on eight
multimedia datasets demonstrate the superiority of the proposed
JLMVC over state-of-the-art methods.
Index Terms—Multi-view clustering, low-rank tensor represen-
tation, kernel trick, affinity matrix, adaptive weight.
I. INTRODUCTION
IN MANY real-world applications, multimedia data such as
images, videos, audio, and documents, are usually repre-
sented by different features or collected from various fields
(called multi-view data) [1]–[3]. For example, in multimedia
retrieval [2], images can be represented by color, textures, and
edges. In video surveillance [3], the same scene is monitored by
multiple cameras from different viewpoints. In natural language
processing [4], documents can be translated by multiple different
languages like Chinese, English, French, and so on. Considering
that multi-view data are greatly conducive to the performance
improvement, multi-view clustering has attracted great research
Manuscript received June 5, 2019; revised September 28, 2019; accepted
October 29, 2019. Date of publication November 11, 2019; date of current
version July 24, 2020. This work was supported in part by the Science and
Technology Development Fund, Macau SAR (File no. 189/2017/A3), and in part
by the Research Committee at University of Macau under Grants MYRG2016-
00123-FST and MYRG2018-00136-FST. The associate editor coordinating the
review of this manuscript and approving it for publication was Dr. Marco Carli.
(Corresponding author: Yicong Zhou.)
Y. Chen and Y. Zhou are with the Department of Computer and Information
Science, University of Macau, Macau 999078, China (e-mail: YongyongChen.
cn@gmail.com; yicongzhou@um.edu.mo).
X. Xiao is with the School of Computer Science and Engineering, South
China University of Technology, Guangzhou 510006, China, and also with the
Department of Computer and Information Science, University of Macau, Macau
999078, China (e-mail: shellyxiaolin@gmail.com).
Color versions of one or more of the figures in this article are available online
at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TMM.2019.2952984
interests in many fields including multimedia data mining, ma-
chine learning and pattern recognition communities [5]–[8].
Given multi-view features extracted from the original multi-
media data, they are used to partition all unlabeled multimedia
data into several distinct clusters. Massive approaches for
clustering have been proposed. Either single-view clustering or
multi-view clustering, they usually follow two main steps: 1)
constructing a symmetric affinity matrix (also called similarity
matrix) to describe the pairwise relations between multimedia
data points and 2) performing the spectral clustering algo-
rithm [9] to obtain clustering results. The core of these methods
is construction of the affinity matrix. This means that the quality
of the learned affinity matrix heavily determines the clustering
performance. In literature, two common schemes, the raw mul-
timedia features and computed representations [10], [11], are
selected to conduct the affinity matrix, leading to the following
three categories: 1) graph-based methods [12]–[19], 2) sub-
space clustering-based methods [5]–[8], [11], [20]–[24], 3) their
combinations [10], [25], [26]. For example, due to simplicity
and effectiveness, k-Nearest Neighbor using cosine or heat
kernel distances [27] has become an intuitive way to construct
the affinity matrix. Following the idea that local connectivity of
multimedia data can be measured by the Euclidean distance, the
work in [12] constructed the affinity matrix by assigning adap-
tive neighbors to each multimedia data point. In [13], Nie et al.
adopted the l1-norm distance instead of the Euclidean distance
and proposed a graph clustering relaxation. Based on the fact
that the affinity matrix should obey the block diagonal property,
Nie et al. [14] imposed the rank constraint on the Laplacian
matrix for graph-based clustering. To well explore the com-
plementary information of multi-view features, the approaches
in [17] and [18] extended the adaptive neighbor strategy [12]
and the rank constraint [14] from the single-view setting into the
multi-view one, respectively. Following this, Wang et al. [19]
pursued a unified affinity matrix from the affinity matrices of all
views and the rank function was considered to partition multime-
dia data points into optimal number of clusters. However, these
graph-based approaches, e.g., [16], [18], [19], usually construct
the affinity matrix by directly using the raw multimedia features
which are often corrupted by noise and outliers. Thus, they may
obtain an unreliable and inaccurate affinity matrix [10], [26].
As the second category, subspace clustering-based methods
have become the mainstream due to their excellent interpretabil-
ity and performance. The goal of subspace clustering is to
simultaneously find low-dimensional subspaces and partition
multimedia data points into multiple subspaces. Specifically,
1520-9210 © 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: University Town Library of Shenzhen. Downloaded on September 12,2020 at 09:23:51 UTC from IEEE Xplore. Restrictions apply.
1986 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 22, NO. 8, AUGUST 2020
sparse subspace clustering (SSC) [21] and low-rank represen-
tation (LRR) [20] are two representative works, resulting in a
local representation matrix and a global one, respectively. Since
SSC learns the representation matrix by l1-norm, it imposes the
sparsity on all entries of the representation matrix. However,
LRR conducts the representation matrix by the low-rank regu-
larizer. This imposes the sparsity on the singular values. Beyond
the low-rankness and sparsity, some extra structures underlying
data, such as the local similarity structure and nonnegativity [28],
may not be fully considered. Instead of the fixed dictionary, i.e.,
the original multimedia feature, the work in [29] proposed to
learn a locality-preserving dictionary to capture the intrinsic ge-
ometric structure of the dictionary for LRR. Yin et al. [26] pro-
posed to integrate LRR and the graph construction in a unified
framework to learn an adaptive low-rank graph affinity matrix. A
similar idea was adopted in [10], [25]. A major challenge is that,
when handing multi-view features, they may cause a significant
performance degradation since they focus only on single-view
feature.
Recently, considerable efforts based on deep neural network
have been expended for clustering. For example, Ji et al. [30]
proposed a deep neural network by introducing a self-expressive
layer into the auto-encoder framework for clustering. To conduct
a deep structure, the authors in [31] adopted semi-nonnegative
matrix factorization for mutli-view clustering. In [32], a highly-
economized scalable image clustering method was proposed
to cluster large-scale multi-view images. Besides, to deal with
multi-view clustering with missing features, Chao et al. [33]
presented an enhanced multi-view co-clustering method. For a
comprehensive survey on clustering, please refer to [34] and the
references therein.
A. Related Work
The existing low-rank-based approaches for multi-view clus-
tering can be roughly grouped into two categories: two-
dimension matrix-based low-rank methods [5], [23], [35]–[40]
and three-dimension tensor-based low-rank ones [6]–[8]. For
example, to deal with multiple multimedia features, the work
in [35] proposed to concatenate all heterogeneous features and
then perform LRR [20]. Xia et al. [36] exploited the low-rank
and sparse matrix decomposition to uncover a shared transition
probability matrix under the Markov chain method. Except for
consistency among multi-view features, the work in [38] took lo-
cal view-specific information into consideration for multi-view
clustering. Similarly, Tang et al. [5] proposed a multi-view clus-
tering method by learning a joint affinity graph. In [5], [38], the
consistency measures the common properties among all views
while the specificity captures the inherent difference in each
view. Different from these approaches that use the nuclear norm
to depict the low-rank property of the representation matrices,
Wang et al. [23] proposed to factorize each representation matrix
as the product of symmetric low-rank data-cluster matrices, such
that the singular value decomposition can be ignored. Following
this, Liu et al. [40] proposed to mine a consensus representation
of all views by multi-view non-negative matrix factorization.
Fig. 1. Comparison of existing low-rank tensor representation-based MVC
methods (the red dashed rectangle) and our proposed JLMVC (the blue dashed
rectangle). Existing methods construct the representation matrix (a) and the
affinity matrix (b) in two separate steps without considering their correlation.
JLMVC learns the representation tensor and the affinity matrix (d) in a unified
framework. Additionally, the kernel-induced mapping is adopted to map the
original multimedia data (usually nonlinear separable) into a new linear space.
The most representative methods of the second category are
the tensor unfolding-based method (LT-MSC) [6] and t-singular
value decomposition (t-SVD)-based one (t-SVD-MSC) [7]. As
shown in Fig. 1(a), each representation matrix is stored as the
frontal slice of a tensor, resulting in a third-order tensor (called
representation tensor). The main difference between [6] and [7]
is the tensor rank approximation which aims to explore the
high order correlations among multi-views. By organizing all
multi-view features into a third-order tensor, the work in [41]
exploited the sparsity and tensor nuclear norm penalty with
self-expressiveness to construct the representation tensor.
Although these approaches have achieved a great advance for
multi-view clustering, they may suffer from the following chal-
lenges: 1) their performance may sharply degrade in real applica-
tions when the multimedia data come from nonlinear subspaces.
The intuitive reason is that they were originally designed to deal
with the data that lie within multiple linear subspaces [8], [42],
[43]. 2) the correlation between the representation tensor and
affinity matrix may not be fully exploited. They learn the rep-
resentation tensor via different low-rank tensor representations,
and then construct the affinity matrix as shown in Figs. 1(a) and
(b) in two separate steps. This means that the global optimal
affinity matrix cannot be ensured. 3) the importance of each
view in the construction of the affinity matrix is not considered.
For example, methods in [6], [7], [44] simply average all repre-
sentation matrices with the same weight. The approach in [44]
overcomes the first limitation, but fails to address the other two
Authorized licensed use limited to: University Town Library of Shenzhen. Downloaded on September 12,2020 at 09:23:51 UTC from IEEE Xplore. Restrictions apply.
CHEN et al.: JOINTLY LEARNING KERNEL REPRESENTATION TENSOR AND AFFINITY MATRIX 1987
challenges. To our best knowledge, no work has been done to
address these three challenges simultaneously.
B. Our Contributions
To address above three challenges, we propose a unified model
to jointly learn the kernel representation tensor and affinity
matrix for multi-view clustering (JLMVC). JLMVC learns the
representation tensor and affinity matrix jointly such that their
correlations can be well exploited, handles the nonlinear mul-
timedia data using a kernel-induced mapping, and adopts the
adaptive weight strategy to form a unified affinity matrix. Fig. 1
compares the proposed JLMVC with two state-of-the-art low-
rank tensor representation-based MVC methods LT-MSC [6]
and t-SVD-MSC [7]. As can be observed that, under the assump-
tion that the original data lie within multiple linear subspaces,
existing low-rank tensor representation-based MVC methods
learn the representation tensor from the original multimedia data.
However, this assumption may not be ensured in real applica-
tions. To achieve nonlinear multi-view clustering, JLMVC maps
the original multimedia data from the input data space into a
new feature space such that the mapped data points can reside in
multiple linear subspaces, as shown in the middle of Fig. 1(c).
JLMVC then learns the representation tensor and affinity matrix
simultaneously. Finally, the learned unified affinity matrix is fed
to the input of the spectral clustering algorithm [9] to obtain the
clustering results.
The contributions and novelty of this paper are summarized
as follows:
rWe propose a joint learning multi-view clustering
(JLMVC) model to jointly learn kernel representation ten-
sor and affinity matrix for multi-view clustering. JLMVC
is able to well explore the correlation between the represen-
tation tensor and affinity matrix, handles the nonlinear data
using a kernel-induced mapping, and adopts the adaptive
weight strategy to form a unified affinity matrix.
rJLMVC uses the tensor nuclear norm to encode the low
rank property of the representation tensor and adaptively
learns different weights for different views’ representation
matrices. This greatly benefits the construction of the uni-
fied affinity matrix.
rAn effective algorithm is designed to solve the JLMVC
model via the alternating direction method of multipli-
ers. Extensive experiments on eight popular multimedia
datasets are conducted and validate the superiority of
JLMVC over ten state-of-the-art approaches.
C. Organization of the Paper
The rest of this paper is structured as follows. Section II intro-
duces some notations and preliminaries, especially the t-SVD-
based tensor nuclear norm which is used to depict the low-rank
property of the representation tensor. In Section II, we intro-
duce JLMVC and design an iterative algorithm under the alter-
nating direction method of multipliers framework. We evaluate
the performance of the proposed JLMVC on eight real-world
multi-view datasets in Section IV and conclude the whole paper
in Section V.
TAB LE I
BASIC NOTATIONS AND THEIR DESCRIPTIONS
II. NOTATIONS AND PRELIMINARIES
In this section, we aim to introduce some notations used
throughout this paper and the t-SVD-based tensor nuclear norm
(see Definition 2.2) that will be used to depict the low-rank
property of the representation tensor. Some basic notations are
summarized in Table I.
Before the definition of t-SVD [45], several operators are first
introduced. For a tensor X∈Rn1×n2×n3, its block circular ma-
trix bcirc(X)and block diagonal matrix bdiag(X)are defined
as
bcirc(X)=⎡
⎢
⎢
⎢
⎣
X(1) X(n3)··· X(2)
X(2) X(1) ··· X(3)
.
.
..
.
.....
.
.
X(n3)X(n3−1) ··· X(1)
⎤
⎥
⎥
⎥
⎦
,
bdiag(X)=⎡
⎢
⎢
⎢
⎣
X(1)
X(2)
...
X(n3)
⎤
⎥
⎥
⎥
⎦
.
The block vectorization is defined as bvec(X)=[X(1);···;
X(n3)]. The inverse operations of bvec and bdiag are de-
fined as bvfold(bvec(X)) = Xand bdfold(bdiag(X)) =
X, respectively. Let Y∈Rn2
×n4
×n3.Thet-product X∗Yis
an n1×n4×n3tensor, X∗Y=bvfold(bcirc(X)∗bvec
(Y)).Thetranspose of Xis XT∈Rn2×n1×n3by transpos-
ing each of the frontal slices and then reversing the order
of transposed frontal slices 2 through n3.Theidentity ten-
sor I∈Rn1×n1×n3is a tensor whose first frontal slice is an
n1×n1identity matrix and the rest frontal slices are zero. A
tensor X∈Rn1×n1×n3is orthogonal if it satisfies XT∗X =
X∗X
T=I.
Definition 2.1: (t-SVD) Given X, its t-SVD is defined as
X=U∗G∗V
T,
where U∈Rn1×n1×n3and V∈Rn2×n2×n3are orthogonal ten-
sors, G∈Rn1×n2×n3is an f-diagonal tensor. Each of its frontal
slices is a diagonal matrix.
Fig. 2 shows the t-SVD of a third-order tensor. The t-SVD-
based tensor nuclear norm (t-SVD-TNN) is given as follows.
Authorized licensed use limited to: University Town Library of Shenzhen. Downloaded on September 12,2020 at 09:23:51 UTC from IEEE Xplore. Restrictions apply.
1988 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 22, NO. 8, AUGUST 2020
Fig. 2. The t-SVD of a tensor of size n1×n2×n3.
Definition 2.2: (t-SVD-TNN) The t-SVD-TNN of a tensor
X∈Rn1×n2×n3, denoted as X, is defined as the sum of
singular values of all the frontal slices of ˆ
X,i.e.,
X=
min{n1,n2}
i=1
n3
k=1 |ˆ
G(i, i, k)|.(1)
III. JOINT LEARNING MULTI-VIEW CLUSTERING
In this section, we first elaborate the proposed JLMVC model
in Section III-A, and then solve this model by the alternating
direction method of multipliers (ADMM) in Section III-B. Con-
sidering that, in real world applications, the multimedia data
may be drawn from multiple nonlinear subspaces, JLMVC first
uses the kernel trick to solve the nonlinearity. Based on the
self-expression property [20], [21], JLMVC carries out joint
learning of the representation tensor and unified affinity matrix.
A. Problem Formulation
The existing multi-view clustering method t-SVD-MSC [7]
learns the representation tensor Zby
min
Z,E Z+α
V
v=1 E(v)2,1
s.t. X(v)=X(v)Z(v)+E(v),v=1,...,V,
Z=Φ(Z(1),Z(2),...,Z(V)).(2)
where X(v)∈Rdv×ndenotes the v-th view feature; α>0is
the regularization parameter; Edenotes noise and outliers; Φ(·)
is an operator to stack all representation matrices {Z(v)}into a
third-order tensor Zas shown in Fig. 1(a).
Once Zis yielded by Eq. (2), the affinity matrix Sis con-
structed by averaging all frontal slices of Z. This means that,
in the construction of S, the correlation between Sand Zis
fixed. This scheme, however, may not ensure the optimal affin-
ity matrix since different view features characterize specific and
partly independent information of the dataset. Therefore, to ad-
dress this issue, different weights should be assigned on different
views. Then we give the following model:
min
Z,S,ω Z+
V
v=1 αX(v)−X(v)Z(v)2,1
+λω(v)Z(v)−S2
F+ηω2
2
s.t. Z=Φ(Z(1),Z(2),...,Z(V)),ω≥0,Σvω(v)=1,(3)
where α,λand ηare three positive parameters to balance the
contributions of all terms in the objective function; ω(v)is the
relative weight of the v-th view; the last term is to smoothen
the weight distribution and avoid the futile solution [46]. How-
ever, in model (3), the self-expression property is encoded on
the original input data space (i.e., the second term). This usu-
ally exhibits the nonlinear structure in real-world datasets. Here,
we seek new feature spaces for the linear separated multi-view
clustering. Borrowing the idea of the kernel methods [42], [43],
for the v-th feature, let φ(v):Rdv→H
(v)be a kernel mapping
from the original data space to the kernel space. As stated in the
following Eq. (6), φ(v)does not need to be defined explicitly.
Let K(v)∈Rn×nbe a positive kernel Gram matrix, i.e.,
K(v)=φ(v)(X(v))Tφ(v)(X(v)).(4)
Then, we encode the self-expression property on the new feature
space. This is also the reason that the proposed JLMVC can
handle the nonlinearity problem. Based on the above analysis,
model (3) can be formulated as
min
Z,S,ω Z+
V
v=1 αφ(X(v))−φ(X(v))Z(v)2,1
+λω(v)Z(v)−S2
F+ηω2
2
s.t. Z=Φ(Z(1),Z(2),...,Z(V)),ω≥0,Σvω(v)=1.(5)
Note that the second term of Eq. (5) can be rewritten as
φ(X(v))−φ(X(v))Z(v)2,1
=
n
i=1 P(v)T
iK(v)P(v)
i1
2,(6)
where P(v)=I−Z(v).P(v)
iis the i-th column of P(v).From
Eq. (6), it is easy to see that the kernel mapping φ(v)appears only
in the form of the inner product, i.e.,φ(v)(X(v))Tφ(v)(X(v)),
leading to the kernel Gram matrix K(v). Therefore, φ(v)is
implicitly defined. For simplicity, we denote g(v)(P(v))=
n
i=1 P(v)T
iK(v)P(v)
i1
2to be the reconstruction error in the
kernel space. Finally, the proposed JLMVC model can be for-
mulated as
min
Z,P (v),S,ω Z+
V
v=1 αg(v)P(v)
+λω(v)Z(v)−S2
F+ηω2
2
s.t. Z=Φ(Z(1),Z(2),...,Z(V)),
P=Φ(P(1),P(2),...,P(V)),
P=I−Z,ω≥0,Σvω(v)=1,(7)
where the first term, i.e.,Zdefined in Eq. (1), is used to
explore the low-rankness of Z; the second term can handle the
nonlinear structures; the third term with the adaptive weight
strategy aims to learn a unified affinity matrix S.
Authorized licensed use limited to: University Town Library of Shenzhen. Downloaded on September 12,2020 at 09:23:51 UTC from IEEE Xplore. Restrictions apply.
CHEN et al.: JOINTLY LEARNING KERNEL REPRESENTATION TENSOR AND AFFINITY MATRIX 1989
B. Optimization
It is intractable to solve the proposed model in Eq. (7) since
it is not jointly convex and coupled with respect to variable Z.
Therefore, we solve Eq. (7) under ADMM framework. We can
reformulate Eq. (7) as:
min
Z,Y,P,S,ω Y+
V
v=1 αg(v)P(v)
+λω(v)Z(v)−S2
F+ηω2
2
s.t. Z=Φ(Z(1) ,Z(2),...,Z(V)),
P=Φ(P(1) ,P(2),...,P(V)),
P=I−Z,ω ≥0,Σvω(v)=1,Z=Y.(8)
Following the idea of ADMM, we introduce one auxiliary vari-
able Yto separate Zin the objective function and then itera-
tively update each variable by fixing other variables [47]. The
augmented Lagrangian function is defined as the sum of the
objective function of Eq. (8) and the penalty term under
the l2-norm. The augmented Lagrangian function of model (8)
is given by:
Lρ(Z,Y,P(v),S,ω;Θ,Π) = Y+
V
v=1 αg(v)P(v)
+λω(v)Z(v)−S2
F+ηω2
2+Θ,I−Z−P
+ρ
2I−Z−P2
F+Π,Z−Y+ρ
2Z − Y2
F,(9)
where Θand Πare the Lagrange multipliers of size n×n×
V;ρis the non-negative penalty parameter; ·,· is the inner
product. Under the ADMM framework, we can solve Eq. (9) by
optimizing one variable while keeping the other variables fixed
as follows:
Step 1 Update Z:Fixing other variables, we can update Z
by the following subproblem:
min
Z
V
v=1
λω(v)
kZ(v)−Sk2
F
+ρk
2
I−Z−P
k+Θk
ρk
2
F
+ρk
2
Z−Yk+Πk
ρk
2
F
.
(10)
It is easy to see that updating each frontal slice Z(v)of Zis
independent. This means that Z(v)can be updated in parallel.
The v-th subproblem is
min
Z(v)
λω(v)
kVertZ
(v)−Sk2
F
+ρk
2
Z(v)−A(v)
k
2
F+ρk
2
Z(v)−B(v)
k
2
F,
(11)
where A(v)
k=I−P(v)+Θ(v)
k
ρkand B(v)
k=Y(v)
k−Π(V)
k
ρk.By
setting the derivative of Eq. (11) with respect to Z(v)to zero,
Fig. 3. Explanation of rotation.
the optimal solution Z(v)
k+1 is
Z(v)
k+1 =2λω(v)
kSk+ρkA(v)
k+ρkB(v)
k(2λω(v)
k+2ρk).
(12)
Step 2 Update Y:When other variables are fixed, Ycan be
updated by
min
YY+ρk
2Y − Fk2
F,(13)
where Fk=Zk+1 +Πk
ρk. Following [7], we rotate Yfrom size
n×n×Vto n×V×nas shown in Fig. 3. The first reason is
that, as in Eq. (1), t-SVD-TNN performs SVD on each frontal
slice of ˆ
Yto capture the “spatial-shifting” correlation [45], [48].
This means that t-SVD-TNN preserves only the low-rank prop-
erty of intra-view. However, we hope to capture the low-rank
property of inter-views. The second reason is that the rotation
operation can significantly reduce the computation cost [7]. Af-
ter the rotation operation, each frontal slice of ˆ
Yrepresents the
view-specific self-representation matrix.
The closed-form solution of Eq. (13) can be obtained by the
tensor tubal-shrinkage operator [7], [49]:
Yk+1 =CV
ρk
(Fk)=U∗CV
ρk
(G)∗VT,(14)
where Fk=U∗G∗V
T, and CV
ρk
(G)=G∗J, in which Jis an
f-diagonal tensor whose diagonal element in the Fourier domain
is J(i, i, k)=max{1−V/ρ
k
G(i,i,k),0}.
Step 3 Update P:With other variables fixed, we minimize
the augmented Lagrangian function in Eq. (9) with respect to P:
min
P
V
v=1
αg(v)P(v)+ρk
2I−Zk+1 −P+Θk
ρk2
F.(15)
Similar to Eq. (10), updating P(v)is also independent:
min
P(v)αg(v)P(v)+ρk
2P(v)−D(v)
k2
F,(16)
where D(v)
k=I−Z(v)
k+1 +Θ(v)
k
ρk. Compared with the method
in [42] which uses l2-norm to measure the reconstruction er-
ror, it is more difficult to solve Eq. (16) since g(v)is convex but
non-smooth. According to [43], the i-th column of the optimal
solution of Eq. (16) p(v)
iis
p(v)
i=ˆp(v),if[1/σ(v)
1,...,1/σ(v)
r]◦t(v)
u>1/τ;
c(v)
i−V(v)
Kt(v)
u,otherwise.(17)
Authorized licensed use limited to: University Town Library of Shenzhen. Downloaded on September 12,2020 at 09:23:51 UTC from IEEE Xplore. Restrictions apply.
1990 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 22, NO. 8, AUGUST 2020
where τ=ρk
α;◦is the element multiplication operator;
K(v)=V(v)Σ(v)2V(v)Tis the singular value decomposition of
K(v);Σ(v)=diag(σ(v)
1,...,σ
(v)
r,0,...,0) and ris the rank
of K(v);V(v)
Kis constructed by the first rcolumns of V(v);
t(v)
u=V(v)
Kc(v)
i;ˆp(v)is defined as
ˆp(v)=c(v)
i−V(v)
K
×⎛
⎝σ(v)2
1
γ(v)+σ(v)2
1
,..., σ(v)2
r
γ(v)+σ(v)2
rT
◦t(v)
u⎞
⎠,
(18)
where γ(v)>0is a scalar, and it satisfies
t(v)T
udiag σ(v)2
i
(γ(v)+σ(v)2
i)21≤i≤rt(v)
u=1/τ 2.(19)
We can obtain a unique root γ(v)when [1/σ(v)
1,...,1/σ(v)
r]◦
t(v)
u>1/τ.
Step 4 Update S:When keeping other variables fixed, we
obtain the following optimization problem:
Sk+1 =argmin
S
V
v=1
ω(v)
kZ(v)
k+1 −S2
F,
=
V
v=1
ω(v)
kZ(v)
k+1.(20)
The last equation is based on the fact that vω(v)
k=1.
Step 5 Update ω:To obtain the adaptive weights ωk+1,we
minimize the augmented Lagrangian function in Eq. (9) with
respect to ω:
ωk+1 =argmin
ω
V
v=1
ω(v)Z(v)
k+1 −Sk+12
F+ηω2
2,
s.t. ω≥0,
v
ω(v)=1.(21)
Actually, Eq. (21) is a quadratic programming problem
ωk+1 =argmin
ω
ω+gk
2η
2
2
,
s.t. ω≥0,
v
ωv=1.(22)
where gv
k=Z(v)
k+1 −Sk+12
Fforms the vector gk. We adopt the
off-the-shelf quadratic programming solver to solve the above
problem.
Step 6 Update Θ,Π, and ρ:The Lagrangian multipliers Θ,Π
and the penalty parameter ρare updated by
Θk+1 =Θ
k+ρk(I−Z
k+1 −P
k+1);
Πk+1 =Π
k+ρk(Zk+1 −Y
k+1);
ρk+1 =min{β∗ρk,ρ
max},(23)
Algorithm 1: JLMVC for multi-view clustering
Input: multi-view features: {X(v)}; parameters: α,λ;
Initialize: Y1,Z1,S
1,Θ1,Π1initialized to 0; weight
ω(v)
1=1
V;η= 500,ρ1=10
−3,β=1.5,
=10
−7,k=1;
1: Calculate the v-th kernel matrix K(v)by Eq. (4)
(v=1,...,V);
2: while not converged do
3: for v=1to Vdo
4: Update Z(v)
k+1 by Eq. (12);
5: Update P(v)
k+1 by Eq. (17);
6: end for
7: Update Yk+1 by Eq. (14);
8: Update Sk+1 by Eq. (20);
9: Update ωk+1 by Eq. (22);
10: Update Θk+1,Πk+1 , and ρk+1 by Eq. (23);
11: Check the convergence condition in Eq. (24);
12: end while
Output:Affinity matrix Sk+1.
where β∈[0,√5+1
2]is a step length to update the penalty pa-
rameter ρin each iteration [50]. ρmax is the maximum value of
the penalty parameter ρ.
The details of the proposed algorithm for solving the JLMVC
model are summarized in Algorithm 1. Algorithm 1 can be ter-
minated when the following convergence condition is satisfied
max I−Z(v)
k+1 −P(v)
k+1∞,v =1,...,V
Zk+1 −Y
k+1∞≤tol, (24)
where tol > 0is a pre-defined tolerance.
Several notes regarding Algorithm 1 are given below to further
understand the proposed JLMVC.
rThe weights of different views are of importance to the
construction of the affinity matrix. An intuitive way to ini-
tialize weights of different views is set each weight to be
ω(v)
1=1
V. Then, weights are updated in an adaptive man-
ner by Eq. (22). Other variables Y1,Z1,S
1,Θ1,Π1are ini-
tialized to 0.
rLines 3–6 of Algorithm 1 can be performed in parallel as
subproblems (11) and (16) are independent with respect to
Z(v)and P(v), respectively.
rAfter performing Algorithm 1, we can obtain the unified
affinity matrix Swhich well inherits the advantage of the
representation tensor Z. Finally, the learned affinity matrix
Sserves as the input of spectral clustering algorithm [9] to
yield the clustering results.
IV. EXPERIMENTAL RESULTS
In this section, we aim to evaluate the performance of JLMVC
on eight multimedia datasets. The model analysis is also re-
ported.
Authorized licensed use limited to: University Town Library of Shenzhen. Downloaded on September 12,2020 at 09:23:51 UTC from IEEE Xplore. Restrictions apply.
CHEN et al.: JOINTLY LEARNING KERNEL REPRESENTATION TENSOR AND AFFINITY MATRIX 1991
TAB LE II
SUMMARY OF EIGHT MULTI-VIEW DATASETS
A. Experimental Settings
Our experiments select eight multimedia datasets for multi-
view clustering, including four face image datasets, two scene
datasets, one prokaryotic dataset, and one article data. The brief
description of these datasets is summarized in Table II. The de-
tails of each dataset are listed as follows:
Dataset descriptions: Yale: 1it consists of 165 gray-scale
images of 15 individuals with different facial expressions and
configurations. Following [6], [7], 4096d(dimension, d)In-
tensity, 3304dLBP, and 6750dGabor are extracted as three
multi-view features; Extended YaleB:2it contains 2414 face
images of 38 individuals, each of which has 64 near frontal
images under different lighting conditions. Similar to [6], [7],
the first 10 classes are selected and three types of features, in-
cluding 2500dIntensity, 3304dLBP, and 6750dGabor, are ex-
tracted; ORL:3it includes 400 face images with 40 clusters
under different times, lighting, facial expressions, and facial de-
tails; Prokaryotic phyla: it contains 551 prokaryotic species
described by textual data and different genomic representations.
Wikipedia:4it is an article dataset selected by Wikipedia ed-
itors since 2009. Following [46], 693 documents with 2 views
are selected; COIL-20:5COIL_20 contains 1440 images of 20
object categories. Three view features including 1024dinten-
sity, 3304dLBP, and 6750dGabor are employed; CMU-PIE:6
it consists of 5440 facial images of 68 subjects. Each image
is of size 64 ×64 with a large variance. Following [51], three
types of features including 1024dIntensity, 256dLBP, and 496d
HOG are used; Scene-15 [52]: it contains 4485 outdoor and in-
door scene images from 15 categories. Following [7], three kinds
of image features, including 1800dPHOW, 1180dPRI-CoLBP,
and 1240dCENTRIST are extracted to represent Scene-15.
Baselines: Our proposed JLMVC is compared with twelve
state-of-the-art single-view and multi-view clustering methods.
The competing methods are listed as follows: SSCbest [21]:
single-view clustering using the sparse regularizer (l1-norm) to
construct the representation matrix; LRRbest [20]: single-view
clustering using the nuclear norm to construct the representation
matrix; MLAP [35]: multi-view clustering by concatenating
representation matrices of different views and imposing low-
rank constraint to explore the complementarity; DiMSC [53]:
1http://cvc.yale.edu/projects/yalefaces/yalefaces.html
2http://vision.ucsd.edu/ leekc/ExtYaleDatabase/ExtYaleB.html
3http://www.uk.research.att.com/facedatabase.html
4http://lig-membres.imag.fr/grimal/data.html
5http://www.cs.columbia.edu/CAVE/software/softlib/
6http://vasc.ri.cmu.edu/idb/html/face/
multi-view clustering with the Hilbert-Schmidt Independence
criterion; LT-MS C [6]: multi-view clustering with the low-rank
tensor constraint; MLAN [16]: multi-view clustering with adap-
tive neighbors; ECMSC [24]: multi-view clustering by simulta-
neously exploiting the representation exclusivity and indicator
consistency; t-SVD-MSC [7]: multi-view clustering via tensor
multi-rank minimization; HLR-M2VS [8]: multi-view clus-
tering via hyper-Laplacian regularized multilinear multiview
self-representations; Kt-SVD-MSC [44]: multi-view clustering
via robust kernelized multi-view self-representations; DMF-
MVC [31]: multi-view clustering via deep matrix factorization;
AW P [54]: multi-view clustering via adaptively weighted
procrustes.
Specifically, SSCbest and LRRbest are two representative
baselines for single-view clustering. Others are the multi-view
clustering baselines. LT-MSC, t-SVD-MSC, HLR-M2VS, and
Kt-SVD-MSC are low-rank tensor representation-based multi-
view clustering approaches. Kt-SVD-MSC is the kernelized ver-
sion of t-SVD-MSC. MLAN is graph-based multi-view cluster-
ing one. The source codes of all competing methods are down-
loaded from the authors’ homepages. For single-view clustering
methods, we perform SSC and LRR on each feature matrix inde-
pendently and report the best clustering results. For multi-view
clustering ones, LT-MSC, t-SVD-MSC, HLR-M2VS, and Kt-
SVD-MSC are first performed to learn the representation tensor
Z, and then conduct the affinity matrix Sby averaging each
frontal slice of Z, that is, S=1
Vv|Z(v)|+|Z(v)T|.This
means that they are performed in two separate steps to obtain the
affinity matrix. After that, the spectral clustering algorithm [9]
is carried out to obtain the final clustering results. For fair com-
parison, our experiments follow the same parameter settings of
the original papers. For SSC and LRR, we select the regulariza-
tion parameter from the interval [0.01,10]; for MLAP, two free
parameters are searched from 0.001 to 1; for DiMSC, two free
parameters are chosen from [0.01,0.03] and [20 : 20 : 180],re-
spectively; the trade-off parameter of LT-MSC is selected from
0.01 to 100; for MLAN, one parameter is set to a random number
between 1 and 30; three free parameters of ECMSC are set in
[0.1,1], [0.1,1], and 1.2, respectively; the trade-off parameters of
t-SVD-MSC and Kt-SVD-MSC are set within the range [0.1,2]
and [0.001,0.6], respectively; for HLR-M2VS, two parameters
are located within the ranges [0.01,0.2] and [0.1,0.9], respec-
tively; DMF-MVC adopts {[100,50],[500,50],[500,200]}as
the sizes of the last layer and other parameters use the default
settings as recommended in [31]; AWP is parameter-free.
Evaluation metrics: Six widely used metrics are selected to
evaluate the clustering quality including accuracy (ACC), nor-
malized mutual information (NMI), adjusted rank index (AR),
F-score, Precision, and Recall. For each evaluation metric, the
higher value indicates the better clustering performance. As we
know, the spectral clustering algorithm uses the K-means al-
gorithm to obtain the indicator matrix for all methods except
MLAN, and different initializations may yield different cluster-
ing results. Thus, we run 10 trials for each experiment on all
datasets and report their average performance with standard de-
viations. Although MLAN does not use the K-means algorithm,
there exists one random parameter. Thus, we repeat MLAN
algorithm 10 trials.
Authorized licensed use limited to: University Town Library of Shenzhen. Downloaded on September 12,2020 at 09:23:51 UTC from IEEE Xplore. Restrictions apply.
1992 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 22, NO. 8, AUGUST 2020
TABLE III
CLUSTERING RESULTS (MEAN±STANDARD DEVIATION)ON THREE FACE IMAGE DATASETS
B. Clustering Performance Comparison
The clustering performance comparison on all multimedia
datasets are reported in Tables III, IV, and V. The best results
are highlighted in bold and the second-best ones are underlined
in each table. From the results in these tables, we reach the
following conclusions:
rGenerally speaking, the proposed JLMVC achieves the
best results on all datasets, except the ORL data where
JLMVC is the second best. They have verified the va-
lidity of the proposed JLMVC. This is mainly because
the proposed JLMVC takes three aspects into one unified
model: 1) high correlation between the representation ten-
sor and affinity matrix; (2) the nonlinear structures in real
applications; (3) different contributions of each view for
the construction of the unified affinity matrix. (More
details can be found in Section IV-C-(3).) Take the
Extended YaleB data as an example, the proposed JLMVC
improves around 1.4%, 0.4%, 2.1%, 1.7%, 1.6%, and 1.8%
with respect to six measures over the second-best method
Kt-SVD-MSC which also exploits the kernel trick to solve
the nonlinear subspaces problem but learns the representa-
tion tensor and affinity matrix in two separate manners;
rThe low-rank tensor representation-based MVC methods
(LT-MSC, t-SVD-MSC, HLR-M2VS, Kt-SVD-MSC, and
the proposed JLMVC) show better results than all single-
view clustering methods (SSC and LRR) in most cases.
This is mostly due to the fact that different features charac-
terize different and partly independent information of the
datasets. LRR and SSC exploit only partial information,
leading to unsatisfactory results especially when multi-
view features are heterogeneous. Whereas, the low-rank
Authorized licensed use limited to: University Town Library of Shenzhen. Downloaded on September 12,2020 at 09:23:51 UTC from IEEE Xplore. Restrictions apply.
CHEN et al.: JOINTLY LEARNING KERNEL REPRESENTATION TENSOR AND AFFINITY MATRIX 1993
TAB LE IV
CLUSTERING RESULTS (MEAN±STANDARD DEVIATION)ON WIKIPEDIA AND PROKARYOTIC
DMF-MVCC was crashed on these two databases.
tensor representation-based MVC can well explore the high
order correlations underlying multi-view features;
rThe graph-based multi-view clustering method, MLAN,
obtains unstable results. On Prokaryotic data, MLAN
achieves the similar performance with our JLMVC. How-
ever, it performs worse than those single-view clustering
methods on other datasets. The reason may be that the
graph-based clustering approaches usually construct the
affinity matrix on the raw multimedia features which may
be corrupted by noise and outliers;
rOn ORL data, HLR-M2VS achieves better results than the
proposed JLMVC. The reason is that the manifold regu-
larization may be better to preserve the local geometrical
structure of ORL data than the kernel trick when han-
dling nonlinearity. However, HLR-M2VS is less robust on
Yale and Extended YaleB datasets. Specifically, in terms
of ACC and NMI, the leading margins of our JLMVC are
24.0% and 19.4% over HLR-M2VS on Extended YaleB, re-
spectively. On Yale, the improvement of JLMVC is 24.4%
and 21.0%, respectively. Similar observations can be ob-
tained on Scene-15 and Prokaryotic datasets. This indi-
cates that, compared to the manifold-based methods, the
kernel-based methods may be a better way to handle the
nonlinear subspaces;
rThe performance of MLAP degrades sharply on the Ex-
tended YaleB data. Its performance is even worse than those
of the single-view clustering methods, i.e., LRR and SSC.
However, it performs better than them on other datasets.
As stated in [7], the LBP and Gabor features cause less
discriminative representation than the intensity feature due
to large variations of illumination as shown in the first
group of Fig. 4. This indicates that simply concatenating
all features may fail to obtain a good affinity matrix to de-
scribe the relationship among all samples, especially when
all features are heterogeneous. This is the direct motiva-
tion why our model considers different contributions of
different features to construct the affinity matrix.
C. Model Analysis
In this section, we aim to give a comprehensive analysis of the
proposed JLMVC in Eq. (7), including the parameter analysis,
convergence analysis, and runtime.
1) Parameter Analysis: There are three parameters, i.e.,
α, λ,η in the proposed JLMVC. In all experiments, we set
η= 500. Thus, there are two free parameters which need to
be tuned. Actually, αand λare used to balance the contri-
butions of the low-rank tensor term, noise term and consen-
sus term. For example, when the noise level of features is
high, αmay be selected a large value. αand λare selected
from the ranges [0.001,0.005,0.01,0.05,0.1,0.3,0.5,0.7] and
[0.001,0.005,0.01,0.05,0.1,0.3,0.5,0.7,0.9,1], respectively.
Here, the Yale and Extended YaleB datasets are selected as two
examples. Fig. 5 shows the ACC and NMI values with respect
to different combinations of αand λ. From this figure, we can
observe that when αis set to a relatively large value, JLMVC can
achieve the best results. An intuitive interpretation is that there
are large variations of illumination on the Extended YaleB data.
Authorized licensed use limited to: University Town Library of Shenzhen. Downloaded on September 12,2020 at 09:23:51 UTC from IEEE Xplore. Restrictions apply.
1994 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 22, NO. 8, AUGUST 2020
TAB LE V
CLUSTERING RESULTS (MEAN±STANDARD DEVIATION)ON COIL-20,CMU-PIE AND SCENE-15
2) Computation Complexity and Empirical Convergence
Analysis: The proposed JLMVC consists of six subproblems.
The main computation complexity of JLMVC is to update Y
and Psince updating other variables contains only the matrix
addition and scalar-matrix multiplication. The total computa-
tion complexity of Ysubproblem is O(2Vn
2log(n)+V2n2)
since it needs to compute the FFT, inverse FFT and singu-
lar value decomposition. For updating P, it includes Vinde-
pendent subproblems as shown in Eq. (16). Each subproblem
takes O(rn2)for the vector-matrix multiplication, where ris
the rank of K(v). Thus, the computation complexity of JLMVC
is O(2Vn
2log(n)+V2n2+Vrn
2).
The empirical convergence of JLMVC on Extended YaleB
dataset is shown in Fig. 6. The x-axis denotes the number of
iterations, while the y-axis represents the errors defined in Eq.
(24). We can see that, after several iterations, the errors witness a
Fig. 4. ACC and NMI values of LRR with all features on (1) Extended YaleB,
(2) Yale and (3) ORL datasets.
quick drop until a stable value. In all experiments, the proposed
JLMVC can reach the smallest residual within 50 iterations. To
further investigate the empirical convergence of JLMVC, Fig. 7
reports the ACC and NMI values with respect to iterations on
Authorized licensed use limited to: University Town Library of Shenzhen. Downloaded on September 12,2020 at 09:23:51 UTC from IEEE Xplore. Restrictions apply.
CHEN ET AL.: JOINTLY LEARNING KERNEL REPRESENTATION TENSOR AND AFFINITY MATRIX 1995
TAB LE VI
PERFORMANCE (ACC/NMI) OF JLMVC AND ITS VARIANTS ON DIFFERENT DATASETS
Fig. 5. ACC and NMI values of JLMVC with different combinations of αand
λon Yale (Two top figures) and Extended YaleB (Two bottom figures) datasets.
Fig. 6. Empirical convergence versus iterations on Extended YaleB data.
Fig. 7. ACC and NMI values versus iterations on Extended YaleB.
Extended YaleB dataset. Before the first 10 iterations, JLMVC
does not reach a meaningful accuracy. But after that, JLMVC
achieves promising ACC and NMI values higher than those of all
competing methods except Kt-SVD-MSC. This shows that the
proposed JLMVC is an excellent multi-view clustering method.
3) The Effect of Zand S:The proposed JLMVC achieves
the joint learning of the representation tensor Zand affinity
matrix S. However, most existing MVC methods follow two
separate steps to construct Zand S. To investigate the effect of
Zand S, we perform a test by setting λ=0. In this test, we sim-
ply obtain Zand then construct S=1
Vv|Z(v)|+|Z(v)T|.
This simple variant of JLMVC is denoted as JLMVC-Z.Ta-
ble VI reports clustering results of JLMVC and JLMVC-Z.Itis
easy to see that JLMVC achieves superior clustering results over
JLMVC-Zin all cases. The average improvement of JLMVC
is around 17.06% and 16.23% over JLMVC-Zwith respect to
ACC and NMI, respectively, indicating that construction of Z
and Ssimultaneous can boost the clustering performance.
4) Ablation Study on the Kernel Trick: To investigate the ef-
fect of the kernel trick, we also carry out the model in Eq. (3),
denoted as JLMVC-nk. Like JLMVC, JLMVC-nk also learns the
representation tensor and affinity matrix simultaneously without
the kernel trick. This means that the affinity matrix is constructed
from the the original multimedia data (usually nonlinear sepa-
rable). The ACC and NMI values of JLMVC-nk are reported
in the last row of Table VI. One can see that JLMVC achieves
better clustering results than JLMVC-nk in all cases. A typical
example is the Extended YaleB dataset whose multiple features
are diverse as shown in Fig. 4. This indicates that the kernel trick
can handle the nonlinearity and boost the multi-view clustering
performance.
5) Runtime: Since the computation time of a method is also
an evaluation factor, we give a runtime comparison of the pro-
posed JLMVC and several competitors. Table VII reports the
runtime comparison results. All experiments are implemented
in Matlab 2016a on a workstation with 3.50 GHz CPU and 16 GB
RAM. From Table VII, the methods with the average time from
low to high are MLAN, t-SVD-MSC, HLR-M2VS, JLMVC,
LT-MSC, DiMSC, MLAP, and Kt-SVD-MSC. MLAN costs the
shortest processing time and the proposed JLMVC belongs to
the middle-ranking group. All methods except for MLAN should
compute the singular value decomposition and matrix inversion.
This leads to a high computation cost. Although MLAN is the
most efficient one, it has an unstable performance. The reason is
that MLAN uses the raw data to learn the similarity matrix and
the raw data are easily contaminated by noise. Other methods
impose the low-rank constraint on the representation matrix (or
tensor) and use the sparse regularizer to remove noise. They can
construct a reliable similarity matrix.
Authorized licensed use limited to: University Town Library of Shenzhen. Downloaded on September 12,2020 at 09:23:51 UTC from IEEE Xplore. Restrictions apply.
1996 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 22, NO. 8, AUGUST 2020
TAB LE VI I
A
VERAGE RUNNING TIME (IN SECONDS)ON ALL DATABASES
V. CONCLUSIONS
In this paper, we proposed a novel method called JLMVC to
solve the multi-view clustering problem, based on the low-rank
tensor representation and “kernel trick”. In JLMVC, instead of
capturing a low-rank representation matrix among all views, the
tensor singular value decomposition-based tensor nuclear norm
was used to learn the representation tensor so as to explore the
high order correlations among different views. Using the kernel
trick, the original multimedia data was implicitly mapped from
the input data space into a new feature space to overcome the dif-
ficulty of nonlinearity in real applications. To make full use of the
high correlation between the representation tensor and affinity
matrix, the proposed JLMVC achieved the joint learning of the
representation tensor and affinity matrix. Thus, the learned affin-
ity matrix has the potential to boost the clustering performance
which was demonstrated by extensive experiments on eight mul-
timedia datasets. Our future work will design a fast and efficient
multi-view clustering method. One possible solution is using the
Frank-Wolfe algorithm to reduce the computation complexity of
the singular value decomposition.
ACKNOWLEDGMENT
The authors would like to thank the editors and the anony-
mous reviewers for their constructive comments, which helped
to improve the quality of this article. The authors wish to grate-
fully acknowledge Prof. C. Zhang from Tianjin University and
Prof. Y. Xie from East China Normal University for sharing
multi-view datasets and codes.
REFERENCES
[1] S. Yang et al., “SkeletonNet: A hybrid network with a skeleton-embedding
process for multi-view image representation learning,” IEEE Trans. Mul-
timedia, vol. 21, no. 11, pp. 2916–2929, Nov. 2019.
[2] Z. Zhang, Y. Xie, W. Zhang, and Q. Tian, “Effective image retrieval via
multilinear multi-index fusion,” IEEE Trans. Multimedia, vol. 21, no. 11,
pp. 2878–2890, Nov. 2019.
[3] S. K. Kuanar, K. B. Ranga, and A. S. Chowdhury, “Multi-view video
summarization using bipartite matching constrained optimum-path for-
est clustering,” IEEE Trans. Multimedia, vol. 17, no. 8, pp. 1166–1173,
Aug. 2015.
[4] X. Wu, C.-W. Ngo, and A. G. Hauptmann, “Multimodal news story clus-
tering with pairwise visual near-duplicate constraint,” IEEE Trans. Multi-
media, vol. 10, no. 2, pp. 188–199, Feb. 2008.
[5] C. Tang et al., “Learning a joint affinity graph for multiviewsubspace clus-
tering,” IEEE Trans. Multimedia, vol. 21, no. 7, pp. 1724–1736, Jul. 2019.
[6] C. Zhang, H. Fu, S. Liu, G. Liu, and X. Cao, “Low-rank tensor constrained
multiview subspace clustering,” in Proc. IEEE Int. Conf. Comput. Vision,
2015, pp. 1582–1590.
[7] Y.Xie et al., “On unifying multi-view self-representations for clustering by
tensor multi-rank minimization,” Int. J. Comput. Vision, vol. 126, no. 11,
pp. 1157–1179, 2018.
[8] Y. Xie, W. Zhang, Y. Qu, L. Dai, and D. Tao, “Hyper-Laplacian regular-
ized multilinear multiview self-representations for clustering and semisu-
pervised learning,” IEEE Trans. Cybern., 2018, to be published.
[9] A. Y. Ng, M. I. Jordan, and Y. Weiss, “On spectral clustering: Analysis
and an algorithm,” in Proc. Neural Inf. Process. Syst., 2002, pp. 849–856.
[10] X. Guo, “Robust subspace segmentation by simultaneously learning data
representations and their affinity matrix,” in Proc. Joint Conf. Artif. Intell.,
2015, pp. 3547–3553.
[11] X. Peng, Z. Yu, Z. Yi, and H. Tang, “Constructing the l2-graph for robust
subspace learning and subspace clustering,” IEEE Trans. Cybern., vol. 47,
no. 4, pp. 1053–1066, Apr. 2017.
[12] F. Nie, X. Wang, and H. Huang, “Clustering and projected clustering with
adaptive neighbors,” in Proc. 20th ACM SIGKDD Int. Conf. Knowl. Dis-
covery Data Mining, 2014, pp. 977–986.
[13] F. Nie et al.,“Newl1-norm relaxations and optimizations for graph clus-
tering,” in Proc. AAAI Conf. Artif. Intell., 2016, pp. 1962–1968.
[14] F. Nie, X. Wang, M. I. Jordan, and H. Huang, “The constrained Laplacian
rank algorithm for graph-based clustering,” in Proc. AAAI Conf. Artif.
Intell., 2016, pp. 1969–1976.
[15] K. Zhan, C. Zhang, J. Guan, and J. Wang, “Graph learning for multi-
view clustering,” IEEE Trans. Cybern., vol. 48, no. 10, pp. 2887–2895,
Oct. 2017.
[16] F. Nie, G. Cai, J. Li, and X. Li, “Auto-weighted multi-view learning for
image clustering and semi-supervised classification,” IEEE Trans. Image
Process., vol. 27, no. 3, pp. 1501–1511, Mar. 2018.
[17] F. Nie, G. Cai, and X. Li, “Multi-view clustering and semi-supervised
classification with adaptive neighbours,” in Proc. AAAI Conf. Artif. Intell.,
2017, pp. 2408–2414.
[18] F. Nie et al., “Self-weighted multiview clustering with multiple graphs,”
in Proc. Joint Conf. Artif. Intell., 2017, pp. 2564–2570.
[19] H. Wang, Y. Yang, and B. Liu, “GMC: Graph-based multi-view clustering,”
IEEE Trans. Knowl. Data Eng., 2019, to be published.
[20] G. Liu et al., “Robust recovery of subspace structures by low-rank rep-
resentation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 35, no. 1,
pp. 171–184, Jan. 2013.
[21] E. Elhamifar and R. Vidal, “Sparse subspace clustering: Algorithm, theory,
and applications,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 35, no. 11,
pp. 2765–2781, Nov. 2013.
[22] C. Lu, J. Feng, Z. Lin, T. Mei, and S. Yan, “Subspace clustering by block
diagonal representation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 41,
no. 2, pp. 487–501, Feb. 2019.
[23] Y. Wang, L. Wu, X. Lin, and J. Gao, “Multiview spectral clustering
via structured low-rank matrix factorization,” IEEE Trans. Neural Netw.
Learn. Syst., no. 29, no. 10, pp. 4833–4843, Oct. 2018.
[24] X. Wang, X. Guo, Z. Lei, C. Zhang, and S. Z. Li, “Exclusivity-consistency
regularized multi-view subspace clustering,” in Proc. IEEE Conf. Comput.
Vision Pattern Recognit., 2017, pp. 923–931.
[25] Z. Kang, H. Pan, S. C. H. Hoi, and Z. Xu, “Robust graph learning from
noisy data,” IEEE Trans. Cybern., 2019, to be published.
[26] M. Yin, S. Xie, Z. Wu, Y. Zhang, and J. Gao, “Subspace clustering via
learning an adaptive low-rank graph,” IEEE Trans.Image Process., vol. 27,
no. 8, pp. 3716–3728, Aug. 2018.
[27] M. Belkin and P. Niyogi, “Laplacian eigenmaps for dimensionality reduc-
tion and data representation,” Neural Comput., vol. 15, no. 6, pp. 1373–
1396, 2003.
[28] L. Zhuang et al., “Constructing a nonnegative low-rank and sparse graph
with data-adaptive features,” IEEE Trans. Image Process., vol. 24, no. 11,
pp. 3717–3728, Nov. 2015.
Authorized licensed use limited to: University Town Library of Shenzhen. Downloaded on September 12,2020 at 09:23:51 UTC from IEEE Xplore. Restrictions apply.
CHEN et al.: JOINTLY LEARNING KERNEL REPRESENTATION TENSOR AND AFFINITY MATRIX 1997
[29] S. Yi et al., “Dual pursuit for subspace learning,” IEEE Trans. Multimedia,
vol. 21, no. 6, pp. 1399–1411, Jun. 2019.
[30] P. Ji, T. Zhang, H. Li, M. Salzmann, and I. Reid, “Deep subspace clustering
networks,” in Proc. Neural Inf. Process. Syst., 2017, pp. 24–33.
[31] H. Zhao, Z. Ding, and Y. Fu, “Multi-view clustering via deep matrix fac-
torization,” in Proc. AAAI Conf. Artif. Intell., 2017, pp. 2921–2927.
[32] Z. Zhang et al., “Highly-economized multi-view binary compression for
scalable image clustering,” in Proc. Eur. Conf. Comput. Vision, 2018,
pp. 717–732.
[33] G. Chao et al., “Multi-view cluster analysis with incomplete data to un-
derstand treatment effects,” Inf. Sci., vol. 494, pp. 278–293, 2019.
[34] G. Chao, S. Sun, and J. Bi, “A survey on multi-view clustering,” 2017,
arXiv:1712.06246.
[35] B. Cheng, G. Liu, J. Wang, Z. Huang, and S. Yan, “Multi-task low-rank
affinity pursuit for image segmentation,” in Proc. IEEE Int. Conf. Comput.
Vis io n, 2011, pp. 2439–2446.
[36] R. Xia, Y. Pan, L. Du, and J. Yin, “Robust multi-view spectral clustering
via low-rank and sparse decomposition,” in Proc. AAAI Conf. Artif. Intell.,
2014, pp. 2149–2155.
[37] C. Zhang, Q. Hu, H. Fu, P. Zhu, and X. Cao, “Latent multi-view subspace
clustering,” in Proc. IEEE Conf. Comput. Vision Pattern Recognit., 2017,
pp. 4279–4287.
[38] S. Luo, C. Zhang, W. Zhang, and X. Cao, “Consistent and specific
multi-view subspace clustering,” in Proc. AAAI Conf. Artif. Intell., 2018,
pp. 3730–3713.
[39] C. Zhang et al., “Generalized latent multi-view subspace clustering,” IEEE
Trans. Pattern Anal. Mach. Intell., 2018, to be published.
[40] J. Liu, C. Wang, J. Gao, and J. Han, “Multi-view clustering via joint non-
negative matrix factorization,” in Proc. SIAM Int. Conf. Data Min., 2013,
pp. 252–260.
[41] M. Yin, J. Gao, S. Xie, and Y. Guo, “Multiview subspace clustering via
tensorial t-product representation,” IEEE Trans. Neural Netw. Learn. Syst.,
vol. 30, no. 3, pp. 851–864, Mar. 2019.
[42] V. M. Patel and R. Vidal, “Kernel sparse subspace clustering,” in Proc.
IEEE Int. Conf. Image Process., 2014, pp. 2849–2853.
[43] S. Xiao, M. Tan, D. Xu, and Z. Y. Dong, “Robust kernel low-rank rep-
resentation,” IEEE Trans. Neural Netw. Learn. Syst., vol. 27, no. 11,
pp. 2268–2281, Nov. 2016.
[44] Y. Qu, J. Liu, Y. Xie, and W. Zhang, “Robust kernelized multi-view self-
representations for clustering by tensor multi-rank minimization,” 2017,
arXiv:1709.05083.
[45] M. E. Kilmer and C. D. Martin, “Factorization strategies for third-order
tensors,” Linear Algebra Appl., vol. 435, no. 3, pp. 641–658, 2011.
[46] H. Wang, Y. Yang, and T. Li, “Multi-view clustering via concept factor-
ization with local manifold regularization,” in Proc. IEEE Int. Conf. Data
Mining, 2016, pp. 1245–1250.
[47] Y. Chen et al., “Denoising of hyperspectral images using nonconvex low
rank matrix approximation,” IEEE Trans. Geosci. Remote Sens., vol. 55,
no. 9, pp. 5366–5380, Sep. 2017.
[48] Y. Chen, S. Wang, and Y. Zhou, “Tensor nuclear norm-based low-rank
approximation with total variation regularization,” IEEE J. Sel. Topics
Signal Process., vol. 12, no. 6, pp. 1364–1377, Dec. 2018.
[49] W. Hu, D. Tao, W. Zhang, Y. Xie, and Y. Yang, “The twist tensor nu-
clear norm for video completion,” IEEE Trans. Neural Netw. Learn. Syst.,
vol. 28, no. 12, pp. 2961–2973, Dec. 2017.
[50] Y. Chen, Y. Wang, M. Li, and G. He, “Augmented Lagrangian alternating
direction method for low-rank minimization via non-convex approxima-
tion,” Signal, Image Video Process., vol. 11, no. 7, pp. 1271–1278, 2017.
[51] T. Zhou, C. Zhang, C. Gong, H. Bhaskar, and J. Yang, “Multiview la-
tent space learning with feature redundancy minimization,” IEEE Trans.
Cybern., 2018, to be published.
[52] L. Fei-Fei and P. Perona, “A Bayesian hierarchical model for learning nat-
ural scene categories,” in Proc. IEEE Comput. Soc. Conf. Comput. Vision
Pattern Recognit., 2005, vol. 2, pp. 524–531.
[53] X. Cao, C. Zhang, H. Fu, S. Liu, and H. Zhang, “Diversity-induced multi-
view subspace clustering,” in Proc. IEEE Conf. Comput. Vision Pattern
Recognit., 2015, pp. 586–594.
[54] F. Nie, L. Tian, and X. Li, “Multiview clustering via adaptively weighted
procrustes,” in Proc. ACM SIGKDD Int. Conf. Knowl. Discovery Data
Mining, 2018, pp. 2022–2030.
Yongyong Chen received the B.S. and M.S. degrees
in the College of Mathematics and Systems Science,
Shandong University of Science and Technology,
Qingdao, China, and visited the National Key Lab
for Novel Software Technology, Nanjing University,
Nanjing, China, as an exchange student in 2017. He is
currently working toward the Ph.D. degree with the
Department of Computer and Information Science,
University of Macau, Macau, China. His research in-
terests include (non-convex) low-rank and sparse ma-
trix/tensor decomposition models, with applications
to image processing, data mining, and computer vision.
Xiaolin Xiao received the B.E. degree from Wuhan
University, Wuhan, China, in 2013, and the Ph.D. de-
gree from the University of Macau, Macau, China, in
2019. She is currently a Postdoctoral Fellow with the
School of Computer Science and Engineering, South
China University of Technology, Guangzhou, China.
Her research interests include superpixel segmenta-
tion, saliency detection, and color image processing
and understanding.
Yicong Zhou (M’07–SM’14) received the B.S. de-
gree in electrical engineering from Hunan University,
Changsha, China, and the M.S. and Ph.D. de-
grees in electrical engineering from Tufts University,
Medford, MA, USA. He is an Associate Professor
and the Director of the Vision and Image Processing
Laboratory, Department of Computer and Informa-
tion Science, University of Macau, Macau, China. His
research interests include image processing and un-
derstanding, computer vision, machine learning, and
multimedia security. Dr. Zhou is a Senior Member
of the International Society for Optical Engineering. He was a recipient of the
Third Price of Macau Natural Science Award in 2014. He is the Co-Chair of
Technical Committee on Cognitive Computing in the IEEE Systems, Man, and
Cybernetics Society. He is an Associate Editor for the IEEE TRANSACTIONS
ON NEUTRAL NETWORKS AND LEARNING SYSTEMS, IEEE TRANSACTIONS ON
CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, IEEE TRANSACTIONS ON
GEOSCIENCE AND REMOTE SENSING, and four other journals.
Authorized licensed use limited to: University Town Library of Shenzhen. Downloaded on September 12,2020 at 09:23:51 UTC from IEEE Xplore. Restrictions apply.