ArticlePDF Available

Jointly Learning Kernel Representation Tensor and Affinity Matrix for Multi-View Clustering

November 2019
IEEE Transactions on Multimedia PP(99):1-1

November 2019
PP(99):1-1

DOI:10.1109/TMM.2019.2952984

Authors:

Yongyong Chen

Harbin Institute of Technology Shenzhen Graduate School

Xiaolin Xiao

University of Macau

Yicong Zhou

University of Macau

Multi-view clustering refers to the task of partitioning numerous unlabeled data into several distinct clusters using multiple features. In this paper, we propose a novel nonlinear method called joint learning multi-view clustering (JLMVC) to jointly learn kernel representation tensor and affinity matrix. The proposed JLMVC has three advantages: (1) unlike existing low-rank representation-based multi-view clustering methods that learn the representation tensor and affinity matrix in two separate steps, JLMVC jointly learns them both in a single step such that their correlations can be well preserved. (2) using the “kernel trick“, JLMVC can handle nonlinear data structures for various real applications. (3) different from most existing methods that treat representations of all views equally, JLMVC automatically learns a reasonable weight for each view. Based on the alternating direction method of multipliers, an effective algorithm is designed to solve the proposed model. Extensive experiments on eight multimedia datasets demonstrate the superiority of the proposed JLMVC over state-of-the-art methods.

The t-SVD of a tensor of size n 1 × n 2 × n 3 .

…

Explanation of rotation.

…

Figures - uploaded by Yongyong Chen

Content may be subject to copyright.

Content uploaded by Yongyong Chen

Content may be subject to copyright.

IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 22, NO. 8, AUGUST 2020 1985

Jointly Learning Kernel Representation Tensor and

Afﬁnity Matrix for Multi-View Clustering

Yongyong Chen , Xiaolin Xiao , and Yicong Zhou , Senior Member, IEEE

Abstract—Multi-view clustering refers to the task of partitioning

numerous unlabeled multimedia data into several distinct clusters

using multiple features. In this paper, we propose a novel nonlinear

method called joint learning multi-view clustering (JLMVC) to

jointly learn kernel representation tensor and afﬁnity matrix.

The proposed JLMVC has three advantages: (1) unlike existing

low-rank representation-based multi-view clustering methods that

learn the representation tensor and afﬁnity matrix in two separate

steps, JLMVC jointly learns them both; (2) using the “kernel trick,”

JLMVC can handle nonlinear data structures for various real

applications; and (3) different from most existing methods that

treat representations of all views equally, JLMVC automatically

learns a reasonable weight for each view. Based on the alternating

direction method of multipliers, an effective algorithm is designed

to solve the proposed model. Extensive experiments on eight

multimedia datasets demonstrate the superiority of the proposed

JLMVC over state-of-the-art methods.

Index Terms—Multi-view clustering, low-rank tensor represen-

tation, kernel trick, afﬁnity matrix, adaptive weight.

I. INTRODUCTION

IN MANY real-world applications, multimedia data such as

images, videos, audio, and documents, are usually repre-

sented by different features or collected from various ﬁelds

(called multi-view data) [1]–[3]. For example, in multimedia

retrieval [2], images can be represented by color, textures, and

edges. In video surveillance [3], the same scene is monitored by

multiple cameras from different viewpoints. In natural language

processing [4], documents can be translated by multiple different

languages like Chinese, English, French, and so on. Considering

that multi-view data are greatly conducive to the performance

improvement, multi-view clustering has attracted great research

Manuscript received June 5, 2019; revised September 28, 2019; accepted

October 29, 2019. Date of publication November 11, 2019; date of current

version July 24, 2020. This work was supported in part by the Science and

Technology Development Fund, Macau SAR (File no. 189/2017/A3), and in part

by the Research Committee at University of Macau under Grants MYRG2016-

00123-FST and MYRG2018-00136-FST. The associate editor coordinating the

review of this manuscript and approving it for publication was Dr. Marco Carli.

(Corresponding author: Yicong Zhou.)

Y. Chen and Y. Zhou are with the Department of Computer and Information

Science, University of Macau, Macau 999078, China (e-mail: YongyongChen.

cn@gmail.com; yicongzhou@um.edu.mo).

X. Xiao is with the School of Computer Science and Engineering, South

China University of Technology, Guangzhou 510006, China, and also with the

Department of Computer and Information Science, University of Macau, Macau

999078, China (e-mail: shellyxiaolin@gmail.com).

Color versions of one or more of the ﬁgures in this article are available online

at http://ieeexplore.ieee.org.

Digital Object Identiﬁer 10.1109/TMM.2019.2952984

interests in many ﬁelds including multimedia data mining, ma-

chine learning and pattern recognition communities [5]–[8].

Given multi-view features extracted from the original multi-

media data, they are used to partition all unlabeled multimedia

data into several distinct clusters. Massive approaches for

clustering have been proposed. Either single-view clustering or

multi-view clustering, they usually follow two main steps: 1)

constructing a symmetric afﬁnity matrix (also called similarity

matrix) to describe the pairwise relations between multimedia

data points and 2) performing the spectral clustering algo-

rithm [9] to obtain clustering results. The core of these methods

is construction of the afﬁnity matrix. This means that the quality

of the learned afﬁnity matrix heavily determines the clustering

performance. In literature, two common schemes, the raw mul-

timedia features and computed representations [10], [11], are

selected to conduct the afﬁnity matrix, leading to the following

three categories: 1) graph-based methods [12]–[19], 2) sub-

space clustering-based methods [5]–[8], [11], [20]–[24], 3) their

combinations [10], [25], [26]. For example, due to simplicity

and effectiveness, k-Nearest Neighbor using cosine or heat

kernel distances [27] has become an intuitive way to construct

the afﬁnity matrix. Following the idea that local connectivity of

multimedia data can be measured by the Euclidean distance, the

work in [12] constructed the afﬁnity matrix by assigning adap-

tive neighbors to each multimedia data point. In [13], Nie et al.

adopted the l1-norm distance instead of the Euclidean distance

and proposed a graph clustering relaxation. Based on the fact

that the afﬁnity matrix should obey the block diagonal property,

Nie et al. [14] imposed the rank constraint on the Laplacian

matrix for graph-based clustering. To well explore the com-

plementary information of multi-view features, the approaches

in [17] and [18] extended the adaptive neighbor strategy [12]

and the rank constraint [14] from the single-view setting into the

multi-view one, respectively. Following this, Wang et al. [19]

pursued a uniﬁed afﬁnity matrix from the afﬁnity matrices of all

views and the rank function was considered to partition multime-

dia data points into optimal number of clusters. However, these

graph-based approaches, e.g., [16], [18], [19], usually construct

the afﬁnity matrix by directly using the raw multimedia features

which are often corrupted by noise and outliers. Thus, they may

obtain an unreliable and inaccurate afﬁnity matrix [10], [26].

As the second category, subspace clustering-based methods

have become the mainstream due to their excellent interpretabil-

ity and performance. The goal of subspace clustering is to

simultaneously ﬁnd low-dimensional subspaces and partition

multimedia data points into multiple subspaces. Speciﬁcally,

See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: University Town Library of Shenzhen. Downloaded on September 12,2020 at 09:23:51 UTC from IEEE Xplore. Restrictions apply.

1986 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 22, NO. 8, AUGUST 2020

sparse subspace clustering (SSC) [21] and low-rank represen-

tation (LRR) [20] are two representative works, resulting in a

local representation matrix and a global one, respectively. Since

SSC learns the representation matrix by l1-norm, it imposes the

sparsity on all entries of the representation matrix. However,

LRR conducts the representation matrix by the low-rank regu-

larizer. This imposes the sparsity on the singular values. Beyond

the low-rankness and sparsity, some extra structures underlying

data, such as the local similarity structure and nonnegativity [28],

may not be fully considered. Instead of the ﬁxed dictionary, i.e.,

the original multimedia feature, the work in [29] proposed to

learn a locality-preserving dictionary to capture the intrinsic ge-

ometric structure of the dictionary for LRR. Yin et al. [26] pro-

posed to integrate LRR and the graph construction in a uniﬁed

framework to learn an adaptive low-rank graph afﬁnity matrix. A

similar idea was adopted in [10], [25]. A major challenge is that,

when handing multi-view features, they may cause a signiﬁcant

performance degradation since they focus only on single-view

feature.

Recently, considerable efforts based on deep neural network

have been expended for clustering. For example, Ji et al. [30]

proposed a deep neural network by introducing a self-expressive

layer into the auto-encoder framework for clustering. To conduct

a deep structure, the authors in [31] adopted semi-nonnegative

matrix factorization for mutli-view clustering. In [32], a highly-

economized scalable image clustering method was proposed

to cluster large-scale multi-view images. Besides, to deal with

multi-view clustering with missing features, Chao et al. [33]

presented an enhanced multi-view co-clustering method. For a

comprehensive survey on clustering, please refer to [34] and the

references therein.

A. Related Work

The existing low-rank-based approaches for multi-view clus-

tering can be roughly grouped into two categories: two-

dimension matrix-based low-rank methods [5], [23], [35]–[40]

and three-dimension tensor-based low-rank ones [6]–[8]. For

example, to deal with multiple multimedia features, the work

in [35] proposed to concatenate all heterogeneous features and

then perform LRR [20]. Xia et al. [36] exploited the low-rank

and sparse matrix decomposition to uncover a shared transition

probability matrix under the Markov chain method. Except for

consistency among multi-view features, the work in [38] took lo-

cal view-speciﬁc information into consideration for multi-view

clustering. Similarly, Tang et al. [5] proposed a multi-view clus-

tering method by learning a joint afﬁnity graph. In [5], [38], the

consistency measures the common properties among all views

while the speciﬁcity captures the inherent difference in each

view. Different from these approaches that use the nuclear norm

to depict the low-rank property of the representation matrices,

Wang et al. [23] proposed to factorize each representation matrix

as the product of symmetric low-rank data-cluster matrices, such

that the singular value decomposition can be ignored. Following

this, Liu et al. [40] proposed to mine a consensus representation

of all views by multi-view non-negative matrix factorization.

Fig. 1. Comparison of existing low-rank tensor representation-based MVC

methods (the red dashed rectangle) and our proposed JLMVC (the blue dashed

rectangle). Existing methods construct the representation matrix (a) and the

afﬁnity matrix (b) in two separate steps without considering their correlation.

JLMVC learns the representation tensor and the afﬁnity matrix (d) in a uniﬁed

framework. Additionally, the kernel-induced mapping is adopted to map the

original multimedia data (usually nonlinear separable) into a new linear space.

The most representative methods of the second category are

the tensor unfolding-based method (LT-MSC) [6] and t-singular

value decomposition (t-SVD)-based one (t-SVD-MSC) [7]. As

shown in Fig. 1(a), each representation matrix is stored as the

frontal slice of a tensor, resulting in a third-order tensor (called

representation tensor). The main difference between [6] and [7]

is the tensor rank approximation which aims to explore the

high order correlations among multi-views. By organizing all

multi-view features into a third-order tensor, the work in [41]

exploited the sparsity and tensor nuclear norm penalty with

self-expressiveness to construct the representation tensor.

Although these approaches have achieved a great advance for

multi-view clustering, they may suffer from the following chal-

lenges: 1) their performance may sharply degrade in real applica-

tions when the multimedia data come from nonlinear subspaces.

The intuitive reason is that they were originally designed to deal

with the data that lie within multiple linear subspaces [8], [42],

[43]. 2) the correlation between the representation tensor and

afﬁnity matrix may not be fully exploited. They learn the rep-

resentation tensor via different low-rank tensor representations,

and then construct the afﬁnity matrix as shown in Figs. 1(a) and

(b) in two separate steps. This means that the global optimal

afﬁnity matrix cannot be ensured. 3) the importance of each

view in the construction of the afﬁnity matrix is not considered.

For example, methods in [6], [7], [44] simply average all repre-

sentation matrices with the same weight. The approach in [44]

overcomes the ﬁrst limitation, but fails to address the other two

Authorized licensed use limited to: University Town Library of Shenzhen. Downloaded on September 12,2020 at 09:23:51 UTC from IEEE Xplore. Restrictions apply.

CHEN et al.: JOINTLY LEARNING KERNEL REPRESENTATION TENSOR AND AFFINITY MATRIX 1987

challenges. To our best knowledge, no work has been done to

address these three challenges simultaneously.

B. Our Contributions

To address above three challenges, we propose a uniﬁed model

to jointly learn the kernel representation tensor and afﬁnity

matrix for multi-view clustering (JLMVC). JLMVC learns the

representation tensor and afﬁnity matrix jointly such that their

correlations can be well exploited, handles the nonlinear mul-

timedia data using a kernel-induced mapping, and adopts the

adaptive weight strategy to form a uniﬁed afﬁnity matrix. Fig. 1

compares the proposed JLMVC with two state-of-the-art low-

rank tensor representation-based MVC methods LT-MSC [6]

and t-SVD-MSC [7]. As can be observed that, under the assump-

tion that the original data lie within multiple linear subspaces,

existing low-rank tensor representation-based MVC methods

learn the representation tensor from the original multimedia data.

However, this assumption may not be ensured in real applica-

tions. To achieve nonlinear multi-view clustering, JLMVC maps

the original multimedia data from the input data space into a

new feature space such that the mapped data points can reside in

multiple linear subspaces, as shown in the middle of Fig. 1(c).

JLMVC then learns the representation tensor and afﬁnity matrix

simultaneously. Finally, the learned uniﬁed afﬁnity matrix is fed

to the input of the spectral clustering algorithm [9] to obtain the

clustering results.

The contributions and novelty of this paper are summarized

as follows:

rWe propose a joint learning multi-view clustering

(JLMVC) model to jointly learn kernel representation ten-

sor and afﬁnity matrix for multi-view clustering. JLMVC

is able to well explore the correlation between the represen-

tation tensor and afﬁnity matrix, handles the nonlinear data

using a kernel-induced mapping, and adopts the adaptive

weight strategy to form a uniﬁed afﬁnity matrix.

rJLMVC uses the tensor nuclear norm to encode the low

rank property of the representation tensor and adaptively

learns different weights for different views’ representation

matrices. This greatly beneﬁts the construction of the uni-

ﬁed afﬁnity matrix.

rAn effective algorithm is designed to solve the JLMVC

model via the alternating direction method of multipli-

ers. Extensive experiments on eight popular multimedia

datasets are conducted and validate the superiority of

JLMVC over ten state-of-the-art approaches.

C. Organization of the Paper

The rest of this paper is structured as follows. Section II intro-

duces some notations and preliminaries, especially the t-SVD-

based tensor nuclear norm which is used to depict the low-rank

property of the representation tensor. In Section II, we intro-

duce JLMVC and design an iterative algorithm under the alter-

nating direction method of multipliers framework. We evaluate

the performance of the proposed JLMVC on eight real-world

multi-view datasets in Section IV and conclude the whole paper

in Section V.

TAB LE I

BASIC NOTATIONS AND THEIR DESCRIPTIONS

II. NOTATIONS AND PRELIMINARIES

In this section, we aim to introduce some notations used

throughout this paper and the t-SVD-based tensor nuclear norm

(see Deﬁnition 2.2) that will be used to depict the low-rank

property of the representation tensor. Some basic notations are

summarized in Table I.

Before the deﬁnition of t-SVD [45], several operators are ﬁrst

introduced. For a tensor X∈Rn1×n2×n3, its block circular ma-

trix bcirc(X)and block diagonal matrix bdiag(X)are deﬁned

bcirc(X)=⎡

⎢

⎣

X(1) X(n3)··· X(2)

X(2) X(1) ··· X(3)

.....

X(n3)X(n3−1) ··· X(1)

⎤

⎥

⎦

bdiag(X)=⎡

⎢

⎣

X(1)

X(2)

...

X(n3)

⎤

⎥

⎦

The block vectorization is deﬁned as bvec(X)=[X(1);···;

X(n3)]. The inverse operations of bvec and bdiag are de-

ﬁned as bvfold(bvec(X)) = Xand bdfold(bdiag(X)) =

X, respectively. Let Y∈Rn2

×n4

×n3.Thet-product X∗Yis

an n1×n4×n3tensor, X∗Y=bvfold(bcirc(X)∗bvec

(Y)).Thetranspose of Xis XT∈Rn2×n1×n3by transpos-

ing each of the frontal slices and then reversing the order

of transposed frontal slices 2 through n3.Theidentity ten-

sor I∈Rn1×n1×n3is a tensor whose ﬁrst frontal slice is an

n1×n1identity matrix and the rest frontal slices are zero. A

tensor X∈Rn1×n1×n3is orthogonal if it satisﬁes XT∗X =

X∗X

T=I.

Deﬁnition 2.1: (t-SVD) Given X, its t-SVD is deﬁned as

X=U∗G∗V

where U∈Rn1×n1×n3and V∈Rn2×n2×n3are orthogonal ten-

sors, G∈Rn1×n2×n3is an f-diagonal tensor. Each of its frontal

slices is a diagonal matrix.

Fig. 2 shows the t-SVD of a third-order tensor. The t-SVD-

based tensor nuclear norm (t-SVD-TNN) is given as follows.

Authorized licensed use limited to: University Town Library of Shenzhen. Downloaded on September 12,2020 at 09:23:51 UTC from IEEE Xplore. Restrictions apply.

1988 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 22, NO. 8, AUGUST 2020

Fig. 2. The t-SVD of a tensor of size n1×n2×n3.

Deﬁnition 2.2: (t-SVD-TNN) The t-SVD-TNN of a tensor

X∈Rn1×n2×n3, denoted as X, is deﬁned as the sum of

singular values of all the frontal slices of ˆ

X,i.e.,

X=

min{n1,n2}



i=1



k=1 |ˆ

G(i, i, k)|.(1)

III. JOINT LEARNING MULTI-VIEW CLUSTERING

In this section, we ﬁrst elaborate the proposed JLMVC model

in Section III-A, and then solve this model by the alternating

direction method of multipliers (ADMM) in Section III-B. Con-

sidering that, in real world applications, the multimedia data

may be drawn from multiple nonlinear subspaces, JLMVC ﬁrst

uses the kernel trick to solve the nonlinearity. Based on the

self-expression property [20], [21], JLMVC carries out joint

learning of the representation tensor and uniﬁed afﬁnity matrix.

A. Problem Formulation

The existing multi-view clustering method t-SVD-MSC [7]

learns the representation tensor Zby

min

Z,E Z+α



v=1 E(v)2,1

s.t. X(v)=X(v)Z(v)+E(v),v=1,...,V,

Z=Φ(Z(1),Z(2),...,Z(V)).(2)

where X(v)∈Rdv×ndenotes the v-th view feature; α>0is

the regularization parameter; Edenotes noise and outliers; Φ(·)

is an operator to stack all representation matrices {Z(v)}into a

third-order tensor Zas shown in Fig. 1(a).

Once Zis yielded by Eq. (2), the afﬁnity matrix Sis con-

structed by averaging all frontal slices of Z. This means that,

in the construction of S, the correlation between Sand Zis

ﬁxed. This scheme, however, may not ensure the optimal afﬁn-

ity matrix since different view features characterize speciﬁc and

partly independent information of the dataset. Therefore, to ad-

dress this issue, different weights should be assigned on different

views. Then we give the following model:

min

Z,S,ω Z+



v=1 αX(v)−X(v)Z(v)2,1

+λω(v)Z(v)−S2

F+ηω2

s.t. Z=Φ(Z(1),Z(2),...,Z(V)),ω≥0,Σvω(v)=1,(3)

where α,λand ηare three positive parameters to balance the

contributions of all terms in the objective function; ω(v)is the

relative weight of the v-th view; the last term is to smoothen

the weight distribution and avoid the futile solution [46]. How-

ever, in model (3), the self-expression property is encoded on

the original input data space (i.e., the second term). This usu-

ally exhibits the nonlinear structure in real-world datasets. Here,

we seek new feature spaces for the linear separated multi-view

clustering. Borrowing the idea of the kernel methods [42], [43],

for the v-th feature, let φ(v):Rdv→H

(v)be a kernel mapping

from the original data space to the kernel space. As stated in the

following Eq. (6), φ(v)does not need to be deﬁned explicitly.

Let K(v)∈Rn×nbe a positive kernel Gram matrix, i.e.,

K(v)=φ(v)(X(v))Tφ(v)(X(v)).(4)

Then, we encode the self-expression property on the new feature

space. This is also the reason that the proposed JLMVC can

handle the nonlinearity problem. Based on the above analysis,

model (3) can be formulated as

min

Z,S,ω Z+



v=1 αφ(X(v))−φ(X(v))Z(v)2,1

+λω(v)Z(v)−S2

F+ηω2

s.t. Z=Φ(Z(1),Z(2),...,Z(V)),ω≥0,Σvω(v)=1.(5)

Note that the second term of Eq. (5) can be rewritten as

φ(X(v))−φ(X(v))Z(v)2,1



i=1 P(v)T

iK(v)P(v)

i1

2,(6)

where P(v)=I−Z(v).P(v)

iis the i-th column of P(v).From

Eq. (6), it is easy to see that the kernel mapping φ(v)appears only

in the form of the inner product, i.e.,φ(v)(X(v))Tφ(v)(X(v)),

leading to the kernel Gram matrix K(v). Therefore, φ(v)is

implicitly deﬁned. For simplicity, we denote g(v)(P(v))=

n

i=1 P(v)T

iK(v)P(v)

i1

2to be the reconstruction error in the

kernel space. Finally, the proposed JLMVC model can be for-

mulated as

min

Z,P (v),S,ω Z+



v=1 αg(v)P(v)

+λω(v)Z(v)−S2

F+ηω2

s.t. Z=Φ(Z(1),Z(2),...,Z(V)),

P=Φ(P(1),P(2),...,P(V)),

P=I−Z,ω≥0,Σvω(v)=1,(7)

where the ﬁrst term, i.e.,Zdeﬁned in Eq. (1), is used to

explore the low-rankness of Z; the second term can handle the

nonlinear structures; the third term with the adaptive weight

strategy aims to learn a uniﬁed afﬁnity matrix S.

Authorized licensed use limited to: University Town Library of Shenzhen. Downloaded on September 12,2020 at 09:23:51 UTC from IEEE Xplore. Restrictions apply.

CHEN et al.: JOINTLY LEARNING KERNEL REPRESENTATION TENSOR AND AFFINITY MATRIX 1989

B. Optimization

It is intractable to solve the proposed model in Eq. (7) since

it is not jointly convex and coupled with respect to variable Z.

Therefore, we solve Eq. (7) under ADMM framework. We can

reformulate Eq. (7) as:

min

Z,Y,P,S,ω Y+



v=1 αg(v)P(v)

+λω(v)Z(v)−S2

F+ηω2

s.t. Z=Φ(Z(1) ,Z(2),...,Z(V)),

P=Φ(P(1) ,P(2),...,P(V)),

P=I−Z,ω ≥0,Σvω(v)=1,Z=Y.(8)

Following the idea of ADMM, we introduce one auxiliary vari-

able Yto separate Zin the objective function and then itera-

tively update each variable by ﬁxing other variables [47]. The

augmented Lagrangian function is deﬁned as the sum of the

objective function of Eq. (8) and the penalty term under

the l2-norm. The augmented Lagrangian function of model (8)

is given by:

Lρ(Z,Y,P(v),S,ω;Θ,Π) = Y+



v=1 αg(v)P(v)

+λω(v)Z(v)−S2

F+ηω2

2+Θ,I−Z−P

+ρ

2I−Z−P2

F+Π,Z−Y+ρ

2Z − Y2

F,(9)

where Θand Πare the Lagrange multipliers of size n×n×

V;ρis the non-negative penalty parameter; ·,· is the inner

product. Under the ADMM framework, we can solve Eq. (9) by

optimizing one variable while keeping the other variables ﬁxed

as follows:

Step 1 Update Z:Fixing other variables, we can update Z

by the following subproblem:

min



v=1

λω(v)

kZ(v)−Sk2

+ρk

2



I−Z−P

k+Θk

ρk



+ρk

2



Z−Yk+Πk

ρk



(10)

It is easy to see that updating each frontal slice Z(v)of Zis

independent. This means that Z(v)can be updated in parallel.

The v-th subproblem is

min

Z(v)

λω(v)

kVertZ

(v)−Sk2

+ρk

2



Z(v)−A(v)

k



F+ρk

2



Z(v)−B(v)

k



(11)

where A(v)

k=I−P(v)+Θ(v)

ρkand B(v)

k=Y(v)

k−Π(V)

ρk.By

setting the derivative of Eq. (11) with respect to Z(v)to zero,

Fig. 3. Explanation of rotation.

the optimal solution Z(v)

k+1 is

Z(v)

k+1 =2λω(v)

kSk+ρkA(v)

k+ρkB(v)

k(2λω(v)

k+2ρk).

(12)

Step 2 Update Y:When other variables are ﬁxed, Ycan be

updated by

min

YY+ρk

2Y − Fk2

F,(13)

where Fk=Zk+1 +Πk

ρk. Following [7], we rotate Yfrom size

n×n×Vto n×V×nas shown in Fig. 3. The ﬁrst reason is

that, as in Eq. (1), t-SVD-TNN performs SVD on each frontal

slice of ˆ

Yto capture the “spatial-shifting” correlation [45], [48].

This means that t-SVD-TNN preserves only the low-rank prop-

erty of intra-view. However, we hope to capture the low-rank

property of inter-views. The second reason is that the rotation

operation can signiﬁcantly reduce the computation cost [7]. Af-

ter the rotation operation, each frontal slice of ˆ

Yrepresents the

view-speciﬁc self-representation matrix.

The closed-form solution of Eq. (13) can be obtained by the

tensor tubal-shrinkage operator [7], [49]:

Yk+1 =CV

ρk

(Fk)=U∗CV

ρk

(G)∗VT,(14)

where Fk=U∗G∗V

T, and CV

ρk

(G)=G∗J, in which Jis an

f-diagonal tensor whose diagonal element in the Fourier domain

is J(i, i, k)=max{1−V/ρ

G(i,i,k),0}.

Step 3 Update P:With other variables ﬁxed, we minimize

the augmented Lagrangian function in Eq. (9) with respect to P:

min



v=1

αg(v)P(v)+ρk

2I−Zk+1 −P+Θk

ρk2

F.(15)

Similar to Eq. (10), updating P(v)is also independent:

min

P(v)αg(v)P(v)+ρk

2P(v)−D(v)

k2

F,(16)

where D(v)

k=I−Z(v)

k+1 +Θ(v)

ρk. Compared with the method

in [42] which uses l2-norm to measure the reconstruction er-

ror, it is more difﬁcult to solve Eq. (16) since g(v)is convex but

non-smooth. According to [43], the i-th column of the optimal

solution of Eq. (16) p(v)

iis

p(v)

i=ˆp(v),if[1/σ(v)

1,...,1/σ(v)

r]◦t(v)

u>1/τ;

c(v)

i−V(v)

Kt(v)

u,otherwise.(17)

Authorized licensed use limited to: University Town Library of Shenzhen. Downloaded on September 12,2020 at 09:23:51 UTC from IEEE Xplore. Restrictions apply.

1990 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 22, NO. 8, AUGUST 2020

where τ=ρk

α;◦is the element multiplication operator;

K(v)=V(v)Σ(v)2V(v)Tis the singular value decomposition of

K(v);Σ(v)=diag(σ(v)

1,...,σ

(v)

r,0,...,0) and ris the rank

of K(v);V(v)

Kis constructed by the ﬁrst rcolumns of V(v);

t(v)

u=V(v)

Kc(v)

i;ˆp(v)is deﬁned as

ˆp(v)=c(v)

i−V(v)

×⎛

⎝σ(v)2

γ(v)+σ(v)2

,..., σ(v)2

γ(v)+σ(v)2

rT

◦t(v)

u⎞

⎠,

(18)

where γ(v)>0is a scalar, and it satisﬁes

t(v)T

udiag  σ(v)2

(γ(v)+σ(v)2

i)21≤i≤rt(v)

u=1/τ 2.(19)

We can obtain a unique root γ(v)when [1/σ(v)

1,...,1/σ(v)

r]◦

t(v)

u>1/τ.

Step 4 Update S:When keeping other variables ﬁxed, we

obtain the following optimization problem:

Sk+1 =argmin



v=1

ω(v)

kZ(v)

k+1 −S2



v=1

ω(v)

kZ(v)

k+1.(20)

The last equation is based on the fact that vω(v)

k=1.

Step 5 Update ω:To obtain the adaptive weights ωk+1,we

minimize the augmented Lagrangian function in Eq. (9) with

respect to ω:

ωk+1 =argmin



v=1

ω(v)Z(v)

k+1 −Sk+12

F+ηω2

s.t. ω≥0,

ω(v)=1.(21)

Actually, Eq. (21) is a quadratic programming problem

ωk+1 =argmin

ω



ω+gk

2η



s.t. ω≥0,

ωv=1.(22)

where gv

k=Z(v)

k+1 −Sk+12

Fforms the vector gk. We adopt the

off-the-shelf quadratic programming solver to solve the above

problem.

Step 6 Update Θ,Π, and ρ:The Lagrangian multipliers Θ,Π

and the penalty parameter ρare updated by

Θk+1 =Θ

k+ρk(I−Z

k+1 −P

k+1);

Πk+1 =Π

k+ρk(Zk+1 −Y

k+1);

ρk+1 =min{β∗ρk,ρ

max},(23)

Algorithm 1: JLMVC for multi-view clustering

Input: multi-view features: {X(v)}; parameters: α,λ;

Initialize: Y1,Z1,S

1,Θ1,Π1initialized to 0; weight

ω(v)

1=1

V;η= 500,ρ1=10

−3,β=1.5,

=10

−7,k=1;

1: Calculate the v-th kernel matrix K(v)by Eq. (4)

(v=1,...,V);

2: while not converged do

3: for v=1to Vdo

4: Update Z(v)

k+1 by Eq. (12);

5: Update P(v)

k+1 by Eq. (17);

6: end for

7: Update Yk+1 by Eq. (14);

8: Update Sk+1 by Eq. (20);

9: Update ωk+1 by Eq. (22);

10: Update Θk+1,Πk+1 , and ρk+1 by Eq. (23);

11: Check the convergence condition in Eq. (24);

12: end while

Output:Afﬁnity matrix Sk+1.

where β∈[0,√5+1

2]is a step length to update the penalty pa-

rameter ρin each iteration [50]. ρmax is the maximum value of

the penalty parameter ρ.

The details of the proposed algorithm for solving the JLMVC

model are summarized in Algorithm 1. Algorithm 1 can be ter-

minated when the following convergence condition is satisﬁed

max I−Z(v)

k+1 −P(v)

k+1∞,v =1,...,V

Zk+1 −Y

k+1∞≤tol, (24)

where tol > 0is a pre-deﬁned tolerance.

Several notes regarding Algorithm 1 are given below to further

understand the proposed JLMVC.

rThe weights of different views are of importance to the

construction of the afﬁnity matrix. An intuitive way to ini-

tialize weights of different views is set each weight to be

ω(v)

1=1

V. Then, weights are updated in an adaptive man-

ner by Eq. (22). Other variables Y1,Z1,S

1,Θ1,Π1are ini-

tialized to 0.

rLines 3–6 of Algorithm 1 can be performed in parallel as

subproblems (11) and (16) are independent with respect to

Z(v)and P(v), respectively.

rAfter performing Algorithm 1, we can obtain the uniﬁed

afﬁnity matrix Swhich well inherits the advantage of the

representation tensor Z. Finally, the learned afﬁnity matrix

Sserves as the input of spectral clustering algorithm [9] to

yield the clustering results.

IV. EXPERIMENTAL RESULTS

In this section, we aim to evaluate the performance of JLMVC

on eight multimedia datasets. The model analysis is also re-

ported.

Authorized licensed use limited to: University Town Library of Shenzhen. Downloaded on September 12,2020 at 09:23:51 UTC from IEEE Xplore. Restrictions apply.

CHEN et al.: JOINTLY LEARNING KERNEL REPRESENTATION TENSOR AND AFFINITY MATRIX 1991

TAB LE II

SUMMARY OF EIGHT MULTI-VIEW DATASETS

A. Experimental Settings

Our experiments select eight multimedia datasets for multi-

view clustering, including four face image datasets, two scene

datasets, one prokaryotic dataset, and one article data. The brief

description of these datasets is summarized in Table II. The de-

tails of each dataset are listed as follows:

Dataset descriptions: Yale: 1it consists of 165 gray-scale

images of 15 individuals with different facial expressions and

conﬁgurations. Following [6], [7], 4096d(dimension, d)In-

tensity, 3304dLBP, and 6750dGabor are extracted as three

multi-view features; Extended YaleB:2it contains 2414 face

images of 38 individuals, each of which has 64 near frontal

images under different lighting conditions. Similar to [6], [7],

the ﬁrst 10 classes are selected and three types of features, in-

cluding 2500dIntensity, 3304dLBP, and 6750dGabor, are ex-

tracted; ORL:3it includes 400 face images with 40 clusters

under different times, lighting, facial expressions, and facial de-

tails; Prokaryotic phyla: it contains 551 prokaryotic species

described by textual data and different genomic representations.

Wikipedia:4it is an article dataset selected by Wikipedia ed-

itors since 2009. Following [46], 693 documents with 2 views

are selected; COIL-20:5COIL_20 contains 1440 images of 20

object categories. Three view features including 1024dinten-

sity, 3304dLBP, and 6750dGabor are employed; CMU-PIE:6

it consists of 5440 facial images of 68 subjects. Each image

is of size 64 ×64 with a large variance. Following [51], three

types of features including 1024dIntensity, 256dLBP, and 496d

HOG are used; Scene-15 [52]: it contains 4485 outdoor and in-

door scene images from 15 categories. Following [7], three kinds

of image features, including 1800dPHOW, 1180dPRI-CoLBP,

and 1240dCENTRIST are extracted to represent Scene-15.

Baselines: Our proposed JLMVC is compared with twelve

state-of-the-art single-view and multi-view clustering methods.

The competing methods are listed as follows: SSCbest [21]:

single-view clustering using the sparse regularizer (l1-norm) to

construct the representation matrix; LRRbest [20]: single-view

clustering using the nuclear norm to construct the representation

matrix; MLAP [35]: multi-view clustering by concatenating

representation matrices of different views and imposing low-

rank constraint to explore the complementarity; DiMSC [53]:

1http://cvc.yale.edu/projects/yalefaces/yalefaces.html

2http://vision.ucsd.edu/ leekc/ExtYaleDatabase/ExtYaleB.html

3http://www.uk.research.att.com/facedatabase.html

4http://lig-membres.imag.fr/grimal/data.html

5http://www.cs.columbia.edu/CAVE/software/softlib/

6http://vasc.ri.cmu.edu/idb/html/face/

multi-view clustering with the Hilbert-Schmidt Independence

criterion; LT-MS C [6]: multi-view clustering with the low-rank

tensor constraint; MLAN [16]: multi-view clustering with adap-

tive neighbors; ECMSC [24]: multi-view clustering by simulta-

neously exploiting the representation exclusivity and indicator

consistency; t-SVD-MSC [7]: multi-view clustering via tensor

multi-rank minimization; HLR-M2VS [8]: multi-view clus-

tering via hyper-Laplacian regularized multilinear multiview

self-representations; Kt-SVD-MSC [44]: multi-view clustering

via robust kernelized multi-view self-representations; DMF-

MVC [31]: multi-view clustering via deep matrix factorization;

AW P [54]: multi-view clustering via adaptively weighted

procrustes.

Speciﬁcally, SSCbest and LRRbest are two representative

baselines for single-view clustering. Others are the multi-view

clustering baselines. LT-MSC, t-SVD-MSC, HLR-M2VS, and

Kt-SVD-MSC are low-rank tensor representation-based multi-

view clustering approaches. Kt-SVD-MSC is the kernelized ver-

sion of t-SVD-MSC. MLAN is graph-based multi-view cluster-

ing one. The source codes of all competing methods are down-

loaded from the authors’ homepages. For single-view clustering

methods, we perform SSC and LRR on each feature matrix inde-

pendently and report the best clustering results. For multi-view

clustering ones, LT-MSC, t-SVD-MSC, HLR-M2VS, and Kt-

SVD-MSC are ﬁrst performed to learn the representation tensor

Z, and then conduct the afﬁnity matrix Sby averaging each

frontal slice of Z, that is, S=1

Vv|Z(v)|+|Z(v)T|.This

means that they are performed in two separate steps to obtain the

afﬁnity matrix. After that, the spectral clustering algorithm [9]

is carried out to obtain the ﬁnal clustering results. For fair com-

parison, our experiments follow the same parameter settings of

the original papers. For SSC and LRR, we select the regulariza-

tion parameter from the interval [0.01,10]; for MLAP, two free

parameters are searched from 0.001 to 1; for DiMSC, two free

parameters are chosen from [0.01,0.03] and [20 : 20 : 180],re-

spectively; the trade-off parameter of LT-MSC is selected from

0.01 to 100; for MLAN, one parameter is set to a random number

between 1 and 30; three free parameters of ECMSC are set in

[0.1,1], [0.1,1], and 1.2, respectively; the trade-off parameters of

t-SVD-MSC and Kt-SVD-MSC are set within the range [0.1,2]

and [0.001,0.6], respectively; for HLR-M2VS, two parameters

are located within the ranges [0.01,0.2] and [0.1,0.9], respec-

tively; DMF-MVC adopts {[100,50],[500,50],[500,200]}as

the sizes of the last layer and other parameters use the default

settings as recommended in [31]; AWP is parameter-free.

Evaluation metrics: Six widely used metrics are selected to

evaluate the clustering quality including accuracy (ACC), nor-

malized mutual information (NMI), adjusted rank index (AR),

F-score, Precision, and Recall. For each evaluation metric, the

higher value indicates the better clustering performance. As we

know, the spectral clustering algorithm uses the K-means al-

gorithm to obtain the indicator matrix for all methods except

MLAN, and different initializations may yield different cluster-

ing results. Thus, we run 10 trials for each experiment on all

datasets and report their average performance with standard de-

viations. Although MLAN does not use the K-means algorithm,

there exists one random parameter. Thus, we repeat MLAN

algorithm 10 trials.

Authorized licensed use limited to: University Town Library of Shenzhen. Downloaded on September 12,2020 at 09:23:51 UTC from IEEE Xplore. Restrictions apply.

1992 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 22, NO. 8, AUGUST 2020

TABLE III

CLUSTERING RESULTS (MEAN±STANDARD DEVIATION)ON THREE FACE IMAGE DATASETS

B. Clustering Performance Comparison

The clustering performance comparison on all multimedia

datasets are reported in Tables III, IV, and V. The best results

are highlighted in bold and the second-best ones are underlined

in each table. From the results in these tables, we reach the

following conclusions:

rGenerally speaking, the proposed JLMVC achieves the

best results on all datasets, except the ORL data where

JLMVC is the second best. They have veriﬁed the va-

lidity of the proposed JLMVC. This is mainly because

the proposed JLMVC takes three aspects into one uniﬁed

model: 1) high correlation between the representation ten-

sor and afﬁnity matrix; (2) the nonlinear structures in real

applications; (3) different contributions of each view for

the construction of the uniﬁed afﬁnity matrix. (More

details can be found in Section IV-C-(3).) Take the

Extended YaleB data as an example, the proposed JLMVC

improves around 1.4%, 0.4%, 2.1%, 1.7%, 1.6%, and 1.8%

with respect to six measures over the second-best method

Kt-SVD-MSC which also exploits the kernel trick to solve

the nonlinear subspaces problem but learns the representa-

tion tensor and afﬁnity matrix in two separate manners;

rThe low-rank tensor representation-based MVC methods

(LT-MSC, t-SVD-MSC, HLR-M2VS, Kt-SVD-MSC, and

the proposed JLMVC) show better results than all single-

view clustering methods (SSC and LRR) in most cases.

This is mostly due to the fact that different features charac-

terize different and partly independent information of the

datasets. LRR and SSC exploit only partial information,

leading to unsatisfactory results especially when multi-

view features are heterogeneous. Whereas, the low-rank

Authorized licensed use limited to: University Town Library of Shenzhen. Downloaded on September 12,2020 at 09:23:51 UTC from IEEE Xplore. Restrictions apply.

CHEN et al.: JOINTLY LEARNING KERNEL REPRESENTATION TENSOR AND AFFINITY MATRIX 1993

TAB LE IV

CLUSTERING RESULTS (MEAN±STANDARD DEVIATION)ON WIKIPEDIA AND PROKARYOTIC

DMF-MVCC was crashed on these two databases.

tensor representation-based MVC can well explore the high

order correlations underlying multi-view features;

rThe graph-based multi-view clustering method, MLAN,

obtains unstable results. On Prokaryotic data, MLAN

achieves the similar performance with our JLMVC. How-

ever, it performs worse than those single-view clustering

methods on other datasets. The reason may be that the

graph-based clustering approaches usually construct the

afﬁnity matrix on the raw multimedia features which may

be corrupted by noise and outliers;

rOn ORL data, HLR-M2VS achieves better results than the

proposed JLMVC. The reason is that the manifold regu-

larization may be better to preserve the local geometrical

structure of ORL data than the kernel trick when han-

dling nonlinearity. However, HLR-M2VS is less robust on

Yale and Extended YaleB datasets. Speciﬁcally, in terms

of ACC and NMI, the leading margins of our JLMVC are

24.0% and 19.4% over HLR-M2VS on Extended YaleB, re-

spectively. On Yale, the improvement of JLMVC is 24.4%

and 21.0%, respectively. Similar observations can be ob-

tained on Scene-15 and Prokaryotic datasets. This indi-

cates that, compared to the manifold-based methods, the

kernel-based methods may be a better way to handle the

nonlinear subspaces;

rThe performance of MLAP degrades sharply on the Ex-

tended YaleB data. Its performance is even worse than those

of the single-view clustering methods, i.e., LRR and SSC.

However, it performs better than them on other datasets.

As stated in [7], the LBP and Gabor features cause less

discriminative representation than the intensity feature due

to large variations of illumination as shown in the ﬁrst

group of Fig. 4. This indicates that simply concatenating

all features may fail to obtain a good afﬁnity matrix to de-

scribe the relationship among all samples, especially when

all features are heterogeneous. This is the direct motiva-

tion why our model considers different contributions of

different features to construct the afﬁnity matrix.

C. Model Analysis

In this section, we aim to give a comprehensive analysis of the

proposed JLMVC in Eq. (7), including the parameter analysis,

convergence analysis, and runtime.

1) Parameter Analysis: There are three parameters, i.e.,

α, λ,η in the proposed JLMVC. In all experiments, we set

η= 500. Thus, there are two free parameters which need to

be tuned. Actually, αand λare used to balance the contri-

butions of the low-rank tensor term, noise term and consen-

sus term. For example, when the noise level of features is

high, αmay be selected a large value. αand λare selected

from the ranges [0.001,0.005,0.01,0.05,0.1,0.3,0.5,0.7] and

[0.001,0.005,0.01,0.05,0.1,0.3,0.5,0.7,0.9,1], respectively.

Here, the Yale and Extended YaleB datasets are selected as two

examples. Fig. 5 shows the ACC and NMI values with respect

to different combinations of αand λ. From this ﬁgure, we can

observe that when αis set to a relatively large value, JLMVC can

achieve the best results. An intuitive interpretation is that there

are large variations of illumination on the Extended YaleB data.

Authorized licensed use limited to: University Town Library of Shenzhen. Downloaded on September 12,2020 at 09:23:51 UTC from IEEE Xplore. Restrictions apply.

1994 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 22, NO. 8, AUGUST 2020

TAB LE V

CLUSTERING RESULTS (MEAN±STANDARD DEVIATION)ON COIL-20,CMU-PIE AND SCENE-15

2) Computation Complexity and Empirical Convergence

Analysis: The proposed JLMVC consists of six subproblems.

The main computation complexity of JLMVC is to update Y

and Psince updating other variables contains only the matrix

addition and scalar-matrix multiplication. The total computa-

tion complexity of Ysubproblem is O(2Vn

2log(n)+V2n2)

since it needs to compute the FFT, inverse FFT and singu-

lar value decomposition. For updating P, it includes Vinde-

pendent subproblems as shown in Eq. (16). Each subproblem

takes O(rn2)for the vector-matrix multiplication, where ris

the rank of K(v). Thus, the computation complexity of JLMVC

is O(2Vn

2log(n)+V2n2+Vrn

2).

The empirical convergence of JLMVC on Extended YaleB

dataset is shown in Fig. 6. The x-axis denotes the number of

iterations, while the y-axis represents the errors deﬁned in Eq.

(24). We can see that, after several iterations, the errors witness a

Fig. 4. ACC and NMI values of LRR with all features on (1) Extended YaleB,

(2) Yale and (3) ORL datasets.

quick drop until a stable value. In all experiments, the proposed

JLMVC can reach the smallest residual within 50 iterations. To

further investigate the empirical convergence of JLMVC, Fig. 7

reports the ACC and NMI values with respect to iterations on

Authorized licensed use limited to: University Town Library of Shenzhen. Downloaded on September 12,2020 at 09:23:51 UTC from IEEE Xplore. Restrictions apply.

CHEN ET AL.: JOINTLY LEARNING KERNEL REPRESENTATION TENSOR AND AFFINITY MATRIX 1995

TAB LE VI

PERFORMANCE (ACC/NMI) OF JLMVC AND ITS VARIANTS ON DIFFERENT DATASETS

Fig. 5. ACC and NMI values of JLMVC with different combinations of αand

λon Yale (Two top ﬁgures) and Extended YaleB (Two bottom ﬁgures) datasets.

Fig. 6. Empirical convergence versus iterations on Extended YaleB data.

Fig. 7. ACC and NMI values versus iterations on Extended YaleB.

Extended YaleB dataset. Before the ﬁrst 10 iterations, JLMVC

does not reach a meaningful accuracy. But after that, JLMVC

achieves promising ACC and NMI values higher than those of all

competing methods except Kt-SVD-MSC. This shows that the

proposed JLMVC is an excellent multi-view clustering method.

3) The Effect of Zand S:The proposed JLMVC achieves

the joint learning of the representation tensor Zand afﬁnity

matrix S. However, most existing MVC methods follow two

separate steps to construct Zand S. To investigate the effect of

Zand S, we perform a test by setting λ=0. In this test, we sim-

ply obtain Zand then construct S=1

Vv|Z(v)|+|Z(v)T|.

This simple variant of JLMVC is denoted as JLMVC-Z.Ta-

ble VI reports clustering results of JLMVC and JLMVC-Z.Itis

easy to see that JLMVC achieves superior clustering results over

JLMVC-Zin all cases. The average improvement of JLMVC

is around 17.06% and 16.23% over JLMVC-Zwith respect to

ACC and NMI, respectively, indicating that construction of Z

and Ssimultaneous can boost the clustering performance.

4) Ablation Study on the Kernel Trick: To investigate the ef-

fect of the kernel trick, we also carry out the model in Eq. (3),

denoted as JLMVC-nk. Like JLMVC, JLMVC-nk also learns the

representation tensor and afﬁnity matrix simultaneously without

the kernel trick. This means that the afﬁnity matrix is constructed

from the the original multimedia data (usually nonlinear sepa-

rable). The ACC and NMI values of JLMVC-nk are reported

in the last row of Table VI. One can see that JLMVC achieves

better clustering results than JLMVC-nk in all cases. A typical

example is the Extended YaleB dataset whose multiple features

are diverse as shown in Fig. 4. This indicates that the kernel trick

can handle the nonlinearity and boost the multi-view clustering

performance.

5) Runtime: Since the computation time of a method is also

an evaluation factor, we give a runtime comparison of the pro-

posed JLMVC and several competitors. Table VII reports the

runtime comparison results. All experiments are implemented

in Matlab 2016a on a workstation with 3.50 GHz CPU and 16 GB

RAM. From Table VII, the methods with the average time from

low to high are MLAN, t-SVD-MSC, HLR-M2VS, JLMVC,

LT-MSC, DiMSC, MLAP, and Kt-SVD-MSC. MLAN costs the

shortest processing time and the proposed JLMVC belongs to

the middle-ranking group. All methods except for MLAN should

compute the singular value decomposition and matrix inversion.

This leads to a high computation cost. Although MLAN is the

most efﬁcient one, it has an unstable performance. The reason is

that MLAN uses the raw data to learn the similarity matrix and

the raw data are easily contaminated by noise. Other methods

impose the low-rank constraint on the representation matrix (or

tensor) and use the sparse regularizer to remove noise. They can

construct a reliable similarity matrix.

Authorized licensed use limited to: University Town Library of Shenzhen. Downloaded on September 12,2020 at 09:23:51 UTC from IEEE Xplore. Restrictions apply.

1996 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 22, NO. 8, AUGUST 2020

TAB LE VI I

VERAGE RUNNING TIME (IN SECONDS)ON ALL DATABASES

V. CONCLUSIONS

In this paper, we proposed a novel method called JLMVC to

solve the multi-view clustering problem, based on the low-rank

tensor representation and “kernel trick”. In JLMVC, instead of

capturing a low-rank representation matrix among all views, the

tensor singular value decomposition-based tensor nuclear norm

was used to learn the representation tensor so as to explore the

high order correlations among different views. Using the kernel

trick, the original multimedia data was implicitly mapped from

the input data space into a new feature space to overcome the dif-

ﬁculty of nonlinearity in real applications. To make full use of the

high correlation between the representation tensor and afﬁnity

matrix, the proposed JLMVC achieved the joint learning of the

representation tensor and afﬁnity matrix. Thus, the learned afﬁn-

ity matrix has the potential to boost the clustering performance

which was demonstrated by extensive experiments on eight mul-

timedia datasets. Our future work will design a fast and efﬁcient

multi-view clustering method. One possible solution is using the

Frank-Wolfe algorithm to reduce the computation complexity of

the singular value decomposition.

ACKNOWLEDGMENT

The authors would like to thank the editors and the anony-

mous reviewers for their constructive comments, which helped

to improve the quality of this article. The authors wish to grate-

fully acknowledge Prof. C. Zhang from Tianjin University and

Prof. Y. Xie from East China Normal University for sharing

multi-view datasets and codes.

REFERENCES

[1] S. Yang et al., “SkeletonNet: A hybrid network with a skeleton-embedding

process for multi-view image representation learning,” IEEE Trans. Mul-

timedia, vol. 21, no. 11, pp. 2916–2929, Nov. 2019.

[2] Z. Zhang, Y. Xie, W. Zhang, and Q. Tian, “Effective image retrieval via

multilinear multi-index fusion,” IEEE Trans. Multimedia, vol. 21, no. 11,

pp. 2878–2890, Nov. 2019.

[3] S. K. Kuanar, K. B. Ranga, and A. S. Chowdhury, “Multi-view video

summarization using bipartite matching constrained optimum-path for-

est clustering,” IEEE Trans. Multimedia, vol. 17, no. 8, pp. 1166–1173,

Aug. 2015.

[4] X. Wu, C.-W. Ngo, and A. G. Hauptmann, “Multimodal news story clus-

tering with pairwise visual near-duplicate constraint,” IEEE Trans. Multi-

media, vol. 10, no. 2, pp. 188–199, Feb. 2008.

[5] C. Tang et al., “Learning a joint afﬁnity graph for multiviewsubspace clus-

tering,” IEEE Trans. Multimedia, vol. 21, no. 7, pp. 1724–1736, Jul. 2019.

[6] C. Zhang, H. Fu, S. Liu, G. Liu, and X. Cao, “Low-rank tensor constrained

multiview subspace clustering,” in Proc. IEEE Int. Conf. Comput. Vision,

2015, pp. 1582–1590.

[7] Y.Xie et al., “On unifying multi-view self-representations for clustering by

tensor multi-rank minimization,” Int. J. Comput. Vision, vol. 126, no. 11,

pp. 1157–1179, 2018.

[8] Y. Xie, W. Zhang, Y. Qu, L. Dai, and D. Tao, “Hyper-Laplacian regular-

ized multilinear multiview self-representations for clustering and semisu-

pervised learning,” IEEE Trans. Cybern., 2018, to be published.

[9] A. Y. Ng, M. I. Jordan, and Y. Weiss, “On spectral clustering: Analysis

and an algorithm,” in Proc. Neural Inf. Process. Syst., 2002, pp. 849–856.

[10] X. Guo, “Robust subspace segmentation by simultaneously learning data

representations and their afﬁnity matrix,” in Proc. Joint Conf. Artif. Intell.,

2015, pp. 3547–3553.

[11] X. Peng, Z. Yu, Z. Yi, and H. Tang, “Constructing the l2-graph for robust

subspace learning and subspace clustering,” IEEE Trans. Cybern., vol. 47,

no. 4, pp. 1053–1066, Apr. 2017.

[12] F. Nie, X. Wang, and H. Huang, “Clustering and projected clustering with

adaptive neighbors,” in Proc. 20th ACM SIGKDD Int. Conf. Knowl. Dis-

covery Data Mining, 2014, pp. 977–986.

[13] F. Nie et al.,“Newl1-norm relaxations and optimizations for graph clus-

tering,” in Proc. AAAI Conf. Artif. Intell., 2016, pp. 1962–1968.

[14] F. Nie, X. Wang, M. I. Jordan, and H. Huang, “The constrained Laplacian

rank algorithm for graph-based clustering,” in Proc. AAAI Conf. Artif.

Intell., 2016, pp. 1969–1976.

[15] K. Zhan, C. Zhang, J. Guan, and J. Wang, “Graph learning for multi-

view clustering,” IEEE Trans. Cybern., vol. 48, no. 10, pp. 2887–2895,

Oct. 2017.

[16] F. Nie, G. Cai, J. Li, and X. Li, “Auto-weighted multi-view learning for

image clustering and semi-supervised classiﬁcation,” IEEE Trans. Image

Process., vol. 27, no. 3, pp. 1501–1511, Mar. 2018.

[17] F. Nie, G. Cai, and X. Li, “Multi-view clustering and semi-supervised

classiﬁcation with adaptive neighbours,” in Proc. AAAI Conf. Artif. Intell.,

2017, pp. 2408–2414.

[18] F. Nie et al., “Self-weighted multiview clustering with multiple graphs,”

in Proc. Joint Conf. Artif. Intell., 2017, pp. 2564–2570.

[19] H. Wang, Y. Yang, and B. Liu, “GMC: Graph-based multi-view clustering,”

IEEE Trans. Knowl. Data Eng., 2019, to be published.

[20] G. Liu et al., “Robust recovery of subspace structures by low-rank rep-

resentation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 35, no. 1,

pp. 171–184, Jan. 2013.

[21] E. Elhamifar and R. Vidal, “Sparse subspace clustering: Algorithm, theory,

and applications,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 35, no. 11,

pp. 2765–2781, Nov. 2013.

[22] C. Lu, J. Feng, Z. Lin, T. Mei, and S. Yan, “Subspace clustering by block

diagonal representation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 41,

no. 2, pp. 487–501, Feb. 2019.

[23] Y. Wang, L. Wu, X. Lin, and J. Gao, “Multiview spectral clustering

via structured low-rank matrix factorization,” IEEE Trans. Neural Netw.

Learn. Syst., no. 29, no. 10, pp. 4833–4843, Oct. 2018.

[24] X. Wang, X. Guo, Z. Lei, C. Zhang, and S. Z. Li, “Exclusivity-consistency

regularized multi-view subspace clustering,” in Proc. IEEE Conf. Comput.

Vision Pattern Recognit., 2017, pp. 923–931.

[25] Z. Kang, H. Pan, S. C. H. Hoi, and Z. Xu, “Robust graph learning from

noisy data,” IEEE Trans. Cybern., 2019, to be published.

[26] M. Yin, S. Xie, Z. Wu, Y. Zhang, and J. Gao, “Subspace clustering via

learning an adaptive low-rank graph,” IEEE Trans.Image Process., vol. 27,

no. 8, pp. 3716–3728, Aug. 2018.

[27] M. Belkin and P. Niyogi, “Laplacian eigenmaps for dimensionality reduc-

tion and data representation,” Neural Comput., vol. 15, no. 6, pp. 1373–

1396, 2003.

[28] L. Zhuang et al., “Constructing a nonnegative low-rank and sparse graph

with data-adaptive features,” IEEE Trans. Image Process., vol. 24, no. 11,

pp. 3717–3728, Nov. 2015.

Authorized licensed use limited to: University Town Library of Shenzhen. Downloaded on September 12,2020 at 09:23:51 UTC from IEEE Xplore. Restrictions apply.

CHEN et al.: JOINTLY LEARNING KERNEL REPRESENTATION TENSOR AND AFFINITY MATRIX 1997

[29] S. Yi et al., “Dual pursuit for subspace learning,” IEEE Trans. Multimedia,

vol. 21, no. 6, pp. 1399–1411, Jun. 2019.

[30] P. Ji, T. Zhang, H. Li, M. Salzmann, and I. Reid, “Deep subspace clustering

networks,” in Proc. Neural Inf. Process. Syst., 2017, pp. 24–33.

[31] H. Zhao, Z. Ding, and Y. Fu, “Multi-view clustering via deep matrix fac-

torization,” in Proc. AAAI Conf. Artif. Intell., 2017, pp. 2921–2927.

[32] Z. Zhang et al., “Highly-economized multi-view binary compression for

scalable image clustering,” in Proc. Eur. Conf. Comput. Vision, 2018,

pp. 717–732.

[33] G. Chao et al., “Multi-view cluster analysis with incomplete data to un-

derstand treatment effects,” Inf. Sci., vol. 494, pp. 278–293, 2019.

[34] G. Chao, S. Sun, and J. Bi, “A survey on multi-view clustering,” 2017,

arXiv:1712.06246.

[35] B. Cheng, G. Liu, J. Wang, Z. Huang, and S. Yan, “Multi-task low-rank

afﬁnity pursuit for image segmentation,” in Proc. IEEE Int. Conf. Comput.

Vis io n, 2011, pp. 2439–2446.

[36] R. Xia, Y. Pan, L. Du, and J. Yin, “Robust multi-view spectral clustering

via low-rank and sparse decomposition,” in Proc. AAAI Conf. Artif. Intell.,

2014, pp. 2149–2155.

[37] C. Zhang, Q. Hu, H. Fu, P. Zhu, and X. Cao, “Latent multi-view subspace

clustering,” in Proc. IEEE Conf. Comput. Vision Pattern Recognit., 2017,

pp. 4279–4287.

[38] S. Luo, C. Zhang, W. Zhang, and X. Cao, “Consistent and speciﬁc

multi-view subspace clustering,” in Proc. AAAI Conf. Artif. Intell., 2018,

pp. 3730–3713.

[39] C. Zhang et al., “Generalized latent multi-view subspace clustering,” IEEE

Trans. Pattern Anal. Mach. Intell., 2018, to be published.

[40] J. Liu, C. Wang, J. Gao, and J. Han, “Multi-view clustering via joint non-

negative matrix factorization,” in Proc. SIAM Int. Conf. Data Min., 2013,

pp. 252–260.

[41] M. Yin, J. Gao, S. Xie, and Y. Guo, “Multiview subspace clustering via

tensorial t-product representation,” IEEE Trans. Neural Netw. Learn. Syst.,

vol. 30, no. 3, pp. 851–864, Mar. 2019.

[42] V. M. Patel and R. Vidal, “Kernel sparse subspace clustering,” in Proc.

IEEE Int. Conf. Image Process., 2014, pp. 2849–2853.

[43] S. Xiao, M. Tan, D. Xu, and Z. Y. Dong, “Robust kernel low-rank rep-

resentation,” IEEE Trans. Neural Netw. Learn. Syst., vol. 27, no. 11,

pp. 2268–2281, Nov. 2016.

[44] Y. Qu, J. Liu, Y. Xie, and W. Zhang, “Robust kernelized multi-view self-

representations for clustering by tensor multi-rank minimization,” 2017,

arXiv:1709.05083.

[45] M. E. Kilmer and C. D. Martin, “Factorization strategies for third-order

tensors,” Linear Algebra Appl., vol. 435, no. 3, pp. 641–658, 2011.

[46] H. Wang, Y. Yang, and T. Li, “Multi-view clustering via concept factor-

ization with local manifold regularization,” in Proc. IEEE Int. Conf. Data

Mining, 2016, pp. 1245–1250.

[47] Y. Chen et al., “Denoising of hyperspectral images using nonconvex low

rank matrix approximation,” IEEE Trans. Geosci. Remote Sens., vol. 55,

no. 9, pp. 5366–5380, Sep. 2017.

[48] Y. Chen, S. Wang, and Y. Zhou, “Tensor nuclear norm-based low-rank

approximation with total variation regularization,” IEEE J. Sel. Topics

Signal Process., vol. 12, no. 6, pp. 1364–1377, Dec. 2018.

[49] W. Hu, D. Tao, W. Zhang, Y. Xie, and Y. Yang, “The twist tensor nu-

clear norm for video completion,” IEEE Trans. Neural Netw. Learn. Syst.,

vol. 28, no. 12, pp. 2961–2973, Dec. 2017.

[50] Y. Chen, Y. Wang, M. Li, and G. He, “Augmented Lagrangian alternating

direction method for low-rank minimization via non-convex approxima-

tion,” Signal, Image Video Process., vol. 11, no. 7, pp. 1271–1278, 2017.

[51] T. Zhou, C. Zhang, C. Gong, H. Bhaskar, and J. Yang, “Multiview la-

tent space learning with feature redundancy minimization,” IEEE Trans.

Cybern., 2018, to be published.

[52] L. Fei-Fei and P. Perona, “A Bayesian hierarchical model for learning nat-

ural scene categories,” in Proc. IEEE Comput. Soc. Conf. Comput. Vision

Pattern Recognit., 2005, vol. 2, pp. 524–531.

[53] X. Cao, C. Zhang, H. Fu, S. Liu, and H. Zhang, “Diversity-induced multi-

view subspace clustering,” in Proc. IEEE Conf. Comput. Vision Pattern

Recognit., 2015, pp. 586–594.

[54] F. Nie, L. Tian, and X. Li, “Multiview clustering via adaptively weighted

procrustes,” in Proc. ACM SIGKDD Int. Conf. Knowl. Discovery Data

Mining, 2018, pp. 2022–2030.

Yongyong Chen received the B.S. and M.S. degrees

in the College of Mathematics and Systems Science,

Shandong University of Science and Technology,

Qingdao, China, and visited the National Key Lab

for Novel Software Technology, Nanjing University,

Nanjing, China, as an exchange student in 2017. He is

currently working toward the Ph.D. degree with the

Department of Computer and Information Science,

University of Macau, Macau, China. His research in-

terests include (non-convex) low-rank and sparse ma-

trix/tensor decomposition models, with applications

to image processing, data mining, and computer vision.

Xiaolin Xiao received the B.E. degree from Wuhan

University, Wuhan, China, in 2013, and the Ph.D. de-

gree from the University of Macau, Macau, China, in

2019. She is currently a Postdoctoral Fellow with the

School of Computer Science and Engineering, South

China University of Technology, Guangzhou, China.

Her research interests include superpixel segmenta-

tion, saliency detection, and color image processing

and understanding.

Yicong Zhou (M’07–SM’14) received the B.S. de-

gree in electrical engineering from Hunan University,

Changsha, China, and the M.S. and Ph.D. de-

grees in electrical engineering from Tufts University,

Medford, MA, USA. He is an Associate Professor

and the Director of the Vision and Image Processing

Laboratory, Department of Computer and Informa-

tion Science, University of Macau, Macau, China. His

research interests include image processing and un-

derstanding, computer vision, machine learning, and

multimedia security. Dr. Zhou is a Senior Member

of the International Society for Optical Engineering. He was a recipient of the

Third Price of Macau Natural Science Award in 2014. He is the Co-Chair of

Technical Committee on Cognitive Computing in the IEEE Systems, Man, and

Cybernetics Society. He is an Associate Editor for the IEEE TRANSACTIONS

ON NEUTRAL NETWORKS AND LEARNING SYSTEMS, IEEE TRANSACTIONS ON

CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, IEEE TRANSACTIONS ON

GEOSCIENCE AND REMOTE SENSING, and four other journals.

Authorized licensed use limited to: University Town Library of Shenzhen. Downloaded on September 12,2020 at 09:23:51 UTC from IEEE Xplore. Restrictions apply.

The methods for improving large-scale multi-view clustering efficiency: a survey

Article

Full-text available

May 2024
ARTIF INTELL REV

The diversity and large scale of multi-view data have brought more significant challenges to conventional clustering technology. Recently, multi-view clustering has received widespread attention because it can better use different views’ consensus and complementary information to improve clustering performance. Simultaneously, many researchers have proposed various algorithms to reduce the computational complexity to accommodate the demands of large-scale multi-view clustering. However, the current reviews do not summarize from the perspective of reducing the computational complexity of large-scale multi-view clustering. Therefore, this paper outlines various high-frequency methods used in recent years to reduce the computational complexity of large-scale multi-view clustering, i.e. third-order tensor t-SVD, anchors-based graph construction, matrix blocking, and matrix factorization, and compares the corresponding algorithms based on several open datasets. Finally, the strengths and weaknesses of the current algorithm and the point of improvement are analyzed.

Manifold-based Incomplete Multi-view Clustering via Bi-Consistency Guidance

Preprint

Full-text available

May 2024

Incomplete multi-view clustering primarily focuses on dividing unlabeled data into corresponding categories with missing instances, and has received intensive attention due to its superiority in real applications. Considering the influence of incomplete data, the existing methods mostly attempt to recover data by adding extra terms. However, for the unsupervised methods, a simple recovery strategy will cause errors and outlying value accumulations, which will affect the performance of the methods. Broadly, the previous methods have not taken the effectiveness of recovered instances into consideration, or cannot flexibly balance the discrepancies between recovered data and original data. To address these problems, we propose a novel method termed Manifold-based Incomplete Multi-view clustering via Bi-consistency guidance (MIMB), which flexibly recovers incomplete data among various views, and attempts to achieve biconsistency guidance via reverse regularization. In particular, MIMB adds reconstruction terms to representation learning by recovering missing instances, which dynamically examines the latent consensus representation. Moreover, to preserve the consistency information among multiple views, MIMB implements a biconsistency guidance strategy with reverse regularization of the consensus representation and proposes a manifold embedding measure for exploring the hidden structure of the recovered data. Notably, MIMB aims to balance the importance of different views, and introduces an adaptive weight term for each view. Finally, an optimization algorithm with an alternating iteration optimization strategy is designed for final clustering. Extensive experimental results on 6 benchmark datasets are provided to confirm that MIMB can significantly obtain superior results as compared with several state-of-the-art baselines.

Integrated Heterogeneous Graph Attention Network for Incomplete Multi-modal Clustering

Article

Full-text available

Apr 2024
INT J COMPUT VISION

Incomplete multi-modal clustering (IMmC) is challenging due to the unexpected missing of some modalities in data. A key to this problem is to explore complementarity information among different samples with incomplete information of unpaired data. Despite preliminary progress, existing methods suffer from (1) relying heavily on paired data, and (2) difficulty in mining complementarity on data with high missing rates. To address the problems, we propose a novel method, Integrated Heterogeneous Graph ATtention (IHGAT) network, for IMmC. To fully exploit the complementarity among different samples and modalities, we first construct a set of integrated heterogeneous graphs based on the similarity graph learned from unified latent representations and the modality-specific availability graphs formed by the existing relations of different samples. Thereafter, the attention mechanism is applied to the constructed integrated heterogeneous graph to aggregate the embedded content of heterogeneous neighbors for each node. In this way, the representations of missing modalities can be learned based on the complementarity information of other samples and their other modalities. Finally, the consistency of probability distribution is embedded into the network for clustering. Consequently, the proposed method can form a complete latent space where incomplete information can be supplemented by other related samples via the learned intrinsic structure. Extensive experiments on eight public datasets show that the proposed IHGAT outperforms existing methods under various settings and is typically more robust in cases of high missing rates.

Latent Multi-view Clustering Based Adaptive Graph Constraint

Conference Paper

Jun 2024

Aligned multi-view clustering for unmapped data via weighted tensor nuclear norm and adaptive graph learning

Article

Jun 2024
NEUROCOMPUTING

Sparse Multi-view Image Clustering with Complete Similarity Information

Article

Jun 2024
NEUROCOMPUTING

Tensorized Multi-View Low-Rank Approximation Based Robust Hand-Print Recognition

Article

May 2024
IEEE T IMAGE PROCESS

Since hand-print recognition, i.e., palmprint, finger-knuckle-print (FKP), and hand-vein, have significant superiority in user convenience and hygiene, it has attracted greater enthusiasm from researchers. Seeking to handle the long-standing interference factors, i.e., noise, rotation, shadow, in hand-print images, multi-view hand-print representation has been proposed to enhance the feature expression by exploiting multiple characteristics from diverse views. However, the existing methods usually ignore the high-order correlations between different views or fuse very limited types of features. To tackle these issues, in this paper, we present a novel tensorized multi-view low-rank approximation based robust hand-print recognition method (TMLA_RHR), which can dexterously manipulate the multi-view hand-print features to produce a high-compact feature representation. To achieve this goal, we formulate TMLA_RHR by two key components, i.e., aligned structure regression loss and tensorized low-rank approximation, in a joint learning model. Specifically, we treat the low-rank representation matrices of different views as a tensor, which is regularized with a low-rank constraint. It models the across information between different views and reduces the redundancy of the learned sub-space representations. Experimental results on eight real-world hand-print databases prove the superiority of the proposed method in comparison with other state-of-the-art related works.

Multi-modal news event detection with external knowledge

Article

May 2024
INFORM PROCESS MANAG

Joint learning of data recovering and graph contrastive denoising for incomplete multi-view clustering

Article

Apr 2024
INFORM FUSION

Complete multi-view subspace clustering via auto-weighted combination of visible and latent views

Article

Feb 2024
INFORM SCIENCES

Tensor Nuclear Norm-Based Low-Rank Approximation With Total Variation Regularization

Article

Full-text available

Dec 2018

Some existing low-rank approximation approaches either need to predefine the rank values (such as the matrix/tensor factorization-based methods) or fail to consider local information of data (e.g., spatial or spectral smooth structure). To overcome these drawbacks, this paper proposes a new model called the tensor nuclear norm-based low-rank approximation with total variation regularization (TLR-TV) for color and multispectral image denoising. TLR-TV uses the tensor nuclear norm to encode the global low-rank prior of tensor data and the total variation regularization to preserve the spatial-spectral continuity in a unified framework. Including the hyper total variation (HTV) and spatial-spectral total variation (SSTV), we propose two TLR-TV-based algorithms, namely TLR-HTV and TLR-SSTV. Using the alternating direction method of multiplier, we further propose two simple algorithms to solve TLR-HTV and TLR-SSTV. Extensive experiments on simulated and real-world noisy images demonstrate that the proposed TLR-HTV and TLR-SSTV outperform the state-of-the-art methods in color and multispectral image denoising in terms of quantitative and qualitative evaluations.

Multi-View Clustering via Deep Matrix Factorization

Article

Feb 2017

Multi-View Clustering (MVC) has garnered more attention recently since many real-world data are comprised of different representations or views. The key is to explore complementary information to benefit the clustering problem. In this paper, we present a deep matrix factorization framework for MVC, where semi-nonnegative matrix factorization is adopted to learn the hierarchical semantics of multi-view data in a layer-wise fashion. To maximize the mutual information from each view, we enforce the non-negative representation of each view in the final layer to be the same. Furthermore, to respect the intrinsic geometric structure in each view data, graph regularizers are introduced to couple the output representation of deep structures. As a non-trivial contribution, we provide the solution based on alternating minimization strategy, followed by a theoretical proof of convergence. The superior experimental results on three face benchmarks show the effectiveness of the proposed deep matrix factorization model.

Multi-View Clustering and Semi-Supervised Classification with Adaptive Neighbours

Article

Feb 2017

Due to the efficiency of learning relationships and complex structures hidden in data, graph-oriented methods have been widely investigated and achieve promising performance in multi-view learning. Generally, these learning algorithms construct informative graph for each view or fuse different views to one graph, on which the following procedure are based. However, in many real world dataset, original data always contain noise and outlying entries that result in unreliable and inaccurate graphs, which cannot be ameliorated in the previous methods. In this paper, we propose a novel multi-view learning model which performs clustering/semi-supervised classification and local structure learning simultaneously. The obtained optimal graph can be partitioned into specific clusters directly. Moreover, our model can allocate ideal weight for each view automatically without additional weight and penalty parameters. An efficient algorithm is proposed to optimize this model. Extensive experimental results on different real-world datasets show that the proposed model outperforms other state-of-the-art multi-view algorithms.

New l1-Norm Relaxations and Optimizations for Graph Clustering

Article

Mar 2016

In recent data mining research, the graph clustering methods, such as normalized cut and ratio cut, have been well studied and applied to solve many unsupervised learning applications. The original graph clustering methods are NP-hard problems. Traditional approaches used spectral relaxation to solve the graph clustering problems. The main disadvantage of these approaches is that the obtained spectral solutions could severely deviate from the true solution. To solve this problem, in this paper, we propose a new relaxation mechanism for graph clustering methods. Instead of minimizing the squared distances of clustering results, we use the l1-norm distance. More important, considering the normalized consistency, we also use the l1-norm for the normalized terms in the new graph clustering relaxations. Due to the sparse result from the l1-norm minimization, the solutions of our new relaxed graph clustering methods get discrete values with many zeros, which are close to the ideal solutions. Our new objectives are difficult to be optimized, because the minimization problem involves the ratio of nonsmooth terms. The existing sparse learning optimization algorithms cannot be applied to solve this problem. In this paper, we propose a new optimization algorithm to solve this difficult non-smooth ratio minimization problem. The extensive experiments have been performed on three two-way clustering and eight multi-way clustering benchmark data sets. All empirical results show that our new relaxation methods consistently enhance the normalized cut and ratio cut clustering results.

Consistent and Specific Multi-View Subspace Clustering

Article

Apr 2018

Multi-view clustering has attracted intensive attention due to the effectiveness of exploiting multiple views of data. However, most existing multi-view clustering methods only aim to explore the consistency or enhance the diversity of different views. In this paper, we propose a novel multi-view subspace clustering method (CSMSC), where consistency and specificity are jointly exploited for subspace representation learning. We formulate the multi-view self-representation property using a shared consistent representation and a set of specific representations, which better fits the real-world datasets. Specifically, consistency models the common properties among all views, while specificity captures the inherent difference in each view. In addition, to optimize the non-convex problem, we introduce a convex relaxation and develop an alternating optimization algorithm to recover the corresponding data representations. Experimental evaluations on four benchmark datasets demonstrate that the proposed approach achieves better performance over several state-of-the-arts.

SkeletonNet: A Hybrid Network With a Skeleton-Embedding Process for Multi-View Image Representation Learning

Article

Apr 2019

Multi-view representation learning plays a fundamental role in multimedia data analysis. Some specific inter-view alignment principles are adopted in conventional models, where there is an assumption that different views share a common latent subspace. However, when dealing views on diverse semantic levels, the view-specific characteristics are neglected, and the divergent inconsistency of similarity measurements hinders sufficient information sharing. This paper proposes a hybrid deep network by introducing tensor factorization into the multi-view Deep Auto-encoder. The network adopts Skeleton-Embedding Process for unsupervised multi-view subspace learning. It takes full consideration of view-specific characteristics, and leverages the strength of both shallow and deep architectures for modeling low and high-level views respectively. We first formulate the high-level-view semantic distribution as the underlying skeleton structure of the learned subspace, and then infer the local tangent structures according to the affinity propagation of low-level-view geometric correlations. As a consequence, more discriminative subspace representation can be learned from global semantic pivots to local geometric details. Experimental comparisons on three benchmark image datasets show the promising performance and flexibility of our model.

Multi-View Cluster Analysis with Incomplete Data to Understand Treatment Effects

Article

Apr 2019
INFORM SCIENCES

Multi-view cluster analysis, as a popular granular computing method, aims to partition sample subjects into consistent clusters across different views in which the subjects are characterized. Frequently, data entries can be missing from some of the views. The latest multi-view co-clustering methods cannot effectively deal with incomplete data, especially when there are mixed patterns of missing values. We propose an enhanced formulation for a family of multi-view co-clustering methods to cope with the missing data problem by introducing an indicator matrix whose elements indicate which data entries are observed and assessing cluster validity only on observed entries. In comparison with common methods that impute missing data in order to use regular multi-view analytics, our approach is less sensitive to imputation uncertainty. In comparison with other state-of-the-art multi-view incomplete clustering methods, our approach is sensible in the cases of either missing any entry in a view or missing the entire view. We first validated the proposed strategy in simulations, and then applied it to a treatment study of opioid dependence which would have been impossible with previous methods due to a number of missing-data patterns. Patients in the treatment study were naturally assessed in different feature spaces such as in the pre-, during- and post-treatment time windows. Our algorithm was able to identify subgroups where patients in each group showed similarities in all of the three time windows, thus leading to the identification of pre-treatment (baseline) features predictive of post-treatment outcomes. We found that cue-induced heroin craving predicts adherence to XR-NTX therapy. This finding is consistent with the clinical literature, serving to validate our approach.

GMC: Graph-based Multi-view Clustering

Article

Mar 2019

Multi-view graph-based clustering aims to provide clustering solutions to multi-view data. However, most existing methods do not give sufficient consideration to weights of different views and require an additional clustering step to produce the final clusters. They also usually optimize their objectives based on fixed graph similarity matrices of all views. In this paper, we propose a general Graph-based Multi-view Clustering (GMC) to tackle these problems. GMC takes the data graph matrices of all views and fuses them to generate a unified matrix. The unified matrix in turn improves the data graph matrix of each view, and also gives the final clusters directly. The key novelty of GMC is its learning method, which can help the learning of each view graph matrix and the learning of the unified matrix in a mutual reinforcement manner. A novel multi-view fusion technique can automatically weight each data graph matrix to derive the unified matrix. A rank constraint without introducing a tuning parameter is also imposed on the Laplacian matrix of the unified matrix, which helps partition the data points naturally into the required number of clusters. An alternating iterative optimization algorithm is presented to optimize the objective function. Experimental results demonstrate that the proposed method outperforms state-of-the-art baselines markedly.

Learning Joint Affinity Graph for Multi-view Subspace Clustering

Article

Dec 2018

With the ability of exploiting the internal structure of data, graph based models have been paid great attention and earned great success in multi-view subspace clustering. Most of existing methods individually construct an affinity graph for each single view and fuse the result obtained from each single graph. However, the common shared and complementary diversity existed between views are not efficiently exploited. In this paper, we propose to address this issue by learning a joint affinity graph for multi-view subspace clustering based on low rank representation with diversity regularization and rank constraint. Specifically, a low rank representation model is employed to learn a shared sample representation coefficient matrix which is used to generate the affinity graph. Meanwhile, we use a diversity regularization to learn optimal weights for each view, which can suppress the redundancy and excavate the diversity among different feature views. In addition, the cluster number is used to promote the affinity graph learning by using a rank constraint. The final clustering result is obtained by using normalized cuts on the learned affinity graph. An efficient algorithm based on Augmented Lagrangian Multiplier with Alternating Direction Minimization (ALM-ADM) is carefully designed to solve the resultant optimization problem. Extensive experiments on various real-world datasets are conducted and the results well demonstrate the effectiveness of the proposed algorithm.

Robust Graph Learning From Noisy Data

Preprint

Dec 2018

Learning graphs from data automatically has shown encouraging performance on clustering and semisupervised learning tasks. However, real data are often corrupted, which may cause the learned graph to be inexact or unreliable. In this paper, we propose a novel robust graph learning scheme to learn reliable graphs from real-world noisy data by adaptively removing noise and errors in the raw data. We show that our proposed model can also be viewed as a robust version of manifold regularized robust PCA, where the quality of the graph plays a critical role. The proposed model is able to boost the performance of data clustering, semisupervised classification, and data recovery significantly, primarily due to two key factors: 1) enhanced low-rank recovery by exploiting the graph smoothness assumption, 2) improved graph construction by exploiting clean data recovered by robust PCA. Thus, it boosts the clustering, semi-supervised classification, and data recovery performance overall. Extensive experiments on image/document clustering, object recognition, image shadow removal, and video background subtraction reveal that our model outperforms the previous state-of-the-art methods.

Jointly Learning Kernel Representation Tensor and Affinity Matrix for Multi-View Clustering

Abstract and Figures

Recommended publications

Multi-view Clustering via Simultaneously Learning Graph Regularized Low-Rank Tensor Representation a...

Multi-view subspace clustering via simultaneously learning the representation tensor and affinity ma...

Adaptive Transition Probability Matrix Learning for Multiview Spectral Clustering

Multi-view Clustering via Simultaneously Learning Graph Regularized Low-Rank Tensor Representation a...