ArticlePDF Available

LDICDL: LncRNA-disease association identification based on Collaborative Deep Learning

October 2020
IEEE/ACM Transactions on Computational Biology and Bioinformatics PP(99):1-1

October 2020
PP(99):1-1

DOI:10.1109/TCBB.2020.3034910

Authors:

Wei Lan

Guangxi University

Qingfeng Chen

University of Technology Sydney

Show all 8 authorsHide

It has been proved that long noncoding RNA (lncRNA) plays critical roles in many human diseases. Therefore, inferring associations between lncRNAs and diseases can contribute to disease diagnosis, prognosis and treatment. To overcome the limitation of traditional experimental methods such as expensive and time-consuming, several computational methods have been proposed to predict lncRNA-disease associations by fusing different biological data. However, the prediction performance of lncRNA-disease associations identification need to be improved. In this study, we propose a computational model (named LDICDL) to identify lncRNA-disease associations based on collaborative deep learning. It uses an automatic encoder to denoise multiple lncRNA feature information and multiple disease feature information, respectively. Then, the matrix decomposition algorithm is employed to predict the potential lncRNA-disease associations. In addition, to overcome the limitation of matrix decomposition, the hybrid model is developed to predict associations between new lncRNA (or disease) and diseases (or lncRNA). The ten-fold cross validation and de novo test are applied to evaluate the performance of method. The experimental results show LDICDL outperforms than other state-of-the-art methods in prediction performance.

The AUROC of LDICDL, SIMCLDA and MFLDA by using ten-fold cross validation.

…

Top 10 lncRNA of osteosarcoma predicted by LDICDL

…

Figures - uploaded by Wei Lan

Content may be subject to copyright.

Content uploaded by Wei Lan

Content may be subject to copyright.

1545-5963 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCBB.2020.3034910, IEEE/ACM

Transactions on Computational Biology and Bioinformatics

JOURNAL OF L

X CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 1

LDICDL: LncRNA-disease association

identiﬁcation based on Collaborative Deep

Learning

Wei Lan, Dehuan Lai, Qingfeng Chen, Ximin Wu, Baoshan Chen, Jin Liu, Jianxin Wang, Yi-Ping

Phoebe Chen

Abstract—It has been proved that long noncoding RNA (lncRNA) plays critical roles in many human diseases. Therefore, inferring

associations between lncRNAs and diseases can contribute to disease diagnosis, prognosis and treatment. To overcome the limitation

of traditional experimental methods such as expensive and time-consuming, several computational methods have been proposed to

predict lncRNA-disease associations by fusing different biological data. However, the prediction performance of lncRNA-disease

associations identiﬁcation need to be improved. In this study, we propose a computational model (named LDICDL) to identify

lncRNA-disease associations based on collaborative deep learning. It uses an automatic encoder to denoise multiple lncRNA feature

information and multiple disease feature information, respectively. Then, the matrix decomposition algorithm is employed to predict the

potential lncRNA-disease associations. In addition, to overcome the limitation of matrix decomposition, the hybrid model is developed

to predict associations between new lncRNA (or disease) and diseases (or lncRNA). The ten-fold cross validation and de novo test are

applied to evaluate the performance of method. The experimental results show LDICDL outperforms than other state-of-the-art

methods in prediction performance.

Index Terms—lncRNA-disease associations, matrix factorization, stacked denoising autoencoder.

1 INTRODUCTION

ITis well known that biological genetic information is

primarily stored in protein-coding genes, and RNA is

the intermediary between DNA sequences and proteins [1].

With the development of human genetic engineering, 2% of

the genes have been conﬁrmed to be protein-coding genes,

and the remaining 98% of the genes have not or few protein

encoding abilities [2]. These genes are usually transcribed

into non-coding RNAs [3]. Non-coding RNAs have been re-

garded as the noise of genomic transcription for a long time

[4], [5]. However, recent studies have shown that they play

important regulatory roles in many biological processes of

•Wei Lan is School of Computer, Electronic and Information, Guangxi U-

niversity, Nanning, Guangxi, 530004, China. E-mail: lanwei@gxu.edu.cn

•Dehuan Lai is School of Computer, Electronic and Information,

Guangxi University, Nanning, Guangxi, 530004, China. E-mail: laide-

huan@st.gxu.edu.cn

•Qingfeng Chen is School of Computer, Electronic and Information and S-

tate Key Laboratory for Conservation and Utilization of Subtropical Agro-

bioresources, Guangxi University, Nanning, Guangxi, 530004, China. E-

mail: qingfeng@gxu.edu.cn

•Ximin Wu is School of Computer, Electronic and Information,

Guangxi University, Nanning, Guangxi, 530004, China. E-mail: wux-

imin@st.gxu.edu.cn

•Baoshan Chen is State Key Laboratory for Conservation and Utilization of

Subtropical Agro-bioresources, Guangxi University, Nanning, Guangxi,

530004, China. E-mail:chenyaoj@gxu.edu.cn

•Jin Liu is Hunan Provincial Key Lab on Bioinformatics, School of Comput-

er Science and Engineering, Central South University, Changsha, Hunan,

410083, China. E-mail:liujin06@mail.csu.edu.cn

•Jianxin Wang is Hunan Provincial Key Lab on Bioinformatics, School of

Computer Science and Engineering, Central South University, Changsha,

Hunan, 410083, China. E-mail: jxwang@mail.csu.edu.cn

•Yi-Ping Phoebe Chen is Department of Computer Science and Information

Technology, La Trobe University, Melbourne Victoria 3086, Australia. E-

mail:phoebe.chen@latrobe.edu.au

Manuscript received April 19, 2005; revised August 26, 2015.

organism. In particular, long non-coding RNAs (lncRNAs)

which are greater than 200 nucleotides in length have been

unveiled to be related to a broad range of diseases [6]. For

example, it has been found that HOTAIR is overexpressed in

breast cancer, colon cancer, liver cancer and gastrointestinal

stromal tumors [7]. Therefore, identifying lncRNA-disease

association is helpful for biologist not only in understanding

the underlying mechanisms of disease, but also disease

prevention diagnosis and treatment [8], [9].

Many biological experimental studies have been de-

veloped to discover potential lncRNA-disease associations

[10]. Although these methods can exactly discover lncRNA-

disease association, they also have some limitations such

as time-consuming and expensive. With the development

of high-throughput sequencing technology, a large amount

of lncRNA related data, such as the sequence, structure,

function and expression, has been generated [11], [12]. Thus,

many computation-based algorithms have been proposed

to overcome these limitations for potential lncRNA-disease

associations prediction [13]. These computational methods

can be classiﬁed into two categories: (1) network-based

methods that use similarity network to predict lncRNA-

disease associations. For example, sun et al [14] proposed

a computational method, RWRlncD, to identify lncRNA-

disease associations based on lncRNA functional similarity

network and the random walk with restart method. Chen

et al [15] presented an algorithm, IRWRLDA, to predict

lncRNA-disease associations in terms of lncRNA similarity

network. They used various measures to calculate lncR-

NA similarity, and IRWRLDA could be used to diseases

without any lncRNA-disease association. Zhou et al [16]

developed a model, RWRHLD, for lncRNA-disease asso-

Authorized licensed use limited to: Guangxi University. Downloaded on November 03,2020 at 02:51:23 UTC from IEEE Xplore. Restrictions apply.

Transactions on Computational Biology and Bioinformatics

JOURNAL OF L

X CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 2

ciation predictions by integrating three networks into one

heterogeneous network. According to constructing a multi-

level network of lncRNA-disease, Yao et al [17] proposed an

algorithm, LncPriCNet, to prioritize candidate for lncRNA-

disease associations. (2) machine learning-based methods

that prioritize candidate lncRNAs by training disease re-

lated known lncRNAs and unknown lncRNAs. Lan et al

[18] developed an online web server (LDAP) to identify

new associations between lncRNAs and diseases based on

positive-unlabeled (PU) learning. Chen et al [19] proposed

a method (LRLSLDA) to infer lncRNA-disease associations

based on the semi-Supervised learning. Wu et al [20] p-

resented a computational method(GAMCLDA) to predict

lncRNA-disease associations based on graph autoencoder

matrix completion. Fu et al [21] developed a computational

model (MFLDA) to predict the associations between lncR-

NA and disease based on multiple data fusion and matrix

factorization (MF). Lu et al [22] presented a computational

model (SIMCLDA) to prioritize candidate lncRNAs based

on inductive matrix completion. Chen et al [23] proposed

a computational framework, ILDMSF, for lncRNA-disease

association identiﬁcation based on multiple kernel fusion

and Support Vector Machine (SVM).

These methods have achieved good performance in

predicting the associations between lncRNAs and diseases.

However, they do not make full use of the known lncRNA

characteristic data and disease characteristic data, so there

are limitations on the accuracy and prediction performance

[24], [25]. This paper proposes a novel computational frame-

work (LDICDL) to predict LncRNA-disease associations.

It uses an automatic encoder to denoise multiple lncRNA

feature information data and multiple disease feature in-

formation data. In addition, the matrix factorization algo-

rithm is employed to predict the potential lncRNA-disease

associations. Further, the hybrid model based on stacked

denoising autoencoder and matrix factorization is develope-

d to overcome the limitation of matrix factorization for de

novo prediction. The experimental results demonstrate our

method has better performance than other state-of-the-art

methods.

2 METHODS

The task of identifying LncRNA-disease associations can be

viewed as taking implicit feedback as the training and test

data. The LncRNA-disease association matrix is represented

by a matrix LDm∗n,where mand ndenotes the number of

lncRNAs and diseases, respectively. The element of LD(i,j)

is equal to 1 if lncRNA iis associated with disease j,

otherwise 0. The lncRNA information is integrated into

lncRNA feature matrix LFm∗t, where tdenotes the number

of features. The disease information is merged into disease

feature matrix DFn∗s, where sdenotes the number of dis-

ease features.

2.1 Stacked Denoising Autoencoder

The stacked denoising autoencoder (SDAE) is a kind of

feedforward neural network which is widely used in rec-

ommend system [26]. In LDICDL, the SDAE is employed to

select lncRNAs and diseases feature information, respective-

ly. The original features of lncRNA and disease are tand s

dimensions, respectively. In ﬁnal, the lncRNAs and diseases

feature information are reduced into kdimensions by using

SDAE. The mini-batch gradient descent algorithm is used to

train SDAE with the batch size=60.

2.2 Matrix Factorization for lncRNA-disease prediction

In the LncRNA-disease association matrix LD, the element

LD(i, j) is deﬁned as follow:

LD(i, j) = {1,if lncRNA i is related with disease j

0,if lncRNA i is not related with disease j

(1)

Therefore, the loss function of matrix factorization for

LncRNA-disease association prediction is deﬁned as follow:

Loss =Σi,j αi,j (LD(i, j)−L(i, :) ·D(j, :)T)2

+γ(Σi∥L(i, :)∥2+Σj∥D(j, :)∥2)(2)

where γdenotes the regulation parameter. L(i, :) and D(j, :)

denote lncRNA isubspace feature and disease jsub-

space feature, respectively. αi,j is the parameter to show

the conﬁdence between lncRNA iand disease jwhere

αi,j =1+θ(LD(i, j)).∥.∥2denotes 2-norm.

2.3 Matrix Factorization with Implicit Feedback for

LncRNA-disease association prediction

Considering that the lncRNA-disease associations predic-

tion performance with matrix factorization is poor for the

new lncRNA or disease, which called cold start problem

[13], [27], the hybrid model is proposed to predict lncRNA

and disease associations by combining matrix factorization

with stocked denoising autoencoder. In our method, pre-

dicting association of new lncRNA or disease relies on the

biological features of lncRNA and disease. The structure of

hybrid model with three hidden layer of lncRNA is show in

Figure 1. Xinput l is the input layer for lncRNA features (i.e.

LF ) and Xencode l is the lncRNA features encoding. Xout l

is the output layer of lncRNA features.

ܵܦܣܧ

ܺ௜௡௣௨௧୬୭௜௦௘̴௟

ܺଵ̴௟ ܺୣ௡௖௢ௗ௘̴௟ሺ݅ǡ ǣ ሻ ܺଷ̴௟

ܺ௢௨௧௣௨௧̴௟

ܦሺ݆ǡ ǣ ሻ

ܯܨ

ܯ௟݅ǡ ݆

ൌ ܺ௘௡௖௢ௗ௘̴௟ሺ݅ǡ ǣ ሻܦሺ݆ǡ ǣ ሻ்

ܺ௜௡௣௨௧̴௟

Fig. 1. The overview of hybrid model of lncRNA.

Authorized licensed use limited to: Guangxi University. Downloaded on November 03,2020 at 02:51:23 UTC from IEEE Xplore. Restrictions apply.

Transactions on Computational Biology and Bioinformatics

JOURNAL OF L

X CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 3

The loss function of hybrid model based on lncRNA

features is deﬁned as follow:

Loss =Σi,j αi,j (LD(i, j)−L(i, :) ·D(j, :)T)2

+γ(Σi∥L(i, :)∥2+Σj∥D(j, :)∥2)

+γl(∥L−Xencode l∥2) + γn(∥Xinput l −Xout l∥2)

+Σlayers γw∥W∥2(3)

where γ,γl,γnand γwdenote regularization parameters. W

denotes the weight matrix.

The loss function is minimized by block coordinate de-

cent [28]. The L(i, :) is updated in term of Eq.4 below in

training step:

L(i, :) ←LD(i, :)C(i)D(γI +DTC(i)D)−1(4)

where C(i)is a diagonal matrix where C(i)(j, j)=αi,j .

The D(:, j) is updated in term of Eq.5 below in training

step:

D(:, j)←LD(:, j)T˜

C(j)L(γI +LT˜

C(j)L)−1(5)

where ˜

C(j)is a diagonal matrix where ˜

C(j)(i, i)=αi,j .

For disease, the structure with three hidden layers is

show in Figure 2. Xinput d is the input layer for disease fea-

tures (i.e. DF ) and Xencode d represents the disease features

encoding. Xout d is the output layer of disease features.

ܺ௜௡௣௨௧̴ௗ

ܵܦܣܧ

ܺ௜௡௣௨௧୬୭௜௦௘̴ௗ

ܺଵ̴ௗ ܺୣ௡௖௢ௗ௘̴ௗሺ݆ǡ ǣ ሻ ܺଷ̴ௗ

ܺ௢௨௧௣௨௧̴ௗ

ܮሺ݅ǡ ǣ ሻ

ܯܨ

ܯௗ݅ǡ ݆

ൌ ܮ ሺ݅ǡ ǣ ሻ ȉ ܺ௘௡௖௢ௗ௘̴ௗ ሺ݆ǡ ǣ ሻ்

Fig. 2. The overview of hybrid model of disease.

The loss function of hybrid model based on disease

feature information is deﬁned as follow:

Loss =Σi,j αi,j (LD(i, j)−L(i, :) ·D(j, :)T)2

+γ(Σi∥L(i, :)∥2+Σj∥D(j, :)∥2)

+γd(∥D−Xencode d ∥2) + γn(∥Xinput d −Xout d ∥2)

+Σlayers γw∥W∥2(6)

where γ,γd,γnand γwdenote regularization parameters.

Wdenotes the weight matrix.

The ﬁnal predicted score matrix S is calculated as follow:

S(i, j) = Ml(i, j ) + Md(i, j)

2(7)

Ml(i, j) = Xencode l (i, :) ·D(j, :)T(8)

Md(i, j) = L(i, :) ·Xencode d (j, :)T(9)

where S(i, j) denotes the score between lncRNA iand

disease j.Xencode l and Xencode d denote the sub-feature

matrix of lncRNA and disease which are obtained by SDAE

based on lncRNA and disease feature information, respec-

tively. Land Ddenote the sub-feature matrix of lncRNA

and disease obtained from matrix factorization.

The whole workﬂow of LDICDL is shown in Figure

3. In the ﬁrst step, the lncRNA-disease association matrix

is decomposed to lncRNA feature subspace and disease

feature information is encoded by using SDAE. Meanwhile,

the lncRNA-disease association matrix is decomposed to

disease feature subspace and lncRNA feature information

is encoded by SDAE. Then, the lncRNA-disease association

score is predicted based on lncRNA feature matrix and dis-

ease encode matrix, and disease feature matrix and lncRNA

encode matrix, respectively. Last, the ﬁnal score of lncRNA-

disease association is calculated by averaging the scores.

The Block coordinate decent is used to minimize the loss

function. Firstly, the Land Dare updated by equation 4

and 5, respectively. Then, the parameters in the SDAE are

updated using gradient decent with mini-batch. The mean

square errors of output and encoding are used to adjust the

gradient. It repeats the former steps for ttimes.

Fig. 3. The workﬂow of LDICDL.

3 RE SULTS

3.1 Datasets

The lncRNA-gene associations are downloaded from lncR-

NA2target [29] and lncRNA-gene function associations are

collected from GeneRIF [30]. They are pre-processed using

Open Biomedical Annotator [31]. The lncRNA-miRNA asso-

ciations and disease-miRNA associations are downloaded

Authorized licensed use limited to: Guangxi University. Downloaded on November 03,2020 at 02:51:23 UTC from IEEE Xplore. Restrictions apply.

Transactions on Computational Biology and Bioinformatics

JOURNAL OF L

X CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 4

from starBase v2.0 [32] and HMDD [33], respectively. The

disease-gene associations are downloaded from DisGeNET

[34]. In ﬁnal, 2697 associations between 240 lncRNAs and

412 diseases are obtained as gold-standard dataset. In addi-

tion, 6066-dimensions feature information of each lncRNA

from lncRNA-related data and 10621-dimensions feature

information of each disease from disease-related data are

collected, respectively.

3.2 Performance evaluation

The ten-fold cross validation and de novo test are employed

to evaluate the performance of different methods. In ten-fold

cross validation, all known associations between lncRNAs

and diseases are divided into ten folds randomly. In each

test, one fold is selected as the test samples and other nine

folds are treated as training samples. All known associations

in test samples are removed by turns and all other known

associations in training samples are used to train model.

Then, the prediction algorithm is carried out to predict the

scores of test samples and candidate samples. In the de novo

test, for disease i, all known associations are removed as test

samples, while all known associations between lncRNAs

and other diseases are considered as training samples. Then,

the scores of associations between lncRNAs and disease i

are calculated by prediction method. After that, the scores

of test and candidate samples are ranked with descending

order and observe whether its ranking is greater than a

speciﬁc threshold. If the rank of test sample is greater than

the threshold, it is considered as true positive, otherwise

false negative. If the rank of candidate sample is greater

than the threshold, it is viewed as false positive, otherwise

true negative. Further, the true positive rate (TPR) and false

positive rate (FPR) are calculated as follows:

T P R =T P

T P +F N (10)

F P R =F P

F P +T N (11)

where T P denotes the number of true positive samples, T N

denotes the number of true negative samples, F P denotes

the number of false positive samples, and F N denotes the

number of false negative samples. The receiver operating

characteristic (ROC) curve is draw based on TPR and FPR

at different thresholds and the Area under of ROC (AUC)

is calculated to evaluate the performance of method. If the

AUC equals to 1, it denotes that this method has perfect

performance. If the AUC equals to 0.5, it denotes that the

prediction of model is uncertain.

In addition, the precision and recall are also calculated

as follows:

P recision =T P

T P +F P (12)

Recall =T P

T P +F N (13)

where precision denotes the proportion of the true positive

samples with rankings higher than the special threshold in

the predicted positive samples, recall denotes the propor-

tion of true positive samples whose ranking is lower than

the special threshold in the whole positive samples. Then,

Precision-Recall (PR) curve is plotted based on precision and

recall. Finally, the area under of PR (AUPR) are computed

to evaluate the performance of method.

3.3 Ten-fold cross validation

In order to evaluate the performance of LDICDL, the ten-

fold cross validation is applied in the experiment. We

compare LDICDL with two state-of-the-art methods based

on matrix completion (SIMCLDA [22] and MFLDA [21]).

The performance of different methods is evaluated in term

of AUC. It can be observed from Figure 4 that LDICDL

achieves the AUC of 0.8651, which is signiﬁcantly higher

than other methods (SIMCLDA 0.8259 and MFLDA 0.6430).

It demonstrates that our method has higher performance

than other methods. In addition, the AUPR is also utilized

to compare the performance of different methods as shown

in Figure 5. The AUPR of LDICDL is 0.0306 in contrast to

0.0227 and 0.0051 with SIMCLDA and MFLDA, respectively.

It proves that our method is more effective than other two

methods. Figure 6 shows the number of correctly retrieved

known lncRNAdisease associations. It can be found that

LDICDL outperforms other methods from top 10 to top 50

associations.

To prove our model can obtain deep latent repre-

sentation of features, we conduct the experimental com-

parison between LDICDL and three classical feature ex-

traction methods including Nonnegative Matrix Factoriza-

tion(NMF), Principal Component Analysis(PCA) and Latent

Dirichlet Allocation (LDA). The comparison result on dif-

ferent feature extraction methods is shown in Figure 7. It

can be found from the result that LDICDL which is based

on using the stacked denoising autoencoder outperforms

other methods. Moreover, in order to show the effect of the

combination of MF and SDAE, we compare it with MF and

SDAE, respectively. The result is shown in Figure 8. The

result demonstrates that the combination of MF and SDAE

outperforms than single method (MF or SDAE). We also

compared different regularization methods (L1, L21 and L2)

on matrix factorization. The results are shown in Figure 9,

and the L2 norm obtains the best performance.

Fig. 4. The AUROC of LDICDL, SIMCLDA and MFLDA by using ten-fold

cross validation.

Authorized licensed use limited to: Guangxi University. Downloaded on November 03,2020 at 02:51:23 UTC from IEEE Xplore. Restrictions apply.

Transactions on Computational Biology and Bioinformatics

JOURNAL OF L

X CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 5

Fig. 5. The AUPR of LDICDL, SIMCLDA and MFLDA by using ten-fold

cross validation.

Fig. 6. Number of correctly retrieved known lncRNAdisease associa-

tions for speciﬁed rank thresholds based on ten-fold cross validation.

3.4 De novo test

In order to validate the performance of LDICDL in identify-

ing potential association for new diseases, the de novo test is

conducted in the experiment. The de novo test removes all

known associations with lncRNAs from each disease ias the

test set each time. The potential associations between lncR-

NAs and disease iare predicted based on feature informa-

tion. The results of AUROC and AUPR are shown in Figures

10 and 11, respectively. The LDICDL achieves the highest

AUC and AUPR (0.8917 and 0.1666). Compared with other

methods, our method is at least 0.09 higher than other

methods in AUC (SIMCLDA 0.7923 and MFLDA 0.5952)

Fig. 7. The AUROC of LDICDL, PCA, LDA and NMF by using ten-fold

cross validation.

Fig. 8. The AUROC of MF, SDAE and SDAE+MF by using ten-fold cross

validation.

Fig. 9. The AUROC of L1, L21 and L2 in MF by using ten-fold cross

validation.

and 0.04 higher than other methods in AUPR (SIMCLDA

0.1270 and MFLDA 0.0398). It demonstrates that our method

is superior to other methods in prediction performance of de

novo test. Figure 12 shows the number of correctly retrieved

Authorized licensed use limited to: Guangxi University. Downloaded on November 03,2020 at 02:51:23 UTC from IEEE Xplore. Restrictions apply.

Transactions on Computational Biology and Bioinformatics

JOURNAL OF L

X CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 6

known lncRNA-disease associations. It can be found that

LDICDL outperforms than other methods for top 10 to top

50.

Fig. 10. The AUROC of LDICDL, SIMCLDA and MFLDA by using de

novo cross validation.

Fig. 11. The AUPR of LDICDL, SIMCLDA and MFLDA by using de novo

cross validation.

3.5 The effects of parameters

In the SDAE, the feature information of lncRNA and disease

are reduced into subspace. To test the effect of feature

dimension k, we conduct the ten-fold cross validation by

changing the feature dimension from 50 to 250 by increasing

50 each time. The result is shown in Figure 13. It is observed

that the LDICDL achieves the best performance when the

feature dimension is equal to 100. Therefore, 100 is applied

for the feature dimension kin experiment. All the three

hidden layers use non-linear activation functions tanh, and

the output layer uses the sigmoid. The number of neurons

of the auto-encoder are set to 130, 100 and 130, respectively.

Fig. 12. The number of correctly retrieved known lncRNAdisease asso-

ciations for speciﬁed rank thresholds based on de novo validation.

The hyperparameters are selected by random

search proposed in [35]. γand θare chosen from

[0.1,1,10,100,200,300,500,1000],γl:γnand γd:γn

are both chosen from [ 1:1, 100:1, 200:1, 300:1, 400:1, 500:1,

600:1, 700:1, 800:1, 900:1, 1000:1] [36], γwis chosen from

[0.1,0.3,0.5,0.7,0.9]. Then all hyperparameters are sampled

from a uniform distribution over a set of possible values. In

our experiment, we repeat the process 20 times to ﬁnd the

optimum parameters. The parameters are set as follows:

θ= 100, γ = 300, γl:γn=γd:γn= 100 : 1, γw= 0.3.

Fig. 13. The effect of feature dimension k.

3.6 Case study

To demonstrate the capability of LDICDL in identifying the

potential lncRNA-disease associations, the osteosarcoma is

selected as case study. In case study, all known associations

between lncRNAs and osteosarcoma are treated as positive

samples. Then, the potential associations are predicted by

LDICDL. The predicted lncRNA of osteosarcoma is ana-

lyzed by consulting recent publication.

Osteosarcoma (osteogenic sarcoma) is a primary bone

malignancy that often affects children and young adults

Authorized licensed use limited to: Guangxi University. Downloaded on November 03,2020 at 02:51:23 UTC from IEEE Xplore. Restrictions apply.

Transactions on Computational Biology and Bioinformatics

JOURNAL OF L

X CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 7

(approximately 3.4% of all childhood cancers) [37], [38]. This

cancer is rare (less than 1% of all cancers diagnosed) and the

pathogenesis is unknown. With the development of multi-

agent chemotherapy regimens, the long-term survival rate

is improved from 65% to 70% [39]. Unfortunately, the prog-

nostic and treatment have no improved in several decades.

Table 1 shows the top 10 lncRNA of osteosarcoma predicted

by LDICDL. As shown in Table 1, 9 out of 10 lncRNAs are

proved to relate with osteosarcoma by recent literatures. The

H19 ranked in top 1 has been proved to be related with

osteosarcoma [40]. The rs217727 in H19 can increase IGF2

cord blood level which has signiﬁcantly associated with

osteosarcoma. It has been proved that the long coding RNA

PVT1 ranked at top 2 can promote cell apoptosis and inhibit

cell proliferation, migration, and invasion in osteosarcoma

cells by regulating the expression of miR-195 [41]. The GAS5

ranked at top 3 can promote the expression of aplasia

Ras Homologue member I (ARHI) which suppresses Cell

Growth and Epithelial-Mesenchymal Transition in Osteosar-

coma by acting as molecular sponger to regulate the expres-

sion of miR-221 [42]. Recent research shows that the NEAT1

ranked at top 4 is signiﬁcantly upregulated in osteosarcoma

cell lines which has close association with higher clinical

stage, distant metastasis and poorer prognosis. In addition,

it can inhibit Ecadherin expression and promote the metas-

tasis of osteosarcoma by relating with the G9a-DNMT1-Snail

complex [43]. The long coding RNA KCNQ1OT1 ranked

at top 5 has been discovered to associate with cell inva-

sion, migration, growth, proliferation and apoptosis through

enhancing WNT/beta-catenin signaling pathway activity

in osteosarcoma tissue [44]. It has been discovered that

AFAP1AS1 ranked at top 7 is signiﬁcantly over-expressed

and the knockdown of AFAP1-AS1 can strikingly inhibits

the cell proliferation in osteosarcoma tissue. It demonstrates

that AFAP1AS1 can promote cell proliferation in osteosar-

coma via regulating miR-4695-5p/TCF4-β-catenin signaling

[45]. The long Noncoding RNA XIST ranked at top 8 has

been proved that it can bind to miR-320b and inhibit the

expression of miR-320b in osteosarcoma cells. The miR-320b

can target the Ras-Related Protein RAP2B and inhibit the

expression of RAP2B which is involved in cell proliferation

and invasion of osteosarcoma [46]. It has been revealed that

the CCAT1 rank at top 9 is upregulated in osteosarcoma

tissues and cells, and is related with the cell proliferation

and migration of osteosarcoma by binding to miR-148a

and regulating the signal pathway of phosphatidyl inositol

3-kinase interacting protein 1 (PIK3IP1) [47]. The recent

evidences present that long coding RNA SPRY4-IT1 ranked

at top 10 is over-expressed in osteosarcoma tissues and

SPRY4-IT1 knockdown strikingly inhibits cells proliferation

through inhibiting the expression of G1 [48]. In addition,

some interesting lncRNAs such as MIR155HG are found

by our method. The biological functions of these lncRNAs

are still unknown. It deserves for biologist to validate by

biological experiments.

4 DISCUSSION

It is well known than lncRNA is a kind of important non-

coding RNA with the length more than 200 nucleotides [49].

Accumulating evidences show that lncRNA plays critical

TABLE 1

Top 10 lncRNA of osteosarcoma predicted by LDICDL

Rank LncRNA Evidence

1 H19 [40]

2 PVT1 [41]

3 GAS5 [42]

4 NEAT1 [43]

5 KCNQ1OT1 [44]

6 MIR155HG Unknown

7 AFAP1-AS1 [45]

8 XIST [46]

9 CCAT1 [47]

10 SPRY4-IT1 [48]

roles in various biological processes such as chromosome

dosage compensation, genomic imprinting, epigenetic regu-

lation, nuclear and cytoplasmic trafﬁcking, cell proliferation,

cell differentiation, cell growth, cell metabolism and cell

apoptosis [50], [51]. In addition, increasing studies demon-

strate that lncRNA has close relationship with various dis-

eases including cancer [28]. Therefore, identifying LncRNA-

disease associations beneﬁts to understand the pathogenesis

of disease, and further disease treatment and drug discov-

ery.

In this study, we have proposed a computational

method, called LDICDL, to predict LncRNA-disease asso-

ciations based on collaborative deep learning. In this ap-

proach, the lncRNA-disease association matrix is decom-

posed to lncRNA feature subspace and disease feature infor-

mation is encoded by using SDAE. Meanwhile, the lncRNA-

disease association matrix is decomposed to disease feature

subspace and lncRNA feature information is encoded by

using SDAE. Then, the lncRNA-disease association score

is predicted based on lncRNA feature matrix and disease

encode matrix, and disease feature matrix and lncRNA en-

code matrix, respectively. The ﬁnal score of lncRNA-disease

association is calculated by averaging the scores. The results

demonstrate LDICDL is competitive and often performs

better than other state-of-the-art methods. In addition, our

method may also be used to other biological entity predic-

tion such as miRNA-disease association prediction [52], [53],

[54], drug-target interaction prediction [55] and disease gene

prediction [56].

FUN DING

This work was partially supported by the National Natu-

ral Science Foundation of China (Nos. 61702122, 61963004

and 61972185), the Natural Science Foundation of Guangx-

i (Nos. 2017GXNSFDA198033 and 2018GXNSFBA281193),

the Key Research and Development Plan of Guangxi

(No. AB17195055), the foundation of Guangxi University

(Nos. 20190240 and XBZ180479), the Innovation Project

of Guangxi Graduate Education (No. YCSW2020020), the

Natural Science Foundation of Yunnan Province of China

(No. 2019FA024), the Hunan Provincial Science and Tech-

nology Program (No. 2018WK4001), the scientiﬁc Research

Foundation of Hunan Provincial Education Department

(No.18B469).

Authorized licensed use limited to: Guangxi University. Downloaded on November 03,2020 at 02:51:23 UTC from IEEE Xplore. Restrictions apply.

Transactions on Computational Biology and Bioinformatics

JOURNAL OF L

X CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 8

REFERENCES

[1] G. L. Maor, A. Yearim, and G. Ast, “The alternative role of dna

methylation in splicing regulation,” Trends in Genetics, vol. 31,

no. 5, pp. 274–280, 2015.

[2] W. Lan, J. Wang, M. Li, J. Liu, F.-X. Wu, and Y. Pan, “Predicting

microrna-disease associations based on improved microrna and

disease similarities,” IEEE/ACM Transactions on Computational Bi-

ology and Bioinformatics (TCBB), vol. 15, no. 6, pp. 1774–1782, 2018.

[3] E. Anastasiadou, L. S. Jacob, and F. J. Slack, “Non-coding rna

networks in cancer,” Nature Reviews Cancer, vol. 18, no. 1, p. 5,

2018.

[4] J. Ponjavic, C. P. Ponting, and G. Lunter, “Functionality or tran-

scriptional noise? evidence for selection within long noncoding

rnas,” Genome research, vol. 17, no. 5, pp. 556–565, 2007.

[5] Q. Chen, W. Lan, and J. Wang, “Mining featured patterns of

mirna interaction based on sequence and structure similarity,”

IEEE/ACM Transactions on Computational Biology and Bioinformatics

(TCBB), vol. 10, no. 2, pp. 415–422, 2013.

[6] K. C. Wang and H. Y. Chang, “Molecular mechanisms of long

noncoding rnas,” Molecular cell, vol. 43, no. 6, pp. 904–914, 2011.

[7] X. Xue, Y. A. Yang, A. Zhang, K. Fong, J. Kim, B. Song, S. Li, J. C.

Zhao, and J. Yu, “Lncrna hotair enhances er signaling and confers

tamoxifen resistance in breast cancer,” Oncogene, vol. 35, no. 21, p.

2746, 2016.

[8] L. Yang, C. Lin, C. Jin, J. C. Yang, B. Tanasa, W. Li, D. Merkurjev,

K. A. Ohgi, D. Meng, J. Zhang et al., “lncrna-dependent mecha-

nisms of androgen-receptor-regulated gene activation programs,”

Nature, vol. 500, no. 7464, p. 598, 2013.

[9] W. Lan, J. Wang, M. Li, W. Peng, and F. Wu, “Computational

approaches for prioritizing candidate disease genes based on ppi

networks,” Tsinghua Science and Technology, vol. 20, no. 5, pp. 500–

512, 2015.

[10] G. Yang, X. Lu, and L. Yuan, “Lncrna: a link between rna and

cancer,” Biochimica et Biophysica Acta (BBA)-Gene Regulatory Mech-

anisms, vol. 1839, no. 11, pp. 1097–1109, 2014.

[11] P.-J. Volders, K. Verheggen, G. Menschaert, K. Vandepoele,

L. Martens, J. Vandesompele, and P. Mestdagh, “An update on

lncipedia: a database for annotated human lncrna sequences,”

Nucleic acids research, vol. 43, no. D1, pp. D174–D180, 2014.

[12] Q. Jiang, J. Wang, X. Wu, R. Ma, T. Zhang, S. Jin, Z. Han, R. Tan,

J. Peng, G. Liu et al., “Lncrna2target: a database for differential-

ly expressed genes after lncrna knockdown or overexpression,”

Nucleic acids research, vol. 43, no. D1, pp. D193–D196, 2014.

[13] W. Lan, L. Huang, D. Lai, and Q. Chen, “Identifying interactions

between long noncoding rnas and diseases based on computation-

al methods,” in Computational Systems Biology. Springer, 2018, pp.

205–221.

[14] J. Sun, H. Shi, Z. Wang, C. Zhang, L. Liu, L. Wang, W. He, D. Hao,

S. Liu, and M. Zhou, “Inferring novel lncrna–disease associations

based on a random walk model of a lncrna functional similarity

network,” Molecular BioSystems, vol. 10, no. 8, pp. 2074–2081, 2014.

[15] X. Chen, Z.-H. You, G.-Y. Yan, and D.-W. Gong, “Irwrlda: im-

proved random walk with restart for lncrna-disease association

prediction,” Oncotarget, vol. 7, no. 36, p. 57919, 2016.

[16] M. Zhou, X. Wang, J. Li, D. Hao, Z. Wang, H. Shi, L. Han,

H. Zhou, and J. Sun, “Prioritizing candidate disease-related long

non-coding rnas by walking on the heterogeneous lncrna and

disease network,” Molecular BioSystems, vol. 11, no. 3, pp. 760–769,

2015.

[17] Q. Yao, L. Wu, J. Li, L. guang Yang, Y. Sun, Z. Li, S. He, F. Feng,

H. Li, and Y. Li, “Global prioritizing disease candidate lncrnas via

a multi-level composite network,” Scientiﬁc reports, vol. 7, p. 39516,

2017.

[18] W. Lan, M. Li, K. Zhao, J. Liu, F.-X. Wu, Y. Pan, and J. Wang,

“Ldap: a web server for lncrna-disease association prediction,”

Bioinformatics, vol. 33, no. 3, pp. 458–460, 2016.

[19] X. Chen and G. Yan, “Novel human lncrna–disease association in-

ference based on lncrna expression proﬁles,” Bioinformatics, vol. 29,

no. 20, pp. 2617–2624, 2013.

[20] X. Wu, W. Lan, Q. Chen, Y. Dong, J. Liu, and W. Peng, “Inferring

lncrna-disease associations based on graph autoencoder matrix

completion,” Computational Biology and Chemistry, p. 107282, 2020.

[21] G. Fu, J. Wang, C. Domeniconi, and G. Yu, “Matrix factorization-

based data fusion for the prediction of lncrna–disease association-

s,” Bioinformatics, vol. 34, no. 9, pp. 1529–1537, 2017.

[22] C. Lu, M. Yang, F. Luo, F.-X. Wu, M. Li, Y. Pan, Y. Li, and J. Wang,

“Prediction of lncrna–disease associations based on inductive

matrix completion,” Bioinformatics, vol. 34, no. 19, pp. 3357–3364,

2018.

[23] Q. Chen, D. Lai, W. Lan, X. Wu, B. Chen, Y.-P. P. Chen, and J. Wang,

“Ildmsf: Inferring associations between long non-coding rna and

disease based on multi-similarity fusion,” IEEE/ACM transactions

on computational biology and bioinformatics, 2019.

[24] J. Han, L. Zheng, Y. Xu, B. Zhang, F. Zhuang, P. S. Yu, and

W. Zuo, “Adaptive deep modeling of users and items using side

information for recommendation,” IEEE Transactions on Neural

Networks and Learning Systems, vol. 31, no. 3, pp. 737–748, 2020.

[25] H. Park, J. Jung, and U. Kang, “A comparative study of matrix

factorization and random walk with restart in recommender sys-

tems,” in 2017 IEEE International Conference on Big Data (Big Data).

IEEE, 2017, pp. 756–765.

[26] H. Wang, N. Wang, and D.-Y. Yeung, “Collaborative deep learning

for recommender systems,” in Proceedings of the 21th ACM SIGKD-

D international conference on knowledge discovery and data mining.

ACM, 2015, pp. 1235–1244.

[27] A. Ramlatchan, M. Yang, Q. Liu, M. Li, J. Wang, and Y. Li,

“A survey of matrix completion methods for recommendation

systems,” Big Data Mining and Analytics, vol. 1, no. 4, pp. 308–323,

2018.

[28] Y. Hu, Y. Koren, and C. Volinsky, “Collaborative ﬁltering for

implicit feedback datasets,” in 2008 Eighth IEEE International Con-

ference on Data Mining. Ieee, 2008, pp. 263–272.

[29] Q. Jiang, J. Wang, X. Wu, R. Ma, T. Zhang, S. Jin, Z. Han, R. Tan,

J. Peng, G. Liu et al., “Lncrna2target: a database for differential-

ly expressed genes after lncrna knockdown or overexpression,”

Nucleic acids research, vol. 43, no. D1, pp. D193–D196, 2014.

[30] Z. Lu, K. BRETONNEL COHEN, and L. Hunter, “Generif quality

assurance as summary revision,” in Biocomputing 2007. World

Scientiﬁc, 2007, pp. 269–280.

[31] C. Jonquet, N. H. Shah, and M. A. Musen, “The open biomedical

annotator,” Summit on translational bioinformatics, vol. 2009, p. 56,

2009.

[32] J.-H. Li, S. Liu, H. Zhou, L.-H. Qu, and J.-H. Yang, “starbase v2.

0: decoding mirna-cerna, mirna-ncrna and protein–rna interaction

networks from large-scale clip-seq data,” Nucleic acids research,

vol. 42, no. D1, pp. D92–D97, 2013.

[33] Y. Li, C. Qiu, J. Tu, B. Geng, J. Yang, T. Jiang, and Q. Cui, “Hmdd

v2. 0: a database for experimentally supported human microrna

and disease associations,” Nucleic acids research, vol. 42, no. D1,

pp. D1070–D1074, 2013.

[34] J. Pinero, N. Queralt-Rosinach, A. Bravo, J. Deu-Pons, A. Bauer-

Mehren, M. Baron, F. Sanz, and L. I. Furlong, “Disgenet: a discov-

ery platform for the dynamical exploration of human diseases and

their genes,” Database, vol. 2015, 2015.

[35] J. Bergstra and Y. Bengio, “Random search for hyper-parameter

optimization,” Journal of Machine Learning Research, vol. 13, no. 1,

pp. 281–305, 2012.

[36] H. Wang, N. Wang, and D. Y. Yeung, “Collaborative deep learning

for recommender systems,” 2014.

[37] P. A. Meyers and R. Gorlick, “Osteosarcoma,” Pediatric Clinics of

North America, vol. 44, no. 4, pp. 973–989, 1997.

[38] B. A. Lindsey, J. E. Markel, and E. S. Kleinerman, “Osteosarcoma

overview,” Rheumatology and therapy, vol. 4, no. 1, pp. 25–43, 2017.

[39] M. S. Isakoff, S. S. Bielack, P. Meltzer, and R. Gorlick, “Osteosar-

coma: current treatment and a collaborative pathway to success,”

Journal of clinical oncology, vol. 33, no. 27, p. 3029, 2015.

[40] T. He, D. Xu, T. Sui, J. Zhu, Z. Wei, and Y. Wang, “Association

between h19 polymorphisms and osteosarcoma risk,” Eur Rev Med

Pharmacol Sci, vol. 21, no. 17, pp. 3775–3780, 2017.

[41] Q. Zhou, F. Chen, J. Zhao, B. Li, Y. Liang, W. Pan, S. Zhang,

X. Wang, and D. Zheng, “Long non-coding rna pvt1 promotes

osteosarcoma development by acting as a molecular sponge to

regulate mir-195,” Oncotarget, vol. 7, no. 50, p. 82620, 2016.

[42] K. Ye, S. Wang, H. Zhang, H. Han, B. Ma, and W. Nan,

“Long noncoding rna gas5 suppresses cell growth and epithelial–

mesenchymal transition in osteosarcoma by regulating the mir-

221/arhi pathway,” Journal of cellular biochemistry, vol. 118, no. 12,

pp. 4772–4781, 2017.

[43] Y. Li and C. Cheng, “Long noncoding rna neat1 promotes the

metastasis of osteosarcoma via interaction with the g9a-dnmt1-

snail complex,” American journal of cancer research, vol. 8, no. 1,

p. 81, 2018.

Authorized licensed use limited to: Guangxi University. Downloaded on November 03,2020 at 02:51:23 UTC from IEEE Xplore. Restrictions apply.

Transactions on Computational Biology and Bioinformatics

JOURNAL OF L

X CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 9

[44] C. Zhang, S. Du, and L. Cao, “Long non-coding rna kcnq1ot1 pro-

motes osteosarcoma progression by increasing β-catenin activity,”

RSC advances, vol. 8, no. 66, pp. 37 581–37 589, 2018.

[45] R. Li, S. Liu, Y. Li, Q. Tang, Y. Xie, and R. Zhai, “Long noncoding

rna afap1-as1 enhances cell proliferation and invasion in osteosar-

coma through regulating mir-4695-5p/tcf4-β-catenin signaling,”

Molecular medicine reports, vol. 18, no. 2, pp. 1616–1622, 2018.

[46] G.-Y. Lv, J. Miao, and X.-L. Zhang, “Long noncoding rna xist pro-

motes osteosarcoma progression by targeting ras-related protein

rap2b via mir-320b,” Oncology Research Featuring Preclinical and

Clinical Cancer Therapeutics, vol. 26, no. 6, pp. 837–846, 2018.

[47] J. Zhao and L. Cheng, “Long non-coding rna ccat1/mir-148a

axis promotes osteosarcoma proliferation and migration through

regulating pik3ip1,” Acta biochimica et biophysica Sinica, vol. 49,

no. 6, pp. 503–512, 2017.

[48] J. Xu, R. Ding, and Y. Xu, “Effects of long non-coding rna spry4-

it1 on osteosarcoma cell biological behavior,” American journal of

translational research, vol. 8, no. 12, p. 5330, 2016.

[49] C. P. Ponting, P. L. Oliver, and W. Reik, “Evolution and functions

of long noncoding rnas,” Cell, vol. 136, no. 4, pp. 629–641, 2009.

[50] R. Zheng, M. Li, X. Chen, S. Zhao, F. Wu, Y. Pan, and J. Wang,

“An ensemble method to reconstruct gene regulatory networks

based on multivariate adaptive regression splines,” IEEE/ACM

transactions on computational biology and bioinformatics, 2019.

[51] A. Necsulea, M. Soumillon, M. Warnefors, A. Liechti, T. Daish,

U. Zeller, J. C. Baker, F. Gr ¨

utzner, and H. Kaessmann, “The evo-

lution of lncrna repertoires and expression patterns in tetrapods,”

Nature, vol. 505, no. 7485, p. 635, 2014.

[52] W. Peng, W. Lan, Z. Yu, J. Wang, and Y. Pan, “A framework

for integrating multiple biological networks to predict microrna-

disease associations,” IEEE transactions on nanobioscience, vol. 16,

no. 2, pp. 100–107, 2016.

[53] W. Lan, Q. Chen, T. Li, C. Yuan, S. Mann, and B. Chen, “I-

dentiﬁcation of important positions within mirnas by integrating

sequential and structural features,” Current Protein and Peptide

Science, vol. 15, no. 6, pp. 591–597, 2014.

[54] W. Peng, W. Lan, J. Zhong, J. Wang, and Y. Pan, “A novel method

of predicting microrna-disease associations based on microrna,

disease, gene and environment factor networks,” Methods, vol. 124,

pp. 69–77, 2017.

[55] W. Lan, J. Wang, M. Li, J. Liu, Y. Li, F.-X. Wu, and Y. Pan, “Predict-

ing drug–target interaction using positive-unlabeled learning,”

Neurocomputing, vol. 206, pp. 50–57, 2016.

[56] J. Liu, M. Li, W. Lan, F.-X. Wu, Y. Pan, and J. Wang, “Classiﬁcation

of alzheimer’s disease using whole brain hierarchical network,”

IEEE/ACM transactions on computational biology and bioinformatics,

vol. 15, no. 2, pp. 624–632, 2016.

Authorized licensed use limited to: Guangxi University. Downloaded on November 03,2020 at 02:51:23 UTC from IEEE Xplore. Restrictions apply.

Finding potential lncRNA–disease associations using a boosting-based ensemble learning model

Article

Full-text available

Mar 2024

Introduction: Long non-coding RNAs (lncRNAs) have been in the clinical use as potential prognostic biomarkers of various types of cancer. Identifying associations between lncRNAs and diseases helps capture the potential biomarkers and design efficient therapeutic options for diseases. Wet experiments for identifying these associations are costly and laborious. Methods: We developed LDA-SABC, a novel boosting-based framework for lncRNA–disease association (LDA) prediction. LDA-SABC extracts LDA features based on singular value decomposition (SVD) and classifies lncRNA–disease pairs (LDPs) by incorporating LightGBM and AdaBoost into the convolutional neural network. Results: The LDA-SABC performance was evaluated under five-fold cross validations (CVs) on lncRNAs, diseases, and LDPs. It obviously outperformed four other classical LDA inference methods (SDLDA, LDNFSGB, LDASR, and IPCAF) through precision, recall, accuracy, F1 score, AUC, and AUPR. Based on the accurate LDA prediction performance of LDA-SABC, we used it to find potential lncRNA biomarkers for lung cancer. The results elucidated that 7SK and HULC could have a relationship with non-small-cell lung cancer (NSCLC) and lung adenocarcinoma (LUAD), respectively. Conclusion: We hope that our proposed LDA-SABC method can help improve the LDA identification.

DeepKEGG: a multi-omics data integration framework with biological insights for cancer recurrence prediction and biomarker discovery

Article

Apr 2024

Deep learning-based multi-omics data integration methods have the capability to reveal the mechanisms of cancer development, discover cancer biomarkers and identify pathogenic targets. However, current methods ignore the potential correlations between samples in integrating multi-omics data. In addition, providing accurate biological explanations still poses significant challenges due to the complexity of deep learning models. Therefore, there is an urgent need for a deep learning-based multi-omics integration method to explore the potential correlations between samples and provide model interpretability. Herein, we propose a novel interpretable multi-omics data integration method (DeepKEGG) for cancer recurrence prediction and biomarker discovery. In DeepKEGG, a biological hierarchical module is designed for local connections of neuron nodes and model interpretability based on the biological relationship between genes/miRNAs and pathways. In addition, a pathway self-attention module is constructed to explore the correlation between different samples and generate the potential pathway feature representation for enhancing the prediction performance of the model. Lastly, an attribution-based feature importance calculation method is utilized to discover biomarkers related to cancer recurrence and provide a biological interpretation of the model. Experimental results demonstrate that DeepKEGG outperforms other state-of-the-art methods in 5-fold cross validation. Furthermore, case studies also indicate that DeepKEGG serves as an effective tool for biomarker discovery. The code is available at https://github.com/lanbiolab/DeepKEGG.

AMPFLDAP: Adaptive Message Passing and Feature Fusion on Heterogeneous Network for LncRNA-Disease Associations Prediction

Article

Apr 2024
Interdiscipl Sci Comput Life Sci

Exploration of the intricate connections between long noncoding RNA (lncRNA) and diseases, referred to as lncRNA-disease associations (LDAs), plays a pivotal and indispensable role in unraveling the underlying molecular mechanisms of diseases and devising practical treatment approaches. It is imperative to employ computational methods for predicting lncRNA-disease associations to circumvent the need for superfluous experimental endeavors. Graph-based learning models have gained substantial popularity in predicting these associations, primarily because of their capacity to leverage node attributes and relationships within the network. Nevertheless, there remains much room for enhancing the performance of these techniques by incorporating and harmonizing the node attributes more effectively. In this context, we introduce a novel model, i.e., Adaptive Message Passing and Feature Fusion (AMPFLDAP), for forecasting lncRNA-disease associations within a heterogeneous network. Firstly, we constructed a heterogeneous network involving lncRNA, microRNA (miRNA), and diseases based on established associations and employing Gaussian interaction profile kernel similarity as a measure. Then, an adaptive topological message passing mechanism is suggested to address the information aggregation for heterogeneous networks. The topological features of nodes in the heterogeneous network were extracted based on the adaptive topological message passing mechanism. Moreover, an attention mechanism is applied to integrate both topological and semantic information to achieve the multimodal features of biomolecules, which are further used to predict potential LDAs. The experimental results demonstrated that the performance of the proposed AMPFLDAP is superior to seven state-of-the-art methods. Furthermore, to validate its efficacy in practical scenarios, we conducted detailed case studies involving three distinct diseases, which conclusively demonstrated AMPFLDAP’s effectiveness in the prediction of LDAs.

A comprehensive survey on deep learning-based identification and predicting the interaction mechanism of long non-coding RNAs

Article

Apr 2024

Long noncoding RNAs (lncRNAs) have been discovered to be extensively involved in eukaryotic epigenetic, transcriptional, and post-transcriptional regulatory processes with the advancements in sequencing technology and genomics research. Therefore, they play crucial roles in the body’s normal physiology and various disease outcomes. Presently, numerous unknown lncRNA sequencing data require exploration. Establishing deep learning-based prediction models for lncRNAs provides valuable insights for researchers, substantially reducing time and costs associated with trial and error and facilitating the disease-relevant lncRNA identification for prognosis analysis and targeted drug development as the era of artificial intelligence progresses. However, most lncRNA-related researchers lack awareness of the latest advancements in deep learning models and model selection and application in functional research on lncRNAs. Thus, we elucidate the concept of deep learning models, explore several prevalent deep learning algorithms and their data preferences, conduct a comprehensive review of recent literature studies with exemplary predictive performance over the past 5 years in conjunction with diverse prediction functions, critically analyze and discuss the merits and limitations of current deep learning models and solutions, while also proposing prospects based on cutting-edge advancements in lncRNA research.

Multilayer grid XG Boost architecture based automatic osteosarcoma classification

Article

Apr 2024
BIOMED SIGNAL PROCES

NGCN : Drug‐target interaction prediction by integrating information and feature learning from heterogeneous network

Article

Mar 2024

Drug‐target interaction (DTI) prediction is essential for new drug design and development. Constructing heterogeneous network based on diverse information about drugs, proteins and diseases provides new opportunities for DTI prediction. However, the inherent complexity, high dimensionality and noise of such a network prevent us from taking full advantage of these network characteristics. This article proposes a novel method, NGCN, to predict drug‐target interactions from an integrated heterogeneous network, from which to extract relevant biological properties and association information while maintaining the topology information. It focuses on learning the topology representation of drugs and targets to improve the performance of DTI prediction. Unlike traditional methods, it focuses on learning the low‐dimensional topology representation of drugs and targets via graph‐based convolutional neural network. NGCN achieves substantial performance improvements over other state‐of‐the‐art methods, such as a nearly 1.0% increase in AUPR value. Moreover, we verify the robustness of NGCN through benchmark tests, and the experimental results demonstrate it is an extensible framework capable of combining heterogeneous information for DTI prediction.

Prediction of lncRNA and disease associations based on residual graph convolutional networks with attention mechanism

Article

Full-text available

Mar 2024

LncRNAs are non-coding RNAs with a length of more than 200 nucleotides. More and more evidence shows that lncRNAs are inextricably linked with diseases. To make up for the shortcomings of traditional methods, researchers began to collect relevant biological data in the database and used bioinformatics prediction tools to predict the associations between lncRNAs and diseases, which greatly improved the efficiency of the study. To improve the prediction accuracy of current methods, we propose a new lncRNA-disease associations prediction method with attention mechanism, called ResGCN-A. Firstly, we integrated lncRNA functional similarity, lncRNA Gaussian interaction profile kernel similarity, disease semantic similarity, and disease Gaussian interaction profile kernel similarity to obtain lncRNA comprehensive similarity and disease comprehensive similarity. Secondly, the residual graph convolutional network was used to extract the local features of lncRNAs and diseases. Thirdly, the new attention mechanism was used to assign the weight of the above features to further obtain the potential features of lncRNAs and diseases. Finally, the training set required by the Extra-Trees classifier was obtained by concatenating potential features, and the potential associations between lncRNAs and diseases were obtained by the trained Extra-Trees classifier. ResGCN-A combines the residual graph convolutional network with the attention mechanism to realize the local and global features fusion of lncRNA and diseases, which is beneficial to obtain more accurate features and improve the prediction accuracy. In the experiment, ResGCN-A was compared with five other methods through 5-fold cross-validation. The results show that the AUC value and AUPR value obtained by ResGCN-A are 0.9916 and 0.9951, which are superior to the other five methods. In addition, case studies and robustness evaluation have shown that ResGCN-A is an effective method for predicting lncRNA-disease associations. The source code for ResGCN-A will be available at https://github.com/Wangxiuxiun/ResGCN-A.

MGDHGS: Gene-bridged metabolite-disease relationships prediction via GraphSAGE and self-attention mechanism

Article

Feb 2024

idenLD-AREL: identifying lncRNA-disease associations by random forests based on an ensemble learning framework

Conference Paper

Dec 2023

PTDA-SWGCL: Predicting tRNA-Disease Associations using Supplementarily Weighted Graph Contrastive Learning

Conference Paper

Dec 2023

A Survey of Matrix Completion Methods for Recommendation Systems

Article

Full-text available

Dec 2018

In recent years, the recommendation systems have become increasingly popular and have been used in a broad variety of applications. Here, we investigate the matrix completion techniques for the recommendation systems that are based on collaborative filtering. The collaborative filtering problem can be viewed as predicting the favorability of a user with respect to new items of commodities. When a rating matrix is constructed with users as rows, items as columns, and entries as ratings, the collaborative filtering problem can then be modeled as a matrix completion problem by filling out the unknown elements in the rating matrix. This article presents a comprehensive survey of the matrix completion methods used in recommendation systems. We focus on the mathematical models for matrix completion and the corresponding computational algorithms as well as their characteristics and potential issues. Several applications other than the traditional user-item association prediction are also discussed.

Long non-coding RNA KCNQ1OT1 promotes osteosarcoma progression by increasing β-catenin activity

Article

Full-text available

Nov 2018

Objective: Long non-coding RNA KCNQ1OT1 has been associated with the development of different types of cancers. The present research investigated the role of KCNQ1OT1 in osteosarcoma. Methods: Expression level of KCNQ1OT1 in osteosarcoma and paired non-cancerous tissue specimens from 56 osteosarcoma patients and its association with patients' clinicopathological features was investigated. KCNQ1OT1 overexpression and knockdown in primary-cultured osteosarcoma cells was constructed by lentiviral transduction. Influence of KCNQ1OT1 overexpression or knockdown on osteosarcoma cell growth, apoptosis, migration, invasion, epithelial-to-mesenchymal transition and beta-catenin activation was investigated. Results: Expression of KCNQ1OT1 in osteosarcoma tissue specimens was significantly increased in comparison to that in adjacent counterparts. High expression of KCNQ1OT1 significantly associated with osteosarcoma progression and patients' decreased survival. Overexpression of KCNQ1OT1 significantly increased osteosarcoma cell growth, proliferation, migration, invasion, epithelial-to-mesenchymal transition and beta-catenin activation while reducing cell apoptosis in vitro, and KCNQ1OT1 knockdown showed opposite effects. Inhibition of beta-catenin/TCF activity by ICG-001 treatment significantly attenuated the promoting effect of KCNQ1OT1 overexpression on osteosarcoma cell malignancy described above. Conclusion: KCNQ1OT1 might be a potential prognostic factor in osteosarcoma. High expression of KCNQ1OT1 might promote osteosarcoma development by increasing the activation of WNT/beta-catenin signaling pathway.

Long noncoding RNA AFAP1‑AS1 enhances cell proliferation and invasion in osteosarcoma through regulating miR‑4695‑5p/TCF4‑β‑catenin signaling

Article

Full-text available

Jun 2018

Long noncoding RNA AFAP1‑AS1 has been shown to promote tumor progression in several human cancer types, such as thyroid cancer, tongue squamous cell carcinoma and lung cancer. However, the role of AFAP1‑AS1 in osteosarcoma (OS) has not been investigated. In the present study, the expression of AFAP1‑AS1 was significantly upregulated in OS tissues and cell lines. Moreover, AFAP1‑AS1 expression was negatively correlated with OS patient prognosis. Besides, AFAP1‑AS1 knockdown significantly inhibited the proliferation and invasion of OS cells in vitro. Furthermore, in vivo xenograft experiments indicated that AFAP1‑AS1 depletion delayed tumor growth. Regarding the underlying mechanism, AFAP1‑AS1 served as a sponge to repress the level of microRNA (miR)‑4695‑5p, which targeted transcription factor (TCF)4, a pivot effector of Wnt/β‑catenin signaling pathway. It was demonstrated that overexpression of AFAP1‑AS1 inhibited the expression of miR‑4695‑5p, while miR‑4695‑5p overexpression decreased TCF4 expression and reduced activation of Wnt/β‑catenin pathway. Through rescue assays, it was demonstrated that restoration of TCF4 expression reversed the effects of AFAP1‑AS1 knockdown or miR‑4695‑5p overexpression on OS cells. Taken together, these findings demonstrated that the AFAP1‑AS1/miR‑4695‑5p/TCF4‑β‑catenin axis played an important role in OS progression.

Prediction of lncRNA-disease associations based on inductive matrix completion

Article

Full-text available

Apr 2018
BIOINFORMATICS

Motivation: Accumulating evidences indicate that long non-coding RNAs (lncRNAs) play pivotal roles in various biological processes. Mutations and dysregulations of lncRNAs are implicated in miscellaneous human diseases. Predicting lncRNA-disease associations is beneficial to disease diagnosis as well as treatment. Although many computational methods have been developed, precisely identifying lncRNA-disease associations, especially for novel lncRNAs, remains challenging. Results: In this study, we propose a method (named SIMCLDA) for predicting potential lncRNA-disease associations based on inductive matrix completion. We compute Gaussian interaction profile kernel of lncRNAs from known lncRNA-disease interactions and functional similarity of diseases based on disease-gene and gene-gene onotology associations. Then, we extract primary feature vectors from Gaussian interaction profile kernel of lncRNAs and functional similarity of diseases by principal component analysis, respectively. For a new lncRNA, we calculate the interaction profile according to the interaction profiles of its neighbors. At last, we complete the association matrix based on the inductive matrix completion framework using the primary feature vectors from the constructed feature matrices. Computational results show that SIMCLDA can effectively predict lncRNA-disease associations with higher accuracy compared with previous methods. Furthermore, case studies show that SIMCLDA can effectively predict candidate lncRNAs for renal cancer, gastric cancer and prostate cancer. Availability: https://github.com//bioinfomaticsCSU/SIMCLDA. Contact: jxwang@mail.csu.edu.cn. Supplementary information: Supplementary data are available at Bioinformatics online.

Identifying Interactions Between Long Noncoding RNAs and Diseases Based on Computational Methods

Chapter

Full-text available

Mar 2018

With the development and improvement of next-generation sequencing technology, a great number of noncoding RNAs have been discovered. Long noncoding RNAs (lncRNAs) are the biggest kind of noncoding RNAs with more than 200 nt nucleotides in length. There are increasing evidences showing that lncRNAs play key roles in many biological processes. Therefore, the mutation and dysregulation of lncRNAs have close association with a number of complex human diseases. Identifying the most likely interaction between lncRNAs and diseases becomes a fundamental challenge in human health. A common view is that lncRNAs with similar function tend to be related to phenotypic similar diseases. In this chapter, we firstly introduce the concept of lncRNA, their biological features, and available data resources. Further, the recent computational approaches are explored to identify interactions between long noncoding RNAs and diseases, including their advantages and disadvantages. The key issues and potential future works of predicting interactions between long noncoding RNAs and diseases are also discussed.

Inferring LncRNA-disease Associations Based On Graph Autoencoder Matrix Completion

Article

May 2020
COMPUT BIOL CHEM

Accumulating studies have indicated that long non-coding RNAs (lncRNAs) play crucial roles in large amount of biological processes. Predicting lncRNA-disease associations can help biologist to understand the molecular mechanism of human disease and benefit for disease diagnosis, treatment and prevention. In this paper, we introduce a computational framework based on graph autoencoder matrix completion (GAMCLDA) to identify lncRNA-disease associations. In our method, the graph convolutional network is utilized to encode local graph structure and features of nodes for learning latent factor vectors of lncRNA and disease. Further, the inner product of lncRNA factor vector and disease factor vector is used as decoder to reconstruct the lncRNA-disease association matrix. In addition, the cost-sensitive neural network is utilized to deal with the imbalance between positive and negative samples. The experimental results show GAMLDA outperforms other state-of-the-art methods in prediction performance which is evaluated by AUC value, AUPR value, PPV and F1-score. Moreover, the case study shows our method is the effectively tool for potential lncRNA-disease prediction.

ILDMSF: Inferring Associations between Long non-coding RNA and Disease Based on Multi-similarity Fusion

Article

Aug 2019

The dysregulation and mutation of long non-coding RNAs (lncRNAs) have been proved to result in a variety of human diseases. Identifying potential disease-related lncRNAs may benefit disease diagnosis, treatment and prognosis. A number of methods have been proposed to predict the potential lncRNA-disease relationships. However, most of them may give rise to incorrect results due to relying on single similarity measure. This article proposes a novel framework (ILDMSF) by fusing the lncRNA similarities and disease similarities, which are measured by lncRNA-related gene and known lncRNA-disease interaction and disease semantic interaction, and known lncRNA-disease interaction, respectively. Further, the support vector machine is employed to identify the potential lncRNA-disease associations based on the integrated similarity. The leave-one-out cross validation is performed to compare ILDMSF with other state of the art methods. The experimental results demonstrate our method is prospective in exploring potential correlations between lncRNA and disease.

Adaptive Deep Modeling of Users and Items Using Side Information for Recommendation

Article

Jun 2019

In the existing recommender systems, matrix factorization (MF) is widely applied to model user preferences and item features by mapping the user-item ratings into a low-dimension latent vector space. However, MF has ignored the individual diversity where the user's preference for different unrated items is usually different. A fixed representation of user preference factor extracted by MF cannot model the individual diversity well, which leads to a repeated and inaccurate recommendation. To this end, we propose a novel latent factor model called adaptive deep latent factor model (ADLFM), which learns the preference factor of users adaptively in accordance with the specific items under consideration. We propose a novel user representation method that is derived from their rated item descriptions instead of original user-item ratings. Based on this, we further propose a deep neural networks framework with an attention factor to learn the adaptive representations of users. Extensive experiments on Amazon data sets demonstrate that ADLFM outperforms the state-of-the-art baselines greatly. Also, further experiments show that the attention factor indeed makes a great contribution to our method.

An Ensemble Method to Reconstruct Gene Regulatory Networks Based on Multivariate Adaptive Regression Splines

Article

Feb 2019

Gene regulatory networks (GRNs) play a key role in biological processes. However, GRNs are diverse under different biological conditions. Reconstructing gene regulatory networks (GRNs) from gene expression has become an important opportunity and challenge in the past decades. Although there are a lot of existing methods to infer the topology of GRNs, such as mutual information, random forest and partial least squares, the accuracy is still low due to the noise and high dimension of the expression data. In this paper, we introduce an ensemble Multivariate Adaptive Regression Splines (MARS) based method to reconstruct the directed GRNs from multifactorial gene expression data, called PBMarsNet. PBMarsNet incorporates part mutual information (PMI) to pre-weight the candidate regulatory genes and then uses MARS to detect the nonlinear regulatory links. Moreover, we apply bootstrap to run the MARS multiple times and average the outputs of each MARS as the final score of regulatory links. The results on DREAM4 challenge and DREAM5 challenge datasets show PBMarsNet has a superior performance and generalization over other state-of-the-art methods.

Association between H19 polymorphisms and osteosarcoma risk

Article

Oct 2017

Objective: The long non-coding RNA (lncRNA) H19, a maternally expressed imprinted gene, has involvement in cancer susceptibility and disease progression. However, the association between H19 polymorphisms and osteosarcoma susceptibility has remained elusive. We designed this case-control study to explore the association between H19 polymorphism and osteosarcoma risk. Patients and methods: In this study, we genotyped 4 tagger SNPs of the H19 gene in a case-control study including 193 osteosarcoma cases and 393 cancer-free controls. Results: For the main effect analysis, rs217727 (G>A) was associated with osteosarcoma risk (GA/GG: adjusted OR = 1.51, 95% CI: 1.06-2.17, p = 0.024; AA/GG: adjusted OR = 1.89, 95% CI: 1.23-2.91, p = 0.004; additive model: adjusted OR = 1.35, 95% CI: 1.01-1.80, p = 0.043). Conclusions: This finding indicates that rs217727 polymorphism may play a role in genetic susceptibility to the risk of osteosarcoma, which may improve our understanding of the potential contribution of H19 SNPs to cancer pathogenesis.

LDICDL: LncRNA-disease association identification based on Collaborative Deep Learning

Abstract and Figures

Recommended publications

Inferring LncRNA-disease Associations Based On Graph Autoencoder Matrix Completion

ILDMSF: Inferring Associations between Long non-coding RNA and Disease Based on Multi-similarity Fus...

GANLDA: Graph attention network for lncRNA-disease associations prediction

Identifying Interactions Between Long Noncoding RNAs and Diseases Based on Computational Methods