ArticlePDF Available

LDICDL: LncRNA-disease association identification based on Collaborative Deep Learning

Authors:

Abstract and Figures

It has been proved that long noncoding RNA (lncRNA) plays critical roles in many human diseases. Therefore, inferring associations between lncRNAs and diseases can contribute to disease diagnosis, prognosis and treatment. To overcome the limitation of traditional experimental methods such as expensive and time-consuming, several computational methods have been proposed to predict lncRNA-disease associations by fusing different biological data. However, the prediction performance of lncRNA-disease associations identification need to be improved. In this study, we propose a computational model (named LDICDL) to identify lncRNA-disease associations based on collaborative deep learning. It uses an automatic encoder to denoise multiple lncRNA feature information and multiple disease feature information, respectively. Then, the matrix decomposition algorithm is employed to predict the potential lncRNA-disease associations. In addition, to overcome the limitation of matrix decomposition, the hybrid model is developed to predict associations between new lncRNA (or disease) and diseases (or lncRNA). The ten-fold cross validation and de novo test are applied to evaluate the performance of method. The experimental results show LDICDL outperforms than other state-of-the-art methods in prediction performance.
Content may be subject to copyright.
1545-5963 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCBB.2020.3034910, IEEE/ACM
Transactions on Computational Biology and Bioinformatics
JOURNAL OF L
A
T
E
X CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 1
LDICDL: LncRNA-disease association
identification based on Collaborative Deep
Learning
Wei Lan, Dehuan Lai, Qingfeng Chen, Ximin Wu, Baoshan Chen, Jin Liu, Jianxin Wang, Yi-Ping
Phoebe Chen
Abstract—It has been proved that long noncoding RNA (lncRNA) plays critical roles in many human diseases. Therefore, inferring
associations between lncRNAs and diseases can contribute to disease diagnosis, prognosis and treatment. To overcome the limitation
of traditional experimental methods such as expensive and time-consuming, several computational methods have been proposed to
predict lncRNA-disease associations by fusing different biological data. However, the prediction performance of lncRNA-disease
associations identification need to be improved. In this study, we propose a computational model (named LDICDL) to identify
lncRNA-disease associations based on collaborative deep learning. It uses an automatic encoder to denoise multiple lncRNA feature
information and multiple disease feature information, respectively. Then, the matrix decomposition algorithm is employed to predict the
potential lncRNA-disease associations. In addition, to overcome the limitation of matrix decomposition, the hybrid model is developed
to predict associations between new lncRNA (or disease) and diseases (or lncRNA). The ten-fold cross validation and de novo test are
applied to evaluate the performance of method. The experimental results show LDICDL outperforms than other state-of-the-art
methods in prediction performance.
Index Terms—lncRNA-disease associations, matrix factorization, stacked denoising autoencoder.
F
1 INTRODUCTION
ITis well known that biological genetic information is
primarily stored in protein-coding genes, and RNA is
the intermediary between DNA sequences and proteins [1].
With the development of human genetic engineering, 2% of
the genes have been confirmed to be protein-coding genes,
and the remaining 98% of the genes have not or few protein
encoding abilities [2]. These genes are usually transcribed
into non-coding RNAs [3]. Non-coding RNAs have been re-
garded as the noise of genomic transcription for a long time
[4], [5]. However, recent studies have shown that they play
important regulatory roles in many biological processes of
Wei Lan is School of Computer, Electronic and Information, Guangxi U-
niversity, Nanning, Guangxi, 530004, China. E-mail: lanwei@gxu.edu.cn
Dehuan Lai is School of Computer, Electronic and Information,
Guangxi University, Nanning, Guangxi, 530004, China. E-mail: laide-
huan@st.gxu.edu.cn
Qingfeng Chen is School of Computer, Electronic and Information and S-
tate Key Laboratory for Conservation and Utilization of Subtropical Agro-
bioresources, Guangxi University, Nanning, Guangxi, 530004, China. E-
mail: qingfeng@gxu.edu.cn
Ximin Wu is School of Computer, Electronic and Information,
Guangxi University, Nanning, Guangxi, 530004, China. E-mail: wux-
imin@st.gxu.edu.cn
Baoshan Chen is State Key Laboratory for Conservation and Utilization of
Subtropical Agro-bioresources, Guangxi University, Nanning, Guangxi,
530004, China. E-mail:chenyaoj@gxu.edu.cn
Jin Liu is Hunan Provincial Key Lab on Bioinformatics, School of Comput-
er Science and Engineering, Central South University, Changsha, Hunan,
410083, China. E-mail:liujin06@mail.csu.edu.cn
Jianxin Wang is Hunan Provincial Key Lab on Bioinformatics, School of
Computer Science and Engineering, Central South University, Changsha,
Hunan, 410083, China. E-mail: jxwang@mail.csu.edu.cn
Yi-Ping Phoebe Chen is Department of Computer Science and Information
Technology, La Trobe University, Melbourne Victoria 3086, Australia. E-
mail:phoebe.chen@latrobe.edu.au
Manuscript received April 19, 2005; revised August 26, 2015.
organism. In particular, long non-coding RNAs (lncRNAs)
which are greater than 200 nucleotides in length have been
unveiled to be related to a broad range of diseases [6]. For
example, it has been found that HOTAIR is overexpressed in
breast cancer, colon cancer, liver cancer and gastrointestinal
stromal tumors [7]. Therefore, identifying lncRNA-disease
association is helpful for biologist not only in understanding
the underlying mechanisms of disease, but also disease
prevention diagnosis and treatment [8], [9].
Many biological experimental studies have been de-
veloped to discover potential lncRNA-disease associations
[10]. Although these methods can exactly discover lncRNA-
disease association, they also have some limitations such
as time-consuming and expensive. With the development
of high-throughput sequencing technology, a large amount
of lncRNA related data, such as the sequence, structure,
function and expression, has been generated [11], [12]. Thus,
many computation-based algorithms have been proposed
to overcome these limitations for potential lncRNA-disease
associations prediction [13]. These computational methods
can be classified into two categories: (1) network-based
methods that use similarity network to predict lncRNA-
disease associations. For example, sun et al [14] proposed
a computational method, RWRlncD, to identify lncRNA-
disease associations based on lncRNA functional similarity
network and the random walk with restart method. Chen
et al [15] presented an algorithm, IRWRLDA, to predict
lncRNA-disease associations in terms of lncRNA similarity
network. They used various measures to calculate lncR-
NA similarity, and IRWRLDA could be used to diseases
without any lncRNA-disease association. Zhou et al [16]
developed a model, RWRHLD, for lncRNA-disease asso-
Authorized licensed use limited to: Guangxi University. Downloaded on November 03,2020 at 02:51:23 UTC from IEEE Xplore. Restrictions apply.
1545-5963 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCBB.2020.3034910, IEEE/ACM
Transactions on Computational Biology and Bioinformatics
JOURNAL OF L
A
T
E
X CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 2
ciation predictions by integrating three networks into one
heterogeneous network. According to constructing a multi-
level network of lncRNA-disease, Yao et al [17] proposed an
algorithm, LncPriCNet, to prioritize candidate for lncRNA-
disease associations. (2) machine learning-based methods
that prioritize candidate lncRNAs by training disease re-
lated known lncRNAs and unknown lncRNAs. Lan et al
[18] developed an online web server (LDAP) to identify
new associations between lncRNAs and diseases based on
positive-unlabeled (PU) learning. Chen et al [19] proposed
a method (LRLSLDA) to infer lncRNA-disease associations
based on the semi-Supervised learning. Wu et al [20] p-
resented a computational method(GAMCLDA) to predict
lncRNA-disease associations based on graph autoencoder
matrix completion. Fu et al [21] developed a computational
model (MFLDA) to predict the associations between lncR-
NA and disease based on multiple data fusion and matrix
factorization (MF). Lu et al [22] presented a computational
model (SIMCLDA) to prioritize candidate lncRNAs based
on inductive matrix completion. Chen et al [23] proposed
a computational framework, ILDMSF, for lncRNA-disease
association identification based on multiple kernel fusion
and Support Vector Machine (SVM).
These methods have achieved good performance in
predicting the associations between lncRNAs and diseases.
However, they do not make full use of the known lncRNA
characteristic data and disease characteristic data, so there
are limitations on the accuracy and prediction performance
[24], [25]. This paper proposes a novel computational frame-
work (LDICDL) to predict LncRNA-disease associations.
It uses an automatic encoder to denoise multiple lncRNA
feature information data and multiple disease feature in-
formation data. In addition, the matrix factorization algo-
rithm is employed to predict the potential lncRNA-disease
associations. Further, the hybrid model based on stacked
denoising autoencoder and matrix factorization is develope-
d to overcome the limitation of matrix factorization for de
novo prediction. The experimental results demonstrate our
method has better performance than other state-of-the-art
methods.
2 METHODS
The task of identifying LncRNA-disease associations can be
viewed as taking implicit feedback as the training and test
data. The LncRNA-disease association matrix is represented
by a matrix LDmn,where mand ndenotes the number of
lncRNAs and diseases, respectively. The element of LD(i,j)
is equal to 1 if lncRNA iis associated with disease j,
otherwise 0. The lncRNA information is integrated into
lncRNA feature matrix LFmt, where tdenotes the number
of features. The disease information is merged into disease
feature matrix DFns, where sdenotes the number of dis-
ease features.
2.1 Stacked Denoising Autoencoder
The stacked denoising autoencoder (SDAE) is a kind of
feedforward neural network which is widely used in rec-
ommend system [26]. In LDICDL, the SDAE is employed to
select lncRNAs and diseases feature information, respective-
ly. The original features of lncRNA and disease are tand s
dimensions, respectively. In final, the lncRNAs and diseases
feature information are reduced into kdimensions by using
SDAE. The mini-batch gradient descent algorithm is used to
train SDAE with the batch size=60.
2.2 Matrix Factorization for lncRNA-disease prediction
In the LncRNA-disease association matrix LD, the element
LD(i, j) is defined as follow:
LD(i, j) = {1,if lncRNA i is related with disease j
0,if lncRNA i is not related with disease j
(1)
Therefore, the loss function of matrix factorization for
LncRNA-disease association prediction is defined as follow:
Loss =Σi,j αi,j (LD(i, j)L(i, :) ·D(j, :)T)2
+γ(ΣiL(i, :)2+ΣjD(j, :)2)(2)
where γdenotes the regulation parameter. L(i, :) and D(j, :)
denote lncRNA isubspace feature and disease jsub-
space feature, respectively. αi,j is the parameter to show
the confidence between lncRNA iand disease jwhere
αi,j =1+θ(LD(i, j))..2denotes 2-norm.
2.3 Matrix Factorization with Implicit Feedback for
LncRNA-disease association prediction
Considering that the lncRNA-disease associations predic-
tion performance with matrix factorization is poor for the
new lncRNA or disease, which called cold start problem
[13], [27], the hybrid model is proposed to predict lncRNA
and disease associations by combining matrix factorization
with stocked denoising autoencoder. In our method, pre-
dicting association of new lncRNA or disease relies on the
biological features of lncRNA and disease. The structure of
hybrid model with three hidden layer of lncRNA is show in
Figure 1. Xinput l is the input layer for lncRNA features (i.e.
LF ) and Xencode l is the lncRNA features encoding. Xout l
is the output layer of lncRNA features.
ܵܦܣܧ
ܺ௜௡௣௨௧୬୭௜௦௘̴௟
ܺଵ̴௟ ܺୣ௡௖௢ௗ௘̴௟ሺ݅ǡ ǣ ሻ ܺଷ̴௟
ܺ௢௨௧௣௨௧̴௟
ܦሺ݆ǡ ǣ ሻ
ܯܨ
ܯ݅ǡ ݆
ൌ ܺ௘௡௖௢ௗ௘̴௟ሺ݅ǡ ǣ ሻܦሺ݆ǡ ǣ ሻ
ܺ௜௡௣௨௧̴௟
Fig. 1. The overview of hybrid model of lncRNA.
Authorized licensed use limited to: Guangxi University. Downloaded on November 03,2020 at 02:51:23 UTC from IEEE Xplore. Restrictions apply.
1545-5963 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCBB.2020.3034910, IEEE/ACM
Transactions on Computational Biology and Bioinformatics
JOURNAL OF L
A
T
E
X CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 3
The loss function of hybrid model based on lncRNA
features is defined as follow:
Loss =Σi,j αi,j (LD(i, j)L(i, :) ·D(j, :)T)2
+γ(ΣiL(i, :)2+ΣjD(j, :)2)
+γl(LXencode l2) + γn(Xinput l Xout l2)
+Σlayers γwW2(3)
where γ,γl,γnand γwdenote regularization parameters. W
denotes the weight matrix.
The loss function is minimized by block coordinate de-
cent [28]. The L(i, :) is updated in term of Eq.4 below in
training step:
L(i, :) LD(i, :)C(i)D(γI +DTC(i)D)1(4)
where C(i)is a diagonal matrix where C(i)(j, j)=αi,j .
The D(:, j) is updated in term of Eq.5 below in training
step:
D(:, j)LD(:, j)T˜
C(j)L(γI +LT˜
C(j)L)1(5)
where ˜
C(j)is a diagonal matrix where ˜
C(j)(i, i)=αi,j .
For disease, the structure with three hidden layers is
show in Figure 2. Xinput d is the input layer for disease fea-
tures (i.e. DF ) and Xencode d represents the disease features
encoding. Xout d is the output layer of disease features.
ܺ௜௡௣௨௧̴ௗ
ܵܦܣܧ
ܺ௜௡௣௨௧୬୭௜௦௘̴ௗ
ܺଵ̴ௗ ܺୣ௡௖௢ௗ௘̴ௗሺ݆ǡ ǣ ሻ ܺଷ̴ௗ
ܺ௢௨௧௣௨௧̴ௗ
ܮሺ݅ǡ ǣ ሻ
ܯܨ
ܯ݅ǡ ݆
ൌ ܮ ሺ݅ǡ ǣ ሻ ȉ ܺ௘௡௖௢ௗ௘̴ௗ ሺ݆ǡ ǣ ሻ
Fig. 2. The overview of hybrid model of disease.
The loss function of hybrid model based on disease
feature information is defined as follow:
Loss =Σi,j αi,j (LD(i, j)L(i, :) ·D(j, :)T)2
+γ(ΣiL(i, :)2+ΣjD(j, :)2)
+γd(DXencode d 2) + γn(Xinput d Xout d 2)
+Σlayers γwW2(6)
where γ,γd,γnand γwdenote regularization parameters.
Wdenotes the weight matrix.
The final predicted score matrix S is calculated as follow:
S(i, j) = Ml(i, j ) + Md(i, j)
2(7)
Ml(i, j) = Xencode l (i, :) ·D(j, :)T(8)
Md(i, j) = L(i, :) ·Xencode d (j, :)T(9)
where S(i, j) denotes the score between lncRNA iand
disease j.Xencode l and Xencode d denote the sub-feature
matrix of lncRNA and disease which are obtained by SDAE
based on lncRNA and disease feature information, respec-
tively. Land Ddenote the sub-feature matrix of lncRNA
and disease obtained from matrix factorization.
The whole workflow of LDICDL is shown in Figure
3. In the first step, the lncRNA-disease association matrix
is decomposed to lncRNA feature subspace and disease
feature information is encoded by using SDAE. Meanwhile,
the lncRNA-disease association matrix is decomposed to
disease feature subspace and lncRNA feature information
is encoded by SDAE. Then, the lncRNA-disease association
score is predicted based on lncRNA feature matrix and dis-
ease encode matrix, and disease feature matrix and lncRNA
encode matrix, respectively. Last, the final score of lncRNA-
disease association is calculated by averaging the scores.
The Block coordinate decent is used to minimize the loss
function. Firstly, the Land Dare updated by equation 4
and 5, respectively. Then, the parameters in the SDAE are
updated using gradient decent with mini-batch. The mean
square errors of output and encoding are used to adjust the
gradient. It repeats the former steps for ttimes.
Fig. 3. The workflow of LDICDL.
3 RE SULTS
3.1 Datasets
The lncRNA-gene associations are downloaded from lncR-
NA2target [29] and lncRNA-gene function associations are
collected from GeneRIF [30]. They are pre-processed using
Open Biomedical Annotator [31]. The lncRNA-miRNA asso-
ciations and disease-miRNA associations are downloaded
Authorized licensed use limited to: Guangxi University. Downloaded on November 03,2020 at 02:51:23 UTC from IEEE Xplore. Restrictions apply.
1545-5963 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCBB.2020.3034910, IEEE/ACM
Transactions on Computational Biology and Bioinformatics
JOURNAL OF L
A
T
E
X CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 4
from starBase v2.0 [32] and HMDD [33], respectively. The
disease-gene associations are downloaded from DisGeNET
[34]. In final, 2697 associations between 240 lncRNAs and
412 diseases are obtained as gold-standard dataset. In addi-
tion, 6066-dimensions feature information of each lncRNA
from lncRNA-related data and 10621-dimensions feature
information of each disease from disease-related data are
collected, respectively.
3.2 Performance evaluation
The ten-fold cross validation and de novo test are employed
to evaluate the performance of different methods. In ten-fold
cross validation, all known associations between lncRNAs
and diseases are divided into ten folds randomly. In each
test, one fold is selected as the test samples and other nine
folds are treated as training samples. All known associations
in test samples are removed by turns and all other known
associations in training samples are used to train model.
Then, the prediction algorithm is carried out to predict the
scores of test samples and candidate samples. In the de novo
test, for disease i, all known associations are removed as test
samples, while all known associations between lncRNAs
and other diseases are considered as training samples. Then,
the scores of associations between lncRNAs and disease i
are calculated by prediction method. After that, the scores
of test and candidate samples are ranked with descending
order and observe whether its ranking is greater than a
specific threshold. If the rank of test sample is greater than
the threshold, it is considered as true positive, otherwise
false negative. If the rank of candidate sample is greater
than the threshold, it is viewed as false positive, otherwise
true negative. Further, the true positive rate (TPR) and false
positive rate (FPR) are calculated as follows:
T P R =T P
T P +F N (10)
F P R =F P
F P +T N (11)
where T P denotes the number of true positive samples, T N
denotes the number of true negative samples, F P denotes
the number of false positive samples, and F N denotes the
number of false negative samples. The receiver operating
characteristic (ROC) curve is draw based on TPR and FPR
at different thresholds and the Area under of ROC (AUC)
is calculated to evaluate the performance of method. If the
AUC equals to 1, it denotes that this method has perfect
performance. If the AUC equals to 0.5, it denotes that the
prediction of model is uncertain.
In addition, the precision and recall are also calculated
as follows:
P recision =T P
T P +F P (12)
Recall =T P
T P +F N (13)
where precision denotes the proportion of the true positive
samples with rankings higher than the special threshold in
the predicted positive samples, recall denotes the propor-
tion of true positive samples whose ranking is lower than
the special threshold in the whole positive samples. Then,
Precision-Recall (PR) curve is plotted based on precision and
recall. Finally, the area under of PR (AUPR) are computed
to evaluate the performance of method.
3.3 Ten-fold cross validation
In order to evaluate the performance of LDICDL, the ten-
fold cross validation is applied in the experiment. We
compare LDICDL with two state-of-the-art methods based
on matrix completion (SIMCLDA [22] and MFLDA [21]).
The performance of different methods is evaluated in term
of AUC. It can be observed from Figure 4 that LDICDL
achieves the AUC of 0.8651, which is significantly higher
than other methods (SIMCLDA 0.8259 and MFLDA 0.6430).
It demonstrates that our method has higher performance
than other methods. In addition, the AUPR is also utilized
to compare the performance of different methods as shown
in Figure 5. The AUPR of LDICDL is 0.0306 in contrast to
0.0227 and 0.0051 with SIMCLDA and MFLDA, respectively.
It proves that our method is more effective than other two
methods. Figure 6 shows the number of correctly retrieved
known lncRNAdisease associations. It can be found that
LDICDL outperforms other methods from top 10 to top 50
associations.
To prove our model can obtain deep latent repre-
sentation of features, we conduct the experimental com-
parison between LDICDL and three classical feature ex-
traction methods including Nonnegative Matrix Factoriza-
tion(NMF), Principal Component Analysis(PCA) and Latent
Dirichlet Allocation (LDA). The comparison result on dif-
ferent feature extraction methods is shown in Figure 7. It
can be found from the result that LDICDL which is based
on using the stacked denoising autoencoder outperforms
other methods. Moreover, in order to show the effect of the
combination of MF and SDAE, we compare it with MF and
SDAE, respectively. The result is shown in Figure 8. The
result demonstrates that the combination of MF and SDAE
outperforms than single method (MF or SDAE). We also
compared different regularization methods (L1, L21 and L2)
on matrix factorization. The results are shown in Figure 9,
and the L2 norm obtains the best performance.
Fig. 4. The AUROC of LDICDL, SIMCLDA and MFLDA by using ten-fold
cross validation.
Authorized licensed use limited to: Guangxi University. Downloaded on November 03,2020 at 02:51:23 UTC from IEEE Xplore. Restrictions apply.
1545-5963 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCBB.2020.3034910, IEEE/ACM
Transactions on Computational Biology and Bioinformatics
JOURNAL OF L
A
T
E
X CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 5
Fig. 5. The AUPR of LDICDL, SIMCLDA and MFLDA by using ten-fold
cross validation.
Fig. 6. Number of correctly retrieved known lncRNAdisease associa-
tions for specified rank thresholds based on ten-fold cross validation.
3.4 De novo test
In order to validate the performance of LDICDL in identify-
ing potential association for new diseases, the de novo test is
conducted in the experiment. The de novo test removes all
known associations with lncRNAs from each disease ias the
test set each time. The potential associations between lncR-
NAs and disease iare predicted based on feature informa-
tion. The results of AUROC and AUPR are shown in Figures
10 and 11, respectively. The LDICDL achieves the highest
AUC and AUPR (0.8917 and 0.1666). Compared with other
methods, our method is at least 0.09 higher than other
methods in AUC (SIMCLDA 0.7923 and MFLDA 0.5952)
Fig. 7. The AUROC of LDICDL, PCA, LDA and NMF by using ten-fold
cross validation.
Fig. 8. The AUROC of MF, SDAE and SDAE+MF by using ten-fold cross
validation.
Fig. 9. The AUROC of L1, L21 and L2 in MF by using ten-fold cross
validation.
and 0.04 higher than other methods in AUPR (SIMCLDA
0.1270 and MFLDA 0.0398). It demonstrates that our method
is superior to other methods in prediction performance of de
novo test. Figure 12 shows the number of correctly retrieved
Authorized licensed use limited to: Guangxi University. Downloaded on November 03,2020 at 02:51:23 UTC from IEEE Xplore. Restrictions apply.
1545-5963 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCBB.2020.3034910, IEEE/ACM
Transactions on Computational Biology and Bioinformatics
JOURNAL OF L
A
T
E
X CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 6
known lncRNA-disease associations. It can be found that
LDICDL outperforms than other methods for top 10 to top
50.
Fig. 10. The AUROC of LDICDL, SIMCLDA and MFLDA by using de
novo cross validation.
Fig. 11. The AUPR of LDICDL, SIMCLDA and MFLDA by using de novo
cross validation.
3.5 The effects of parameters
In the SDAE, the feature information of lncRNA and disease
are reduced into subspace. To test the effect of feature
dimension k, we conduct the ten-fold cross validation by
changing the feature dimension from 50 to 250 by increasing
50 each time. The result is shown in Figure 13. It is observed
that the LDICDL achieves the best performance when the
feature dimension is equal to 100. Therefore, 100 is applied
for the feature dimension kin experiment. All the three
hidden layers use non-linear activation functions tanh, and
the output layer uses the sigmoid. The number of neurons
of the auto-encoder are set to 130, 100 and 130, respectively.
Fig. 12. The number of correctly retrieved known lncRNAdisease asso-
ciations for specified rank thresholds based on de novo validation.
The hyperparameters are selected by random
search proposed in [35]. γand θare chosen from
[0.1,1,10,100,200,300,500,1000],γl:γnand γd:γn
are both chosen from [ 1:1, 100:1, 200:1, 300:1, 400:1, 500:1,
600:1, 700:1, 800:1, 900:1, 1000:1] [36], γwis chosen from
[0.1,0.3,0.5,0.7,0.9]. Then all hyperparameters are sampled
from a uniform distribution over a set of possible values. In
our experiment, we repeat the process 20 times to find the
optimum parameters. The parameters are set as follows:
θ= 100, γ = 300, γl:γn=γd:γn= 100 : 1, γw= 0.3.
Fig. 13. The effect of feature dimension k.
3.6 Case study
To demonstrate the capability of LDICDL in identifying the
potential lncRNA-disease associations, the osteosarcoma is
selected as case study. In case study, all known associations
between lncRNAs and osteosarcoma are treated as positive
samples. Then, the potential associations are predicted by
LDICDL. The predicted lncRNA of osteosarcoma is ana-
lyzed by consulting recent publication.
Osteosarcoma (osteogenic sarcoma) is a primary bone
malignancy that often affects children and young adults
Authorized licensed use limited to: Guangxi University. Downloaded on November 03,2020 at 02:51:23 UTC from IEEE Xplore. Restrictions apply.
1545-5963 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCBB.2020.3034910, IEEE/ACM
Transactions on Computational Biology and Bioinformatics
JOURNAL OF L
A
T
E
X CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 7
(approximately 3.4% of all childhood cancers) [37], [38]. This
cancer is rare (less than 1% of all cancers diagnosed) and the
pathogenesis is unknown. With the development of multi-
agent chemotherapy regimens, the long-term survival rate
is improved from 65% to 70% [39]. Unfortunately, the prog-
nostic and treatment have no improved in several decades.
Table 1 shows the top 10 lncRNA of osteosarcoma predicted
by LDICDL. As shown in Table 1, 9 out of 10 lncRNAs are
proved to relate with osteosarcoma by recent literatures. The
H19 ranked in top 1 has been proved to be related with
osteosarcoma [40]. The rs217727 in H19 can increase IGF2
cord blood level which has significantly associated with
osteosarcoma. It has been proved that the long coding RNA
PVT1 ranked at top 2 can promote cell apoptosis and inhibit
cell proliferation, migration, and invasion in osteosarcoma
cells by regulating the expression of miR-195 [41]. The GAS5
ranked at top 3 can promote the expression of aplasia
Ras Homologue member I (ARHI) which suppresses Cell
Growth and Epithelial-Mesenchymal Transition in Osteosar-
coma by acting as molecular sponger to regulate the expres-
sion of miR-221 [42]. Recent research shows that the NEAT1
ranked at top 4 is significantly upregulated in osteosarcoma
cell lines which has close association with higher clinical
stage, distant metastasis and poorer prognosis. In addition,
it can inhibit Ecadherin expression and promote the metas-
tasis of osteosarcoma by relating with the G9a-DNMT1-Snail
complex [43]. The long coding RNA KCNQ1OT1 ranked
at top 5 has been discovered to associate with cell inva-
sion, migration, growth, proliferation and apoptosis through
enhancing WNT/beta-catenin signaling pathway activity
in osteosarcoma tissue [44]. It has been discovered that
AFAP1AS1 ranked at top 7 is significantly over-expressed
and the knockdown of AFAP1-AS1 can strikingly inhibits
the cell proliferation in osteosarcoma tissue. It demonstrates
that AFAP1AS1 can promote cell proliferation in osteosar-
coma via regulating miR-4695-5p/TCF4-β-catenin signaling
[45]. The long Noncoding RNA XIST ranked at top 8 has
been proved that it can bind to miR-320b and inhibit the
expression of miR-320b in osteosarcoma cells. The miR-320b
can target the Ras-Related Protein RAP2B and inhibit the
expression of RAP2B which is involved in cell proliferation
and invasion of osteosarcoma [46]. It has been revealed that
the CCAT1 rank at top 9 is upregulated in osteosarcoma
tissues and cells, and is related with the cell proliferation
and migration of osteosarcoma by binding to miR-148a
and regulating the signal pathway of phosphatidyl inositol
3-kinase interacting protein 1 (PIK3IP1) [47]. The recent
evidences present that long coding RNA SPRY4-IT1 ranked
at top 10 is over-expressed in osteosarcoma tissues and
SPRY4-IT1 knockdown strikingly inhibits cells proliferation
through inhibiting the expression of G1 [48]. In addition,
some interesting lncRNAs such as MIR155HG are found
by our method. The biological functions of these lncRNAs
are still unknown. It deserves for biologist to validate by
biological experiments.
4 DISCUSSION
It is well known than lncRNA is a kind of important non-
coding RNA with the length more than 200 nucleotides [49].
Accumulating evidences show that lncRNA plays critical
TABLE 1
Top 10 lncRNA of osteosarcoma predicted by LDICDL
Rank LncRNA Evidence
1 H19 [40]
2 PVT1 [41]
3 GAS5 [42]
4 NEAT1 [43]
5 KCNQ1OT1 [44]
6 MIR155HG Unknown
7 AFAP1-AS1 [45]
8 XIST [46]
9 CCAT1 [47]
10 SPRY4-IT1 [48]
roles in various biological processes such as chromosome
dosage compensation, genomic imprinting, epigenetic regu-
lation, nuclear and cytoplasmic trafficking, cell proliferation,
cell differentiation, cell growth, cell metabolism and cell
apoptosis [50], [51]. In addition, increasing studies demon-
strate that lncRNA has close relationship with various dis-
eases including cancer [28]. Therefore, identifying LncRNA-
disease associations benefits to understand the pathogenesis
of disease, and further disease treatment and drug discov-
ery.
In this study, we have proposed a computational
method, called LDICDL, to predict LncRNA-disease asso-
ciations based on collaborative deep learning. In this ap-
proach, the lncRNA-disease association matrix is decom-
posed to lncRNA feature subspace and disease feature infor-
mation is encoded by using SDAE. Meanwhile, the lncRNA-
disease association matrix is decomposed to disease feature
subspace and lncRNA feature information is encoded by
using SDAE. Then, the lncRNA-disease association score
is predicted based on lncRNA feature matrix and disease
encode matrix, and disease feature matrix and lncRNA en-
code matrix, respectively. The final score of lncRNA-disease
association is calculated by averaging the scores. The results
demonstrate LDICDL is competitive and often performs
better than other state-of-the-art methods. In addition, our
method may also be used to other biological entity predic-
tion such as miRNA-disease association prediction [52], [53],
[54], drug-target interaction prediction [55] and disease gene
prediction [56].
FUN DING
This work was partially supported by the National Natu-
ral Science Foundation of China (Nos. 61702122, 61963004
and 61972185), the Natural Science Foundation of Guangx-
i (Nos. 2017GXNSFDA198033 and 2018GXNSFBA281193),
the Key Research and Development Plan of Guangxi
(No. AB17195055), the foundation of Guangxi University
(Nos. 20190240 and XBZ180479), the Innovation Project
of Guangxi Graduate Education (No. YCSW2020020), the
Natural Science Foundation of Yunnan Province of China
(No. 2019FA024), the Hunan Provincial Science and Tech-
nology Program (No. 2018WK4001), the scientific Research
Foundation of Hunan Provincial Education Department
(No.18B469).
Authorized licensed use limited to: Guangxi University. Downloaded on November 03,2020 at 02:51:23 UTC from IEEE Xplore. Restrictions apply.
1545-5963 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCBB.2020.3034910, IEEE/ACM
Transactions on Computational Biology and Bioinformatics
JOURNAL OF L
A
T
E
X CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 8
REFERENCES
[1] G. L. Maor, A. Yearim, and G. Ast, “The alternative role of dna
methylation in splicing regulation,” Trends in Genetics, vol. 31,
no. 5, pp. 274–280, 2015.
[2] W. Lan, J. Wang, M. Li, J. Liu, F.-X. Wu, and Y. Pan, “Predicting
microrna-disease associations based on improved microrna and
disease similarities,” IEEE/ACM Transactions on Computational Bi-
ology and Bioinformatics (TCBB), vol. 15, no. 6, pp. 1774–1782, 2018.
[3] E. Anastasiadou, L. S. Jacob, and F. J. Slack, “Non-coding rna
networks in cancer,” Nature Reviews Cancer, vol. 18, no. 1, p. 5,
2018.
[4] J. Ponjavic, C. P. Ponting, and G. Lunter, “Functionality or tran-
scriptional noise? evidence for selection within long noncoding
rnas,” Genome research, vol. 17, no. 5, pp. 556–565, 2007.
[5] Q. Chen, W. Lan, and J. Wang, “Mining featured patterns of
mirna interaction based on sequence and structure similarity,”
IEEE/ACM Transactions on Computational Biology and Bioinformatics
(TCBB), vol. 10, no. 2, pp. 415–422, 2013.
[6] K. C. Wang and H. Y. Chang, “Molecular mechanisms of long
noncoding rnas,” Molecular cell, vol. 43, no. 6, pp. 904–914, 2011.
[7] X. Xue, Y. A. Yang, A. Zhang, K. Fong, J. Kim, B. Song, S. Li, J. C.
Zhao, and J. Yu, “Lncrna hotair enhances er signaling and confers
tamoxifen resistance in breast cancer,” Oncogene, vol. 35, no. 21, p.
2746, 2016.
[8] L. Yang, C. Lin, C. Jin, J. C. Yang, B. Tanasa, W. Li, D. Merkurjev,
K. A. Ohgi, D. Meng, J. Zhang et al., “lncrna-dependent mecha-
nisms of androgen-receptor-regulated gene activation programs,”
Nature, vol. 500, no. 7464, p. 598, 2013.
[9] W. Lan, J. Wang, M. Li, W. Peng, and F. Wu, “Computational
approaches for prioritizing candidate disease genes based on ppi
networks,” Tsinghua Science and Technology, vol. 20, no. 5, pp. 500–
512, 2015.
[10] G. Yang, X. Lu, and L. Yuan, “Lncrna: a link between rna and
cancer,” Biochimica et Biophysica Acta (BBA)-Gene Regulatory Mech-
anisms, vol. 1839, no. 11, pp. 1097–1109, 2014.
[11] P.-J. Volders, K. Verheggen, G. Menschaert, K. Vandepoele,
L. Martens, J. Vandesompele, and P. Mestdagh, “An update on
lncipedia: a database for annotated human lncrna sequences,”
Nucleic acids research, vol. 43, no. D1, pp. D174–D180, 2014.
[12] Q. Jiang, J. Wang, X. Wu, R. Ma, T. Zhang, S. Jin, Z. Han, R. Tan,
J. Peng, G. Liu et al., “Lncrna2target: a database for differential-
ly expressed genes after lncrna knockdown or overexpression,”
Nucleic acids research, vol. 43, no. D1, pp. D193–D196, 2014.
[13] W. Lan, L. Huang, D. Lai, and Q. Chen, “Identifying interactions
between long noncoding rnas and diseases based on computation-
al methods,” in Computational Systems Biology. Springer, 2018, pp.
205–221.
[14] J. Sun, H. Shi, Z. Wang, C. Zhang, L. Liu, L. Wang, W. He, D. Hao,
S. Liu, and M. Zhou, “Inferring novel lncrna–disease associations
based on a random walk model of a lncrna functional similarity
network,” Molecular BioSystems, vol. 10, no. 8, pp. 2074–2081, 2014.
[15] X. Chen, Z.-H. You, G.-Y. Yan, and D.-W. Gong, “Irwrlda: im-
proved random walk with restart for lncrna-disease association
prediction,” Oncotarget, vol. 7, no. 36, p. 57919, 2016.
[16] M. Zhou, X. Wang, J. Li, D. Hao, Z. Wang, H. Shi, L. Han,
H. Zhou, and J. Sun, “Prioritizing candidate disease-related long
non-coding rnas by walking on the heterogeneous lncrna and
disease network,” Molecular BioSystems, vol. 11, no. 3, pp. 760–769,
2015.
[17] Q. Yao, L. Wu, J. Li, L. guang Yang, Y. Sun, Z. Li, S. He, F. Feng,
H. Li, and Y. Li, “Global prioritizing disease candidate lncrnas via
a multi-level composite network,” Scientific reports, vol. 7, p. 39516,
2017.
[18] W. Lan, M. Li, K. Zhao, J. Liu, F.-X. Wu, Y. Pan, and J. Wang,
“Ldap: a web server for lncrna-disease association prediction,”
Bioinformatics, vol. 33, no. 3, pp. 458–460, 2016.
[19] X. Chen and G. Yan, “Novel human lncrna–disease association in-
ference based on lncrna expression profiles,” Bioinformatics, vol. 29,
no. 20, pp. 2617–2624, 2013.
[20] X. Wu, W. Lan, Q. Chen, Y. Dong, J. Liu, and W. Peng, “Inferring
lncrna-disease associations based on graph autoencoder matrix
completion,” Computational Biology and Chemistry, p. 107282, 2020.
[21] G. Fu, J. Wang, C. Domeniconi, and G. Yu, “Matrix factorization-
based data fusion for the prediction of lncrna–disease association-
s,” Bioinformatics, vol. 34, no. 9, pp. 1529–1537, 2017.
[22] C. Lu, M. Yang, F. Luo, F.-X. Wu, M. Li, Y. Pan, Y. Li, and J. Wang,
“Prediction of lncrna–disease associations based on inductive
matrix completion,” Bioinformatics, vol. 34, no. 19, pp. 3357–3364,
2018.
[23] Q. Chen, D. Lai, W. Lan, X. Wu, B. Chen, Y.-P. P. Chen, and J. Wang,
“Ildmsf: Inferring associations between long non-coding rna and
disease based on multi-similarity fusion,” IEEE/ACM transactions
on computational biology and bioinformatics, 2019.
[24] J. Han, L. Zheng, Y. Xu, B. Zhang, F. Zhuang, P. S. Yu, and
W. Zuo, “Adaptive deep modeling of users and items using side
information for recommendation,” IEEE Transactions on Neural
Networks and Learning Systems, vol. 31, no. 3, pp. 737–748, 2020.
[25] H. Park, J. Jung, and U. Kang, “A comparative study of matrix
factorization and random walk with restart in recommender sys-
tems,” in 2017 IEEE International Conference on Big Data (Big Data).
IEEE, 2017, pp. 756–765.
[26] H. Wang, N. Wang, and D.-Y. Yeung, “Collaborative deep learning
for recommender systems,” in Proceedings of the 21th ACM SIGKD-
D international conference on knowledge discovery and data mining.
ACM, 2015, pp. 1235–1244.
[27] A. Ramlatchan, M. Yang, Q. Liu, M. Li, J. Wang, and Y. Li,
“A survey of matrix completion methods for recommendation
systems,” Big Data Mining and Analytics, vol. 1, no. 4, pp. 308–323,
2018.
[28] Y. Hu, Y. Koren, and C. Volinsky, “Collaborative filtering for
implicit feedback datasets,” in 2008 Eighth IEEE International Con-
ference on Data Mining. Ieee, 2008, pp. 263–272.
[29] Q. Jiang, J. Wang, X. Wu, R. Ma, T. Zhang, S. Jin, Z. Han, R. Tan,
J. Peng, G. Liu et al., “Lncrna2target: a database for differential-
ly expressed genes after lncrna knockdown or overexpression,”
Nucleic acids research, vol. 43, no. D1, pp. D193–D196, 2014.
[30] Z. Lu, K. BRETONNEL COHEN, and L. Hunter, “Generif quality
assurance as summary revision,” in Biocomputing 2007. World
Scientific, 2007, pp. 269–280.
[31] C. Jonquet, N. H. Shah, and M. A. Musen, “The open biomedical
annotator,” Summit on translational bioinformatics, vol. 2009, p. 56,
2009.
[32] J.-H. Li, S. Liu, H. Zhou, L.-H. Qu, and J.-H. Yang, “starbase v2.
0: decoding mirna-cerna, mirna-ncrna and protein–rna interaction
networks from large-scale clip-seq data,” Nucleic acids research,
vol. 42, no. D1, pp. D92–D97, 2013.
[33] Y. Li, C. Qiu, J. Tu, B. Geng, J. Yang, T. Jiang, and Q. Cui, “Hmdd
v2. 0: a database for experimentally supported human microrna
and disease associations,” Nucleic acids research, vol. 42, no. D1,
pp. D1070–D1074, 2013.
[34] J. Pinero, N. Queralt-Rosinach, A. Bravo, J. Deu-Pons, A. Bauer-
Mehren, M. Baron, F. Sanz, and L. I. Furlong, “Disgenet: a discov-
ery platform for the dynamical exploration of human diseases and
their genes,” Database, vol. 2015, 2015.
[35] J. Bergstra and Y. Bengio, “Random search for hyper-parameter
optimization,” Journal of Machine Learning Research, vol. 13, no. 1,
pp. 281–305, 2012.
[36] H. Wang, N. Wang, and D. Y. Yeung, “Collaborative deep learning
for recommender systems,” 2014.
[37] P. A. Meyers and R. Gorlick, “Osteosarcoma,” Pediatric Clinics of
North America, vol. 44, no. 4, pp. 973–989, 1997.
[38] B. A. Lindsey, J. E. Markel, and E. S. Kleinerman, “Osteosarcoma
overview,” Rheumatology and therapy, vol. 4, no. 1, pp. 25–43, 2017.
[39] M. S. Isakoff, S. S. Bielack, P. Meltzer, and R. Gorlick, “Osteosar-
coma: current treatment and a collaborative pathway to success,”
Journal of clinical oncology, vol. 33, no. 27, p. 3029, 2015.
[40] T. He, D. Xu, T. Sui, J. Zhu, Z. Wei, and Y. Wang, “Association
between h19 polymorphisms and osteosarcoma risk,” Eur Rev Med
Pharmacol Sci, vol. 21, no. 17, pp. 3775–3780, 2017.
[41] Q. Zhou, F. Chen, J. Zhao, B. Li, Y. Liang, W. Pan, S. Zhang,
X. Wang, and D. Zheng, “Long non-coding rna pvt1 promotes
osteosarcoma development by acting as a molecular sponge to
regulate mir-195,” Oncotarget, vol. 7, no. 50, p. 82620, 2016.
[42] K. Ye, S. Wang, H. Zhang, H. Han, B. Ma, and W. Nan,
“Long noncoding rna gas5 suppresses cell growth and epithelial–
mesenchymal transition in osteosarcoma by regulating the mir-
221/arhi pathway,” Journal of cellular biochemistry, vol. 118, no. 12,
pp. 4772–4781, 2017.
[43] Y. Li and C. Cheng, “Long noncoding rna neat1 promotes the
metastasis of osteosarcoma via interaction with the g9a-dnmt1-
snail complex,” American journal of cancer research, vol. 8, no. 1,
p. 81, 2018.
Authorized licensed use limited to: Guangxi University. Downloaded on November 03,2020 at 02:51:23 UTC from IEEE Xplore. Restrictions apply.
1545-5963 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCBB.2020.3034910, IEEE/ACM
Transactions on Computational Biology and Bioinformatics
JOURNAL OF L
A
T
E
X CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 9
[44] C. Zhang, S. Du, and L. Cao, “Long non-coding rna kcnq1ot1 pro-
motes osteosarcoma progression by increasing β-catenin activity,”
RSC advances, vol. 8, no. 66, pp. 37 581–37 589, 2018.
[45] R. Li, S. Liu, Y. Li, Q. Tang, Y. Xie, and R. Zhai, “Long noncoding
rna afap1-as1 enhances cell proliferation and invasion in osteosar-
coma through regulating mir-4695-5p/tcf4-β-catenin signaling,”
Molecular medicine reports, vol. 18, no. 2, pp. 1616–1622, 2018.
[46] G.-Y. Lv, J. Miao, and X.-L. Zhang, “Long noncoding rna xist pro-
motes osteosarcoma progression by targeting ras-related protein
rap2b via mir-320b,” Oncology Research Featuring Preclinical and
Clinical Cancer Therapeutics, vol. 26, no. 6, pp. 837–846, 2018.
[47] J. Zhao and L. Cheng, “Long non-coding rna ccat1/mir-148a
axis promotes osteosarcoma proliferation and migration through
regulating pik3ip1,” Acta biochimica et biophysica Sinica, vol. 49,
no. 6, pp. 503–512, 2017.
[48] J. Xu, R. Ding, and Y. Xu, “Effects of long non-coding rna spry4-
it1 on osteosarcoma cell biological behavior,” American journal of
translational research, vol. 8, no. 12, p. 5330, 2016.
[49] C. P. Ponting, P. L. Oliver, and W. Reik, “Evolution and functions
of long noncoding rnas,” Cell, vol. 136, no. 4, pp. 629–641, 2009.
[50] R. Zheng, M. Li, X. Chen, S. Zhao, F. Wu, Y. Pan, and J. Wang,
“An ensemble method to reconstruct gene regulatory networks
based on multivariate adaptive regression splines,” IEEE/ACM
transactions on computational biology and bioinformatics, 2019.
[51] A. Necsulea, M. Soumillon, M. Warnefors, A. Liechti, T. Daish,
U. Zeller, J. C. Baker, F. Gr ¨
utzner, and H. Kaessmann, “The evo-
lution of lncrna repertoires and expression patterns in tetrapods,”
Nature, vol. 505, no. 7485, p. 635, 2014.
[52] W. Peng, W. Lan, Z. Yu, J. Wang, and Y. Pan, “A framework
for integrating multiple biological networks to predict microrna-
disease associations,” IEEE transactions on nanobioscience, vol. 16,
no. 2, pp. 100–107, 2016.
[53] W. Lan, Q. Chen, T. Li, C. Yuan, S. Mann, and B. Chen, “I-
dentification of important positions within mirnas by integrating
sequential and structural features,” Current Protein and Peptide
Science, vol. 15, no. 6, pp. 591–597, 2014.
[54] W. Peng, W. Lan, J. Zhong, J. Wang, and Y. Pan, “A novel method
of predicting microrna-disease associations based on microrna,
disease, gene and environment factor networks,” Methods, vol. 124,
pp. 69–77, 2017.
[55] W. Lan, J. Wang, M. Li, J. Liu, Y. Li, F.-X. Wu, and Y. Pan, “Predict-
ing drug–target interaction using positive-unlabeled learning,”
Neurocomputing, vol. 206, pp. 50–57, 2016.
[56] J. Liu, M. Li, W. Lan, F.-X. Wu, Y. Pan, and J. Wang, “Classification
of alzheimer’s disease using whole brain hierarchical network,”
IEEE/ACM transactions on computational biology and bioinformatics,
vol. 15, no. 2, pp. 624–632, 2016.
Authorized licensed use limited to: Guangxi University. Downloaded on November 03,2020 at 02:51:23 UTC from IEEE Xplore. Restrictions apply.
... Consequently, machine learning algorithms have been broadly applied in LDA prediction, for example, collaborative filtering , graph regularization , matrix factorization (Fu et al., 2018;Wang et al., 2020;Xi et al., 2022), heterogeneous graph learning framework, (Cao et al., 2023), and ensemble learning models (Peng et al., 2022a). Notably, deep learning has been broadly applied due to its powerful classification performance (Sun et al., 2022;Wang et al., 2023b;Hu et al., 2023;Jiang et al., 2023;Zhou et al., 2024a), such as in the graph convolution network (Wang W. et al., 2022), node2vec , collaborative deep learning (Lan et al., 2020), deep neural network (Wei et al., 2020), deep multi-network embedding (Ma, 2022), graph autoencoder Zhou et al., 2024b), and a capsule network with the attention mechanism . In particular, to identify new LDAs, a few models first extracted LDA features and classified unknown lncRNA-disease pairs (LDPs) by combining machine leaning models. ...
Article
Full-text available
Introduction: Long non-coding RNAs (lncRNAs) have been in the clinical use as potential prognostic biomarkers of various types of cancer. Identifying associations between lncRNAs and diseases helps capture the potential biomarkers and design efficient therapeutic options for diseases. Wet experiments for identifying these associations are costly and laborious. Methods: We developed LDA-SABC, a novel boosting-based framework for lncRNA–disease association (LDA) prediction. LDA-SABC extracts LDA features based on singular value decomposition (SVD) and classifies lncRNA–disease pairs (LDPs) by incorporating LightGBM and AdaBoost into the convolutional neural network. Results: The LDA-SABC performance was evaluated under five-fold cross validations (CVs) on lncRNAs, diseases, and LDPs. It obviously outperformed four other classical LDA inference methods (SDLDA, LDNFSGB, LDASR, and IPCAF) through precision, recall, accuracy, F1 score, AUC, and AUPR. Based on the accurate LDA prediction performance of LDA-SABC, we used it to find potential lncRNA biomarkers for lung cancer. The results elucidated that 7SK and HULC could have a relationship with non-small-cell lung cancer (NSCLC) and lung adenocarcinoma (LUAD), respectively. Conclusion: We hope that our proposed LDA-SABC method can help improve the LDA identification.
Article
Deep learning-based multi-omics data integration methods have the capability to reveal the mechanisms of cancer development, discover cancer biomarkers and identify pathogenic targets. However, current methods ignore the potential correlations between samples in integrating multi-omics data. In addition, providing accurate biological explanations still poses significant challenges due to the complexity of deep learning models. Therefore, there is an urgent need for a deep learning-based multi-omics integration method to explore the potential correlations between samples and provide model interpretability. Herein, we propose a novel interpretable multi-omics data integration method (DeepKEGG) for cancer recurrence prediction and biomarker discovery. In DeepKEGG, a biological hierarchical module is designed for local connections of neuron nodes and model interpretability based on the biological relationship between genes/miRNAs and pathways. In addition, a pathway self-attention module is constructed to explore the correlation between different samples and generate the potential pathway feature representation for enhancing the prediction performance of the model. Lastly, an attribution-based feature importance calculation method is utilized to discover biomarkers related to cancer recurrence and provide a biological interpretation of the model. Experimental results demonstrate that DeepKEGG outperforms other state-of-the-art methods in 5-fold cross validation. Furthermore, case studies also indicate that DeepKEGG serves as an effective tool for biomarker discovery. The code is available at https://github.com/lanbiolab/DeepKEGG.
Article
Exploration of the intricate connections between long noncoding RNA (lncRNA) and diseases, referred to as lncRNA-disease associations (LDAs), plays a pivotal and indispensable role in unraveling the underlying molecular mechanisms of diseases and devising practical treatment approaches. It is imperative to employ computational methods for predicting lncRNA-disease associations to circumvent the need for superfluous experimental endeavors. Graph-based learning models have gained substantial popularity in predicting these associations, primarily because of their capacity to leverage node attributes and relationships within the network. Nevertheless, there remains much room for enhancing the performance of these techniques by incorporating and harmonizing the node attributes more effectively. In this context, we introduce a novel model, i.e., Adaptive Message Passing and Feature Fusion (AMPFLDAP), for forecasting lncRNA-disease associations within a heterogeneous network. Firstly, we constructed a heterogeneous network involving lncRNA, microRNA (miRNA), and diseases based on established associations and employing Gaussian interaction profile kernel similarity as a measure. Then, an adaptive topological message passing mechanism is suggested to address the information aggregation for heterogeneous networks. The topological features of nodes in the heterogeneous network were extracted based on the adaptive topological message passing mechanism. Moreover, an attention mechanism is applied to integrate both topological and semantic information to achieve the multimodal features of biomolecules, which are further used to predict potential LDAs. The experimental results demonstrated that the performance of the proposed AMPFLDAP is superior to seven state-of-the-art methods. Furthermore, to validate its efficacy in practical scenarios, we conducted detailed case studies involving three distinct diseases, which conclusively demonstrated AMPFLDAP’s effectiveness in the prediction of LDAs.
Article
Long noncoding RNAs (lncRNAs) have been discovered to be extensively involved in eukaryotic epigenetic, transcriptional, and post-transcriptional regulatory processes with the advancements in sequencing technology and genomics research. Therefore, they play crucial roles in the body’s normal physiology and various disease outcomes. Presently, numerous unknown lncRNA sequencing data require exploration. Establishing deep learning-based prediction models for lncRNAs provides valuable insights for researchers, substantially reducing time and costs associated with trial and error and facilitating the disease-relevant lncRNA identification for prognosis analysis and targeted drug development as the era of artificial intelligence progresses. However, most lncRNA-related researchers lack awareness of the latest advancements in deep learning models and model selection and application in functional research on lncRNAs. Thus, we elucidate the concept of deep learning models, explore several prevalent deep learning algorithms and their data preferences, conduct a comprehensive review of recent literature studies with exemplary predictive performance over the past 5 years in conjunction with diverse prediction functions, critically analyze and discuss the merits and limitations of current deep learning models and solutions, while also proposing prospects based on cutting-edge advancements in lncRNA research.
Article
Drug‐target interaction (DTI) prediction is essential for new drug design and development. Constructing heterogeneous network based on diverse information about drugs, proteins and diseases provides new opportunities for DTI prediction. However, the inherent complexity, high dimensionality and noise of such a network prevent us from taking full advantage of these network characteristics. This article proposes a novel method, NGCN, to predict drug‐target interactions from an integrated heterogeneous network, from which to extract relevant biological properties and association information while maintaining the topology information. It focuses on learning the topology representation of drugs and targets to improve the performance of DTI prediction. Unlike traditional methods, it focuses on learning the low‐dimensional topology representation of drugs and targets via graph‐based convolutional neural network. NGCN achieves substantial performance improvements over other state‐of‐the‐art methods, such as a nearly 1.0% increase in AUPR value. Moreover, we verify the robustness of NGCN through benchmark tests, and the experimental results demonstrate it is an extensible framework capable of combining heterogeneous information for DTI prediction.
Article
Full-text available
LncRNAs are non-coding RNAs with a length of more than 200 nucleotides. More and more evidence shows that lncRNAs are inextricably linked with diseases. To make up for the shortcomings of traditional methods, researchers began to collect relevant biological data in the database and used bioinformatics prediction tools to predict the associations between lncRNAs and diseases, which greatly improved the efficiency of the study. To improve the prediction accuracy of current methods, we propose a new lncRNA-disease associations prediction method with attention mechanism, called ResGCN-A. Firstly, we integrated lncRNA functional similarity, lncRNA Gaussian interaction profile kernel similarity, disease semantic similarity, and disease Gaussian interaction profile kernel similarity to obtain lncRNA comprehensive similarity and disease comprehensive similarity. Secondly, the residual graph convolutional network was used to extract the local features of lncRNAs and diseases. Thirdly, the new attention mechanism was used to assign the weight of the above features to further obtain the potential features of lncRNAs and diseases. Finally, the training set required by the Extra-Trees classifier was obtained by concatenating potential features, and the potential associations between lncRNAs and diseases were obtained by the trained Extra-Trees classifier. ResGCN-A combines the residual graph convolutional network with the attention mechanism to realize the local and global features fusion of lncRNA and diseases, which is beneficial to obtain more accurate features and improve the prediction accuracy. In the experiment, ResGCN-A was compared with five other methods through 5-fold cross-validation. The results show that the AUC value and AUPR value obtained by ResGCN-A are 0.9916 and 0.9951, which are superior to the other five methods. In addition, case studies and robustness evaluation have shown that ResGCN-A is an effective method for predicting lncRNA-disease associations. The source code for ResGCN-A will be available at https://github.com/Wangxiuxiun/ResGCN-A.
Article
Full-text available
In recent years, the recommendation systems have become increasingly popular and have been used in a broad variety of applications. Here, we investigate the matrix completion techniques for the recommendation systems that are based on collaborative filtering. The collaborative filtering problem can be viewed as predicting the favorability of a user with respect to new items of commodities. When a rating matrix is constructed with users as rows, items as columns, and entries as ratings, the collaborative filtering problem can then be modeled as a matrix completion problem by filling out the unknown elements in the rating matrix. This article presents a comprehensive survey of the matrix completion methods used in recommendation systems. We focus on the mathematical models for matrix completion and the corresponding computational algorithms as well as their characteristics and potential issues. Several applications other than the traditional user-item association prediction are also discussed.
Article
Full-text available
Objective: Long non-coding RNA KCNQ1OT1 has been associated with the development of different types of cancers. The present research investigated the role of KCNQ1OT1 in osteosarcoma. Methods: Expression level of KCNQ1OT1 in osteosarcoma and paired non-cancerous tissue specimens from 56 osteosarcoma patients and its association with patients' clinicopathological features was investigated. KCNQ1OT1 overexpression and knockdown in primary-cultured osteosarcoma cells was constructed by lentiviral transduction. Influence of KCNQ1OT1 overexpression or knockdown on osteosarcoma cell growth, apoptosis, migration, invasion, epithelial-to-mesenchymal transition and beta-catenin activation was investigated. Results: Expression of KCNQ1OT1 in osteosarcoma tissue specimens was significantly increased in comparison to that in adjacent counterparts. High expression of KCNQ1OT1 significantly associated with osteosarcoma progression and patients' decreased survival. Overexpression of KCNQ1OT1 significantly increased osteosarcoma cell growth, proliferation, migration, invasion, epithelial-to-mesenchymal transition and beta-catenin activation while reducing cell apoptosis in vitro, and KCNQ1OT1 knockdown showed opposite effects. Inhibition of beta-catenin/TCF activity by ICG-001 treatment significantly attenuated the promoting effect of KCNQ1OT1 overexpression on osteosarcoma cell malignancy described above. Conclusion: KCNQ1OT1 might be a potential prognostic factor in osteosarcoma. High expression of KCNQ1OT1 might promote osteosarcoma development by increasing the activation of WNT/beta-catenin signaling pathway.
Article
Full-text available
Long noncoding RNA AFAP1‑AS1 has been shown to promote tumor progression in several human cancer types, such as thyroid cancer, tongue squamous cell carcinoma and lung cancer. However, the role of AFAP1‑AS1 in osteosarcoma (OS) has not been investigated. In the present study, the expression of AFAP1‑AS1 was significantly upregulated in OS tissues and cell lines. Moreover, AFAP1‑AS1 expression was negatively correlated with OS patient prognosis. Besides, AFAP1‑AS1 knockdown significantly inhibited the proliferation and invasion of OS cells in vitro. Furthermore, in vivo xenograft experiments indicated that AFAP1‑AS1 depletion delayed tumor growth. Regarding the underlying mechanism, AFAP1‑AS1 served as a sponge to repress the level of microRNA (miR)‑4695‑5p, which targeted transcription factor (TCF)4, a pivot effector of Wnt/β‑catenin signaling pathway. It was demonstrated that overexpression of AFAP1‑AS1 inhibited the expression of miR‑4695‑5p, while miR‑4695‑5p overexpression decreased TCF4 expression and reduced activation of Wnt/β‑catenin pathway. Through rescue assays, it was demonstrated that restoration of TCF4 expression reversed the effects of AFAP1‑AS1 knockdown or miR‑4695‑5p overexpression on OS cells. Taken together, these findings demonstrated that the AFAP1‑AS1/miR‑4695‑5p/TCF4‑β‑catenin axis played an important role in OS progression.
Article
Full-text available
Motivation: Accumulating evidences indicate that long non-coding RNAs (lncRNAs) play pivotal roles in various biological processes. Mutations and dysregulations of lncRNAs are implicated in miscellaneous human diseases. Predicting lncRNA-disease associations is beneficial to disease diagnosis as well as treatment. Although many computational methods have been developed, precisely identifying lncRNA-disease associations, especially for novel lncRNAs, remains challenging. Results: In this study, we propose a method (named SIMCLDA) for predicting potential lncRNA-disease associations based on inductive matrix completion. We compute Gaussian interaction profile kernel of lncRNAs from known lncRNA-disease interactions and functional similarity of diseases based on disease-gene and gene-gene onotology associations. Then, we extract primary feature vectors from Gaussian interaction profile kernel of lncRNAs and functional similarity of diseases by principal component analysis, respectively. For a new lncRNA, we calculate the interaction profile according to the interaction profiles of its neighbors. At last, we complete the association matrix based on the inductive matrix completion framework using the primary feature vectors from the constructed feature matrices. Computational results show that SIMCLDA can effectively predict lncRNA-disease associations with higher accuracy compared with previous methods. Furthermore, case studies show that SIMCLDA can effectively predict candidate lncRNAs for renal cancer, gastric cancer and prostate cancer. Availability: https://github.com//bioinfomaticsCSU/SIMCLDA. Contact: jxwang@mail.csu.edu.cn. Supplementary information: Supplementary data are available at Bioinformatics online.
Chapter
Full-text available
With the development and improvement of next-generation sequencing technology, a great number of noncoding RNAs have been discovered. Long noncoding RNAs (lncRNAs) are the biggest kind of noncoding RNAs with more than 200 nt nucleotides in length. There are increasing evidences showing that lncRNAs play key roles in many biological processes. Therefore, the mutation and dysregulation of lncRNAs have close association with a number of complex human diseases. Identifying the most likely interaction between lncRNAs and diseases becomes a fundamental challenge in human health. A common view is that lncRNAs with similar function tend to be related to phenotypic similar diseases. In this chapter, we firstly introduce the concept of lncRNA, their biological features, and available data resources. Further, the recent computational approaches are explored to identify interactions between long noncoding RNAs and diseases, including their advantages and disadvantages. The key issues and potential future works of predicting interactions between long noncoding RNAs and diseases are also discussed.
Article
Accumulating studies have indicated that long non-coding RNAs (lncRNAs) play crucial roles in large amount of biological processes. Predicting lncRNA-disease associations can help biologist to understand the molecular mechanism of human disease and benefit for disease diagnosis, treatment and prevention. In this paper, we introduce a computational framework based on graph autoencoder matrix completion (GAMCLDA) to identify lncRNA-disease associations. In our method, the graph convolutional network is utilized to encode local graph structure and features of nodes for learning latent factor vectors of lncRNA and disease. Further, the inner product of lncRNA factor vector and disease factor vector is used as decoder to reconstruct the lncRNA-disease association matrix. In addition, the cost-sensitive neural network is utilized to deal with the imbalance between positive and negative samples. The experimental results show GAMLDA outperforms other state-of-the-art methods in prediction performance which is evaluated by AUC value, AUPR value, PPV and F1-score. Moreover, the case study shows our method is the effectively tool for potential lncRNA-disease prediction.
Article
The dysregulation and mutation of long non-coding RNAs (lncRNAs) have been proved to result in a variety of human diseases. Identifying potential disease-related lncRNAs may benefit disease diagnosis, treatment and prognosis. A number of methods have been proposed to predict the potential lncRNA-disease relationships. However, most of them may give rise to incorrect results due to relying on single similarity measure. This article proposes a novel framework (ILDMSF) by fusing the lncRNA similarities and disease similarities, which are measured by lncRNA-related gene and known lncRNA-disease interaction and disease semantic interaction, and known lncRNA-disease interaction, respectively. Further, the support vector machine is employed to identify the potential lncRNA-disease associations based on the integrated similarity. The leave-one-out cross validation is performed to compare ILDMSF with other state of the art methods. The experimental results demonstrate our method is prospective in exploring potential correlations between lncRNA and disease.
Article
In the existing recommender systems, matrix factorization (MF) is widely applied to model user preferences and item features by mapping the user-item ratings into a low-dimension latent vector space. However, MF has ignored the individual diversity where the user's preference for different unrated items is usually different. A fixed representation of user preference factor extracted by MF cannot model the individual diversity well, which leads to a repeated and inaccurate recommendation. To this end, we propose a novel latent factor model called adaptive deep latent factor model (ADLFM), which learns the preference factor of users adaptively in accordance with the specific items under consideration. We propose a novel user representation method that is derived from their rated item descriptions instead of original user-item ratings. Based on this, we further propose a deep neural networks framework with an attention factor to learn the adaptive representations of users. Extensive experiments on Amazon data sets demonstrate that ADLFM outperforms the state-of-the-art baselines greatly. Also, further experiments show that the attention factor indeed makes a great contribution to our method.
Article
Gene regulatory networks (GRNs) play a key role in biological processes. However, GRNs are diverse under different biological conditions. Reconstructing gene regulatory networks (GRNs) from gene expression has become an important opportunity and challenge in the past decades. Although there are a lot of existing methods to infer the topology of GRNs, such as mutual information, random forest and partial least squares, the accuracy is still low due to the noise and high dimension of the expression data. In this paper, we introduce an ensemble Multivariate Adaptive Regression Splines (MARS) based method to reconstruct the directed GRNs from multifactorial gene expression data, called PBMarsNet. PBMarsNet incorporates part mutual information (PMI) to pre-weight the candidate regulatory genes and then uses MARS to detect the nonlinear regulatory links. Moreover, we apply bootstrap to run the MARS multiple times and average the outputs of each MARS as the final score of regulatory links. The results on DREAM4 challenge and DREAM5 challenge datasets show PBMarsNet has a superior performance and generalization over other state-of-the-art methods.
Article
Objective: The long non-coding RNA (lncRNA) H19, a maternally expressed imprinted gene, has involvement in cancer susceptibility and disease progression. However, the association between H19 polymorphisms and osteosarcoma susceptibility has remained elusive. We designed this case-control study to explore the association between H19 polymorphism and osteosarcoma risk. Patients and methods: In this study, we genotyped 4 tagger SNPs of the H19 gene in a case-control study including 193 osteosarcoma cases and 393 cancer-free controls. Results: For the main effect analysis, rs217727 (G>A) was associated with osteosarcoma risk (GA/GG: adjusted OR = 1.51, 95% CI: 1.06-2.17, p = 0.024; AA/GG: adjusted OR = 1.89, 95% CI: 1.23-2.91, p = 0.004; additive model: adjusted OR = 1.35, 95% CI: 1.01-1.80, p = 0.043). Conclusions: This finding indicates that rs217727 polymorphism may play a role in genetic susceptibility to the risk of osteosarcoma, which may improve our understanding of the potential contribution of H19 SNPs to cancer pathogenesis.