PreprintPDF Available

Single-cell RNA-seq Imputation using Generative Adversarial Networks

Authors:

Abstract and Figures

Single-cell RNA-seq (scRNA-seq) enables the characterization of transcriptomic profiles at the single-cell resolution with increasingly high throughput. However, it suffers from many sources of technical noises, including insufficient mRNA molecules that lead to excess false zero values, often termed dropouts. Computational approaches have been proposed to recover the biologically meaningful expression by borrowing information from similar cells in the observed dataset. However, these methods suffer oversmoothing and removal of natural cell-to-cell stochasticity in gene expression. Here, we propose the generative adversarial networks for scRNA-seq imputation (scIGANs), which uses generated realistic rather than observed cells to avoid these limitations and the powerless for rare cells. Evaluations based on a variety of simulated and real scRNA-seq datasets demonstrate that scIGANs is effective for dropout imputation and enhancing various downstream analysis. ScIGANs is also scalable and robust to small datasets that have few genes with low expression and/or cell-to-cell variance.
Content may be subject to copyright.
1
Single-cell RNA-seq Imputation using Generative Adversarial
1
Networks
2
Yungang Xu1,†, Zhigang Zhang2,6,†, Lei You1, Jiajia Liu1,4, Zhiwei Fan1,5, Xiaobo Zhou1,3,*
3
1 Center for Computational Systems Medicine, School of Biomedical Informatics, The University of Texas
4
Health Science Center at Houston, TX 77030, USA
5
2 School of Information Management and Statistics, Hubei University of Economics, Wuhan, Hubei
6
430205, China
7
3 Department of Pediatric Surgery, McGovern Medical School, The University of Texas Health Science
8
Center at Houston, Houston, TX 77030, USA
9
4 School of Electronics and Information, Tongji University, Shanghai, Shanghai 201804, China
10
5 West China School of Public Health and West China Fourth Hospital, Sichuan University, Chengdu,
11
Chengdu 610040, China
12
6 Hubei Center for Data and Analysis, Hubei University of Economics, Wuhan, Hubei, 430205, China
13
14
These authors contributed equally to this work.
15
*Correspondence: xiaobo.zhou@uth.tmc.edu
16
17
Email addresses:
18
YX: yungangx.xu@uth.tmc.edu
19
ZZ: zzg@hbue.edu.cn
20
LY: lei.you@uth.tmc.edu
21
JL: jiajia.liu@uth.tmc.edu
22
ZF: zhiwei.fan@uth.tmc.edu
23
XZ: xiaobo.zhou@uth.tmc.edu
24
.CC-BY-NC-ND 4.0 International licenseIt is made available under a perpetuity.preprint (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in
The copyright holder for this. http://dx.doi.org/10.1101/2020.01.20.913384doi: bioRxiv preprint first posted online Jan. 21, 2020;
2
Abstract
1
Single-cell RNA-seq (scRNA-seq) enables the characterization of transcriptomic profiles at the
2
single-cell resolution with increasingly high throughput. However, it suffers from many sources
3
of technical noises, including insufficient mRNA molecules that lead to excess false zero values,
4
often termed dropouts. Computational approaches have been proposed to recover the
5
biologically meaningful expression by borrowing information from similar cells in the observed
6
dataset. However, these methods suffer oversmoothing and removal of natural cell-to-cell
7
stochasticity in gene expression. Here, we propose the generative adversarial networks for
8
scRNA-seq imputation (scIGANs), which uses generated realistic rather than observed cells to
9
avoid these limitations and the powerless for rare cells. Evaluations based on a variety of
10
simulated and real scRNA-seq datasets demonstrate that scIGANs is effective for dropout
11
imputation and enhancing various downstream analysis. ScIGANs is also scalable and robust to
12
small datasets that have few genes with low expression and/or cell-to-cell variance.
13
Introduction
14
Single-cell RNA-seq (scRNA-seq) revolutionizes the traditional profiling of gene expression,
15
making it able to fully characterize the transcriptomes of individual cells at the unprecedented
16
throughput. A major problem for scRNA-seq is the sparsity of the expression matrix with a
17
tremendous number of zero values. Most of these zero or near-zero values are artificially
18
caused by technical defects including but not limited to low capture rate, insufficient sequencing
19
depth, or other technological factors such that the observed zero does not reflect the underlying
20
true expression level, which is called dropout [1]. A pressing need in scRNA-seq data analysis
21
remains identifying and handling the dropout events that, otherwise, will severely hinder
22
downstream analysis and attenuate the power of scRNA-seq on a wide range of biological and
23
biomedical applications. Therefore, applying computational approaches to address problems of
24
.CC-BY-NC-ND 4.0 International licenseIt is made available under a perpetuity.preprint (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in
The copyright holder for this. http://dx.doi.org/10.1101/2020.01.20.913384doi: bioRxiv preprint first posted online Jan. 21, 2020;
3
missingness and noises is very important and timely, particularly considering the increasingly
1
popular and large amount of scRNA-seq data.
2
Several methods have been recently proposed to address the challenges resulted from excess
3
zero values in scRNA-seq. MAGIC [2] imputes missing expression values by sharing
4
information across similar cells, based on the idea of heat diffusion. ScImpute [3] learns each
5
gene’s dropout probability in each cell and then imputes the dropout values borrowing
6
information from other similar cells selected based on the genes unlikely affected by dropout
7
events. SAVER [4] borrows information across genes using a Bayesian approach to estimate
8
unobserved true expression levels of genes. DrImpute [5] impute dropouts by simply averaging
9
the expression values of similar cells defined by clustering. VIPER [6] borrows information from
10
a sparse set of local neighborhood cells of similar expression patterns to impute the expression
11
measurements in the cells of interest based on nonnegative sparse regression models.
12
Meanwhile, some other methods aim at the same goal by denoising the scRNA-seq data. DCA
13
[7] uses a deep count autoencoder network to denoise scRNA-seq datasets by learning the
14
count distribution, overdispersion, and sparsity of the data. ENHANCE [8] recovers denoised
15
expression values based on principal component analysis on raw scRNA-seq data. During the
16
preparation of this manuscript, we also noticed another imputation method DeepImpute [9],
17
which uses a deep neural network with dropout layers and loss functions to learn patterns in the
18
data, allowing for scRNA-seq imputation.
19
While existing studies have adopted varying approaches for dropout imputation and yielded
20
promising results, they either borrow information from similar cells or aggregate (co-expressed
21
or similar) genes of the observed data, which will lead to oversmoothing (e.g. MAGIC) and
22
remove natural cell-to-cell stochasticity in gene expression (e.g. scImpute). Moreover, the
23
imputation performance will be significantly reduced for rare cells, which have limited
24
information and are common for many scRNA-seq studies. Alternatively, SCRABBLE [10]
25
.CC-BY-NC-ND 4.0 International licenseIt is made available under a perpetuity.preprint (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in
The copyright holder for this. http://dx.doi.org/10.1101/2020.01.20.913384doi: bioRxiv preprint first posted online Jan. 21, 2020;
4
attempt to leverage bulk data as a constraint on matrix regularization to impute dropout events.
1
However, most scRNA-seq studies often lack matched bulk RNA-seq data and thus limit its
2
practicality. Additionally, due to the non-trivial distinction between true and false zero counts,
3
imputation and denoising need account for both the intra-cell-type dependence and inter-cell-
4
type specificity. In view of the above concerns, a deep generative model would be a better
5
choice to learn the true data distribution and then generate new data points with some variations,
6
which are then independently used to impute the missing values and avoid overfitting.
7
Deep generative models have been widely used for missing value imputation in fields [11-13],
8
however, other than scRNA-seq. Although a deep generative model was used for scRNA-seq
9
analysis [14], it’s not explicitly designed for dropout imputation. Among deep generative models,
10
generative adversarial networks (GANs) have evoked increasing interest in the computer vision
11
community since its first introduction in 2014 [15]. GANs has become an active area of research
12
with multiple variants developed [16-20] and holds promising in data imputation [21] because of
13
its capability of learning and mimicking any distribution of data. Given the great success of
14
GANs in inpainting, we hypothesize that similar deep neural net architectures could be used to
15
impute dropouts in scRNA-seq data.
16
In this study, we propose a GANs framework for scRNA-seq imputation (scIGANs). Inspired by
17
its established applications in inpainting, we convert the expression profile of each individual cell
18
to an image, wherein the pixels are represented by the normalized gene expression. And then
19
dropout imputation becomes the process of inpainting an image by recovering the missing
20
pieces that represent the dropout events. Because of the inherent advantages of GANs,
21
scIGANs does not impose an assumption of specific statistical distributions for gene expression
22
levels and dropout probabilities. It also does not force the imputation of genes that are not
23
affected by dropout events. Moreover, scIGANs generates a set of realistic single cells instead
24
of directly borrowing information from observed cells to impute the dropout events, which can
25
.CC-BY-NC-ND 4.0 International licenseIt is made available under a perpetuity.preprint (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in
The copyright holder for this. http://dx.doi.org/10.1101/2020.01.20.913384doi: bioRxiv preprint first posted online Jan. 21, 2020;
5
avoid overfitting for the cell type of big population and meanwhile promise enough imputation
1
power for rare cells. Using a variety of simulated and real datasets, we extensively evaluate
2
scIGANs with nine other state-of-the-art, representative methods and demonstrate its superior
3
performance in recovering the biologically meaningful expression, identifying subcellular states
4
of the same cell types, improving differential expression and temporal dynamics analysis.
5
ScIGANs is also robust and scalable to datasets that have a small number of genes with low
6
expression and cell-to-cell variance.
7
Results
8
1. The scIGANs approach
9
Generative adversarial networks (GANs), first introduced in 2014 [15], evoked much interest in
10
the computer vision community and has become an active area of research with multiple
11
variants developed [16-20]. Inspired by its excellent performance in generating realistic images
12
[22-26] and recent application to generating realistic scRNA-seq data [27, 28], we propose
13
scIGANs, the generative adversarial networks for scRNA-seq imputation (Figure 1, Methods).
14
The basic idea is that scIGANs can learn the non-linear gene-gene dependencies from complex,
15
multi-cell type samples and train a generative model to generate realistic expression profiles of
16
defined cell types [27, 28]. To train scIGANs, the real single-cell expression profiles are first
17
reshaped to images and fed to GANs, wherein each cell corresponds to an image with the
18
normalized gene expression representing the pixel (Figures 1 and S1A, Methods). The
19
generator generates fake images by transforming a 100-dimensional latent variable into single-
20
cell gene expression profiles (Figure S1A). The discriminator evaluates whether the images are
21
authentic or generated. These two networks are trained concurrently whilst competing against
22
one another to improve the performance of both (Figure 1).
23
Once trained, the generative model is used to generate scRNA-seq data of defined cell types.
24
And then we propose to infer the true expression of dropouts from the generated realistic cells.
25
.CC-BY-NC-ND 4.0 International licenseIt is made available under a perpetuity.preprint (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in
The copyright holder for this. http://dx.doi.org/10.1101/2020.01.20.913384doi: bioRxiv preprint first posted online Jan. 21, 2020;
6
The most important benefit of using generated cells instead of the real cells for scRNA-seq
1
imputation is to avoid overfitting for the cell type of big population but insufficient power for rare
2
cells. The generator can produce a set of cells of any number with the expression profiles
3
faithfully characterizing the demand cell type; and then the k-nearest neighbors (KNN) approach
4
is used to impute the dropouts of the same cell type in the real scRNA-seq data (Figure S1B,
5
Methods). The scIGANs is implemented in python and R, and compiled as a command-line tool
6
compatible with both CPU and GPU platform. The core model is built on the PyTorch framework
7
and adopted to accommodate scRNA-seq data as input. It’s publicly available at
8
https://github.com/xuyungang/scIGANs.
9
2. ScIGANs recovers single-cell gene expression from dropouts without inflicting
10
extra noise
11
Recovery of the biologically meaningful expression from dropout events is the major goal of
12
scRNA-seq imputation to benefit the downstream analyses and biological discoveries. We use
13
both simulated and real scRNA-seq datasets to illustrate the performance and robustness of
14
scIGANs in rescuing dropouts and avoiding additional noise from imputation.
15
First, simulated datasets are used to evaluate the imputation performance since they have
16
known 'truth' and can thus benchmark different methods. In a single dataset with a 52.8% zero
17
rate that was simulated according to an independent single-cell clustering method CIDR [29]
18
(Methods), scIGANs performed superiorly over all other nine methods in recovering the gene
19
expression and cell population clusters (Figures 2A and S2A; Tables S1). Although GANs is a
20
supervised model that requires pre-defined cell labels, we implemented scIGANs to
21
accommodate scRNA-seq data without prior labels, instead to learn the labels by applying
22
spectral clustering [30] on input data. On the same simulated data, scIGANs trained without
23
labels (scIGANs w/o) reduced the performance slightly and remained the superiority over the
24
other eight compared methods, except for scImpute [3] (Figures 2A and S2A; Tables S1).
25
.CC-BY-NC-ND 4.0 International licenseIt is made available under a perpetuity.preprint (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in
The copyright holder for this. http://dx.doi.org/10.1101/2020.01.20.913384doi: bioRxiv preprint first posted online Jan. 21, 2020;
7
Second, we test the performance of scIGANs and other peer methods on datasets with different
1
dropout rates simulated by Splatter [31] (Methods). scIGANs ranks in the top in rescuing the
2
population clusters (Figures S2B-D) and has the highest resistance to dropout rate increase
3
(Figure S2E; Table S2). Moreover, to evaluate the robustness of imputation methods, we used
4
the same simulation strategy described by SCRABBLE [32] to repeat the above Splatter
5
simulation 100 times for each dropout rate. We evaluated the performance by multiple
6
quantitative clustering metrics (Table S3). The second-ranked SCRABBLE performed superiorly
7
over all other peer methods, however, it has worse concordance among simulated replicates
8
with a higher dropout rate (Figure 2B). In contrast, scIGANs ranks in top among all methods and
9
has the most robust performance among the replicates across increasing dropout rates (Figures
10
2B and S3A-F; Tables S3).
11
Third, we evaluate the imputation methods using real scRNA-seq data from the Human brain,
12
which contains 420 cells in eight well-defined cell types after we excluded uncertain hybrid cells
13
[33] (Methods). However, the raw data doesn’t show clear clustering of all cell types because of
14
the dropouts and technical noise. After imputation, scIGANs enhanced the cell type clusters to
15
the maximum extent so that all 8 cell types could be separated and identified (Figure 2C).
16
Quantitative evaluations of the clustering following different imputation methods highlighted the
17
superiority of scIGANs over the others, even trained without the prior cell labels (Figures 2D and
18
S3G; Table S4).
19
Last, we test another important yet difficult to quantify robustness, i.e. to what extent the
20
imputation method will not introduce extra noise by, for example, mistakenly imputing biological
21
“zeros” or over-imputation. None of the existing imputation methods evaluated their robustness
22
in avoiding extra noise using real scRNA-seq data. Spike-in RNA (e.g. ERCC spike-in
23
developed by the External RNA Controls Consortium) is a common set of external RNA controls
24
to be equally added to an RNA analysis experiment after sample isolation. It is widely used in
25
scRNA-seq experiments to remove the confounding noises from biological variance. Because
26
.CC-BY-NC-ND 4.0 International licenseIt is made available under a perpetuity.preprint (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in
The copyright holder for this. http://dx.doi.org/10.1101/2020.01.20.913384doi: bioRxiv preprint first posted online Jan. 21, 2020;
8
the spike-in RNAs are added to samples with the identical amount to capture the technical noise,
1
the readout for the spike-in RNAs should be free of cell-to-cell variability and the detected
2
variances of expression, if exists, should only come from technical confounders other than
3
biological contexts (e.g. cell types). Therefore, the expression of spike-in RNAs that were added
4
to individual cells should not be able to cluster these cells into different subgroups regarding cell
5
types or other biological states. We here use the ERCC spike-in read counts from a real scRNA-
6
seq study [34] to evaluate the imputation methods on denoising the technical variance without
7
introducing extra noise (Methods). These 92 ERCC RNAs were added to 288 single-cell
8
libraries of three sets of 96 cells with different cell-cycle states. However, the raw counts failed
9
to cluster these cells into one cluster due to the dropouts of spike-in RNAs (Figure 2E). We
10
expected that the imputation could help impute the artificial zeros without exposing the cell
11
states to spike-in profiles and thus all cells should have the same spike-in profiles and will be
12
clustered into a single group. ScIGANs successfully recovers the spike-in profiles with minimum
13
cell-to-cell variability and clustered all cells closely into one group, even though it was trained
14
with supervisory cell labels (Figures 2E and S3H-I). However, other imputation methods suffer
15
from introducing extra noises and thus made clustering even worse (Figures S3H-I; Table S5).
16
Altogether, scIGANs performs superiorly on imputing the dropouts and avoiding extra noise.
17
3. ScIGANs enables the identification of cellular states of the same cell type
18
Single-cell RNA-seq is typically used to identify different cell types from heterogeneous tissues
19
or cell populations. However, cell populations that seem homogeneous, in terms of expression
20
of cell surface markers, comprise many different cellular states and hide cell-to-cell variability
21
that can have significant effects on cell function [35, 36], such as cellular functions,
22
developmental stages, cell cycle phase, and adjacent microenvironments. Therefore, many
23
biological questions require deeper investigation beyond the cell types towards implied cellular
24
states, such as cell-cycle phases of the same cell type. It was reported that cell cycles
25
contribute to phenotypic and functional cell heterogeneity even in monoclonal cell lines [37-39].
26
.CC-BY-NC-ND 4.0 International licenseIt is made available under a perpetuity.preprint (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in
The copyright holder for this. http://dx.doi.org/10.1101/2020.01.20.913384doi: bioRxiv preprint first posted online Jan. 21, 2020;
9
However, identifying the different cell-cycle phases of the same cell type from scRNA-seq data
1
is more challenging due to the prevalence of dropout and high technical variance, which was
2
recently reported more attributable than cell cycle to the single-cell transcriptomic variability [38].
3
We thereby test how imputation could benefit the identification of cell cycle variability from
4
scRNA-seq studies.
5
First, we reanalyze scRNA-seq data from mouse embryonic stem cells (mESC) that were sorted
6
for G1, S and G2M phases of the cell cycle (Methods) [34]. Due to the dropout and other
7
technical noise, the raw data does not show cluster structures regarding the three different cell-
8
cycle phases (Figure 3A) and has the poorest clustering measurements (Figure S4A). All other
9
imputation methods fail to recover the cluster structure regarding the cell-cycle states (Figures
10
3A and S4A). Only scIGANs shows significant improvement in detecting cell-cycle states with
11
the best performance (Figures 3A and S4A). Using a collection of independently predefined cell-
12
cycle marker genes (Methods), scIGANs significantly improves the identification of the cell cycle
13
states superior over all other methods, shown as the most of sorted cells are correctly assigned
14
in the cell-cycle phase spaces (Figures 3B and S4B).
15
Second, we assess the performance of different imputation methods on pinpointing the cell-
16
cycle dynamics using a large scRNA-seq data of about 6.8k mouse ESCs (Methods) [40]. The
17
previous work confirmed that ES cells lack strong cell-cycle oscillations in mRNA abundance,
18
but they do show evidence of limited G2/M phase-specific transcription [40]. Imputation by
19
scIGANs significantly improved the cell-cycle oscillations with especially a more obvious G2/M
20
phase-specific transcription (Figures 3C and S4C-L). All the above demonstrate that scIGANs
21
performs better than all other methods on recovering and capturing the cellular states and very
22
subtle cell-cycle oscillations among single cells.
23
.CC-BY-NC-ND 4.0 International licenseIt is made available under a perpetuity.preprint (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in
The copyright holder for this. http://dx.doi.org/10.1101/2020.01.20.913384doi: bioRxiv preprint first posted online Jan. 21, 2020;
10
4. ScIGANs improves the differential expression analysis
1
Differential expression analysis refers broadly to the task of identifying those genes with
2
expression levels that depend on some variables, like cell type or state. Ultimately, most single-
3
cell studies start with identifying cell populations and characterizing genes that determine cell
4
types and drive them different from one to another. Using the scRNA-seq data [41] that have
5
matched bulk RNA-seq data, we compare the performances of different imputation methods on
6
improving identification of differentially expressed genes (DEGs). This dataset has six samples
7
of bulk RNA-seq (four for H1 ESC and two for definitive endoderm cells, DEC) and 350 samples
8
of scRNA-seq (212 for H1 ESC and 138 for DEC) (Methods). DESeq2 [42] is used to identify
9
DEGs for both bulk and single-cell RNA-seq data between the H1 and DEC cells (Methods).
10
The raw scRNA-seq has a much higher zero expression rate than bulk RNA-seq (49.1% vs
11
14.8%) and shares fewest DEGs with bulk samples (Figure 4A). After imputation, the number of
12
DEGs is increased toward the DEGs numbers of bulk samples (except the two other neural
13
network-based methods, DCA [7] and DeepImpute [9], which give fewer DEGs than raw data).
14
scIGANs imputation identifies the highest number of dataset-specific DEGs and shares a
15
significant number of DEGs with bulk RNA-seq (Figure 4A). Using a set of top 1000 DEGs from
16
bulk samples (500 up-regulated and 500 down-regulated genes) as a benchmark, scIGANs-
17
imputed scRNA-seq data show the highest correspondence with bulk RNA-seq (Figures 4B and
18
S5).
19
Moreover, the expressions of five marker genes for H1 and DEC, respectively, were
20
investigated to compare the extent to which the imputation could recover the expression
21
patterns of signature genes. Results show that scIGANs best reflect the expression signatures
22
of both H1 and DEC cells by removing undesirable variation resulted from dropouts (Figures 3C
23
and S6). Projection of cells to the UMAP space overlaid by the expression of signature genes
24
furtherly highlights the performance of scIGANs on recovering the expression patterns of
25
.CC-BY-NC-ND 4.0 International licenseIt is made available under a perpetuity.preprint (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in
The copyright holder for this. http://dx.doi.org/10.1101/2020.01.20.913384doi: bioRxiv preprint first posted online Jan. 21, 2020;
11
signature genes (Figures 3D-E and S7). In summary, scIGANs improves the identification of
1
DEGs from scRNA-seq data with better performance.
2
5. ScIGANs enhances the inference of cellular trajectory
3
Beyond characterizing cells by types, scRNA-seq also largely benefits organizing cells by
4
temporal or developmental stages, i.e. cellular trajectory. In general, trajectory analysis starts
5
with reducing the dimensionality of the expression data, then reconstructs a trajectory along
6
which the cells are presumed to travel, and finally projects each cell onto this trajectory at the
7
proper position. Although single-cell experiments can illuminate trajectories in a wide variety of
8
biological settings [43-46], none of the single-cell trajectory inference methods account for
9
dropout events. We hypothesized that inferring the cellular trajectory on scRNA-seq data after
10
imputation could improve the accuracy of pseudotime ordering. We utilize a time-course scRNA-
11
seq data derived from the differentiation from H1 ESC to definitive endoderm cells (DEC) [41]. A
12
total of 158 cells were profiled at 0, 12, 24, 36, 72, and 96 hours after inducing the differentiation
13
from H1 ESCs (Figure 5A). We apply scIGANs and all other nine imputation methods to the raw
14
scRNA-seq data with known time points and then reconstruct the trajectories. Imputation by
15
scIGANs produces the highest correspondence between the inferred pseudotime and real-time
16
course (Figures 5B-C and S8), suggesting that scIGANs recovers more accurate transcriptome
17
dynamics along the time course. We also study the signature genes of pluripotency (e.g.
18
NANOG and POU5F1) and DECs (e.g. CER1 and HNF1B) and find that scIGANs improves the
19
gene expression temporal dynamics after imputation (Figures 5D-E) and has better
20
performance than other imputation methods (Figure S8). These results demonstrate that
21
scIGANs can help to improve the single-cell trajectory analysis and recover the temporal
22
dynamics of gene expression.
23
.CC-BY-NC-ND 4.0 International licenseIt is made available under a perpetuity.preprint (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in
The copyright holder for this. http://dx.doi.org/10.1101/2020.01.20.913384doi: bioRxiv preprint first posted online Jan. 21, 2020;
12
6. scIGANs is robust to the small dataset of few genes with low expression or
1
cell-to-cell variance
2
In general, other imputation methods (e.g. SAVER [4] and scImpute [3]) heavily rely on a set of
3
pre-selected informative genes that are highly expressed and unlikely to suffer from the dropout.
4
Imputation is then performed from the most similar cells defined by these informative genes. In
5
contrast, scIGANs automatically learns the gene-gene and cell-cell dependencies from the
6
whole dataset. More important, scIGANs converts each single-cell expression profile to an
7
image so that a 1-dimension “feature” vector is reshaped to a 2-dimension matrix with each
8
element representing the expression of a single gene (Figure S1A). Like image processing,
9
scIGANs is then trained by convolution on the matrix so that the 2-dimension gene-gene
10
relations within each individual cell are captured. Therefore, we hypothesize that scIGANs is
11
more robust to genes of low expression or with less cell-to-cell variance.
12
From the aforementioned scRNA-seq data with 350 cells (212 H1 ESC and 138 DEC) [41], we
13
randomly sample small sets of genes (n=1024 for each) from the 5000-gene sets with,
14
respectively, top/lower means or variances, as well as a set of 1024 genes randomly from all
15
expressed genes (refer to Methods for details). When visualized only on the 1024 genes with
16
very low expression or variance, the two types of cells are almost mixed up without any cluster
17
characterization for the raw expression profiles (Figures 6A and 4D). Imputation by scIGANs
18
successfully recovered the two cell clusters for both datasets with only 1024 genes of low
19
expression and variance, respectively (Figure 6B). However, all other methods failed in
20
identifying the two cell types from these datasets (Figure S9). Moreover, scIGANs significantly
21
changes the mean and variance of expression after imputation, while it’s not always the same
22
cases for other methods (Figures 6C-D and S9). All these results show that scIGANs is robust
23
to a small dataset of genes with very low expression or cell-to-cell variance, which are less
24
informative for other imputation methods. It’s strong support to the expectation that scIGANs
25
.CC-BY-NC-ND 4.0 International licenseIt is made available under a perpetuity.preprint (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in
The copyright holder for this. http://dx.doi.org/10.1101/2020.01.20.913384doi: bioRxiv preprint first posted online Jan. 21, 2020;
13
can learn very limited gene-gene and cell-cell dependencies from a small set of lowly or close-
1
to-uniform expressed genes.
2
Discussion
3
Here we propose the generative adversarial networks for scRNA-seq imputation (scIGANs).
4
ScIGANs converts the expression profiles of individual cells to images and feeds them to
5
generative adversarial networks. The trained generative network produces expression profiles
6
representing the realistic cells of defined types. The generated cells, rather than the observed
7
cells, are then used to impute the dropouts of the real cells. We assess scIGANs regarding its
8
performances on the recovery of gene expression and various downstream applications using
9
simulated and real scRNA-seq datasets. We provide compelling evidence that scIGANs
10
performs superior over the other nine peer imputation methods. Most importantly, using
11
generated rather than observed cells, scIGANs avoids overfitting for the cell type of big
12
population and meanwhile promise enough imputation power for rare cells.
13
While there are many methods for scRNA-seq imputation, we specifically show how the GANs
14
can improve the imputation and downstream applications, representing one of three pioneering
15
applications of GANs to genomic data. Two other recent manuscripts used GANs to simulate
16
(generate) realistic scRNA-seq data with the applications of either integrating multiple scRNA-
17
seq datasets [19] or augmenting the sparse and underrepresented cell populations in scRNA-
18
seq data [27, 28]. We, for the first time, advance the applications of GANs to scRNA-seq for
19
dropout imputation. Inspired by the great success of GANs in inpainting and a highly relevant
20
work that applied GANs for ‘realistic’ generation of scRNA-seq data [27, 28], we speculate that
21
the generated realistic cells can not only augment the observed dataset but also benefit the
22
dropout imputation since it was proved that the generated data mimics the distribution of the
23
real data in their original space with stable fidelity [27, 28]. Our multiple downstream
24
assessments and applications on simulated and real scRNA-seq datasets demonstrated its
25
.CC-BY-NC-ND 4.0 International licenseIt is made available under a perpetuity.preprint (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in
The copyright holder for this. http://dx.doi.org/10.1101/2020.01.20.913384doi: bioRxiv preprint first posted online Jan. 21, 2020;
14
advantage in dropout imputation, superior over other peer methods. Especially for cells coming
1
from very small populations, generated data were proved to faithfully augment the sparse cell
2
populations [27, 28] and thus reduce the sampling bias and improve the imputation power,
3
which, however, are suffered by all other imputation methods. Additionally, GANs is able to
4
learn dependencies between genes beyond pairwise correlations [27, 28], which enables
5
scIGANs more sensitive and robust to small datasets with very low or uniform expressions. We
6
demonstrated these advantages by ERCC spike-in RNAs (Figures 2E and S3H-I) and
7
downsampling real scRNA-seq data (Figures 6 and S9).
8
The underlying basis of scIGANs is that the real scRNA-seq data is derived from sampling,
9
which doesn't have enough cells to characterize the true expression profiling of each cell type,
10
even for the major type of the cell populations; and the generated realistic cells could augment
11
the observations, especially for sparse and underrepresented cell populations, and thus improve
12
the dropout imputation of scRNA-seq data. There are many benefits of using realistic rather
13
than the observed cells for imputation. First, the generated cells characterize the expression
14
profiles of real cells and faithfully represent the cell heterogeneity. Therefore, the realistic cells
15
are ideal to serve as extra samples and independently impute the observed dropouts to avoid
16
the “circular logic” issue (overfitting) suffered by other methods (e.g. scImpute), which borrows
17
information from the observed data per se. Second, the realistic cells will augment the rare cell-
18
types, and thus overcome potential sampling biases present in downstream analyses.
19
Additionally, benefitting from the power of GANs in adversarially discriminating between real
20
and realistic data, and the augmentation from generated data, scIGANs is more sensitive to
21
subcellular states like the cell-cycle phases investigated in this study. Imputation by scIGANs
22
enables the investigation of scRNA-seq data beyond the identification and characterization of
23
cell types but go deeper into subcellular states and capture cell-to-cell variability of the
24
homogenous cell populations. This is critical for the applications of scRNA-seq to pinpoint the
25
state transitions along the cellular trajectory or identify and remove the subcellular confounding
26
.CC-BY-NC-ND 4.0 International licenseIt is made available under a perpetuity.preprint (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in
The copyright holder for this. http://dx.doi.org/10.1101/2020.01.20.913384doi: bioRxiv preprint first posted online Jan. 21, 2020;
15
factors (e.g. cell-cycle phases) [38]. Our evaluations on cell-cycle phase detection and trajectory
1
construction show the superiority of scIGANs over the all other nine tested methods.
2
In summary, scIGANs is a method that takes advantage of both the gene-to-gene and cell-to-
3
cell relationships to recover the true expression level of each gene in each cell, removing
4
technical variation without compromising biological variabilities across cells. ScIGANs is also
5
compatible with other single-cell analysis methods since it does not change the dimension (i.e.,
6
the number of genes and cells) of the input data and it effectively recovers the dropouts without
7
affecting the non-dropout expressions. Additionally, ScIGANs is scalable and robust to small
8
datasets that have few genes with low expression and/or cell-to-cell variance.
9
Methods
10
Generative adversarial networks and improved Wasserstein GANs
11
We here show that the generative adversarial networks (GANs) can be applied to scRNA-seq
12
imputation. The GANs training strategy is to define a game between two competing networks.
13
The generator network maps a source of noise to the input space. The discriminator network
14
receives either a generated sample or a true data sample and must distinguish between the two.
15
The generator is trained to fool the discriminator. Formally, the game between the generator
16
and discriminator
is the minimax objective
min
max

log

log1
17

; where
is the discriminator that can be any network,
is the real data distribution and
18
is the model distribution implicitly defined by

,
;
is the generator which
19
can be any network,
can be sampled from any noise distribution
, such as the uniform
20
distribution or a spherical Gaussian distribution.
21
It is difficult to train the original GANs model since minimizing the objective function corresponds
22
to minimizing the Jensen-Shannon divergence between
and
, which is not continuous with
23
respect to the generator’s parameters. Earth-Mover (Wasserstein-1) distance
,
is used to
24
deal with such difficulty [47]. Such a model is called Wasserstein GANs(WGANs) which the
25
.CC-BY-NC-ND 4.0 International licenseIt is made available under a perpetuity.preprint (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in
The copyright holder for this. http://dx.doi.org/10.1101/2020.01.20.913384doi: bioRxiv preprint first posted online Jan. 21, 2020;
16
objective function is constructed as
min
max





; where
is the set
1
of 1-Lipschitz function, the definition of other symbols are the same as the original GANs model.
2
To enforce the Lipschitz constraint on the critic, one can clip the weights of the critic to lie within
3
a compact space
,
. The set of functions satisfying this constraint is a subset of the k-
4
Lipschitz functions for some
which depends on
and the critic architecture. Researchers
5
introduced an alternative way to enforce the Lipschitz constraint, usually called improved
6
WGANs(IWGANs), which is widely used in training GANs models [48]. The objective is
7
min
max






!"
#!
1
; where
#
is sampled from
8
the straight lines between pairs of points sampled from the real data distribution and the
9
generator distribution.
is a predefined parameter. BEGAN [49] is an equilibrium enforcing
10
method paired with a loss derived from the Wasserstein [20] distance for training auto-encoder
11
based Generative Adversarial networks. The BEGAN objective is:
12
$
%
%

.
%
for )
%
%
for )


*+%%
, for each training step 5
6
where
13
%7|7
7|
where :: <
=<
is the autoencoder function.
ABC1,2E is the target norm.
7B<
is a sample of dimension F
6
In this paper, we use this method to train our scIGANs.
14
The scIGANs
15
Although, scIGANs is designed scalable to the dataset with any number of cell types and genes,
16
we here taking a dataset with 9 cell types and 32*32=1024 genes as an example to elucidate
17
how it works. The generator network of scIGANs is defined as
,%G
;)
. The inputs of the
18
generator are:
norm
0,1
, and label
%G
J
1,9
(Supplementary Figure S1A). Denote
)
as
19
.CC-BY-NC-ND 4.0 International licenseIt is made available under a perpetuity.preprint (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in
The copyright holder for this. http://dx.doi.org/10.1101/2020.01.20.913384doi: bioRxiv preprint first posted online Jan. 21, 2020;
17
the parameters need to be learned. To be noted that. The generator is defined as following the
1
steps:
2
1. Do transposed convolution on
by GConv1_1 and get the tensor
of dimension
3
(32,32,32).
4
2. Do transposed convolution on
%G
by GConv1_2 and get the tensor
%G
of dimension
5
(8,32,32).
6
3. Concatenate
and
%G
to get GConcat1.
7
4. Do convolution on GConcat1 by GConv2_1 and GConv2_2 to get the tensor of
8
dimension (1,32,32), which is the output of the Generator.
9
The discriminator network is defined as
,%G
;L
. The inputs of discriminator are samples of
10
real data

or 
representing the expression profile of an individual cell, and label of
11
or
denoted by
%G
representing the cell type or subpopulation. Denote
L
as the parameters
12
need to be learned. The discriminator is defined as following the steps (supplementary Figure
13
S1A):
14
1. Do convolution on
or
by DConv1_1 and get the tensor of dimension (16,32,32).
15
2. Do convolution on
%G
by DConv1_2 and get the tensor of dimension (16,32,32).
16
3. Concatenate results of steps (1) and (2) as Dconcat1, which is a tensor of the dimension
17
(32,32,32).
18
4. Convert the Dconcat1 to a vector of length 16 using a fully connected network (FCN).
19
5. Do convolution on the result of step (4) by GConv2_1 and GConv2_2 to get the tensor of
20
dimension (1,32,32), which is the output of the Discriminator.
21
With a well-trained GANs model, for a given cell
which belongs to the subpopulation
M
,
22
we generate a candidate set
N
with
O

expression profiles. Denote

as the
nearest
23
.CC-BY-NC-ND 4.0 International licenseIt is made available under a perpetuity.preprint (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in
The copyright holder for this. http://dx.doi.org/10.1101/2020.01.20.913384doi: bioRxiv preprint first posted online Jan. 21, 2020;
18
neighbors using Euclidian distance in the set
N
. We then use the following equation to impute
1
P
th gene in the cell
(Supplementary Figure S1B):
,
R
,
,
ST
,
U0
,
,
VWXV
6
.
2
Data processing and normalization
3
The data of a scRNA-seq study are usually organized as a read count matrix with
F
rows
4
representing genes and
Y
columns representing cells, which is the input of scIGANs. Since
5
scIGANs is trained similarly to the training for image processing, we need to transfer the
6
expression profile of each cell to a grayscale image (Supplementary Figure S1A). To this end,
7
scIGANs firstly normalizes the raw count matrix by the maximum read count of each sample
8
(cell) so that all genes of each sample will have the expression values in a [0,1] range. scIGANs
9
then reshapes the expression profile of each cell to a square image in a column-wise manner,
10
with the normalized gene expression values representing the pixels of the image. The image
11
size will be
OZO
, where
O
is the minimum integer so that
OZO[ N
. If the gene number is less
12
than
OZO
, extra zeroes will be filled. Then, a scRNA-seq matrix with
Y
cells will be represented
13
as
Y
grayscale images and used to train a conditional GANs with the cell labels.
14
Simulated scRNA-seq data
15
We first simulated a simple scRNA-seq data with 150 cells and 20180 genes using the default
16
CIDR simulation function scSimulator(N=3, k=50) [29]. Three cell types are generated with 50
17
cells for each. The dropout data has a dropout rate of 52.8%. Figures 2A, S2A and Table S1 are
18
derived from this data. We then tested the performance of different imputation methods on
19
different dropout rates simulated by Splatter [31]. We took the same simulation strategy with the
20
same parameters as the Splatter simulator used by SCRABBLE [10]. Specifically, three scRNA-
21
seq datasets with three different dropout rates (71%, 83%, and 87%) were simulated; each
22
dataset has 800 genes and 1000 cells grouped into three clusters (cell types). Figures S2B-E
23
and Table S2 were derived from these datasets. To test the robustness of imputation methods,
24
.CC-BY-NC-ND 4.0 International licenseIt is made available under a perpetuity.preprint (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in
The copyright holder for this. http://dx.doi.org/10.1101/2020.01.20.913384doi: bioRxiv preprint first posted online Jan. 21, 2020;
19
we repeated 100 times of the above Splatter simulations and generated 100 datasets for each
1
of the above three different dropout rates. Figures 2B, S3A-F, and Table S4 (EXCEL) were
2
derived from these datasets.
3
Real scRNA-seq datasets
4
Human brain scRNA-seq data. We used scRNA-seq data of 466 cells capturing the cellular
5
complexity of the adult and fetal human brain at a whole transcriptome level [33]. Tag tables
6
were downloaded from the data repository NCBI Gene Expression Omnibus (GEO access
7
number: GSE67835) and combined
into one table with columns representing cells and rows
8
representing genes. We excluded the
uncertain hybrid cells and remained 420 cells in eight cell
9
types with the expression of 22085 genes. This dataset was used to generate Figures 2C-D and
10
S3G, and Table S4.
11
Cell-cycle phase scRNA-seq data. To evaluate the performance of different imputation
12
methods on identifying different cellular states of the same cell type, we analyzed a single-cell
13
RNA-seq data from mESCs [34]. A set of 96 asynchronously dividing cells for each cell-cycle
14
phase of G1, S, and G2M was captured using the Fluidigm C1 system, and sequencing libraries
15
were prepared and processed. In this dataset, 288 mESCs were profiled and characterized by
16
38293 transcripts with a dropout rate of 74.4%. This dataset was used to generate Figures 3A-B
17
and S3, and Table S6.
18
ERCC spike-in RNAs scRNA-seq data. In the above scRNA-seq dataset for mESCs, ERCC
19
spike RNAs were added to each cell and sequenced. ERCC spike RNAs consist of 92 RNA
20
transcripts in length of 250 to 2,000 nt, which are widely used in scRNA-seq experiments to
21
remove the confounding noises from biological variability. Since RNA spike-in is added to
22
samples with the identical amount to capture the technical noise, the readout for the spike-in
23
RNAs should be free of cell-to-cell variability and the detected variance of expression, if exists,
24
should only come from technical confounders other than biological contexts (e.g. cell types).
25
Therefore, the expression profiles of spike-in RNAs that were added to individual cells should
26
.CC-BY-NC-ND 4.0 International licenseIt is made available under a perpetuity.preprint (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in
The copyright holder for this. http://dx.doi.org/10.1101/2020.01.20.913384doi: bioRxiv preprint first posted online Jan. 21, 2020;
20
not be able to cluster these cells into different subgroups regarding cell types or other biological
1
states. Therefore, We used the ERCC spike-in read counts from the real scRNA-seq data for
2
mESCs [34] to evaluate the imputation methods on denoising the technical variation without
3
introducing extra noise. This data was used to generate Figures 2E and S3H-I, and Table S5.
4
Mouse ESCs scRNA-seq dataset for cell-cycle dynamics. 6885 mouse embryonic stem cells
5
(mESC) were profiled using the droplet-microfluidic scRNA-seq approach with 1 biological
6
replicate (933 cells) and 2 technical replicates (2509 and 3443 cells for each). The processed
7
count matrix was downloaded from Gene Expression Omnibus (GEO) with the access ID
8
GSE65525. All other nine imputation methods and scIGANs were used to impute the raw matrix
9
with an exception that SCRABBLE and DrImpute failed to impute this data because take longer
10
than a month to finish the imputation. This data was used to generate Figures 3C and S4C-L.
11
Cell cycle dynamics assessment was performed according to Figure 6E-F of [14]. Briefly, the
12
Pearson’s correlation was applied among a list of previously categorized 44 cell-cycle genes
13
based on their expression across these 6.8k cells. Genes were ordered by hierarchical
14
clustering on the correlation matrix and their previously categorized cell-cycle phases were
15
indicated as linked dots representing cell-cycle oscillations (Figures 3C and S4C-L). Clustering
16
measurements were also applied to the gene clusters against their pre-assigned cell-cycle
17
phased (bar plots in Figures 3C and S4C-L), which represent the performances of imputation
18
methods on clustering the genes across cells.
19
Human ESC scRNA-seq dataset for differential expression analysis. To compare the
20
performance of different imputation methods on the identification of differentially expressed
21
genes (DEGs), we utilize a real dataset with both bulk and single-cell RNA-seq experiments on
22
human embryonic stem cells (ESC) and the differentiated definitive endoderm cells (DEC) [41].
23
This dataset includes six samples of bulk RNA-seq (four for H1 ESC and two for DEC) and
24
scRNA-seq of 350 single cells (212 for H1 ESC and 138 for DEC). The percentage of zero
25
.CC-BY-NC-ND 4.0 International licenseIt is made available under a perpetuity.preprint (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in
The copyright holder for this. http://dx.doi.org/10.1101/2020.01.20.913384doi: bioRxiv preprint first posted online Jan. 21, 2020;
21
expression is 14.8% in bulk data and 49.1% in single-cell data. This dataset was used to
1
generate Figures 4 and S5-S7.
2
We use scIGANs and nine other imputation methods to impute the gene expression for single
3
cells and then use DESeq2 [42] to perform differential expression analysis on the raw and 10
4
imputed data, respectively. DEGs are genes with the absolute log fold changes (H1/DEC)
1.5,
5
adjust-p
0.05, and base mean
10 (Figure 4A). A set of top 1000 DEGs (500 best up-
6
regulated and 500 best down-regulated genes based on their adjust-p values) from bulk RNA-
7
seq data were used to evaluate the correspondence between scRNA-seq and bulk RNA-seq
8
data (Figures 4B and S5). To further evaluate the improvement of imputation on DEG
9
identification, five signature genes highlighted in Figure 1c of the source paper [42] for H1 and
10
DEC, respectively, were plotted out (Figures 4C and S6). The expression of two marker genes
11
(SOX2 for H1 cell and CXCR4 for DEC cell) were overlaid to the UMAP space of single cells to
12
show the expression signature of these two types of cells (Figures 4D-E and S7).
13
Time-course scRNA-seq data for cellular trajectory analysis. We utilize a time-course
14
scRNA-seq data derived from the differentiation from H1 ESC to definitive endoderm cells (DEC)
15
[41]. A total of 758 cells were profiled at 0 (n=92), 12 (n=102), 24 (n=66), 36 (n=172), 72
16
(n=138), and 96 (n=188) hours after inducing the differentiation from H1 ESCs to DECs (Figure
17
5A). We apply scIGANs and all other nine imputation methods to the raw scRNA-seq data with
18
known time points and then reconstruct the trajectories.
19
Subsampling for robustness analysis. We subsampled the scRNA-seq data derived from
20
human embryonic stem cells (ESC) and the differentiated definitive endoderm cells (DEC) [40].
21
This dataset has expression profiles of 350 single cells (212 for H1 ESC and 138 for DEC)
22
across 19097 genes. Three different sampling strategies were used to generate different sub-
23
datasets for robustness tests. These datasets were used to generate Figures 6 and S9.
24
.CC-BY-NC-ND 4.0 International licenseIt is made available under a perpetuity.preprint (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in
The copyright holder for this. http://dx.doi.org/10.1101/2020.01.20.913384doi: bioRxiv preprint first posted online Jan. 21, 2020;
22
1) datasets with a subset of genes that have top- and lower-mean of expressions across all 350
1
cells, denoted as mean.top and mean.low. Specifically, the expression matrix (genes in rows
2
and cells in columns) was sorted by the row means (descending) and the first and last 5000
3
genes were selected, representing two subsets with high and low expressions, respectively.
4
Then 1024 (32*32) genes were randomly picked from these 5000 genes to generate the two
5
test datasets, mean.top and mean.low (Figures 6 and S9). These two datasets have the zero-
6
rate of 6.34% (mean.top) and 97.25% (mean.low).
7
2) datasets with a subset of genes that have top- and lower-standard deviation (sd) of
8
expressions across all 350 cells, denoted as sd.top and sd.low. Specifically, the expression
9
matrix (genes in rows and cells in columns) was sorted by the row sd (descending) and the first
10
and last 5000 genes were selected, representing two subsets with high and low expression
11
standard deviations, respectively. Then 1024 (32*32) genes were randomly picked from these
12
5000 genes to generate the two test datasets, sd.top and sd.low (Figures 6 and S9). These two
13
datasets have the zero-rate of 8.72% (mean.top) and 92.42% (mean.low).
14
3) dataset with a subset of 1024 genes randomly selected from all 19097 genes, denoted as
15
global.random. It has the zero-rate of 49.51%.
16
Implementation and availability
17
scIGANs is implemented in Python 3.6 and R 3.6.1 with an interface wrapper script. An
18
expression matrix of the single cells is the only required input file. Optionally, a file including the
19
cell labels (cell type or subpopulation information) can be provided to direct scIGANs for cell
20
type-specific imputation. If there are no prior cell labels provided, scIGANs will pre-cluster the
21
cells using a spectral clustering method. ScIGANs can run on either CPUs or GPUs. The output
22
is the imputed expression matrix of the same dimensions, of which only the true zero values will
23
be imputed without change other expression values. The whole package with a usage tutorial is
24
available at GitHub (https://github.com/xuyungang/scIGANs).
25
.CC-BY-NC-ND 4.0 International licenseIt is made available under a perpetuity.preprint (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in
The copyright holder for this. http://dx.doi.org/10.1101/2020.01.20.913384doi: bioRxiv preprint first posted online Jan. 21, 2020;
23
Availability of data and codes for reproducibility
1
The original sources and preprocesses of all data are described in Methods. The processed
2
datasets and codes used to reproduce the Figures and Tables are available at GitHub
3
(https://github.com/xuyungang/scIGANs_Reproducibility).
4
Statistical information
5
All statistical tests are implemented by R (version 3.6.1). Specifically, the Pearson correlation
6
tests (Figures 4B and S5) were done by cor.test() with default parameters; the student’s t-tests
7
(Figures 6C-D and S9) were done by t.test() with default parameters; the differentially
8
expressed genes (DEGs) were identified by DESeq2 with the p-adjust <=0.05,
9
log2FoldChange >= 1.5, and baseMean >= 10 (Figures 4A-B and S5).
10
Qantitative measurments of single cell clusters
11
We use 11 numeric metrics to quantitate the clustering of single cells. RI, the Rand index, is a
12
measure of the similarity between two data clusterings. ARI, the adjusted Rand index, is
13
adjusted for the chance grouping of elements. MI, mutual information, is used in determining the
14
similarity of two different clusterings of a dataset. As such, it provides some advantages over
15
the traditional Rand index. AMI, adjusted mutal information, is a variation of mutual information
16
used for comparing clusterings. VI, variation of information, is a measure of the distance
17
between two clusterings and a simple linear expression involving the mutual information. NVI
18
the normalized VI. ID and NID refer to the information distance and normalized information
19
distance. All these metrics are computed using clustComp() from R package ‘aricode’
20
(https://cran.r-project.org/web/packages/aricode/). F score (also F1-score or F-measure) is
the
21
harmonic mean of precision and recall. AUC, area under the receiver operating characteristic
22
(ROC) curve, is the probability that a classifier will rank a randomly chosen positive instance
23
higher than a randomly chosen negative one. ACC, accuracy. The above three classification
24
metrics are defined by compare the independent clustering of cells to the true cell lables.
25
.CC-BY-NC-ND 4.0 International licenseIt is made available under a perpetuity.preprint (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in
The copyright holder for this. http://dx.doi.org/10.1101/2020.01.20.913384doi: bioRxiv preprint first posted online Jan. 21, 2020;
24
Clustering was done using prediction() from the R package SC3 [50]. The in-house R scripts for
1
these metrics are provided in the codes for reproducibility.
2
Declarations
3
Ethics approval and consent to participate
4
Not applicable.
5
Consent for publication
6
Not applicable.
7
Competing interests
8
The authors declare that they have no competing interests.
9
Author’s Contributions
10
YX, ZZ, and XZ conceived the study. ZZ developed the scIGANs model and YX wrapped it up to
11
a package. YX analyzed all scRNA-seq datasets, interpreted the results and wrapped up the
12
reproducibility codes on GitHub (https://github.com/xuyungang/scIGANs_Reproducibility). ZZ,
13
LY, JL, and ZF helped to test the method and reproduce the analyses. YX wrote the manuscript
14
and all authors revised it. All authors read and approved the final version of the manuscript.
15
Materials and Correspondence
16
The correspondence and material request should be addressed to Yungang Xu
17
(yungang.xu@uth.tmc.edu).
18
Acknowledgments
19
This work was funded by the National Institutes of Health (NIH) [R01CA241930, R01GM123037
20
and AR069395].
21
Additional files
22
Additional file 1 (PDF): Supplementary Figures S1-S9.
23
Additional file 2 (PDF): Supplementary Tables S1, S2, and Tables S4-S6.
24
Additional file 3 (XLSX): Supplementary Table S3.
25
.CC-BY-NC-ND 4.0 International licenseIt is made available under a perpetuity.preprint (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in
The copyright holder for this. http://dx.doi.org/10.1101/2020.01.20.913384doi: bioRxiv preprint first posted online Jan. 21, 2020;
25
References
1
1. van Dijk, D., et al.,
Recovering Gene Interactions from Single-Cell Data Using Data Diffusion.
Cell, 2 2018.
174
(3): p. 716-729 e27. 3 2. van Dijk, D., et al.,
Recovering Gene Interactions from Single-Cell Data Using Data Diffusion.
Cell, 4 2018.
174
(3). 5 3. Li, W. and J. Li,
An accurate and robust imputation method scImpute for single-cell RNA-seq data.
6 Nature Communications, 2018.
9
(1). 7 4. Huang, M., et al.,
SAVER: gene expression recovery for single-cell RNA sequencing.
Nature 8 Methods, 2018.
15
(7): p. 539-542. 9 5. Gong, W., et al.,
DrImpute: imputing dr opo ut events in single c ell RNA sequen c ing data.
BMC 10 Bioinformatics, 2018.
19
(1): p. 220. 11 6. Chen, M. and X. Zhou,
VIPER: variability-preserving imputation for accurate gene expression
12
recovery in single-cell RNA sequencing studies.
Genome Biology, 2018.
19
(1): p. 196. 13 7. Eraslan, G., et al.,
Single-cell RNA-seq denoising using a deep count autoe nco der.
Nature 14 Communications, 2019.
10
(1): p. 390. 15 8. Wagner, F., D. Barkley, and I. Yanai,
Accurate denoising of single-cell RNA-Seq data using
16
unbiased principal component analysis.
bioRxiv, 2019: p. 655365. 17 9. Arisdakessian, C., et al.,
DeepImpute: an accurate, fast, and scalable deep neural network
18
method to impute single-cell RNA-seq data.
Genome Biol, 2019.
20
(1): p. 211. 19 10. Peng, T., et al.,
SCRABBLE: single-cell RNA-seq imputation constrained by bulk RNA-seq data.
20 Genome Biology, 2019.
20
(1): p. 88. 21 11. Mattei, P.A. and F.-J. on Machine,
MIWAE: Deep Generative Modelling and Imputation of
22
Incomplete Data Sets.
International Conference on Machine , 2019. 23 12. Zhang, H., P. Xie, and X.-E. preprint arXiv,
Missing value imputation based on deep generative
24
models.
arXiv preprint arXiv:1808.01684, 2018. 25 13. Mattei, P.A. and F.-J. preprint arXiv,
missiwae: Deep generative modelling and imputation of
26
incomplete data.
arXiv preprint arXiv:1812.02633, 2018. 27 14. Lopez, R., et al.,
Deep generative modeling for single-cell transcriptomics.
Nature Methods, 2018. 28
15
(12): p. 1053-1058. 29 15. Goodfellow, I., J. Pouget-Abadie, and M.M. in neural …,
Generative adversarial nets.
Generative 30 adversarial nets, 2014. 31 16. Radford, A., L. Metz, and C.S. preprint arXiv,
Unsupervised representation learning with deep
32
convolutional generative adversarial networks.
Unsupervised representation learning with deep 33 convolutional generative adversarial networks, 2015. 34 17. Chen, X., et al.,
Infogan: Interpretable representation learning by information maximizing
35
generative adversarial nets.
Advances in neural …, 2016. 36 18. Miyato, T., et al.,
Spectral normalization for generative adversarial networks.
arXiv preprint 37 arXiv …, 2018. 38 19. Ghahramani, A., F.M. Watt, and N.M. Luscombe,
Generative adversarial networks simulate gene
39
expression and predict perturbations in single cells.
bioRxiv, 2018: p. 262501. 40 20. Ahmed, F., M. Arjovsky, and D.-V. in neural …,
Improved training of wasserstein gans.
Advances 41 in neural …, 2017. 42 21. Yoon, J., J. Jordon, and M.J.a.p.a. van der Schaar,
GAIN: Missing Data Imputation using
43
Generative Adversarial Nets.
2018. 44 22. Ledig, C., et al.
Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial
45
Network
. in
CVPR
. 2017. 46
.CC-BY-NC-ND 4.0 International licenseIt is made available under a perpetuity.preprint (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in
The copyright holder for this. http://dx.doi.org/10.1101/2020.01.20.913384doi: bioRxiv preprint first posted online Jan. 21, 2020;
26
23. Brock, A., et al.,
Neural photo editing with introspective adversarial networks.
2016. 1 24. Wolterink, J.M., et al.,
Generative adversarial networks for noise reduction in low-dose CT.
2017. 2
36
(12): p. 2536-2545. 3 25. Zhang, H., V. Sindagi, and V.M.J.a.p.a. Patel,
Image de-raining using a conditional generative
4
adversarial network.
2017. 5 26. Chen, Q. and V. Koltun.
Photographic image synthesis with cascaded refinement networks
. in 6
IEEE International Conference on Computer Vision (ICCV)
. 2017. 7 27. Marouf, M., et al.,
Realistic in silico generation and augmentation of single cell RNA-seq data
8
using Generative Adversarial Neural Networks.
bioRxiv, 2018: p. 390153. 9 28. Marouf, M., et al.,
Realistic in silico generation and augmentation of single-cell RNA-seq data
10
using generative adversarial networks.
Nature Communications, 2020.
11
(1): p. 166. 11 29. Lin, P., M. Troup, and J.W.K. Ho,
CIDR: Ultrafast and accurate clustering through imputation for
12
single-cell RNA-seq data.
Genome Biology, 2017.
18
(1). 13 30. Zare, H., et al.,
Data reduction for spectral clustering to analyze high throughput flow cytometry
14
data.
BMC Bioinformatics, 2010.
11
: p. 403. 15 31. Zappia, L., B. Phipson, and A. Oshlack,
Splatter: simulation of single-cell RNA sequencing data.
16 Genome Biol, 2017.
18
(1): p. 174. 17 32. Peng, T., et al.,
SCRABBLE: single-cell RNA-seq imputation constrained by bulk RNA-seq data.
18 Genome Biol, 2019.
20
(1): p. 88. 19 33. Spyros, D., et al.,
A survey of human brain transcriptome diversity at the single cell level.
20 Proceedings of the National Academy of Sciences, 2015.
112
(23): p. 7285-7290. 21 34. Buettner, F., et al.,
Computational analysis of cell-to-cell heterogeneity in single-cell RNA-
22
sequencing data reveals hidden subpopulations of cells.
Nature Biotechnology, 2015.
33
(2): p. 23 155-160. 24 35. Paul, F., et al.,
Transcriptional Heterogeneity and Lineage Commitment in Myeloid Progenitors.
25 Cell, 2015.
163
(7): p. 1663-1677. 26 36. Wilson, N.K., et al.,
Combined Single-Cell Functional and Gene Expression Analysis Resolves
27
Heterogeneity within Stem Cell Populations.
Cell stem cell, 2015.
16
(6): p. 712-724. 28 37. Buettner, F., et al.,
Computational analysis of cell-to-cell heterogeneity in single-cell RNA-
29
sequencing data reveals hidden subpopulations of cells.
Nature Biotechnology, 2015.
33
(2): p. 30 155-160. 31 38. McDavid, A., G. Finak, and R. Gottardo,
The contribution of cell cycle to heterogeneity in single-
32
cell RNA-seq data.
Nature Biotechnology, 2016.
34
(6): p. 591. 33 39. Rapsomaniki, M., et al.,
CellCycleTRACER accounts for cell cycle and volume in mass cytometry
34
data.
Nature Communications, 2018.
9
(1): p. 632. 35 40. Klein, A.M., et al.,
Droplet Barcoding for Single-Cell Transcriptomics Applied to Embryonic Stem
36
Cells.
Cell, 2015.
161
(5): p. 1187-1201. 37 41. Chu, L.F., et al.,
Single-cell RNA-seq reveals novel regulators of human embryonic stem cell
38
differentiation to definitive endoderm.
Genome Biol, 2016.
17
(1): p. 173. 39 42. Love, M.I., W. Huber, and S. Anders,
Moderated estimation of fold change and dispersion for
40
RNA-seq data with DESeq2.
Genome Biology, 2014.
15
(12): p. 550. 41 43. Chen, H., et al.,
Single-cell trajectories reconstruction, exploration and mapping of omics data
42
with STREAM.
Nature Communications, 2019.
10
(1): p. 1903. 43 44. Bendall, S.C., et al.,
Single-Cell Trajectory Detection Uncovers Progression and Regulatory
44
Coordination in Human B Cell Development.
Cell, 2014.
157
(3): p. 714-725. 45 45. Qiu, X., et al.,
Reversed graph embedding resolves complex single-cell trajectories.
Nature 46 Methods, 2017.
14
(10): p. 979-982. 47
.CC-BY-NC-ND 4.0 International licenseIt is made available under a perpetuity.preprint (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in
The copyright holder for this. http://dx.doi.org/10.1101/2020.01.20.913384doi: bioRxiv preprint first posted online Jan. 21, 2020;
27
46. Schiebinger, G., et al.,
Optimal-Transport Analysis of Single-Cell Gene Expression Identifies
1
Developmental Trajectories in Reprogramming.
Cell, 2019.
176
(Cell Tissue Res. 331 2008): p. 928. 2 47. Yang, Q., et al.,
Low-Dose CT Image Denoising Using a Generative Adversarial Network With
3
Wasserstein Distance and Perceptual Loss.
IEEE Trans Med Imaging, 2018.
37
(6): p. 1348-1357. 4 48. Gulrajani, I., et al.,
Improved Training of Wasserstein GANs.
arXiv, 2017. 5 49. Berthelot, D., T. Schumm, and L. Metz,
BEGAN: Boundary Equilibrium Generative Adversarial
6
Networks.
BEGAN: Boundary Equilibrium Generative Adversarial Networks, 2017. 7 50. Kiselev, V.Y., et al.,
SC3: consensus clustering of single-cell RNA-seq data.
Nat Methods, 2017. 8
14
(5): p. 483-486. 9
10
.CC-BY-NC-ND 4.0 International licenseIt is made available under a perpetuity.preprint (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in
The copyright holder for this. http://dx.doi.org/10.1101/2020.01.20.913384doi: bioRxiv preprint first posted online Jan. 21, 2020;
Figure 1. Overview of generative adversarial networks for single-cell RNA-seqimputation(scIGANs).
The expression profile of each cell is reshaped to a square image, which is fed to the GANs (Supplementary
Figure S1A). The trained generator is used to generate a set of realistic cells, of which the k-nearest
neighbors (KNN) are used to impute the raw scRNA-seq expression matrix (Supplementary Figure S1B).
MSE, mean squared error.
Also see Figures S1.
Cells (n)
Genes (m)
scRNA-seq
expression matrix Generative Adversarial Networks (GANs)
D
Discriminator
Latent space
G
Generator
MSE of
Input and
Output
Fine tune training
Generated
fake samples
Impute dropouts
reshape
cell 1
KNN
cell 2
cell n
.CC-BY-NC-ND 4.0 International licenseIt is made available under a perpetuity.preprint (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in
The copyright holder for this. http://dx.doi.org/10.1101/2020.01.20.913384doi: bioRxiv preprint first posted online Jan. 21, 2020;
Dropout(81.4%)
scIGANs(w/)
Oligodendrocytes
Astrocytes
OPC
Microglia
Neurons
Endothelial
Fetal_quiescent
Fetal_replicating
DCA
DeepImpute
MAGIC
scImpute
Full
Dropout(52.8%)
DCA
DeepImpute
DrImpute
ENHANCE
Cell type 1
Cell type 2
Cell type 3
MAGIC
SAVER
scImpute
SCRABBLE
VIPER
scIGANs(w/)
scIGANs(w/o)
B
DE
Figure 2. ScIGANs recovers single-cell gene expression from dropouts without extra noise. A. The UMAP
plots of the CIDR simulated scRNA-seq data for Full, Dropout, and imputed matrix by 10 methods. Multiple cluster-
ing measurements are provided in Supplementary Figure S2A and Table S1. B. The adjusted rand index (ARI), a
representative clustering measurement to indicate performance and robustness of all methods on the Splatter
simulated data with three different dropout rates (71%, 83%, 87%) and 100 replicates for each. The plots of other
selected measurements are provided in Supplementary Figures S3A-F and the full list of clustering measurements
provided in Supplementary Table S3. C. The selected UMAP plots of real scRNA-seq data for human Brain; the
plots of all other imputation methods are provided in Supplementary Figures S3G. D. The selected clustering mea-
surements for scRNA-seq data of human Brian. AUC, area under the ROC curve; ARI, adjusted rand index; F
score, the harmonic mean of precision and recall; NMI, normalized mutual information. Full list of all considered
clustering measurements are provided in Supplementary Table S4. E. The evaluation of robustness in avoiding
extra noise using scRNA-seq data of spike-in RNAs. All UMAP plots are provided in Supplementary Figure S3H.
Also see Figures S2-S3.
UMAP_1
UMAP_2
C
UMAP_1
UMAP_2
A
●●
●●
●●
●●●
●●
71% 83% 87%
Dropout
ENHANCE
DCA
DeepImpute
DrImpute
MAGIC
SAVER
scImpute
VIPER
SCRABBLE
scIGANs(w/)
Full
Dropout
ENHANCE
DCA
DeepImpute
DrImpute
MAGIC
SAVER
scImpute
VIPER
SCRABBLE
scIGANs(w/)
Full
Dropout
ENHANCE
DCA
DeepImpute
DrImpute
MAGIC
SAVER
scImpute
VIPER
SCRABBLE
scIGANs(w/)
Full
0.00
0.25
0.50
0.75
1.00
ARI
Dropout
G1
G2M
S
scIGANs(w/)
●●
scIGANs(w/o)
UMAP_1
UMAP_2
0.2 0.5 0.8
ARI
Fscore
AUC
ACC
scIGANs(w/) DeepImpute scImpute
MAGIC Dropout DCA
.CC-BY-NC-ND 4.0 International licenseIt is made available under a perpetuity.preprint (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in
The copyright holder for this. http://dx.doi.org/10.1101/2020.01.20.913384doi: bioRxiv preprint first posted online Jan. 21, 2020;
Dropout
DCA
DeepImpute
DrImpute
●●
ENHANCE
G1
S
G2M
MAGIC
SAVER
scImpute
SCRABBLE
VIPER
scIGANs(w/)
Mcm6
Nasp
Mcm2
Pcna
Rrm1
Birc5
Brca2
E2f5
Msh2
Npat
Rrm2
Ccne1
Cdc25a
Slbp
Cdc6
Dhfr
Cdk1
Cenpf
Top2a
Cdc20
Cenpa
Plk1
Cks2
Ccna2
Bub1b
Kif20a
Racgap1
Ccnb1
Aurka
Ccnf
Ccnb2
E2f1
Brca1
Bub1
Cdc25c
Ccne2
Tym s
Cdc45
Ccng2
Cdkn2d
Cdkn2c
Psrc1
Cdkn1a
Cdkn3
G1/S
S
G2
G2/M
M/G1
0.0
0.2
0.4
0.6
0.8
1.0
0.423
0.5750.541
F score
AUC
ACC
Raw
E2f5
Ccna2
Rrm2
Kif20a
Cdc25a
Cdc25c
Racgap1
Bub1b
Cks2
Dhfr
Rrm1
Ccnb2
Ccnf
Msh2
E2f1
Bub1
Brca2
Npat
Brca1
Ccne2
Ccne1
Slbp
Cdc6
Tym s
Plk1
Aurka
Birc5
Cdc45
Ccnb1
Cenpa
Cdc20
Psrc1
Ccng2
Cdkn2d
Cdkn2c
Nasp
Mcm2
Mcm6
Cdkn1a
Cenpf
Top2a
Pcna
Cdkn3
Cdk1
0.0
0.2
0.4
0.6
0.8
1.0
0.372
0.5510.595
scImpute
Nasp
Mcm6
Mcm2
Pcna
Rrm2
Ccne1
Cdc25a
E2f5
Npat
Brca1
Rrm1
E2f1
Slbp
Ccna2
Msh2
Birc5
Brca2
Kif20a
Cks2
Bub1b
Racgap1
Cdc25c
Cdc6
Top2a
Cdk1
Plk1
Cenpf
Cdc20
Cenpa
Aurka
Ccnb1
Ccnb2
Ccnf
Cdkn2c
Cdkn3
Ccne2
Bub1
Ccng2
Cdkn2d
Dhfr
Cdc45
Tym s
Cdkn1a
Psrc1
scIGANs(w/)
0.0
0.2
0.4
0.6
0.8
0.443
0.612 0.65
1.0
C
UMAP_1
UMAP_2
A
B
0.0 0.4
0.0 0.4
Raw
Cell cycle score (S)
Cell cycle score (G2M)
G2M
G1 S
−0.4
−0.4 −0.5 0.0 0.5
−0.5 0.0 0.5
scIGANs (w/)
Cell cycle score (S)
G2M
G1 S
−0.2 0.0 0.4
−0.4 0.0 0.4
scImpute
Cell cycle score (S)
G2M
G1 S
G1/S
S
G2
G2/M
M/G1
G1/S
S
G2
G2/M
M/G1
F score
AUC
ACC
F score
AUC
ACC
Figure
3
.
S
cI
G
ANs enables the identi
f
ication o
f
subcellular states. A. The UMAP plots o
f
the real scRNA-seq
data with known cell-cycle states. Full list of all considered clustering metrics are provided in Supplementary Figure
S4A and Table S6. B. Cells are projected to the cell-cycle phase spaces based on collections of cell-cycle genes.
The plots for all other methods are provided in Supplementary Figure S4B. C. Cell cycle dynamics shown as the
hierarchical clustering of 44 cell-cycle-regulated genes across 6.8k mouse ESCs. Full dynamic cell-cycle profiles
from before and after imputation by different methods are provided in Supplementary Figure S4C-L. The bar charts
show the quantitative concordance between the assigned cell-cycle phases by hierarchical clustering and the true
phases for which these genes serve as markers. F score, the harmonic mean of precision and recall; AUC, area
under the ROC curve; ACC, accuracy.
A
lso see Figure S4.
.CC-BY-NC-ND 4.0 International licenseIt is made available under a perpetuity.preprint (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in
The copyright holder for this. http://dx.doi.org/10.1101/2020.01.20.913384doi: bioRxiv preprint first posted online Jan. 21, 2020;
ABulk
Normalized count
500 2000
Raw
0 1000
scIGANs
200 600
NANOG GATA6
0 4000
DEC H1
0 1000
DEC H1
0400
DEC H1
C
●●
−10 −5 0 5
−10 0 10 20
n = 466 + 496
r = 0.92
p = 0
−10 −5 0 5
−10 0 5 10
n = 480 + 499
r = 0.95
p = 0
-5
Log fold change of bulk RNA-seq (H1/DEC)
Log change of scRNA-se
(H1/DEC)
BRaw scIGANs
−2.5 0.0 2.5
0
2
4
z-socre
CXCR4
D
−6
−3
0
3
6
−2 0 2 4
UMAP_1
UMAP_2
scIGANs
−2.5
0.0
2.5
5.0
−2.5 0.0 2.5
UMAP_2
Raw
0
2
4
−2.5 0.0 2.5
−2
z-score
SOX2
Cell type
DEC
H1
Cell type
DEC
H1
E
−1.0
0.0
1.0
z-score
2024
UMAP_1
SOX2
−1.0
0.0
1.0
z-score
−2 0 2 4
UMAP_1
CXCR4
Figure
4
.
S
cI
G
ANs increases the correspondence o
f
di
ff
erential expression between single-cell and bulk
RNA-seq. A. The correspondence of differentially expressed genes (DEGs) between bulk and single-cell RNA-seq
with different imputation approaches. B. The correlations between log fold changes of differentially expressed
genes from bulk and single-cell RNA-seq. Detailed legends and the plots of all considered imputation methods are
provided in Supplementary Figure S5. C. The expression for one of five selected signature genes of H1 and DEC
cells, respectively. All plots of other genes with different imputation methods are provided in Supplementary Figure
S6. D-E. The UMAP plots of the single cells overlaid by the expression of SOX2 and CECR4, which is the marke
r
of H1 and DEC, respectively. Raw (D) and scIGANs imputed (E) matrix are shown and all other methods are provid-
ed in Supplementary Figure S7.
A
lso see Figure S5-S7.
1536
1229
504
375
167
123
110
87 84 76 61 50 40 35 21 15 9631
0
500
1000
1500
Intersection Size
Bulk
VIPER
scImpute
scIGANs
ENHANCE
SAVER
DrImpute
Raw
DCA
DeepImpute
3328
3151
2724
2691
2255
2185
2168
2088
1709
1213
1101001000
Number of DEGs
.CC-BY-NC-ND 4.0 International licenseIt is made available under a perpetuity.preprint (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in
The copyright holder for this. http://dx.doi.org/10.1101/2020.01.20.913384doi: bioRxiv preprint first posted online Jan. 21, 2020;
●●
00h 12h 24h 36h 72h 96h
ESC
Mesendoderm
Endoderm
CED1H
A
−10
0
10
−10 010
UMAP 1
UMAP 2
00h
12h
24h
36h
72h
96h
●●
●●
●●●
●●
●●
●●
●●
●●
●●●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●●
●●
POU5F1
NANOG
010 20 30
0
3
0
3
Pseudotime
Expression
(log10)
●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●
●●●●●●●●●●
●●●●
●●●●●●●●●●● ●●●●●●●●●●
●●●●●
●●
●●
●● ●●
●●
●●●●
●●
●●
●●
●●
●●
●●
●●●
●●
●●
●●●●●●● ●●●●●●● ●●● ●●●● ●●●●●●●●● ●● ●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●● ●●
●●●●●●●●
●●
●●●● ●●●● ●●●●●●●●
●●●● ●●●● ●●●● ●●●●●●●●●●●●●●●● ●● ●●●●●● ●● ●●●●●●● ●● ●●●●● ●●●●●●● ●●●● ●●●●● ●● ●●●●●●●●● ●● ●●●●●●●●
●●●●
●●
●● ●●
●●●● ●●●●
●● ●●●● ●● ●●
●● ●●● ●● ●● ●●●●● ●●●● ●●● ●● ●● ●●●●●●
●● ●●● ●●●● ●● ●● ●●
●● ●●● ●● ●● ●● ●● ●●
●●●●●●●●●●●●●
●●
●●●● ●●●●● ●●●●● ●●
●●●
●● ●● ●● ●●● ●●●● ●● ●●● ●●
●● ●● ●●
●●●●●●●●●●●●● ●●
●●●●
●●●●●●●●●●●●
●●●●●●●●●●●●●●●
HNF1B
CER1
0102030
0
4
−2
2
Pseudotime
Time point
B
−10
0
10
−10 010
UMAP 1
UMAP 2
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
POU5F1
NANOG
0102030
1
3
0
3
Pseudotime
●●●●
●●
●●
●●●●●●●●
●●
●●
●●●
●●●●
●●
●●
●●●
●●
●●●●
●●
●●
●●●
●●
●●●
●●
●●●
●●
●●
●●
●●
●●
●●
●●
●●●●
●●
●●
●●
HNF1B
CER1
0102030
−2
4
0
2
Pseudotime
Expression
(log10)
C
Raw
scIGANs
D
E
Figure 5.
S
cI
G
ANs improves time course analysis and reconstruction o
f
cellular trajectory
f
rom scRNA-seq
data. A. The time points of scRNA-seq sampling along the differentiation from pluripotent state (H1 cells) through
mesendoderm to definitive endoderm cells (DEC). B-C. The trajectories reconstructed by monocle3 from the raw
(C) and scIGANs imputed (D) scRNA-seq data. D-E. The expression of two pluripotent (left) and DEC (right) signa-
ture genes are shown in the order of the pseudotime. The plots of all other imputation methods are provided in
Supplementary Figure S8.
A
lso see Figure S8.
.CC-BY-NC-ND 4.0 International licenseIt is made available under a perpetuity.preprint (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in
The copyright holder for this. http://dx.doi.org/10.1101/2020.01.20.913384doi: bioRxiv preprint first posted online Jan. 21, 2020;
B
−5.0
−2.5
0.0
2.5
5.0
7.5
−2 0 2
UMAP_1
UMAP_2
−10
−5
0
5
−2 −1 0 1 2
UMAP_1
−2.5
0.0
2.5
−4 −2 0 2
UMAP_1
−2.5
0.0
2.5
−1 0 1 2
UMAP_1
−4
0
4
−2 0 2
UMAP_1
p = <2e−16
0
250
500
750
1000
Raw scIGANs
Mean
p = <2e−16
0
3
6
9
12
Raw scIGANs
p = 1.2e−06
500
1000
Raw scIGANs
p = <2e−16
0.0
2.5
5.0
7.5
Raw scIGANs
p = 2.1e−07
500
1000
Raw scIGANs
p = <2e−16
0
250
500
750
1000
Raw scIGANs
SD
p = <2e−16
0
5
10
15
Raw scIGANs
p = <2e−16
250
500
750
1000
1250
Raw scIGANs
p = <2e−16
0.0
2.5
5.0
7.5
Raw scIGANs
p = 2.4e−12
500
750
1000
1250
Raw scIGANs
−5
0
5
−2.5 0.0 2.5 5.0
UMAP_1
UMAP_2
random
−2
−1
0
1
2
−3 −2 −1 0 1 2
UMAP_1
mean.low
−4
−2
0
2
−5.0 −2.5 0.0 2.5 5.0
UMAP_1
mean.top
−2
0
2
−2 −1 0 1 2
UMAP_1
sd.low
−4
0
4
−4 −2 0 2
UMAP_1
sd.topA
Crandom mean.low mean.top sd.low sd.top
D
Figure
6
.
S
cI
G
ANs is robust to small set o
f
genes with very low expression or cell-to-cell variance. A-B. The
UMAP visualizations of H1 and DEC cells using only 1024 genes from raw (A) or scIGANs imputed (B) expression
matrix based on three different sampling strategies. The sampling strategies are described in Methods. C-D. The
boxplots show the mean (C) or standard deviation (sd, D) of the 1024 sampled genes before and after scIGANs
imputation; p, the p-value of the Student’s t-test (two-side). The same series of plots for all other imputation methods
are provided in Supplementary Figure S9.
A
lso see Figure S9.
.CC-BY-NC-ND 4.0 International licenseIt is made available under a perpetuity.preprint (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in
The copyright holder for this. http://dx.doi.org/10.1101/2020.01.20.913384doi: bioRxiv preprint first posted online Jan. 21, 2020;
... In this manuscript, we focus on the generation purpose of GAN used in four areas: digital image processing, medical image processing, medical informatics, and its latest applications in omic data. The generation purpose can be further categorized into data simulation (3), data augmentation for small dataset (4), style transformation (5), and gene data simulation (6). The great successful applications of GAN in medical image generation (7,8) and cell gene imputation (6) motivated us to review the literatures in these four sub areas, rather than just focusing on the digital image processing field. ...
... The generation purpose can be further categorized into data simulation (3), data augmentation for small dataset (4), style transformation (5), and gene data simulation (6). The great successful applications of GAN in medical image generation (7,8) and cell gene imputation (6) motivated us to review the literatures in these four sub areas, rather than just focusing on the digital image processing field. We searched in the top conferences of computer science and Google Scholar with keywords related to GAN. ...
... The trained generative network produces expression profiles representing the realistic cells of defined types. The generated cells, rather than the observed cells, are then used to impute the dropouts of the real cells (6). ...
Article
Full-text available
The basic Generative Adversarial Networks (GAN) model is composed of the input vector, generator, and discriminator. Among them, the generator and discriminator are implicit function expressions, usually implemented by deep neural networks. GAN can learn the generative model of any data distribution through adversarial methods with excellent performance. It has been widely applied to different areas since it was proposed in 2014. In this review, we introduced the origin, specific working principle, and development history of GAN, various applications of GAN in digital image processing, Cycle-GAN, and its application in medical imaging analysis, as well as the latest applications of GAN in medical informatics and bioinformatics.
... The imputation accuracy was shown to be significantly improved compared with traditional imputation techniques which ignore the within sample correlation. Recently, a few other approaches have been proposed for imputation of time series gene expression data such as imputeTS (Moritz and Bartz-Beielstein, 2017), SIMPLEs (Hu et al., 2020), and scIGANs (Xu et al., 2020). ...
Article
Full-text available
Multi-omics studies, which explore the interactions between multiple types of biological factors, have significant advantages over single-omics analysis for their ability to provide a more holistic view of biological processes, uncover the causal and functional mechanisms for complex diseases, and facilitate new discoveries in precision medicine. However, omics datasets often contain missing values, and in multi-omics study designs it is common for individuals to be represented for some omics layers but not all. Since most statistical analyses cannot be applied directly to the incomplete datasets, imputation is typically performed to infer the missing values. Integrative imputation techniques which make use of the correlations and shared information among multi-omics datasets are expected to outperform approaches that rely on single-omics information alone, resulting in more accurate results for the subsequent downstream analyses. In this review, we provide an overview of the currently available imputation methods for handling missing values in bioinformatics data with an emphasis on multi-omics imputation. In addition, we also provide a perspective on how deep learning methods might be developed for the integrative imputation of multi-omics datasets.
Article
Motivation The advancements of single-cell sequencing methods have paved the way for the characterization of cellular states at unprecedented resolution, revolutionizing the investigation on complex biological systems. Yet, single-cell sequencing experiments are hindered by several technical issues, which cause output data to be noisy, impacting the reliability of downstream analyses. Therefore, a growing number of data science methods has been proposed to recover lost or corrupted information from single-cell sequencing data. To date, however, no quantitative benchmarks have been proposed to evaluate such methods. Results We present a comprehensive analysis of the state-of-the-art computational approaches for denoising and imputation of single-cell transcriptomic data, comparing their performance in different experimental scenarios. In detail, we compared 19 denoising and imputation methods, on both simulated and real-world datasets, with respect to several performance metrics related to imputation of dropout events, recovery of true expression profiles, characterization of cell similarity, identification of differentially expressed genes and computation time. The effectiveness and scalability of all methods were assessed with regard to distinct sequencing protocols, sample size and different levels of biological variability and technical noise. As a result, we identify a subset of versatile approaches exhibiting solid performances on most tests and show that certain algorithmic families prove effective on specific tasks but inefficient on others. Finally, most methods appear to benefit from the introduction of appropriate assumptions on noise distribution of biological processes.
ResearchGate has not been able to resolve any references for this publication.