Article

Highly Regional Genes: graph-based gene selection for single cell RNA-seq data

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Gene selection is an indispensable step for analyzing noisy and high-dimensional single cell RNA-seq (scRNA-seq) data. Compared with the commonly used variance-based methods, by mimicking the human maker selection in the 2D visualization of cells, a new feature selection method called HRG (Highly Regional Genes) is proposed to find the informative genes which show regional expression patterns in the cell-cell similarity network. We mathematically find the optimal expression patterns which can maximize the proposed scoring function. In comparison with several unsupervised methods, HRG shows high accuracy and robustness, and can increase the performance of downstream cell clustering and gene correlation analysis. Also, it is applicable for selecting informative genes of sequencing-based spatial transcriptomic data.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Gene Selection Approaches derived from the concept of feature selection [23,24], tailored specifically for genomics research, including scRNAseq studies. Those approaches, whether highly depend on well-trained embedded machine learning models [25] to identify the importance of each gene, or they utilize heuristic metrics to determine key genes [26,27], are always unstable and not optimization-directed. ...
... (1) K-Best [48] selects the top K features based on their scores, providing a straightforward approach to feature prioritization; (2) mRMR [49] chooses features that maximize relevance to the target variable while minimizing redundancy among the features; (3) LASSO [50] employs regularization to shrink the coefficients of less useful features to zero, effectively performing feature selection during model fitting; (4) RFE [51] is a recursive feature elimination method that systematically removes the weakest features based on model performance until a specified number of features remains; (5) HRG [27] utilizes a graph-based approach to identify genes that exhibit regional expression patterns within a cell-cell similarity network; (6) geneBasis [26] aims to select a small, targeted panel of genes from scRNA-seq datasets that can effectively capture the transcriptional variability present across different cells and cell types; (7) CellBRF [25], selects the most significant gene subset evaluated using Random Forest. HRG, CellBRF, and geneBasis are commonly used gene panel selection methods. ...
Preprint
Recent advancements in single-cell genomics necessitate precision in gene panel selection to interpret complex biological data effectively. Those methods aim to streamline the analysis of scRNA-seq data by focusing on the most informative genes that contribute significantly to the specific analysis task. Traditional selection methods, which often rely on expert domain knowledge, embedded machine learning models, or heuristic-based iterative optimization, are prone to biases and inefficiencies that may obscure critical genomic signals. Recognizing the limitations of traditional methods, we aim to transcend these constraints with a refined strategy. In this study, we introduce an iterative gene panel selection strategy that is applicable to clustering tasks in single-cell genomics. Our method uniquely integrates results from other gene selection algorithms, providing valuable preliminary boundaries or prior knowledge as initial guides in the search space to enhance the efficiency of our framework. Furthermore, we incorporate the stochastic nature of the exploration process in reinforcement learning (RL) and its capability for continuous optimization through reward-based feedback. This combination mitigates the biases inherent in the initial boundaries and harnesses RL's adaptability to refine and target gene panel selection dynamically. To illustrate the effectiveness of our method, we conducted detailed comparative experiments, case studies, and visualization analysis.
... DUBStepR (Ranjan et al. 2021) uses gene-gene correlation to identify an initial core set of genes and defines a graph-based measure of cell aggregation to optimize the number of genes. Highly Regional Genes (HRG) (Wu et al. 2022) chooses genes with expression levels regionally distributed on the cell-cell network or graph constructed based on cell similarity. ...
... We compare CellBRF with six state-of-the-art feature selection methods: a commonly used method [HVG (Satija et al. 2015)], two clustering-based feature selection methods [Feats (Vans et al. 2021), FEAST (Su et al. 2021)], a k-NN graphbased feature selection method geneBasisR (Missarova et al. 2021), a gene correlation-based feature selection method [DUBStepR (Ranjan et al. 2021)] and a cell-cell similaritybased method HRG (Wu et al. 2022). For Feast, geneBasis, and HVG, the number of selected genes is set to their default values of 2000, 50, and 2000, respectively. ...
Article
Full-text available
Motivation: Single-cell RNA sequencing (scRNA-seq) offers a powerful tool to dissect the complexity of biological tissues through cell sub-population identification in combination with clustering approaches. Feature selection is a critical step for improving the accuracy and interpretability of single-cell clustering. Existing feature selection methods underutilize the discriminatory potential of genes across distinct cell types. We hypothesize that incorporating such information could further boost the performance of single cell clustering. Results: We develop CellBRF, a feature selection method that considers genes' relevance to cell types for single-cell clustering. The key idea is to identify genes that are most important for discriminating cell types through random forests guided by predicted cell labels. Moreover, it proposes a class balancing strategy to mitigate the impact of unbalanced cell type distributions on feature importance evaluation. We benchmark CellBRF on 33 scRNA-seq datasets representing diverse biological scenarios and demonstrate that it substantially outperforms state-of-the-art feature selection methods in terms of clustering accuracy and cell neighborhood consistency. Furthermore, we demonstrate the outstanding performance of our selected features through three case studies on cell differentiation stage identification, non-malignant cell subtype identification, and rare cell identification. CellBRF provides a new and effective tool to boost single-cell clustering accuracy. Availability and implementation: All source codes of CellBRF are freely available at https://github.com/xuyp-csu/CellBRF.
... Gaussian distribution under the null hypothesis, and a one-sided p-value is computed. HRG [41] differs from Hotspot in three key aspects. First, it replaces the KNN graph with a shared-nearest-neighbors (SNN) graph, where the edge weight w ij between spots i and j is defined as the Jaccard index of their respective K-nearest neighbors. ...
Article
Full-text available
In the analysis of spatially resolved transcriptomics data, detecting spatially variable genes (SVGs) is crucial. Numerous computational methods exist, but varying SVG definitions and methodologies lead to incomparable results. We review 31 state-of-the-art methods, categorizing SVGs into three types: overall, cell-type-specific, and spatial-domain-marker SVGs. Our review explains the intuitions underlying these methods, summarizes their applications, and categorizes the hypothesis tests they use in the trade-off between generality and specificity for SVG detection. We discuss challenges in SVG detection and propose future directions for improvement. Our review offers insights for method developers and users, advocating for category-specific benchmarking.
... Gaussian distribution under the null hypothesis, and a one-sided p-value is computed. HRG [41] differs from Hotspot in three key aspects. First, it replaces the KNN graph with a shared-nearest-neighbors (SNN) graph, where the edge weight w ij between spots i and j is defined as the Jaccard index of their respective K-nearest neighbors. ...
Preprint
Full-text available
In the analysis of spatially resolved transcriptomics data, detecting spatially variable genes (SVGs) is crucial. Numerous computational methods exist, but varying SVG definitions and methodologies lead to incomparable results. We review 31 state-of-the-art methods, categorizing SVGs into three types: overall, cell-type-specific, and spatial-domain-marker SVGs. Our review explains the intuitions underlying these methods, summarizes their applications, and categorizes the hypothesis tests they use in the trade-off between generality and specificity for SVG detection. We discuss challenges in SVG detection and propose future directions for improvement. Our review offers insights for method developers and users, advocating for category-specific benchmarking.
... The "Aggregate Marker Score" function evaluated the expression of important features within cells, and we selected samples C that express specific cell type features as core training dataset. Then, the Entropy test or Highly Regional Genes (Wu et al. 2022a) was used for feature selection. The genes G used for training were the union of input markers G prior and genes selected from the scRNA-seq expression matrix. ...
Article
Full-text available
Single-cell RNA-seq (scRNA-seq) is a powerful technique for decoding the complex cellular compositions in the tumor microenvironment (TME). As previous studies have defined many meaningful cell subtypes in several tumor types, there is a great need to computationally transfer these labels to new datasets. Also, different studies used different approaches or criteria to define the cell subtypes for the same major cell lineages. The relationships between the cell subtypes defined in different studies should be carefully evaluated. In this updated package scCancer2, designed for integrative tumor scRNA-seq data analysis, we developed a supervised machine learning framework to annotate TME cells with annotated cell subtypes from 15 scRNA-seq datasets with 594 samples in total. Based on the trained classifiers, we quantitatively constructed the similarity maps between the cell subtypes defined in different references by testing on all the 15 datasets. Secondly, to improve the identification of malignant cells, we designed a classifier by integrating large-scale pan-cancer TCGA bulk gene expression datasets and scRNA-seq datasets (10 cancer types, 175 samples, 663,857 cells). This classifier shows robust performances when no internal confidential reference cells are available. Thirdly, scCancer2 integrated a module to process the spatial transcriptomic data and analyze the spatial features of TME. Availability The package and user documentation are available at http://lifeome.net/software/sccancer2/ and https://doi.org/10.5281/zenodo.10477296. Supplementary information Supplementary data are available at Bioinformatics online.
... After QC, the R packages harmony (v1.0) [46] Seurat (v3.9.9) [21] and High-lyRegionalGenes (v0.1.0) [47] were used for data integration of the expression data from SynOV and NS treated samples, basic downstream analysis and feature selection, respectively. Specifically, we first directly combined two expression matrixes first and performed SCTransform and scaling using Seurat pipelines. ...
... Also, several papers reporting new methods/tools and resources are attractive to our readership. As an example, a graphbased gene selection method, Highly Regional Genes (HRG), is developed for the analysis of single-cell RNA-seq data (Wu et al., 2022). ...
... Typically, it is possible to select features/genes using either supervised or unsupervised based-approach according to the dataset availability and method's constraints [43][44][45][46]. For instance, supervised feature selection can be used when training samples' labels are readily available, while unsupervised feature selection should be applied when training samples' labels are unavailable [47,48]. ...
Article
Nowadays, microarray data processing is one of the most important applications in molecular biology for cancer diagnosis. A major task in microarray data processing is gene selection, which aims to find a subset of genes with the least inner similarity and most relevant to the target class. Removing unnecessary, redundant, or noisy data reduces the data dimensionality. This research advocates a graph theoretic-based gene selection method for cancer diagnosis. Both unsupervised and supervised modes use well-known and successful social network approaches such as the maximum weighted clique criterion and edge centrality to rank genes. The suggested technique has two goals: (i) to maximize the relevancy of the chosen genes with the target class and (ii) to reduce their inner redundancy. A maximum weighted clique is chosen in a repetitive way in each iteration of this procedure. The appropriate genes are then chosen from among the existing features in this maximum clique using edge centrality and gene relevance. In the experiment, several datasets consisting of Colon, Leukemia, SRBCT, Prostate Tumor, and Lung Cancer, with different properties, are used to demonstrate the efficacy of the developed model. Our performance is compared to that of renowned filter-based gene selection approaches for cancer diagnosis whose results demonstrate a clear superiority.
... Since their ability to encode similarity relationships among data, graph-based models such as graph embedding [20], graph clustering [50], and semisupervised learning [51] have played an important role in machine learning tasks. Through the use of graph-based models for cancer prediction, a universal and versatile framework can be created that reflects the complex relationships and structure of the gene space. ...
Article
Full-text available
Several Artificial Intelligence-based models have been developed for cancer prediction. In spite of the promise of artificial intelligence, there are very few models which bridge the gap between traditional human-centered prediction and the potential future of machine-centered cancer prediction. In this study, an efficient and effective model is developed for gene selection and cancer prediction. Moreover, this study proposes an artificial intelligence decision system to provide physicians with a simple and human-interpretable set of rules for cancer prediction. In contrast to previous deep learning-based cancer prediction models, which are difficult to explain to physicians due to their black-box nature, the proposed prediction model is based on a transparent and explainable decision forest model. The performance of the developed approach is compared to three state-of-the-art cancer prediction including TAGA, HPSO and LL. The reported results on five cancer datasets indicate that the developed model can improve the accuracy of cancer prediction and reduce the execution time.
Preprint
Full-text available
Recent advances in spatial sequencing technologies enable simultaneous capture of spatial location and chromatin accessibility of cells within intact tissue slices. Identifying peaks that display spatial variation and cellular heterogeneity is the first and key analytic task for characterizing the spatial chromatin accessibility landscape of complex tissues. Here we propose an efficient and iterative model, Descartes, for spatially variable peaks identification based on the graph of inter-cellular correlations. Through the comprehensive benchmarking for spatially variable peaks identification, we demonstrate the superiority of Descartes in revealing cellular heterogeneity and capturing tissue structure. In terms of computational efficiency, Descartes also outperforms existing methods with spatial assumptions. Utilizing the graph of inter-cellular correlations, Descartes denoises and imputes data via the neighboring relationships, enhancing the precision of downstream analysis. We further demonstrate the ability of Descartes for peak module identification by using peak-peak correlations within the graph. When applied to spatial multi-omics data, Descartes show its potential to detect gene-peak interactions, offering valuable insights into the construction of gene regulatory networks.
Article
Full-text available
Large-scale transcriptomic data are crucial for understanding the molecular features of hepatocellular carcinoma (HCC). Integrated 15 transcriptomic datasets of HCC clinical samples, the first version of HCCDB (HCC database) was released in 2018. Through the meta-analysis of differentially expressed genes and prognosis-related genes across multiple datasets, it provides a systematic view of the altered biological processes and the inter-patient heterogeneities of HCC with high reproducibility and robustness. With four years having passed, the database now needs integration of recently published datasets. Furthermore, the latest single-cell and spatial transcriptomics have provided a great opportunity to decipher complex gene expression variations at the cellular level with spatial architecture. Here, we present HCCDB v2.0, an updated version that combines bulk, single-cell, and spatial transcriptomic data of HCC clinical samples. It dramatically expands the bulk sample size by adding 1656 new samples from 11 datasets to the existing 3917 samples, thereby enhancing the reliability of transcriptomic meta-analysis. A total of 182,832 cells and 69,352 spatial spots were added to the single-cell and spatial transcriptomics sections, respectively. A novel single-cell level and 2-dimension (sc-2D) metric was proposed as well to summarize cell type-specific and dysregulated gene expression patterns. Results are all graphically visualized in our online portal, allowing users to easily retrieve data through a user-friendly interface and navigate between different views. With extensive clinical phenotypes and transcriptomic data in the database, we show two applications for identifying prognosis-associated cells and tumor microenvironment. HCCDB v2.0 is available at http://lifeome.net/database/hccdb2.
Article
Single-cell clustering is a critical step in biological downstream analysis. The clustering performance could be effectively improved by extracting cell-type-specific genes. The state-of-the-art feature selection methods usually calculate the importance of a single gene without considering the information contained in the gene expression distribution. Moreover, these methods ignore the intrinsic expression patterns of genes and heterogeneity within groups of different mean expression levels. In this work, we present a Feature sElection method based on gene Expression Decomposition (FEED) of scRNA-seq data, which selects informative genes to enhance clustering performance. First, the expression levels of genes are decomposed into multiple Gaussian components. Then, a novel gene correlation calculation method is proposed to measure the relationship between genes from the perspective of distribution. Finally, a permutation-based approach is proposed to determine the threshold of gene importance to obtain marker gene subsets. Compared with state-of-the-art feature selection methods, applying FEED on various scRNA-seq datasets including large datasets followed by different common clustering algorithms results in significant improvements in the accuracy of cell-type identification. The source codes for FEED are freely available at https://github.com/genemine/FEED.
Preprint
Single-cell RNA-seq (scRNA-seq) is a powerful technique for decoding the complex cellular compositions in the tumor microenvironment (TME). As previous studies have defined many meaningful cell subtypes in several tumor types, there is a great need to computationally transfer these labels to new datasets. Also, different studies used different approaches or criteria to define the cell subtypes for the same major cell lineages. The relationships between the cell subtypes defined in different studies should be carefully evaluated. In this updated package scCancer2, designed for integrative tumor scRNA-seq data analysis, we developed a supervised machine learning framework to annotate TME cells with annotated cell subtypes from 15 scRNA datasets with 594 samples in total. Based on the trained classifiers, we quantitatively constructed the similarity maps between the cell subtypes defined in different references by testing on all the 15 datasets. Secondly, to improve the identification of malignant cells, we designed a classifier by integrating large-scale pan-cancer TCGA bulk gene expression datasets and scRNA-seq datasets (10 cancer types, 159 samples, 652,160 cells). This classifier shows robust performances when no internal confidential reference cells are available. Thirdly, this package also integrated a module to process the seq-based spatial transcriptomic data and analyze the spatial features of TME. Software availability: http://lifeome.net/software/sccancer2/.
Article
Spatially resolved transcriptomics (SRT) has unlocked new dimensions in our understanding of intricate tissue architectures. However, this rapidly expanding field produces a wealth of diverse and voluminous data, necessitating the evolution of sophisticated computational strategies to unravel inherent patterns. Two distinct methodologies, gene spatial pattern recognition (GSPR) and tissue spatial pattern recognition (TSPR), have emerged as vital tools in this process. GSPR methodologies are designed to identify and classify genes exhibiting noteworthy spatial patterns, while TSPR strategies aim to understand intercellular interactions and recognize tissue domains with molecular and spatial coherence. In this review, we provide a comprehensive exploration of SRT, highlighting crucial data modalities and resources that are instrumental for the development of methods and biological insights. We address the complexities and challenges posed by the use of heterogeneous data in developing GSPR and TSPR methodologies and propose an optimal workflow for both. We delve into the latest advancements in GSPR and TSPR, examining their interrelationships. Lastly, we peer into the future, envisaging the potential directions and perspectives in this dynamic field.
Article
The development of spatial transcriptomics technologies has transformed genetic research from a single-cell data level to a two-dimensional spatial coordinate system and facilitated the study of the composition and function of various cell subsets in different environments and organs. The large-scale data generated by these spatial transcriptomics technologies, which contains spatial gene expression information, have elicited the need for spatially resolved approaches to meet the requirements of computational and biological data interpretation. These requirements include dealing with the explosive growth of data to determine the cell-level and gene-level expression, correcting the inner batch effect and loss of expression to improve the data quality, conducting efficient interpretation and in-depth knowledge mining both at the single-cell and tissue-wide levels, and conducting multi-omics integration analysis to provide an extensible framework towards the in-depth understanding of biological processes. However, algorithms designed specifically for spatial transcriptomics technologies to meet these requirements are still in their infancy. Here, we review computational approaches to these problems in light of corresponding issues and challenges, and present forward-looking insights into algorithm development.
Article
Full-text available
Two fundamental aims that emerge when analyzing single-cell RNA-seq data are identifying which genes vary in an informative manner and determining how these genes organize into modules. Here, we propose a general approach to these problems, called “Hotspot,” that operates directly on a given metric of cell-cell similarity, allowing for its integration with any method (linear or non-linear) for identifying the primary axes of transcriptional variation between cells. In addition, we show that when using multimodal data, Hotspot can be used to identify genes whose expression reflects alternative notions of similarity between cells, such as physical proximity in a tissue or clonal relatedness in a cell lineage tree. In this manner, we demonstrate that while Hotspot is capable of identifying genes that reflect nuanced transcriptional variability between T helper cells, it can also identify spatially dependent patterns of gene expression in the cerebellum as well as developmentally heritable expression programs during embryogenesis. Hotspot is implemented as an open-source Python package and is available for use at http://www.github.com/yoseflab/hotspot. A record of this paper’s transparent peer review process is included in the supplemental information.
Article
Full-text available
Identifying genes that display spatial expression patterns in spatially resolved transcriptomic studies is an important first step toward characterizing the spatial transcriptomic landscape of complex tissues. Here we present a statistical method, SPARK, for identifying spatial expression patterns of genes in data generated from various spatially resolved transcriptomic techniques. SPARK directly models spatial count data through generalized linear spatial models. It relies on recently developed statistical formulas for hypothesis testing, providing effective control of type I errors and yielding high statistical power. With a computationally efficient algorithm, which is based on penalized quasi-likelihood, SPARK is also scalable to datasets with tens of thousands of genes measured on tens of thousands of samples. Analyzing four published spatially resolved transcriptomic datasets using SPARK, we show it can be up to ten times more powerful than existing methods and disclose biological discoveries that otherwise cannot be revealed by existing approaches.
Article
Full-text available
We present Vision, a tool for annotating the sources of variation in single cell RNA-seq data in an automated and scalable manner. Vision operates directly on the manifold of cell-cell similarity and employs a flexible annotation approach that can operate either with or without preconceived stratification of the cells into groups or along a continuum. We demonstrate the utility of Vision in several case studies and show that it can derive important sources of cellular variation and link them to experimental meta-data even with relatively homogeneous sets of cells. Vision produces an interactive, low latency and feature rich web-based report that can be easily shared among researchers, thus facilitating data dissemination and collaboration.
Article
Full-text available
Motivation: Most genomes contain thousands of genes, but for most functional responses, only a subset of those genes are relevant. To facilitate many single-cell RNASeq (scRNASeq) analyses the set of genes is often reduced through feature selection, i.e. by removing genes only subject to technical noise. Results: We present M3Drop, an R package that implements popular existing feature selection methods and two novel methods which take advantage of the prevalence of zeros (dropouts) in scRNASeq data to identify features. We show these new methods outperform existing methods on simulated and real datasets. Availability and implementation: M3Drop is freely available on github as an R package and is compatible with other popular scRNASeq tools: https://github.com/tallulandrews/M3Drop. Supplementary information: Supplementary data are available at Bioinformatics online.
Article
Full-text available
Advances in single-cell technologies have enabled high-resolution dissection of tissue composition. Several tools for dimensionality reduction are available to analyze the large number of parameters generated in single-cell studies. Recently, a nonlinear dimensionality-reduction technique, uniform manifold approximation and projection (UMAP), was developed for the analysis of any type of high-dimensional data. Here we apply it to biological data, using three well-characterized mass cytometry and single-cell RNA sequencing datasets. Comparing the performance of UMAP with five other tools, we find that UMAP provides the fastest run times, highest reproducibility and the most meaningful organization of cell clusters. The work highlights the use of UMAP for improved visualization and interpretation of single-cell data.
Article
Full-text available
The cellular complexity of human brain development has been intensively investigated, although a regional characterization of the entire human cerebral cortex based on single-cell transcriptome analysis has not been reported. Here, we performed RNA-seq on over 4,000 individual cells from 22 brain regions of human mid-gestation embryos. We identified 29 cell sub-clusters, which showed different proportions in each region and the pons showed especially high percentage of astrocytes. Embryonic neurons were not as diverse as adult neurons, although they possessed important features of their destinies in adults. Neuron development was unsynchronized in the cerebral cortex, as dorsal regions appeared to be more mature than ventral regions at this stage. Region-specific genes were comprehensively identified in each neuronal sub-cluster, and a large proportion of these genes were neural disease related. Our results present a systematic landscape of the regionalized gene expression and neuron maturation of the human cerebral cortex.
Article
Full-text available
UMAP (Uniform Manifold Approximation and Projection) is a novel manifold learning technique for dimension reduction. UMAP is constructed from a theoretical framework based in Riemannian geometry and algebraic topology. The result is a practical scalable algorithm that applies to real world data. The UMAP algorithm is competitive with t-SNE for visualization quality, and arguably preserves more of the global structure with superior run time performance. Furthermore, UMAP as described has no computational restrictions on embedding dimension, making it viable as a general purpose dimension reduction technique for machine learning.
Article
Full-text available
As single-cell RNA sequencing (scRNA-seq) technologies have rapidly developed, so have analysis methods. Many methods have been tested, developed, and validated using simulated datasets. Unfortunately, current simulations are often poorly documented, their similarity to real data is not demonstrated, or reproducible code is not available. Here, we present the Splatter Bioconductor package for simple, reproducible, and well-documented simulation of scRNA-seq data. Splatter provides an interface to multiple simulation methods including Splat, our own simulation, based on a gamma-Poisson distribution. Splat can simulate single populations of cells, populations with multiple cell types, or differentiation paths. Electronic supplementary material The online version of this article (doi:10.1186/s13059-017-1305-0) contains supplementary material, which is available to authorized users.
Article
Full-text available
Characterizing the transcriptome of individual cells is fundamental to understanding complex biological systems. We describe a droplet-based system that enables 3′ mRNA counting of tens of thousands of single cells per sample. Cell encapsulation, of up to 8 samples at a time, takes place in ∼6 min, with ∼50% cell capture efficiency. To demonstrate the system's technical performance, we collected transcriptome data from ∼250k single cells across 29 samples. We validated the sensitivity of the system and its ability to detect rare populations using cell lines and synthetic RNAs. We profiled 68k peripheral blood mononuclear cells to demonstrate the system's ability to characterize large immune populations. Finally, we used sequence variation in the transcriptome data to determine host and donor chimerism at single-cell resolution from bone marrow mononuclear cells isolated from transplant patients.
Article
Full-text available
Hormone-secreting cells within pancreatic islets of Langerhans play important roles in metabolic homeostasis and disease. However, their transcriptional characterization is still incomplete. Here, we sequenced the transcriptomes of thousands of human islet cells from healthy and type 2 diabetic donors. We could define specific genetic programs for each individual endocrine and exocrine cell type, even for rare d, g, ε, and stellate cells, and revealed subpopulations of a, b, and acinar cells. Intriguingly, d cells expressed several important receptors, indicating an unrecognized importance of these cells in integrating paracrine and systemic metabolic signals. Genes previously associated with obesity or diabetes were found to correlate with BMI. Finally, comparing healthy and T2D transcriptomes in a cell-type resolved manner uncovered candidates for future functional studies. Altogether, our analyses demonstrate the utility of the generated single-cell gene expression resource.
Article
Full-text available
High-throughput single-cell technologies have great potential to discover new cell types; however, it remains challenging to detect rare cell types that are distinct from a large population. We present a novel computational method, called GiniClust, to overcome this challenge. Validation against a benchmark dataset indicates that GiniClust achieves high sensitivity and specificity. Application of GiniClust to public single-cell RNA-seq datasets uncovers previously unrecognized rare cell types, including Zscan4-expressing cells within mouse embryonic stem cells and hemoglobin-expressing cells in the mouse cortex and hippocampus. GiniClust also correctly detects a small number of normal cells that are mixed in a cancer cell population. Electronic supplementary material The online version of this article (doi:10.1186/s13059-016-1010-4) contains supplementary material, which is available to authorized users.
Article
Full-text available
Spatial structure of RNA expression RNA-seq and similar methods can record gene expression within and among cells. Current methods typically lose positional information and many require arduous single-cell isolation and sequencing. Ståhl et al. have developed a way of measuring the spatial distribution of transcripts by annealing fixed brain or cancer tissue samples directly to bar-coded reverse transcriptase primers, performing reverse transcription followed by sequencing and computational reconstruction, and they can do so for multiple genes. Science , this issue p. 78
Article
Full-text available
One size does not fit all Oligodendrocytes are best known for their ability to myelinate brain neurons, thus increasing the speed of signal transmission. Marques et al. surveyed oligodendrocytes of developing mice and found unexpected heterogeneity. Transcriptional analysis identified 12 populations, ranging from precursors to mature oligodendrocytes. Transcriptional profiles diverged as the oligodendrocytes matured, building distinct populations. One population was responsive to motor learning, and another, with a different transcriptome, traveled along blood vessels. Science , this issue p. 1326
Article
Full-text available
The major and essential objective of pre-implantation development is to establish embryonic and extra-embryonic cell fates. To address when and how this fundamental process is initiated in mammals, we characterize transcriptomes of all individual cells throughout mouse pre-implantation development. This identifies targets of master pluripotency regulators Oct4 and Sox2 as being highly heterogeneously expressed between blastomeres of the 4-cell embryo, with Sox21 showing one of the most heterogeneous expression profiles. Live-cell tracking demonstrates that cells with decreased Sox21 yield more extra-embryonic than pluripotent progeny. Consistently, decreasing Sox21 results in premature upregulation of the differentiation regulator Cdx2, suggesting that Sox21 helps safeguard pluripotency. Furthermore, Sox21 is elevated following increased expression of the histone H3R26-methylase CARM1 and is lowered following CARM1 inhibition, indicating the importance of epigenetic regulation. Therefore, our results indicate that heterogeneous gene expression, as early as the 4-cell stage, initiates cell-fate decisions by modulating the balance of pluripotency and differentiation.
Article
Full-text available
Significance The brain comprises an immense number of cells and cellular connections. We describe the first, to our knowledge, single cell whole transcriptome analysis of human adult cortical samples. We have established an experimental and analytical framework with which the complexity of the human brain can be dissected on the single cell level. Using this approach, we were able to identify all major cell types of the brain and characterize subtypes of neuronal cells. We observed changes in neurons from early developmental to late differentiated stages in the adult. We found a subset of adult neurons which express major histocompatibility complex class I genes and thus are not immune privileged.
Article
Full-text available
The recent advance of single-cell technologies has brought new insights into complex biological phenomena. In particular, genome-wide single-cell measurements such as transcriptome sequencing enable the characterization of cellular composition as well as functional variation in homogenic cell populations. An important step in the single-cell transcriptome analysis is to group cells that belong to the same cell types based on gene expression patterns. The corresponding computational problem is to cluster a noisy high dimensional dataset with substantially fewer objects (cells) than the number of variables (genes). In this article, we describe a novel algorithm named shared nearest neighbor (SNN)-Cliq that clusters single-cell transcriptomes. SNN-Cliq utilizes the concept of shared nearest neighbor that shows advantages in handling high-dimensional data. When evaluated on a variety of synthetic and real experimental datasets, SNN-Cliq outperformed the state-of-the-art methods tested. More importantly, the clustering results of SNN-Cliq reflect the cell types or origins with high accuracy. The algorithm is implemented in MATLAB and Python. The source code can be downloaded at http://bioinfo.uncc.edu/SNNCliq. zcsu@uncc.edu Supplementary information: Supplementary data are available at Bioinformatics online. © The Author 2015. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.
Article
Full-text available
The mammalian cerebral cortex supports cognitive functions such as sensorimotor integration, memory, and social behaviors. Normal brain function relies on a diverse set of differentiated cell types, including neurons, glia, and vasculature. Here, we have used large-scale single-cell RNA-seq to classify cells in the mouse somatosensory cortex and hippocampal CA1 region. We found 47 molecularly distinct subclasses, comprising all known major cell types in the cortex. We identified numerous marker genes, which allowed alignment with known cell types, morphology, and location. We found a layer I interneuron expressing Pax6 and a distinct postmitotic oligodendrocyte subclass marked by Itpr2. Across the diversity of cortical cell types, transcription factors formed a complex, layered regulatory code, suggesting a mechanism for the maintenance of adult cell type identity. Copyright © 2015, American Association for the Advancement of Science.
Article
Full-text available
Considerable progress in sequencing technologies makes it now possible to study the genomic and transcriptomic landscape of single cells. However, to better understand the complexity of multicellular organisms, we must devise ways to perform high-throughput measurements while preserving spatial information about the tissue context or subcellular localization of analysed nucleic acids. In this Innovation article, we summarize pioneering technologies that enable spatially resolved transcriptomics and discuss how these methods have the potential to extend beyond transcriptomics to encompass spatially resolved genomics, proteomics and possibly other omic disciplines.
Article
Full-text available
Tissue gene expression profiling is performed on homogenates or on populations of isolated single cells to resolve molecular states of different cell types. In both approaches, histological context is lost. We have developed an in situ sequencing method for parallel targeted analysis of short RNA fragments in morphologically preserved cells and tissue. We demonstrate in situ sequencing of point mutations and multiplexed gene expression profiling in human breast cancer tissue sections.
Article
Full-text available
High-throughput sequencing has allowed for unprecedented detail in gene expression analyses, yet its efficient application to single cells is challenged by the small starting amounts of RNA. We have developed CEL-Seq, a method for overcoming this limitation by barcoding and pooling samples before linearly amplifying mRNA with the use of one round of in vitro transcription. We show that CEL-Seq gives more reproducible, linear, and sensitive results than a PCR-based amplification method. We demonstrate the power of this method by studying early C. elegans embryonic development at single-cell resolution. Differential distribution of transcripts between sister cells is seen as early as the two-cell stage embryo, and zygotic expression in the somatic cell lineages is enriched for transcription factors. The robust transcriptome quantifications enabled by CEL-Seq will be useful for transcriptomic analyses of complex tissues containing populations of diverse cell types.
Article
Full-text available
Actin-myosin interactions provide the driving force underlying each heartbeat. The current view is that actin-bound regulatory proteins play a dominant role in the activation of calcium-dependent cardiac muscle contraction. In contrast, the relevance and nature of regulation by myosin regulatory proteins (for example, myosin light chain-2 [MLC2]) in cardiac muscle remain poorly understood. By integrating gene-targeted mouse and computational models, we have identified an indispensable role for ventricular Mlc2 (Mlc2v) phosphorylation in regulating cardiac muscle contraction. Cardiac myosin cycling kinetics, which directly control actin-myosin interactions, were directly affected, but surprisingly, Mlc2v phosphorylation also fed back to cooperatively influence calcium-dependent activation of the thin filament. Loss of these mechanisms produced early defects in the rate of cardiac muscle twitch relaxation and ventricular torsion. Strikingly, these defects preceded the left ventricular dysfunction of heart disease and failure in a mouse model with nonphosphorylatable Mlc2v. Thus, there is a direct and early role for Mlc2 phosphorylation in regulating actin-myosin interactions in striated muscle contraction, and dephosphorylation of Mlc2 or loss of these mechanisms can play a critical role in heart failure.
Article
Full-text available
Next-generation sequencing technology is a powerful tool for transcriptome analysis. However, under certain conditions, only a small amount of material is available, which requires more sensitive techniques that can preferably be used at the single-cell level. Here we describe a single-cell digital gene expression profiling assay. Using our mRNA-Seq assay with only a single mouse blastomere, we detected the expression of 75% (5,270) more genes than microarray techniques and identified 1,753 previously unknown splice junctions called by at least 5 reads. Moreover, 8-19% of the genes with multiple known transcript isoforms expressed at least two isoforms in the same blastomere or oocyte, which unambiguously demonstrated the complexity of the transcript variants at whole-genome scale in individual cells. Finally, for Dicer1(-/-) and Ago2(-/-) (Eif2c2(-/-)) oocytes, we found that 1,696 and 1,553 genes, respectively, were abnormally upregulated compared to wild-type controls, with 619 genes in common.
Article
Single-cell transcriptomics has transformed our ability to characterize cell states, but deep biological understanding requires more than a taxonomic listing of clusters. As new methods arise to measure distinct cellular modalities, a key analytical challenge is to integrate these datasets to better understand cellular identity and function. Here, we develop a strategy to "anchor" diverse datasets together, enabling us to integrate single-cell measurements not only across scRNA-seq technologies, but also across different modalities. After demonstrating improvement over existing methods for integrating scRNA-seq data, we anchor scRNA-seq experiments with scATAC-seq to explore chromatin differences in closely related interneuron subsets and project protein expression measurements onto a bone marrow atlas to characterize lymphocyte populations. Lastly, we harmonize in situ gene expression and scRNA-seq datasets, allowing transcriptome-wide imputation of spatial gene expression patterns. Our work presents a strategy for the assembly of harmonized references and transfer of information across datasets.
Article
Intratumoral heterogeneity is a major obstacle to cancer treatment and a significant confounding factor in bulk-tumor profiling. We performed an unbiased analysis of transcriptional heterogeneity in colorectal tumors and their microenvironments using single-cell RNA–seq from 11 primary colorectal tumors and matched normal mucosa. To robustly cluster single-cell transcriptomes, we developed reference component analysis (RCA), an algorithm that substantially improves clustering accuracy. Using RCA, we identified two distinct subtypes of cancer-associated fibroblasts (CAFs). Additionally, epithelial–mesenchymal transition (EMT)-related genes were found to be upregulated only in the CAF subpopulation of tumor samples. Notably, colorectal tumors previously assigned to a single subtype on the basis of bulk transcriptomics could be divided into subgroups with divergent survival probability by using single-cell signatures, thus underscoring the prognostic value of our approach. Overall, our results demonstrate that unbiased single-cell RNA–seq profiling of tumor and matched normal samples provides a unique opportunity to characterize aberrant cell states within a tumor.
Article
The hypothalamus contains the highest diversity of neurons in the brain. Many of these neurons can co-release neurotransmitters and neuropeptides in a use-dependent manner. Investigators have hitherto relied on candidate protein-based tools to correlate behavioral, endocrine and gender traits with hypothalamic neuron identity. Here we map neuronal identities in the hypothalamus by single-cell RNA sequencing. We distinguished 62 neuronal subtypes producing glutamatergic, dopaminergic or GABAergic markers for synaptic neurotransmission and harboring the ability to engage in task-dependent neurotransmitter switching. We identified dopamine neurons that uniquely coexpress the Onecut3 and Nmur2 genes, and placed these in the periventricular nucleus with many synaptic afferents arising from neuromedin S+ neurons of the suprachiasmatic nucleus. These neuroendocrine dopamine cells may contribute to the dopaminergic inhibition of prolactin secretion diurnally, as their neuromedin S+ inputs originate from neuron
Article
Although the function of the mammalian pancreas hinges on complex interactions of distinct cell types, gene expression profiles have primarily been described with bulk mixtures. Here we implemented a droplet-based, single-cell RNA-seq method to determine the transcriptomes of over 12,000 individual pancreatic cells from four human donors and two mouse strains. Cells could be divided into 15 clusters that matched previously characterized cell types: all endocrine cell types, including rare epsilon-cells; exocrine cell types; vascular cells; Schwann cells; quiescent and activated stellate cells; and four types of immune cells. We detected subpopulations of ductal cells with distinct expression profiles and validated their existence with immuno-histochemistry stains. Moreover, among human beta- cells, we detected heterogeneity in the regulation of genes relating to functional maturation and levels of ER stress. Finally, we deconvolved bulk gene expression samples using the single-cell data to detect disease-associated differential expression. Our dataset provides a resource for the discovery of novel cell type-specific transcription factors, signaling receptors, and medically relevant genes.
Article
Cells, the basic units of biological structure and function, vary broadly in type and state. Single-cell genomics can characterize cell identity and function, but limitations of ease and scale have prevented its broad application. Here we describe Drop-seq, a strategy for quickly profiling thousands of individual cells by separating them into nanoliter-sized aqueous droplets, associating a different barcode with each cell's RNAs, and sequencing them all together. Drop-seq analyzes mRNA transcripts from thousands of individual cells simultaneously while remembering transcripts' cell of origin. We analyzed transcriptomes from 44,808 mouse retinal cells and identified 39 transcriptionally distinct cell populations, creating a molecular atlas of gene expression for known retinal cell classes and novel candidate cell subtypes. Drop-seq will accelerate biological discovery by enabling routine transcriptional profiling at single-cell resolution. VIDEO ABSTRACT. Copyright © 2015 Elsevier Inc. All rights reserved.
Article
It has long been the dream of biologists to map gene expression at the single-cell level. With such data one might track heterogeneous cell sub-populations, and infer regulatory relationships between genes and pathways. Recently, RNA sequencing has achieved single-cell resolution. What is limiting is an effective way to routinely isolate and process large numbers of individual cells for quantitative in-depth sequencing. We have developed a high-throughput droplet-microfluidic approach for barcoding the RNA from thousands of individual cells for subsequent analysis by next-generation sequencing. The method shows a surprisingly low noise profile and is readily adaptable to other sequencing-based assays. We analyzed mouse embryonic stem cells, revealing in detail the population structure and the heterogeneous onset of differentiation after leukemia inhibitory factor (LIF) withdrawal. The reproducibility of these high-throughput single-cell data allowed us to deconstruct cell populations and infer gene expression relationships. VIDEO ABSTRACT. Copyright © 2015 Elsevier Inc. All rights reserved.
Article
Advancing our understanding of embryonic development is heavily dependent on identification of novel pathways or regulators. Although genome-wide techniques such as RNA sequencing are ideally suited for discovering novel candidate genes, they are unable to yield spatially resolved information in embryos or tissues. Microscopy-based approaches, using in situ hybridization, for example, can provide spatial information about gene expression, but are limited to analyzing one or a few genes at a time. Here, we present a method where we combine traditional histological techniques with low-input RNA sequencing and mathematical image reconstruction to generate a high-resolution genome-wide 3D atlas of gene expression in the zebrafish embryo at three developmental stages. Importantly, our technique enables searching for genes that are expressed in specific spatial patterns without manual image annotation. We envision broad applicability of RNA tomography as an accurate and sensitive approach for spatially resolved transcriptomics in whole embryos and dissected organs.
Visualizing data using t-SNE
  • Van der Maaten
KneeArrower: Finds Cutoff Points on Knee Curves
  • A Tseng