Doublet prediction for a mixture of human and mouse cells. (A) Schematic overview of species mixing experiment. (B) Identification of mixed-species doublets based on fraction of reads mapping to human or mouse transcriptome. (C) Principal component (PC) analysis of single-cell transcriptomes, restricting to human-mouse gene orthologs. (D) Histogram of doublet scores for simulated doublets. The bimodal distribution reflects the two types of doublets: undetectable intra-species Type A doublets (left peak) and inter-species Type B doublets (right peak). (E) Histograms of doublet scores for observed singlets (gray) and doublets (red). (F) Receiver-operator characteristic (ROC) curve for Scrublet and total transcript counts as predictors of inter-species doublets. AUC, area under the curve. 

Doublet prediction for a mixture of human and mouse cells. (A) Schematic overview of species mixing experiment. (B) Identification of mixed-species doublets based on fraction of reads mapping to human or mouse transcriptome. (C) Principal component (PC) analysis of single-cell transcriptomes, restricting to human-mouse gene orthologs. (D) Histogram of doublet scores for simulated doublets. The bimodal distribution reflects the two types of doublets: undetectable intra-species Type A doublets (left peak) and inter-species Type B doublets (right peak). (E) Histograms of doublet scores for observed singlets (gray) and doublets (red). (F) Receiver-operator characteristic (ROC) curve for Scrublet and total transcript counts as predictors of inter-species doublets. AUC, area under the curve. 

Source publication
Preprint
Full-text available
Single-cell RNA-sequencing has become a widely used, powerful approach for studying cell populations. However, these methods often generate multiplet artifacts, where two or more cells receive the same barcode, resulting in a hybrid transcriptome. In most experiments, multiplets account for several percent of transcriptomes and can confound downstr...

Contexts in source publication

Context 1
... hiding species labels and restricting to orthologous genes ( Fig. 3C), Scrublet estimated the detectable (Type B) doublet fraction at ϕD = 54%, close to the 50% expected for cross-species doublets given equal input of mouse and human cells (Fig. 3D). Furthermore, the detector accurately identified human-mouse doublets with a receiver-operator characteristic (ROC) area under the curve (AUC) of 0.99 (recall of 98% of human- . CC-BY 4.0 International license peer-reviewed) is the author/funder. It is made available under ...
Context 2
... hiding species labels and restricting to orthologous genes ( Fig. 3C), Scrublet estimated the detectable (Type B) doublet fraction at ϕD = 54%, close to the 50% expected for cross-species doublets given equal input of mouse and human cells (Fig. 3D). Furthermore, the detector accurately identified human-mouse doublets with a receiver-operator characteristic (ROC) area under the curve (AUC) of 0.99 (recall of 98% of human- . CC-BY 4.0 International license peer-reviewed) is the author/funder. It is made available under ...
Context 3
... copyright holder for this preprint (which was not . http://dx.doi.org/10.1101/357368 doi: bioRxiv preprint first posted online Jul. 9, 2018; identical mouse and human names (n = 15,642 genes) were added together, and all other genes were excluded. For PCA, we used the top 20% most highly variable genes with ³3 counts in ³5 cells (n=2,372 genes) and kept the first two PCs. Scrublet was run using í µí±˜ = 50, í µí±Ÿ = 10, and í µí±‘ = 0.12 (twice the observed rate of human-mouse doublets). To classify cells as singlets or doublets, a threshold was set by eye using the histogram of doublet scores for simulated doublets (Fig. ...
Context 4
... tested the Scrublet on a publicly available dataset consisting of a mixture of human (HEK293T) and mouse (NIH3T3) cells (Fig. 3A). This dataset, though not representative of most single-cell experiments, provides a useful test case because the differences between human and mouse genomic sequence provide an independent way to detect doublets [2,3]. We defined a partial "ground truth" on doublet identity according to whether a cell barcode associates with transcripts from both species (a doublet), or just one species (Fig. 3B). Because doublets arising from the encapsulation of two human or two mouse cells cannot be identified as such, we expected our doublet detector to correctly predict all "ground truth" labeled doublets, since they arise from distinct human and mouse cell ...
Context 5
... tested the Scrublet on a publicly available dataset consisting of a mixture of human (HEK293T) and mouse (NIH3T3) cells (Fig. 3A). This dataset, though not representative of most single-cell experiments, provides a useful test case because the differences between human and mouse genomic sequence provide an independent way to detect doublets [2,3]. We defined a partial "ground truth" on doublet identity according to whether a cell barcode associates with transcripts from both species (a doublet), or just one species (Fig. 3B). Because doublets arising from the encapsulation of two human or two mouse cells cannot be identified as such, we expected our doublet detector to correctly predict all "ground truth" labeled doublets, since they arise from distinct human and mouse cell ...
Context 6
... copyright holder for this preprint (which was not . http://dx.doi.org/10.1101/357368 doi: bioRxiv preprint first posted online Jul. 9, 2018; mouse doublets with precision of 96%) (Fig. 3E,F). In contrast, predicting doublets on the basis of total transcript counts was less effective (AUC=0.88), since the average human cell contained nearly twice as many transcripts as the average mouse cell; to achieve a recall of 90%, the precision dropped to just 15% (Fig. ...
Context 7
... copyright holder for this preprint (which was not . http://dx.doi.org/10.1101/357368 doi: bioRxiv preprint first posted online Jul. 9, 2018; mouse doublets with precision of 96%) (Fig. 3E,F). In contrast, predicting doublets on the basis of total transcript counts was less effective (AUC=0.88), since the average human cell contained nearly twice as many transcripts as the average mouse cell; to achieve a recall of 90%, the precision dropped to just 15% (Fig. ...
Context 8
... tested the Scrublet on a publicly available dataset consisting of a mixture of human (HEK293T) and mouse (NIH3T3) cells (Fig. 3A). This dataset, though not representative of most single-cell experiments, provides a useful test case because the differences between human and mouse genomic sequence provide an independent way to detect doublets [2,3]. We defined a partial "ground truth" on doublet identity according to whether a cell barcode associates with ...
Context 9
... most single-cell experiments, provides a useful test case because the differences between human and mouse genomic sequence provide an independent way to detect doublets [2,3]. We defined a partial "ground truth" on doublet identity according to whether a cell barcode associates with transcripts from both species (a doublet), or just one species (Fig. 3B). Because doublets arising from the encapsulation of two human or two mouse cells cannot be identified as such, we expected our doublet detector to correctly predict all "ground truth" labeled doublets, since they arise from distinct human and mouse cell ...
Context 10
... hiding species labels and restricting to orthologous genes ( Fig. 3C), Scrublet estimated the detectable (Type B) doublet fraction at ϕD = 54%, close to the 50% expected for cross-species doublets given equal input of mouse and human cells (Fig. 3D). Furthermore, the detector accurately identified human-mouse doublets with a receiver-operator characteristic (ROC) area under the curve (AUC) of 0.99 ...
Context 11
... hiding species labels and restricting to orthologous genes ( Fig. 3C), Scrublet estimated the detectable (Type B) doublet fraction at ϕD = 54%, close to the 50% expected for cross-species doublets given equal input of mouse and human cells (Fig. 3D). Furthermore, the detector accurately identified human-mouse doublets with a receiver-operator characteristic (ROC) area under the curve (AUC) of 0.99 (recall of 98% of human- . CC-BY 4.0 International license peer-reviewed) is the author/funder. It is made available under ...
Context 12
... copyright holder for this preprint (which was not . http://dx.doi.org/10.1101/357368 doi: bioRxiv preprint first posted online Jul. 9, 2018; mouse doublets with precision of 96%) (Fig. 3E,F). In contrast, predicting doublets on the basis of total transcript counts was less effective (AUC=0.88), since the average human cell contained nearly twice as many transcripts as the average mouse cell; to achieve a recall of 90%, the precision dropped to just 15% (Fig. ...
Context 13
... posted online Jul. 9, 2018; mouse doublets with precision of 96%) (Fig. 3E,F). In contrast, predicting doublets on the basis of total transcript counts was less effective (AUC=0.88), since the average human cell contained nearly twice as many transcripts as the average mouse cell; to achieve a recall of 90%, the precision dropped to just 15% (Fig. ...
Context 14
... most highly variable genes with ³3 counts in ³5 cells (n=2,372 genes) and kept the first two PCs. Scrublet was run using í µí±˜ = 50, í µí±Ÿ = 10, and í µí±‘ = 0.12 (twice the observed rate of human-mouse doublets). To classify cells as singlets or doublets, a threshold was set by eye using the histogram of doublet scores for simulated doublets (Fig. ...

Similar publications

Article
Full-text available
Background: The merging of two divergent genomes during hybridization can result in the remodeling of parental gene expression in hybrids. A molecular basis underling expression change in hybrid is regulatory divergence, which may change with the parental genetic divergence. However, there still no unanimous conclusion for this hypothesis. Result...

Citations

... In Figures 1E, H, we presented the CNV frequency statistics for the 32 genes, indicating their chromosomal locations. Genes were located on chromosomes 1, 2, 3,4,6,7,9,10,11,12,14,15,17,19,20,22, and X, with most genes experiencing deletions as the primary CNV change. Additionally, we used the CGGA database to analyze the expression differences of the 32 disulfidptosis-associated genes in WHO II, III, and IV populations, revealing significant expression differences (p<0.001). ...
Article
Full-text available
Introduction Glioblastoma (GBM) presents significant challenges due to its malignancy and limited treatment options. Precision treatment requires subtyping patients based on prognosis. Disulfidptosis, a novel cell death mechanism, is linked to aberrant glucose metabolism and disulfide stress, particularly in tumors expressing high levels of SLC7A11. The exploration of disulfidptosis may provide a new perspective for precise diagnosis and treatment of glioblastoma. Methods Transcriptome sequencing was conducted on samples from GBM patients treated at Tiantan Hospital (January 2022 - December 2023). Data from CGGA and TCGA databases were collected. Consensus clustering based on disulfidptosis features categorized GBM patients into two subtypes (DRGclusters). Tumor immune microenvironment, response to immunotherapy, and drug sensitivity were analyzed. An 8-gene disulfidptosis-based subtype predictor was developed using LASSO machine learning algorithm and validated on CGGA dataset. Results Patients in DRGcluster A exhibited improved overall survival (OS) compared to DRGcluster B. DRGcluster subtypes showed differences in tumor immune microenvironment and response to immunotherapy. The predictor effectively stratified patients into high and low-risk groups. Significant differences in IC50 values for chemotherapy and targeted therapy were observed between risk groups. Discussion Disulfidptosis-based classification offers promise as a prognostic predictor for GBM. It provides insights into tumor immune microenvironment and response to therapy. The predictor aids in patient stratification and personalized treatment selection, potentially improving outcomes for GBM patients.
... Cells that satisfied any one of the following criteria were removed: (1) <300 detected genes; (2) outlier number of unique molecular identifiers (UMIs), ranging from 7,500 to 15,000; (3) outlier proportion of mitochondrial gene expression were excluded ranging from 2.5% to 15%. Outlier cutoffs for each batch of samples were determined empirically based on the distribution of UMI and proportion of mitochondrial gene expression per cell; or (4) doublets identified by the python package Scrublet [119]. Overall, this led to removal of 4.7% of cells, retaining 35,072 cells for downstream analyses. ...
Article
Full-text available
Chronic inflammation is often associated with the development of tissue fibrosis, but how mesenchymal cell responses dictate pathological fibrosis versus resolution and healing remains unclear. Defining stromal heterogeneity and identifying molecular circuits driving extracellular matrix deposition and remodeling stands to illuminate the relationship between inflammation, fibrosis, and healing. We performed single-cell RNA-sequencing of colon-derived stromal cells and identified distinct classes of fibroblasts with gene signatures that are differentially regulated by chronic inflammation, including IL-11–producing inflammatory fibroblasts. We further identify a transcriptional program associated with trans -differentiation of mucosa-associated fibroblasts and define a functional gene signature associated with matrix deposition and remodeling in the inflamed colon. Our analysis supports a critical role for the metalloprotease Adamdec1 at the interface between tissue remodeling and healing during colitis, demonstrating its requirement for colon epithelial integrity. These findings provide mechanistic insight into how inflammation perturbs stromal cell behaviors to drive fibroblastic responses controlling mucosal matrix remodeling and healing.
... snATAC_bmat). Using the binary accessibility matrix as input, doublet scores were computed using the scrublet module v0.2 (Python) 52 . Nuclei with a score above 0.4 were considered a doublet and were removed. ...
Article
Full-text available
During mouse embryonic development, pluripotent cells rapidly divide and diversify, yet the regulatory programs that define the cell repertoire for each organ remain ill-defined. To delineate comprehensive chromatin landscapes during early organogenesis, we mapped chromatin accessibility in 19,453 single nuclei from mouse embryos at 8.25 days post-fertilization. Identification of cell-type-specific regions of open chromatin pinpointed two TAL1-bound endothelial enhancers, which we validated using transgenic mouse assays. Integrated gene expression and transcription factor motif enrichment analyses highlighted cell-type-specific transcriptional regulators. Subsequent in vivo experiments in zebrafish revealed a role for the ETS factor FEV in endothelial identity downstream of ETV2 (Etsrp in zebrafish). Concerted in vivo validation experiments in mouse and zebrafish thus illustrate how single-cell open chromatin maps, representative of a mammalian embryo, provide access to the regulatory blueprint for mammalian organogenesis.
... Most multiplets were expected to be doublets, so we computationally identified doublet GEMs using an algorithm similar to the recently reported methods DoubletDetection (https://github.com/JonathanShor/ DoubletDetection), DoubletFinder (40) and Scrublet (41). We assumed that heterotypic doublets should cluster separately in gene space from the cell types that comprise them and that these clusters should be detectable if doublets appear in sufficient numbers. ...
Article
Full-text available
Understanding the physiology and pathology of an organ composed of a variety of cell populations depends critically on genome-wide information on each cell type. Here, we report single-cell transcriptome profiling of over 6,800 freshly dispersed anterior pituitary cells from postpubertal male and female rats. Six pituitary-specific cell types were identified based on known marker genes and characterized: folliculostellate cells and hormone-producing corticotrophs, gonadotrophs, thyrotrophs, somatotrophs, and lactotrophs. Also identified were endothelial and blood cells from the pituitary capillary network. The expression of numerous developmental and neuroendocrine marker genes in both folliculostellate and hormone-producing cells supports that they have a common origin. For several genes, the validity of transcriptome analysis was confirmed by qRT-PCR and single cell immunocytochemistry. Folliculostellate cells exhibit impressive transcriptome diversity, indicating their major roles in production of endogenous ligands and detoxification enzymes, and organization of extracellular matrix. Transcriptome profiles of hormone-producing cells also indicate contributions toward those functions, while also clearly demonstrating their endocrine function. This survey highlights many novel genetic markers contributing to pituitary cell type identity, sexual dimorphism, and function, and points to relationships between hormone-producing and folliculostellate cells.
... We applied Scrublet ( Wolock et al., 2018) to remove putative hybrid transcriptomes occurring when two or more cells enter the same microfluidic droplet and receive the same barcode. Scrublet assigns each measured transcriptome a 'doublet score', which indicates the likelihood of being a hybrid transcriptome. ...
Article
Tumor-infiltrating myeloid cells (TIMs)comprise monocytes, macrophages, dendritic cells, and neutrophils, and have emerged as key regulators of cancer growth. These cells can diversify into a spectrum of states, which might promote or limit tumor outgrowth but remain poorly understood. Here, we used single-cell RNA sequencing (scRNA-seq)to map TIMs in non-small-cell lung cancer patients. We uncovered 25 TIM states, most of which were reproducibly found across patients. To facilitate translational research of these populations, we also profiled TIMs in mice. In comparing TIMs across species, we identified a near-complete congruence of population structures among dendritic cells and monocytes; conserved neutrophil subsets; and species differences among macrophages. By contrast, myeloid cell population structures in patients’ blood showed limited overlap with those of TIMs. This study determines the lung TIM landscape and sets the stage for future investigations into the potential of TIMs as immunotherapy targets.
... For the detection of potential doublet cells, we first split the dataset of ~2 million cells into four equally sized subsets, and then applied the scrublet v.0.1 pipeline 58 to each subset with parameters (min_count = 3, min_cells = 3, vscore_percentile = 85, n_pc = 30, expected_doublet_rate = 0.06, sim_doublet_ratio = 2, n_neighbours = 30, scaling_method = 'log') for doublet score calculation. Cells with doublet score over 0.25 are annotated as detected doublets. ...
Article
Full-text available
Mammalian organogenesis is a remarkable process. Within a short timeframe, the cells of the three germ layers transform into an embryo that includes most of the major internal and external organs. Here we investigate the transcriptional dynamics of mouse organogenesis at single-cell resolution. Using single-cell combinatorial indexing, we profiled the transcriptomes of around 2 million cells derived from 61 embryos staged between 9.5 and 13.5 days of gestation, in a single experiment. The resulting ‘mouse organogenesis cell atlas’ (MOCA) provides a global view of developmental processes during this critical window. We use Monocle 3 to identify hundreds of cell types and 56 trajectories, many of which are detected only because of the depth of cellular coverage, and collectively define thousands of corresponding marker genes. We explore the dynamics of gene expression within cell types and trajectories over time, including focused analyses of the apical ectodermal ridge, limb mesenchyme and skeletal muscle.
... Finally, the approach in this paper only calculates the multiplet frequency-it does not actually identify the multiplets so that they can be removed from downstream analyses. For that purpose, other more sophisticated approaches have been developed (Ilicic et al., 2016;Stoeckius et al., 2017;Kang et al., 2018;Wolock, Lopez & Klein, 2018;DePasquale et al., 2018). Nonetheless, simply calculating the multiplet frequency from the data returned by standard pipelines such as the 10Â cellranger is important for many purposes, and the results here enable that to be done regardless of the proportions at which the cell types are mixed. ...
Article
Full-text available
In single-cell RNA-sequencing, it is important to know the frequency at which the sequenced transcriptomes actually derive from multiple cells. A common method to estimate this multiplet frequency is to mix two different types of cells (e.g., human and mouse), and then determine how often the transcriptomes contain transcripts from both cell types. When the two cell types are mixed in equal proportion, the calculation of the multiplet frequency from the frequency of mixed transcriptomes is straightforward. But surprisingly, there are no published descriptions of how to calculate the multiplet frequency in the general case when the cell types are mixed unequally. Here, I derive equations to analytically calculate the multiplet frequency from the numbers of observed pure and mixed transcriptomes when two cell types are mixed in arbitrary proportions, under the assumption that the loading of cells into droplets or wells is Poisson.
Preprint
Full-text available
Motivation Single-cell RNA sequencing (scRNA-seq) is a recent technology that has provided many valuable biological insights. Notable uses include identifying novel cell-types, measuring the cellular response to treatment, and tracking trajectories of distinct cell lineages in time. The raw data generated in this process typically amounts to hundreds of millions of sequencing reads and requires substantial computational infrastructure for downstream analysis, a major hurdle for a biological research lab. Fortunately, the preprocessing step that converts this huge sequence data into manageable cell-specific expression profiles is standardized and can be performed in the cloud. We demonstrate how a cloud-based computational framework can be used to transform the raw data into biologically interpretable cell-type-specific information, using either 3’ or 5’ transcriptome libraries from 10x Genomics. The processed data which is an order of magnitude smaller in size can be easily downloaded to a laptop for customized analysis to gain deeper biological insights. Results We produced an automated and easily extensible pipeline in the cloud for the analysis of single-cell RNA-seq data which provides a convenient method to handle post-processing of scRNA sequencing using next generation sequencing platforms. The basic step provides the transformation of the scRNA-seq data to cell-type-specific expression profiles and computes the quality control metrics for the dataset. The extensibility of the platform is demonstrated by adding a doublet-removal algorithm and recomputing the clustering of the cells. Any additional computational steps that take a cell-type expression counts matrix as input can be easily added to this framework with minimal effort. Availability The framework and its documentation for installation is available at the Github repository http://github.com/nj3252/CB-Source/ Contact kyun@houstonmethodist.org Supplementary information Supplementary data available at Bioinformatics online.
Article
Full-text available
Single cell RNA-seq has revolutionized transcriptomics by providing cell type resolution for differential gene expression and expression quantitative trait loci (eQTL) analyses. However, efficient power analysis methods for single cell data and inter-individual comparisons are lacking. Here, we present scPower; a statistical framework for the design and power analysis of multi-sample single cell transcriptomic experiments. We modelled the relationship between sample size, the number of cells per individual, sequencing depth, and the power of detecting differentially expressed genes within cell types. We systematically evaluated these optimal parameter combinations for several single cell profiling platforms, and generated broad recommendations. In general, shallow sequencing of high numbers of cells leads to higher overall power than deep sequencing of fewer cells. The model, including priors, is implemented as an R package and is accessible as a web tool. scPower is a highly customizable tool that experimentalists can use to quickly compare a multitude of experimental designs and optimize for a limited budget. scRNASeq data is revolutionizing our understanding of biological systems, but is still expensive to generate. Here, the authors present a statistical framework that facilitates informed multi-sample experimental design to reduce unnecessary costs and maximize the utility of the generated data.
Article
Full-text available
Compelling evidence supports vascular contributions to cognitive impairment and dementia (VCID) including Alzheimer’s disease (AD), but the underlying pathogenic mechanisms and treatments are not fully understood. Cis P-tau is an early driver of neurodegeneration resulting from traumatic brain injury, but its role in VCID remains unclear. Here, we found robust cis P-tau despite no tau tangles in patients with VCID and in mice modeling key aspects of clinical VCID, likely because of the inhibition of its isomerase Pin1 by DAPK1. Elimination of cis P-tau in VCID mice using cis-targeted immunotherapy, brain-specific Pin1 overexpression, or DAPK1 knockout effectively rescues VCID-like neurodegeneration and cognitive impairment in executive function. Cis mAb also prevents and ameliorates progression of AD-like neurodegeneration and memory loss in mice. Furthermore, single-cell RNA sequencing revealed that young VCID mice display diverse cortical cell type–specific transcriptomic changes resembling old patients with AD, and the vast majority of these global changes were recovered by cis-targeted immunotherapy. Moreover, purified soluble cis P-tau was sufficient to induce progressive neurodegeneration and brain dysfunction by causing axonopathy and conserved transcriptomic signature found in VCID mice and patients with AD with early pathology. Thus, cis P-tau might play a major role in mediating VCID and AD, and antibody targeting it may be useful for early diagnosis, prevention, and treatment of cognitive impairment and dementia after neurovascular insults and in AD.