Figure - available from: Nature Genetics
This content is subject to copyright. Terms and conditions apply.
EQTL and colocalisation in relation to GTEx
a, Fraction of GTEx brain eGenes that could be assessed in each of the considered contexts (cell type-conditions; Methods). b, Fraction of GTEx brain eQTL that were replicated in this study (nominal p < 0.5; fraction relative to the set of assessed genes from a). c, Figure analogous to main text Fig. 4c, additionally including eQTL counts from a pseudobulk eQTL analysis (top red dot on the left, red square on the right; calculated using cells from all day 52 cells untreated pooled). d, Figure analogous to main text Fig. 5a, additionally including colocalisation results from a pseudobulk eQTL analysis (using cells from all day 52 cells untreated pooled). In the box plots, the middle line is the median and the lower and upper edges of the box denote the first and third quartiles, while the violin plots show the distribution. Legend: Astro: Astrocytes-like; DA: Dopaminergic neurons, Epen1: Ependymal-like1, FPP: Floor Plate Progenitors, P_FPP: Proliferating Floor Plate Progenitors, Sert: Serotonergic-like neurons.

EQTL and colocalisation in relation to GTEx a, Fraction of GTEx brain eGenes that could be assessed in each of the considered contexts (cell type-conditions; Methods). b, Fraction of GTEx brain eQTL that were replicated in this study (nominal p < 0.5; fraction relative to the set of assessed genes from a). c, Figure analogous to main text Fig. 4c, additionally including eQTL counts from a pseudobulk eQTL analysis (top red dot on the left, red square on the right; calculated using cells from all day 52 cells untreated pooled). d, Figure analogous to main text Fig. 5a, additionally including colocalisation results from a pseudobulk eQTL analysis (using cells from all day 52 cells untreated pooled). In the box plots, the middle line is the median and the lower and upper edges of the box denote the first and third quartiles, while the violin plots show the distribution. Legend: Astro: Astrocytes-like; DA: Dopaminergic neurons, Epen1: Ependymal-like1, FPP: Floor Plate Progenitors, P_FPP: Proliferating Floor Plate Progenitors, Sert: Serotonergic-like neurons.

Source publication
Article
Full-text available
Studying the function of common genetic variants in primary human tissues and during development is challenging. To address this, we use an efficient multiplexing strategy to differentiate 215 human induced pluripotent stem cell (iPSC) lines toward a midbrain neural fate, including dopaminergic neurons, and use single-cell RNA sequencing (scRNA-seq...

Citations

... Because scRNA-seq data has a celllevel resolution, it provides an opportunity to powerfully partition expression variation within and between cell types. This has recently become possible with the proliferation of population-scale scRNA-seq studies that contain hundreds of individuals 13,[19][20][21][22] . ...
... However, most studies infer cell types directly from the scRNA-seq data, such as the cell types in our PMBC data analysis, inducing some circularity; this is typically ignored 2,7,35 yet will deflate estimates of cell type-specific variance by construction. Third, CTMM assumes discrete cell types, whereas continuous cell types are more appropriate in some cases, e.g., when defined by pseudo-time 13,20,36 or degree of IFN stimulation 21 . While incorporating continuous cell types is straightforward with overall pseudobulk data, it can only be expressed in cell type-specific pseudobulk data by discretizing the continuous cell types. ...
Article
Full-text available
Single-cell RNA sequencing (scRNA-seq) has been widely used to characterize cell types based on their average gene expression profiles. However, most studies do not consider cell type-specific variation across donors. Modelling this cell type-specific inter-individual variation could help elucidate cell type-specific biology and inform genes and cell types underlying complex traits. We therefore develop a new model to detect and quantify cell type-specific variation across individuals called CTMM (Cell Type-specific linear Mixed Model). We use extensive simulations to show that CTMM is powerful and unbiased in realistic settings. We also derive calibrated tests for cell type-specific interindividual variation, which is challenging given the modest sample sizes in scRNA-seq. We apply CTMM to scRNA-seq data from human induced pluripotent stem cells to characterize the transcriptomic variation across donors as cells differentiate into endoderm. We find that almost 100% of transcriptome-wide variability between donors is differentiation stage-specific. CTMM also identifies individual genes with statistically significant stage-specific variability across samples, including 85 genes that do not have significant stage-specific mean expression. Finally, we extend CTMM to partition interindividual covariance between stages, which recapitulates the overall differentiation trajectory. Overall, CTMM is a powerful tool to illuminate cell type-specific biology in scRNA-seq.
... Instead of culturing and characterizing each EpiSC line individually, pooled cell culture ("cell villages") approaches have emerged as a scalable approach for characterizing large panels of genetically distinct cell lines that can reduce technical variation (batch-to-batch variability), significantly lower reagent and labor costs, and enable population-scale perturbation experiments 28,55,56 . For example, this approach has now been used in several studies to analyze cellular phenotypes in genetically diverse human PSC panels 27,45,57 . This approach relies on the fact that each genetically distinct cell line is effectively "barcoded" by its unique genome sequence, allowing for assignment of cells assayed in single cell genomics experiments to specific individuals (genetic demultiplexing) 58 . ...
... This could enable large-scale characterization of our DO EpiSC panel (i.e., for quality control purposes) as well as evaluation of molecular phenotypes across DO EpiSCs for quantitative trait locus mapping. However, it has been observed that a few cell lines may rapidly proliferate and overtake a pooled culture due to somatic mutations in tumor suppressors/oncogenes, epigenetic aberrations, aneuploidy, or via active mechanisms such as cell competition 55,59,60 thereby reducing the utility of this system to screen large numbers of donors, although this has not been observed in all studies employing pooled culture 56,57 . Thus, we also needed to evaluate how well the representation of input cell lines in villages is maintained over time in culture. ...
... Systematic studies in mouse and human have revealed pervasive genetic background effects on the phenotypic penetrance of many gene mutations in vivo 15,66-68 and on the outcomes of in vitro directed differentiation 27,45,57,69 . These data highlight a critical challenge in modern developmental and stem cell biology --understanding the contribution of genetic variation to quantitative phenotypic variation in developmental processes. ...
Preprint
Full-text available
The directed differentiation of pluripotent stem cells (PSCs) from panels of genetically diverse individuals is emerging as a powerful experimental system for characterizing the impact of natural genetic variation on developing cell types and tissues. Here, we establish new PSC lines and experimental approaches for modeling embryonic development in a genetically diverse, outbred mouse stock (Diversity Outbred mice). We show that a range of inbred and outbred PSC lines can be stably maintained in the primed pluripotent state (epiblast stem cells -- EpiSCs) and establish the contribution of genetic variation to phenotypic differences in gene regulation and directed differentiation. Using pooled in vitro fertilization, we generate and characterize a genetic reference panel of Diversity Outbred PSCs (n = 230). Finally, we demonstrate the feasibility of pooled culture of Diversity Outbred EpiSCs as "cell villages", which can facilitate the differentiation of large numbers of EpiSC lines for forward genetic screens. These data can complement and inform similar efforts within the stem cell biology and human genetics communities to model the impact of natural genetic variation on phenotypic variation and disease-risk.
... The first generation of single-cell eQTLs considered aggregated (typically, mean) gene expression from multiple cells per individual, using a so-called "pseudobulk" approach. This approach has provided important insights into the genetic basis of cell type-specific gene expression across several tissues [7][8][9][10][11][12][13][14] , but has limitations. In particular, the pseudobulk approach does not appropriately model the intra-individual cell-to-cell variability. ...
Preprint
Full-text available
Understanding the genetic basis of gene expression can help us understand the molecular underpinnings of human traits and disease. Expression quantitative trait locus (eQTL) mapping can help in studying this relationship but have been shown to be very cell-type specific, motivating the use of single-cell RNA sequencing and single-cell eQTLs to obtain a more granular view of genetic regulation. Current methods for single-cell eQTL mapping either rely on the “pseudobulk” approach and traditional pipelines for bulk transcriptomics or do not scale well to large datasets. Here, we propose SAIGE-QTL, a robust and scalable tool that can directly map eQTLs using single-cell profiles without needing aggregation at the pseudobulk level. Additionally, SAIGE-QTL allows for testing the effects of less frequent/rare genetic variation through set-based tests, which is traditionally excluded from eQTL mapping studies. We evaluate the performance of SAIGE-QTL on both real and simulated data and demonstrate the improved power for eQTL mapping over existing pipelines.
... thereby adding 'noise' to cellular phenotypes and reducing the ability to identify genetic associations 22,23 . However, the availability of large stem cell banks (for example, HipSci 24 and iPSCORE 25 ), robotic handling of cell cultures and single-cell technologies 26 has helped curtail some of these limitations and provide a platform for innovative population genomics studies using hPS cells [14][15][16] . ...
... are located in intergenic regions and probably contribute to disease via changes in genome regulation 11,12 . To complicate matters further, there is growing evidence that many of these loci exert their effects in a cell-type-or context-dependent manner (for example, varying physiological or environmental stimuli) [13][14][15][16] . ...
... Many single-cell computational methods exist for classifying cell types, such as placing cells in a defined manifold (distribution and organization of cells based on their gene expression profiles or other molecular features) and inferring cell states 40 . Therefore, single-cell methods are uniquely suited to identify context-dependent effects, where the relationship between genotype and environment can be investigated for individual cells [14][15][16] . ...
Article
Full-text available
Human pluripotent stem (hPS) cells can, in theory, be differentiated into any cell type, making them a powerful in vitro model for human biology. Recent technological advances have facilitated large-scale hPS cell studies that allow investigation of the genetic regulation of molecular phenotypes and their contribution to high-order phenotypes such as human disease. Integrating hPS cells with single-cell sequencing makes identifying context-dependent genetic effects during cell development or upon experimental manipulation possible. Here we discuss how the intersection of stem cell biology, population genetics and cellular genomics can help resolve the functional consequences of human genetic variation. We examine the critical challenges of integrating these fields and approaches to scaling them cost-effectively and practically. We highlight two areas of human biology that can particularly benefit from population-scale hPS cell studies, elucidating mechanisms underlying complex disease risk loci and evaluating relationships between common genetic variation and pharmacotherapeutic phenotypes.
... We found that the donor abundance estimated by Vireo-bulk is perfectly matched with the cell numbers obtained at single-cell resolution (R 2 = 0.997, Fig. 2d). Similar high consistency was also observed when performing the same analysis on an even larger donor pool (n = 18) where iPS cells were differentiated toward neurons (Fig. 2e) 11 . To test whether the accuracy of Vireo-bulk demultiplexing will be influenced by cell types and their composition, we further performed synthetic analyses by keeping or removing a certain cell type from the PBMC scRNA-seq data that was used in Fig. 2b-c. ...
Article
Full-text available
Disease modeling with isogenic Induced Pluripotent Stem Cell (iPSC)-differentiated organoids serves as a powerful technique for studying disease mechanisms. Multiplexed coculture is crucial to mitigate batch effects when studying the genetic effects of disease-causing variants in differentiated iPSCs or organoids, and demultiplexing at the single-cell level can be conveniently achieved by assessing natural genetic barcodes. Here, to enable cost-efficient time-series experimental designs via multiplexed bulk and single-cell RNA-seq of hybrids, we introduce a computational method in our Vireo Suite, Vireo-bulk, to effectively deconvolve pooled bulk RNA-seq data by genotype reference, and thereby quantify donor abundance over the course of differentiation and identify differentially expressed genes among donors. Furthermore, with multiplexed scRNA-seq and bulk RNA-seq, we demonstrate the usefulness and necessity of a pooled design to reveal donor iPSC line heterogeneity during macrophage cell differentiation and to model rare WT1 mutation-driven kidney disease with chimeric organoids. Our work provides an experimental and analytic pipeline for dissecting disease mechanisms with chimeric organoids.
... Single-cell RNA-sequencing (scRNA-seq) has become a powerful approach to simultaneously quantify the transcription of hundreds or even thousands of features (genes, transcripts, exons) at an unprecedented resolution. This high-throughput transcriptomic profiling assays have helped scientists to study important biological questions, for example, cellular heterogeneity, dynamics of cellular processes and pathways, novel cell type discovery, and cell fate decisions and differentiation [1][2][3][4]. ...
Article
Full-text available
Background Normalization is a critical step in the analysis of single-cell RNA-sequencing (scRNA-seq) datasets. Its main goal is to make gene counts comparable within and between cells. To do so, normalization methods must account for technical and biological variability. Numerous normalization methods have been developed addressing different sources of dispersion and making specific assumptions about the count data. Main body The selection of a normalization method has a direct impact on downstream analysis, for example differential gene expression and cluster identification. Thus, the objective of this review is to guide the reader in making an informed decision on the most appropriate normalization method to use. To this aim, we first give an overview of the different single cell sequencing platforms and methods commonly used including isolation and library preparation protocols. Next, we discuss the inherent sources of variability of scRNA-seq datasets. We describe the categories of normalization methods and include examples of each. We also delineate imputation and batch-effect correction methods. Furthermore, we describe data-driven metrics commonly used to evaluate the performance of normalization methods. We also discuss common scRNA-seq methods and toolkits used for integrated data analysis. Conclusions According to the correction performed, normalization methods can be broadly classified as within and between-sample algorithms. Moreover, with respect to the mathematical model used, normalization methods can further be classified into: global scaling methods, generalized linear models, mixed methods, and machine learning-based methods. Each of these methods depict pros and cons and make different statistical assumptions. However, there is no better performing normalization method. Instead, metrics such as silhouette width, K-nearest neighbor batch-effect test, or Highly Variable Genes are recommended to assess the performance of normalization methods.
... Studies using postmortem human tissues, while delivering important insight, often fall short of capturing the full spectrum of dynamic regulatory effects because they reflect predefined adult tissue contexts, and because most studies have utilized bulk 5 sequencing. Recent advances in single-cell technologies have started to shift this paradigm by enabling researchers to collect a heterogeneous biological sample and disentangle contextspecific regulatory variation through downstream analysis of single-cell molecular phenotypes [15][16][17][18][19][20] . Still, many contexts are difficult to sample from healthy human donors, particularly dynamic contexts where we would like to capture multiple time points from the same individual. ...
... These in vitro systems are each imperfect reflections of human cell biology, but the activation of a range of relevant cis-regulatory elements can reveal the effects of variants within them even 15 without completely recapitulating in vivo cellular state. Indeed, studies in these systems have captured regulatory effects of numerous disease-associated loci and variants near genes involved in developmental processes 15,16,21 . However, most protocols would require separate experimental setups for each cell-type or perturbation of interest, making it difficult to efficiently explore the space of disease-relevant contexts. ...
... First, we compiled manually curated gene lists containing marker genes from each stage of cardiomyocyte differentiation (Methods, Supplementary Table S7) 17 . We used a cell-scoring tool 35 (scDRS) 36 to identify subpopulations of cells with enriched expression for gene sets specific to the cardiomyocyte trajectory, and applied principal components analysis to infer pseudotime (Methods, Fig. 3A) 16,37 . The first expression principal component using data from these cardiomyocyte trajectory cells offers a reasonable pseudotime metric (Fig. 3B), as it captures sequential expression of marker genes for each stage of cardiomyocyte differentiation (NANOG, 40 an IPSC marker gene; MIXL1, mesendoderm; MESP1, mesoderm; GATA4, cardiac progenitor; and TNNT2, cardiomyocyte) 38 . ...
Preprint
Full-text available
Identifying the molecular effects of human genetic variation across cellular contexts is crucial for understanding the mechanisms underlying disease-associated loci, yet many cell-types and developmental stages remain underexplored. Here we harnessed the potential of heterogeneous differentiating cultures (HDCs), an in vitro system in which pluripotent cells asynchronously differentiate into a broad spectrum of cell-types. We generated HDCs for 53 human donors and collected single-cell RNA-sequencing data from over 900,000 cells. We identified expression quantitative trait loci in 29 cell-types and characterized regulatory dynamics across diverse differentiation trajectories. This revealed novel regulatory variants for genes involved in key developmental and disease-related processes while replicating known effects from primary tissues, and dynamic regulatory effects associated with a range of complex traits.
... To investigate the relationships between pleiotropic genes and phenotypes, we utilized summary databased Mendelian randomization (SMR) analyses. This involved employing MAMT alongside eQTL data, which encompassed brain tissues from BrainMeta, blood samples from the eQTLGen consortium, and data on nine speci c brain cell types derived from both sources [54][55][56][57] . The SMR methodology integrates information from GWAS and QTL research to assess pleiotropic connections between gene expression and complex traits of interest 58 . ...
Preprint
Full-text available
This investigation elucidates the genetic connection between major depressive disorder (MD) and metabolic syndrome (MetS), uncovering bidirectional interactions and shared pleiotropic genes. Leveraging a comprehensive genome-wide association study (GWAS) dataset from European and East Asian populations, we discovered new genetic markers linked to MD and enhanced the robustness of genetic associations via cross-trait analysis. Moreover, the study harnessed computational strategies for drug repurposing, highlighting the potential of Cytochrome P450 and HDAC inhibitors as novel treatments for MD and MetS. Employing BLISS technology, we pinpointed proteins significantly linked to both conditions, advancing our comprehension of their molecular underpinnings. Through Mendelian randomization, we investigated how diverse dietary patterns across populations influence MD and MetS, shedding light on the relationship between diet and disease susceptibility. This research not only enriches our understanding of the intersecting biological pathways of MD and MetS but also opens avenues for innovative preventive and therapeutic measures.
... To identify cell types within each cluster, a comprehensive manual annotation was performed. A list of marker genes for different cell types was collected from the PanglaoDB database 70 and from a literature-curated set of relevant marker genes [71][72][73][74][75] . Then it was compared with cluster speci c markers identi ed by the "FindAllMarkers" function (iteratively comparing one cluster against all the others) from the Seurat package. ...
Preprint
Full-text available
Adeno-associated viral (AAV) vector-based gene therapy is gaining foothold as a treatment option for a variety of genetic neurological diseases with encouraging clinical results. Nonetheless, dose-dependent toxicities and severe adverse events have emerged in recent clinical trials through mechanisms that remain unclear. We have modelled here the impact of AAV transduction in the context of cell models of the human central nervous system (CNS), taking advantage of induced pluripotent stem cell-based technologies. Our work uncovers vector-induced cell-intrinsic innate immune mechanisms that contribute to apoptosis in 2D and 3D models. While empty AAV capsids were well tolerated, the AAV genome triggered p53-dependent DNA damage responses across CNS cell types followed by induction of IL-1R- and STING-dependent inflammatory responses. In addition, transgene expression led to MAVS-dependent signaling and activation of type I interferon (IFN) responses. Cell-intrinsic and paracrine apoptosis onset could be prevented by inhibiting p53 or acting downstream of STING- and IL-1R-mediated responses. Activation of DNA damage, type I IFN and CNS inflammation were confirmed in vivo , in a mouse model. Together, our work identifies the cell-autonomous innate immune mechanisms of vector DNA sensing that can potentially contribute to AAV-associated neurotoxicity.
... We conducted robustness testing of the CDSKNN parameters using these datasets and evaluated the adaptability of the different methods to imbalanced data. The second group of datasets included multiple single-cell datasets from various studies [13,[31][32][33][34][35][36][37][38]; employed diverse library preparation methods; and involved different tissues from humans or mice, such as the hypothalamus, peripheral blood, and heart, with cell numbers ranging from 8,000 to 1.46 million. We assessed the universality and operational efficiency of CDSKNN using these datasets. ...
Article
Full-text available
Background Accurate and efficient cell grouping is essential for analyzing single-cell transcriptome sequencing (scRNA-seq) data. However, the existing clustering techniques often struggle to provide timely and accurate cell type groupings when dealing with datasets with large-scale or imbalanced cell types. Therefore, there is a need for improved methods that can handle the increasing size of scRNA-seq datasets while maintaining high accuracy and efficiency. Methods We propose CDSKNNXMBD (Community Detection based on a Stable K-Nearest Neighbor Graph Structure), a novel single-cell clustering framework integrating partition clustering algorithm and community detection algorithm, which achieves accurate and fast cell type grouping by finding a stable graph structure. Results We evaluated the effectiveness of our approach by analyzing 15 tissues from the human fetal atlas. Compared to existing methods, CDSKNN effectively counteracts the high imbalance in single-cell data, enabling effective clustering. Furthermore, we conducted comparisons across multiple single-cell datasets from different studies and sequencing techniques. CDSKNN is of high applicability and robustness, and capable of balancing the complexities of across diverse types of data. Most importantly, CDSKNN exhibits higher operational efficiency on datasets at the million-cell scale, requiring an average of only 6.33 min for clustering 1.46 million single cells, saving 33.3% to 99% of running time compared to those of existing methods. Conclusions The CDSKNN is a flexible, resilient, and promising clustering tool that is particularly suitable for clustering imbalanced data and demonstrates high efficiency on large-scale scRNA-seq datasets.