Figure - available from: Nature Communications
This content is subject to copyright. Terms and conditions apply.
Results on mouse brain cortex dataset
a Layout of input data matrices in mouse brain cortex dataset. The UMAP visualization of cell factors learned by scMoMaT, where cells are colored by b Leiden clusters (with scMoMaT-annotated cell types) and c cell batches. d The scores of marker genes for non-neuronal cell types, where x-axis correspond to Leiden clusters. e The scores of neuronal cell type marker genes in different clusters. The top-scoring clusters are colored red. Source data for b–e are provided in the Source Data file.

Results on mouse brain cortex dataset a Layout of input data matrices in mouse brain cortex dataset. The UMAP visualization of cell factors learned by scMoMaT, where cells are colored by b Leiden clusters (with scMoMaT-annotated cell types) and c cell batches. d The scores of marker genes for non-neuronal cell types, where x-axis correspond to Leiden clusters. e The scores of neuronal cell type marker genes in different clusters. The top-scoring clusters are colored red. Source data for b–e are provided in the Source Data file.

Source publication
Article
Full-text available
Single cell data integration methods aim to integrate cells across data batches and modalities, and data integration tasks can be categorized into horizontal, vertical, diagonal, and mosaic integration, where mosaic integration is the most general and challenging case with few methods developed. We propose scMoMaT, a method that is able to integrat...

Citations

... Cobolt [10] is a hierarchical Bayesian generative model that handles missing modalities by estimating posteriors for each unique modality availability setting. scMoMAT [27] pre-computes missing features before using matrix factorization in the integration step. moETM [28] deals with missing modalities by training a model to reconstruct one modality from another. ...
Preprint
Full-text available
Joint analysis of multi-omic single-cell data across cohorts has significantly enhanced the comprehensive analysis of cellular processes. However, most of the existing approaches for this purpose require access to samples with complete modality availability, which is impractical in many real-world scenarios. In this paper, we propose (Single-Cell Cross-Cohort Cross-Category) integration, a novel framework that learns unified cell representations under domain shift without requiring full-modality reference samples. Our generative approach learns rich cross-modal and cross-domain relationships that enable imputation of these missing modalities. Through experiments on real-world multi-omic datasets, we demonstrate that offers a robust solution to single-cell tasks such as cell type clustering, cell type classification, and feature imputation.
... Recent advancements in sequencing technology promote the diversity of data modalities and extend our understanding beyond genomics to epigenetics, transcriptomics and proteomics, thus providing multi-modal insights 8,9 . These breakthroughs have also raised new research questions such as reference mapping, perturbation prediction and multi-omic integration [10][11][12][13][14] . It is critical to parallelly develop methodologies capable of effectively harnessing, enhancing and adapting to the rapid expansion of sequencing data. ...
... scMultiomic integration. We benchmarked scGPT in two integration settings, paired and mosaic, against the recent scMultiomic integration methods Seurat (v.4) 44 , scGLUE 13 and scMoMat 14 . In the paired data-integration experiment, we benchmarked scGPT with scGLUE 13 and Seurat (v.4) 44 on the 10x Multiome PBMC 43 dataset with RNA and ATAC-seq data as the first example. ...
Article
Full-text available
Generative pretrained models have achieved remarkable success in various domains such as language and computer vision. Specifically, the combination of large-scale diverse datasets and pretrained transformers has emerged as a promising approach for developing foundation models. Drawing parallels between language and cellular biology (in which texts comprise words; similarly, cells are defined by genes), our study probes the applicability of foundation models to advance cellular biology and genetic research. Using burgeoning single-cell sequencing data, we have constructed a foundation model for single-cell biology, scGPT, based on a generative pretrained transformer across a repository of over 33 million cells. Our findings illustrate that scGPT effectively distills critical biological insights concerning genes and cells. Through further adaptation of transfer learning, scGPT can be optimized to achieve superior performance across diverse downstream applications. This includes tasks such as cell type annotation, multi-batch integration, multi-omic integration, perturbation response prediction and gene network inference.
... Such an integration is necessary to overcome the current limitations in modality scalability and cost associated with currently accessible scMulti-omics sequencing technologies. Nevertheless, multimodal mosaic integration is quite challenging [56,65]. A key challenge involves addressing the diversity of modalities and handling technical variations across different batches. ...
Preprint
Full-text available
Obtaining positive and negative samples to examining several multifaceted brain diseases in clinical trials face significant challenges. We propose an innovative approach known as Adaptive Conditional Graph Diffusion Convolution (ACGDC) model. This model is tailored for the fusion of single cell multi-omics data and the creation of novel samples. ACGDC customizes a new array of edge relationship categories to merge single cell sequencing data and pertinent meta-information gleaned from annotations. Afterward, it employs network node properties and neighborhood topological connections to reconstruct the relationship between edges and their properties among nodes. Ultimately, it generates novel single-cell samples via inverse sampling within the framework of conditional diffusion model. To evaluate the credibility of the single cell samples generated through the new sampling approach, we conducted a comprehensive assessment. This assessment included comparisons between the generated samples and real samples across several criteria, including sample distribution space, enrichment analyses (GO term, KEGG term), clustering, and cell subtype classification, thereby allowing us to rigorously validate the quality and reliability of the single-cell samples produced by our novel sample method. The outcomes of our study demonstrated the effectiveness of the proposed method in seamlessly integrating single-cell multi-omics data and generating innovative samples that closely mirrored both the spatial distribution and bioinformatic significance observed in real samples. Thus, we suggest that the generation of these reliable control samples by ACGDC holds substantial promise in advancing precision research on brain diseases. Additionally, it offers a valuable tool for classifying and identifying astrocyte subtypes.
... Mosaic integration methods are urgently needed to markedly expand the scale and modalities of integration, breaking through the modality scalability and cost limitations of existing scMulti-omics sequencing technologies. Most recently, scVAEIT [26], scMoMaT [27], StabMap [28], and Multigrate [29] have been proposed to tackle this problem. However, these methods are not capable of aligning modalities or correcting batches, which results in limited functions and performances. ...
... scMoMaT. scMoMaT [27] is designed to integrate multimodal mosaic data. The code is downloaded from https: //github.com/PeterZZQ/scMoMaT. ...
Preprint
Full-text available
A bstract Rapidly developing single-cell multi-omics sequencing technologies generate increasingly large bodies of multimodal data. Integrating multimodal data from different sequencing technologies, i.e . mosaic data, permits larger-scale investigation with more modalities and can help to better reveal cellular heterogeneity. However, mosaic integration involves major challenges, particularly regarding modality alignment and batch effect removal. Here we present a deep probabilistic framework for the mosaic integration and knowledge transfer (MIDAS) of single-cell multimodal data. MIDAS simultaneously achieves dimensionality reduction, imputation, and batch correction of mosaic data by employing self-supervised modality alignment and information-theoretic latent disentanglement. We demonstrate its superiority to other methods and reliability by evaluating its performance in full trimodal integration and various mosaic tasks. We also constructed a single-cell trimodal atlas of human peripheral blood mononuclear cells (PBMCs), and tailored transfer learning and reciprocal reference mapping schemes to enable flexible and accurate knowledge transfer from the atlas to new data.
Article
Advances in molecular profiling have facilitated generation of large multi-modal datasets that can potentially reveal critical axes of biological variation underlying complex diseases. Distilling biological meaning, however, requires computational strategies that can perform mosaic integration across diverse cohorts and datatypes. Here, we present mosaicMPI, a framework for discovery of low to high-resolution molecular programs representing both cell types and states, and integration within and across datasets into a network representing biological themes. Using existing datasets in glioblastoma, we demonstrate that this approach robustly integrates single cell and bulk programs across multiple platforms. Clinical and molecular annotations from cohorts are statistically propagated onto this network of programs, yielding a richly characterized landscape of biological themes. This enables deep understanding of individual tumor samples, systematic exploration of relationships between modalities, and generation of a reference map onto which new datasets can rapidly be mapped. mosaicMPI is available at https://github.com/MorrissyLab/mosaicMPI.
Article
Full-text available
Background With the development of single-cell technology, many cell traits can be measured. Furthermore, the multi-omics profiling technology could jointly measure two or more traits in a single cell simultaneously. In order to process the various data accumulated rapidly, computational methods for multimodal data integration are needed. Results Here, we present inClust+, a deep generative framework for the multi-omics. It’s built on previous inClust that is specific for transcriptome data, and augmented with two mask modules designed for multimodal data processing: an input-mask module in front of the encoder and an output-mask module behind the decoder. InClust+ was first used to integrate scRNA-seq and MERFISH data from similar cell populations, and to impute MERFISH data based on scRNA-seq data. Then, inClust+ was shown to have the capability to integrate the multimodal data (e.g. tri-modal data with gene expression, chromatin accessibility and protein abundance) with batch effect. Finally, inClust+ was used to integrate an unlabeled monomodal scRNA-seq dataset and two labeled multimodal CITE-seq datasets, transfer labels from CITE-seq datasets to scRNA-seq dataset, and generate the missing modality of protein abundance in monomodal scRNA-seq data. In the above examples, the performance of inClust+ is better than or comparable to the most recent tools in the corresponding task. Conclusions The inClust+ is a suitable framework for handling multimodal data. Meanwhile, the successful implementation of mask in inClust+ means that it can be applied to other deep learning methods with similar encoder-decoder architecture to broaden the application scope of these models.
Article
Full-text available
Integrating single-cell datasets produced by multiple omics technologies is essential for defining cellular heterogeneity. Mosaic integration, in which different datasets share only some of the measured modalities, poses major challenges, particularly regarding modality alignment and batch effect removal. Here, we present a deep probabilistic framework for the mosaic integration and knowledge transfer (MIDAS) of single-cell multimodal data. MIDAS simultaneously achieves dimensionality reduction, imputation and batch correction of mosaic data by using self-supervised modality alignment and information-theoretic latent disentanglement. We demonstrate its superiority to 19 other methods and reliability by evaluating its performance in trimodal and mosaic integration tasks. We also constructed a single-cell trimodal atlas of human peripheral blood mononuclear cells and tailored transfer learning and reciprocal reference mapping schemes to enable flexible and accurate knowledge transfer from the atlas to new data. Applications in mosaic integration, pseudotime analysis and cross-tissue knowledge transfer on bone marrow mosaic datasets demonstrate the versatility and superiority of MIDAS. MIDAS is available at https://github.com/labomics/midas.
Article
Full-text available
Background Single-cell RNA-sequencing (scRNA-seq) measures gene expression in single cells, while single-nucleus ATAC-sequencing (snATAC-seq) quantifies chromatin accessibility in single nuclei. These two data types provide complementary information for deciphering cell types and states. However, when analyzed individually, they sometimes produce conflicting results regarding cell type/state assignment. The power is compromised since the two modalities reflect the same underlying biology. Recently, it has become possible to measure both gene expression and chromatin accessibility from the same nucleus. Such paired data enable the direct modeling of the relationships between the two modalities. Given the availability of the vast amount of single-modality data, it is desirable to integrate the paired and unpaired single-modality datasets to gain a comprehensive view of the cellular complexity. Results We benchmark nine existing single-cell multi-omic data integration methods. Specifically, we evaluate to what extent the multiome data provide additional guidance for analyzing the existing single-modality data, and whether these methods uncover peak-gene associations from single-modality data. Our results indicate that multiome data are helpful for annotating single-modality data. However, we emphasize that the availability of an adequate number of nuclei in the multiome dataset is crucial for achieving accurate cell type annotation. Insufficient representation of nuclei may compromise the reliability of the annotations. Additionally, when generating a multiome dataset, the number of cells is more important than sequencing depth for cell type annotation. Conclusions Seurat v4 is the best currently available platform for integrating scRNA-seq, snATAC-seq, and multiome data even in the presence of complex batch effects.