Fengzhu Sun

Fengzhu Sun
University of Southern California | USC · Department of Biological Sciences

About

363
Publications
39,907
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
11,905
Citations

Publications

Publications (363)
Chapter
Full-text available
Metagenomic Hi-C (metaHi-C) enables the recognition of relationships between contigs in terms of their physical proximity within the same cell, facilitating the reconstruction of high-quality metagenome-assembled genomes (MAGs) from complex microbial communities. However, current Hi-C-based contig binning methods solely depend on Hi-C interactions...
Article
Full-text available
Sequence classification facilitates a fundamental understanding of the structure of microbial communities. Binary metagenomic sequence classifiers are insufficient because environmental metagenomes are typically derived from multiple sequence sources. Here we introduce a deep-learning based sequence classifier, DeepMicroClass, that classifies metag...
Article
Full-text available
The human microbiome, comprising microorganisms residing within and on the human body, plays a crucial role in various physiological processes and has been linked to numerous diseases. To analyze microbiome data, it is essential to account for inherent heterogeneity and variability across samples. Normalization methods have been proposed to mitigat...
Article
Full-text available
Contig binning plays a crucial role in metagenomic data analysis by grouping contigs from the same or closely related genomes. However, existing binning methods face challenges in practical applications due to the diversity of data types and the difficulties in efficiently integrating heterogeneous information. Here, we introduce COMEBin, a binning...
Preprint
Full-text available
The human microbiome, comprising microorganisms residing within and on the human body, plays a crucial role in various physiological processes and has been linked to numerous diseases. To analyze microbiome data, it is essential to account for inherent heterogeneity and variability across samples. Normalization methods have been proposed to mitigat...
Article
Full-text available
Metagenomic Hi-C (metaHi-C) can identify contig-to-contig relationships with respect to their proximity within the same physical cell. Shotgun libraries in metaHi-C experiments can be constructed by next-generation sequencing (short-read metaHi-C) or more recent third-generation sequencing (long-read metaHi-C). However, all existing metaHi-C analys...
Article
Full-text available
Ulcerative colitis (UC) is an immune-mediated inflammation of the colonic mucosa. Gut microbiota dysbiosis may play a significant role in disease pathogenesis by causing shifts in metabolomic profiles within the gut. To identify differences and trends in the metabolomic profile of paediatric UC patients pre- and post-faecal microbiota transplants (...
Article
Local associations refer to spatial–temporal correlations that emerge from the biological realm, such as time-dependent gene co-expression or seasonal interactions between microbes. One can reveal the intricate dynamics and inherent interactions of biological systems by examining the biological time series data for these associations. To accomplish...
Article
Full-text available
The introduction of high-throughput chromosome conformation capture (Hi-C) into metagenomics enables reconstructing high-quality metagenome-assembled genomes (MAGs) from microbial communities. Despite recent advances in recovering eukaryotic, bacterial, and archaeal genomes using Hi-C contact maps, few of Hi-C-based methods are designed to retrieve...
Article
Full-text available
Binning aims to recover microbial genomes from metagenomic data. For complex metagenomic communities, the available binning methods are far from satisfactory, which usually do not fully use different types of features and important biological knowledge. We developed a novel ensemble binner, MetaBinner, which generates component results with multipl...
Article
Full-text available
Early cancer detection by cell-free DNA faces multiple challenges: low fraction of tumor cell-free DNA, molecular heterogeneity of cancer, and sample sizes that are not sufficient to reflect diverse patient populations. Here, we develop a cancer detection approach to address these challenges. It consists of an assay, cfMethyl-Seq, for cost-effectiv...
Article
Full-text available
Background Chronic infection with hepatitis B virus (HBV) has been proved highly associated with the development of hepatocellular carcinoma (HCC). Aims The purpose of the study is to investigate the association between HBV preS region quasispecies and HCC development, as well as to develop HCC diagnosis model using HBV preS region quasispecies....
Article
Full-text available
Motivation: Phage-host associations play important roles in microbial communities. But in natural communities, as opposed to culture-based lab studies where phages are discovered and characterized metagenomically, their hosts are generally not known. Several programs have been developed for predicting which phage infects which host based on variou...
Article
Full-text available
The association of colorectal cancer (CRC) and the human gut microbiome dysbiosis has been the focus of several studies in the past. Many bacterial taxa have been shown to have differential abundance among CRC patients compared to healthy controls. However, the relationship between CRC and non-bacterial gut microbiome such as the gut virome is unde...
Article
Full-text available
Motivation: Metagenomic binning aims to retrieve microbial genomes directly from ecosystems by clustering metagenomic contigs assembled from short reads into draft genomic bins. Traditional shotgun-based binning methods depend on the contigs' composition and abundance profiles and are impaired by the paucity of enough samples to construct reliable...
Article
Full-text available
Dysbiosis of human gut microbiota has been reported in association with ulcerative colitis (UC) in both children and adults using either 16S rRNA gene or shotgun sequencing data. However, these studies used either 16S rRNA or metagenomic shotgun sequencing but not both. We sequenced feces samples from 19 pediatric UC and 23 healthy children ages be...
Article
Full-text available
Evaluating metagenomic software is key for optimizing metagenome interpretation and focus of the Initiative for the Critical Assessment of Metagenome Interpretation (CAMI). The CAMI II challenge engaged the community to assess methods on realistic and complex datasets with long- and short-read sequences, created computationally from around 1,700 ne...
Article
Full-text available
Recovering high-quality metagenome-assembled genomes (MAGs) from complex microbial ecosystems remains challenging. Recently, high-throughput chromosome conformation capture (Hi-C) has been applied to simultaneously study multiple genomes in natural microbial communities. We develop HiCBin, a novel open-source pipeline, to resolve high-quality MAGs...
Article
Full-text available
High-throughput chromosome conformation capture (Hi-C) has recently been applied to natural microbial communities and revealed great potential to study multiple genomes simultaneously. Several extraneous factors may influence chromosomal contacts rendering the normalization of Hi-C contact maps essential for downstream analyses. However, the curren...
Preprint
Full-text available
Sequence classification reduces the complexity of metagenomes and facilitates a fundamental understanding of the structure and function of microbial communities. Binary metagenomic classifiers offer an insufficient solution because environmental metagenomes are typically derived from multiple sequence sources, including prokaryotes, eukaryotes and...
Preprint
Full-text available
Binning is an essential procedure during metagenomic data analysis. However, the available individual binning methods usually do not simultaneously fully use different features or biological information. Furthermore, it is challenging to integrate multiple binning results efficiently and effectively. Therefore, we developed an ensemble binner, Meta...
Article
Full-text available
Background: It is known that patients with ulcerative colitis (UC) have reduced numbers of short-chain fatty acid (SCFA) producing bacteria and reduced SCFA concentration in feces. There is also evidence that Hispanic patients have increased incidence of UC and increased likelihood of developing disease at a younger age. To understand why this mig...
Article
Full-text available
Background Patients with ulcerative colitis (UC) have an increased risk of Clostridioides difficile infection (CDI). There is a well-documented relationship between bile acids and CDI. Aims To evaluate faecal bile acid profiles and gut microbial changes associated with CDI in children with UC. Methods This study was conducted at Children's Hospit...
Article
Full-text available
Antibiotic resistance in bacteria limits the effect of corresponding antibiotics, and the classification of antibiotic resistance genes (ARGs) is important for the treatment of bacterial infections and for understanding the dynamics of microbial communities. Although several methods have been developed to classify ARGs, none of them work well when...
Chapter
Next generation sequencing (NGS) technologies make it possible to sequence a large number of metagenomes economically and efficiently using either 16S rRNA gene or whole metagenome shotgun sequencing. Metagenome comparison plays essential roles in understanding the contributions of environmental factors on the composition and functions of different...
Preprint
Full-text available
High-throughput chromosome conformation capture (Hi-C) has recently been applied to natural microbial communities and revealed great potential to study multiple genomes simultaneously. Several extraneous factors may influence chromosomal contacts rendering the normalization of Hi-C contact maps essential for downstream analyses. However, the curren...
Article
Motivation The rapid development of sequencing technologies has enabled us to generate a large number of metagenomic reads from genetic materials in microbial communities, making it possible to gain deep insights into understanding the differences between the genetic materials of different groups of microorganisms, such as bacteria, viruses, plasmi...
Article
Motivation: Rapid developments in sequencing technologies have boosted generating high volumes of sequence data. To archive and analyze those data, one primary step is sequence comparison. Alignment-free sequence comparison based on k-mer frequencies offers a computationally efficient solution, yet in practice, the k-mer frequency vectors for larg...
Article
Full-text available
Metagenomic sequencing has greatly enhanced the discovery of viral genomic sequences; however, it remains challenging to identify the host(s) of these new viruses. We developed VirHostMatcher-Net, a flexible, network-based, Markov random field framework for predicting virus–prokaryote interactions using multiple, integrated features: CRISPR sequenc...
Article
Full-text available
Background The recent development of metagenomic sequencing makes it possible to massively sequence microbial genomes including viral genomes without the need for laboratory culture. Existing reference-based and gene homology-based methods are not efficient in identifying unknown viruses or short viral sequences from metagenomic data. Methods Here...
Article
Full-text available
Alignment-free methods, more time and memory efficient than alignment-based methods, have been widely used for comparing genome sequences or raw sequencing samples without assembly. However, in this study, we show that alignment-free dissimilarity calculated based on sequencing samples can be overestimated compared with the dissimilarity calculated...
Article
Motivation: Metagenomic contig binning is an important computational problem in metagenomic research, which aims to cluster contigs from the same genome into the same group. Unlike classical clustering problem, contig binning can utilize known relationships among some of the contigs or the taxonomic identity of some contigs. However, the current s...
Article
Motivation: Detecting sequences containing repetitive regions is a basic bioinformatics task with many applications. Several methods have been developed for various types of repeat detection tasks. An efficient generic method for detecting most types of repetitive sequences is still desirable. Inspired by the excellent properties and successful ap...
Article
Full-text available
Comparing metagenomic samples is a critical step in understanding the relationships among microbial communities. Recently, next-generation sequencing (NGS) technologies have produced a massive amount of short reads data for microbial communities from different environments. The assembly of these short reads can, however, be time-consuming and chall...
Article
Full-text available
Following publication of the original paper [1], Dr. Nayfach kindly pointed out an error and the authors would like to report the following correction.
Article
Full-text available
Abstract We develop a metagenomic data analysis pipeline, MicroPro, that takes into account all reads from known and unknown microbial organisms and associates viruses with complex diseases. We utilize MicroPro to analyze four metagenomic datasets relating to colorectal cancer, type 2 diabetes, and liver cirrhosis and show that including reads from...
Article
Full-text available
Background: Alignment-free (AF) sequence comparison is attracting persistent interest driven by data-intensive applications. Hence, many AF procedures have been proposed in recent years, but a lack of a clearly defined benchmarking consensus hampers their performance assessment. Results: Here, we present a community resource (http://afproject.or...
Preprint
Full-text available
Alignment-free (AF) sequence comparison is attracting persistent interest driven by data-intensive applications. Hence, many AF procedures have been proposed in recent years, but a lack of a clearly defined benchmarking consensus hampers their performance assessment. Here, we present a community resource (http://afproject.org) to establish standard...
Article
In metagenomic studies of microbial communities, the short reads come from mixtures of genomes. Read assembly is usually an essential first step for the follow-up studies in metagenomic research. Understanding the power and limitations of various read assembly programs in practice is important for researchers to choose which programs to use in thei...
Preprint
Full-text available
Metagenomic sequencing has greatly enhanced the discovery of viral genomic sequences; however it remains challenging to identify the host(s) of these new viruses. We developed VirHostMatcher-Net, a flexible, network-based, Markov random field framework for predicting virus-host interactions using multiple, integrated features: CRISPR sequences, seq...
Article
Full-text available
Background The application of genomic data and bioinformatics for the identification of restricted or illegally-sourced natural products is urgently needed. The taxonomic identity and geographic provenance of raw and processed materials have implications in sustainable-use commercial practices, and relevance to the enforcement of laws that regulate...
Preprint
The recent development of metagenomic sequencing makes it possible to sequence microbial genomes including viruses in an environmental sample. Identifying viral sequences from metagenomic data is critical for downstream virus analyses. The existing reference-based and gene homology-based methods are not efficient in identifying unknown viruses or s...
Article
Full-text available
Comparing metagenomic samples is crucial for understanding microbial communities. For different groups of microbial communities, such as human gut metagenomic samples from patients with a certain disease and healthy controls, identifying group-specific sequences offers essential information for potential biomarker discovery. A sequence that is pres...
Data
LC-specific 40-mers and sequences.
Data
Detailed descriptions of method and results.
Article
Full-text available
Horizontal gene transfer (HGT) plays an important role in the evolution of microbial organisms including bacteria. Alignment-free methods based on single genome compositional information have been used to detect HGT. Currently, Manhattan and Euclidean distances based on tetranucleotide frequencies are the most commonly used alignment-free dissimila...
Article
Full-text available
Genome and metagenome comparisons based on large amounts of next-generation sequencing (NGS) data pose significant challenges for alignment-based approaches due to the huge data size and the relatively short length of the reads. Alignment-free approaches based on the counts of word patterns in NGS data do not depend on the complete genome and are g...
Article
Full-text available
Background: Metagenomics sequencing provides deep insights into microbial communities. To investigate their taxonomic structure, binning assembled contigs into discrete clusters is critical. Many binning algorithms have been developed, but their performance is not always satisfactory, especially for complex microbial communities, calling for furth...
Article
Full-text available
High-throughput technologies have led to large collections of different types of biological data that provide unprecedented opportunities to unravel molecular heterogeneity of biological processes. Nevertheless, how to jointly explore data from multiple sources into a holistic, biologically meaningful interpretation remains challenging. In this wor...
Article
The postsynaptic density (PSD) contains a collection of scaffold proteins used for assembling synaptic signaling complexes. However, it is not known how the core-scaffold machinery associates in protein-interaction networks or how proteins encoded by genes involved in complex brain disorders are distributed through spatiotemporal protein complexes....
Article
Full-text available
Disruption of healthy microbial communities has been linked to numerous diseases, yet microbial interactions are little understood. This is due in part to the large number of bacteria, and the much larger number of interactions (easily in the millions), making experimental investigation very difficult at best and necessitating the nascent field of...
Article
Full-text available
Protein domains can be viewed as portable units of biological function that defines the functional properties of proteins. Therefore, if a protein is associated with a disease, protein domains might also be associated and define disease endophenotypes. However, knowledge about such domain-disease relationships is rarely available. Thus, identificat...
Article
Full-text available
Background: Local trend (i.e. shape) analysis of time series data reveals co-changing patterns in dynamics of biological systems. However, slow permutation procedures to evaluate the statistical significance of local trend scores have limited its applications to high-throughput time series data analysis, e.g., data from the next generation sequenc...
Article
Full-text available
Interactions among microbes and stratification across depths are both believed to be important drivers of microbial communities, though little is known about how microbial associations differ between and across depths. We have monitored the free-living microbial community at the San Pedro Ocean Time-series station, monthly, for a decade, at five di...
Article
Full-text available
Next Generation Sequencing (NGS) technologies generate large amounts of short read data for many different organisms. The fact that NGS reads are generally short makes it challenging to assemble the reads and reconstruct the original genome sequence. For clustering genomes using such NGS data, word-count based alignment-free sequence comparison is...
Article
Full-text available
Motivation: Biological network comparison software largely relies on the concept of alignment where close matches between the nodes of two or more networks are sought. These node matches are based on sequence similarity and/or interaction patterns. However, because of the incomplete and error-prone datasets currently available, such methods have ha...
Article
Full-text available
The DiseaseConnect (http://disease-connect.org) is a web server for analysis and visualization of a comprehensive knowledge on mechanism-based disease connectivity. The traditional disease classification system groups diseases with similar clinical symptoms and phenotypic traits. Thus, diseases with entirely different pathologies could be grouped t...
Article
Full-text available
Background: The comparison of samples, or beta diversity, is one of the essential problems in ecological studies. Next generation sequencing (NGS) technologies make it possible to obtain large amounts of metagenomic and metatranscriptomic short read sequences across many microbial communities. De novo assembly of the short reads can be especially...
Article
Full-text available
With the development of next-generation sequencing (NGS) technologies, a large amount of short read data has been generated. Assembly of these short reads can be challenging for genomes and metagenomes without template sequences, making alignment-based genome sequence comparison difficult. In addition, sequence reads from NGS can come from differen...
Article
Full-text available
Recently a range of new statistics have become available for the alignment-free comparison of two sequences based on k-tuple word content. Here we extend these statistics to the simultaneous comparison of more than two sequences. Our suite of statistics contains, firstly, $${C}_{l}^{*}$$ and $${C}_{l}^{S}$$, extensions of statistics for pairwise co...
Article
Full-text available
With the rapid development of biotechnologies, many types of biological data including molecular networks are now available. However, to obtain a more complete understanding of a biological system, the integration of molecular networks with other data, such as molecular sequences, protein domains and gene expression profiles, is needed. A key to th...
Article
Full-text available
Abstract Next-generation sequencing (NGS) technologies have generated enormous amounts of shotgun read data, and assembly of the reads can be challenging, especially for organisms without template sequences. We study the power of genome comparison based on shotgun read data without assembly using three alignment-free sequence comparison statistics,...
Article
Full-text available
Background Genome-wide association studies (GWAS) have identified many common polymorphisms associated with complex traits. However, these associated common variants explain only a small fraction of the phenotypic variances, leaving a substantial portion of genetic heritability unexplained. As a result, searches for "missing" heritability are drawi...
Data
Supplementary materials. Supplementary methods and results.
Article
Full-text available
Background Sequence signatures, as defined by the frequencies of k-tuples (or k-mers, k-grams), have been used extensively to compare genomic sequences of individual organisms, to identify cis-regulatory modules, and to study the evolution of regulatory sequences. Recently many next-generation sequencing (NGS) read data sets of metagenomic samples...
Data
Supplementary Materials for “Comparison of Metagenomic Samples Using Sequence Signatures”.
Article
A class of one-way isothermal mass transfer processes is investigated in this paper. Based on the definition of mass entransy, the entransy dissipation function, which reflects the irreversibility of the mass transfer ability loss, is derived. The optimality condition for the minimum entransy dissipation of the mass transfer process with a generali...
Article
A thermodynamic model for an open inverse Brayton cycle (refrigeration or heat pump cycle) with pressure drop irreversibilities is established. There are seven flow resistances (or pressure drops) encountered by the working fluid stream for the inverse Brayton cycle. Two of these, the friction through the blades and vanes of the compressor and the...
Article
Based on the optimal ecological performance parameters of a heat pump with linear phenomenological heat transfer law between working fluid and heat reservoirs, the local stability analysis of the endoreversible heat pump working in an ecological regime is studied. The steady state of the heat pump working at the maximum ecological function is stead...
Article
Full-text available
Motivation: Local similarity analysis of biological time series data helps elucidate the varying dynamics of biological systems. However, its applications to large scale high-throughput data are limited by slow permutation procedures for statistical significance evaluation. Results: We developed a theoretical approach to approximate the statisti...
Article
The optimal configuration of the expansion process of a heated ideal gas inside a cylinder for maximum work output with a movable piston and time-dependent heat conductance is determined in this paper. The heat conductance of cylinder walls is not a constant, but depends on the time-dependent heat transfer surface area of the walls in contact with...
Article
A re-analysis of the ‘tree-shaped network’ constructal method for triangular-shaped electronics is presented. The high effective conduction channel distribution has been re-optimized by using a triangular elemental area, without the premise that the new-order assembly construct must be assembled by the optimized last-order construct. A more optimal...

Network

Cited By