Article

Cap3: A DNA sequence assembly program

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

We describe the third generation of the CAP sequence assembly program. The CAP3 program includes a number of improvements and new features. The program has a capability to clip 5' and 3' low-quality regions of reads. It uses base quality values in computation of overlaps between reads, construction of multiple sequence alignments of reads, and generation of consensus sequences. The program also uses forward-reverse constraints to correct assembly errors and link contigs. Results of CAP3 on four BAC data sets are presented. The performance of CAP3 was compared with that of PHRAP on a number of BAC data sets. PHRAP often produces longer contigs than CAP3 whereas CAP3 often produces fewer errors in consensus sequences than PHRAP. It is easier to construct scaffolds with CAP3 than with PHRAP on low-pass data with forward-reverse constraints.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... This leads, however, to computationally prohibitive inference for moderately large values of N and we shall describe in Section II a novel approach that tractably approximates the exact noninformative permutation prior in (4). The UCS-MMV problem studied in this paper is encountered in many applications as discussed in the sequel: ...
... Similarly, the multi-target tracking (MTT) problem consists in estimating the target states (e.g., locations, velocities) from randomly shuffled measurements such as in radar applications [3], wherein knowing the order of the measurements is critical to correctly assign them to targets. • Genomics: under the finite chemical bases alphabet {A, T , G, C} constraint, genome assembly [4] is a reconstruction problem of DNA sequences X = (x 1 , . . . , x M ) from their assembled and permuted sub-vector measurements Y = (y 1 , . . . ...
... Indeed, the columns/rows of any (N × N ) permutation matrix form a canonical basis for R N and hence must be captured by a non-separable prior. While the exact inference of permutation matrices from bilinear models as in (1) is known to be NP-hard [9], one of the key ideas embodied by this paper is to design two coupled denoisers for the columns and rows of U so as to overcome the intractability associated with the noninformative permutation prior in (4). To that end, we split the variable U into two auxiliary variables U + and U − with equality constraint in between. ...
Preprint
Full-text available
This paper introduces an algorithmic solution to a broader class of unlabeled sensing problems with multiple measurement vectors (MMV). The goal is to recover an unknown structured signal matrix, $\mathbf{X}$, from its noisy linear observation matrix, $\mathbf{Y}$, whose rows are further randomly shuffled by an unknown permutation matrix $\mathbf{U}$. A new Bayes-optimal unlabeled compressed sensing (UCS) recovery algorithm is developed from the bilinear approximate message passing (Bi-VAMP) framework using non-separable and coupled priors on the rows and columns of the permutation matrix $\mathbf{U}$. In particular, standard unlabeled sensing is a special case of the proposed framework, and UCS further generalizes it by neither assuming a partially shuffled signal matrix $\mathbf{X}$ nor a small-sized permutation matrix $\mathbf{U}$. For the sake of theoretical performance prediction, we also conduct a state evolution (SE) analysis of the proposed algorithm and show its consistency with the asymptotic empirical mean-squared error (MSE). Numerical results demonstrate the effectiveness of the proposed UCS algorithm and its advantage over state-of-the-art baseline approaches in various applications. We also numerically examine the phase transition diagrams of UCS, thereby characterizing the detectability region as a function of the signal-to-noise ratio (SNR).
... At each step of the inner iteration, consensus sequences are generated using the soft-clipped bases from the reads that map near the ends of the contigs to extend contig sequences (see the "Extending incomplete supertranscripts" Section below). This is done until the number of iteration reaches a user-defined threshold (default value: 30) or no further improvement is observed. Once the inner iteration is completed, ROAST performs BLAST [28] searches using blastn algorithm within the assembly to identify potential overlap between different contigs. ...
... In addition, ROAST extracts unmapped mates only for those reads that have less than 25% (user-defined parameter; default value: 25) soft-clipped bases. The unmapped mates are then re-assembled using CAP3 [30] and the contig is extended by stitching it to the newly assembled sequence based on the overlapping edges (Additional file 1: Figs. S3 and S4). ...
... To distinguish chimeric positions from exon-exon split boundaries, ROAST uses the Cufflinks v2.2.1 [42] on the alignment file generated by HISAT2. CAP3 assembler [30] is used to generate consensus sequence from partially mapped reads and to construct assemblies from unmapped read for the extension of partial supertranscripts. ...
Article
Full-text available
Background Transcriptomic studies involving organisms for which reference genomes are not available typically start by generating de novo transcriptome or supertranscriptome assembly from the raw RNA-seq reads. Assembling a supertranscriptome is, however, a challenging task due to significantly varying abundance of mRNA transcripts, alternative splicing, and sequencing errors. As a result, popular de novo supertranscriptome assembly tools generate assemblies containing contigs that are partially-assembled, fragmented, false chimeras or have local mis-assemblies leading to decreased assembly accuracy. Commonly available tools for assembly improvement rely primarily on running BLAST using closely related species making their accuracy and reliability conditioned on the availability of the data for closely related organisms. Results We present ROAST, a tool for optimization of supertranscriptome assemblies that uses paired-end RNA-seq data from Illumina sequencing platform to iteratively identify and fix assembly errors solely using the error signatures generated by RNA-seq alignment tools including soft-clips, unexpected expression coverage, and reads with mates unmapped or mapped on a different contig to identify and fix various supertranscriptome assembly errors without performing BLAST searches against other organisms. Evaluation results using simulated as well as real datasets show that ROAST significantly improves assembly quality by identifying and fixing various assembly errors. Conclusion ROAST provides a reference-free approach to optimizing supertranscriptome assemblies highlighting its utility in refining de novo supertranscriptome assemblies of non-model organisms.
... Given a full-rank matrix A * ∈ R m×n and a vector y * ∈ R s satisfying m s > n > 0, the unlabeled sensing problem [63,64] asks that for an unknown vector ξ * ∈ R n , if one only knows the vector y * ∈ R s consisting of s shuffled entries of A * ξ * , whether the vector ξ * is unique and how to recover it efficiently. This problem emerges from various fields of natural science and engineering, such as biology [24,55,1,38], neuroscience [45], computer vision [17,41,61,26,35] and communication networks [44,27,57]. ...
... Lasserre, Laurent, and Rostalski gave explicitly rank conditions that guarantee us to find all real solutions of the polynomial system (24) [30,31,32] by the semidefinite relaxation method. Let d = max j=1,...,m d j , for t d, define the convex set ...
Preprint
Full-text available
We study the unlabeled sensing problem that aims to solve a linear system of equations $A x =\pi(y) $ for an unknown permutation $\pi$. For a generic matrix $A$ and a generic vector $y$, we construct a system of polynomial equations whose unique solution satisfies $ A\xi^*=\pi(y)$. In particular, $\xi^*$ can be recovered by solving the rank-one moment matrix completion problem. We propose symbolic and numeric algorithms to compute the unique solution. Some numerical experiments are conducted to show the efficiency and robustness of the proposed algorithms.
... The newly found reads are then used to extend the contig if they overlap with one of the contig ends by at least 20 nucleotides and with at least 85% sequence identity. Genseed-HMM was run with three different assemblers-CAP3, Newbler and SPAdes [89][90][91]. The resulting contigs formed the input for a super-assembly using CAP3. ...
... 3.11.1) and CAP3 (version data: 12/21/07) were used [90,91,95]. The resulting assemblies were handled with SeqKit [96]. ...
Article
Full-text available
Virus discovery by genomics and metagenomics empowered studies of viromes, facilitated characterization of pathogen epidemiology, and redefined our understanding of the natural genetic diversity of viruses with profound functional and structural implications. Here we employed a data-driven virus discovery approach that directly queries unprocessed sequencing data in a highly parallelized way and involves a targeted viral genome assembly strategy in a wide range of sequence similarity. By screening more than 269,000 datasets of numerous authors from the Sequence Read Archive and using two metrics that quantitatively assess assembly quality, we discovered 40 nidoviruses from six virus families whose members infect vertebrate hosts. They form 13 and 32 putative viral subfamilies and genera, respectively, and include 11 coronaviruses with bisegmented genomes from fishes and amphibians, a giant 36.1 kilobase coronavirus genome with a duplicated spike glycoprotein (S) gene, 11 tobaniviruses and 17 additional corona-, arteri-, cremega-, nanhypo- and nangoshaviruses. Genome segmentation emerged in a single evolutionary event in the monophyletic lineage encompassing the subfamily Pitovirinae . We recovered the bisegmented genome sequences of two coronaviruses from RNA samples of 69 infected fishes and validated the presence of poly(A) tails at both segments using 3’RACE PCR and subsequent Sanger sequencing. We report a genetic linkage between accessory and structural proteins whose phylogenetic relationships and evolutionary distances are incongruent with the phylogeny of replicase proteins. We rationalize these observations in a model of inter-family S recombination involving at least five ancestral corona- and tobaniviruses of aquatic hosts. In support of this model, we describe an individual fish co-infected with members from the families Coronaviridae and Tobaniviridae . Our results expand the scale of the known extraordinary evolutionary plasticity in nidoviral genome architecture and call for revisiting fundamentals of genome expression, virus particle biology, host range and ecology of vertebrate nidoviruses.
... All of the contigs generated are combined and 28 short contigs are filtered. Finally, an overlap graph based assembler, either CAP3 (Huang and Madan, 1999) or Minimo from the AMOS package (Treangen et al., 2011), is used to generate the final contigs. ...
... Currently, long reads are starting to be included also in viral reconstruction pipelines and the integration of these assemblers are of significant importance. Additionally, there are assembly tools that are included directly on by option in some of the described pipelines, such as MEGAHIT , Velvet (Zerbino and Birney, 2008), MetaVelvet (Namiki et al., 2012), SOAP-denovo2 (Luo et al., 2012), ABySS (Simpson et al., 2009), CAP3 (Huang and Madan, 1999), Minimo (Treangen et al., 2011) and SGA (Simpson and Durbin, 2012). ...
... These PCR products with overlapping regions were sequenced by 1st BASE DNA Sequencing Division, Malaysia. The sequenced DNA fragments were analysed and assembled in whole circular genome using CAP3 (de novo assembly) (Huang and Madan, 1999). The draft DNA assembly was compared with reference sequence in order to validate and annotate the newly assembled viral genome organization. ...
Article
Full-text available
This study investigated the genomic characteristics of Acheta domesticus volvovirus (AdVVV) in a commercial cricket farming operation. BLAST analysis of the Acheta domesticus genome assembly identified sequences with high similarity to the AdVVV-Japan genome, suggesting AdVVV presence. PCR confirmed AdVVV infection in the A. domesticus breeding population from Nakhon Ratchasima farm, Thailand. The complete 2,516 nucleotide AdVVV-Thailand genome was reconstructed through targeted primer amplification and sequencing. It contained four open reading frames encoding hypothetical proteins, with a characteristic hairpin structure at the termini, consistent with other AdVVV isolates. Phylogenetic analysis revealed AdVVV-Thailand’s closer genetic affiliation with AdVVV-Japan compared to other isolates. Comparative analysis of coding sequences across five AdVVV isolates showed the highest variability in the hypothetical protein/putative capsid protein ORF1, with 64 variable sites out of 1086 bases, suggesting its significance in genetic diversity. In contrast, ORF2, ORF3, and ORF4 exhibited minimal variability. The majority of variations were singletons, with 85.33% confined to ORF1. This study confirmed AdVVV presence in a commercial cricket farm, reconstructed the AdVVV-Thailand genome, provided insights into its phylogeny and genetic diversity across isolates, highlighting the putative capsid protein’s role in driving variability. These findings enhance understanding of AdVVV genomics and evolutionary dynamics within cricket populations.
... The trimmed reads that passed quality control filters, were mapped to the Amel_Hav3.1 honeybee genome, resulting in a total of 7350 mapped reads corresponding to the HVR region (Hav3.1_CM009933.2:11771976-11772119). The CAP3 Sequence Assembly Program (-o 40, -p 90) [23], was used to cluster and generate the contigs. CAP3 output was qualitatively filtered. ...
Article
Full-text available
In Apis mellifera, csd is the primary gene involved in sex determination: haploid hemizygous eggs develop as drones, while females develop from eggs heterozygous for the csd gene. If diploid eggs are homozygous for the csd gene, diploid drones will develop, but will be eaten by worker bees before they are born. Therefore, high csd allelic diversity is a priority for colony survival and breeding. This study aims to investigate the variability of the hypervariable region (HVR) of the csd gene in bees sampled in an apiary under a selection scheme. To this end, an existing dataset of 100 whole-genome sequences was analyzed with a validated pipeline based on de novo assembly of sequences within the HVR region. In total, 102 allelic sequences were reconstructed and translated into amino acid sequences. Among these, 47 different alleles were identified, 44 of which had previously been observed, while 3 are novel alleles. The results show a high variability in the csd region in this breeding population of honeybees.
... For the purpose of de novo transcriptome assembly, Trinity (version 2.11.0) was employed with default settings [17,18]. In order to reduce redundancy within the assembly, CAP3 [19] was utilized. Furthermore, to achieve a reduction in transcript redundancy and generate distinct genes (referred to as "unigenes"), the CD-HIT program (version 4.8.1) was employed. ...
Article
Full-text available
In the realm of food nutritional security, the development of mineral-rich grains assumes a pivotal role in combating malnutrition. Within the scope of the current investigation, we endeavoured to discern the transcripts accountable for the improved accumulation of grain-Fe within Indian barnyard millet. This pursuit entailed transcriptome sequencing of genotypes BAR-1433 (with high Fe content) and BAR-1423 (with low Fe content) during two distinct stages of spike development—spike emergence and milking stage. In the context of spike emergence, we identified a cohort of 895 up-regulated transcripts and 126 down-regulated transcripts that delineated the difference between the high and low grain-Fe genotypes. In contrast, during the milking stage, the tally of up-regulated transcripts reached 436, while down-regulated transcripts numbered 285. The transcripts that consistently ascended in both developmental stages underwent functional annotation, aligning their roles with nucleolar proteins, metal-nicotianamine transporters, ribonucleoprotein complexes, vinorine synthases, cellulose synthases, auxin response factors, embryogenesis abundant proteins, cytochrome c oxidases, and zinc finger BED domain-containing proteins. Meanwhile, a heterogeneous spectrum of transcripts exhibited differential expression and upregulation throughout the distinct stages. These transcripts encompassed various facets, such as ABC Transporter family proteins, Calcium-dependent kinase family, Ferritin, Metal ion binding, Iron-sulfur cluster binding, Cytochrome family, Zinc finger transcription factor family, Ferredoxin–NADP reductase type 1 family, Putative laccase, Multicopper oxidase family, and Terpene synthase family. To authenticate the reliability of these transcripts, six contigs representing probable functions, including metal transporters, iron sulfur coordination, metal ion binding, auxin-responsive GH3-like protein 2, and cytochrome P450 71B16, were harnessed for primer design. Subsequently, these primers were utilized in the validation process through qRT-PCR, with the outcomes aligning harmoniously with the transcriptome results. This study chronicles a constellation of genes linked to elevated iron content within barnyard millet, showcasing a proof of concept for leveraging transcriptome insights in marker-assisted selection to fortify barnyard millet with iron. This marks the inaugural comprehensive transcriptome analysis delineating transcripts associated with varying levels of grain-iron content during the panicle developmental stages within the barnyard millet paradigm.
... (Bankevich et al., 2012). The assembly produced by SPAdes was subjected to re-assembly using the overlap layout consensus-based assembler CAP3 version year 2007 (Huang & Madan, 1999). The CAP3 produced the best assembly based on assembly size, largest contig size, N50 value and least number of contigs. ...
Article
Full-text available
Spilosoma obliqua nucleopolyhedrovirus (SpobNPV) is known as a biocontrol agent against S. obliqua a polyphagous insect. Genome of an isolate designated as SpobMNPV was sequenced and found to have 136,141 bp, 139 putative open reading frames (ORFs) on both sense (48%) and anti-sense (52%) strands and 97.91% nucleotide similarity with Hyphantria cunea nucleopolyhedrovirus (HycuNPV). All the 38 core genes of baculoviruses were identified and validated in the SpobMNPV genome, which differed from SpobNPV–Manipur isolate in several aspects. In SpobMNPV genome, 7 h were found comprised of 2 to 16 repeated units of 67-bp at each site with an imperfect 30-bp palindrome near the centre in both orientations. Comparison of consensus palindrome sequences (hrcons) present in hrs with that of selected alphabaculovirus group I NPVs revealed them to be completely conserved at each side of the hrcons, that is 1-GxTTTxC-7 and 22-TxGxAAAxC-30. Based on phylogenetic analysis of 38 core genes, SpobMNPV was found closest to the HycuNPV in the group I alphabaculovirus. The complete genome of this isolate is being reported for the first time from North India. The information on genome analysis of SpobMNPV will be an addition to the available database on alphabaculoviruses and also accelerate research on SpobNPV as a component of integrated management of S. obliqua in many economically important crops.
... Briefly, adapter trimming and selection of high-quality paired reads was performed with Trimmomatic (Bolger et al., 2014); paired reads were merged with FLASH (Magoc & Salzberg, 2011) and demultiplexing was performed with a custom script. Assembly into contigs was performed with CAP3 (Huang & Madan, 1999) and contigs were aligned with Lastz (Harris, 2007) to the set of ca 1400 reference UCEs. Alignment of individual UCEs was performed with MAFFT (linsi option; Katoh & Standley, 2013). ...
Article
Full-text available
The circumscription of the family Ormyridae (Hymenoptera: Chalcidoidea) is revised after phylogenetic analysis based on ultra‐conserved elements (UCEs) and comparative morphological assessment of the chalcid ‘Gall Clade’. Six genera are treated in the family, including two new genera, Halleriaphagus van Noort and Burks, gen. nov ., and Ouma Mitroiu, gen. nov. One genus, Eubeckerella Narendran, is re‐assigned to the family, and Ormyrulus Bouček is synonymised with Ormyrus Westwood, syn. nov ., resulting in the new combination Ormyrus gibbus (Bouček), comb. nov. The six genera are classified in three subfamilies, two of which are newly described, Asparagobiinae van Noort, Burks, Mitroiu and Rasplus, subfam. nov., and Hemadinae van Noort, Burks, Mitroiu and Rasplus, subfam. nov. Halleriaphagus is established for the newly described type species Halleriaphagus phagolucida van Noort and Burks, sp. nov ., and Ouma is erected for O. daleskeyae Mitroiu, sp. nov. , and O. emazantsi Mitroiu, sp. nov. Asparagobius is revised with description of Asparagobius bouceki van Noort, sp. nov. , and Asparagobius copelandi Rasplus and van Noort, sp. nov. Asparagobius and Halleriaphagus are classified in Asparagobiinae, Hemadas in Hemadinae and Eubeckerella , Ormyrus and Ouma in Ormyrinae. The molecular support defining the ormyrid clade is corroborated by the proposed morphological synapomorphy of a foliaceous prepectus overlying the tegula base. Identification keys to the genera of Ormyridae and to the species of Asparagobius and Ouma are provided. Online Lucid identification keys and images of all the species treated herein are available at: http://www.waspweb.org . Zoobank Registration: LSID urn:lsid:zoobank.org:pub:8811695B-EE57-4C18-A6B6-E63D267E2373 .
... Apply the scaffolding contig algorithm of CLC Genomics Workbench (version: 6.0.4) to perform de novo splicing (word size=45, minimum contig length>=200) to obtain primary UniGene. Use CAP3 [18] splicing software to splice the primary UniGene for the second time to obtain the final UniGene. ...
... Subsequently, automated DNA sequencing was performed by the National Center for Genomic Sequencing-CNSG (Medellin-Colombia) using PF and 1492R primers. The sequence reads obtained were edited and assembled using the CAP3 software (Huang and Madan 1999). The sequence was submitted to GenBank to search for similar sequences with the EzTaxon-e server (http:// www. ...
Article
Full-text available
Three extremophile bacterial strains (BBCOL-009, BBCOL-014 and BBCOL-015), capable of degrading high concentrations of perchlorate at a range of pH (6.5 to 10.0), were isolated from Colombian Caribbean Coast sediments. Morphological features included Gram negative strain bacilli with sizes averaged of 1.75 × 0.95, 2.32 × 0.65 and 3.08 × 0.70 μm, respectively. The reported strains tolerate a wide range of pH (6.5 to 10.0); concentrations of NaCl (3.5 to 7.5% w/v) and KClO4⁻ (250 to 10000 mg/L), reduction of KClO4⁻ from 10 to 25%. LB broth with NaCl (3.5–30% w/v) and KClO4ˉ (250-10000 mg/L) were used in independent trials to evaluate susceptibility to salinity and perchlorate, respectively. Isolates increased their biomass at 7.5 % (w/v) NaCl with optimal development at 3.5 % NaCl. Subsequently, ClO4ˉ reduction was assessed using LB medium with 3.5% NaCl and 10000 mg/L ClO4ˉ. BBCOL-009, BBCOL-014 and BBCOL-015 achieved 10%, 17%, and 25% reduction of ClO4ˉ, respectively. The 16 S rRNA gene sequence grouped them as Bacillus flexus T6186-2, Bacillus marisflavi TF-11 (T), and Bacillus vietnamensis 15 − 1 (T) respectively, with < 97.5% homology. In addition, antimicrobial resistance to ertapenem, vancomycine, amoxicillin clavulanate, penicillin, and erythromycin was present in all the isolates, indicating their high adaptability to stressful environments. The isolated strains from marine sediments in Cartagena Bay, Colombia are suitable candidates to reduce perchlorate contamination in different environments. Although the primary focus of the study of perchlorate-reducing and resistant bacteria is in the ecological and agricultural realms, from an astrobiological perspective, perchlorate-resistant bacteria serve as models for astrobiological investigations.
... Sanger sequencing with primers 577F 5´-GCCAG CACCCGCGGT-3′, 577R 5´-ACCGCGGGTGCTGGC-3′, 1055F 5´-GGTGGTG CA TGGCCG-3′ and 1055R 5´-GGTGGTGCATGGCCG-3′ (Elwood et al., 1985) was done at the Laboratory for DNA Sequencing (Faculty of Science, Charles University, Prague, Czech Republic) on an ABI PRISM 3100 sequencer (Applied Biosystems, San Francisco, CA, USA). Chromatograms were checked manually, using 4Peaks (Griekspoor & Groothuis, 2006) and assembled into contiguous sequences using CAP3 (Huang & Madan, 1999). ...
Article
The phylogenetic and taxonomic affinities of lineages currently assigned to the non‐monophyletic ciliate order Loxocephalida Jankowski (1980) within subclass Scuticociliatia Small (1967) remain unresolved. In the current study, we redescribe the morphology of the type species, Loxocephalus luridus Eberhard (1862) based on two Czech populations and include the first scanning and transmission electron microscopy images of the species. We provide the first 18S rRNA gene sequences for L. luridus and consider its phylogenetic position. Our results support the separation of Dexiotricha from Loxocephalus ; however, the former genus is recovered as non‐monophyletic. The monophyly of genus Dexiotricha and that of Loxocephalus + Dexiotricha is rejected. Loxocephalus luridus , together with Dexiotricha species, nests within a fully supported clade with Conchophthirus species, long presumed to belong to the Pleuronematida. Haptophrya is recovered as sister to this clade. The monophyly of the Astomatia Schewiakoff (1896) including Haptophrya is rejected. No clear morphologic synapomorphy is identified for the fully supported clade consisting of Haptophrya , Dexiotricha , Loxocephalus , and Conchophthirus .
... The resulting transcriptomes were merged using Linux cat command before redundancy removal using vsearch (Rognes et al., 2016) at a 95% identity cutoff. CAP3 (Huang and Madan, 1999) was then used (with o = 16, k = 0 and p = 95 as options) to further elongate the contigs. BWA ) separately aligned the reads from solitarious and gregarious libraries against the assembled reference trancriptome. ...
Preprint
Full-text available
Background Locust outbreaks cause devastation and are an important matter for fundamental research. They associate with a striking case of phenotypic plasticity; i.e., a gregarious phase versus solitarious phase polyphenism that affects most aspects of the locusts’ biology. However, changes in behaviour are the most notorious. Changes in gene expression dictate the phenotypic changes, behaviour is key to the locusts’ phase change, and the Central Nervous System (CNS) is essential to behaviour. Therefore, understanding and tackling the phenomenon requires studying the gene expression changes that the locusts’ CNS undergoes between phases. The genes that change expression the same way in different locusts would be ancestrally relevant for the phenomenon in general and those that change expression in a species-specific way would be relevant for species-specific understanding and tackling of the phenomenon. Methods Here, we use available raw sequencing reads to build transcriptomes using the same RNAseq pipeline and to compare the gene expression changes that the CNS of the two main pest locusts (Schistocerca gregaria and Locusta migratoria) undergo when they turn gregarious. Our aim is to find out about the species-specificity of the phenomenon, highlight the genes that respond in species-specific manner and those that respond the same way in both species. Results The locust phase change phenomenon seems highly species-specific, very likely due to the inter-specific differences in the biology and life conditions of the locusts. Research on locust outbreaks, gregariousness and swarming should therefore consider each locust species apart—as none seems representative of all locust species. Still, the 109 genes and 39 non-annotated sequences that change expression level the same way in the two main pest locusts provide sufficient material for functional testing in search for important genes, to better understand, or to fight against locust outbreaks. The genes that respond in a species-specific way provide material for understanding the differences between locust species and for looking for potential species-specific weapons against each of them. The still uncharacterized transcripts that change expression either in a species-specific or the same way between the two species provide material for functional testing and gene discovery.
... A partial alfalfa MsPROPEP1 homolog, identi ed by our previous RNA sequencing analysis, was used as a BlastN query sequence against the set of alfalfa and Medicago truncatula ESTs represented in GenBank. The resulting hits were then assembled using CAP3 software (Huang et al. 1999) and the assembled sequence was used to design a pair of PCR primers able to amplify the full-length cDNA of MsPROPEP1. Polypeptide sequences homologous to MsPROPEP1 were obtained from GenBank, and were aligned using the MegAlign program implemented within DNAStar (www.dnastar.com). ...
Preprint
Full-text available
Plant peptide hormones have various important roles in plant development, defense against pathogens, and tolerance to abiotic stress. However, only a limited number of hormone-like peptides have been proven to contribute to salt and drought stress tolerance in plants other than Arabidopsis . In this study, we present the isolation and characterization of MsPROPEP1 , a propeptide precursor gene obtained from the legume pasture Medicago sativa . The transcription of the MsPROPEP1 was found to be inducible by NaCl, polyethylene glycol (PEG), and abscisic acid (ABA). The constitutive expression of MsPROPEP1 in alfalfa seedlings mitigated the restriction on plant growth imposed by either salinity or osmotic stress and raised their sensitivity to ABA in promoting stomatal closure. In addition, we synthesized MsPep1 peptide and found that the application of MsPep1 enhanced tolerance to stress induced by NaCl and PEG. In transgenic plants, many ABA-dependent stress-responsive genes are activated; this is known to promote the expression of peroxidase which plays a role in reactive oxygen scavenging. Our findings suggest that MsPROPEP1 is a candidate for the genetic manipulation of salinity and drought tolerance in legume species.
... The two assembly drafts were merged by quickmerge version 0.3 [65], producing the primary contigs. These contigs were further merged by Cap3 (version date: 02/10/15) [66]. The putative contaminant (bacteria [mainly Klebsiella pneumoniae] and mitochondria) sequences were identified and removed using BLASTN verison 2.10.1+ ...
Article
Full-text available
Background Encystment is an important survival strategy extensively employed by microbial organisms to survive unfavorable conditions. Single-celled ciliated protists (ciliates) are popular model eukaryotes for studying encystment, whereby these cells degenerate their ciliary structures and develop cyst walls, then reverse the process under more favorable conditions. However, to date, the evolutionary basis and mechanism for encystment in ciliates is largely unknown. With the rapid development of high-throughput sequencing technologies, genome sequencing and comparative genomics of ciliates have become effective methods to provide insights into above questions. Results Here, we profiled the MAC genome of Pseudourostyla cristata, a model hypotrich ciliate for encystment studies. Like other hypotrich MAC genomes, the P. cristata MAC genome is extremely fragmented with a single gene on most chromosomes, and encodes introns that are generally small and lack a conserved branch point for pre-mRNA splicing. Gene family expansion analyses indicate that multiple gene families involved in the encystment are expanded during the evolution of P. cristata. Furthermore, genomic comparisons with other five representative hypotrichs indicate that gene families of phosphorelay sensor kinase, which play a role in the two-component signal transduction system that is related to encystment, show significant expansion among all six hypotrichs. Additionally, cyst wall-related chitin synthase genes have experienced structural changes that increase them from single-exon to multi-exon genes during evolution. These genomic features potentially promote the encystment in hypotrichs and enhance their ability to survive in adverse environments during evolution. Conclusions We systematically investigated the genomic structure of hypotrichs and key evolutionary phenomenon, gene family expansion, for encystment promotion in ciliates. In summary, our results provided insights into the evolutionary mechanism of encystment in ciliates.
... YesAssembly of reads (i.e., align the reads to create longer sequences called contigs)CAP3A DNA sequence assembly program(Huang and Madan 1999) Yes MEGAHIT An ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph(Li et al. 2015a, b) SPAdes* SPAdes aims to build continuous and accurate sequences (often referred to as contigs and scaffolds) from short reads(Prjibelski et al. 2020) Velvet A tool for de novo assembly based on de Bruijn graphs and it is suitable for short read data with high coverage(Zerbino and Birney 2008) Mapping of reads to assembled contigs (to obtain abundance/coverage data) Bowtie2* Ultrafast, memory-efficient alignment program for aligning short DNA sequence reads to large genomes(Langmead et al. 2009) -BWA Efficiently align short sequencing reads against a large reference sequence such as the human genome, allowing mismatches and gaps(Li and Durbin 2009) Content courtesy of Springer Nature, terms of use apply. Rights reserved. ...
Article
Full-text available
Due to their vectorial capacity, mosquitoes (Diptera: Culicidae) receive special attention from health authorities and entomologists. These cosmopolitan insects are responsible for the transmission of many viral diseases, such as dengue and yellow fever, causing huge impacts on human health and justifying the intensification of research focused on mosquito-borne diseases. In this context, the study of the virome of mosquitoes can contribute to anticipate the emergence and/or the reemergence of infectious diseases. The assessment of mosquito viromes also contributes to the surveillance of a wide variety of viruses found in these insects, allowing the early detection of pathogens with public health importance. However, the study of mosquito viromes can be challenging due to the number and complexities of steps involved in this type of research. Therefore, this article aims to describe, in a straightforward and simplified way, the steps necessary for obtention and assessment of mosquito viromes. In brief, this article explores: the capture and preservation of specimens; sampling strategies; treatment of samples before DNA/RNA extraction; extraction methodologies; enrichment and purification processes; sequencing choices; and bioinformatics analysis.
... nih. gov/ BLAST/) [12]. Afterward, these sequences were submitted to the CAP3 Sequence Assembly Program to form the consensus sequence comprising the whole region codifying E1, E3, E2, 6 K, and Capsid (C). ...
Article
Full-text available
Background Chikungunya virus (CHIKV) is an arbovirus from the Togaviridae family which has four genotypes: West African (WA), East/Central/South African (ECSA) and Asian/Caribbean lineage (AL) and Indian Ocean Lineage (IOL). The ECSA genotype was first registered in Brazil in Feira de Santana and spread to all Brazilian regions. This study reports the characterization of CHIKV isolates recovered from sera samples of fifty patients from seventeen cities in Maranhão, a state from Brazilian northeast region and part of the Legal Amazon area. Methods and results Primers were developed to amplify the partial regions coding structural proteins (E1, E3, E2, 6 K, and Capsid C). The consensus sequences have 2871 bp, covering approximately 24% of the genome. The isolates were highly similar (> 99%) to the ECSA isolate from Feira de Santana (BHI3734/H804698), presenting 30 non-synonymous mutations in E1 (5.95%), 18 in E2 (4.46%), and 1 in E3 (3.03%), taking the BHI3734/H804698 isolate as standard. Although the mutations described have not previously been related to increased infectivity or transmissibility of CHIKV, in silico analysis showed changes in physicochemical characteristics, antigenicity, and B cell epitopes of E1 and E2. Conclusions These findings demonstrate the importance of molecular approaches for monitoring the viral adaptations undergone by CHIKV and its geographic distribution.
... For the MIC assemblies, three replicates of H. grandinella and six of S. cf. sulcatum were assembled separately and reassembled using CAP3 with strict overlap parameters (-o 50 -p 99) (Huang and Madan 1999) to obtain more genomic data (Lyu et al. 2023). Potential MAC contamination in the MIC assemblies, which are unavoidable during amplification, were identified using BLAST (Camacho et al. 2009) (v2.10.0 + , cutoff: E-value < 1e-10, identity > 97%, qcov > 95%) and removed by a custom Perl script (Code S2). ...
Article
Full-text available
Genomes are incredibly dynamic within diverse eukaryotes and programmed genome rearrangements (PGR) play important roles in generating genomic diversity. However, genomes and chromosomes in metazoans are usually large in size which prevents our understanding of the origin and evolution of PGR. To expand our knowledge of genomic diversity and the evolutionary origin of complex genome rearrangements, we focus on ciliated protists (ciliates). Ciliates are single-celled eukaryotes with highly fragmented somatic chromosomes and massively scrambled germline genomes. PGR in ciliates occurs extensively by removing massive amounts of repetitive and selfish DNA elements found in the silent germline genome during development of the somatic genome. We report the partial germline genomes of two spirotrich ciliate species, namely Strombidium cf. sulcatum and Halteria grandinella, along with the most compact and highly fragmented somatic genome for S. cf. sulcatum. We provide the first insights into the genome rearrangements of these two species and compare these features with those of other ciliates. Our analyses reveal: (1) DNA sequence loss through evolution and during PGR in S. cf. sulcatum has combined to produce the most compact and efficient nanochromosomes observed to date; (2) the compact, transcriptome-like somatic genome in both species results from extensive removal of a relatively large number of shorter germline-specific DNA sequences; (3) long chromosome breakage site motifs are duplicated and retained in the somatic genome, revealing a complex model of chromosome fragmentation in spirotrichs; (4) gene scrambling and alternative processing are found throughout the core spirotrichs, offering unique opportunities to increase genetic diversity and regulation in this group. Supplementary Information The online version contains supplementary material available at 10.1007/s42995-023-00213-x.
... The transcriptomes were de novo assembled using Trinity v2.8.5 (default parameters) (Grabherr et al., 2011). Cap3 (version date: 02/10/15) was used to merge transcriptome assemblies from different libraries of the same species (Huang and Madan, 1999). Redundant transcripts were removed by CD-HIT v4.6.2 (Fu et al., 2012) with 90 % identity. ...
... The obtained reads were assembled with the CAP3 tool (https://doua.prabi.fr/software/cap3/) [18]. The 1396 bp-long sequence was deposited in GenBank/EMBL/DDBJ under accession number OP099850. ...
Article
Full-text available
The taxonomic status of strain P5891T, isolated from an Adélie penguin beak swab, was investigated. Based on the 16S rRNA gene sequence, the strain was identified as a potentially novel Corynebacterium species, with the highest sequence similarities to Corynebacterium rouxii FRC0190T (96.7 %) and Corynebacterium epidermidicanis DSM 45586T (96.6 %). The average nucleotide identity values between strain P5891T and C. rouxii FRC0190T and C. epidermidicanis DSM 45586T were 68.2 and 69.2 %, respectively. The digital DNA–DNA hybridization values between strain P5891T and C. rouxii FRC0190T and C. epidermidicanis DSM 45586T were 23.7 and 21.4 %, respectively. Phylogenetic trees based on the 16S rRNA sequence placed strain P5891T in a separate branch with Corynebacterium canis 1170T and Corynebacterium freiburgense 1045T, while a phylogenomic tree based on the Corynebacterium species core genome placed the strain next to Corynebacterium choanae 200CHT. Extensive phenotyping and genomic analyses clearly confirmed that strain P5891T represents a novel species of the genus Corynebacterium, for which the name Corynebacterium mendelii sp. nov. is proposed, with the type strain P5891T (=CCM 8862T=LMG 31627T).
... After resolving ambiguous nucleotides, the online CAP3 Sequence Assembly Program (Huang and Madan 1999) was used to construct a single contiguous sequence for each recombinant insert. The final consensus sequence for each rust isolate was deposited in GenBank (Table 1). ...
Article
Full-text available
Allium crops are commonly grown in South Africa and harvested as either fresh produce for the domestic and export markets or as seed. Apart from occasional outbreaks on garlic, rust is problematic as a cosmetic disease with unappealing uredinia regularly observed on freshly packed produce of bunching onion and leek in supermarkets. Spore morphology and phylogenetic analysis of five rust samples collected from A. fistulosum (bunching onion) confirmed the causal organism as Puccinia porri . Garlic and bunching onion varieties were mostly susceptible to P. porri , whereas leek varieties were either susceptible or segregating in their response, with bulb onions being resistant. Microscopy of early infection structures showed appressorium formation, stomatal penetration, and a substomatal structure which differentiated into infection hyphae and haustorium mother cells. At microscopy level differences in host response became visible from 48 h post-inoculation onwards with prehaustorial and early hypersensitivity observed as resistance mechanisms in onions.
... Pure DNA was shipped to SolGent Company South Korea for polymerase chain reaction (PCR) and sequencing of the internal transcribed spacer region (ITS) of rRNA gene using the primer pair ITS1 (forward) and ITS4 (reverse)[22]. The obtained sequences were analyzed using Basic Local Alignment Search Tool (BLAST) from the National Center of Biotechnology Information (NCBI) website[23]. Sequence alignments and phylogeny were performed using MegAlign (DNA Star) software version 5.05. ...
... Putative viral sequences that did not match the length or ORF structure in comparison to their closest relative in the NCBI nucleic acid databases were further investigated, and libraries from which the virus genome originated were submitted to a new round of assembly using an integrative strategy composed of multiple assemblers (SPAdes [42], Trinity [47], metaSPAdes [48], rnaviralSPAdes, metaviralSPAdes v. 3.15.5 [49], Oases v. 0.1.2 [50], and MEGAHIT v. 1.2.9 [51]) followed by transcript consolidation with Cap3 [52] as described by Espinal et al., 2023 [53]. ...
Article
Full-text available
Parasitoid wasps are fundamental insects for the biological control of agricultural pests. Despite the importance of wasps as natural enemies for more sustainable and healthy agriculture, the factors that could impact their species richness, abundance, and fitness, such as viral diseases, remain almost unexplored. Parasitoid wasps have been studied with regard to the endogenization of viral elements and the transmission of endogenous viral proteins that facilitate parasitism. However, circulating viruses are poorly characterized. Here, RNA viromes of six parasitoid wasp species are studied using public libraries of next-generation sequencing through an integrative bioinformatics pipeline. Our analyses led to the identification of 18 viruses classified into 10 families (Iflaviridae, Endornaviridae, Mitoviridae, Partitiviridae, Virgaviridae, Rhabdoviridae, Chuviridae, Orthomyxoviridae, Xinmoviridae, and Narnaviridae) and into the Bunyavirales order. Of these, 16 elements were described for the first time. We also found a known virus previously identified on a wasp prey which suggests viral transmission between the insects. Altogether, our results highlight the importance of virus surveillance in wasps as its service disruption can affect ecology, agriculture and pest management, impacting the economy and threatening human food security.
... Selected isolate cultures were shipped to Macrogen, Inc. (Singapore) for internal transcribed spacer (ITS) isolation and sequencing. Cap3 Contig Assembly (Stothard, 2000) and Reverse Complement (Huang and Madan, 1999) were used for merging and reversing sequences. Subsequently, these sequences will be aligned using NCBI BLASTn (Zhang et al., 2000) and phylogeny trees were built using MEGA ver. ...
Article
Full-text available
Root-knot nematodes (RKNs) are groups of nematodes that cause significant diseases in horticultural and field crops. Chemical pesticides used to control RKNs could pollute environmental resources and ultimately affect human health. Therefore, eco-friendly efforts are needed. Previous research revealed that nematode-trapping fungi (NTFs) as the biological enemies of nematodes has been observed suppressing the nematode population. This study aimed to isolate NTF species from municipal waste-contaminated soil in Medan City, Indonesia, and identified them using morphological and molecular analysis. Furthermore, their biocontrol potential against Meloidogyne hapla Chitwood (Nematoda: Meloidogynidae) was assessed. Soil sample covered seven districts with seven repeats for isolation and in vitro assessment against M. hapla was done on CMA and observed after 12-72 hours. Three isolates were successfully obtained and proven effective in suppressing M. hapla by 97.7% (isolate sH51 and sH52) and 89.27% (isolate sH53). Morphological identification on PDA and genetic analysis of ITS concluded that sH51 is Drechslerella brochopaga Drechsler (Ascomycota: Orbiliaceae) and sH53 is Arthrobotrys thaumasius Drechsler (Ascomycota: Orbiliaceae). Morphological analysis for isolate sH52 reveals it as Arthrobotrys sinensis but is limited to Arthrobotrys sp. based on phylogeny analysis thus additional gen needs to be sequenced for confirmation.
... Reads of rust on leaves 1 and 2 and stems 1 and 2 were sequenced using the Illumina HiSeq4000 platform (Shabrina et al. 2019). The sequences were realigned using CAP3 software (Huang and Madan 1999) to reduce clusters of nearly identical transcripts and obtain more complete contigs. Then, they were clustered to remove redundant or ambiguous contigs using CD-HIT-EST software (Li and Godzik 2006), applying an identity threshold of 95%. ...
Article
Full-text available
Sengon (Falcataria falcata (L.) Greuter & R. Rankin) plantations in Indonesia are threatened by attacks from Boktor stem borers and gall rust disease. Controlling pests and diseases is difficult; therefore, planting resistant trees obtained from tree selection programs is necessary. Currently, genomic breeding often incorporates GWAS, which uses thousands of SNP markers to identify markers with significant associations with the traits studied. This study aimed to bypass such expensive studies by identifying and developing SNP markers from sequences of putative resistance genes to Boktor stem borer and gall rust disease, identified from sengon transcriptomic data analysis. A total of 496,194 putative SNP sites were identified from transcriptomic sequences using the SAMtools and BFCtools programs, of which 119 SNP sites were associated with resistance genes. Of the 101 non-synonymous SNPs selected, only 12 were located in the conserved domain of each gene and were used for primer design. Of the 13 primers designed, only 10 were successfully amplified. Validation of 10 developed SNP markers on 100 sengon accessions using the HRM method confirmed a significant association between SNP markers and resistance traits, with a -log 10 (P-value) between 10.49 and 16.63. A few SNPs markers developed from putative resistance gene sequences are associated with resistance traits in sengon. Therefore, the SNP markers could be applied in selection programs for sengon trees resistant to Boktor stem borers and gall rust disease.
Article
Coagulase-negative Staphylococcus (CoNS) species inhibiting Staphylococcus aureus has been described in the skin of atopic dermatitis (AD) patients. This study evaluated whether Staphylococcus spp. from the skin and nares of AD and non-AD children produced antimicrobial substances (AMS). AMS production was screened by an overlay method and tested against NaOH, proteases and 30 indicator strains. Clonality was assessed by pulsed-field gel electrophoresis. Proteinaceous AMS-producers were investigated for autoimmunity by the overlay method and presence of bacteriocin genes by polymerase chain reaction. Two AMS-producers had their genome screened for AMS genes. A methicillin-resistant S. aureus (MRSA) produced proteinaceous AMS that inhibited 51.7% of the staphylococcal indicator strains, and it was active against 60% of the colonies selected from the AD child where it was isolated. On the other hand, 57 (8.8%) CoNS from the nares and skin of AD and non-AD children, most of them S. epidermidis (45.6%), reduced the growth of S. aureus and other CoNS species. Bacteriocin-related genes were detected in the genomes of AMS-producers. AMS production by CoNS inhibited S. aureus and other skin microbiota species from children with AD. Furthermore, an MRSA colonizing a child with AD produced AMS, reinforcing its contribution to dysbiosis and disease severity.
Article
Full-text available
Background The InBIO Barcoding Initiative (IBI) Dataset - DS-IBILP08 contains records of 2350 specimens of moths (Lepidoptera species that do not belong to the superfamily Papilionoidea). All specimens have been morphologically identified to species or subspecies level and represent 1158 species in total. The species of this dataset correspond to about 42% of mainland Portuguese Lepidoptera species. All specimens were collected in mainland Portugal between 2001 and 2022. All DNA extracts and over 96% of the specimens are deposited in the IBI collection at CIBIO, Research Center in Biodiversity and Genetic Resources. New information The authors enabled "The InBIO Barcoding Initiative Database: DNA barcodes of Portuguese moths" in order to release the majority of data of DNA barcodes of Portuguese moths within the InBIO Barcoding Initiative. This dataset increases the knowledge on the DNA barcodes of 1158 species from Portugal belonging to 51 families. There is an increase in DNA barcodes of 205% in Portuguese specimens publicly available. The dataset includes 61 new Barcode Index Numbers. All specimens have their DNA barcodes publicly accessible through BOLD online database and the distribution data can be accessed through the Global Biodiversity Information Facility (GBIF).
Article
Bacterial species referred to as magnetotactic bacteria (MTB) biomineralize iron oxides and iron sulphides inside the cell. Bacteria can arrange themselves passively along geomagnetic field lines with the aid of these iron components known as magnetosomes. In this study, magnetosome nanoparticles, which were obtained from the taxonomically identified MTB isolate Providencia sp. PRB-1, were characterized and their antibacterial activity was evaluated. An in vitro test showed that magnetosome nanoparticles significantly inhibited the growth of Staphylococcus sp., Pseudomonas aeruginosa, and Klebsiella pneumoniae. Magnetosomes were found to contain cuboidal iron crystals with an average size of 42 nm measured by particle size analysis and scanning electron microscope analysis. The energy dispersive X-ray examination revealed that Fe and O were present in the extracted magnetosomes. The extracted magnetosome nanoparticles displayed maximum absorption at 260 nm in the UV-Vis spectrum. The distinct magnetite peak in the Fourier transform infrared (FTIR) spectroscopy spectra was observed at 574.75 cm−1. More research is needed into the intriguing prospect of biogenic magnetosome nanoparticles for antibacterial applications.
Article
Groundwater (GW) quality monitoring is vital for sustainable water resource management. The present study introduced a metagenome-derived machine learning (ML) model aimed at enhancing the predictive understanding and diagnostic interpretation of GW pollution associated with petroleum. In this framework, taxonomic and metabolic profiles derived from GW metagenomes were combined for use as the input dataset. By employing strategies that optimized data integration, model selection, and parameter tuning, we achieved a significant increase in diagnostic accuracy for petroleum-polluted GW. Explanatory artificial intelligence techniques identified petroleum degradation pathways and Rhodocyclaceae as strong predictors of a pollution diagnosis. Metagenomic analysis corroborated the presence of gene operons encoding aminobenzoate and xylene biodegradation within the de novo assembled genome of Rhodocyclaceae. Our genome-centric metagenomic analysis thus clarified the ecological interactions associated with microbiomes in breaking down petroleum contaminants, validating the ML-based diagnostic results. This metagenome-derived ML framework not only enhances the predictive diagnosis of petroleum pollution but also offers interpretable insights into the interaction between microbiomes and petroleum. The proposed ML framework demonstrates great promise for use as a science-based strategy for the on-site monitoring and remediation of GW pollution
Article
Full-text available
Mosquitoes harbor a large diversity of eukaryotic viruses. Those viromes probably influence mosquito physiology and the transmission of human pathogens. Nevertheless, their ecology remains largely unstudied. Here, we address two key questions in virome ecology. First, we assessed the influence of mosquito species on virome taxonomic diversity and relative abundance. Contrary to most previous studies, the potential effect of the habitat was explicitly included. Thousands of individuals of Culex poicilipes and Culex tritaeniorhynchus , two vectors of viral diseases, were concomitantly sampled in three habitats over two years. A total of 95 viral taxa from 25 families were identified with meta-transcriptomics, with 75% of taxa shared by both mosquitoes. Viromes significantly differed by mosquito species but not by habitat. Differences were largely due to changes in relative abundance of shared taxa. Then, we studied the diversity of viruses with a broad host range. We searched for viral taxa shared by the two Culex species and Aedes vexans , another disease vector, present in one of the habitats. Twenty-six out of the 163 viral taxa were found in the three mosquitoes. These taxa encompassed 14 families. A database analysis supported broad host ranges for many of those viruses, as well as a widespread geographical distribution. Thus, the viromes of mosquitoes from the same genera mainly differed in the relative abundance of shared taxa, whereas differences in viral diversity dominated between mosquito genera. Whether this new model of virome diversity and structure applies to other mosquito communities remains to be determined.
Article
This study investigated the potential of using steam-exploded oil palm empty fruit bunches (EFB) as a renewable feedstock for producing fumaric acid (FA), a food additive widely used for flavor and preservation, through a separate hydrolysis and fermentation process using the fungal isolate K20. The efficiency of FA production by free and immobilized cells was compared. The maximum FA concentration (3.25 g/L), with 0.034 g/L/h productivity, was observed after incubation with the free cells for 96 h. Furthermore, the production was scaled up in a 3-L air-lift fermenter using oil palm EFB-derived glucose as the substrate. The FA concentration, yield, and productivity from 100 g/L initial oil palm EFB-derived glucose were 44 g/L, 0.39 g/g, and 0.41 g/L/h, respectively. The potential for scaling up the fermentation process indicates favorable results, which could have significant implications for industrial applications.
Article
Full-text available
Background Mitochondria play essential roles in tumorigenesis; however, little is known about the contribution of mitochondrial DNA (mtDNA) to esophageal squamous cell carcinoma (ESCC). Whole-genome sequencing (WGS) is by far the most efficient technology to fully characterize the molecular features of mtDNA; however, due to the high redundancy and heterogeneity of mtDNA in regular WGS data, methods for mtDNA analysis are far from satisfactory. Methods Here, we developed a likelihood-based method dMTLV to identify low-heteroplasmic mtDNA variants. In addition, we described fNUMT, which can simultaneously detect non-reference nuclear sequences of mitochondrial origin (non-ref NUMTs) and their derived artifacts. Using these new methods, we explored the contribution of mtDNA to ESCC utilizing the multi-omics data of 663 paired tumor-normal samples. Results dMTLV outperformed the existing methods in sensitivity without sacrificing specificity. The verification using Nanopore long-read sequencing data showed that fNUMT has superior specificity and more accurate breakpoint identification than the current methods. Leveraging the new method, we identified a significant association between the ESCC overall survival and the ratio of mtDNA copy number of paired tumor-normal samples, which could be potentially explained by the differential expression of genes enriched in pathways related to metabolism, DNA damage repair, and cell cycle checkpoint. Additionally, we observed that the expression of CBWD1 was downregulated by the non-ref NUMTs inserted into its intron region, which might provide precursor conditions for the tumor cells to adapt to a hypoxic environment. Moreover, we identified a strong positive relationship between the number of mtDNA truncating mutations and the contribution of signatures linked to tumorigenesis and treatment response. Conclusions Our new frameworks promote the characterization of mtDNA features, which enables the elucidation of the landscapes and roles of mtDNA in ESCC essential for extending the current understanding of ESCC etiology. dMTLV and fNUMT are freely available from https://github.com/sunnyzxh/dMTLV and https://github.com/sunnyzxh/fNUMT, respectively.
Article
Termite hindguts are inhabited by symbionts that help with numerous processes, but changes in the gut microbiome due to season can potentially impact the physiology of termites. This study investigated the impact of seasonal changes on the composition of bacteria and protozoa in the termite gut. Termites were obtained monthly from May to October 2020 at a location in the central United States that typically experiences seasonal air temperatures ranging from < 0 to > 30 °C. The guts of 10 termites per biological replication were dissected and frozen within 1 day after collections. DNA was extracted from the frozen gut tissues and used for termite 16S rRNA mitochondrial gene analysis and bacterial 16S rRNA gene sequence surveys. Phylogenetic analysis of termite 16S rRNA gene sequences verified that the same colony was sampled across all time points. On processing bacterial 16S sequences, we observed alpha (observed features, Pielou’s evenness, and Shannon diversity) and beta diversity (unweighted Unifrac, Bray-Curtis, and Jaccard) metrics to vary significantly across months. Based on the analysis of the composition of microbiomes with bias correction (ANCOM-BC) at the genus level, we found several significant bacterial taxa over collection months. In addition, Spearman correlation analysis demonstrated that 41 bacterial taxa were significantly correlated (positively and negatively) with average soil temperature. These results from a single termite colony suggest termite microbial communities go through seasonal changes in relative abundance related to temperature, although other seasonal effects cannot be excluded. Further investigations are required to conclusively define the consistency of microbial variation among different colonies with season.
Article
The motility of Vibrio species plays a pivotal role in their survival and adaptation to diverse environments and is intricately associated with pathogenicity in both humans and aquatic animals. Numerous mutant strains of Vibrio alginolyticus have been generated using UV or EMS mutagenesis to probe flagellar motility using molecular genetic approaches. Identifying these mutations promises to yield valuable insights into motility at the protein structural physiology level. In this study, we determined the complete genomic structure of 4 reference specimens of laboratory V . alginolyticus strains: a precursor strain, V . alginolyticus 138-2, two strains showing defects in the lateral flagellum (VIO5 and YM4), and one strain showing defects in the polar flagellum (YM19). Subsequently, we meticulously ascertained the specific mutation sites within the 18 motility-deficient strains related to the polar flagellum (they fall into three categories: flagellar-deficient, multi-flagellar, and chemotaxis-deficient strains) by whole genome sequencing and mapping to the complete genome of parental strains VIO5 or YM4. The mutant strains had an average of 20.6 (±12.7) mutations, most of which were randomly distributed throughout the genome. However, at least two or more different mutations in six flagellar-related genes were detected in 18 mutants specifically selected as chemotaxis-deficient mutants. Genomic analysis using a large number of mutant strains is a very effective tool to comprehensively identify genes associated with specific phenotypes using forward genetics.
Article
Full-text available
Pecan leaf dieback caused by Neofusicoccum caryigenum is an emerging disease in southeastern United States pecan orchards. In this study, a first draft N. caryigenum genome was sequenced and assembled. Genome size was estimated as 42.5 Mbp, and genome completeness was estimated as 97.4%.
Article
Full-text available
Bees are important actors in terrestrial ecosystems and are recognised for their prominent role as pollinators. In the Iberian Peninsula, approximately 1,100 bee species are known, with nearly 100 of these species being endemic to the Peninsula. A reference collection of DNA barcodes, based on morphologically identified bee specimens, representing 514 Iberian species, was constructed. The "InBIO Barcoding Initiative Database: DNA Barcodes of Iberian bees" dataset contains records of 1,059 sequenced specimens. The species of this dataset correspond to about 47% of Iberian bee species diversity and 21% of endemic species diversity. For peninsular Portugal only, the corresponding coverage is 71% and 50%. Specimens were collected between 2014 and 2022 and are deposited in the research collection of Thomas Wood (Naturalis Biodiversity Center, The Netherlands), in the FLOWer Lab collection at the University of Coimbra (Portugal), in the Andreia Penado collection at the Natural History and Science Museum of the University of Porto (MHNC-UP) (Portugal) and in the InBIO Barcoding Initiative (IBI) reference collection (Vairão, Portugal). Of the 514 species sequenced, 75 species from five different families are new additions to the Barcode of Life Data System (BOLD) and 112 new BINs were added. Whilst the majority of species were assigned to a single BIN (94.9%), 27 nominal species were assigned to multiple BINs. Although the placement into multiple BINs may simply reflect genetic diversity and variation, it likely also represents currently unrecognised species-level diversity across diverse taxa, such as Amegilla albigena Lepeletier, 1841, Andrena russula Lepeletier, 1841, Lasioglossum leucozonium (Schrank, 1781), Nomada femoralis Morawitz, 1869 and Sphecodes alternatus Smith, 1853. Further species pairs of Colletes, Hylaus and Nomada were placed into the same BINs, emphasising the need for integrative taxonomy within Iberia and across the Mediterranean Basin more broadly. These data substantially contribute to our understanding of bee genetic diversity and DNA barcodes in Iberia and provide an important baseline for ongoing taxonomic revisions in the West Palaearctic biogeographical region.
Article
Full-text available
The evolution of gene expression programs underlying the development of vertebrates remains poorly characterized. Here, we present a comprehensive proteome atlas of the model chordate Ciona, covering eight developmental stages and ∼7,000 translated genes, accompanied by a multi-omics analysis of co-evolution with the vertebrate Xenopus. Quantitative proteome comparisons argue against the widely held hourglass model, based solely on transcriptomic profiles, whereby peak conservation is observed during mid-developmental stages. Our analysis reveals maximal divergence at these stages, particularly gastrulation and neurulation. Together, our work provides a valuable resource for evaluating conservation and divergence of multi-omics profiles underlying the diversification of vertebrates.
Article
Tolerance mechanisms to single abiotic stress events are being investigated in different plant species, but how plants deal with multiple stress factors occurring simultaneously is still poorly understood. Here, we introduce Salicornia europaea as a species with an extraordinary tolerance level to both flooding and high salt concentrations. Plants exposed to 0.5 M NaCl (mimicking sea water concentrations) grew larger than plants not exposed to salt. Adding more salt reduced growth, but concentrations up to 2.5 M NaCl were not lethal. Regular tidal flooding with salt water (0.5 M NaCl) did not affect growth or chlorophyll fluorescence, whereas continuous flooding stopped growth while plants survived. Quantitative polymerase chain reaction (qPCR) analysis of plants exposed to 1% oxygen in air revealed induction of selected hypoxia responsive genes, but these genes were not induced during tidal flooding, suggesting that S. europaea did not experience hypoxic stress. Indeed, plants were able to transport oxygen into waterlogged soil. Interestingly, sequential exposure to salt and hypoxic air changed the expression of several but not all genes as compared to their expression upon hypoxia only, demonstrating the potential to use S. europaea to investigate signalling-crosstalk between tolerance reactions to multiple environmental perturbations.
Preprint
Full-text available
Mosquito-borne viruses represent a threat to human health worldwide. This taxonomically-diverse group includes numerous viruses that recurrently spread into new regions. Thus, periodic surveillance of the arbovirus diversity in a given region can help optimizing the diagnosis of arboviral infections. Nevertheless, such screenings are rarely carried out, especially in low-income countries. Consequently, case investigation is often limited to a fraction of the arbovirus diversity. This situation probably results in undiagnosed cases. Here, we have explored the diversity of mosquito-borne viruses in two regions of Burkina Faso. To this end, we have screened mosquitoes collected along three years in six urban and rural areas using untargeted metagenomics. The analysis focused on two mosquito species, Aedes aegypti and Culex quinquefasciatus, considered among the main vectors of arboviruses worldwide. The screening detected Sindbis virus (SINV, Togaviridae) for the first time in Burkina Faso. This zoonotic arbovirus has spread from Africa into Europe. SINV causes periodic outbreaks in Europe but its distribution and epidemiology in Africa remains largely unstudied. SINV was detected in one of the six areas of the study, and at a single year. Detection was validated with isolation in cell culture. SINV was only detected in Cx. quinquefasciatus, thus extending the list of potential vectors of SINV in nature. SINV infection rate in mosquitoes was similar to those observed in European regions that experience SINV outbreaks. A phylogenetic analysis placed the nearly-full genome within a cluster of Central African strains of lineage I. This cluster is supposedly at the origin of the SINV strains introduced into Europe. Thus, West Africa should also be considered as a potential source of the European SINV strains. Our results call for studies on the prevalence of SINV infections in the region to estimate disease burden and the interest of SINV diagnostic in case investigation.
Article
Full-text available
The pathogenesis of severe Plasmodium falciparum malaria involves cytoadhesive microvascular sequestration of infected erythrocytes, mediated by P. falciparum erythrocyte membrane protein 1 (PfEMP1). PfEMP1 variants are encoded by the highly polymorphic family of var genes, the sequences of which are largely unknown in clinical samples. Previously, we published new approaches for var gene profiling and classification of predicted binding phenotypes in clinical P. falciparum isolates (Wichers et al., 2021), which represented a major technical advance. Building on this, we report here a novel method for var gene assembly and multidimensional quantification from RNA-sequencing that outperforms the earlier approach of Wichers et al., 2021, on both laboratory and clinical isolates across a combination of metrics. Importantly, the tool can interrogate the var transcriptome in context with the rest of the transcriptome and can be applied to enhance our understanding of the role of var genes in malaria pathogenesis. We applied this new method to investigate changes in var gene expression through early transition of parasite isolates to in vitro culture, using paired sets of ex vivo samples from our previous study, cultured for up to three generations. In parallel, changes in non-polymorphic core gene expression were investigated. Modest but unpredictable var gene switching and convergence towards var2csa were observed in culture, along with differential expression of 19% of the core transcriptome between paired ex vivo and generation 1 samples. Our results cast doubt on the validity of the common practice of using short-term cultured parasites to make inferences about in vivo phenotype and behaviour.
Preprint
Full-text available
Advances in assembling microbial genomes have led to growth of reference genome databases, which have been transformative for applied and basic microbiome research. Here we show that published microbial genome databases from humans, mice, cows, pigs, fish, honeybees, and marine environments contain significant levels of sequencing adapter contamination that systematically reduces assembly quality. By removing the adapter-contaminated ends of contiguous sequences and reassembling, we improve the accuracy and contiguousness of genome assemblies in these databases.
Article
Rose flowers are known for their diverse and attractive colors and scents, which are influenced by the expression of phyto-miRNAs. These are small non-coding RNAs that regulate various aspects of plant metabolism. However, the identification and functional analysis of miRNAs and their target genes involved in rose floral traits remain elusive. This study performed a comprehensive analysis of miRNAs in Rosa genus using bioinformatics and qRT-PCR approaches. It detected 36 miRNAs belonging to 20 families and found that miR5021 was the most abundant and had multiple targets in R. rugosa transcriptome. It also demonstrated that miR5021 was associated with the terpenoid biosynthesis pathway, which is responsible for producing floral volatile compounds. The negative regulation of miR5021 on five key enzymes in the terpenoid pathway (GGPS, CCD1, LIS, DXR, and DXS) in white and dark-pink flowers of R. canina was validated. The findings reveal that miR5021 plays a crucial role in modulating rose flower color and scent by regulating the terpenoid pathway and provide new insights into the molecular mechanism of rose floral traits.
Chapter
The advancement of sequencing technologies and molecular techniques has generated a huge amount of biological data which needs to be analyzed and interpreted with the help of machine learning and artificial intelligence-based methods. Bioinformatics tools and databases like NCBI-BLAST, ensembl Plants, Galaxy platform, RAP-DB, etc., plays a major role in understanding the functional genomics and molecular system of many crops. Bioinformatics tools and databases emerged as a crucial platform to deal with huge data generated by the OMICS technologies such as genomics, transcriptomics, proteomics, and metabolomics and used to draw logical conclusions about the problem. Bioinformatics in agriculture, also known as agro-informatics, plays an important and increasing role in deciphering the genomics and related information for crop improvement and also discussed some of the important genomic tools and databases in this chapter.
Article
Full-text available
The aim of this work is the full characterization of all the nocturnin (noc) paralogues expressed in a teleost, the goldfish. An in silico analysis of the evolutive origin of noc in Osteichthyes is performed, including the splicing variants and new paralogues appearing after teleostean 3R genomic duplication and the cyprinine 4Rc. After sequencing the full-length mRNA of goldfish, we obtained two isoforms for noc-a (noc-aa and noc-ab) with two splice variants (I and II), and only one for noc-b (noc-bb) with two transcripts (II and III). Using the splicing variant II, the prediction of the secondary and tertiary structures renders a well-conserved 3D distribution of four α-helices and nine β-sheets in the three noc isoforms. A synteny analysis based on the localization of noc genes in the patrilineal or matrilineal subgenomes and a phylogenetic tree of protein sequences were accomplished to stablish a classification and a long-lasting nomenclature of noc in goldfish, and valid to be extrapolated to allotetraploid Cyprininae. Finally, both goldfish and zebrafish showed a broad tissue expression of all the noc paralogues. Moreover, the enriched expression of specific paralogues in some tissues argues in favour of neo- or subfunctionalization.
Preprint
Full-text available
The genus Potexvirus includes many species known to infect terrestrial plants and, more recently, a novel species, Turtlegrass virus X (TVX) infecting the seagrass Thalassia testudinum . We designed a degenerate primer pair, Potex7F and Potex7-RC, to amplify a 584 nt RNA-dependent RNA polymerase (RdRp) fragment from potexviruses infecting seagrasses. This primer pair was modified from Potex5/Potex2-RC primers published in 2002 that target terrestrial plant potexviruses, by including nucleotide bases found in TVX and related potexvirus sequences identified in seagrass viromes from Tampa Bay, Florida, USA. Using this newly developed primer set, we performed a reverse transcriptase polymerase chain reaction (RT-PCR) survey to screen for potexviruses in 63 opportunistically collected, apparently healthy seagrass samples. The survey examined the host species T. testudinum, Halodule wrightii, Halophila stipulacea, Syringodium filiforme, Ruppia maritima , and Zostera marina . PCR products were successfully amplified and sequenced from T. testudinum samples collected around Florida, USA, indicating prevailing potexvirus infection in the region. Impact statement Potexviruses are widespread in terrestrial plants; however, the recent discovery of a potexvirus in the seagrass Thalassia testudinum extends their host range to marine flowering plants. Here we present degenerate potexvirus PCR primers, Potex7F and Potex7-RC, to enable the identification of potexvirus infections in seagrass habitats. Discovery of potexviruses in T. testudinum highlights the utility of this new primer set in uncovering the diversity, host range, and geographic range of potexviruses infecting seagrasses. Data summary All sequence data are available in NCBI GenBank under the accession numbers OR827689 - OR827705 , OR854648 , OR863396 , and OR879052 - OR879056 . The authors confirm all supporting data, code and protocols have been provided within the article or through supplementary data files.
Article
Full-text available
The trend toward very large DNA sequencing projects, such as those being undertaken as part of the Human Genome Program, necessitates the development of efficient and precise algorithms for assembling a long DNA sequence from the fragments obtained by shotgun sequencing or other methods. The sequence reconstruction problem that we take as our formulation of DNA sequence assembly is a variation of the shortest common superstring problem, complicated by the presence of sequencing errors and reverse complements of fragments. Since the simpler superstring problem is NP-hard, any efficient reconstruction procedure must resort to heuristics. In this paper, however, a four-phase approach based on rigorous design criteria is presented, and has been found to be very accurate in practice. Our method is robust in the sense that it can accommodate high sequencing error rates, and list a series of alternate solutions in the event that several appear equally good. Moreover, it uses a limited form of multiple sequence alignment to detect, and often correct, errors in the data. Our combined algorithm has successfully reconstructed nonrepetitive sequences of length 50,000 sampled at error rates of as high as 10%.
Article
Full-text available
The availability of massive amounts of DNA sequence information has begun to revolutionize the practice of biology. As a result, current large-scale sequencing output, while impressive, is not adequate to keep pace with growing demand and, in particular, is far short of what will be required to obtain the 3-billion-base human genome sequence by the target date of 2005. To reach this goal, improved automation will be essential, and it is particularly important that human involvement in sequence data processing be significantly reduced or eliminated. Progress in this respect will require both improved accuracy of the data processing software and reliable accuracy measures to reduce the need for human involvement in error correction and make human review more efficient. Here, we describe one step toward that goal: a base-calling program for automated sequencer traces, phred, with improved accuracy. phred appears to be the first base-calling program to achieve a lower error rate than the ABI software, averaging 40%-50% fewer errors in the data sets examined independent of position in read, machine running conditions, or sequencing chemistry.
Article
Full-text available
We describe an algorithm for aligning two sequences within a diagonal band that requires only O(NW) computation time and O(N) space, where N is the length of the shorter of the two sequences and W is the width of the band. The basic algorithm can be used to calculate either local or global alignment scores. Local alignments are produced by finding the beginning and end of a best local alignment in the band, and then applying the global alignment algorithm between those points. This algorithm has been incorporated into the FASTA program package, where it has decreased the amount of memory required to calculate local alignments from O(NW) to O(N) and decreased the time required to calculate optimized scores for every sequence in a protein sequence database by 40%. On computers with limited memory, such as the IBM-PC, this improvement both allows longer sequences to be aligned and allows optimization within wider bands, which can include longer gaps.
Article
Full-text available
A new approach to rapid sequence comparison, basic local alignment search tool (BLAST), directly approximates alignments that optimize a measure of local similarity, the maximal segment pair (MSP) score. Recent mathematical results on the stochastic properties of MSP scores allow an analysis of the performance of this method as well as the statistical significance of alignments it generates. The basic algorithm is simple and robust; it can be implemented in a number of ways and applied in a variety of contexts including straightforward DNA and protein sequence database searches, motif searches, gene identification searches, and in the analysis of multiple regions of similarity in long DNA sequences. In addition to its flexibility and tractability to mathematical analysis, BLAST is an order of magnitude faster than existing sequence comparison tools of comparable sensitivity.
Article
Full-text available
We have developed three computer programs for comparisons of protein and DNA sequences. They can be used to search sequence data bases, evaluate similarity scores, and identify periodic structures based on local sequence similarity. The FASTA program is a more sensitive derivative of the FASTP program, which can be used to search protein or DNA sequence data bases and can compare a protein sequence to a DNA sequence data base by translating the DNA data base as it is searched. FASTA includes an additional step in the calculation of the initial pairwise similarity score that allows multiple regions of similarity to be joined to increase the score of related sequences. The RDF2 program can be used to evaluate the significance of similarity scores using a shuffling method that preserves local sequence composition. The LFASTA program can display all the regions of local similarity between two sequences with scores greater than a threshold, using the same scoring parameters and a similar alignment algorithm; these local similarities can be displayed as a "graphic matrix" plot or as individual alignments. In addition, these programs have been generalized to allow comparison of DNA or protein sequences based on a variety of alternative scoring matrices.
Article
Full-text available
A program for assembling sequences by using a global approach has been developed. By successive steps, a more and more precise classification of DNA fragments permits the positioning of the sequences on the contig; after having detected the pairs of overlapping sequences, groups are formed such that all sequences in a group overlap. Sequences common to several groups enable the groups to be ordered in a series. Ambiguities in the order of groups can arise at this stage, due to the presence of repeated fragments; different solutions are then proposed. Putting the groups into order leads to a preclassification of sequences. The fragments are then aligned by group, by searching for words common to all sequences in the group, and using an algorithm of dynamic programming. A detailed example on a set of nine sequences accompanies the description of the method.
Article
Full-text available
A human infant brain cDNA library, made specifically for production of expressed sequence tags (ESTs) was evaluated by partial sequencing of over 1,600 clones. Advantages of this library, constructed for EST sequencing, include the use of directional cloning, size selection, very low numbers of mitochondrial and ribosomal transcripts, short polyA tails, few non-recombinants and a broad representation of transcripts. 37% of the clones were identified, based on matches to over 320 different genes in the public databases. Of these, two proteins similar to the Alzheimer's disease amyloid precursor protein were identified.
Article
We have developed three computer programs for comparisons of protein and DNA sequences. They can be used to search sequence data bases, evaluate similarity scores, and identify periodic structures based on local sequence similarity. The FASTA program is a more sensitive derivative of the FASTP program, which can be used to search protein or DNA sequence data bases and can compare a protein sequence to a DNA sequence data base by translating the DNA data base as it is searched. FASTA includes an additional step in the calculation of the initial pairwise similarity score that allows multiple regions of similarity to be joined to increase the score of related sequences. The RDF2 program can be used to evaluate the significance of similarity scores using a shuffling method that preserves local sequence composition. The LFASTA program can display all the regions of local similarity between two sequences with scores greater than a threshold, using the same scoring parameters and a similar alignment algorithm; these local similarities can be displayed as a "graphic matrix" plot or as individual alignments. In addition, these programs have been generalized to allow comparison of DNA or protein sequences based on a variety of alternative scoring matrices.
Article
A new approach to assembling large, random shotgun sequencing projects has been developed. The TIGR Assembler overcomes several major obstacles to assembling such projects: the large number of pairwise comparisons required, the presence of repeat regions, chimeras introduced in the cloning process, and sequencing errors. A fast initial comparison of fragments based on oligonucleotide content is used to eliminate the need for a more sensitive comparison between most fragment pairs, thus greatly reducing computer search time. Potential repeat regions are recognized by determining which fragments have more potential overlaps than expected given a random distribution of fragments. Repeat regions are dealt with by increasing the match criteria stringency and by assembling these regions last so that maximum information from nonrepeat regions can be used. The algorithm also incorporates a number of constraints, such as clone length and the placement of sequences from the opposite ends of a clone. TIGR Assembler has been used to assemble the complete 1.8 Mbp Haemophilus influenzae (Fleischmann et al., 1995) and 0.58 Mbp Mycoplasma genitalium (Fraser et al., 1995) genomes.
Article
The problem of finding a longest common subsequence of two strings has been solved in quadratic time and space. An algorithm is presented which will solve this problem in quadratic time and in linear space.
Article
Sequencing of large clones or small genomes is generally done by the shotgun approach (Anderson et al. 1982). This has two phases: (1) a shotgun phase in which a number of reads are generated from random subclones and assembled into contigs, followed by (2) a directed, or finishing phase in which the assembly is inspected for correctness and for various kinds of data anomalies (such as contaminant reads, unremoved vector sequence, and chimeric or deleted reads), additional data are collected to close gaps and resolve low quality regions, and editing is performed to correct assembly or base-calling errors. Finishing is currently a bottleneck in large-scale sequencing efforts, and throughput gains will depend both on reducing the need for human intervention and making it as efficient as possible. We have developed a finishing tool, consed, which attempts to implement these principles. A distinguishing feature relative to other programs is the use of error probabilities from our programs phred and phrap as an objective criterion to guide the entire finishing process. More information is available at http:// www.genome.washington.edu/consed/consed. html.
Article
Elimination of the data processing bottleneck in high-throughput sequencing will require both improved accuracy of data processing software and reliable measures of that accuracy. We have developed and implemented in our base-calling program phred the ability to estimate a probability of error for each base-call, as a function of certain parameters computed from the trace data. These error probabilities are shown here to be valid (correspond to actual error rates) and to have high power to discriminate correct base-calls from incorrect ones, for read data collected under several different chemistries and electrophoretic conditions. They play a critical role in our assembly program phrap and our finishing program consed.
Article
An effective computer program for assembling DNA fragments, the contig assembly program (CAP), has been developed. In the CAP program, a filter is used to eliminate quickly fragment pairs that could not possibly overlap, a dynamic programming algorithm is applied to compute the maximal-scoring overlapping alignment between each remaining pair of fragments, and a simple greedy approach is employed to assemble fragments in order of alignment scores. To identify the true fragment overlaps, the dynamic programming algorithm uses specially chosen sets of alignment parameters to tolerate sequencing errors and to penalize "mutational" changes between different copies of a repetitive sequence. The performance tests of the program on fragment data from genomic sequencing projects produced satisfactory results. The CAP program is efficient in computer time and memory; it took about 4 h to assemble a set of 1015 fragments into long contigs on a Sun workstation.
Article
Space, not time, is often the limiting factor when computing optimal sequence alignments, and a number of recent papers in the biology literature have proposed space-saving strategies. However, a 1975 computer science paper by Hirschberg presented a method that is superior to the new proposals, both in theory and in practice. The goal of this paper is to give Hirschberg's idea the visibility it deserves by developing a linear-space version of Gotoh's algorithm, which accommodates affine gap penalties. A portable C-software package implementing this algorithm is available on the BIONET free of charge.
Article
A program package, called SEQAID, to support DNA sequencing is presented. The program automatically assembles long DNA sequences from short fragments with minimal user interaction. Various tools for controlling the assembling process are also available. The main novel features of the system are that SEQAID implements several new well-behaved algorithms based on a mathematical model of the problem. It also utilizes available information on restriction fragments to detect illegitimate overlaps and to find relationships between separately assembled sequence blocks. Experiences with the system are reported including an extremely pathological real sequence which offers an interesting benchmark for this kind of programs.
Article
Introduction to Computational Biology: Maps, Sequencesand Genomes. Chapman Hall, 1995.[WF74] R.A. Wagner and M.J. Fischer. The String to String Correction Problem. Journal of the ACM, 21(1):168--173, 1974.[WM92] S. Wu and U. Manber. Fast Text Searching Allowing Errors. Communicationsof the ACM, 10(35):83--91, 1992.73Bibliography[KOS+00] S. Kurtz, E. Ohlebusch, J. Stoye, C. Schleiermacher, and R. Giegerich.Computation and Visualization of Degenerate Repeats in CompleteGenomes. In ...
Article
This paper describes a new way of storing DNA gel reading data and an accompanying set of computer programs. These programs will perform all the manipulations that are required on data gained by the so-called ‘shotgun’ method of DNA sequencing. This system simplifies the computer processing involved with this sequencing method and also has the capability of being able at any time during a project to display, lined up in register, all the gel readings covering any section of the sequence.
Article
A new software system designed for use in high-throughput DNA sequencing laboratories is described. The Genome Reconstruction Manager (GRM) was developed from requirements derived from ongoing large-scale DNA sequencing projects. Object-oriented principles were followed in designing the system, and tools supporting object-oriented system development were employed for its implementation. GRM provides several advances in software support for high-throughput DNA sequencing: support for random, directed, and mixed sequencing strategies; a novel system for fragment assembly; a commercial object data-base management system for data storage; a client/server architecture for using network computational servers; and an underlying data model that can evolve to support fully automatic sequence reconstruction. GRM is currently being deployed for production use in high-throughput DNA sequencing projects.
Article
We present a dynamic programming algorithm for computing a best global alignment of two sequences. The proposed algorithm is robust in identifying any of several global relationships between two sequences. The algorithm delivers a best alignment of two sequences in linear space and quadratic time. We also describe a multiple alignment algorithm based on the pairwise algorithm. Both algorithms have been implemented as portable C programs. Experimental results indicate that for a commonly used set of gap penalties, the new programs produce more satisfactory alignments on sequences of various lengths than some existing pairwise and multiple programs based on the dynamic programming algorithm of Needleman and Wunsch.
Article
The combination of high-resolution scanning, image-databasing technology and sequence assembly software makes it possible to assemble contiguous overlapping sequences of DNA in a fraction of the time required by manual methods. This paper describes and evaluates an assembly program that rapidly generates contigs from scanned images and provides a unique ability to verify disagreements between overlapping strands or fragments of complementary DNA sequences on-screen, further increasing sequencing throughput.
Article
We have developed a set of tools, genfrag, to fragment and optionally mutate a DNA sequence to generate benchmark data sets for testing DNA sequence assembly algorithms. Data parameters can be systematically and independently varied to explore the range of data--and corresponding performance of assembly tools--encountered on large-scale random, or "shot-gun," sequencing projects.
Article
We describe the Genome Assembly Program (GAP), a new program for DNA sequence assembly. The program is suitable for large and small projects, a variety of strategies and can handle data from a range of sequencing instruments. It retains the useful components of our previous work, but includes many novel Ideas and methods. Many of these methods have been made possible by the program's completely new, and highly interactive, graphical user interface. The program provides many visual clues to the current state of a sequencing project and allows users to interact In intuitive and graphical ways with their data. The program has tools to display and manipulate the various types of data that help to solve and check difficult assemblies, particularly those in repetitive genomes. We have introduced the following new displays: the Contig Selector, the Contlg Comparator, the Template Display, the Restriction Enzyme Map and the Stop Codon Map. We have also made it possible to have any number of Contig Editors and Contig Joining Editors running simultaneously even on the same contig. The program also includes a new ‘Directed Assembly’ algorithm and routines for automatically detecting unfinished segments of sequence, to which it suggests experimental solutions.
Article
We describe a number of improvements to the CAP sequence assembly program. These improvements include the development of methods for solving the problem caused by simple repetitive sequences, for automatically editing fragment alignments and consensus sequences, and for identifying chimeric fragments. The improved program (CAP2) assembled each of seven data sets, six of which contain repetitive sequences of very strong similarity, into a single sequence. As an example, CAP2 assembled a set of 1467 fragments into a single sequence of 73,328 bp that has only eight differences from the original sequence. The effects of fragment length, coverage, and error rate on the performance of CAP2 were evaluated using artificial data sets.
Article
Diverse biochemical and computational procedures and facilities have been developed to hybridize thousands of DNA clones with short oligonucleotide probes and subsequently to extract valuable genetic information. This technology has been applied to 73,536 cDNA clones from infant brain libraries. By a mutual comparison of 57,419 samples that were successfully scored by 200-320 probes, 19,726 genes have been identified and sorted by their expression levels. The data indicate that an additional 20,000 or more genes may be expressed in the infant brain. Representative clones of the found genes create a valuable resource for complete sequencing and functional studies of many novel genes. These results demonstrate the unique capacity of hybridization technology to identify weakly transcribed genes and to study gene networks involved in organismal development, aging, or tumorigenesis by monitoring the expression of every gene in related tissues, whether known or still undiscovered.
Article
We have developed a fluorescent DNA probe, oxazole yellow (YO)-linked oligonucleotide complementary to a target DNA/RNA, which can enhance the fluorescence on hybridizing with a target nucleotide. We demonstrated the applicability of the YO-linked oligonucleotide probe to real-time monitoring of the in vitro transcription process of a plasmid DNA constructed containing the 5′-terminus non-coded region of hepatitis C virus RNA. In the process of in vitro transcription in the presence of YO-linked complementary oligonucleotide, the fluorescence of the reaction mixture showed a time-dependent linear increase corresponding to the generated target RNA product.
Article
We describe a computer program, named DNA-Protein Search (DPS), for comparing a megabase DNA sequence with a protein sequence database. The DPS program addresses the problems of frameshifts and introns in the DNA sequence. The DPS program was used to compare each of the following sequences with the Swiss-Prot database: the 1.8-megabase sequence of the Haemophilus influenzae Rd genome, the 0.58-megabase sequence of the Mycoplasma genitalium genome, and the 0.56-megabase sequence of Saccharomyces cerevisiae chromosome VIII. The comparisons found new regions that are similar to protein sequences. The sensitivity of DPS was evaluated using as test data the known coding regions of the three DNA sequences. The results demonstrate that the DPS program is a useful tool for finding the coding regions of the DNA sequence. The DPS program uses an order of magnitude less computer memory and is several times faster than the BLASTX program.
Article
DNA microarray technologies together with rapidly increasing genomic sequence information is leading to an explosion in available gene expression data. Currently there is a great need for efficient methods to analyze and visualize these massive data sets. A self-organizing map (SOM) is an unsupervised neural network learning algorithm which has been successfully used for the analysis and organization of large data files. We have here applied the SOM algorithm to analyze published data of yeast gene expression and show that SOM is an excellent tool for the analysis and visualization of gene expression profiles.
Article
This document describes the C programming language interface to our Fragment Assembly Kernel. Inputs to the Fragment Assembly Kernel are (1) DNA fragment sequences from potentially inaccurate sequencing experiments, and (2) optional constraints on fragment assembly such as known fragment overlaps or relative fragment orientation. Fragment sequence version control is supported. The Fragment Assembly Kernel produces the most probable reconstructions of the original DNA sequence from the fragments, subject to any specified constraints. Each fragment assembly includes multiple sequence alignment and consensus sequences. Multiple sequence alignment editing capabilities are provided to allow manual correction of sequence errors. September 15, 1994 Department of Computer Science The University of Arizona Tucson, Arizona 85721 *This work was supported in part by Baylor College of Medicine under DOE Grant DE-FG05-91ER61132. An Interface for a Fragment Assembly Kernel 1. Overview At a concept...