ArticlePDF Available

Basic Local Aligment Search Tool

Authors:

Abstract and Figures

A new approach to rapid sequence comparison, basic local alignment search tool (BLAST), directly approximates alignments that optimize a measure of local similarity, the maximal segment pair (MSP) score. Recent mathematical results on the stochastic properties of MSP scores allow an analysis of the performance of this method as well as the statistical significance of alignments it generates. The basic algorithm is simple and robust; it can be implemented in a number of ways and applied in a variety of contexts including straightforward DNA and protein sequence database searches, motif searches, gene identification searches, and in the analysis of multiple regions of similarity in long DNA sequences. In addition to its flexibility and tractability to mathematical analysis, BLAST is an order of magnitude faster than existing sequence comparison tools of comparable sensitivity.
Content may be subject to copyright.
A preview of the PDF is not available
... Based on conserved domain analysis by alignment of homologous proteins and protein structural modeling by Phyre2 (Kelley, Mezulis et al. 2015, Liu, Lee et al. 2020, we concluded that ZMO1055 contains an N-terminal PAS (Per-Arnt-Sim) domain (Xing, Gumerov et al. 2023). PAS domains display a high sequence diversity explaining why the PAS domain of ZMO1055 has not been recognized by standard two dimensional BLAST search (Altschul, Gish et al. 1990). A PAS domain does not only function as a signal receiver and transducer domain, but dimerizes GGDEF/EAL domain containing proteins enabling or enhancing the catalytic activity upon signal perception (Schirmer 2016). ...
... The copyright holder for this preprint this version posted June 21, 2024. ; https://doi.org/10.1101/2024.06.21.600002 doi: bioRxiv preprint equivalent of the position 526 of the EAL domain in ZMO1055ZM4 in the first 5039 nonredundant most closely related proteins as derived from the NCBI database (Altschul, Gish et al. 1990). After automatic and subsequently manual curation of the alignment of the EAL domains of these proteins, a statistical summary of the conservation at this site was performed. ...
Preprint
Full-text available
The ubiquitous second messenger cyclic di-GMP is the most abundant diffusible nucleotide signalling system in bacteria deciding the life style transition between sessility and motility. GGDEF diguanylate cyclases and EAL phosphodiesterases conventionally direct the turnover of this signaling molecule. Thereby, those domains are subject to micro- and macroevolution with the evolutionary forces that promote alterations in these proteins currently mostly unknown. While the highly conserved signature amino acids involved in divalent ion binding and catalysis equally as signal transduction modules have been readily identified, more subtle amino acid substitutions that modulate the catalytic activity have been rarely recognized and their molecular mechanism characterized. Our previous work identified the A526V substitution to be involved in downregulation of the apparent catalytic activity of the Zymomonas mobilis ZM4 PAS-GGDEF-EAL ZMO1055 phosphodiesterase and leading to a self-flocculation phenotype mediated by elevated production of the exopolysaccharide cellulose in Z. mobilis ZM401. As A526 is located at a position that has previously not been recognized to affect the catalytic activity of the EAL domain, we further investigated the molecular mechanisms and the functional conservation of this substitution. Using a number of model systems, our results indicate that the alanine at position 526 is highly conserved in ZMO1055 homologs and beyond with the A526V mutation to alter the apparent phosphodiesterase activity in subgroups of EAL domains. Thus we hypothesize that single amino acid substitutions that lead to alterations in the catalytic activity of cyclic di-GMP turnover domains amplify the signaling output and thus significantly contribute to the flexibility and adaptability of the cyclic di-GMP signaling network. In this context, ZMO1055 seems to be a current evolutionary target.
... The Rapid Annotation Subsystem Technology (RAST) server (https://rast.nmpdr.org/) [17] classi ed the predicted genes into the subsystems of the SEED database [18] and the RPSBLAST tool of BLAST version 2.13.0 [19] classi ed the proteins according to the functional categories of the Clusters of Orthologous Genes (COG) database [20], considering an E-value threshold of 1e − 10 to select the signi cant alignments. In addition, the KEGG Automatic Annotation Server (KASS) [21] classi ed the proteins according to the KEGG orthology (KO) system to allow the metabolic pathways mapping. ...
... First, we aligned its genome sequence with the prokaryotic representative genomes of NCBI Refseq database (https://ftp.ncbi.nlm.nih.gov/blast/db/; version of 23/12/2023), using the BLASTn tool of BLAST version 2.13.0 [19] and considering an E-value threshold of 1e -10 to select the signi cant alignments. Then, the JSpeciesWS server (https://jspecies.ribohost.com/jspeciesws/) ...
Preprint
Full-text available
The Bacillus sp. isolate EB-40 was characterized in 'Prata Anã' banana ( Musa sp.) plants as an endophyte capable of colonizing both inter- and intracellular spaces of roots, nitrogen fixation, phosphate solubilization, in vitro synthesis of indole-3-acetic acid, and promotion of enhancements in the development of micropropagated banana seedlings. Here, we report the whole-genome sequence of Bacillus sp. isolate EB-40 and its taxonomic assignment. Its genome is composed of one chromosome and three plasmids. The chromosome is a circular double-stranded DNA (5,613,235 base pairs (bp) ) with a GC content of 35.3% and 5,462 genes. The three plasmids have a total length of 237,685 bp with 201 genes. Comparative genomics highlighted significant conservation of the isolate EB-40 genome with other B. cereus isolates, leading to its assignment it as a novel isolate within this species.
... The nucleotide and whole genome sequencing (WGS) data were then analyzed using QIAGEN CLC Genomics Workbench, 1 Biocyc Pathway Tools v27 (Karp et al., 2019), and PATRIC v3.6.12 2 (Davis et al., 2020). BLAST (basic local alignment search tool) search was performed using NCBI services and databases (Altschul et al., 1990). Classification of protein family (pfam) search was performed via InterPro server (Paysan-Lafosse et al., 2023). ...
... Taxonomy was assigned using UCLUST (Edgar, 2010) against the Greengenes database version 13_850 for 16S rDNA OTUs (DeSantis et al., 2006;McDonald et al., 2012). For fungal ITS sequences, taxonomy was assigned using BLAST (Altschul et al., 1990) against the UNITE database (Kõljalg et al., 2013) V6.9.7 (E < 10-5). The resultant OTU abundance tables for both primer sets were filtered to remove singletons and rarefied to an even number of sequences per samples to ensure an equal sampling depth. ...
Article
Full-text available
1. Dryland vegetation is characterized by discrete plant patches that accumulate and capture soil resources under their canopies. These "fertile islands" are major drivers of dryland ecosystem structure and functioning, yet we lack an integrated understanding of the factors controlling their magnitude and variability at the global scale. 2. We conducted a standardized field survey across 236 drylands from five continents. At each site, we measured the composition, diversity and cover of perennial plants. Fertile island effects were estimated at each site by comparing composite soil
... To elucidate the potential functional implications of the identified mutations, we associated all SNPs identified in correlation with Brix or POL measures with potential gene sequences retrieved from the assembled transcriptome. Specifically, we conducted an alignment of all assembled transcripts with the sugarcane genome sequence of the cultivar SP70-1143 using the BLASTn v2.11.0+ tool (Altschul et al. 1990). For each SNP-associated scaffold, we considered a maximum of 5 alignments, applying an E-value cutoff of 1e-6. ...
Preprint
Full-text available
Sugarcane ( Saccharum spp.) holds significant economic importance in sugar and biofuel production. Despite extensive research, understanding highly quantitative traits, such as sucrose content, remains challenging due to the complex genomic landscape of the crop. In this study, we conducted a multiomic investigation to elucidate the genetic architecture and molecular mechanisms governing sucrose accumulation in sugarcane. Using a biparental cross (IACSP95-3018 × IACSP93-3046) and a genetically diverse collection of sugarcane genotypes, we evaluated the soluble solids (Brix) and sucrose content (POL) across various years and environments. Both populations were genotyped using a genotyping-by-sequencing (GBS) approach, with single nucleotide polymorphisms (SNPs) identified via bioinformatics pipelines. Genotype‒phenotype associations were established using a combination of traditional linear mixed-effect models and machine learning algorithms. Furthermore, we conducted an RNA sequencing (RNA-Seq) experiment on genotypes exhibiting distinct Brix and POL profiles across different developmental stages. Differentially expressed genes (DEGs) potentially associated with variations in sucrose accumulation were identified. All findings were integrated through a comprehensive gene coexpression network analysis. Strong correlations among the evaluated characteristics were observed, with estimates of modest to high heritabilities. By leveraging a broad set of SNPs identified for both populations, we identified several SNPs potentially linked to phenotypic variance. Our examination of genes close to these markers facilitated the association of such SNPs with DEGs in genotypes with contrasting sucrose levels. Through the integration of these results with a gene coexpression network, we delineated a set of genes potentially involved in the regulatory mechanisms of sucrose accumulation in sugarcane, collectively contributing to the definition of this critical phenotype. Our findings constitute a significant resource for biotechnology and plant breeding initiatives. Furthermore, our genotype‒phenotype association models hold promise for application in genomic selection, offering valuable insights into the molecular underpinnings governing sucrose accumulation in sugarcane.
... For DNA sequence verification, the most commonly used database is the Basic Local Alignment Search Tool (BLAST), a bioinformatics tool established by researchers at the National Institute of Health (Altschul et al., 1990). The BLAST algorithm allows comparison of newly generated sequences to a library of sequences (https://blast.ncbi.nlm.nih.gov/Blast.cgi). ...
Article
Full-text available
The Padina genus has 75 species of which 54 species are classification recognized. The Padina genus has been used as nutritional food that supplies vitamins, proteins, and carotenoids by humans for a long time. Recently, several drugs and dietary supplements containing active components extracted from Padina have been commercially developed. Species of the Padina genus are quietly morphologically similar. Almost all previous studies of the Padina genus used the morphological identification method. This study presents the results of correlation analysis between morphological characteristics and the rbcL marker in the identification of Padina seaweed samples collected at Hon Thom, Phu Quoc, Kien Giang province (HTO), Vietnam. The phylogenetic tree and genetic distance between HTO with references showed the samples belonging to Padina australis. Thus, identification methods based on genetic markers and morphology on HTO seaweed samples were consistent.
... All raw reads for each gene were assembled and checked for ambiguities and lowquality data in Geneious R10 (Kearse et al., 2012). Edited sequences were verified for contamination using the BLAST-n algorithm run over the GenBank nr/nt database (Altschul et al., 1990). For the phylogenetic reconstruction, sequences of the genus Philinopsis available in the public database (GenBank) were added to the dataset (Table 1). ...
Article
Full-text available
Philinopsis gigliolii (Tapparone Canefri, 1874) was described under the name Aglaja gigliolii based on preserved material from the Pacific coast of Japan, collected during an expedition of the Italian warship Magenta in 1864-1868. Currently, this species is considered a subjective synonym of P. speciosa Pease, 1860, described from Hawaii, despite their morphological differences. To clarify the species status of P. gigliolii we have conducted a molecular phylogenetic analysis of the genus Philinopsis using COI, 16S, and histone H3 molecular markers, which included a specimen of P. gigliolii from Peter the Great Bay, the Sea of Japan. Our results confirm that P. gigliolii represents a distinct valid species, which shows both morphological and molecular differences with P. speciosa. The latter species is recovered paraphyletic and clearly needs further taxonomical revision. At the same time, the molecular analysis indicates that Australian species P. taronga (Allan, 1933) is conspecific to P. gigliolii (only two molecular substitutions were identified in 16S), and these species show many similarities in both external and internal morphology. We consider P. taronga a junior subjective synonym of P. gigliolii. Formally Chelidonura aureopunctata Rudman, 1968, described from New Zealand, is considered a junior subjective synonym of P. gigliolii as well. Philinopsis gigliolii has an antitropical distribution, its range includes subtropical and temperate areas of the Pacific Ocean in both hemispheres (the Sea of Japan, the Yellow Sea, the Pacific coast of Japan; SouthEast Australia and the northern coast of New Zealand). Three hypotheses may explain this distribution pattern. (1) The antitropical distribution results from the historical disjunction across tropical latitudes following the abiotic or biotic factors. (2) Philinopsis gigliolii may be widely distributed in temperate and tropical waters of the Pacific Ocean but be overlooked in the central part of its geographic range due to external similarities to other species of the genus. (3) The last hypothesis suggests the anthropogenic transportation of P. gigliolii. Further sampling activity and comparative genetic analyses may contribute to a better understanding of this very interesting biogeographic pattern. How to cite this article: Chaban E.M., Ekimova I.A., Chernyshev A.V. 2024. Philinopsis gigliolii (Gastropoda: Heterobranchia: Aglajidae) from the Sea of Japan: validity, synonymy and biogeography // Invert. РЕЗЮМЕ: Philinopsis gigliolii (Tapparone Canefri, 1874) был описан как Aglaja gigliolii по фиксированному материалу, собранному у тихоокеанского побережья Японии во время экспедиции на итальянском военном корабле «Маджента» в 1864-1868 гг. В настоящее время этот вид считается младшим субъективным синонимом P. speciosa Pease, 1860, описанного с Гавайских островов, несмотря на их морфологические различия. Для уточнения таксономического статуса P. gigliolii мы провели молекулярно-филогенетический анализ рода Philinopsis, включая экземпляр P. gigliolii из залива Петра Великого Японского моря, с использованием трех молекулярных маркеров, представляющих частичные фрагменты цитохром с оксидазы субъединицы I (COI), 16S rRNA и гистона H3 (H3). Наши результаты подтверждают, что P. gigliolii представляет собой валидный вид, который имеет молекулярные и морфологические отличия от P. speciosa. Последний вид признан парафилетическим и явно нуждается в дальнейшей таксономической ревизии. В то же время молекулярный анализ показывает, что австралийский вид P. taronga (Allan, 1933) конспецифичен P. gigliolii (в 16S выявлены всего 2 молекулярные замены), и эти виды обнаруживают большое сходство как во внешней, так и во внутренней морфологии. Мы считаем P. taronga младшим субъективным синонимом P. gigliolii. Формально, Chelidonura aureopunctata Rudman, 1968, описанную из прибрежья Но-вой Зеландии, также следует считать младшим субъективным синонимом P. gigliolii. Philinopsis gigliolii имеет антитропическое распространение: его ареал включает субтропические и умеренные районы Тихого океана в обоих полушариях (Японское и Желтое моря, тихоокеанское побережье Японии; юго-восточная Австралия и северное побережье Новой Зеландии). Три гипотезы могут объяснить такую картину распре-деления: 1) антитропическое распределение является результатом исторического разделения ареала через тропические широты как следствие действия абиотических или биотических факторов; 2) Philinopsis gigliolii может быть широко распространен в тропических и умеренных водах Тихого океана, но не отмечен в центральной части ареала из-за внешнего сходства с другими видами рода; 3) последняя гипотеза предполагает антропогенный перенос P. gigliolii. Дополнительный сбор образцов и дальнейший генетический анализ могут способствовать лучшему пониманию этой очень интересной биогеографической модели. Как цитировать эту статью: Chaban E.M., Ekimova I.A., Chernyshev A.V. 2024. Philinopsis gigliolii (Gastropoda: Heterobranchia: Aglajidae) from the Sea of Japan: validity, synonymy and biogeography // Invert.
... For analysis of DNA sequencing data, Blast 2.0 and chromas 1.45 (http://www.technelysium.com.au) were utilized [14]. Single-nucleotide polymorphisms (SNPs) have been detected in PCR products of the b- ...
... We used Cutadapt (v3.3; Martin (2011)) to demultiplex the obtained sequencing reads and DADA2 (Callahan et al. 2016) to infer Amplicon Sequencing Variants (ASVs). Taxonomy was assigned against the Martin7 database (Vondrák et al. 2023) using a local BLAST (Altschul et al. 1990) search. Assignments were kept if the "percent identity" was higher than 97%. ...
Article
Full-text available
Lichens are an important part of forest ecosystems, contributing to forest biodiversity, the formation of micro-niches and nutrient cycling. Assessing the diversity of lichenised fungi in complex ecosystems, such as forests, requires time and substantial skills in collecting and identifying lichens. The completeness of inventories thus largely depends on the expertise of the collector, time available for the survey and size of the studied area. Molecular methods of surveying biodiversity hold the promise to overcome these challenges. DNA barcoding of individual lichen specimens and bulk collections is already being applied; however, eDNA methods have not yet been evaluated as a tool for lichen surveys. Here, we assess which species of lichenised fungi can be detected in eDNA swabbed from bark surfaces of living trees in central European forests. We compare our findings to an expert floristic survey carried out in the same plots about a decade earlier. In total, we studied 150 plots located in three study regions across Germany. In each plot, we took one composite sample based on six trees, belonging to the species Fagus sylvatica , Picea abies and Pinus sylvestris . The eDNA method yielded 123 species, the floristic survey 87. The total number of species found with both methods was 167, of which 48% were detected only in eDNA, 26% only in the floristic survey and 26% in both methods. The eDNA contained a higher diversity of inconspicuous species. Many prevalent taxa reported in the floristic survey could not be found in the eDNA due to gaps in molecular reference databases. We conclude that, currently, eDNA has merit as a complementary tool to monitor lichen biodiversity at large scales, but cannot be used on its own. We advocate for the further development of specialised and more complete databases.
... This model was aligned with the sequence of PDB model-5KGF Chain C using PyMOL, with red-highlighted segments indicating mismatches, damage, or deletions [113][114][115][116]. Ramachandran plots and model-template sequence alignment data were retrieved from the SWISS-Model database server [117]. Sequence alignment was also con rmed using BLAST [118] [119] and the corresponding protein was modeled and aligned using PyMOL [103]. ...
Preprint
Full-text available
Glioblastoma multiforme (GBM) is one of the most common and aggressive forms of malignant brain cancer in adults and is classified based on its isocitrate dehydrogenase (IDH) mutation. Surgery, radiotherapy, and Temozolomide (TMZ) are the standard treatment methods for GBM. Here we present a combination therapy of cold atmospheric plasma (CAP) and TMZ as a key treatment for GBM. CAP works by increasing reactive oxygen and nitrogen species (RONS) and targets the spread of the tumor. In this study, we performed the transcriptomic analysis of U-87MG cells by high throughput deep RNA-Seq analysis to quantify differential gene expression across the genome. Furthermore, we studied various signaling pathways and predicted structural changes of consequential proteins to elucidate the functional changes caused by up or down-regulation of the most altered genes. Our results demonstrate that combination treatment downregulated key genes like p53, histones, DNA damage markers, cyclins, in the following pathways: MAPK, P53, DNA damage and cell cycle. Moreover, in silico studies were conducted for further investigation to verify these results, and the combination of CAP & TMZ showed a significant antitumor effect in the GBM cells leading to apoptosis and damaged key proteins. Further studies of the impact of TMZ on gene expression, biochemical pathways, and protein structure will lead to improved treatment approaches for GBM.
Article
Full-text available
Sequence analysis of protein and nucleic acid databases by exhaustive string-matching algorithms is effectively implemented on large processor-array machines, such as the I.C.L. DAP. An improved method of assessing the significance of the best alignments for proteins is described. Examples involving the cystic fibrosis antigen and Drosophila vitellogenins illustrate the usefulness of this approach.
Article
Full-text available
We have developed a computer algorithm that can extract the pattern of conserved primary sequence elements common to all members of a homologous protein family. The method involves clustering the pairwise similarity scores among a set of related sequences to generate a binary dendrogram (tree). The tree is then reduced in a stepwise manner by progressively replacing the node connecting the two most similar termini by one common pattern until only a single common "root" pattern remains. A pattern is generated at a node by (i) performing a local optimal alignment on the sequence/pattern pair connected by the node with the use of an extended dynamic programming algorithm and then (ii) constructing a single common pattern from this alignment with a nested hierarchy of amino acid classes to identify the minimal inclusive amino acid class covering each paired set of elements in the alignment. Gaps within an alignment are created and/or extended using a "pay once" gap penalty rule, and gapped positions are converted into gap characters that function as 0 or 1 amino acid of any type during subsequent alignment. This method has been used to generate a library of covering patterns for homologous families in the National Biomedical Research Foundation/Protein Identification Resource protein sequence data base. We show that a covering pattern can be more diagnostic for sequence family membership than any of the individual sequences used to construct the pattern.
Article
This paper gives a formal definition of the biological concept of evolutionary distance and an algorithm to compute it. For any set S of finite sequences of varying lengths this distance is a real-valued function on $S \times S$, and it is shown to be a metric under conditions which are wide enough to include the biological application. The algorithm, introduced here, lends itself to computer programming and provides a method to compute evolutionary distance which is shorter than the other methods currently in use.
Article
Dynamic Monte Carlo studies have been performed on various diamond lattice models of β-proteins. Unlike previous work, no bias toward the native state is introduced; instead, the protein is allowed to freely hunt through all of phase space to find the equilibrium conformation. Thus, these systems may aid in the elucidation of the rules governing protein folding from a given primary sequence; in particular, the interplay of short- vs long-range interaction can be explored. Three distinct models (AC) were examined. In model A, in addition to the preference for trans (t) over gauche states (g+ and g−) (thereby perhaps favoring β-sheet formation), attractive interactions are allowed between all nonbonded, nearest neighbor pairs of segments. If the molecules possess a relatively large fraction of t states in the denatured form, on cooling spontaneous collapse to a well-defined β-barrel is observed. Unfortunately, in model A the denatured state exhibits too much secondary structure to correctly model the globular protein collapse transition. Thus in models B and C, the local stiffness is reduced. In model B, in the absence of long-range interactions, t and g states are equally weighted, and cooperativity is introduced by favoring formation of adjacent pairs of nonbonded (but not necessarily parallel) t states. While the denatured state of these systems behaves like a random coil, their native globular structure is poorly defined. Model C retains the cooperativity of model B but allows for a slight preference of t over g states in the short-range interactions. Here, the denatured state is indistinguishable from a random coil, and the globular state is a well-defined β-barrel. Over a range of chain lengths, the collapse is well represented by an all-or-none model. Hence, model C possesses the essential qualitative features observed in real globular proteins. These studies strongly suggest that the uniqueness of the globular conformation requires some residual secondary structure to be present in the denatured state.
Article
The theoretical basis of sequential circuit synthesis is developed, with particular reference to the work of D. A. Huffman and E. F. Moore. A new method of synthesis is developed which emphasizes formal procedures rather than the more familiar intuitive ones. Familiarity is assumed with the use of switching algebra in the synthesis of combinational circuits.
Article
Mathematical methods for comparison of nucleic acid sequences are reviewed. There are two major methods of sequence comparison: dynamic programming and a method referred to here as the regions method. The problem types discussed are comparison of two sequences, location of long matching segments, efficient database searches and comparison of several sequences.
Article
A new development is introduced here in the use of dynamic programming in finding pattern similarities in genetic sequences, as was first done by Needleman and Wunsch (1969). A condition of pattern similarity is defined and an algorithm is given which scans any set of similarities and screens out those which fail to meet the condition. When the set to be scanned contains every pair of segments, one from each of two given sequences of lengthsm andn (i.e. every possible location for a pattern similarity), then it completes the scan in a number of computational steps proportional tom·n, leaving those pairs of segments which satisfy the similarity condition. The algorithm is based on the concept of match density, as suggested by Goad and Kanehisa (1982).
Article
Homology and distance measures have been routinely used to compare two biological sequences, such as proteins or nucleic acids. The homology measure of Needleman and Wunsch is shown, under general conditions, to be equivalent to the distance measure of Sellers. A new algorithm is given to find similar pairs of segments, one segment from each sequence. The new algorithm, based on homology measures, is compared to an earlier one due to Sellers.