Accession numbers of DNA sequences used.

Accession numbers of DNA sequences used.

Source publication
Article
Full-text available
In genome analysis, k-mer-based comparison methods have become standard tools. However, even though they are able to deliver reliable results, other algorithms seem to work better in some cases. To improve k-mer-based DNA sequence analysis and comparison, we successfully checked whether adding positional resolution is beneficial for finding and/or...

Contexts in source publication

Context 1
... our analysis, we used publicly available unmasked nucleotide sequences from the NCBI website in FASTA format (Table 1) [13]. In some cases, data in the GenBank format were also used, if a comparison with gene positions was required. ...
Context 2
... a large region (~40% in size) with a slight increase in the C content, but otherwise without clear monomer preferences located at the bottom, is associated with the late genes (L1, L2). For higher word lengths (1 ≤ k ≤ 4), the stated regions are also visible and show similar word contents for all HPV types (Figures S1 and S2). Genes 2017, 8, 122 6 of 18 clear monomer preferences located at the bottom, is associated with the late genes (L1, L2). ...
Context 3
... contrast to k = 1, the features on the relative maps do not share a very uniform structure considering relative k-mer contents, except for very close relatives (HHV6A, HHV6B, and HHV4 type 1 and 2). Table S1). (A) Good correlation within the first region between all HPV types; (B) Bad correlation (values around zero or lower) between the first and second region; (C) Relatively good correlation in the second region for most values; (D) Good correlation amongst all of the third regions of HPV. ...
Context 4
... were represented by colored bars at the right side of the linear representation of the circular HPV genomes (E1 red, E2 blue, E4 green, E5 yellow, E6 orange, E7 purple, L1 magenta, L2 grey). The box indicates region 2, for borders of all regions see Table S1. Figure S2: Map of HPV viruses for k = 3. ...
Context 5
... order of words is alphabetically from left in each genome starting with an AAA. The box indicates region 2, for borders of all regions see Table S1. Figure S3: Map of HHV genomes for k = 2, bin width of 500 bp. ...

Citations

... The approach we use in this study to find the conserved sequence patterns is the so-called k-mer analysis [11]. As the name suggests, this technique focuses on frequencies of DNA words of length k. ...
... The analyzed DNA sequences are converted into the so-called k-mer spectra by counting the number of appearances of each possible k-mer word inside the respective sequences. The k-mer analysis has already been used to identify interesting patterns, potentially related to biological functions like gene expression and chromatin organization [11][12][13][14][15][16]. While analyzing the k-mer spectra [13], it is commonplace for the most frequent and impactful k-mer words to show repetitive internal patterns. ...
... These local densities were correlated using the Pearson correlation coefficient to determine similarity [19]. In comparison to former studies, this correlation of local densities reincorporates part of the information dropped in the k-mer analysis approach [11]. ...
Article
Full-text available
The specific characteristics of k-mer words (2 ≤ k ≤ 11) regarding genomic distribution and evolutionary conservation were recently found. Among them are, in high abundance, words with a tandem repeat structure (repeat unit length of 1 bp to 3 bp). Furthermore, there seems to be a class of extremely short tandem repeats (≤12 bp), so far overlooked, that are non-random-distributed and, therefore, may play a crucial role in the functioning of the genome. In the following article, the positional distributions of these motifs we call super-short tandem repeats (SSTRs) were compared to other functional elements, like genes and retrotransposons. We found length- and sequence-dependent correlations between the local SSTR density and G+C content, and also between the density of SSTRs and genes, as well as correlations with retrotransposon density. In addition to many general interesting relations, we found that SINE Alu has a strong influence on the local SSTR density. Moreover, the observed connection of SSTR patterns to pseudogenes and -exons might imply a special role of SSTRs in gene expression. In summary, our findings support the idea of a special role and the functional relevance of SSTRs in the genome.
... Given that positional information had been shown to improve classification when using k−mer features [54], [55], we were curious to see if positional information could also improve the performance of secondary structure fingerprints. In a previous study [26], when using secondary structure fingerprints that do not take positions into account the best accuracy, precision and recall were 64.06%, 73.10%, and 63.90% respectively. ...
Article
Full-text available
RNA elements that are transcribed but not translated into proteins are called non-coding RNAs (ncRNAs). They play wide-ranging roles in biological processes and disorders. Just like proteins, their structure is often intimately linked to their function. Many examples have been documented where structure is conserved across taxa despite sequence divergence. Thus, structure is often used to identify function. Specifically, the secondary structure is predicted and ncRNAs with similar structures are assumed to have same or similar functions. However, a strand of RNA can fold into multiple possible structures, and some strands even fold differently in vivo and in vitro. Furthermore, ncRNAs often function as RNA-protein complexes, which can affect structure. Because of these, we hypothesized using one structure per sequence may discard information, possibly resulting in poorer classification accuracy. Therefore, we propose using secondary structure fingerprints, comprising two categories: a higher-level representation derived from RNA-As-Graphs (RAG), and free energy fingerprints based on a curated repertoire of small structural motifs. The fingerprints take into account the difference between global and local structural matches. We also evaluated our deep learning architecture with k-mers. By combining our global-local fingerprints with 6-mer, we achieved an accuracy, precision, and recall of 91.04%, 91.10%, and 91.00%.
... The combination of position information of the SSTRs with their abundance as in principle shown in [212] by mapping of SSTRs reveals specific patterns in centromeric and Figure 20. Showing the first steps of the search strategy and analyses for k-mer words. ...
... The combination of position information of the SSTRs with their abundance as in principle shown in [212] by mapping of SSTRs reveals specific patterns in centromeric and other regions and are also depending on the length of the SSTRs and may show also correlations with specific sites for mutations [manuscript in preparation]. ...
Article
Full-text available
Complex functioning of the genome in the cell nucleus is controlled at different levels: (a) the DNA base sequence containing all relevant inherited information; (b) epigenetic pathways consisting of protein interactions and feedback loops; (c) the genome architecture and organization activating or suppressing genetic interactions between different parts of the genome. Most research so far has shed light on the puzzle pieces at these levels. This article, however, attempts an integrative approach to genome expression regulation incorporating these different layers. Under environmental stress or during cell development, differentiation towards specialized cell types, or to dysfunctional tumor, the cell nucleus seems to react as a whole through coordinated changes at all levels of control. This implies the need for a framework in which biological, chemical, and physical manifestations can serve as a basis for a coherent theory of gene self-organization. An international symposium held at the Biomedical Research and Study Center in Riga, Latvia, on 25 July 2022 addressed novel aspects of the abovementioned topic. The present article reviews the most recent results and conclusions of the state-of-the-art research in this multidisciplinary field of science, which were delivered and discussed by scholars at the Riga symposium.
... Considering the probe length of six trinucleotide units together with one dye molecule at one end, the results of SMLM indicated small chromatin loops for the expansion region rearranging chromatin on the nanoscale so that a deactivation could be explained by geometric reasons in the genome architecture [36]. Although established programs for the design of COMBO-FISH probes and probe sets were available [20,26], novel so-called alignment-free investigations of k-mers, their frequencies and their positioning along the nucleotide sequence of a chromosome [49,50] have found oligonucleotide probes that uniquely bind in a given repetition rate to chromatin sequences repetitively occurring as interspersed motives [29]. New generations of specific COMBO-FISH probes were elucidated against SINEs (Short Interspersed Nuclear Elements, e.g., ALU elements [32,48,49], Figure 4), LINEs (Long Interspersed Nuclear Elements, e.g., L1 [32]), or centromeres [44]. ...
Chapter
Full-text available
Genome sequence databases of many species have been completed so that it is possible to apply an established technique of FISH (Fluorescence In Situ Hybridization) called COMBO-FISH (COMBinatorial Oligonucleotide FISH). It makes use of bioinformatic sequence database search for probe design. Oligonucleotides of typical lengths of 15–30 nucleotides are selected in such a way that they only co-localize at the given genome target. Typical probe sets of 20–40 stretches label about 50–250 kb specifically. The probes are either solely composed of purines or pyrimidines, respectively, for Hoogsteen-type binding, or of purines and pyrimidines together for Watson-Crick type binding. We present probe sets for tumor cell analysis. With an improved sequence database analysis and sequence search according to uniqueness, a novel family of probes repetitively binding to characteristic genome features like SINEs (Short Interspersed Nuclear Elements, e.g., ALU elements), LINEs (Long Interspersed Nuclear Elements, e.g., L1), or centromeres has been developed. All types of probes can be synthesized commercially as DNA or PNA probes, labelled by dye molecules, and specifically attached to the targets for microscopy research. With appropriate dyes labelled, cell nuclei can be subjected to super-resolution localization microscopy.
... Hence, despite the limited differences of codon usages between the ISGs and background human genes, these features were useful for discriminating the ISGs from non-ISGs. In the last category, we calculated the occurrence frequency of 256 nucleotide 4-mers to add some positional resolution for finding and comparing interesting organisational structures [41]. The difference between ISGs and non-ISGs ...
... The second category also focused on the combination of 2 nucleotides but added the impact of phosphodiester bonds along the 5 to 3 direction (e.g., CpG content) [95]. The third category calculated the occurrence frequency of 4-mers (e.g., "CGCG" composition to involve some positional resolution) [41]. The last category considered the co-occurrence of SLNPs. ...
Article
Full-text available
Background A virus-infected cell triggers a signalling cascade, resulting in the secretion of interferons (IFNs), which in turn induces the upregulation of the IFN-stimulated genes (ISGs) that play a role in antipathogen host defence. Here, we conducted analyses on large-scale data relating to evolutionary gene expression, sequence composition, and network properties to elucidate factors associated with the stimulation of human genes in response to IFN-α. Results We find that ISGs are less evolutionary conserved than genes that are not significantly stimulated in IFN experiments (non-ISGs). ISGs show obvious depletion of GC content in the coding region. This influences the representation of some compositions following the translation process. IFN-repressed human genes (IRGs), downregulated genes in IFN experiments, can have similar properties to the ISGs. Additionally, we design a machine learning framework integrating the support vector machine and novel feature selection algorithm that achieves an area under the receiver operating characteristic curve (AUC) of 0.7455 for ISG prediction. Its application in other IFN systems suggests the similarity between the ISGs triggered by type I and III IFNs. Conclusions ISGs have some unique properties that make them different from the non-ISGs. The representation of some properties has a strong correlation with gene expression following IFN-α stimulation, which can be used as a predictive feature in machine learning. Our model predicts several genes as putative ISGs that so far have shown no significant differential expression when stimulated with IFN-α in the cell/tissue types in the available databases. A web server implementing our method is accessible at http://isgpre.cvr.gla.ac.uk/. The docker image at https://hub.docker.com/r/hchai01/isgpre can be downloaded to reproduce the prediction.
... Among different methodologies that rely on DNA composition to identify horizontally transferred genomic regions 126 , k-mer spectrum analysis is a standard tool for this purpose 127,128 . Normalized k-mer spectra for DNA sequences of arbitrary length were generated by counting occurrences of all k-mers and normalizing by the total amount of words counted. ...
Article
Full-text available
Sexual reproduction consists of genome reduction by meiosis and subsequent gamete fusion. The presence of genes homologous to eukaryotic meiotic genes in archaea and bacteria suggests that DNA repair mechanisms evolved towards meiotic recombination. However, fusogenic proteins resembling those found in gamete fusion in eukaryotes have so far not been found in prokaryotes. Here, we identify archaeal proteins that are homologs of fusexins, a superfamily of fusogens that mediate eukaryotic gamete and somatic cell fusion, as well as virus entry. The crystal structure of a trimeric archaeal fusexin (Fusexin1 or Fsx1) reveals an archetypical fusexin architecture with unique features such as a six-helix bundle and an additional globular domain. Ectopically expressed Fusexin1 can fuse mammalian cells, and this process involves the additional globular domain and a conserved fusion loop. Furthermore, archaeal fusexin genes are found within integrated mobile elements, suggesting potential roles in cell-cell fusion and gene exchange in archaea, as well as different scenarios for the evolutionary history of fusexins. Sexual reproduction in eukaryotes involves gamete fusion, mediated by fusogenic proteins. Here, the authors identify fusogenic protein homologs encoded within mobile genetic elements in archaeal genomes, solve the crystal structure of one of the proteins, and show that its ectopic expression can fuse mammalian cells, suggesting potential roles in cell-cell fusion and gene exchange.
... All aforementioned advantages for sequences visualization motivate researchers to visualize bio sequences in the form of 2D or 3D images [15,[30][31][32]. Theoretically, all kinds of K-mer histogram of sequences [6,8,30], one hot representation, representation methods based on physicochemical properties [33], representation methods based on a combination vector of several types of information [34], and DNA-walk representation can be adopted to encode biological sequences as 2D images. ...
... All aforementioned advantages for sequences visualization motivate researchers to visualize bio sequences in the form of 2D or 3D images [15,[30][31][32]. Theoretically, all kinds of K-mer histogram of sequences [6,8,30], one hot representation, representation methods based on physicochemical properties [33], representation methods based on a combination vector of several types of information [34], and DNA-walk representation can be adopted to encode biological sequences as 2D images. Of course, it should be noted that methods like [33] that emphasize physicochemical properties are often developed for protein sequences, since the physicochemical properties of amino acids are much more diverse, and therefore, create a rich set of information. ...
... To overcome the aforementioned issues, various methods were proposed, such as Spa-tial_K-mer [30], to encode locations as well as frequencies of the K-mers. For this purpose, the sequences are first split into chunks of constant length, and then, the K-mer frequencies for each substring are determined to preserve the local information of the string. ...
Article
Full-text available
The classification of biological sequences is an open issue for a variety of data sets, such as viral and metagenomics sequences. Therefore, many studies utilize neural network tools, as the well-known methods in this field, and focus on designing customized network structures. However, a few works focus on more effective factors, such as input encoding method or implementation technology, to address accuracy and efficiency issues in this area. Therefore, in this work, we propose an image-based encoding method, called as WalkIm, whose adoption, even in a simple neural network, provides competitive accuracy and superior efficiency, compared to the existing classification methods (e.g. VGDC, CASTOR, and DLM-CNN) for a variety of biological sequences. Using WalkIm for classifying various data sets (i.e. viruses whole-genome data, metagenomics read data, and metabarcoding data), it achieves the same performance as the existing methods, with no enforcement of parameter initialization or network architecture adjustment for each data set. It is worth noting that even in the case of classifying high-mutant data sets, such as Coronaviruses, it achieves almost 100% accuracy for classifying its various types. In addition, WalkIm achieves high-speed convergence during network training, as well as reduction of network complexity. Therefore WalkIm method enables us to execute the classifying neural networks on a normal desktop system in a short time interval. Moreover, we addressed the compatibility of WalkIm encoding method with free-space optical processing technology. Taking advantages of optical implementation of convolutional layers, we illustrated that the training time can be reduced by up to 500 time. In addition to all aforementioned advantages, this encoding method preserves the structure of generated images in various modes of sequence transformation, such as reverse complement, complement, and reverse modes.
... The plasmid genomes were subsampled using overlapping k-length nucleotides, and the occurrence of every k-mer was counted. k-mer counting is a well-studied method for analyzing sequence data (30). The subsequence size k is a critical parameter as the subsequences yield various pieces of information, depending on the size. ...
Article
Full-text available
Plasmids play a major role facilitating the spread of antimicrobial resistance between bacteria. Understanding the host range and dissemination trajectories of plasmids is critical for surveillance and prevention of antimicrobial resistance. Identification of plasmid host ranges could be improved using automated pattern detection methods compared to homology-based methods due to the diversity and genetic plasticity of plasmids. In this study, we developed a method for predicting the host range of plasmids using machine learning—specifically, random forests. We trained the models with 8,519 plasmids from 359 different bacterial species per taxonomic level; the models achieved Matthews correlation coefficients of 0.662 and 0.867 at the species and order levels, respectively. Our results suggest that despite the diverse nature and genetic plasticity of plasmids, our random forest model can accurately distinguish between plasmid hosts. This tool is available online through the Center for Genomic Epidemiology (https://cge.cbs.dtu.dk/services/PlasmidHostFinder/). IMPORTANCE Antimicrobial resistance is a global health threat to humans and animals, causing high mortality and morbidity while effectively ending decades of success in fighting against bacterial infections. Plasmids confer extra genetic capabilities to the host organisms through accessory genes that can encode antimicrobial resistance and virulence. In addition to lateral inheritance, plasmids can be transferred horizontally between bacterial taxa. Therefore, detection of the host range of plasmids is crucial for understanding and predicting the dissemination trajectories of extrachromosomal genes and bacterial evolution as well as taking effective countermeasures against antimicrobial resistance.
... Therefore, alignment-free alternatives are promising to avoid this algorithmic complexity [7,8,83,84,87,91]. Commonly used approaches are Bag-of-Words (BoW [67]), information theoretic methods based on the Kolmogorov-Smirnov complexity [35] and the related Normalized Compression Distance [13,40]. Recently, similarities based on Natural Vectors gained attraction [17,42,92]. ...
Article
Full-text available
We present an approach to discriminate SARS-CoV-2 virus types based on their RNA sequence descriptions avoiding a sequence alignment. For that purpose, sequences are preprocessed by feature extraction and the resulting feature vectors are analyzed by prototype-based classification to remain interpretable. In particular, we propose to use variants of learning vector quantization (LVQ) based on dissimilarity measures for RNA sequence data. The respective matrix LVQ provides additional knowledge about the classification decisions like discriminant feature correlations and, additionally, can be equipped with easy to realize reject options for uncertain data. Those options provide self-controlled evidence, i.e., the model refuses to make a classification decision if the model evidence for the presented data is not sufficient. This model is first trained using a GISAID dataset with given virus types detected according to the molecular differences in coronavirus populations by phylogenetic tree clustering. In a second step, we apply the trained model to another but unlabeled SARS-CoV-2 virus dataset. For these data, we can either assign a virus type to the sequences or reject atypical samples. Those rejected sequences allow to speculate about new virus types with respect to nucleotide base mutations in the viral sequences. Moreover, this rejection analysis improves model robustness. Last but not least, the presented approach has lower computational complexity compared to methods based on (multiple) sequence alignment. Supplementary information: The online version contains supplementary material available at 10.1007/s00521-021-06018-2.
... The k-mer frequency is also significant for assembled genomes analysis. In genomic features analysis, k-mers of length 7-9 bp is primary for motif analysis (Cserhati et al., 2019), and specific k-mer patterns help identify DNA features (Sievers et al., 2017). In genome-wide association studies, k-mers provide a versatile descriptor to capture genetic variants (Jaillard et al., 2018) and overcome the limitations of single nucleotide polymorphism-based association (Lees et al., 2016). ...
Article
Full-text available
Motivation The k-mer frequency in whole genome sequences provides researchers with an insightful perspective on genomic complexity, comparative genomics, metagenomics, and phylogeny. The current k-mer counting tools are typically slow, and they require large memory and hard disk for assembled genome analysis. Results We propose a novel and ultra-fast k-mer counting algorithm, KCOSS, to fulfill k-mer counting mainly for assembled genomes with segmented Bloom filter, lock-free queue, lock-free thread pool, and cuckoo hash table. We optimize running time and memory consumption by recycling memory blocks, merging multiple consecutive first-occurrence k-mers into C-read, and writing a set of C-reads to disk asynchronously. KCOSS was comparatively tested with Jellyfish2, CHTKC, and KMC3 on seven assembled genomes and three sequencing datasets in running time, memory consumption, and hard disk occupation. The experimental results show that KCOSS counts k-mer with less memory and disk while having a shorter running time on assembled genomes. KCOSS can be used to calculate the k-mer frequency not only for assembled genomes but also for sequencing data. Availability The KCOSS software is implemented in C ++. It is freely available on GitHub: https://github.com/kcoss-2021/KCOSS. Supplementary information Supplementary data are available at Bioinformatics online.