Table 2 - uploaded by Brian R King
Content may be subject to copyright.
Avian influenza A subtype frequency in FLU60

Avian influenza A subtype frequency in FLU60

Source publication
Article
Full-text available
Digital signal processing (DSP) techniques for biological sequence analysis continue to grow in popularity due to the inherent digital nature of these sequences. DSP methods have demonstrated early success for detection of coding regions in a gene. Recently, these methods are being used to establish DNA gene similarity. We present the inter-coeffic...

Similar publications

Article
Full-text available
We previously introduced a numerical quantity called the stability (Ps) of an inferred tree and showed that for the tree to be reliable this stability as well as the reliability of the tree, which is usually computed as the bootstrap probability (Pb), must be high. However, if genome duplication occurs in a species, a gene family of the genome also...

Citations

... At this stage proposed method adds additional zeros if required to make all the sequences of equal length. This is called zero-padding [39,40]. The 2D FFT is computed to the numerical values as explained in Sect. ...
Article
Full-text available
Protein sequence comparison remains a challenging work for the researchers owing to the computational complexity due to the presence of 20 amino acids compared with only four nucleotides in Genome sequences. Further, protein sequences of different species are of different lengths; it throws additional changes to the researchers to develop methods, specially alignment-free methods, to compare protein sequences. In this work, an efficient technique to compare protein sequences is developed by a graphical representation. First, the classified grouping of 20 amino acids with a cardinality of 4 based on polar class is considered to narrow down the representational range from 20 to 4. Then a unit vector technique based on a two-quadrant Cartesian system is proposed to provide a new two-dimensional graphical representation of the protein sequence. Now, two approaches are proposed to cope with the varying lengths of protein sequences from various species: one uses Dynamic Time Warping (DTW), while the other one uses a two-dimensional Fast Fourier Transform (2D FFT). Next, the effectiveness of these two techniques is analyzed using two evaluation criteria—quantitative measures based on symmetric distance (SD) and computational speed. An analysis is performed on five data sets of 9 ND4, 9 ND5, 9 ND6, 12 Baculovirus, and 24 TF proteins under the two methods. It is found that the FFT-based method produces the same results as DTW but in less computational time. It is found that the result of the proposed method agrees with the known biological reference. Further, the present method produces better clustering than the existing ones.
... This is the accepted routine, since identical DNA sequences tend to reflect identical biological functionality and forms the basis for determining whether the DNA sequences are homologous, e.g., there is shared ancestry between them. DSP based methods have been used to locate reading frames in DNA, including different gene regions (exons) [7], to detect splice sites within the gene [8], to identify active sites in a protein, to identify acceptor splicing sites and motif patterns in DNA [9,10]. DFT is the most popular DSP technique [7]. ...
... DSP based methods have been used to locate reading frames in DNA, including different gene regions (exons) [7], to detect splice sites within the gene [8], to identify active sites in a protein, to identify acceptor splicing sites and motif patterns in DNA [9,10]. DFT is the most popular DSP technique [7]. In general, the fast Fourier transform was developed to compute the DFT. ...
Article
Full-text available
Within the realms of human thoughts on nature, Fourier analysis is considered as one of the greatest ideas currently put forwarded. The Fourier transform shows that any periodic function can be rewritten as the sum of sinusoidal functions. Having a Fourier transform view on real-world problems like the DNA sequence of genes, would make things intuitively simple to understand in comparison with their initial formal domain view. In this study we used discrete Fourier transform (DFT) on DNA sequences of a set of genes in the bovine genome known to govern milk production, in order to develop a new gene clustering algorithm. The implementation of this algorithm is very user-friendly and requires only simple routine mathematical operations. By transforming the configuration of gene sequences into frequency domain, we sought to elucidate important features and reveal hidden gene properties. This is biologically appealing since no information is lost via this transformation and we are therefore not reducing the number of degrees of freedom. The results from different clustering methods were integrated using evidence accumulation algorithms to provide in insilico validation of our results. We propose using candidate gene sequences accompanied by other genes of biologically unknown function. These will then be assigned some degree of relevant annotation by using our proposed algorithm. Current knowledge in biological gene clustering investigation is also lacking, and so DFT-based methods will help shine a light on use of these algorithms for biological insight.
... The inter coefficient difference (ICD) method is applied to the spectrum of the binary representation of DNA sequences to compare genome sequences. 35 Authors in ref 36 use a different method based on the Fourier power spectrum to compare DNA sequences. The usage of FFT in the Voss kind of representation-based similarity analysis of protein sequences is recently identified by authors in ref 37. ...
... This is called zero padding. 35,38 Then, FFT is applied to these numerical values using eq 1. ...
Article
Full-text available
The difficult aspect of developing new protein sequence comparison techniques is coming up with a method that can quickly and effectively handle huge data sets of various lengths in a timely manner. In this work, we first obtain two numerical representations of protein sequences separately based on one physical property and one chemical property of amino acids. The lengths of all the sequences under comparison are made equal by appending the required number of zeroes. Then, fast Fourier transform is applied to this numerical time series to obtain the corresponding spectrum. Next, the spectrum values are reduced by the standard inter coefficient difference method. Finally, the corresponding normalized values of the reduced spectrum are selected as the descriptors for protein sequence comparison. Using these descriptors, the distance matrices are obtained using Euclidian distance. They are subsequently used to draw the phylogenetic trees using the UPGMA algorithm. Phylogenetic trees are first constructed for 9 ND4, 9 ND5, and 9 ND6 proteins using the polarity value as the chemical property and the molecular weight as the physical property. They are compared, and it is seen that polarity is a better choice than molecular weight in protein sequence comparison. Next, using the polarity property, phylogenetic trees are obtained for 12 baculovirus and 24 transferrin proteins. The results are compared with those obtained earlier on the identical sequences by other methods. Three assessment criteria are considered for comparison of the results-quality based on rationalized perception, quantitative measures based on symmetric distance, and computational speed. In all the cases, the results are found to be more satisfactory.
... Similarity between biological sequences forms the basis for determining whether there is sequence homology, as defined in terms of shared ancestry between them in the evolutionary history of life [7]. Although alignment methods represent the standard for sequence analysis, comparison and similarity, it is difficult to determine the best parameters to achieve optimal alignments. ...
Article
Full-text available
A signal analysis of the complete genome sequenced for coronavirus variants of concern—B.1.1.7 (Alpha), B.1.135 (Beta) and P1 (Gamma)—and coronavirus variants of interest—B.1.429–B.1.427 (Epsilon) and B.1.525 (Eta)—is presented using open GISAID data. We deal with a certain new type of finite alternating sum series having independently distributed terms associated with binary (0,1) indicators for the nucleotide bases. Our method provides additional information to conventional similarity comparisons via alignment methods and Fourier Power Spectrum approaches. It leads to uncover distinctive patterns regarding the intrinsic data organization of complete genomics sequences according to its progression along the nucleotide bases position. The present new method could be useful for the bioinformatics surveillance and dynamics of coronavirus genome variants.
... Given two coding sequences, A and B, the tripletrepeat model set for a sequence G is G D < T 1 ; ı 1 >; < T 2 ; ı 2 >; :::; < T 64 ; ı 64 >, where T is the triplet of 64 codons. A weight deviation between the two sequences is shown in Eq. (12). It can be used to measure the similarity between A and B. ...
Article
Full-text available
Data-driven machine learning, especially deep learning technology, is becoming an important tool for handling big data issues in bioinformatics. In machine learning, DNA sequences are often converted to numerical values for data representation and feature learning in various applications. Similar conversion occurs in Genomic Signal Processing (GSP), where genome sequences are transformed into numerical sequences for signal extraction and recognition. This kind of conversion is also called encoding scheme. The diverse encoding schemes can greatly affect the performance of GSP applications and machine learning models. This paper aims to collect, analyze, discuss, and summarize the existing encoding schemes of genome sequence particularly in GSP as well as other genome analysis applications to provide a comprehensive reference for the genomic data representation and feature learning in machine learning.
... Binary representation of genome sequences refers to the Voss type of representation [1], where the nucleotides A, T, C, G are represented by the 4-component vectors 1, 0, 0, 0; 0, 1, 0, 0; 0,0,1,0 and 0,0,0,1 respectively. Based on such representation of nucleotides, DFT based analysis of genome sequences were made in [11] and [12]. Natural question is to see whether such Voss type of representation is possible for amino acid sequences and if so, whether DFT based analysis could also be made for comparison of Protein sequences based on such binary representations. ...
Article
The paper first considers a new complex representation of amino acids of which the real parts and imaginary parts are taken respectively from hydrophilic properties and residue volumes of amino acids. Then it applies complex Fourier transform on the represented sequence of complex numbers to obtain the spectrum in the frequency domain. By using the method of ‘Inter coefficient distances’ on the spectrum obtained, it constructs phylogenetic trees of different Protein sequences. Finally on the basis of such phylogenetic trees pair wise comparison is made for such Protein sequences. The paper also obtains pair wise comparison of the same protein sequences following the same method but based on a known complex representation of amino acids, where the real and imaginary parts refer to hydrophobicity properties and residue volumes of the amino acids respectively. The results of the two methods are now compared with those of the same sequences obtained earlier by other methods. It is found that both the methods are workable, further the new complex representation is better compared to the earlier one. This shows that the hydrophilic property (polarity) is a better choice than hydrophobic property of amino acids especially in protein sequence comparison.
... It is effectively used in identification of protein coding regions, because a DFT spectrum of a DNA sequence reflects the distribution and periodic pattern of the sequence [41]. Use of DFT on binary sequence is found in [42], where the binary sequence is generated from genome sequences by Voss type of representation. Naturally to find similar use of DFT in protein sequence analysis, corresponding Voss type representation of amino acids is to be known priori. ...
Article
Full-text available
The paper considers Voss type representation of amino acids and uses FFT on the represented binary sequences to get the spectrum in the frequency domain. Based on the analysis of this spectrum by using the method of inter coefficient difference (ICD), it compares protein sequences of ND5 and ND6 category. Results obtained agree with the standard ones. The purpose of the paper is to extend the ICD method of comparison of DNA sequences to comparison of protein sequences. The topic of discussion is to develop a novel method of comparing protein sequences. The main achievements of the work are that the method applied is completely new of its kind, so far as protein sequence comparison is concerned and moreover the results of comparison agree with the previous results obtained by other methods for the same category of protein sequences.
Article
Full-text available
At present, adequate mathematical tools are not used to analyze the arrangement of components in arrays of naturally ordered data of a different nature, including words or letters in texts, notes in musical compositions, symbols in sign sequences, monitoring data, numbers representing ordered measurement results, components in genetic texts. Therefore, it is difficult or impossible to measure and compare the order of messages allocated in long information chains. The main approaches for comparing symbol sequences are using probabilistic models and statistical tools, pairwise and multiple alignment, which makes it possible to determine the degree of similarity of sequences using edit distance measures. The application of pseudospectral and fractal representation of symbolic sequences is somewhat exotic. "The curse of a priori unconscious knowledge" of the obvious orderliness of the sequence should be especially noticed, as it is widespread in mathematical linguistics, bioinformatics (mathematical biology), and other similar fields of science. The noted approaches almost do not pay attention to the study and detection of the patterns of the specific arrangement of all symbols, words, and components of data sets that constitute a separate sequence. The object of study in our works is a specifically organized numerical tuple – the arrangement of components (order) in symbolic or numerical sequence. The intervals between the closest identical components of the order are used as the basis for the quantitative representation of the chain arrangement. Multiplying all the intervals or summing their logarithms allows one to get numbers that uniquely reflect the arrangement of components in a particular sequence. These numbers, allow us to obtain a whole set of normalized characteristics of the order, among which the geometric mean interval and its logarithm. Such characteristics surprisingly accurately reflect the arrangement of the components in the symbolic sequences. In this paper, we present an approach for quantitative comparing the arrangement of arrays of naturally ordered data (information chains) of an arbitrary nature. The measures of similarity/distinction and procedure of comparison of the chain order, based on the selection of a list of equal and similar by the order characteristics of the subsequences (components), are proposed. Rank distributions are used for faster selection of a list of matching components. The paper presents a toolkit for comparing the order of information chains and demonstrates some of its applications for studying the structure of nucleotide sequences.