Fig 3 - uploaded by Lu Cai
Content may be subject to copyright.
Classification performance of the SVM algorithm combining nucleotide correlation of dinucleotides across five organisms. The figure plots sensitivity (true positive rate) as a function of 1-specificity (false positive rate). Values in paren- theses denote the area under the receiver operating characteristic curve ( auROC ) 

Classification performance of the SVM algorithm combining nucleotide correlation of dinucleotides across five organisms. The figure plots sensitivity (true positive rate) as a function of 1-specificity (false positive rate). Values in paren- theses denote the area under the receiver operating characteristic curve ( auROC ) 

Source publication
Article
Full-text available
Nucleosome positioning plays a key role in the regulation of many biological processes. In this study, the statistical difference of information content was investigated in nucleosome and linker DNA regions across eukaryotic organisms. By analyzing the information redundancy, D k , in Saccharomyces cerevisiae, Drosophila melanogaster, and Caenorhab...

Context in source publication

Context 1
... analyses have uncovered that nucleosome positioning exhibits intrinsic DNA sequence preferences. Short-range dominance of nucleotide correlation in the nucleosome and linker DNAs was found by calculating the information redundancy. However, the information redundancy represents the accumulated nucleotide correlation corresponding to16 dinucleotides. To further clarify the nucleotide correlation affecting nucleosome positioning, the parameter F k was introduced to examine particular base correlation corresponding to every dinucleotide (see “ Materials and methods ” for details). The calculated results indicated that the distribution of particular base correlation is significantly different in the nucleosome and linker DNA regions through the S . cerevisiae , D . melanogaster , and C . elegans genomes (see ESM Figs. S7, S8, S9, S10, and S11). Besides, the profiles of base correlation of the 16 dinucleotides were slightly different between the normal nucleosome and the nucleosome consist- ing of a histone variant. More details can be founded in ESM. To efficiently confirm the relationship between base correlation and nucleosome positioning, the SVM classifier in terms of each base correlation corresponding to 16 dinucleotides was employed to discriminate the nucleosome and linker DNA sequences in eukaryotic genomes. To compare the performance of our model with other methods, the nucleosome positioning datasets of H . sapiens , O . latipes , C . elegans , C . albicans , and S . cerevisiae were retrieved from data published by Tanaka and Nakai (2009). For each organism, the dataset com- prised randomly extracted 1,000 nucleosomal and 1,000 linker DNA sequences. The parameter F k ( k = 0 ... 98), describing particular base correlation corresponding to the 16 dinucleotides, was calculated for each DNA sequence. Using a 1,584-element (99×16) vector as input vectors, the SVM classifier discriminating the nucleosome and linker DNA sequences was constructed. The performance of the classifier was measured using an independent validation procedure. In this procedure, the dataset is randomly divided into two subsets. One of them was trained by SVM and the other was tested. The performance of the SVM classifier was shown in Table 2. The model obtained a good performance with an average total accuracy of 76.0500 % in five organisms. To compare the performance of our model with other methods, the auROC was also calculated. Our SVM model obtained a mean auROC of 0.8353 (see Fig. 3). The result indicated that the prediction accuracy of our model is significantly higher than the method mentioned in the study of Tanaka and Nakai (2009) and is identical approximately to the study of Zhang et al. (2012). This finding suggested again that nucleotide correlation information is the important signature of nucleosome positioning. The information-theoretic method was applied to detect information on nucleotide correlation stored in the nucleosome and linker DNA sequence across the three organisms mentioned above. The results showed that information content of nucleotide correlation in the nucleosome and linker DNA regions is significantly different and that short-range nucleotide correlation in the nucleosome and linker DNA sequence is dominant. Short-range dominance of nucleotide correlation in the nucleosome and linker DNA regions is probably the major reason why many successful prediction models of nucleosome positioning in eukaryotic organisms were constructed in terms of sequence information of oligonucleotides or k -mer. Besides, the difference between normal nucleosome and H2A.Z-containing nucleosome was elucidated from the perspective of information theory. Next, the profiles of periodicity in the nucleosome and linker DNA for the S . cerevisiae , D . melanogaster , and C . elegans genomes indicated that the periodicities exist obviously and is species-specific. In D . melanogaster , the profile of the power spectrum was different between the H34-containing and the H2A.Z-containing nucleosome. Furthermore, the SVM model combining particular nucleotide correlation corresponding to the 16 dinucleotides was used successfully to discriminate the nucleosome and linker DNA sequences in H . sapiens , O . latipes , C . elegans , C . albicans , and S . cerevisiae . This application confirmed the importance of nucleotide correlation. Although the question of whether nucleosome positioning in vivo is determined by chromatin code is hotly debated, the fact that information stored in primary DNA sequence is an important determiner of nucleosome formation has been ...

Similar publications

Article
Full-text available
DNA damage is a natural hazard of life. The most common DNA lesions are base, sugar, and single-strand break damage resulting from oxidation, alkylation, deamination, and spontaneous hydrolysis. If left unrepaired, such lesions can become fixed in the genome as permanent mutations. Thus, evolution has led to the creation of several highly conserved...

Citations

... The metrics for measuring the prediction performance are mathematically expressed by Formula (8): (8) where N + is the total number of positive samples or nucleosomal sequences investigated, while N + − is the number of nucleosomal sequences incorrectly predicted to be linker sequences. N − is the total number of negative samples or linker sequences investigated, while N − + is the number of linker sequences incorrectly predicted to be nucleosomal sequences [50]. Formula (8) is widely utilized to compute the prediction of classifiers. ...
Article
Full-text available
Nucleosomes are the basic units of eukaryotes. The accurate positioning of nucleosomes plays a significant role in understanding many biological processes such as transcriptional regulation mechanisms and DNA replication and repair. Here, we describe the development of a novel method, termed ZCMM, based on Z-curve theory and position weight matrix (PWM). The ZCMM was trained and tested using the nucleosomal and linker sequences determined by support vector machine (SVM) in Saccharomyces cerevisiae (S. cerevisiae), and experimental results showed that the sensitivity (Sn), specificity (Sp), accuracy (Acc), and Matthews correlation coefficient (MCC) values for ZCMM were 91.40%, 96.56%, 96.75%, and 0.88, respectively, and the average area under the receiver operating characteristic curve (AUC) value was 0.972. A ZCMM predictor was developed to predict nucleosome positioning in Homo sapiens (H. sapiens), Caenorhabditis elegans (C. elegans), and Drosophila melanogaster (D. melanogaster) genomes, and the accuracy (Acc) values were 77.72%, 85.34%, and 93.62%, respectively. The maximum AUC values of the four species were 0.982, 0.861, 0.912 and 0.911, respectively. Another independent dataset for S. cerevisiae was used to predict nucleosome positioning. Compared with the results of Wu's method, it was found that the Sn, Sp, Acc, and MCC of ZCMM results for S. cerevisiae were all higher, reaching 96.72%, 96.54%, 94.10%, and 0.88. Compared with the Guo’s method ‘iNuc-PseKNC’, the results of ZCMM for D. melanogaster were better. Meanwhile, the ZCMM was compared with some experimental data in vitro and in vivo for S. cerevisiae, and the results showed that the nucleosomes predicted by ZCMM were highly consistent with those confirmed by these experiments. Therefore, it was further confirmed that the ZCMM method has good accuracy and reliability in predicting nucleosome positioning.
... Methods of the nucleosome positioning prediction could be conditionally divided into two groups: statistical and biophysical methods [10]. Methods of the first group are based on the statistical properties of a DNA sequence, and consider it as a sequence of symbols; primarily such methods are based on a periodicity of 10 symbols [11,12]. Methods of the second group are based on calculating the flexibility of the double helix section composed of different nucleotides and of the corresponding nucleosome formation energies [13]. ...
Article
Full-text available
It is well known that major part of a eukaryotic genome is wrapped around histone proteins forming nucleosomes. It was also demonstrated that the DNA sequence itself is playing an important role in the nucleosome positioning process. In this work, a cluster analysis of 67 517 nucleosome binding sites from the S. Cerevisiae genome was carried out. The classification method is based on the self-adjusting dinucleotides position weight matrix. As a result, 135 significant clusters were discovered that contain 43225 sequences (which constitutes 64% of the initial set). The meaning of the found classes is discussed, as well as the possibility of the further usage.
... In recent years, informational entropy was widely applied in the recognition and evolution research of DNA sequences (Grosse et al., 2000;Yu and Jiang, 2001;Otu and Sayood, 2003;Xing et al., 2013). The average mutual information profile is an excellent candidate for a species signature (Bauer et al., 2008). ...
Article
Full-text available
DNA replication is a highly precise process that is initiated from origins of replication (ORIs) and is regulated by a set of regulatory proteins. The mining of DNA sequence information will be not only beneficial for understanding the regulatory mechanism of replication initiation but also for accurately identifying ORIs. In this study, the GC profile and GC skew were calculated to analyze the compositional bias in the Saccharomyces cerevisiae genome. We found that the GC profile in the region of ORIs is significantly lower than that in the flanking regions. By calculating the information redundancy, an estimation of the correlation of nucleotides, we found that the intensity of adjoining correlation in ORIs is dramatically higher than that in flanking regions. Furthermore, the relationships between ORIs and nucleosomes as well as transcription start sites were investigated. Results showed that ORIs are usually not occupied by nucleosomes. Finally, we calculated the distribution of ORIs in yeast chromosomes and found that most ORIs are in transcription terminal regions. We hope that these results will contribute to the identification of ORIs and the study of DNA replication mechanisms.
... The performance of the algorithm was measured by five parameters, the sensitivity (Sn), specificity (Sp), positive predictive value (PPV), total accuracy (TA) and Mathew's correlation coefficient (MCC). These evaluation measures are defined as follows Xing et al., 2011;Xing et al., 2013): ...
... Based on the characteristics of nucleosome positioning sequence (or nucleosomal sequences), various computational methods (Chen, et al., 2012b;Chen, et al., 2010;Gupta, et al., 2008;Peckham, et al., 2007;Xing, et al., 2011Xing, et al., , 2013Zhang, et al. 2012a,b;Zhao, et al., 2010) were proposed for predicting nucleosome positioning in different genomes. All these methods could yield quite encouraging results, and each of them did play a role in stimulating the development of this area. ...
Article
Full-text available
Nucleosome positioning participates in many cellular activities and plays significant roles in regulating cellular processes. With the avalanche of genome sequences generated in the postgenomic age, it is highly desired to develop automated methods for rapidly and effectively identifying nucleosome positioning. Although some computational methods were proposed, most of them were species specific and neglected the intrinsic local structural properties that might play important roles in determining the nucleosome positioning on a DNA sequence. Here a predictor called " INUC-PSEKNC " was developed for predicting nucleosome positioning in Homo sapiens, Caenorhabditis elegans, and Drosophila melanogaster genomes, respectively. In the new predictor, the samples of DNA sequences were formulated by a novel feature-vector called "pseudo k-tuple nucleotide composition", into which six DNA local structural properties were incorporated. It was observed by the rigorous cross-validation tests on the three stringent benchmark datasets that the overall success rates achieved by INUC-PSEKNC in predicting the nucleosome positioning of the aforementioned three genomes were 86.27%, 86.90% and 79.97%, respectively. Meanwhile, the results obtained by INUC-PSEKNC on various benchmark datasets used by the previous investigators for different genomes also indicated that the current predictor remarkably outperformed its counterparts. A user-friendly web-server, INUC-PSEKNC is freely accessible at http://lin.uestc.edu.cn/server/iNuc-PseKNC. hlin@gordonlifescience.org, hlin@uestc.edu.cn (H.L.); greatchen@heuu.edu.cn, wchen@gordonlifescience.org (W.C.); kcchou@gordonlifescience.org (KCC).
Article
The nucleosome is the basic structure of chromatin in eukaryotic cells, with essential roles in the regulation of many biological processes, such as DNA transcription, replication and repair, and RNA splicing. Because of the importance of nucleosomes, the factors that determine their positioning within genomes should be investigated. High-resolution nucleosome-positioning maps are now available for organisms including Saccharomyces cerevisiae, Drosophila melanogaster and Caenorhabditis elegans, enabling the identification of nucleosome positioning by application of computational tools. Here, we describe a novel predictor called NucPosPred, which was specifically designed for large-scale identification of nucleosome positioning in C. elegans and D. melanogaster genomes. NucPosPred was separately optimized for each species for four types of DNA sequence feature extraction, with consideration of two classification algorithms (gradient-boosting decision tree and support vector machine). The overall accuracy obtained with NucPosPred was 92.29% for C. elegans and 88.26% for D. melanogaster, outperforming previous methods and demonstrating the potential for species-specific prediction of nucleosome positioning. For the convenience of most experimental scientists, a web-server for the predictor NucPosPred is available at http://121.42.167.206/NucPosPred/index.jsp.