Conference PaperPDF Available

Finding genomic features from enriched regions in ChlP-Seq data

October 2012

October 2012

DOI:10.1109/BIBM.2012.6392731

Conference: Bioinformatics and Biomedicine (BIBM), 2012 IEEE International Conference on

Authors:

Iman Rezaeian

University of Windsor

Luis Rueda

University of Windsor

Finding genomic features in ChlP-Seq data has become an attractive research topic lately, because of the power, resolution and low-noise of next generation sequencing, making it a much better alternative to traditional microarrays such as ChlP-chip and other related methods. However, handling ChlP-Seq data is not straightforward, mainly because of the large amounts of data produced by next generation sequencing. ChlP-Seq has widespread over a range of applications in finding biomarkers, especially those associated with important genomic features in epigenomics and transcriptomics, including binding sites, promoters, exons/introns, transcription sites, among others. Efficient algorithms for finding relevant regions in ChlP-Seq data have been proposed, which capture the most significant peaks from the sequence reads. Among these, multilevel thresholding algorithms have been applied successfully for transcriptomics and genomics data analysis, in particular for detecting significant regions based on next generation sequencing data. We show that the Optimal Multilevel Thresholding algorithm (OMT) achieves higher accuracy in detecting enriched regions and genomic features of detected regions on FoxAl data. OMT finds more gene-related regions (gene, exon, promoter) in comparison with other methods. Using a small number of parameters is another advantage of the proposed method.

Two true positive regions in chromosomes 3 and 13 of FoxA1 dataset. The x-axis corresponds to the genome position in bp and the y-axis corresponds to the number of reads. Both peaks are detected by OMT but only the bottom one is detected by T-PIC and none of them is detected by MACS.

…

Figures - uploaded by Luis Rueda

Content may be subject to copyright.

Content uploaded by Luis Rueda

Content may be subject to copyright.

Finding Genomic Features from Enriched Regions in ChIP-Seq Data

Iman Rezaeian

School of Computer Science

University of Windsor

Windsor, Canada

Email: rezaeia@uwindsor.ca

Luis Rueda

School of Computer Science

University of Windsor

Windsor, Canada

Email: lrueda@uwindsor.ca

Abstract—Finding genomic features in ChIP-Seq data has

become an attractive research topic lately, because of the power,

resolution and low-noise of next generation sequencing, making

it a much better alternative to traditional microarrays such

as ChIP-chip and other related methods. However, handling

ChIP-Seq data is not straightforward, mainly because of the

large amounts of data produced by next generation sequencing.

ChIP-Seq has widespread over a range of applications in

ﬁnding biomarkers, especially those associated with important

genomic features in epigenomics and transcriptomics, including

binding sites, promoters, exons/introns, transcription sites,

among others. Efﬁcient algorithms for ﬁnding relevant regions

in ChIP-Seq data have been proposed, which capture the most

signiﬁcant peaks from the sequence reads. Among these, multi-

level thresholding algorithms have been applied successfully

for transcriptomics and genomics data analysis, in particular

for detecting signiﬁcant regions based on next generation

sequencing data.

We show that the Optimal Multilevel Thresholding algo-

rithm (OMT) achieves higher accuracy in detecting enriched

regions and genomic features of detected regions on FoxA1

data. OMT ﬁnds more gene-related regions (gene, exon, pro-

moter) in comparison with other methods. Using a small

number of parameters is another advantage of the proposed

method.

Keywords-multi level thresholding; transcriptomics; ChIP-

Seq data analysis

I. INTRODUCTION

Genome-wide mapping of protein-DNA interactions is

essential for understanding of transcriptional regulation.

Mapping of binding sites for transcription factors and

other DNA-binding proteins is essential for decoding gene

regulatory networks that underlie different biological pro-

cesses. Chromatin immunoprecipitation followed by high-

throughput sequencing (ChIP-Seq) is one of those techniques

that provides quantitative, genome-wide mapping of target

protein binding events [1], [2].

In ChIP-Seq, a protein is ﬁrst cross-linked to DNA and the

fragments subsequently sheared. Following a size selection

step that enriches for fragments of speciﬁed lengths, the

fragments ends are sequenced, and the resulting reads are

aligned to the reference genome. Detecting protein binding

sites from massive sequence-based datasets with millions of

short reads represents a truly bioinformatics challenge that

requires considerable computational resources, in spite of

the availability of programs for ChIP-chip analysis [3].

With the increasing popularity of ChIP-Seq technology,

the demand for peak ﬁnding methods has increased the need

to develop new algorithms. Although due to mapping chal-

lenges and biases in various aspects of existing protocols,

identifying peaks is not a straightforward task.

Different approaches have been proposed for detect-

ing peaks on ChIP-Seq/RNA-Seq mapped reads. Zhang

et al. presented a model-based analysis of ChIP-Seq data

(MACS), which analyzes data generated by short read se-

quencers [4]. It models the length of the sequenced ChIP

fragments and uses it to improve the spatial resolution of

predicted binding sites. A two-pass strategy called Pe ak-

Seq has been presented in [5]. This strategy compensates

for signals caused by open chromatin, as revealed by the

inclusion of the controls. The ﬁrst pass identiﬁes putative

binding sites and compensates for genomic variation in

mapping the sequences. The second pass ﬁlters out sites not

signiﬁcantly enriched compared to the normalized control,

computing precise enrichments and signiﬁcance. Tree shape

Peak Identiﬁcation for ChIP-Seq (T-PIC) is a statistical

approach for calling peaks that has been recently proposed

in [6]. This approach is based on evaluating the signiﬁcance

of a robust statistical test that measures the extent of pile-up

reads. Another algorithm for identiﬁcation of binding sites

is site identiﬁcation from paired-end sequencing (SIPeS) [7],

which can be used for identiﬁcation of binding sites from

short reads generated from paired-end Illumina ChIP-Seq

technology.

One of the possible problems of the existing methods is

that the location of detected peaks could be non-optimal.

Moreover, for detecting these peaks all of the methods use

a set of parameters that may cause variations of the results

for different datasets. In [8], we have recently proposed a

method for ﬁnding signiﬁcant peaks using optimal multi-

level thresholding (OMT). Here, we show that OMT can

be efﬁciently used to ﬁnd genomic features when used in

conjunction with a model for ﬁnding the best number of

peaks. The results of our experiments show that our method

can achieve a higher degree of accuracy than two recently

developed methods, MACS and T-PIC, while providing

Input reads

Extend size of

reads to fragment

Create histogram

based on fragments

for each chromosome

Use OMT algorithm

to find significant

peaks in each

window

Relevant peaks

and genomic features Finding genomic

features of

detected peaks

Figure 1. Schematic representation of the process for ﬁnding genomic

features by using OMT.

ﬂexibility when applying it to different datasets.

II. THE PEAK DETECTION METHOD

In ChIP-Seq, a protein is ﬁrst cross-linked to DNA and

the fragments subsequently pruned. Then, the fragments

ends are sequenced, and the resulting reads are aligned to

the genome. The result of reading the alignments produces

a histogram with genome coordinates as the x-axis and

frequency of the aligned reads in each genome coordinate

as the y-axis. The aim is to ﬁnd signiﬁcant peaks corre-

sponding to enriched regions. Each peak can be seen as a

homogeneous group (cluster) which is well separated from

the others by means of “valleys”. In that sense, the problem

can be formulated as one-dimensional clustering. Figure 1

depicts the process of ﬁnding peaks and genomic features

corresponding to the regions of interest for the speciﬁed

protein. Each module is explained in detail in the next few

sections.

The ﬁrst step of the model consists of converting the

Input BED ﬁle into a histogram. After extending each read

to a fragment length based on the direction of each read

(forward or backward), each of them is aligned to the

reference genome based on its coordinates. Afterwards, for

each chromosome, separate histograms for experiment and

control data are created for further processing.

Starting from the beginning of the chromosome, a sliding

window of minimum size tis applied to the histogram

and each window is analyzed separately. The sizes of the

windows are not necessarily equal to prevent truncating a

peak before its end. Thus, for each window, a minimum

number of tbins is used and, by starting from the end of

the previous window, the size of the window is increased

until a zero value in the histogram is reached. We consider

a minimum of t=3,000 in order to ensure that a window

covers at least one peak of typical size.

After creating the histogram based on fragments (reads),

the histogram is then processed to obtain the optimal thresh-

olding that will determine the locations of peaks. More detail

about this procedure can be found in [8].

A dynamic programming algorithm for optimal multi-

level thresholding was proposed in our previous work [9],

which is an extension for irregularly sampled histograms.

The optimal thresholding is the one that maximizes the

between-class variance. The algorithm runs in O(kn2)for

a histogram of nbins, and has been further improved to

achieve linear complexity, i.e. O(kn), by following the

approach of [10].

III. FINDING RELEVANT PEAKS AND GENOMIC

FEATURES

Finding the correct number of peaks (the number of

regions in each window) is crucial in order to fully automate

the whole process. For this, we need to determine the correct

number peaks prior to applying the multi-level thresholding

method. This is found by using an index of validity derived

from clustering techniques. We have recently proposed the

α(K)index [11], which is the result of a combination of a

simple index, A(K), and the well-known Iindex [12]. By

computing and comparing values of α(K)over all possible

numbers of clusters, the one with the maximum value of

α(K)is the best number of clusters.

After ﬁnding the locations of the detected peaks, in a

two step process, signiﬁcant peaks are selected. In the ﬁrst

step, the effective area of each peak is found by shrinking

the peak. In the second step, the two sample Cramer-von

Mises non parametric hypothesis test [13], with α=0.01,is

used to accept/reject peaks based on the comparison between

experiment and control histograms corresponding to each

peak. Finally, the peaks are ranked and returned as the ﬁnal

relevant peaks.

In the next step, using the information gathered from

the UCSC Genome Browser on NCBI36/hg18 assembly,

the genomic features of each detected peak have been

investigated. We assign a genomic feature to a peak if

that peak overlaps with the region containing that genomic

feature. Since a detected peak can be located in a genomic

region with different genomic features, it could also have

different genomic features. For example, if a speciﬁc peak

overlaps with an exon and intron simultaneously, we count

that peak as an intron and an exon.

IV. EXPERIMENTAL RESULTS

We have used the FoxA1 dataset [4], which contains

experiment and control samples of 24 chromosomes. As in

Figure 2. Two true positive regions in chromosomes 3 and 13 of FoxA1

dataset. The x-axis corresponds to the genome position in bp and the y-axis

corresponds to the number of reads. Both peaks are detected by OMT but

only the bottom one is detected by T-PIC and none of them is detected by

MACS.

[6], the experiment and control histograms were generated

separately by extending each mapped position (read) into

an appropriately oriented fragment, and then joining the

fragments based on their genome coordinates. The ﬁnal

histogram was generated by subtracting the control from the

experiment histogram. To ﬁnd signiﬁcant peaks, we used

a non-overlapping window whose initial size is 3,000bp.

To avoid truncating peaks in boundaries, each window is

extended until the value of the histogram at the end of the

window becomes zero.

Computing the enrichment score for each method pro-

ceeds as follows. Random intervals from the genome are

created by selecting the same number of intervals with the

same lengths from each chromosome as in the called peaks

but with random starting locations. Then, the number of

occurrences of the binding motif in the called peaks and

the random intervals are counted. The enrichment score

is the ratio of the number of occurrences in the called

peaks divided by the number of occurrences in the random

intervals.

A. Comparison with Other Methods for ChIP-Seq Analysis

Table I shows a comparison between OMT and two

recently proposed methods, MACS [4] and T-PIC [6]. As

shown in the table, the number of signiﬁcant peaks detected

by OMT is higher than those of the other two methods. This

implies that OMT is able to ﬁnd some signiﬁcant peaks

that are not detected by the other two methods. Also, the

enrichment ratio for OMT is far higher than MACS and

higher than T-PIC. However, the average size of the peaks

is smaller than the other two methods which implies that

OMT is able to detect signiﬁcant peaks more precisely.

Tab le I

COMPARISON BETWEEN OMT AND TWO RECENTLY PROPOSED

METHODS,MACSAND T-PIC , BASED ON NUMBER OF DETECTED

PEAKS,MEAN LENGTH OF DETECTED PEAKS AND ENRICHMENT SCORE.

Dataset Method of Comparison OMT T-P I C MACS

FoxA1

Detected peaks 20,032 17,619 13,639

Mean length of peaks 306 510 394

Enrichment ratio 2.62 2.54 1.68

A conceptual comparison of OMT with MACS and T-PIC

basedontheirfeaturesisshowninTableII.Asshownin

the table, the other algorithms require some parameters to be

set by the user based on the particular data to be processed,

including p-values, m-fold, window length, among others.

OMT is the algorithm that requires the smallest number of

parameters. Only the average fragment length is needed.

B. Analysis of Genomic Features

We have also biologically validated the peaks detected by

OMT on the results of independent qPCR experiments for

the FoxA1 protein. For this, we considered 25 true positives

and 7 true negatives (regions) reported in [14]. The results

of other two well-known methods, T-PIC and MACS, are

included in the comparison. Table IV shows the result of

this biological validation on each method. As the other two

methods, OMT has been able to reject all true negatives.

Although OMT ﬁnds a larger number of regions, OMT

shows a high sensitivity, ﬁnding more true positives than

T-PIC and MACS. As an example, two true positive regions

in chromosomes 3 and 13 of FoxA1 are shown in Figure 2.

Both peaks are detected by OMT but only the bottom one is

detected by T-PIC and none of them is detected by MACS.

In another experiment, we compared the type and corre-

sponding number of regions found by these three methods

in the FoxA1 dataset. Table III shows the percentage of

regions which are located in gene, promoter, intron and exon

areas as well as inter-genetic regions. OMT was able to

detect more regions corresponding to genes, promoters and

exons, while the percentage of detected regions within inter-

genetic area by our proposed method is less than number of

regions corresponding to the other two methods. In contrast,

the number of detected regions corresponding to the introns

found by OMT is not higher than the other two methods.

V. D ISCUSSION AND CONCLUSION

We have presented a multi-level thresholding algorithm

that can be applied to an efﬁcient analysis of ChIP-Seq data

to ﬁnd signiﬁcant peaks and genomic features. OMT can

be applied to high-throughput next generation sequencing

data with different characteristics, and allows us detecting

Tab le II

CONCEPTUAL COMPARISON OF RECENTLY PROPOSED METHODS FOR ChIP −Seq DATA.

Method Peak selection criteria Peak ranking Parameters

MACS local region Poisson p-value p-value p-value threshold, tag length, m-fold for

shift estimate

T-P IC local height threshold p-value average fragment length, signiﬁcance p-

value, minimum length of interval

OMT number of ChIP reads minus control reads

in window p-value average fragment length

Table III

COMPARISON BETWEEN OUR PROPOSED METHOD,MACSAND T-P IC , BASED ON THE PERCENTAGE OF DETECTED REGIONS WHICH BELONG TO

DIFFERENT GENOMIC FEATURES.

Method Number of Regions Genes Exons Introns Promoters Inter-genetic Regions

Regions %Regions %Regions %Regions %Regions %

MACS 13,639 12,125 88.90 976 7.16 11,689 85.70 688 5.05 7,533 55.23

T-P IC 17,619 15,529 88.14 1,336 7.58 15,325 86.98 793 4.50 8,794 49.91

OMT 20,032 19,557 97.63 1,941 9.69 17,258 86.15 1,296 6.47 9,155 45.70

Tab le IV

COMPARISON OF OMT, MACS AND T-PIC, BASED ON THE NUMBER OF

TRUE POSITIVE (TP) AND TRUE NEGATIVE (TN) DETECTED PEAKS.

OMT T-P I C MACS

TP 15 13 12

TN 0 0 0

signiﬁcant regions on ChIP-Seq data. OMT has been shown

to be sound and robust in experiments. Finding more ge-

nomic features in comparison with two other state of the art

methods, MACS and T-PIC, and using fewer parameter are

other interesting features of OMT.

REFERENCES

[1] A. Barski and K. Zhao, “Genomic location analysis by chip-

seq,” Journal of Cellular Biochemistry, no. 107, pp. 11–18,

2009.

[2] P. Park, “Chip-seq: advantages and challenges of a maturing

technology,” Nat Rev Genetics, vol. 10, no. 10, pp. 669–680,

2009.

[3] D. Reiss, M. Facciotti, and N. Baliga, “Model-based deconvo-

lution of genome-wide dna binding,” Bioinformatics, vol. 24,

no. 3, pp. 396–403, 2008.

[4] Y. Zhang, T. Liu, C. Meyer, J. Eeckhoute, D. Johnson,

B. Bernstein, C. Nusbaum, R. Myers, M. Brown, W. Li, , and

X. Liu, “Model-based analysis of chip-seq (macs),” Genome

Biology, vol. 9, no. 9, p. R137, 2008.

[5] J. Rozowsky, G. Euskirchen, R. Auerbach, Z. Zhang, T. Gib-

son, R. Bjornson, N. Carriero, M. Snyder, and M. Gerstein,

“Peakseq enables systematic scoring of chip-seq experiments

relative to controls,” Nature Biotechnology, vol. 27, no. 1, pp.

66–75, 2009.

[6] V. Hower, S. Evans, and L. Pachter, “Shape-based peak

identiﬁcation for chip-seq,” BMC Bioinformatics, vol. 11,

no. 81, 2010.

[7] C. Wang, J. Xu, D. Zhang, Z. Wilson, and D. Zhang,

“An effective approach for identiﬁcation of in vivo protein-

DNA binding sites from paired-end ChIP-Seq data,” BMC

Bioinformatics, vol. 41, no. 1, pp. 117–129, 2008.

[8] I. Rezaeian and L. Rueda, “A new algorithm for ﬁnding

enriched regions in chip-seq data,” ACM Conference on

Bioinformatics, Computational Biology and Biomedicine - to

appear, 2012.

[9] L. Rueda, “An Efﬁcient Algorithm for Optimal Multilevel

Thresholding of Irregularly Sampled Histograms,” Proceed-

ings of the 7th International Workshop on Statistical Pattern

Recognition, pp. 612–621, 2008.

[10] M. Luessi, M. Eichmann, G. Schuster, and A. Katsaggelos,

“Framework for efﬁcient optimal multilevel image threshold-

ing,” Journal of Electronic Imaging, vol. 18, 2009.

[11] L. Rueda and I. Rezaeian, “A fully automatic gridding method

for cdna microarray images,” BMC Bioinformatics, vol. 12,

p. 113, 2011.

[12] U. Maulik and S. Bandyopadhyay, “Performance Evaluation

of Some Clustering Algorithms and Validity Indices,” IEEE

Trans. on Pattern Analysis and Machine Intelligence, vol. 24,

no. 12, pp. 1650–1655, 2002.

[13] T. W. Anderson, “On the Distribution of the Two-Sample

Cramer-von Mises Criterion,” Ann. Math. Statist., vol. 33, pp.

1148–1159, 1962.

[14] M. Lupien, J. Eeckhoute, C. A. Meyer, Q. Wang, Y. Zhang,

W. Li, J. S. Carroll, X. S. Liu, and M. Brown, “FoxA1 Trans-

lates Epigenetic Signatures into Enhancer-driven Lineage-

speciﬁc Transcription,” Cell, vol. 132, no. 6, pp. 958–970,

2008.

ResearchGate has not been able to resolve any citations for this publication.

A new algorithm for finding enriched regions in ChIP-Seq data

Conference Paper

Full-text available

Oct 2012

Genome-wide profiling of DNA-binding proteins using ChIP-Seq has emerged as an alternative to ChIP-chip methods. Due to the large amounts of data produced by next generation sequencing, ChIP-Seq offers many advantages, such as much higher resolution, less noise and greater coverage than its predecessor, the ChIP-chip array. Multi-level thresholding algorithms have been applied to many problems in image and signal processing. These algorithms have been used for transcriptomics and genomics data analysis such as sub-grid and spot detection in DNA microarrays, and also for detecting significant regions based on next generation sequencing data. We show that our Optimal Multilevel Thresholding algorithm (OMT) has higher accuracy in detecting enriched regions (peaks) in comparison with previously proposed peak finders by testing three algorithms on the well-known FoxA1 Data set and also for four transcription factors (with a total of six antibodies) for Drosophila melanogaster. Using a small number of parameters is another advantage of the proposed method.

An Efficient Algorithm for Optimal Multilevel Thresholding of Irregularly Sampled Histograms

Conference Paper

Full-text available

Dec 2008

Luis Rueda

Optimal multilevel thresholding is a quite important problem in image segmentation and pattern recognition. Although efficient algorithms have been proposed recently, they do not address the issue of irregularly sampled histograms. A polynomial-time algorithm for multilevel thresholding of irregularly sampled histograms is proposed. The algorithm is polynomial not just on the number of bins of the histogram, n, but also on the number of thresholds, k, i.e. it runs in Θ(kn 2). The proposed algorithm is general enough for a wide range of thresholding and clustering criteria, and has the capability of dealing with irregularly sampled histograms. This implies important consequences on pattern recognition, since optimal clustering in the one-dimensional space can be obtained in polynomial time. Experiments on synthetic and real-life histograms show that for typical cases, the proposed algorithm can find the optimal thresholds in a fraction of a second.

Framework for efficient optimal multilevel image thresholding.

Article

Full-text available

Jan 2009
J ELECTRON IMAGING

Image thresholding is a very common image processing operation, since almost all image processing schemes need some sort of separation of the pixels into different classes. In order to determine the thresholds, most methods analyze the histogram of the image. The optimal thresholds are often found by either minimiz- ing or maximizing an objective function with respect to the values of the thresholds. By defining two classes of objective functions for which the optimal thresholds can be found by efficient algorithms, this paper provides a framework for determining the solution ap- proach for current and future multilevel thresholding algorithms. We show, for example, that the method proposed by Otsu and other well-known methods have objective functions belonging to these classes. By implementing the algorithms in ANSI C and comparing their execution times, we can also make quantitative statements about their performance. © 2009 SPIE and IS&T.

A fully automatic gridding method for cDNA microarray images

Article

Full-text available

Apr 2011
BMC BIOINFORMATICS

Processing cDNA microarray images is a crucial step in gene expression analysis, since any errors in early stages affect subsequent steps, leading to possibly erroneous biological conclusions. When processing the underlying images, accurately separating the sub-grids and spots is extremely important for subsequent steps that include segmentation, quantification, normalization and clustering. We propose a parameterless and fully automatic approach that first detects the sub-grids given the entire microarray image, and then detects the locations of the spots in each sub-grid. The approach, first, detects and corrects rotations in the images by applying an affine transformation, followed by a polynomial-time optimal multi-level thresholding algorithm used to find the positions of the sub-grids in the image and the positions of the spots in each sub-grid. Additionally, a new validity index is proposed in order to find the correct number of sub-grids in the image, and the correct number of spots in each sub-grid. Moreover, a refinement procedure is used to correct possible misalignments and increase the accuracy of the method. Extensive experiments on real-life microarray images and a comparison to other methods show that the proposed method performs these tasks fully automatically and with a very high degree of accuracy. Moreover, unlike previous methods, the proposed approach can be used in various type of microarray images with different resolutions and spot sizes and does not need any parameter to be adjusted.

Shape-based peak identification for ChIP-Seq

Article

Full-text available

Jan 2011
BMC BIOINFORMATICS

The identification of binding targets for proteins using ChIP-Seq has gained popularity as an alternative to ChIP-chip. Sequencing can, in principle, eliminate artifacts associated with microarrays, and cheap sequencing offers the ability to sequence deeply and obtain a comprehensive survey of binding. A number of algorithms have been developed to call "peaks" representing bound regions from mapped reads. Most current algorithms incorporate multiple heuristics, and despite much work it remains difficult to accurately determine individual peaks corresponding to distinct binding events. Our method for identifying statistically significant peaks from read coverage is inspired by the notion of persistence in topological data analysis and provides a non-parametric approach that is statistically sound and robust to noise in experiments. Specifically, our method reduces the peak calling problem to the study of tree-based statistics derived from the data. We validate our approach using previously published data and show that it can discover previously missed regions. The difficulty in accurately calling peaks for ChIP-Seq data is partly due to the difficulty in defining peaks, and we demonstrate a novel method that improves on the accuracy of previous methods in resolving peaks. Our introduction of a robust statistical test based on ideas from topological data analysis is also novel. Our methods are implemented in a program called T-PIC (Tree shape Peak Identification for ChIP-Seq) is available at http://bio.math.berkeley.edu/tpic/.

An effective approach for identification of in vivo protein-DNA binding sites from paired-end ChIP-Seq data

Article

Full-text available

Feb 2010
BMC BIOINFORMATICS

ChIP-Seq, which combines chromatin immunoprecipitation (ChIP) with high-throughput massively parallel sequencing, is increasingly being used for identification of protein-DNA interactions in vivo in the genome. However, to maximize the effectiveness of data analysis of such sequences requires the development of new algorithms that are able to accurately predict DNA-protein binding sites. Here, we present SIPeS (Site Identification from Paired-end Sequencing), a novel algorithm for precise identification of binding sites from short reads generated by paired-end solexa ChIP-Seq technology. In this paper we used ChIP-Seq data from the Arabidopsis basic helix-loop-helix transcription factor ABORTED MICROSPORES (AMS), which is expressed within the anther during pollen development, the results show that SIPeS has better resolution for binding site identification compared to two existing ChIP-Seq peak detection algorithms, Cisgenome and MACS. When compared to Cisgenome and MACS, SIPeS shows better resolution for binding site discovery. Moreover, SIPeS is designed to calculate the mappable genome length accurately with the fragment length based on the paired-end reads. Dynamic baselines are also employed to effectively discriminate closely adjacent binding sites, for effective binding sites discovery, which is of particular value when working with high-density genomes.

Framework for efficient optimal multilevel image thresholding

Article

Jan 2009
J ELECTRON IMAGING

Marco Eichmann

Image thresholding is a very common image processing operation, since almost all image processing schemes need some sort of separation of the pixels into different classes. In order to determine the thresholds, most methods analyze the histogram of the image. The optimal thresholds are often found by either minimizing or maximizing an objective function with respect to the values of the thresholds. By defining two classes of objective functions for which the optimal thresholds can be found by efficient algorithms, this paper provides a framework for determining the solution approach for current and future multilevel thresholding algorithms. We show, for example, that the method proposed by Otsu and other well-known methods have objective functions belonging to these classes. By implementing the algorithms in ANSI C and comparing their execution times, we can also make quantitative statements about their performance.

ChIP-Seq: advantages and challenges of a maturing technology

Article

Jan 2009

PJ Park

Performance Evaluation of Some Clustering Algorithms and Validity Indices.

Article

Jan 2002

In this article, we evaluate the performance of three clustering algorithms, hard K-Means, single linkage, and a simulated annealing (SA) based technique, in conjunction with four cluster validity indices, namely Davies-Bouldin index, Dunn's index, Calinski-Harabasz index, and a recently developed index \cal I. Based on a relation between the index \cal I and the Dunn's index, a lower bound of the value of the former is theoretically estimated in order to get unique hard K-partition when the data set has distinct substructures. The effectiveness of the different validity indices and clustering methods in automatically evolving the appropriate number of clusters is demonstrated experimentally for both artificial and real-life data sets with the number of clusters varying from two to ten. Once the appropriate number of clusters is determined, the SA-based clustering technique is used for proper partitioning of the data into the said number of clusters.

On the Distribution of the Two-Sample Cramer-von Mises Criterion

Article

Sep 1962
Ann Math Stat

T. W. Anderson

The Cramer-von Mises $\omega^2$ criterion for testing that a sample, $x_1, \cdots, x_N$, has been drawn from a specified continuous distribution $F(x)$ is \begin{equation*}\tag{1}\omega^2 = \int^\infty_{-\infty} \lbrack F_N(x) - F(x)\rbrack^2 dF(x),\end{equation*} where $F_N(x)$ is the empirical distribution function of the sample; that is, $F_N(x) = k/N$ if exactly $k$ observations are less than or equal to $x(k = 0, 1, \cdots, N)$. If there is a second sample, $y_1, \cdots, y_M$, a test of the hypothesis that the two samples come from the same (unspecified) continuous distribution can be based on the analogue of $N\omega^2$, namely \begin{equation*}\tag{2} T = \lbrack NM/(N + M)\rbrack \int^\infty_{-\infty} \lbrack F_N(x) - G_M(x)\rbrack^2 dH_{N+M}(x),\end{equation*} where $G_M(x)$ is the empirical distribution function of the second sample and $H_{N+M}(x)$ is the empirical distribution function of the two samples together [that is, $(N + M)H_{N+M}(x) = NF_N(x) + MG_M(x)\rbrack$. The limiting distribution of $N\omega^2$ as $N \rightarrow \infty$ has been tabulated [2], and it has been shown ([3], [4a], and [7]) that $T$ has the same limiting distribution as $N \rightarrow \infty, M \rightarrow \infty$, and $N/M \rightarrow \lambda$, where $\lambda$ is any finite positive constant. In this note we consider the distribution of $T$ for small values of $N$ and $M$ and present tables to permit use of the criterion at some conventional significance levels for small values of $N$ and $M$. The limiting distribution seems a surprisingly good approximation to the exact distribution for moderate sample sizes (corresponding to the same feature for $N\omega^2$ [6]). The accuracy of approximation is better than in the case of the two-sample Kolmogorov-Smirnov statistic studied by Hodges [4].

Finding genomic features from enriched regions in ChlP-Seq data

Abstract and Figures

Recommended publications

FASTCAT: A bioinformatics platform for functional annotation

Transcriptomics: Quantifying Non-Uniform Read Distribution Using MapReduce

Alectinib treatment response in lung adenocarcinoma patient with novel EML4-ALK variant: Novel EML4-...

Screening for DNA Mismatch Repair Genes Mutation in Turkish Patients With HNPCC\Lynch Syndrome by Ne...