Conference PaperPDF Available

Finding genomic features from enriched regions in ChlP-Seq data

Authors:

Abstract and Figures

Finding genomic features in ChlP-Seq data has become an attractive research topic lately, because of the power, resolution and low-noise of next generation sequencing, making it a much better alternative to traditional microarrays such as ChlP-chip and other related methods. However, handling ChlP-Seq data is not straightforward, mainly because of the large amounts of data produced by next generation sequencing. ChlP-Seq has widespread over a range of applications in finding biomarkers, especially those associated with important genomic features in epigenomics and transcriptomics, including binding sites, promoters, exons/introns, transcription sites, among others. Efficient algorithms for finding relevant regions in ChlP-Seq data have been proposed, which capture the most significant peaks from the sequence reads. Among these, multilevel thresholding algorithms have been applied successfully for transcriptomics and genomics data analysis, in particular for detecting significant regions based on next generation sequencing data. We show that the Optimal Multilevel Thresholding algorithm (OMT) achieves higher accuracy in detecting enriched regions and genomic features of detected regions on FoxAl data. OMT finds more gene-related regions (gene, exon, promoter) in comparison with other methods. Using a small number of parameters is another advantage of the proposed method.
Content may be subject to copyright.
Finding Genomic Features from Enriched Regions in ChIP-Seq Data
Iman Rezaeian
School of Computer Science
University of Windsor
Windsor, Canada
Email: rezaeia@uwindsor.ca
Luis Rueda
School of Computer Science
University of Windsor
Windsor, Canada
Email: lrueda@uwindsor.ca
Abstract—Finding genomic features in ChIP-Seq data has
become an attractive research topic lately, because of the power,
resolution and low-noise of next generation sequencing, making
it a much better alternative to traditional microarrays such
as ChIP-chip and other related methods. However, handling
ChIP-Seq data is not straightforward, mainly because of the
large amounts of data produced by next generation sequencing.
ChIP-Seq has widespread over a range of applications in
finding biomarkers, especially those associated with important
genomic features in epigenomics and transcriptomics, including
binding sites, promoters, exons/introns, transcription sites,
among others. Efficient algorithms for finding relevant regions
in ChIP-Seq data have been proposed, which capture the most
significant peaks from the sequence reads. Among these, multi-
level thresholding algorithms have been applied successfully
for transcriptomics and genomics data analysis, in particular
for detecting significant regions based on next generation
sequencing data.
We show that the Optimal Multilevel Thresholding algo-
rithm (OMT) achieves higher accuracy in detecting enriched
regions and genomic features of detected regions on FoxA1
data. OMT finds more gene-related regions (gene, exon, pro-
moter) in comparison with other methods. Using a small
number of parameters is another advantage of the proposed
method.
Keywords-multi level thresholding; transcriptomics; ChIP-
Seq data analysis
I. INTRODUCTION
Genome-wide mapping of protein-DNA interactions is
essential for understanding of transcriptional regulation.
Mapping of binding sites for transcription factors and
other DNA-binding proteins is essential for decoding gene
regulatory networks that underlie different biological pro-
cesses. Chromatin immunoprecipitation followed by high-
throughput sequencing (ChIP-Seq) is one of those techniques
that provides quantitative, genome-wide mapping of target
protein binding events [1], [2].
In ChIP-Seq, a protein is first cross-linked to DNA and the
fragments subsequently sheared. Following a size selection
step that enriches for fragments of specified lengths, the
fragments ends are sequenced, and the resulting reads are
aligned to the reference genome. Detecting protein binding
sites from massive sequence-based datasets with millions of
short reads represents a truly bioinformatics challenge that
requires considerable computational resources, in spite of
the availability of programs for ChIP-chip analysis [3].
With the increasing popularity of ChIP-Seq technology,
the demand for peak finding methods has increased the need
to develop new algorithms. Although due to mapping chal-
lenges and biases in various aspects of existing protocols,
identifying peaks is not a straightforward task.
Different approaches have been proposed for detect-
ing peaks on ChIP-Seq/RNA-Seq mapped reads. Zhang
et al. presented a model-based analysis of ChIP-Seq data
(MACS), which analyzes data generated by short read se-
quencers [4]. It models the length of the sequenced ChIP
fragments and uses it to improve the spatial resolution of
predicted binding sites. A two-pass strategy called Pe ak-
Seq has been presented in [5]. This strategy compensates
for signals caused by open chromatin, as revealed by the
inclusion of the controls. The first pass identifies putative
binding sites and compensates for genomic variation in
mapping the sequences. The second pass filters out sites not
significantly enriched compared to the normalized control,
computing precise enrichments and significance. Tree shape
Peak Identification for ChIP-Seq (T-PIC) is a statistical
approach for calling peaks that has been recently proposed
in [6]. This approach is based on evaluating the significance
of a robust statistical test that measures the extent of pile-up
reads. Another algorithm for identification of binding sites
is site identification from paired-end sequencing (SIPeS) [7],
which can be used for identification of binding sites from
short reads generated from paired-end Illumina ChIP-Seq
technology.
One of the possible problems of the existing methods is
that the location of detected peaks could be non-optimal.
Moreover, for detecting these peaks all of the methods use
a set of parameters that may cause variations of the results
for different datasets. In [8], we have recently proposed a
method for finding significant peaks using optimal multi-
level thresholding (OMT). Here, we show that OMT can
be efficiently used to find genomic features when used in
conjunction with a model for finding the best number of
peaks. The results of our experiments show that our method
can achieve a higher degree of accuracy than two recently
developed methods, MACS and T-PIC, while providing
Input reads
Extend size of
reads to fragment
Create histogram
based on fragments
for each chromosome
Use OMT algorithm
to find significant
peaks in each
window
Relevant peaks
and genomic features Finding genomic
features of
detected peaks
Figure 1. Schematic representation of the process for finding genomic
features by using OMT.
flexibility when applying it to different datasets.
II. THE PEAK DETECTION METHOD
In ChIP-Seq, a protein is first cross-linked to DNA and
the fragments subsequently pruned. Then, the fragments
ends are sequenced, and the resulting reads are aligned to
the genome. The result of reading the alignments produces
a histogram with genome coordinates as the x-axis and
frequency of the aligned reads in each genome coordinate
as the y-axis. The aim is to find significant peaks corre-
sponding to enriched regions. Each peak can be seen as a
homogeneous group (cluster) which is well separated from
the others by means of “valleys”. In that sense, the problem
can be formulated as one-dimensional clustering. Figure 1
depicts the process of finding peaks and genomic features
corresponding to the regions of interest for the specified
protein. Each module is explained in detail in the next few
sections.
The first step of the model consists of converting the
Input BED file into a histogram. After extending each read
to a fragment length based on the direction of each read
(forward or backward), each of them is aligned to the
reference genome based on its coordinates. Afterwards, for
each chromosome, separate histograms for experiment and
control data are created for further processing.
Starting from the beginning of the chromosome, a sliding
window of minimum size tis applied to the histogram
and each window is analyzed separately. The sizes of the
windows are not necessarily equal to prevent truncating a
peak before its end. Thus, for each window, a minimum
number of tbins is used and, by starting from the end of
the previous window, the size of the window is increased
until a zero value in the histogram is reached. We consider
a minimum of t=3,000 in order to ensure that a window
covers at least one peak of typical size.
After creating the histogram based on fragments (reads),
the histogram is then processed to obtain the optimal thresh-
olding that will determine the locations of peaks. More detail
about this procedure can be found in [8].
A dynamic programming algorithm for optimal multi-
level thresholding was proposed in our previous work [9],
which is an extension for irregularly sampled histograms.
The optimal thresholding is the one that maximizes the
between-class variance. The algorithm runs in O(kn2)for
a histogram of nbins, and has been further improved to
achieve linear complexity, i.e. O(kn), by following the
approach of [10].
III. FINDING RELEVANT PEAKS AND GENOMIC
FEATURES
Finding the correct number of peaks (the number of
regions in each window) is crucial in order to fully automate
the whole process. For this, we need to determine the correct
number peaks prior to applying the multi-level thresholding
method. This is found by using an index of validity derived
from clustering techniques. We have recently proposed the
α(K)index [11], which is the result of a combination of a
simple index, A(K), and the well-known Iindex [12]. By
computing and comparing values of α(K)over all possible
numbers of clusters, the one with the maximum value of
α(K)is the best number of clusters.
After finding the locations of the detected peaks, in a
two step process, significant peaks are selected. In the first
step, the effective area of each peak is found by shrinking
the peak. In the second step, the two sample Cramer-von
Mises non parametric hypothesis test [13], with α=0.01,is
used to accept/reject peaks based on the comparison between
experiment and control histograms corresponding to each
peak. Finally, the peaks are ranked and returned as the final
relevant peaks.
In the next step, using the information gathered from
the UCSC Genome Browser on NCBI36/hg18 assembly,
the genomic features of each detected peak have been
investigated. We assign a genomic feature to a peak if
that peak overlaps with the region containing that genomic
feature. Since a detected peak can be located in a genomic
region with different genomic features, it could also have
different genomic features. For example, if a specific peak
overlaps with an exon and intron simultaneously, we count
that peak as an intron and an exon.
IV. EXPERIMENTAL RESULTS
We have used the FoxA1 dataset [4], which contains
experiment and control samples of 24 chromosomes. As in
Figure 2. Two true positive regions in chromosomes 3 and 13 of FoxA1
dataset. The x-axis corresponds to the genome position in bp and the y-axis
corresponds to the number of reads. Both peaks are detected by OMT but
only the bottom one is detected by T-PIC and none of them is detected by
MACS.
[6], the experiment and control histograms were generated
separately by extending each mapped position (read) into
an appropriately oriented fragment, and then joining the
fragments based on their genome coordinates. The final
histogram was generated by subtracting the control from the
experiment histogram. To find significant peaks, we used
a non-overlapping window whose initial size is 3,000bp.
To avoid truncating peaks in boundaries, each window is
extended until the value of the histogram at the end of the
window becomes zero.
Computing the enrichment score for each method pro-
ceeds as follows. Random intervals from the genome are
created by selecting the same number of intervals with the
same lengths from each chromosome as in the called peaks
but with random starting locations. Then, the number of
occurrences of the binding motif in the called peaks and
the random intervals are counted. The enrichment score
is the ratio of the number of occurrences in the called
peaks divided by the number of occurrences in the random
intervals.
A. Comparison with Other Methods for ChIP-Seq Analysis
Table I shows a comparison between OMT and two
recently proposed methods, MACS [4] and T-PIC [6]. As
shown in the table, the number of significant peaks detected
by OMT is higher than those of the other two methods. This
implies that OMT is able to find some significant peaks
that are not detected by the other two methods. Also, the
enrichment ratio for OMT is far higher than MACS and
higher than T-PIC. However, the average size of the peaks
is smaller than the other two methods which implies that
OMT is able to detect significant peaks more precisely.
Tab le I
COMPARISON BETWEEN OMT AND TWO RECENTLY PROPOSED
METHODS,MACSAND T-PIC , BASED ON NUMBER OF DETECTED
PEAKS,MEAN LENGTH OF DETECTED PEAKS AND ENRICHMENT SCORE.
Dataset Method of Comparison OMT T-P I C MACS
FoxA1
Detected peaks 20,032 17,619 13,639
Mean length of peaks 306 510 394
Enrichment ratio 2.62 2.54 1.68
A conceptual comparison of OMT with MACS and T-PIC
basedontheirfeaturesisshowninTableII.Asshownin
the table, the other algorithms require some parameters to be
set by the user based on the particular data to be processed,
including p-values, m-fold, window length, among others.
OMT is the algorithm that requires the smallest number of
parameters. Only the average fragment length is needed.
B. Analysis of Genomic Features
We have also biologically validated the peaks detected by
OMT on the results of independent qPCR experiments for
the FoxA1 protein. For this, we considered 25 true positives
and 7 true negatives (regions) reported in [14]. The results
of other two well-known methods, T-PIC and MACS, are
included in the comparison. Table IV shows the result of
this biological validation on each method. As the other two
methods, OMT has been able to reject all true negatives.
Although OMT finds a larger number of regions, OMT
shows a high sensitivity, finding more true positives than
T-PIC and MACS. As an example, two true positive regions
in chromosomes 3 and 13 of FoxA1 are shown in Figure 2.
Both peaks are detected by OMT but only the bottom one is
detected by T-PIC and none of them is detected by MACS.
In another experiment, we compared the type and corre-
sponding number of regions found by these three methods
in the FoxA1 dataset. Table III shows the percentage of
regions which are located in gene, promoter, intron and exon
areas as well as inter-genetic regions. OMT was able to
detect more regions corresponding to genes, promoters and
exons, while the percentage of detected regions within inter-
genetic area by our proposed method is less than number of
regions corresponding to the other two methods. In contrast,
the number of detected regions corresponding to the introns
found by OMT is not higher than the other two methods.
V. D ISCUSSION AND CONCLUSION
We have presented a multi-level thresholding algorithm
that can be applied to an efficient analysis of ChIP-Seq data
to find significant peaks and genomic features. OMT can
be applied to high-throughput next generation sequencing
data with different characteristics, and allows us detecting
Tab le II
CONCEPTUAL COMPARISON OF RECENTLY PROPOSED METHODS FOR ChIP Seq DATA.
Method Peak selection criteria Peak ranking Parameters
MACS local region Poisson p-value p-value p-value threshold, tag length, m-fold for
shift estimate
T-P IC local height threshold p-value average fragment length, significance p-
value, minimum length of interval
OMT number of ChIP reads minus control reads
in window p-value average fragment length
Table III
COMPARISON BETWEEN OUR PROPOSED METHOD,MACSAND T-P IC , BASED ON THE PERCENTAGE OF DETECTED REGIONS WHICH BELONG TO
DIFFERENT GENOMIC FEATURES.
Method Number of Regions Genes Exons Introns Promoters Inter-genetic Regions
Regions %Regions %Regions %Regions %Regions %
MACS 13,639 12,125 88.90 976 7.16 11,689 85.70 688 5.05 7,533 55.23
T-P IC 17,619 15,529 88.14 1,336 7.58 15,325 86.98 793 4.50 8,794 49.91
OMT 20,032 19,557 97.63 1,941 9.69 17,258 86.15 1,296 6.47 9,155 45.70
Tab le IV
COMPARISON OF OMT, MACS AND T-PIC, BASED ON THE NUMBER OF
TRUE POSITIVE (TP) AND TRUE NEGATIVE (TN) DETECTED PEAKS.
OMT T-P I C MACS
TP 15 13 12
TN 0 0 0
significant regions on ChIP-Seq data. OMT has been shown
to be sound and robust in experiments. Finding more ge-
nomic features in comparison with two other state of the art
methods, MACS and T-PIC, and using fewer parameter are
other interesting features of OMT.
REFERENCES
[1] A. Barski and K. Zhao, “Genomic location analysis by chip-
seq,” Journal of Cellular Biochemistry, no. 107, pp. 11–18,
2009.
[2] P. Park, “Chip-seq: advantages and challenges of a maturing
technology,Nat Rev Genetics, vol. 10, no. 10, pp. 669–680,
2009.
[3] D. Reiss, M. Facciotti, and N. Baliga, “Model-based deconvo-
lution of genome-wide dna binding,” Bioinformatics, vol. 24,
no. 3, pp. 396–403, 2008.
[4] Y. Zhang, T. Liu, C. Meyer, J. Eeckhoute, D. Johnson,
B. Bernstein, C. Nusbaum, R. Myers, M. Brown, W. Li, , and
X. Liu, “Model-based analysis of chip-seq (macs),” Genome
Biology, vol. 9, no. 9, p. R137, 2008.
[5] J. Rozowsky, G. Euskirchen, R. Auerbach, Z. Zhang, T. Gib-
son, R. Bjornson, N. Carriero, M. Snyder, and M. Gerstein,
“Peakseq enables systematic scoring of chip-seq experiments
relative to controls,Nature Biotechnology, vol. 27, no. 1, pp.
66–75, 2009.
[6] V. Hower, S. Evans, and L. Pachter, “Shape-based peak
identification for chip-seq,BMC Bioinformatics, vol. 11,
no. 81, 2010.
[7] C. Wang, J. Xu, D. Zhang, Z. Wilson, and D. Zhang,
“An effective approach for identification of in vivo protein-
DNA binding sites from paired-end ChIP-Seq data,” BMC
Bioinformatics, vol. 41, no. 1, pp. 117–129, 2008.
[8] I. Rezaeian and L. Rueda, “A new algorithm for finding
enriched regions in chip-seq data,” ACM Conference on
Bioinformatics, Computational Biology and Biomedicine - to
appear, 2012.
[9] L. Rueda, “An Efficient Algorithm for Optimal Multilevel
Thresholding of Irregularly Sampled Histograms,Proceed-
ings of the 7th International Workshop on Statistical Pattern
Recognition, pp. 612–621, 2008.
[10] M. Luessi, M. Eichmann, G. Schuster, and A. Katsaggelos,
“Framework for efficient optimal multilevel image threshold-
ing,” Journal of Electronic Imaging, vol. 18, 2009.
[11] L. Rueda and I. Rezaeian, “A fully automatic gridding method
for cdna microarray images,” BMC Bioinformatics, vol. 12,
p. 113, 2011.
[12] U. Maulik and S. Bandyopadhyay, “Performance Evaluation
of Some Clustering Algorithms and Validity Indices,” IEEE
Trans. on Pattern Analysis and Machine Intelligence, vol. 24,
no. 12, pp. 1650–1655, 2002.
[13] T. W. Anderson, “On the Distribution of the Two-Sample
Cramer-von Mises Criterion,Ann. Math. Statist., vol. 33, pp.
1148–1159, 1962.
[14] M. Lupien, J. Eeckhoute, C. A. Meyer, Q. Wang, Y. Zhang,
W. Li, J. S. Carroll, X. S. Liu, and M. Brown, “FoxA1 Trans-
lates Epigenetic Signatures into Enhancer-driven Lineage-
specific Transcription,” Cell, vol. 132, no. 6, pp. 958–970,
2008.
ResearchGate has not been able to resolve any citations for this publication.
Conference Paper
Full-text available
Genome-wide profiling of DNA-binding proteins using ChIP-Seq has emerged as an alternative to ChIP-chip methods. Due to the large amounts of data produced by next generation sequencing, ChIP-Seq offers many advantages, such as much higher resolution, less noise and greater coverage than its predecessor, the ChIP-chip array. Multi-level thresholding algorithms have been applied to many problems in image and signal processing. These algorithms have been used for transcriptomics and genomics data analysis such as sub-grid and spot detection in DNA microarrays, and also for detecting significant regions based on next generation sequencing data. We show that our Optimal Multilevel Thresholding algorithm (OMT) has higher accuracy in detecting enriched regions (peaks) in comparison with previously proposed peak finders by testing three algorithms on the well-known FoxA1 Data set and also for four transcription factors (with a total of six antibodies) for Drosophila melanogaster. Using a small number of parameters is another advantage of the proposed method.
Conference Paper
Full-text available
Optimal multilevel thresholding is a quite important problem in image segmentation and pattern recognition. Although efficient algorithms have been proposed recently, they do not address the issue of irregularly sampled histograms. A polynomial-time algorithm for multilevel thresholding of irregularly sampled histograms is proposed. The algorithm is polynomial not just on the number of bins of the histogram, n, but also on the number of thresholds, k, i.e. it runs in Θ(kn 2). The proposed algorithm is general enough for a wide range of thresholding and clustering criteria, and has the capability of dealing with irregularly sampled histograms. This implies important consequences on pattern recognition, since optimal clustering in the one-dimensional space can be obtained in polynomial time. Experiments on synthetic and real-life histograms show that for typical cases, the proposed algorithm can find the optimal thresholds in a fraction of a second.
Article
Full-text available
Image thresholding is a very common image processing operation, since almost all image processing schemes need some sort of separation of the pixels into different classes. In order to determine the thresholds, most methods analyze the histogram of the image. The optimal thresholds are often found by either minimiz- ing or maximizing an objective function with respect to the values of the thresholds. By defining two classes of objective functions for which the optimal thresholds can be found by efficient algorithms, this paper provides a framework for determining the solution ap- proach for current and future multilevel thresholding algorithms. We show, for example, that the method proposed by Otsu and other well-known methods have objective functions belonging to these classes. By implementing the algorithms in ANSI C and comparing their execution times, we can also make quantitative statements about their performance. © 2009 SPIE and IS&T.
Article
Full-text available
Processing cDNA microarray images is a crucial step in gene expression analysis, since any errors in early stages affect subsequent steps, leading to possibly erroneous biological conclusions. When processing the underlying images, accurately separating the sub-grids and spots is extremely important for subsequent steps that include segmentation, quantification, normalization and clustering. We propose a parameterless and fully automatic approach that first detects the sub-grids given the entire microarray image, and then detects the locations of the spots in each sub-grid. The approach, first, detects and corrects rotations in the images by applying an affine transformation, followed by a polynomial-time optimal multi-level thresholding algorithm used to find the positions of the sub-grids in the image and the positions of the spots in each sub-grid. Additionally, a new validity index is proposed in order to find the correct number of sub-grids in the image, and the correct number of spots in each sub-grid. Moreover, a refinement procedure is used to correct possible misalignments and increase the accuracy of the method. Extensive experiments on real-life microarray images and a comparison to other methods show that the proposed method performs these tasks fully automatically and with a very high degree of accuracy. Moreover, unlike previous methods, the proposed approach can be used in various type of microarray images with different resolutions and spot sizes and does not need any parameter to be adjusted.
Article
Full-text available
The identification of binding targets for proteins using ChIP-Seq has gained popularity as an alternative to ChIP-chip. Sequencing can, in principle, eliminate artifacts associated with microarrays, and cheap sequencing offers the ability to sequence deeply and obtain a comprehensive survey of binding. A number of algorithms have been developed to call "peaks" representing bound regions from mapped reads. Most current algorithms incorporate multiple heuristics, and despite much work it remains difficult to accurately determine individual peaks corresponding to distinct binding events. Our method for identifying statistically significant peaks from read coverage is inspired by the notion of persistence in topological data analysis and provides a non-parametric approach that is statistically sound and robust to noise in experiments. Specifically, our method reduces the peak calling problem to the study of tree-based statistics derived from the data. We validate our approach using previously published data and show that it can discover previously missed regions. The difficulty in accurately calling peaks for ChIP-Seq data is partly due to the difficulty in defining peaks, and we demonstrate a novel method that improves on the accuracy of previous methods in resolving peaks. Our introduction of a robust statistical test based on ideas from topological data analysis is also novel. Our methods are implemented in a program called T-PIC (Tree shape Peak Identification for ChIP-Seq) is available at http://bio.math.berkeley.edu/tpic/.
Article
Full-text available
ChIP-Seq, which combines chromatin immunoprecipitation (ChIP) with high-throughput massively parallel sequencing, is increasingly being used for identification of protein-DNA interactions in vivo in the genome. However, to maximize the effectiveness of data analysis of such sequences requires the development of new algorithms that are able to accurately predict DNA-protein binding sites. Here, we present SIPeS (Site Identification from Paired-end Sequencing), a novel algorithm for precise identification of binding sites from short reads generated by paired-end solexa ChIP-Seq technology. In this paper we used ChIP-Seq data from the Arabidopsis basic helix-loop-helix transcription factor ABORTED MICROSPORES (AMS), which is expressed within the anther during pollen development, the results show that SIPeS has better resolution for binding site identification compared to two existing ChIP-Seq peak detection algorithms, Cisgenome and MACS. When compared to Cisgenome and MACS, SIPeS shows better resolution for binding site discovery. Moreover, SIPeS is designed to calculate the mappable genome length accurately with the fragment length based on the paired-end reads. Dynamic baselines are also employed to effectively discriminate closely adjacent binding sites, for effective binding sites discovery, which is of particular value when working with high-density genomes.
Article
Image thresholding is a very common image processing operation, since almost all image processing schemes need some sort of separation of the pixels into different classes. In order to determine the thresholds, most methods analyze the histogram of the image. The optimal thresholds are often found by either minimizing or maximizing an objective function with respect to the values of the thresholds. By defining two classes of objective functions for which the optimal thresholds can be found by efficient algorithms, this paper provides a framework for determining the solution approach for current and future multilevel thresholding algorithms. We show, for example, that the method proposed by Otsu and other well-known methods have objective functions belonging to these classes. By implementing the algorithms in ANSI C and comparing their execution times, we can also make quantitative statements about their performance.
Article
In this article, we evaluate the performance of three clustering algorithms, hard K-Means, single linkage, and a simulated annealing (SA) based technique, in conjunction with four cluster validity indices, namely Davies-Bouldin index, Dunn's index, Calinski-Harabasz index, and a recently developed index \cal I. Based on a relation between the index \cal I and the Dunn's index, a lower bound of the value of the former is theoretically estimated in order to get unique hard K-partition when the data set has distinct substructures. The effectiveness of the different validity indices and clustering methods in automatically evolving the appropriate number of clusters is demonstrated experimentally for both artificial and real-life data sets with the number of clusters varying from two to ten. Once the appropriate number of clusters is determined, the SA-based clustering technique is used for proper partitioning of the data into the said number of clusters.
Article
The Cramer-von Mises $\omega^2$ criterion for testing that a sample, $x_1, \cdots, x_N$, has been drawn from a specified continuous distribution $F(x)$ is \begin{equation*}\tag{1}\omega^2 = \int^\infty_{-\infty} \lbrack F_N(x) - F(x)\rbrack^2 dF(x),\end{equation*} where $F_N(x)$ is the empirical distribution function of the sample; that is, $F_N(x) = k/N$ if exactly $k$ observations are less than or equal to $x(k = 0, 1, \cdots, N)$. If there is a second sample, $y_1, \cdots, y_M$, a test of the hypothesis that the two samples come from the same (unspecified) continuous distribution can be based on the analogue of $N\omega^2$, namely \begin{equation*}\tag{2} T = \lbrack NM/(N + M)\rbrack \int^\infty_{-\infty} \lbrack F_N(x) - G_M(x)\rbrack^2 dH_{N+M}(x),\end{equation*} where $G_M(x)$ is the empirical distribution function of the second sample and $H_{N+M}(x)$ is the empirical distribution function of the two samples together [that is, $(N + M)H_{N+M}(x) = NF_N(x) + MG_M(x)\rbrack$. The limiting distribution of $N\omega^2$ as $N \rightarrow \infty$ has been tabulated [2], and it has been shown ([3], [4a], and [7]) that $T$ has the same limiting distribution as $N \rightarrow \infty, M \rightarrow \infty$, and $N/M \rightarrow \lambda$, where $\lambda$ is any finite positive constant. In this note we consider the distribution of $T$ for small values of $N$ and $M$ and present tables to permit use of the criterion at some conventional significance levels for small values of $N$ and $M$. The limiting distribution seems a surprisingly good approximation to the exact distribution for moderate sample sizes (corresponding to the same feature for $N\omega^2$ [6]). The accuracy of approximation is better than in the case of the two-sample Kolmogorov-Smirnov statistic studied by Hodges [4].