ArticlePDF Available

A HMM Approach to Identifying Distinct DNA Methylation Patterns for Subtypes of Breast Cancers

Authors:
A HMM Approach to Identifying Distinct DNA Methylation Patterns
for Subtypes of Breast Cancers
Thesis
Presented in Partial Fulfillment of the Requirements for the Degree Master of Science
in the Graduate School of The Ohio State University
By
Maoxiong Xu, B.S.
Graduate Program in Computer Science and Engineering
The Ohio State University
2011
Thesis Committee:
Victor X. Jin, Advisor
Raghu Machiraju
Copyright by
Maoxiong Xu
2011
ii
Abstract
The United States has the highest annual incidence rates of breast cancer in the
world; 128.6 per 100,000 in whites and 112.6 per 100,000 among African Americans.[1,2]
It is the second-most common cancer (after skin cancer) and the second-most common
cause of cancer death (after lung cancer).[1] Recent studies have demonstrated that hyper-
methylation of CpG islands may be implicated in tumor genesis, acting as a mechanism
to inactivate specific gene expression of a diverse array of genes (Baylin et al., 2001).
Genes have been reported to be regulated by CpG hyper-methylation, include tumor
suppressor genes, cell cycle related genes, DNA mismatch repair genes, hormone
receptors and tissue or cell adhesion molecules (Yan et al., 2001). Usually, breast cancer
cells may or may not have three important receptors: estrogen receptor (ER),
progesterone receptor (PR), and HER2. So we will consider the ER, PR and HER2 while
dealing with the data. In this thesis, we first use Hidden Markov Model (HMM) to train
the methylation data from both breast cancer cells and other cancer cells. Also we did
hierarchy clustering to the gene expression data for the breast cancer cells and based on
the clustering results, we get the methylation distribution in each cluster. Finally, we
correlate the HMM training results with the methylation distribution and get the biology
meanings for the states in the HMM results.
iii
Dedicated to my father, mother, and wife,
for all of their love and support.
iv
Acknowledgments
I have many people to thank for my making it this far: my advisor, Dr. Victor Jin,
for everything he's done; Dr. Raghu Machiraju, for his help and support; all of my lab
mates, for their knowledge, assistance, and encouragement; and the incredible
Biomedical Informatics Department staff for everything they do.
v
Vita
2005……………………………...Mudu Central High School
2009……………………………...B.S. Computer Science, Southeast University
2009 to present……….……..……M.S. Computer Science & Engineering, The
Ohio State University
Sep. 2010 to present……………...Graduate Teaching Associate, Department
of Bioinformatics, The Ohio State University
Publications
Cao AR, Rabinovich R, Xu M, Xu X, Jin VX, Farnham PJ: Genome-wide analysis of
transcription factor E2F1 mutant proteins reveals that N- and C-terminal protein
interaction domains do not participate in targeting E2F1 to the human genome.
J Biol Chem. 2011 Apr 8; 286(14):11985-96. Epub 2011 Feb 10.
Fields of Study
Major: Computer Science & Engineering
Machine Learning applied in Bioinformatics
vi
Table of Contents
Abstract……........................................................................................................................ii
Dedication………………………………………………………………………..……….iii
Acknowledgments…..........................................................................................................iv
Vita......................................................................................................................................v
Table of Contents ...............................................................................................................vi
List of Tables .....................................................................................................................ix
List of Figures.....................................................................................................................xi
Chapter 1: Introduction........................................................................................................1
1.1 Methylation……………………………………………………………………1
1.1.1 What Is Methylation? ......................................................................1
1.1.2 DNA Methylation…………………………………………………2
1.1.3 DNA Methylation Mechanism……………………………….........3
1.1.4 DNA Methylation in Cancer...……………………………….........5
1.2 Gene Expression………………………………………………………………6
1.2.1 Gene Expression Measurement……………………………….…….7
1.2.2 mRNA Quantification……………………………………………8
1.2.3 Regulation of Gene Expression……………………….….……...10
1.3 Hidden Markov Model………………………………………………...…….11
1.3.1 Introduction to Hidden Markov Model…………………….……12
vii
1.3.2 Hidden Markov Model……………………………………..…….13
1.3.3 Model Architecture...…………………………………………….13
1.3.4 HMM Training and Decoding……………………………..…….14
1.3.5 HMMs in Computational Biology………………………..……...15
1.3.6 Application of HMMs to Specific Problems……………..……...16
Chapter 2: Methods and Algorithms……………………………....….………………….18
2.1 The Probabilistic Model…………………….………….…………….………18
2.2 Baum-Welch Algorithm…………………………….………………..……19
2.3 Work Flow…………………………………………….……………..……23
Chapter 3: Data Process…..…………………………………………………………..….26
3.1 Data Sets……………………………………………………………..………26
3.2 MBD-seq Protocol…………………………………………………..……….27
3.3 Data Preprocess…………………………………………………….….……..27
3.4 Input for HMM……………………….………………………………..…….30
3.5 Methylation Distribution Overview………………….……………….….…..33
3.6 Gene Expression Data………………………………………………….……34
Chapter 4: Results and Discussion………………………………………………….…...35
4.1 Results from HMM………………………………………………….………35
viii
4.2 Biology Meanings………………………………………………………..…..41
4.2.1 Gene Expression Results for 33 Breast Cancer Cell Lines........…..41
4.2.2 Results Based on Different Clusters…………………………..…...42
4.2.3 States Meanings and Group Patterns……………………….....…...50
Chapter 5: Data Visualization……………………………………………………..……..56
Chapter 6: Conclusions and Suggestions for Further Work………………………..…....59
6.1 Conclusion……………………………………………………………..….59
6.2 Future Work…………………………………….……………….…..…….…60
References………………………………………………………………………...….…..61
Appendix_Formats………………………………………………………….…..…….…66
A. BAM format………………………………………………………..…..…….66
B. SAM format………………………………………………………..….….….66
C. Export format………………………………………………………..…...…..67
D. BED format………………………………………………………..…………68
E. Fastq format………………………………………………………..………...70
F. Bowtie output format………………………………………………..……….71
ix
List of Tables
Table 3.1 Data summary for 36 cell lines……………………...………..……………….29
Table 3.2 12 Groups for 36 cell lines……………..………………………..…….….…...31
Table 4.1 BIC results for HMM results…………………………………..………….…..35
Table 4.2 Transition Matrix…………………………………………………..…….……36
Table 4.3 Emission probabilities for each mark in each state……………………..….…38
Table 4.4 Ordered emission probabilities for each mark in each state-mark………...…..39
Table 4.5 Ordered emission probabilities for each mark in each state- probabilities.…...39
Table 4.6 Filtered ordered emission probabilities for each mark in each state- marks….40
Table 4.7 Number of genes in each cluster……………………………………..………..43
Table 4.8 First 3 marks for each state…………………………………………………..50
x
Table 4.9 States and interval correlation results………………………………………..51
Table 4.10 States meanings…………………………………………………………….52
Table 4.11 Patterns for subtypes of Breast cancers…………………………………….52
xi
List of Figures
Fig 1.1 Methylation…………………………………………………………………….…1
Fig 1.2 DNA methylation……………………………………………………………..…..2
Fig 1.3 DNA methylation mechanism……………………………………….……….…...4
Fig 1.4 DNA methylation in cancer…………………………………………….…….…...6
Fig 1.5 Gene Expression………………………………………………………….….……6
Figure 1.6: A simple HMM λ= (A,B, π),where N = 3, M = 3, a12,a23,a32 are non-zero,
b1(a), b2(t),b3(g) = 1 and π = 1, 0, 0. ……………………………………………..……...13
Fig2.1 A Broad overview of the HMM work-flow, highlighting the most significant
inputs, transformations, and outputs at each step from start to end. ……………..…...…23
Fig 3.1 Bar figure for 36 cell lines……...……………………………………………..…30
Fig 3.2 Methylation distribution for 33 breast cancer cell lines……...…………..……...34
Fig 4.1 Heatmap for transition matrix…………………………………………….……..37
Fig 4.2 33 Breast Cancer Cell Gene Expression One-Way Hierarchy Clustering……....41
Fig 4.3 Grouped 33 Breast Cancer Cell Gene Expression One-Way Hierarchy
Clustering …………………………………………………………………….….……...42
Fig 4.4 Methylation distribution based on cluster 1 genes……………………...……….44
Fig 4.5 Methylation distribution based on cluster 2 genes…………………...…..……...45
xii
Fig 4.6 Methylation distribution based on cluster 3 genes………………………...…….45
Fig 4.7 Methylation distribution based on cluster 4 genes…………………...……….....46
Fig 4.8 Methylation distribution based on cluster 5 genes…………………….………...46
Fig 4.9 Methylation distribution based on cluster 6 genes…………………...…….........47
Fig 4.10 Methylation distribution based on cluster 7 genes………………..….………...48
Fig 4.11 Methylation distribution based on cluster 8 genes………………….……........48
Fig 4.12 Methylation distribution based on cluster 9 genes………………….……........49
Fig 5.1 Database Web Tool……………………………………………………………..56
1
Chapter 1: Introduction
1.1 Methylation
1.1.1 What Is Methylation?
In the view of chemical sciences, methylation means the addition of a methyl
group to a substrate or the substitution of an atom or group by a methyl group.
Methylation is a form of alkylation with, to be specific, a methyl group, rather than a
larger carbon chain, replacing a hydrogen atom.
In the view of biological systems,
methylation is catalyzed by enzymes; such
methylation can be involved in modification
of heavy metals, regulation of gene
expression, regulation of protein function,
and RNA metabolism. Methylation of heavy
metals can also occur outside of biological
systems. Chemical methylation of tissue
samples is also one method for reducing
certain histological staining artifacts.
Fig 1.1 Methylation
2
The term methylation in organic chemistry refers to the alkylation process used to
describe the delivery of a CH3 group [3].This is commonly performed using nucleophilic
methyl sources - iodomethane, dimethyl sulfate, dimethyl carbonate, or less commonly
with the more powerful (and more dangerous) methylating reagents of methyl triflate or
methyl fluorosulfonate (magic methyl), which all react via SN2 nucleophilic substitution.
For example a carboxylate may be methylated on oxygen to give a methyl ester, an
alkoxide salt RO− may be likewise methylated to give an ether, ROCH3, or a ketone
enolate may be methylated on carbon to produce a new ketone.
1.1.2 DNA Methylation
After every cycle of DNA replication, several modifications occur in the DNA.
DNA methylation is one such post-synthesis modification. It is an epigenetic
modification involved in
both normal developmental
processes and disease states
through the modulation of
gene expression and the
maintenance of genomic
organization[4]. DNA
methylation has been
proven by research to
be manifested in a
Fig 1.2 DNA methylation
3
number of biological processes such as regulation of imprinted genes, X chromosome
inactivation, and tumor suppressor gene silencing in cancerous cells. It also acts as a
protection mechanism adopted by the pathogen DNA (mainly bacterial against the end
nuclease activity that destroys any foreign DNA [5, 6].
DNA cytosine methylation is the covalent addition of a methyl group to the 5
position of cytosine. In humans, DNA methylation occurs predominantly in a CpG
dinucleotide context and is catalyzed by DNA methyltransferases [7, 8, 9]. Dense clusters of
CpG dinucleotides, termed CpG islands, are present in roughly 40% of gene promoters,
and methylation of these regions is associated with transcriptional silencing [10, 11]. DNA
methylation is essential for normal developmental processes, such as imprinting [12] and
X chromosome inactivation [13]. Dysregulation of DNA methylation occurs in disease
states such as cancer, where promoter CpG island hyper-methylation leads to inactivation
of tumor suppressor genes [14, 15]. Thus, many tumor suppressors classically identified
through mutation analyses, such as APC [16, 17], BRCA1 [18, 19], and CDKN2A [20, 21], have
also been found to be transcriptionally silenced by promoter hyper-methylation.
1.1.3 DNA Methylation Mechanism
In DNA, methylation usually occurs in the CpG islands, a CG rich region,
upstream of the promoter region. The letter “p” here signifies that the C and G are
connected by a phosphodiester bond. In humans, DNA methylation is carried out by a
group of enzymes called DNA methyltransferases. These enzymes not only determine the
4
DNA methylation patterns during the early development, but are also responsible for
copying these patterns to the strands generated from DNA replication [6].
DNA methylation involves the addition of a methyl group to the 5 position of the
cytosine pyrimidine ring or the number 6 nitrogen of the adenine purine ring (cytosine
and adenine are two of the four bases of DNA). This modification can be inherited
through cell division. DNA methylation is typically removed during zygote although the
latest research shows that hydroxylation of methyl group occurs rather than complete
removal of methyl groups in zygotermation and re-established through successive cell
divisions during development [22]. DNA methylation is a crucial part of normal
organismal development and cellular differentiation in higher organisms. DNA
methylation stably alters the gene expression pattern in cells such that cells can
"remember where they have been" or decrease gene expression; for example, cells
programmed to be pancreatic islets during embryonic development remain pancreatic
islets throughout the life of
the organism without
continuing signals telling
them that they need to
remain islets. In addition,
DNA methylation suppresses
the expression of viral genes and
other deleterious elements that
have been incorporated into the genome of the host over time. DNA methylation also
Fig 1.3 DNA methylation mechanism
5
forms the basis of chromatin structure, which enables cells to form the myriad
characteristics necessary for multicellular life from a single immutable sequence of DNA.
DNA methylation also plays a crucial role in the development of nearly all types of
cancer [23].
1.1.4 DNA Methylation in Cancer
DNA methylation is an important regulator of gene transcription and a large body
of evidence has demonstrated that aberrant DNA methylation is associated with
unscheduled gene silencing, and the genes with high levels of 5-methylcytosine in their
promoter region are transcriptional silent. DNA methylation is essential during
embryonic development, and in somatic cells, patterns of DNA methylation are generally
transmitted to daughter cells with a high fidelity. Aberrant DNA methylation patterns
have been associated with a large number of human malignancies and found in two
distinct forms: hyper-methylation and hypo-methylation compared to normal tissue.
Hyper-methylation is one of the major epigenetic modifications that repress transcription
via promoter region of tumor suppressor genes. Hyper-methylation typically occurs at
CpG islands in the promoter region and is associated with gene inactivation. Global
hypo-methylation has also been implicated in the development and progression of cancer
through different mechanisms [24].
6
Fig 1.4 DNA methylation in cancer
1.2 Gene Expression
Gene expression is the process by
which information from a gene is used in
the synthesis of a functional gene
product. These products are often
proteins, but in non-protein coding genes
such as rRNA genes or tRNA genes, the
product is a functional RNA. The
process of gene expression is used by all
known life - eukaryotes (including
Fig 1.5 Gene Expression
7
multicellular organisms), prokaryotes (bacteria and archaea) and viruses - to generate the
macromolecular machinery for life. Several steps in the gene expression process may be
modulated, including the transcription, RNA splicing, translation, and post-translational
modification of a protein. Gene regulation gives the cell control over structure and
function, and is the basis for cellular differentiation, morphogenesis and the versatility
and adaptability of any organism. Gene regulation may also serve as a substrate for
evolutionary change, since control of the timing, location, and amount of gene expression
can have a profound effect on the functions (actions) of the gene in a cell or in a
multicellular organism.
In genetics gene expression is the most fundamental level at which genotype gives
rise to the phenotype. The genetic code is "interpreted" by gene expression, and the
properties of the expression products give rise to the organism's phenotype.
1.2.1 Gene Expression Measurement
Measuring gene expression is an important part of many life sciences - the ability
to quantify the level at which a particular gene is expressed within a cell, tissue or
organism can give a huge amount of information. For example measuring gene
expression can:
Identify viral infection of a cell (viral protein expression)
Determine an individual's susceptibility to cancer (oncogene expression)
Find if a bacterium is resistant to penicillin (beta-lactamase expression)
8
Similarly the analysis of the location of expression protein is a powerful tool and
this can be done on an organism or cellular scale. Investigation of localization is
particularly important for study of development in multicellular organisms and as an
indicator of protein function in single cells. Ideally measurement of expression is done by
detecting the final gene product (for many genes this is the protein) however it is often
easier to detect one of the precursors, typically mRNA, and infer gene expression level.
1.2.2 mRNA Quantification
Levels of mRNA can be quantitatively measured by Northern blotting which
gives size and sequence information about the mRNA molecules. A sample of RNA is
separated on an agarose gel and hybridized to a radio-labeled RNA probe that is
complementary to the target sequence. The radio-labeled RNA is then detected by an
autoradiograph. The main problems with Northern blotting stem from the use of
radioactive reagents (which make the procedure time consuming and potentially
dangerous) and lower quality quantification than more modern methods (due to the fact
that quantification is done by measuring band strength in an image of a gel). Northern
blotting is, however, still widely used as the additional mRNA size information allows
the discrimination of alternately spliced transcripts.
A more modern low-throughput approach for measuring mRNA abundance is
reverse transcription quantitative polymerase chain reaction (RT-PCR followed with
9
qPCR). RT-PCR first generates a DNA template from the mRNA by reverse transcription.
The DNA template is then used for qPCR where the change in fluorescence of a probe
changes as the DNA amplification process progresses. With a carefully constructed
standard curve qPCR can produce an absolute measurement such as number of copies of
mRNA, typically in units of copies per nanolitre of homogenized tissue or copies per cell.
qPCR is very sensitive (detection of a single mRNA molecule is possible), but can be
expensive due to the fluorescent probes required.
Northern blots and RT-qPCR are good for detecting whether a single gene is
being expressed, but it quickly becomes impractical if many genes within the sample are
being studied. Using DNA microarrays transcript levels for many genes at once
(expression profiling) can be measured. Recent advances in microarray technology allow
for the quantification, on a single array, of transcript levels for every known gene in
several organisms‟ genomes, including humans.
Alternatively "tag based" technologies like Serial analysis of gene expression
(SAGE), which can provide a relative measure of the cellular concentration of different
messenger RNAs, can be used. The great advantage of tag-based methods is the "open
architecture", allowing for the exact measurement of any transcript, with a known or
unknown sequence.
10
1.2.3 Regulation of Gene Expression
Regulation of gene expression refers to the control of the amount and timing of
appearance of the functional product of a gene. Control of expression is vital to allow a
cell to produce the gene products it needs when it needs them; in turn this gives cells the
flexibility to adapt to a variable environment, external signals, damage to the cell, etc.
Some simple examples of where gene expression is important are:
Control of Insulin expression so it gives a signal for blood glucose regulation
X chromosome inactivation in female mammals to prevent an "overdose" of the genes it
contains.
Cycling expression levels control progression through the eukaryotic cell cycle
More generally gene regulation gives the cell control over all structure and
function, and is the basis for cellular differentiation, morphogenesis and the versatility
and adaptability of any organism.
Any step of gene expression may be modulated, from the DNA-RNA
transcription step to post-translational modification of a protein. The stability of the final
gene product, whether it is RNA or protein, also contributes to the expression level of the
gene - an unstable product results in a low expression level. In general gene expression is
regulated through changes in the number and type of interactions between molecules that
collectively influence transcription of DNA and translation of RNA.
11
DNA methylation is a widespread mechanism for epigenetic influence on gene
expression and is seen in bacteria and eukaryotes and has roles in heritable transcription
silencing and transcription regulation. In eukaryotes the structure of chromatin, controlled
by the histone code, regulates access to DNA with significant impacts on the expression
of genes in euchromatin and heterochromatin areas.
1.3 Hidden Markov Model (HMM)
1.3.1 Introduction to HMM
A Hidden Markov Model (HMM) is a stochastic model that captures the statistical
properties of observed real world data. A good HMM accurately models the real world
source of the observed data and has the ability to simulate the source. Machine Learning
techniques based on HMMs have been successfully applied to problems including speech
recognition, optical character recognition, and as we will examine problems in
computational biology.
Methylation finding or prediction has become one of the foremost computational
biology problems for two reasons. Firstly, completely sequenced genomes have become
readily available. And most important, because of the need to extract actual biological
knowledge from this data to explain the molecular interactions that occur in cells and to
define important cellular pathways. Discovering the location of hyper-methylation on the
genome is a very important step towards building such a body of knowledge. This thesis
12
will introduce several different statistical and algorithmic methods for hyper-methylation
finding, with a focus on the statistical model-based approach using HMMs.
1.3.2 Hidden Markov Models
A basic Markov model of a process is a model where each state corresponds to an
observable event and the state transition probabilities depend only on the current and
predecessor state. This model is extended to a Hidden Markov model for application to
more complex processes, including speech recognition and computational gene finding.
A generalized Hidden Markov Model (HMM) consists of a finite set of states, an
alphabet of output symbols, a set of state transition probabilities and a set of emission
probabilities. The emission probabilities specify the distribution of output symbols that
may be emitted from each state. Therefore in a hidden model, there are two stochastic
processes; the process of moving between states and the process of emitting an output
sequence. The sequence of state transitions is a hidden process and is observed through
the sequence of emitted symbols.
Let us formalize the definition of an HMM in the following way, taken from an
HMM tutorial by Lawrence Rabiner [25]. An HMM is defined by the following elements:
1. Set S of N states, S = S1S2…SN
2. Set O of M observation symbols, the output alphabet. O = o1o2oM
3. Set A of state transition probabilities, A = aij where aij is the probability of moving
from state i to state j.
13
   E1.1
4. Set B of observation symbol probabilities at state j, B = bj(k), where bj(k) is the
probability of emitting symbol k at state j.
 E1.2
5. Set π, the initial state distribution π = πi where πi is the probability that the start state is
state i.  E1.3
Given the definitions above, the notation of a model is λ= (A, B, π).
Figure 1.6: A simple HMM λ= (A,B, π),where N = 3, M = 3, a12,a23,a32 are non-zero,
b1(a), b2(t),b3(g) = 1 and π = 1, 0, 0. Note that states can be 'null' states that do not emit
any symbol.
1.3.3 Model Architecture
The set of states S, the output symbol alphabet X and the connections between the
states constitute the architecture of a model. The architecture of a HMM is problem
dependent. The model is constructed to correspond to the properties and constraints of the
14
observed sequences and of the process itself. HMM architecture can also be learned from
the data, but in most computational biology problems, it is advantageous to use known
constraints that characterize the processes.
1.3.4 HMM Training and Decoding
Once the architecture of an HMM has been decided, an HMM must be trained to
closely fit the process it models. Training involves adjusting the transition and output
probabilities until the model sufficiently fits the process. These adjustments are
performed using standard machine learning techniques to optimize P(O|λ), the probability
of observed sequence O = O1O2…OT, (here T is the number of observation length, i.e. the
number of 1000bp intervals) given model over a set of training sequences. The most
common and straightforward algorithm for HMM training is expectation maximization
(EM) which adapts the transition and output parameters by continually re-estimating
these parameters until P(O|λ) has been locally maximized.
HMM decoding involves the prediction of hidden states given an observed
sequence. The problem is to discover the best sequence of states Q = Q1Q2QT visited
that accounts for an emitted sequence O = O1O2…OT and a model λ. There may be
several different ways to define a best sequence of states. A common decoding algorithm
is the Viterbi algorithm. The Viterbi algorithm uses a dynamic programming approach to
find the most likely sequence of states Q given an observed sequence O and model λ.
15
1.3.5 HMMs in Computational Biology
The field of computational biology involves the application of computer science
theories and approaches to biological and medical problems. Computational biology is
motivated by newly available and abundant raw molecular datasets gathered from a
variety of organisms. Though the availability of this data marks a new era in biological
research, it alone does not provide any biologically significant knowledge. The goal of
computational biology is then to elucidate additional information regarding protein
coding, protein function and many other cellular mechanisms from the raw datasets. This
new information is required for drug design, medical diagnosis, medical treatment and
countless fields of research.
The majority of raw molecular data used in computational biology corresponds to
sequences of nucleotides corresponding to the primary structure of DNA and RNA or
sequences of amino acids corresponding to the primary structure of proteins. Therefore
the problem of inferring knowledge from this data belongs to the broader class of
sequence analysis problems.
Two of the most studied sequence analysis problems are speech recognition and
language processing. Biological sequences have the same left-to-right linear aspect as
sequences of sounds corresponding to speech and sequences of words representing
language. Consequently, the major computational biology sequence analysis problems
can be mapped to linguistic problems. A common linguistic metaphor in computational
biology is that of protein family classification as speech recognition. The metaphor
suggests interpreting different proteins belonging to the same family as different
16
vocalizations of the same word. Another metaphor is gene finding in DNA sequences as
the parsing of language into words and semantically meaningful sentences. It follows that
biological sequences can be treated linguistically with the same techniques used for
speech recognition and language processing.
Since the theory of HMMs was formalized in the late 1960s, several scientists
have applied the theory to speech recognition and language processing. Just as HMMs
were first introduced as mathematical models of language, HMMs can be used as
mathematical models of molecular processes and biological sequences. In addition,
HMMs have been applied to linguistics because they are suited for problems where the
exact theory may be unknown but where there exists large amounts of data and
knowledge derived from observation. As this is also the situation in biology, HMM-based
approaches have been successfully applied to problems in computational biology. The
main benefit is that an HMM provides a good method for learning the theory from the
data and can provide a structured model of sequence data and molecular processes.
1.3.6 Application of HMMs to Specific Problems
It is clear that an HMM-based approach is a logical idea for tackling problems in
computational biology. Much work has been performed and applications have been built
using such an approach.
Baldi and Brunak [26] define three main groups of problems in computational biology for
which HMMs have been proven especially useful.
17
First, HMMs can be used for multiple alignments of DNA sequences, which is a
difficult task to perform using a naive dynamic programming approach. Second, the
structure of trained HMMs can uncover patterns in biological data. Such patterns have
been used to discover periodicities within specific regions of the data and to help predict
regions in sequences prone to forming specific structures. Third is the large set of
classification problems. HMM based approaches have been applied to structure
prediction - the problem of classifying each nucleotide according to which structure it
belongs. HMMs have also been used in protein profiling to discriminate between
different protein families and predict a new protein's family or subfamily. HMM-based
approaches have also been successful when applied to the problem of methylation
function finding. This is the problem of predicting methylation function according to
which type of job they perform. The remainder of this thesis is concerned with this last
problem of computational methylation function prediction.
18
Chapter 2: Methods and Algorithms
2.1 The Probabilistic Model
The probabilistic model is based on a multivariate instance of a Hidden Markov
Model (HMM). The model assumes a fixed number of hidden states N. In each hidden
state, the emission distribution, that is the probability distribution over each combination
of marks, is modeled with a product of independent Bernoulli random variables. Formally,
for each of the N states, and M=4096 (i.e. 212) input marks, there is an emission
parameter bk,m denoting the probability in state k (k=1,…,N) that input mark m
(m=1,…,M) has a present call. Let denote a chromosome where C is the set of all
chromosomes. Let  denote an interval on the genome where t =1,…, T
corresponding to the 1000bp intervals on genome and T is the number of intervals that
the genome was divided into. So  is assigned „1‟ if a mark is detected in the tth
1000bp interval on chromosome c. Let aij denote the probability of transitioning from
state i to state j,
where
i=1,…,N and j=1,…,N.
19
We also have parameters πi (i=1, …, N), which denote the probability that the
state of the first interval on the chromosome is i. Let be an unobserved state
sequence through chromosome c and be the set of all possible state sequences. Let
denote the unobserved state on chromosome c at location t for state sequence . The full
likelihood of all of the observed data D for the parameters a, b, and π can then be
expressed as:


 
 

 


……………………………………………………………………………………..E2.1
2.2 Baum-Welch Algorithm
In electrical engineering, computer science, statistical computing and
bioinformatics, the BaumWelch algorithm is used to find the unknown parameters of a
hidden Markov model (HMM). It makes use of the forward-backward algorithm and is
named for Leonard E. Baum and Lloyd R. Welch.
A Hidden Markov Model is a probabilistic model of the joint probability of a
collection of random variables {O1Ot, O1Ot}. The Ot variables are discrete
observations and the variables are “hidden” and discrete. Under an HMM, there are two
conditional independence assumptions made about these random variables that make
associated algorithms tractable. These independence assumptions are:
20
1. The tth hidden variable, given the (t-1)st hidden variable, is independent of
previous variables, or:
 E2.2
2. The tth observation depends only on the tth state.
 E2.3
In the following, we present the EM algorithm for finding the maximum-likelihood
estimate of the parameters of a hidden Markov model given a set of observed feature
vectors. This algorithm is also known as the Baum-Welch algorithm.
Qt is a discrete random variable with N possible values {S1….SN}. We further assume
that the underlying “hidden” Markov chain defined by P(Qt | Qt-1 } is time-homogeneous
(i.e., is independent of the time t). Therefore, we can represent P (Qt | Qt-1} as a time-
independent stochastic transition matrix
   E2.4
The special case of time t=1 is described by the initial state distribution
 E2.5
We say that we are in state j at time t if Qt = Sj. A particular sequence of states is
described by Q = (Q1. . . QT )
where Qt ϵ{S1SN}is the state at time t.
21
The observation is one of M possible observation symbols, Ot ϵ {o1…oM}. The
probability of a particular observation vector at a particular time t for state j is described
by:  E2.6
so B={bij} is an N by M matrix.
A particular observation sequence O is described as O = (O1 = o1, …, OT = oT ).
Therefore, we can describe a HMM by: λ= (A, B, ). Given an observation O, the Baum-
Welch algorithm finds:
 E2.7
that is, the HMM λ, that maximizes the probability of the observation O.
Initialization: set λ= (A, B, ) with random initial conditions. The algorithm
updates the parameters of λ iteratively until convergence, following the procedure below:
The forward procedure:
We define:
 E2.8
which is the probability of seeing the partial sequence o1, , , ot and ending up in state Si at
time t.
We can efficiently calculate αi(t) recursively as:
1.  E2.9
2.
 E2.10
22
The backward procedure: This is the probability of the ending partial sequence ot+1,…, oT
given that we started at state i, at time t. We can efficiently calculate βi(t) as:
1.  E2.11
2. 
 E2.12
Using α and β, we can calculate the following variables:



 E2.13
 
  



 E2.14
Having and ξ, we can define update rules as follows:
 E2.15
 



 E2.16


 E2.17
(note that the summation in the nominator of b‟i(k) is only over observed symbols equal
to ok).
Using the updated values of A, B and π, a new iteration performed until convergence.
23
2.3 Work Flow
Fig2.1 A Broad overview of the HMM work-flow, highlighting the most significant
inputs, transformations, and outputs at each step from start to end.
Detailed description of the work flow:
Data Sources and Data Processing (Will be discussed in the next chapter)
24
HMM training: For the HMM training, we first randomly initialized the parameters, the
probability matrix for the first state (), the initial transition matrix (A), the initial
emitting matrix (B).
Then we started to try to train the model with different numbers of states, from 12-26,
using the Baum-Welch algorithm. Also, during the training, we did some control to make
the training process more designable. For example, for each iteration, we will calculate
the value of A and B with
  E2.17
and
  E2.18
By doing this kind of thing, we can smooth the transition and emission matrixes. Here in
my program, based on the number of states and possible observations in each state, we
set α as 0.000001.
For the terminate conditions, we have two thresholds:
The first one is delta, which is the difference between the previous LLR (Log
Likelihood Ratio) and the current LLR. For every iteration, we will calculate the LLR for
the current iteration, then compare it to the previous LLR and get the value of delta. If
delta is less than or equal to some threshold (here we set 0.001), then we think that the
training is enough and so we stop training.
The other is the number of iteration, we count the number of iterations all the time
during the training process, if the number of iteration is less than the threshold (we set
300 here), we continue the training, otherwise, we abort.
25
At the end of every iteration, we will check if any of the two thresholds is
achieved and if either the delta or the iteration number achieves, we just stop the training
process and output the training results.
We trained our HMM using different number of states from 11 to 26, and then we
can get the Bayesian information criterion (BIC) values. The BIC is an asymptotic result
derived under the assumptions that the data distribution is in the exponential family.
The formula for the BIC is:
 E2.19
Where
x is the observed data;
n is the number of data points in x, the number of observations, or equivalently, the
sample size;
k is the number of free parameters to be estimated. If the estimated model is a linear
regression, k is the number of regresses, including the intercept;
then p(x|k) is the probability of the observed data given the number of parameters; or, the
likelihood of the parameters given the dataset;
and L is the maximized value of the likelihood function for the estimated model.
26
Chapter 3: Date Process
3.1 Data Sets:
We used 36 cell lines as our training datasets, including 33 Breast Cancer Cell Lines, H1,
HCT116 and CD4+T cell lines.
a) 33 Breast Cancer Cell Lines:
Breast cell lines were procured through the Integrative Cancer Biology Program
(ICBP) of the National Cancer Institute (Neve et al., 2006) [27].
b) H1 cell line:
The H1 MBD-seq data used in this thesis is from the paper of R Alan Harris. Et.al
(nature biotechnology, 2010) [28]
c) HCT116:
The HCT116 MBD-seq data used in this thesis is from the paper of David Serre
et.al. (Nucleic Acids Research, 2010) [4]
d) d) CD4+Tcell:
The CD4+Tcell MBD-seq data used in this thesis is from the paper of Jung K
Choi et.al. (Genome Biology, 2009)[29]
27
3.2 MBD-seq Protocol
Genomic DNA was isolated by the QIAamp DNA Mini Kit (Qiagen) following
the manufacture‟s protocol. Genomic DNA of breast cell lines was procured through the
Integrative Cancer Biology Program (ICBP) of the National Cancer Institute.
MBDCap-seq, mapping and normalization. Methylated DNA was eluted by the
MethylMiner Methylated DNA Enrichment Kit (Invitrogen) according to the
manufacturer‟s instructions. Briefly, one microgram of genomic DNA was sonicated and
captured by MBD proteins. The methylated DNA was eluted in 1 M salt buffer. DNA in
each eluted fraction was precipitated by glycogen, sodium acetate and ethanol, and was
resuspended in TE buffer. Eluted DNA was used to generate libraries for sequencing
following the standard protocols from Illumina. MBDCap-seq libraries were sequenced
using the Illumina Genome Analyzer II as per manufacturer's instructions. Image
analysis and base calling were performed with the standard Illumina
pipeline. Sequencing reads were mapped by ELAND algorithm. Unique reads were up
to 36 base pair reads mapped to the human reference genome (hg18), with up to two
mismatches. Reads in satellite regions were excluded due to the large number of
amplifications.
3.3 Data Preprocess
The original data for the 33 DNA methylation data are in the export format which
is described in (Appendix.C), the H1 is in the form of bam which is described in
(Appendix.A) and the HCT116 and CD4+T cells are in the form of fastq which is
28
described in (Appendix.E). While dealing with these data, for export files (33 Breast
Cancer cell lines), we have to first divide the reads in the files into three groups:
Group 1: the unique matched reads (which means there is only one sequence on the
genome that matches the read)
Group 2: the multiple-/non- matched reads (which means there is multiply/no sequences
match the read)
Group 3: QC (Quality Control, which means the read itself can‟t meet some quality
requirements)
Then for the reads in group2, we use a tool (Lonut) to get the most possible
matched reads and then add them to the reads from group1 to get the reads that we are
going to deal with (Total Reads After Processing).
For the bam files(H1 cell lines), we first use the popular tool “samtool” to
transform it from bam format to sam format, since bam file is a binary file. Also since
there is no QC reads in sam file, so this time we just divide the reads in to two groups, the
group1 and group2 which are the same as for the export files.
Then for the fastq files(HCT116 and CD4+Tcell), we first use the software bowtie
to map the sequences on to the genome and then followed by the same steps as those for
33 breast cancer cell lines.
Besides, the outputs for all the process above are in bed format (Appendix.D)
29
Table 3.1 Data summary for 36 cell lines
Cell line ID
Raw Reads
Unique Matched
Reads
Not Unique
Matched Reads
Total Reads After
Process
BrCa-02(AU565)
38,389,113
21,757,417
11,268,129
33,025,546
BrCa-03(BT549)
38,607,423
24,343,702
9,151,298
33,495,000
BrCa-06(HCC1569)
33,243,637
17,790,745
11,032,912
28,823,657
BrCa-07(HCC1937)
32,664,695
17,761,936
10,746,815
28,508,751
BrCa-08(HCC2185)
40,922,132
22,424,765
11,505,148
33,929,913
BrCa-09(HCC70)
42,112,586
24,613,958
11,832,051
36,446,009
BrCa-10(LY2)
38,858,773
23,020,926
11,294,571
34,315,497
BrCa-11(MCF-7)
43,128,546
24,876,183
12,608,935
37,485,118
BrCa-12(MDAMB-231)
36,495,183
22,767,185
8,963,014
31,730,199
BrCa-14(MDAMB-468D)
46,932,495
25,467,786
14,656,101
40,123,887
BrCa-15(SUM149PT)
36,129,334
21,592,142
10,491,546
32,083,688
BrCa-16(SUM225CWN)
27,600,744
17,390,015
7,502,375
24,892,390
BrCa-20(BT20)
38,329,851
21,775,872
11,679,619
33,455,491
BrCa-25(HCC1954)
38,223,154
21,961,680
10,936,018
32,897,698
BrCa-28(MCF10A)
47,587,907
27,946,727
12,391,220
40,337,947
BrCa-32(SKBR3)
41,094,509
24,365,279
10,698,249
35,063,528
BrCa-33(SUM159PT)
43,752,391
25,433,158
11,726,940
37,160,098
BrCa-38(BT474)
46,247,881
27,613,327
11,417,755
39,031,082
BrCa-40(HCC1143)
34,178,168
20,315,316
9,549,005
29,864,321
BrCa-41(HCC1428)
33,877,849
21,196,882
8,922,003
30,118,885
BrCa-43(HCC202)
34,308,392
20,183,614
9,788,109
29,971,723
BrCa-44(HCC3153)
30,572,196
16,910,888
9,429,028
26,339,916
BrCa-49(MDAMB436)
37,982,516
22,680,715
10,033,312
32,714,027
BrCa-51(SUM185PE)
28,071,302
16,021,760
7,372,317
23,394,077
BrCa-55(600MPE)
29,989,492
15,772,575
8,997,509
24,770,084
BrCa-59(HCC1500)
33,512,194
21,689,325
7,297,771
28,987,096
BrCa-63(HS578T)
29,774,913
19,954,550
6,690,775
26,645,325
BrCa-64(MCF12A)
39,816,671
19,667,828
14,407,417
34,075,245
BrCa-65(MDAMB175VII)
36,527,805
19,117,034
11,991,384
31,108,418
BrCa-67(MDAMB453)
34,634,234
18,920,039
11,290,635
30,210,674
BrCa-68(SUM1315MO2)
34,112,917
20,891,597
9,163,182
30,054,779
BrCa-70(SUM52PE)
32,326,562
19,320,660
9,731,807
29,052,467
T47D
43,275,135
26,119,748
10,784,904
36,904,652
HCT116
19,041,613
4,906,885
3,615,483
8,522,368
Tcell
7,172,5143
21,645,848
18,120,978
39,766,826
H1
5,9618,003
30,139,685
174,794
30,314,479
30
Fig 3.1 Bar figure for 36 cell lines
3.4 Input for HMM
First we divide the whole genome into 1000-base-pair non-overlapping intervals
within which we independently made a call as to whether each of the 36 marks was
detected as being present or not based on the count of tags mapping to the interval. Each
tag was uniquely assigned to one interval based on the location of the 5‟ end of the tag
after applying a shift of 500 bases in the 5‟ to 3‟ direction of the tag (mid-point). The
threshold, t, for each mark was based on the total number of mapped reads for the mark,
and was set to the smallest integer t such that P(X>t)<10-4 where X is a random variable
with a Poisson distribution with mean parameter set to the empirical mean of the number
of tags per interval. In each cell line, if a mark is detected in an interval, then we assign „1‟
to this interval; otherwise, we assign „0‟ to it.
31
Also we group the 36 cell lines into 12 groups based on some mechanism/factors
(e.g. Gene Cluster, ER+/-, PR+/- and HER2 expression) as follows:
Table 3.2 12 groups for 36 cell lines
Cell lines
Gene Cluster
ER
PR
HER2
group(1for+,2for-)
Groups
BT474
Lu
+
+
11.9994
111
1
MCF10A
BaB
-
-
6.837
222
2
MCF12A
BaB
-
-
7.226
222
HCC1428
Lu
+
+
7.6065
112
MCF7
Lu
+
+
8.4522
112
3
T47D
Lu
+
+
8.2666
112
600MPE
Lu
+
-
9.2756
122
LY2
Lu
+
-
6.9903
122
4
MDAMB175
Lu
+
-
9.2384
122
SUM52PE
Lu
+
-
7.6287
122
AU565
Lu
-
-
12.1189
221
HCC202
Lu
-
-
12.1056
221
5
SKBR3
Lu
-
-
11.5751
221
HCC1569
BaA
-
-
11.7554
221
HCC1954
BaA
-
-
11.5082
221
6
SUM225CWN
BaA
-
-
12.9908
221
HCC2185
Lu
-
-
9.3429
222
MDAMB453
Lu
-
-
10.172
222
7
SUM185PE
Lu
-
-
8.3417
222
BT20
BaA
-
-
7.7677
222
HCC1143
BaA
-
-
8.7032
222
HCC1937
BaA
-
-
7.7687
222
8
HCC3153
BaA
-
-
8.9164
222
HCC70
BaA
-
-
7.9334
222
MDAMB468
BaA
-
-
7.0596
222
BT549
BaB
-
-
7.0328
222
HCC1500
BaB
-
-
7.2479
222
HS578T
BaB
-
-
6.6301
222
MDAMB231
BaB
-
-
6.5633
222
9
MDAMB436
BaB
-
-
6.9034
222
SUM1315
BaB
-
-
6.9018
222
SUM149PT
BaB
-
-
6.5676
222
SUM159PT
BaB
-
-
7.3181
222
H1
10
HCT116
11
Tcell
12
32
Then we need to decide whether an interval is hyper-methylated in this group.
What we did here is count the number of marks in each interval for every group. If there
is at least one mark presented in some interval for some group, then we say this group is
hyper-methylated in this interval. Again, for each group, if some interval is hyper-
methylated, then we assign this interval with „1‟, otherwise „0‟.
Then for the input of the HMM, we combine the 12 groups together in the way as
follows:
We denote the observation value for interval t as vt (t=1, …, T), then we calculate the
value of vt in the following steps:
1. Initial vt, set vt = 0;
2. Put the 12 groups in the increasing order, from group1 to group12.
3. While going through the 12 groups, for each group, if the corresponding interval
is 1, then vt =2* vt +1; otherwise, vt =2* vt
4. After going through the 12 groups, we have add 1 to vt, that is vt = vt +1, so we
can get the value of vt from 1 to 4096.
So based on the encode mechanism, we can decode the value of an interval.
What we need to do is simply write the number in the format of binary number which is
in the form of


 E3.1
where ai ϵ {0,1} and i= ϵ {1,…,12}
33
Then if ai =1, we know the group i in this interval is 1(hyper-methylated).
Thus, for some vt =1, we know, none of the 12 groups is 1 since 1-1=0;
And for some vt =169, we know this interval is a combination of 3 groups (9, 7, 5) with 1
(hyper-methylated), since
169-1=128+32+8=212-5+212-7+212-9
3.5 Methylation Distribution Overview
Also, we need to get an overview of the DNA methylation distribution, so we
modified a web tool developed by Brian to visualize the data.
To find the meaning of the states (training result), we also have to deal with the
source data to find the correlation between our result and the source data. What we did
here is first remap our intervals to the genome (hg18) and get the positions that they are
on the genome. Then we correlate these remapped reads with genes (hg18). By using this,
we can divide the intervals into different regions based on their distances to the gene 5‟
end. Also it is not necessary to include the gene desert regions, so we also filter out the
regions that are farther than 100k from the gene body. Besides, since different genes have
different lengths of gene body, so we artificially choose a certain distance (2kbp) away
from both 5‟ and 3‟ ends in the gene body and all the intervals are with a length of 2k
base pair.
34
Fig 3.2 Methylation distribution for 33 breast cancer cell lines
3.6 Gene Expression Data
Increasing evidence is revealing a role of methylation in the interaction of
environmental factors with genetic expression. Differences in maternal care during the
first 6 days of life in the rat induce differential methylation patterns in some promoter
regions and, thus, influencing gene expression [30].
Also, we correlated the results with the gene expression data (Richard M.Neve et
al, 2006) [27]. The gene expression data files are in the format of *.cel, so we used a R
package to transform them into readable files. Then we paste them into the same file and
did the one way hierarchy clustering (cluster the genes) for the 33 breast cancer cell lines
and in order to find out the meanings of the states more easily, we ordered the cell lines
in the order of subgroups. So in this way, we can have a straight sight on which cluster of
genes is correlated with a particular subgroup.
0
0.5
1
1.5
2
2.5
5TSS-50
5TSS-46
5TSS-42
5TSS-38
5TSS-34
5TSS-30
5TSS-26
5TSS-22
5TSS-18
5TSS-14
5TSS-10
5TSS-6
5TSS-2
3TSS-2
3TSS+3
3TSS+7
3TSS+11
3TSS+15
3TSS+19
3TSS+23
3TSS+27
3TSS+31
3TSS+35
3TSS+39
3TSS+43
3TSS+47
1
2
3
4
5
6
7
8
9
35
Chapter 4: Results and Discussion
4.1 Results from HMM
We trained our HMM using different number of states from 11 to 26, and then we get the
BIC values as follows:
Table 4.1 BIC results for HMM results
# states
L(log ratio)
k
n
BIC
11
-4,257,409
45,165
3,070,531
9,189,464
12
-4,256,362
49,283
3,070,531
9,248,882
13
-4,254,518
53,403
3,070,531
9,306,736
14
-4,265,865
57,525
3,070,531
9,391,002
15
-4,212,430
61,649
3,070,531
9,345,733
16
-4,195,762
65,775
3,070,531
9,374,029
17
-4,223,164
69,903
3,070,531
9,490,494
18
-4,247,955
74,033
3,070,531
9,601,768
19
-4,162,556
78,165
3,070,531
9,492,691
20
-4,188,262
82,299
3,070,531
9,605,854
21
-4,189,774
86,435
3,070,531
9,670,659
22
-4,158,646
90,573
3,070,531
9,670,214
23
-4,161,580
94,713
3,070,531
9,737,922
24
-4,167,498
98,855
3,070,531
9,811,629
25
-4,151,067
102,999
3,070,531
9,840,667
26
-4,145,348
107,145
3,070,531
9,891,160
36
So based on the BIC we decide to choose the model with 11 states.
Following are the training results for the model with 11 states (transition and
emission matrixes):
The transition matrix is 11x11 since there are 11 states in total and the transition
matrix is about the probabilities from each state to all the possible states.
The result is as follows:
Table 4.2 Transition Matrix
Here the rows are the states a transition starts and the columns are the states it
transits to, then each cell in the main table is the probability that a transition from at the
from-state (the column number) to the to-state (the row number).
37
The corresponding heatmap:
Fig 4.1 Heatmap for transition matrix
From the transition matrix, we can see that most of the states are with very high
probabilities to transit to themselves expect states 1, 4 and 10. In the view of the biology
side, it is very reasonable, since for methylation intervals in the whole genome, if current
region is methylated or not methylated, then it is very possible that the next interval is
also methylated or not methylated. Also some of the states are more likely to transit to
other states; it is possible that they are mostly in the intervals whose next interval is not
the same as it (from methylated interval to non-methylated interval or from non-
methylated interval to methylated interval). We can see that states 1, 4 and 10 do not have
38
very high probabilities to transit to themselves separately, but when treated as a group,
we can see, it still has a very high probability to transit to itself. So I think maybe we can
treat them as a group in the future analysis.
Then for the emission matrix, we have a matrix of 11 x 4096, there is not enough
space to present the whole here, so we can‟t draw the matrix here. Also we don‟t need to
all the detailed observation probabilities of every combination of marks, actually we have
to get the probability of each group occurs in each state, so we have to add up all the
probabilities of observations which contain a particular group.
The results are as follows:
Table 4.3 Emission probabilities for each mark in each state
Here the rows are the states and the columns are the marks, then each cell in the
main table is the probability that one mark (the column number) can present in the
corresponding state (the row number).
39
In order to have a more clearly view of the emission probabilities, we ordered the
probabilities in each states and follows are the marks in the decreasing order and the
corresponding probabilities:
Table 4.5 Ordered emission probabilities for each mark in each state- probabilities
Table 4.4 Ordered emission probabilities for each mark in each state-mark
40
Furthermore, we consider that for all the intervals on the whole genome, there are
a lot of non-methylated intervals which can even be a large part of the whole genome,
and now we don‟t want to consider these kinds of things since we want to focus on the
methylated intervals. So we apply the threshold 0.08 to the observation probability of
each state and get 8 states above the threshold and 3 states under.
Table 4.6 Filtered ordered emission probabilities for each mark in each state- marks
From the table above, we can see that the 3 states under the threshold are exactly
the ones that don‟t have very high probabilities to transit to themselves. Also these 3
states are quite similar to each other. They are all dominated by 3 marks (7, 9, 10) and
followed by the other marks which are also in very similar orders. Also some states are
dominated by non-DNA methylation related cell lines (e.g. state 7 which is dominated by
group 10 and 12 which are H1-stem cell and CD4+T cell).
41
4.2 Biology Meanings
After we get the states and their features, we have to assign them with biology
meanings by correlate them with other data.
What we did is correlating the results above with the gene expression data for the
33 breast cancer cell lines since increasing evidence is revealing a role of methylation in
the interaction of environmental factors with genetic expression.
We did the one-way hierarchy clustering to 12113 genes we get for the 33 breast
cancer cell lines (Richard M.Neve et al, 2006) [27] as described in the section of data
process.
4.2.1 Gene Expression results for 33 breast cancer cell lines
Fig 4.2 33 Breast Cancer Cell Gene Expression One-Way Hierarchy Clustering
42
We future applied a threshold to the clustering results and get the results as follows
Fig 4.3 Grouped 33 Breast Cancer Cell Gene Expression One-Way Hierarchy Clustering
4.2.2 Results Based on Different Clusters
From the results above, we can see that the whole genes were divided into 9
clusters which are annotated in 9 different colors. From up to bottom, we denote them
from cluster1 to cluster9. We can see some very interesting features from the figure
above, but we would like to discuss them together with the methylation distribution in
these clusters. Based on the clustering results, we get the genes in each cluster.
43
Table 4.7 Number of genes in each cluster
clusters
# genes
1
1,362
2
860
3
1,098
4
1,274
5
800
6
4,010
7
1,211
8
866
9
632
Then for each group, we correlated the methylated intervals to the genome based
on their distances to the nearest gene in each cluster. What we do is as follows:
We first remap the methylated intervals onto the genome as the distance between
its midpoint to the 5‟TSS or 3‟TSS of its nearest gene. We do this for all the intervals and
we call this process find-region. Then we count the number of reads located in each 2kb
intervals from -100kb to 4kb based on 5‟TSS and -4kb to 100kb based on 3‟TSS. After
that, we draw the distribution image to have a look at distribution of methylated interval
in each group in each cluster and try to find the relationship between gene expression and
methylation.
44
Fig 4.4 Methylation distribution based on cluster 1 genes
Here we can see that group3 and group6 are high methylated in gene body region,
also from the clustering results, we can see group 3 and group6 are with high gene
expression which makes sense since methylation in the gene body can up-regulate gene
expression. Also group8 and group9 are low methylated in the 5‟TSS promoter region
(5TSS-2) which also makes sense since methylation in the promoter region can repress
gene expression.
0
0.5
1
1.5
2
2.5
5TSS-50
5TSS-46
5TSS-42
5TSS-38
5TSS-34
5TSS-30
5TSS-26
5TSS-22
5TSS-18
5TSS-14
5TSS-10
5TSS-6
5TSS-2
3TSS-2
3TSS+3
3TSS+7
3TSS+11
3TSS+15
3TSS+19
3TSS+23
3TSS+27
3TSS+31
3TSS+35
3TSS+39
3TSS+43
3TSS+47
1
2
3
4
5
6
7
8
9
cluster1
45
Fig 4.5 Methylation distribution based on cluster 2 genes
Here group 1 is high methylated in gene body region (5TSS+2) and is with high
gene expression.
Fig 4.6 Methylation distribution based on cluster 3 genes
Here group 2 is high methylated in gene body region (3TSS+1) and is with high
gene expression. Group 8 is low methylated in 5TSS promoter region (5TSS-4) and with
high gene expression.
0
0.5
1
1.5
2
2.5
3
5TSS-50
5TSS-46
5TSS-42
5TSS-38
5TSS-34
5TSS-30
5TSS-26
5TSS-22
5TSS-18
5TSS-14
5TSS-10
5TSS-6
5TSS-2
3TSS-2
3TSS+3
3TSS+7
3TSS+11
3TSS+15
3TSS+19
3TSS+23
3TSS+27
3TSS+31
3TSS+35
3TSS+39
3TSS+43
3TSS+47
1
2
3
4
5
6
7
8
9
cluster2
0
0.5
1
1.5
2
2.5
3
5TSS-50
5TSS-46
5TSS-42
5TSS-38
5TSS-34
5TSS-30
5TSS-26
5TSS-22
5TSS-18
5TSS-14
5TSS-10
5TSS-6
5TSS-2
3TSS-2
3TSS+3
3TSS+7
3TSS+11
3TSS+15
3TSS+19
3TSS+23
3TSS+27
3TSS+31
3TSS+35
3TSS+39
3TSS+43
3TSS+47
1
2
3
4
5
6
7
8
9
cluster3
46
Fig 4.7 Methylation distribution based on cluster 4 genes
Here group 2 is high methylated in 5TSS promoter region (5TSS-1) and is with
low gene expression. Group 7 is low methylated in 5TSS promoter region (5TSS-2) and
is with high gene expression.
Fig 4.8 Methylation distribution based on cluster 5 genes
0
0.5
1
1.5
2
2.5
3
5TSS-50
5TSS-46
5TSS-42
5TSS-38
5TSS-34
5TSS-30
5TSS-26
5TSS-22
5TSS-18
5TSS-14
5TSS-10
5TSS-6
5TSS-2
3TSS-2
3TSS+3
3TSS+7
3TSS+11
3TSS+15
3TSS+19
3TSS+23
3TSS+27
3TSS+31
3TSS+35
3TSS+39
3TSS+43
3TSS+47
1
2
3
4
5
6
7
8
9
cluster4
0
0.5
1
1.5
2
2.5
3
5TSS-50
5TSS-46
5TSS-42
5TSS-38
5TSS-34
5TSS-30
5TSS-26
5TSS-22
5TSS-18
5TSS-14
5TSS-10
5TSS-6
5TSS-2
3TSS-2
3TSS+3
3TSS+7
3TSS+11
3TSS+15
3TSS+19
3TSS+23
3TSS+27
3TSS+31
3TSS+35
3TSS+39
3TSS+43
3TSS+47
1
2
3
4
5
6
7
8
9
cluster5
47
Here group 4 is high methylated in gene body region (5TSS+1) and is with high
gene expression. Group 8 is low methylated in 5TSS promoter region (5TSS-4) and with
high gene expression.
Fig 4.9 Methylation distribution based on cluster 6 genes
Here group2 is high methylated in the 5TSS promoter region (5TSS-1) and is with
low gene expression.
0
0.5
1
1.5
2
2.5
3
5TSS-50
5TSS-46
5TSS-42
5TSS-38
5TSS-34
5TSS-30
5TSS-26
5TSS-22
5TSS-18
5TSS-14
5TSS-10
5TSS-6
5TSS-2
3TSS-2
3TSS+3
3TSS+7
3TSS+11
3TSS+15
3TSS+19
3TSS+23
3TSS+27
3TSS+31
3TSS+35
3TSS+39
3TSS+43
3TSS+47
1
2
3
4
5
6
7
8
9
cluster6
48
Fig 4.10 Methylation distribution based on cluster 7 genes
Here group 3 and group 4 ars high methylated in gene body region (5TSS+1 and
5TSS+2) and are with high gene expression.
Fig 4.11 Methylation distribution based on cluster 8 genes
Here group 5 is high methylated in the 5TSS promoter region (5TSS-2) and is
with low gene expression.
0
0.5
1
1.5
2
2.5
3
5TSS-50
5TSS-46
5TSS-42
5TSS-38
5TSS-34
5TSS-30
5TSS-26
5TSS-22
5TSS-18
5TSS-14
5TSS-10
5TSS-6
5TSS-2
3TSS-2
3TSS+3
3TSS+7
3TSS+11
3TSS+15
3TSS+19
3TSS+23
3TSS+27
3TSS+31
3TSS+35
3TSS+39
3TSS+43
3TSS+47
1
2
3
4
5
6
7
8
9
cluster7
0
0.5
1
1.5
2
2.5
3
5TSS-50
5TSS-46
5TSS-42
5TSS-38
5TSS-34
5TSS-30
5TSS-26
5TSS-22
5TSS-18
5TSS-14
5TSS-10
5TSS-6
5TSS-2
3TSS-2
3TSS+3
3TSS+7
3TSS+11
3TSS+15
3TSS+19
3TSS+23
3TSS+27
3TSS+31
3TSS+35
3TSS+39
3TSS+43
3TSS+47
1
2
3
4
5
6
7
8
9
cluster8
49
Fig 4.12 Methylation distribution based on cluster 9 genes
Here group 2 and group 6 are high methylated in gene body region (3TSS-1 and
5TSS+1) and are with high gene expression.
0
0.5
1
1.5
2
2.5
3
3.5
5TSS-50
5TSS-46
5TSS-42
5TSS-38
5TSS-34
5TSS-30
5TSS-26
5TSS-22
5TSS-18
5TSS-14
5TSS-10
5TSS-6
5TSS-2
3TSS-2
3TSS+3
3TSS+7
3TSS+11
3TSS+15
3TSS+19
3TSS+23
3TSS+27
3TSS+31
3TSS+35
3TSS+39
3TSS+43
3TSS+47
1
2
3
4
5
6
7
8
9
cluster9
50
4.2.3 States Meanings and Group Patterns
From the figures above, we can see that in each cluster, the distributions of
methylated intervals are quite different for different groups but still it is not very easy to
find the biology meanings for the states. Furthermore, we ordered all the groups in each
2kbp interval in a decreasing order as well as what we did to the emission matrix.
Then we correlate the emission matrix with ordered groups in each cluster. We
assume a state is coherent with an interval if the first 3 groups are the same.
Table 4.8 First 3 marks for each state
For example, we have state 3 which ordered some features excluding group 10,
11 and 12 (5, 8, 6, 3, 9, 4…), then an interval is said to be coherent with state 3 if it‟s
ordered features start with 5, 8, 6 and followed by other groups.
Based on the assumption above, we correlate our states with the clustering results
and we get the results as follows:
51
Table 4.9 States and interval correlation results
clusters
states
intervals
features
cluster1
State_1
5TSS-36
9
7
8
4
1
6
2
5
3
cluster2
State_3
5TSS-26
5
8
6
4
1
9
2
3
7
cluster2
State_3
3TSS+22
5
8
6
7
9
1
4
2
3
cluster2
State_5
5TSS-12
8
9
5
4
2
7
3
1
6
cluster3
State_2
3TSS+26
4
1
9
6
2
3
8
7
5
cluster3
State_11
3TSS+28
6
8
9
4
7
5
2
3
1
cluster3
State_11
3TSS+30
6
8
9
3
4
5
7
1
2
cluster4
State_1
5TSS-23
9
7
8
5
2
4
1
6
3
cluster4
State_1
3TSS+3
9
7
8
6
2
5
1
4
3
cluster5
State_1
5TSS-6
9
7
8
6
3
4
1
2
5
cluster5
State_1
3TSS+17
9
7
8
5
3
1
4
2
6
cluster5
State_2
3TSS+25
4
1
9
5
2
3
8
7
6
cluster5
State_5
3TSS+46
8
9
5
4
7
6
3
2
1
cluster5
State_8
5TSS-28
9
8
7
2
5
6
1
3
4
cluster6
State_6
5TSS-34
8
7
6
4
9
5
3
1
2
cluster8
State_11
3TSS+48
6
8
9
7
5
2
4
1
3
cluster9
State_5
3TSS+21
8
9
5
3
4
6
1
7
2
cluster9
State_8
5TSS-25
9
8
7
6
4
5
1
3
2
cluster9
State_8
3TSS+20
9
8
7
4
6
3
1
5
2
cluster9
State_9
3TSS+25
9
8
2
7
3
6
1
4
5
52
In cluster 1, we get:
State_1 5TSS-36 9 7 8 4 1 6 2 5 3
which means the 36th 2kbp interval in the upstream of 5TSS is coherent with state 1.
From the results above, we can see that there are no states 4, 7 and 10 which is
reasonable since states 4 and 10 are in the 3 states that we filtered and also they are
dominated by group 10 which is not breast cancer cell line, they all can be classified as
the . Also state 7 is mostly dominated by non-methylation related groups (group 10 and
group 12). Then for state 1, we can see it appears in cluster 1, 4 and 5. Besides, in cluster
1 and 4, there is only state 1 .While for the other states, we can continue to find some
other deep biology meanings.
Table 4.10 States meanings
states
meanings
1
regions that are unlikely to be methylated, if methylated, it is high probability
in breast cancer cell lines
2
regions that methylated in breast cancer cell lines, at far 3 distal (50k
downstream of 3'core with high gene expression and 52k downstream of
3'core) with low gene expressions
3
regions that methylated in breast cancer cell lines, at far distal (52k upstream
of 5'TSS and 44k downstream of 3'core) regions with high gene expressions
Continued
53
Table 4.10: Continued
4 and
10
regions that are unlikely to be methylated, if methylated, it is high probability
in non-breast cancer cell lines
5
regions that methylated in breast cancer cell lines, at near 5 distal and far 3
distal (24k upstream of 5'TSS and 92k downstream of 3'core) regions with
high gene expressions
6
regions that methylated in breast cancer cell lines, at far 5 distal (68k
downstream of 5'TSS) regions with low gene expressions
7
regions that mainly methylated in non-breast cancer cell lines (H1 stem cell
and CD4+ Tcell)
8
regions that methylated in breast cancer cell lines, at far 5 distal and 3 near
distal (56k and 50k upstream of 5'TSS and 40k downstream of 3'core) regions
with high gene expressions
9
regions that methylated in breast cancer cell lines, at far 3 distal (50k
downstream of 3'core) regions with low gene expressions
11
regions that methylated in breast cancer cell lines, at far 3 distal (56k, 60k
and 96k downstream of 3‟TSS) regions with high gene expressions
Near distal regions are the regions in 10k-40k up/down stream of 5TSS/3TSS
Far distal regions are the regions in 41k-100k up/down stream of 5TSS/3TSS
Proximal regions are the regions in 4k-10k up/down stream of 5‟TSS/3‟TSS
54
Then based on the results above and correlated to Table 4.6 and Figure 4.2, we can figure
out the DNA methylation patterns for subtypes of Breast cancers as follows:
Table 4.11 Patterns for subtypes of Breast cancers
subtypes
patterns
Group 1
Low methylated at far 3 distal (56k and 92k downstream of 3‟TSS) regions
with high gene expressions
Group 2
Low methylated at far 3 distal (40k, 42k, 60k and 96k downstream of
3‟TSS) regions with low gene expressions and at far 5 distal regions (68k
upstream of 5‟TSS) with high or low gene expressions
Group 3
Low methylated at far 5 distal (46k and 72k upstream of 5‟TSS), far 3 distal
(44k, 96k downstream of 3‟TSS) and 3 proximal (6k downstream of 3‟TSS)
regions with high gene expressions
Group 4
High methylated at far 3 distal (50k, 52k downstream of 3‟TSS) regions
with low gene expressions
Group 5
High methylated at far distal (52k upstream of 5'TSS and 44k downstream
of 3'core) regions with high gene expressions
Group 6
High methylated at far 3 distal (56k, 60k, 96k downstream of 3'TSS) regions
with high gene expressions
Continued
55
Table 4.11: Continued
Group 7
Low methylated at far 5 distal (52k upstream of 5‟TSS) regions with high
gene expressions
Group 8
High methylated at far 3 distal (42k and 96k downstream of 3‟TSS) and near
5 distal (24k upstream of 5‟TSS) regions with high gene expressions, but at
far 5 distal (68k upstream of 5‟TSS) regions (68k upstream of 5‟TSS) with
low gene expressions
Group 9
High methylated at 5 distal (46k, 50k, 56k and 72k upstream of 5‟TSS), 3
distal (34k, 40k, and 50k downstream of 3‟TSS) and 3 proximal (6k
downstream of 3‟TSS) regions with high gene expressions
56
Chapter 5: Data Visualization
We also modified a developed database web tool to visualize the methylation data
in our 36 cell lines to give an intuitional view of the data. Also in the web tool, we
grouped the 36 cell lines into 9 groups which can be quite convenient for us to compare
the data in different groups. For details,
Step1: We divide each cell line into 100bp intervals and then count the number of
methylation reads that fall into to each interval.
Step2: We normalize each of the 36 cell lines to the same level, say each of the cell lines
has 10,000,000 methylation reads, since according to the statistical summary, most of the
cell lines have this level of methylation read number.
Thus, for an original interval value Dij which is the number of methylation reads in the jth
interval on the ith cell line, we can calculate the normalized value


E5.1
where Si is the total number of methylation reads in the ith cell line.
Step3: In order to express the methylation level of an interval, we need to use red color
from light to dark to present the methylation levels from low to high. Here our level
boundaries are as follows:
57

 

 

 

 

 

 

 

E5.2
So, we will use 7 red colors with different brightness to represent the different
levels of methylation.
Also Methylation data is stored as a start coordinate and up to 4 consecutive
following methylation levels for fixed step 100nt spans. This was done to reduce the
number of records in the database and improve performance because while methylation
data is globally dense, it is many disjoint segments usually only a few hundred
consecutive nucleotides long with a non-zero methylation value.
Besides the methylation data, we also correlate the genome regions with genes, so
then we can see the correlation between gene and methylation regions. For example, in
the fig datatool.fig, we can see, in region chr1:27060485-27070485, there is one gene
called SFN (NM_006142) and in the region of this gene, cell lines in Group2 and Group4
are not methylated but some cell lines in Group6 and Group7 are hyper-methylated.
58
Fig 5.1 Database Web Tool
Corresponding Link: http://motif.bmi.ohio-state.edu/hmm
59
Chapter 6: Conclusions and Suggestions for Future Work
6.1 Conclusions
Many researchers are doing research in Breast Cancer cell lines and methylation
data, also some people are trying to solve biology problems using Hidden Markov Model.
However, few people had used HMM to deal with DNA methylation data in Breast
Cancer lines. Besides, for those trying to use HMM to solve biology problems, they
usually only set 2 states for training, and the meaning for the 2 states are even known
which makes the training not so meaningful. In this thesis, we used much more states for
the training and also the meanings for the states are not known before the training, also
after training, by correlating with other biology data, we can figure out the meanings of
the states, which is advanced and novel. Also, for the program itself, we modified the
standard version of HMM and make it work better for our data. The time and space
complexities for standard version of HMM are O(#iteration *n2T) and O(n2T), where n is
the number of states and T is the length of the input sequence. Ours are O(#iteration *
(m+n)nT) and O(nT), where m is the number of possible observations. We did this
modification because our server has 30GB memory limitation while deal with the whole
genome with the standard HMM it will take much more memory than the limitation.
60
Besides the HMM part, we also used programs published or developed by our lab
to processing and analyzing the data which largely help us to find the biology meanings.
For example, for the correlation of DNA methylation data and gene expression data, we
find the relationships between them which are coherent with some published results.
6.2 Future Work
The HMM program we designed here is not parallelized so it takes quite long
time to train the whole genome as the input. So we could try to parallelize the program
use OpenMP, MPI or other methods and make it more efficient. Also, our prediction for
biology meanings is restricted to the dataset we have. So if we further correlated our
results with other data, it is quite possible that we can predict more and deeper biology
meanings. What‟s more, now we can only verify our results with some published paper
but not very systematically, so if possible it is better to do some biology experiments to
further verify the predictions we made.
61
References:
[1] American Cancer Society (September 13, 2007). "What Are the Key Statistics for
Breast Cancer?". Archived from the original on January 5, 2008.
http://web.archive.org/web/20080105001124/http://www.cancer.org/docroot/CRI/
content/CRI_2_4_1X_What_are_the_key_statistics_for_breast_cancer_5.asp.
Retrieved 2008-02-03
[2] Browse the SEER Cancer Statistics Review 19752006".
http://seer.cancer.gov/csr/1975_2006/browse_csr.php?section=4&page=sect_04_t
able.07.html.
[3] March, Jerry; Smith, Michael W. (2001). March's advanced organic chemistry:
reactions, mechanisms, and structure. New York: Wiley. ISBN 0-471-58589-0
[4] Serre D, Lee BH, Ting AH. MBD-isolated Genome Sequencing provides a high-
throughput and comprehensive survey of DNA methylation in the human genome.
Nucleic Acids Res. 2010; 38:3919
[5] Campbell, M.K., (1995) Biochemistry. Saunders College: Philadelphia, pgs. 615-
16, 181
[6] Maclean, N., S.P. Gregory, and R.A. Flavell (1993) Eukaryotic Genes.
Butterworth and Co., London, pgs. 53-67
62
[7] Yen RW, Vertino PM, Nelkin BD, Yu JJ, el-Deiry W, Cumaraswamy A, Lennon
GG, Trask BJ, Celano P, Baylin SB. Isolation and characterization of the cDNA
encoding human DNA methyltransferase. Nucleic Acids Res. 1992; 20:22872291.
[8] Gruenbaum Y, Cedar H, Razin A. Substrate and sequence specificity of a
eukaryotic DNA methylase. Nature. 1982; 295:620622.
[9] Bird A. Perceptions of epigenetics. Nature. 2007; 447:396398.
[10] Holliday R, Pugh JE. DNA modification mechanisms and gene activity during
development. Science. 1975; 187:226232.
[11] Cedar H, Stein R, Gruenbaum Y, Naveh-Many T, Sciaky-Gallili N, Razin A.
Effect of DNA methylation on gene expression. Cold Spring Harb. Symp. Quant.
Biol. 1983; 47 (Pt 2):605609.
[12] Reik W, Collick A, Norris ML, Barton SC, Surani MA. Genomic imprinting
determines methylation of parental alleles in transgenic mice. Nature. 1987;
328:248251.
[13] Riggs AD. X inactivation, differentiation, and DNA methylation. Cytogenet. Cell
Genet. 1975; 14:925.
[14] Feinberg AP, Vogelstein B. Alterations in DNA methylation in human colon
neoplasia. Semin. Surg. Oncol. 1987; 3:149151.
[13] Spruck CH, III, Rideout WM, III, Jones PA. DNA methylation and cancer.
[Review]. EXS. 1993; 64:487509.
63
[16] Virmani AK, Rathi A, Sathyanarayana UG, Padar A, Huang CX, Cunnigham HT,
Farinas AJ, Milchgrub S, Euhus DM, Gilcrease M, et al. Aberrant methylation of
the adenomatous polyposis coli (APC) gene promoter 1A in breast and lung
carcinomas. Clin. Cancer Res. 2001; 7:19982004.
[17] Tsuchiya T, Tamura G, Sato K, Endoh Y, Sakata K, Jin Z, Motoyama T, Usuba O,
Kimura W, Nishizuka S, et al. Distinct methylation patterns of two APC gene
promoters in normal and cancerous gastric epithelia. Oncogene. 2000; 19:3642
3646.
[18] Ibanez de Caceres I, Battagli C, Esteller M, Herman JG, Dulaimi E, Edelson MI,
Bergman C, Ehya H, Eisenberg BL, Cairns P. Tumor cell-specific BRCA1 and
RASSF1A hypermethylation in serum, plasma, and peritoneal fluid from ovarian
cancer patients. Cancer Res. 2004; 64:64766481.
[19] Rice JC, Massey-Brown KS, Futscher BW. Aberrant methylation of the BRCA1
CpG island promoter is associated with decreased BRCA1 mRNA in sporadic
breast cancer cells. Oncogene. 1998; 17:18071812.
[20] Bian YS, Osterheld MC, Fontolliet C, Bosman FT, Benhattar J. p16 inactivation
by methylation of the CDKN2A promoter occurs early during neoplastic
progression in Barrett's; esophagus. Gastroenterology. 2002; 122:11131121.
[21] Holst CR, Nuovo GJ, Esteller M, Chew K, Baylin SB, Herman JG, Tlsty TD.
Methylation of p16(INK4a) promoters occurs in vivo in histologically normal
human mammary epithelia. Cancer Res. 2003; 63:15961601.
64
[22] Iqbal, K.; Jin, S.-G.; Pfeifer, G. P.; Szabo, P. E. (2011). "Reprogramming of the
paternal genome upon fertilization involves genome-wide oxidation of 5-
methylcytosine". Proceedings of the National Academy of Sciences 108 (9):
36423647. doi:10.1073/pnas.1014033108. PMC 3048122.
PMID 21321204.http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pmce
ntrez&artid=3048122
[23] Jaenisch, R.; Bird, A. (2003). "Epigenetic regulation of gene expression: how the
genome integrates intrinsic and environmental signals". Nature genetics 33 Suppl:
245254. doi:10.1038/ng1089. PMID 12610534
[24] Craig, JM; Wong, NC (editor) (2011). Epigenetics: A Reference Manual. Caister
Academic Press. ISBN 978-1-904455-88-2
[25] Rabiner, Lawrence R. "A Tutorial on Hidden Markov Models and Selected
Applications in Speech Recognition". Proceedings of the IEEE , Vol. 77, No. 2,
February 1989, pp. 257-286.
[26] Baldi, P. & Brunak S. "Bioinformatics - The Machine Learning Approach".
Massachusetts Institute of Technology, 1998.
[27] Richard M.Neve et al A colloetion of breast cancer cell lines for the study of
functionally distinct cancer subtypes, Cancer Cell 10,515-527, December, 2006
65
[28] R Alan Harris et.al. "Comparison of sequencing-based methods to profile DNA
methylation and identification of monoallelic epigenetic modifications". Nature
Biotechnology, 2010
[29] Jung K Choi1, Jae-Bum Bae1, Jaemyun Lyu1, Tae-Yoon Kim2 and Young-Joon
Kim1*Nucleosome deposition and DNA methylation at coding region boundaries,
Genome Biology,2009,10:R89
[30] Weaver IC (2007). "Epigenetic programming by maternal behavior and
pharmacological intervention. Nature versus nurture: let's call the whole thing off".
Epigenetics 2 (1): 228. doi:10.4161/epi.2.1.3881. PMID 17965624.
http://www.landesbioscience.com/journals/epi/abstract.php?id=3881.
[31] Cock et. al. The Sanger FASTQ file format for sequences with quality scores, and
the Solexa/Illumina FASTQ variants. Nucleic Acids Research, 2009
66
Appendix_Formats
A. BAM format
BAM format is the compressed binary version of the Sequence Alignment/Map
(SAM) format, a compact and index-able representation of nucleotide sequence
alignments. Many next-generation sequencing and analysis tools work with SAM/BAM.
For custom track display, the main advantage of indexed BAM over PSL and other
human-readable alignment formats is that only the portions of the files needed to display
a particular region are transferred to UCSC. This makes it possible to display alignments
from files that are so large that the connection to UCSC would time out when attempting
to upload the whole file to UCSC. Both the BAM file and its associated index file remain
on your web-accessible server (http or ftp), not on the UCSC server. UCSC temporarily
caches the accessed portions of the files to speed up interactive display.
B. SAM (Sequence Alignment/Map) format
SAM format is a generic format for storing large nucleotide sequence alignments.
SAM aims to be a format that:
67
Is flexible enough to store all the alignment information generated by various
alignment programs;
Is simple enough to be easily generated by alignment programs or converted from
existing alignment formats;
Is compact in file size;
Allows most of operations on the alignment to work on a stream without loading
the whole alignment into memory;
Allows the file to be indexed by genomic position to efficiently retrieve all reads
aligning to a locus.
C. EXPORT format
HWUSI-EAS68R 0012 3 1 1173 16855 0 1
AGATCGAGCTGGAGAAATTCCATGAATATACCACAC
cddddcLcc^dd\d^a`ccYcca^aM_]_b`b\TYc chr10.fa 12854193 R36 118 Y
HWUSI-EAS68R 0012 3 1 1174 2493 0 1
AAGACGGGAAAGGACTCACTCAAAGTCACACAGCTG
cTccc_M_^_L_UUL[MM]XGXZFZXSQV\aaYYaV chr3.fa 195586021 F 17C18 97 N
HWUSI-EAS68R 0012 3 1 1174 7057 0 1
CAACTTGGAGAATCACATTTGAAGTGCAAAGAACAC
BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB NM N
68
D. BED format
Bed format provides a flexible way to define the data lines that are displayed in an
annotation track. BED lines have three required fields and nine additional optional fields.
The number of fields per line must be consistent throughout any single set of data in an
annotation track. The order of the optional fields is binding: lower-numbered fields must
always be populated if higher-numbered fields are used.
The first three required BED fields are:
1. chrom - The name of the chromosome (e.g. chr3, chrY, chr2_random) or scaffold
(e.g. scaffold10671).
2. chromStart - The starting position of the feature in the chromosome or scaffold.
The first base in a chromosome is numbered 0.
3. chromEnd - The ending position of the feature in the chromosome or scaffold.
The chromEnd base is not included in the display of the feature. For example, the
first 100 bases of a chromosome are defined as chromStart=0, chromEnd=100,
and span the bases numbered 0-99.
The 9 additional optional BED fields are:
4. name - Defines the name of the BED line. This label is displayed to the left of the
BED line in the Genome Browser window when the track is open to full display
mode or directly to the left of the item in pack mode.
69
5. score - A score between 0 and 1000. If the track line useScore attribute is set to 1
for this annotation data set, the score value will determine the level of gray in
which this feature is displayed (higher numbers = darker gray). This table shows
the Genome Browser's translation of BED score values into shades of gray:
shade
score in
range
166
167-
277
278-
388
389-
499
500-
611
612-
722
723-
833
834-
944
945
6. strand - Defines the strand - either '+' or '-'.
7. thickStart - The starting position at which the feature is drawn thickly (for
example, the start codon in gene displays).
8. thickEnd - The ending position at which the feature is drawn thickly (for
example, the stop codon in gene displays).
9. itemRgb - An RGB value of the form R,G,B (e.g. 255,0,0). If the track line
itemRgb attribute is set to "On", this RBG value will determine the display color
of the data contained in this BED line. NOTE: It is recommended that a simple
color scheme (eight colors or less) be used with this attribute to avoid
overwhelming the color resources of the Genome Browser and your Internet
browser.
10. blockCount - The number of blocks (exons) in the BED line.
11. blockSizes - A comma-separated list of the block sizes. The number of items in
this list should correspond to blockCount.
70
12. blockStarts - A comma-separated list of block starts. All of the blockStart
positions should be calculated relative to chromStart. The number of items in this
list should correspond to blockCount.
Example
Here's an example of an annotation track that uses a complete BED definition:
track name= pairedReads description="Clone Paired Reads" useScore=1
chr22 1000 5000 cloneA 960 + 1000 5000 0 2 567,488, 0, 3512
chr22 2000 6000 cloneB 900 - 2000 6000 0 2 433,399, 0, 3601
E. Fastq format
Fastq format is a text-based format for storing both a biological sequence (usually
nucleotide sequence) and its corresponding quality scores. Both the sequence letter and
quality score are encoded with a single ASCII character for brevity. It was originally
developed at the Wellcome Trust Sanger Institute to bundle a FASTA sequence and its
quality data, but has recently become the de facto standard for storing the output of high
throughput sequencing instruments such as the Illumina Genome Analyzer [31].
Example
@SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36
GGGTGATGGCCGCTGCCGATGGCGTCAAATCCCACC
+SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36
71
IIIIIIIIIIIIIIIIIIIIIIIIIIIIII9IG9IC
F. Bowtie output format (generated from software bowtie, and input is the fastq files)
19 + chr9 8003
AGGCTATATGCGCGGCCAGCAGACCTGCAGGGCCCGCTCGTCCAGGGGGCGG
TGCTTGCTCTGGATCGTGTGCGG
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
167:C>G,72:G>C,73:C>G
28 + chr19 28101
AAATAAATAAATAAAAACAACTTGTCCAAGGTCAGACAGGCCGCCTCTTAGT
AAGCACACCTATCCTCTATAGTA
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
141:A>C,60:A>C,72:T>G
28 + chr1 25355
AAATAAATAAATAAAAACAACTTGTCCAAGGTCAGACAGGCCGCCTCTTAGT
AAGCACACCTATCCTCTATAGTA
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
141:A>C,60:A>C,72:T>G
35 + chr1 41809
CAAATACGGTGACTGTTTCTTACGTGGACGACGTTGTGTTGAACATGGGTGA
72
GTAAGACTGAAGCAGCCGTAATT
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII 0
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Analysis of DNA methylation patterns relies increasingly on sequencing-based profiling methods. The four most frequently used sequencing-based technologies are the bisulfite-based methods MethylC-seq and reduced representation bisulfite sequencing (RRBS), and the enrichment-based techniques methylated DNA immunoprecipitation sequencing (MeDIP-seq) and methylated DNA binding domain sequencing (MBD-seq). We applied all four methods to biological replicates of human embryonic stem cells to assess their genome-wide CpG coverage, resolution, cost, concordance and the influence of CpG density and genomic context. The methylation levels assessed by the two bisulfite methods were concordant (their difference did not exceed a given threshold) for 82% for CpGs and 99% of the non-CpG cytosines. Using binary methylation calls, the two enrichment methods were 99% concordant and regions assessed by all four methods were 97% concordant. We combined MeDIP-seq with methylation-sensitive restriction enzyme (MRE-seq) sequencing for comprehensive methylome coverage at lower cost. This, along with RNA-seq and ChIP-seq of the ES cells enabled us to detect regions with allele-specific epigenetic states, identifying most known imprinted regions and new loci with monoallelic epigenetic marks and monoallelic expression.
Article
Full-text available
FASTQ has emerged as a common file format for sharing sequencing read data combining both the sequence and an associated per base quality score, despite lacking any formal definition to date, and existing in at least three incompatible variants. This article defines the FASTQ format, covering the original Sanger standard, the Solexa/Illumina variants and conversion between them, based on publicly available information such as the MAQ documentation and conventions recently agreed by the Open Bioinformatics Foundation projects Biopython, BioPerl, BioRuby, BioJava and EMBOSS. Being an open access publication, it is hoped that this description, with the example files provided as Supplementary Data, will serve in future as a reference for this important file format.
Article
Full-text available
DNA methylation is an epigenetic modification involved in both normal developmental processes and disease states through the modulation of gene expression and the maintenance of genomic organization. Conventional methods of DNA methylation analysis, such as bisulfite sequencing, methylation sensitive restriction enzyme digestion and array-based detection techniques, have major limitations that impede high-throughput genome-wide analysis. We describe a novel technique, MBD-isolated Genome Sequencing (MiGS), which combines precipitation of methylated DNA by recombinant methyl-CpG binding domain of MBD2 protein and sequencing of the isolated DNA by a massively parallel sequencer. We utilized MiGS to study three isogenic cancer cell lines with varying degrees of DNA methylation. We successfully detected previously known methylated regions in these cells and identified hundreds of novel methylated regions. This technique is highly specific and sensitive and can be applied to any biological settings to identify differentially methylated regions at the genomic scale.
Article
Full-text available
Nucleosome deposition downstream of transcription initiation and DNA methylation in the gene body suggest that control of transcription elongation is a key aspect of epigenetic regulation. Here we report a genome-wide observation of distinct peaks of nucleosomes and methylation at both ends of a protein coding unit. Elongating polymerases tend to pause near both coding ends immediately upstream of the epigenetic peaks, causing a significant reduction in elongation efficiency. Conserved features in underlying protein coding sequences seem to dictate their evolutionary conservation across multiple species. The nucleosomal and methylation marks are commonly associated with high sequence-encoded DNA-bending propensity but differentially with CpG density. As the gene grows longer, the epigenetic codes seem to be shifted from variable inner sequences toward boundary regions, rendering the peaks more prominent in higher organisms. Recent studies suggest that epigenetic inhibition of transcription elongation facilitates the inclusion of constitutive exons during RNA splicing. The epigenetic marks we identified here seem to secure the first and last coding exons from exon skipping as they are indispensable for accurate translation.
Chapter
In 1975, two papers suggested a role for DNA methylation in X chromosome inactivation. In one paper (Riggs, 1975), I argued that: 1) DNA methylation should affect protein-DNA interactions; 2) methylation patterns and a maintenance methylase should exist; and 3) DNA methylation should be involved in mammalian cellular differentiative processes. Holliday and Pugh (1975) argued similarly, although less weight was given to X inactivation and more weight was given to the possibility that 5-methylcytosine (5-meCyt) might be deaminated to thymidine; thus a specific mutational change would be generated, as suggested by Scarano (1971). Recently, several studies of X chromosome inactivation have contributed to the emerging body of evidence supporting a role for DNA methylation in mammalian gene regulation; it is these studies that will be reviewed in this chapter. More comprehensive reviews of X chromosome inactivation have been published recently (Gartler and Riggs, 1983; Graves, 1983).
Article
A review of studies on DNA methylation in colonic neoplasia is presented. Hypomethylation of a wide variety of genes from throughout the genome was seen in all colon cancers studied. These changes preceded malignancy because benign adenomas were also affected.
Article
A model based on DNA methylation is proposed to explain the initiation and maintenance of mammalian X inactivation and certain aspects of other permanent events in eukaryotic cell differentiation. A key feature of the model is the proposal of sequence-specific DNA methylases that methylate unmethylated sites with great difficulty but easily methylate half-methylated sites. Although such enzymes have not yet been detected in eukaryotes, they are known in bacteria. An argument is presented, based on recent data on DNA-binding proteins, that DNA methylation should affect the binding of regulatory proteins. In support of the model, short reviews are included covering both mammalian X inactivation and bacterial restriction and modification enzymes.