ArticlePDF Available

A HMM Approach to Identifying Distinct DNA Methylation Patterns for Subtypes of Breast Cancers

Authors:

Raghu Machiraju

The Ohio State University

DNA methylation mechanism

…

A simple HMM λ= (A,B, π),where N = 3, M = 3, a 12 ,a 23 ,a 32 are non-zero, b 1 (a), b 2 (t),b 3 (g) = 1 and π = 1, 0, 0. Note that states can be 'null' states that do not emit

…

Bar figure for 36 cell lines

…

Methylation distribution for 33 breast cancer cell lines

…

.1 Data summary for 36 cell lines

…

Figures - uploaded by Raghu Machiraju

Content may be subject to copyright.

Content uploaded by Raghu Machiraju

Content may be subject to copyright.

A HMM Approach to Identifying Distinct DNA Methylation Patterns

for Subtypes of Breast Cancers

Thesis

Presented in Partial Fulfillment of the Requirements for the Degree Master of Science

in the Graduate School of The Ohio State University

Maoxiong Xu, B.S.

Graduate Program in Computer Science and Engineering

The Ohio State University

2011

Thesis Committee:

Victor X. Jin, Advisor

Raghu Machiraju

Maoxiong Xu

2011

Abstract

The United States has the highest annual incidence rates of breast cancer in the

world; 128.6 per 100,000 in whites and 112.6 per 100,000 among African Americans.[1,2]

It is the second-most common cancer (after skin cancer) and the second-most common

cause of cancer death (after lung cancer).[1] Recent studies have demonstrated that hyper-

methylation of CpG islands may be implicated in tumor genesis, acting as a mechanism

to inactivate specific gene expression of a diverse array of genes (Baylin et al., 2001).

Genes have been reported to be regulated by CpG hyper-methylation, include tumor

suppressor genes, cell cycle related genes, DNA mismatch repair genes, hormone

receptors and tissue or cell adhesion molecules (Yan et al., 2001). Usually, breast cancer

cells may or may not have three important receptors: estrogen receptor (ER),

progesterone receptor (PR), and HER2. So we will consider the ER, PR and HER2 while

dealing with the data. In this thesis, we first use Hidden Markov Model (HMM) to train

the methylation data from both breast cancer cells and other cancer cells. Also we did

hierarchy clustering to the gene expression data for the breast cancer cells and based on

the clustering results, we get the methylation distribution in each cluster. Finally, we

correlate the HMM training results with the methylation distribution and get the biology

meanings for the states in the HMM results.

iii

Dedicated to my father, mother, and wife,

for all of their love and support.

Acknowledgments

I have many people to thank for my making it this far: my advisor, Dr. Victor Jin,

for everything he's done; Dr. Raghu Machiraju, for his help and support; all of my lab

mates, for their knowledge, assistance, and encouragement; and the incredible

Biomedical Informatics Department staff for everything they do.

Vita

2005……………………………...Mudu Central High School

2009……………………………...B.S. Computer Science, Southeast University

2009 to present……….……..……M.S. Computer Science & Engineering, The

Ohio State University

Sep. 2010 to present……………...Graduate Teaching Associate, Department

of Bioinformatics, The Ohio State University

Publications

Cao AR, Rabinovich R, Xu M, Xu X, Jin VX, Farnham PJ: Genome-wide analysis of

transcription factor E2F1 mutant proteins reveals that N- and C-terminal protein

interaction domains do not participate in targeting E2F1 to the human genome.

J Biol Chem. 2011 Apr 8; 286(14):11985-96. Epub 2011 Feb 10.

Fields of Study

Major: Computer Science & Engineering

Machine Learning applied in Bioinformatics

Table of Contents

Abstract……........................................................................................................................ii

Dedication………………………………………………………………………..……….iii

Acknowledgments…..........................................................................................................iv

Vita......................................................................................................................................v

Table of Contents ...............................................................................................................vi

List of Tables .....................................................................................................................ix

List of Figures.....................................................................................................................xi

Chapter 1: Introduction........................................................................................................1

1.1 Methylation……………………………………………………………………1

1.1.1 What Is Methylation? ......................................................................1

1.1.2 DNA Methylation…………………………………………………2

1.1.3 DNA Methylation Mechanism……………………………….........3

1.1.4 DNA Methylation in Cancer...……………………………….........5

1.2 Gene Expression………………………………………………………………6

1.2.1 Gene Expression Measurement……………………………….…….7

1.2.2 mRNA Quantification……………………………………………8

1.2.3 Regulation of Gene Expression……………………….….……...10

1.3 Hidden Markov Model………………………………………………...…….11

1.3.1 Introduction to Hidden Markov Model…………………….……12

vii

1.3.2 Hidden Markov Model……………………………………..…….13

1.3.3 Model Architecture...…………………………………………….13

1.3.4 HMM Training and Decoding……………………………..…….14

1.3.5 HMMs in Computational Biology………………………..……...15

1.3.6 Application of HMMs to Specific Problems……………..……...16

Chapter 2: Methods and Algorithms……………………………....….………………….18

2.1 The Probabilistic Model…………………….………….…………….………18

2.2 Baum-Welch Algorithm…………………………….……………….….……19

2.3 Work Flow…………………………………………….…………….….……23

Chapter 3: Data Process…..…………………………………………………………..….26

3.1 Data Sets……………………………………………………………..………26

3.2 MBD-seq Protocol…………………………………………………..……….27

3.3 Data Preprocess…………………………………………………….….……..27

3.4 Input for HMM……………………….………………………………..…….30

3.5 Methylation Distribution Overview………………….……………….….…..33

3.6 Gene Expression Data………………………………………………….……34

Chapter 4: Results and Discussion………………………………………………….…...35

4.1 Results from HMM………………………………………………….………35

viii

4.2 Biology Meanings………………………………………………………..…..41

4.2.1 Gene Expression Results for 33 Breast Cancer Cell Lines........…..41

4.2.2 Results Based on Different Clusters…………………………..…...42

4.2.3 States Meanings and Group Patterns……………………….....…...50

Chapter 5: Data Visualization……………………………………………………..……..56

Chapter 6: Conclusions and Suggestions for Further Work………………………..…....59

6.1 Conclusion……………………………………………………………..….…59

6.2 Future Work…………………………………….……………….…..…….…60

References………………………………………………………………………...….…..61

Appendix_Formats………………………………………………………….…..…….…66

A. BAM format………………………………………………………..…..…….66

B. SAM format………………………………………………………..….….….66

C. Export format………………………………………………………..…...…..67

D. BED format………………………………………………………..…………68

E. Fastq format………………………………………………………..………...70

F. Bowtie output format………………………………………………..……….71

List of Tables

Table 3.1 Data summary for 36 cell lines……………………...………..……………….29

Table 3.2 12 Groups for 36 cell lines……………..………………………..…….….…...31

Table 4.1 BIC results for HMM results…………………………………..………….…..35

Table 4.2 Transition Matrix…………………………………………………..…….……36

Table 4.3 Emission probabilities for each mark in each state……………………..….…38

Table 4.4 Ordered emission probabilities for each mark in each state-mark………...…..39

Table 4.5 Ordered emission probabilities for each mark in each state- probabilities.…...39

Table 4.6 Filtered ordered emission probabilities for each mark in each state- marks….40

Table 4.7 Number of genes in each cluster……………………………………..………..43

Table 4.8 First 3 marks for each state…………………………………………………..50

Table 4.9 States and interval correlation results………………………………………..51

Table 4.10 States meanings…………………………………………………………….52

Table 4.11 Patterns for subtypes of Breast cancers…………………………………….52

List of Figures

Fig 1.1 Methylation…………………………………………………………………….…1

Fig 1.2 DNA methylation……………………………………………………………..…..2

Fig 1.3 DNA methylation mechanism……………………………………….……….…...4

Fig 1.4 DNA methylation in cancer…………………………………………….…….…...6

Fig 1.5 Gene Expression………………………………………………………….….……6

Figure 1.6: A simple HMM λ= (A,B, π),where N = 3, M = 3, a12,a23,a32 are non-zero,

b1(a), b2(t),b3(g) = 1 and π = 1, 0, 0. ……………………………………………..……...13

Fig2.1 A Broad overview of the HMM work-flow, highlighting the most significant

inputs, transformations, and outputs at each step from start to end. ……………..…...…23

Fig 3.1 Bar figure for 36 cell lines……...……………………………………………..…30

Fig 3.2 Methylation distribution for 33 breast cancer cell lines……...…………..……...34

Fig 4.1 Heatmap for transition matrix…………………………………………….……..37

Fig 4.2 33 Breast Cancer Cell Gene Expression One-Way Hierarchy Clustering……....41

Fig 4.3 Grouped 33 Breast Cancer Cell Gene Expression One-Way Hierarchy

Clustering …………………………………………………………………….….……...42

Fig 4.4 Methylation distribution based on cluster 1 genes……………………...……….44

Fig 4.5 Methylation distribution based on cluster 2 genes…………………...…..……...45

xii

Fig 4.6 Methylation distribution based on cluster 3 genes………………………...…….45

Fig 4.7 Methylation distribution based on cluster 4 genes…………………...……….....46

Fig 4.8 Methylation distribution based on cluster 5 genes…………………….………...46

Fig 4.9 Methylation distribution based on cluster 6 genes…………………...…….........47

Fig 4.10 Methylation distribution based on cluster 7 genes………………..….………...48

Fig 4.11 Methylation distribution based on cluster 8 genes………………….……........48

Fig 4.12 Methylation distribution based on cluster 9 genes………………….……........49

Fig 5.1 Database Web Tool……………………………………………………………..56

Chapter 1: Introduction

1.1 Methylation

1.1.1 What Is Methylation?

In the view of chemical sciences, methylation means the addition of a methyl

group to a substrate or the substitution of an atom or group by a methyl group.

Methylation is a form of alkylation with, to be specific, a methyl group, rather than a

larger carbon chain, replacing a hydrogen atom.

In the view of biological systems,

methylation is catalyzed by enzymes; such

methylation can be involved in modification

of heavy metals, regulation of gene

expression, regulation of protein function,

and RNA metabolism. Methylation of heavy

metals can also occur outside of biological

systems. Chemical methylation of tissue

samples is also one method for reducing

certain histological staining artifacts.

Fig 1.1 Methylation

The term methylation in organic chemistry refers to the alkylation process used to

describe the delivery of a CH3 group [3].This is commonly performed using nucleophilic

methyl sources - iodomethane, dimethyl sulfate, dimethyl carbonate, or less commonly

with the more powerful (and more dangerous) methylating reagents of methyl triflate or

methyl fluorosulfonate (magic methyl), which all react via SN2 nucleophilic substitution.

For example a carboxylate may be methylated on oxygen to give a methyl ester, an

alkoxide salt RO− may be likewise methylated to give an ether, ROCH3, or a ketone

enolate may be methylated on carbon to produce a new ketone.

1.1.2 DNA Methylation

After every cycle of DNA replication, several modifications occur in the DNA.

DNA methylation is one such post-synthesis modification. It is an epigenetic

modification involved in

both normal developmental

processes and disease states

through the modulation of

gene expression and the

maintenance of genomic

organization[4]. DNA

methylation has been

proven by research to

be manifested in a

Fig 1.2 DNA methylation

number of biological processes such as regulation of imprinted genes, X chromosome

inactivation, and tumor suppressor gene silencing in cancerous cells. It also acts as a

protection mechanism adopted by the pathogen DNA (mainly bacterial against the end

nuclease activity that destroys any foreign DNA [5, 6].

DNA cytosine methylation is the covalent addition of a methyl group to the 5

position of cytosine. In humans, DNA methylation occurs predominantly in a CpG

dinucleotide context and is catalyzed by DNA methyltransferases [7, 8, 9]. Dense clusters of

CpG dinucleotides, termed CpG islands, are present in roughly 40% of gene promoters,

and methylation of these regions is associated with transcriptional silencing [10, 11]. DNA

methylation is essential for normal developmental processes, such as imprinting [12] and

X chromosome inactivation [13]. Dysregulation of DNA methylation occurs in disease

states such as cancer, where promoter CpG island hyper-methylation leads to inactivation

of tumor suppressor genes [14, 15]. Thus, many tumor suppressors classically identified

through mutation analyses, such as APC [16, 17], BRCA1 [18, 19], and CDKN2A [20, 21], have

also been found to be transcriptionally silenced by promoter hyper-methylation.

1.1.3 DNA Methylation Mechanism

In DNA, methylation usually occurs in the CpG islands, a CG rich region,

upstream of the promoter region. The letter “p” here signifies that the C and G are

connected by a phosphodiester bond. In humans, DNA methylation is carried out by a

group of enzymes called DNA methyltransferases. These enzymes not only determine the

DNA methylation patterns during the early development, but are also responsible for

copying these patterns to the strands generated from DNA replication [6].

DNA methylation involves the addition of a methyl group to the 5 position of the

cytosine pyrimidine ring or the number 6 nitrogen of the adenine purine ring (cytosine

and adenine are two of the four bases of DNA). This modification can be inherited

through cell division. DNA methylation is typically removed during zygote although the

latest research shows that hydroxylation of methyl group occurs rather than complete

removal of methyl groups in zygotermation and re-established through successive cell

divisions during development [22]. DNA methylation is a crucial part of normal

organismal development and cellular differentiation in higher organisms. DNA

methylation stably alters the gene expression pattern in cells such that cells can

"remember where they have been" or decrease gene expression; for example, cells

programmed to be pancreatic islets during embryonic development remain pancreatic

islets throughout the life of

the organism without

continuing signals telling

them that they need to

remain islets. In addition,

DNA methylation suppresses

the expression of viral genes and

other deleterious elements that

have been incorporated into the genome of the host over time. DNA methylation also

Fig 1.3 DNA methylation mechanism

forms the basis of chromatin structure, which enables cells to form the myriad

characteristics necessary for multicellular life from a single immutable sequence of DNA.

DNA methylation also plays a crucial role in the development of nearly all types of

cancer [23].

1.1.4 DNA Methylation in Cancer

DNA methylation is an important regulator of gene transcription and a large body

of evidence has demonstrated that aberrant DNA methylation is associated with

unscheduled gene silencing, and the genes with high levels of 5-methylcytosine in their

promoter region are transcriptional silent. DNA methylation is essential during

embryonic development, and in somatic cells, patterns of DNA methylation are generally

transmitted to daughter cells with a high fidelity. Aberrant DNA methylation patterns

have been associated with a large number of human malignancies and found in two

distinct forms: hyper-methylation and hypo-methylation compared to normal tissue.

Hyper-methylation is one of the major epigenetic modifications that repress transcription

via promoter region of tumor suppressor genes. Hyper-methylation typically occurs at

CpG islands in the promoter region and is associated with gene inactivation. Global

hypo-methylation has also been implicated in the development and progression of cancer

through different mechanisms [24].

Fig 1.4 DNA methylation in cancer

1.2 Gene Expression

Gene expression is the process by

which information from a gene is used in

the synthesis of a functional gene

product. These products are often

proteins, but in non-protein coding genes

such as rRNA genes or tRNA genes, the

product is a functional RNA. The

process of gene expression is used by all

known life - eukaryotes (including

Fig 1.5 Gene Expression

multicellular organisms), prokaryotes (bacteria and archaea) and viruses - to generate the

macromolecular machinery for life. Several steps in the gene expression process may be

modulated, including the transcription, RNA splicing, translation, and post-translational

modification of a protein. Gene regulation gives the cell control over structure and

function, and is the basis for cellular differentiation, morphogenesis and the versatility

and adaptability of any organism. Gene regulation may also serve as a substrate for

evolutionary change, since control of the timing, location, and amount of gene expression

can have a profound effect on the functions (actions) of the gene in a cell or in a

multicellular organism.

In genetics gene expression is the most fundamental level at which genotype gives

rise to the phenotype. The genetic code is "interpreted" by gene expression, and the

properties of the expression products give rise to the organism's phenotype.

1.2.1 Gene Expression Measurement

Measuring gene expression is an important part of many life sciences - the ability

to quantify the level at which a particular gene is expressed within a cell, tissue or

organism can give a huge amount of information. For example measuring gene

expression can:

 Identify viral infection of a cell (viral protein expression)

 Determine an individual's susceptibility to cancer (oncogene expression)

 Find if a bacterium is resistant to penicillin (beta-lactamase expression)

Similarly the analysis of the location of expression protein is a powerful tool and

this can be done on an organism or cellular scale. Investigation of localization is

particularly important for study of development in multicellular organisms and as an

indicator of protein function in single cells. Ideally measurement of expression is done by

detecting the final gene product (for many genes this is the protein) however it is often

easier to detect one of the precursors, typically mRNA, and infer gene expression level.

1.2.2 mRNA Quantification

Levels of mRNA can be quantitatively measured by Northern blotting which

gives size and sequence information about the mRNA molecules. A sample of RNA is

separated on an agarose gel and hybridized to a radio-labeled RNA probe that is

complementary to the target sequence. The radio-labeled RNA is then detected by an

autoradiograph. The main problems with Northern blotting stem from the use of

radioactive reagents (which make the procedure time consuming and potentially

dangerous) and lower quality quantification than more modern methods (due to the fact

that quantification is done by measuring band strength in an image of a gel). Northern

blotting is, however, still widely used as the additional mRNA size information allows

the discrimination of alternately spliced transcripts.

A more modern low-throughput approach for measuring mRNA abundance is

reverse transcription quantitative polymerase chain reaction (RT-PCR followed with

qPCR). RT-PCR first generates a DNA template from the mRNA by reverse transcription.

The DNA template is then used for qPCR where the change in fluorescence of a probe

changes as the DNA amplification process progresses. With a carefully constructed

standard curve qPCR can produce an absolute measurement such as number of copies of

mRNA, typically in units of copies per nanolitre of homogenized tissue or copies per cell.

qPCR is very sensitive (detection of a single mRNA molecule is possible), but can be

expensive due to the fluorescent probes required.

Northern blots and RT-qPCR are good for detecting whether a single gene is

being expressed, but it quickly becomes impractical if many genes within the sample are

being studied. Using DNA microarrays transcript levels for many genes at once

(expression profiling) can be measured. Recent advances in microarray technology allow

for the quantification, on a single array, of transcript levels for every known gene in

several organisms‟ genomes, including humans.

Alternatively "tag based" technologies like Serial analysis of gene expression

(SAGE), which can provide a relative measure of the cellular concentration of different

messenger RNAs, can be used. The great advantage of tag-based methods is the "open

architecture", allowing for the exact measurement of any transcript, with a known or

unknown sequence.

1.2.3 Regulation of Gene Expression

Regulation of gene expression refers to the control of the amount and timing of

appearance of the functional product of a gene. Control of expression is vital to allow a

cell to produce the gene products it needs when it needs them; in turn this gives cells the

flexibility to adapt to a variable environment, external signals, damage to the cell, etc.

Some simple examples of where gene expression is important are:

 Control of Insulin expression so it gives a signal for blood glucose regulation

 X chromosome inactivation in female mammals to prevent an "overdose" of the genes it

contains.

 Cycling expression levels control progression through the eukaryotic cell cycle

More generally gene regulation gives the cell control over all structure and

function, and is the basis for cellular differentiation, morphogenesis and the versatility

and adaptability of any organism.

Any step of gene expression may be modulated, from the DNA-RNA

transcription step to post-translational modification of a protein. The stability of the final

gene product, whether it is RNA or protein, also contributes to the expression level of the

gene - an unstable product results in a low expression level. In general gene expression is

regulated through changes in the number and type of interactions between molecules that

collectively influence transcription of DNA and translation of RNA.

DNA methylation is a widespread mechanism for epigenetic influence on gene

expression and is seen in bacteria and eukaryotes and has roles in heritable transcription

silencing and transcription regulation. In eukaryotes the structure of chromatin, controlled

by the histone code, regulates access to DNA with significant impacts on the expression

of genes in euchromatin and heterochromatin areas.

1.3 Hidden Markov Model (HMM)

1.3.1 Introduction to HMM

A Hidden Markov Model (HMM) is a stochastic model that captures the statistical

properties of observed real world data. A good HMM accurately models the real world

source of the observed data and has the ability to simulate the source. Machine Learning

techniques based on HMMs have been successfully applied to problems including speech

recognition, optical character recognition, and as we will examine problems in

computational biology.

Methylation finding or prediction has become one of the foremost computational

biology problems for two reasons. Firstly, completely sequenced genomes have become

readily available. And most important, because of the need to extract actual biological

knowledge from this data to explain the molecular interactions that occur in cells and to

define important cellular pathways. Discovering the location of hyper-methylation on the

genome is a very important step towards building such a body of knowledge. This thesis

will introduce several different statistical and algorithmic methods for hyper-methylation

finding, with a focus on the statistical model-based approach using HMMs.

1.3.2 Hidden Markov Models

A basic Markov model of a process is a model where each state corresponds to an

observable event and the state transition probabilities depend only on the current and

predecessor state. This model is extended to a Hidden Markov model for application to

more complex processes, including speech recognition and computational gene finding.

A generalized Hidden Markov Model (HMM) consists of a finite set of states, an

alphabet of output symbols, a set of state transition probabilities and a set of emission

probabilities. The emission probabilities specify the distribution of output symbols that

may be emitted from each state. Therefore in a hidden model, there are two stochastic

processes; the process of moving between states and the process of emitting an output

sequence. The sequence of state transitions is a hidden process and is observed through

the sequence of emitted symbols.

Let us formalize the definition of an HMM in the following way, taken from an

HMM tutorial by Lawrence Rabiner [25]. An HMM is defined by the following elements:

1. Set S of N states, S = S1S2…SN

2. Set O of M observation symbols, the output alphabet. O = o1o2…oM

3. Set A of state transition probabilities, A = aij where aij is the probability of moving

from state i to state j.

   E1.1

4. Set B of observation symbol probabilities at state j, B = bj(k), where bj(k) is the

probability of emitting symbol k at state j.

 E1.2

5. Set π, the initial state distribution π = πi where πi is the probability that the start state is

state i.  E1.3

Given the definitions above, the notation of a model is λ= (A, B, π).

Figure 1.6: A simple HMM λ= (A,B, π),where N = 3, M = 3, a12,a23,a32 are non-zero,

b1(a), b2(t),b3(g) = 1 and π = 1, 0, 0. Note that states can be 'null' states that do not emit

any symbol.

1.3.3 Model Architecture

The set of states S, the output symbol alphabet X and the connections between the

states constitute the architecture of a model. The architecture of a HMM is problem

dependent. The model is constructed to correspond to the properties and constraints of the

observed sequences and of the process itself. HMM architecture can also be learned from

the data, but in most computational biology problems, it is advantageous to use known

constraints that characterize the processes.

1.3.4 HMM Training and Decoding

Once the architecture of an HMM has been decided, an HMM must be trained to

closely fit the process it models. Training involves adjusting the transition and output

probabilities until the model sufficiently fits the process. These adjustments are

performed using standard machine learning techniques to optimize P(O|λ), the probability

of observed sequence O = O1O2…OT, (here T is the number of observation length, i.e. the

number of 1000bp intervals) given model over a set of training sequences. The most

common and straightforward algorithm for HMM training is expectation maximization

(EM) which adapts the transition and output parameters by continually re-estimating

these parameters until P(O|λ) has been locally maximized.

HMM decoding involves the prediction of hidden states given an observed

sequence. The problem is to discover the best sequence of states Q = Q1Q2…QT visited

that accounts for an emitted sequence O = O1O2…OT and a model λ. There may be

several different ways to define a best sequence of states. A common decoding algorithm

is the Viterbi algorithm. The Viterbi algorithm uses a dynamic programming approach to

find the most likely sequence of states Q given an observed sequence O and model λ.

1.3.5 HMMs in Computational Biology

The field of computational biology involves the application of computer science

theories and approaches to biological and medical problems. Computational biology is

motivated by newly available and abundant raw molecular datasets gathered from a

variety of organisms. Though the availability of this data marks a new era in biological

research, it alone does not provide any biologically significant knowledge. The goal of

computational biology is then to elucidate additional information regarding protein

coding, protein function and many other cellular mechanisms from the raw datasets. This

new information is required for drug design, medical diagnosis, medical treatment and

countless fields of research.

The majority of raw molecular data used in computational biology corresponds to

sequences of nucleotides corresponding to the primary structure of DNA and RNA or

sequences of amino acids corresponding to the primary structure of proteins. Therefore

the problem of inferring knowledge from this data belongs to the broader class of

sequence analysis problems.

Two of the most studied sequence analysis problems are speech recognition and

language processing. Biological sequences have the same left-to-right linear aspect as

sequences of sounds corresponding to speech and sequences of words representing

language. Consequently, the major computational biology sequence analysis problems

can be mapped to linguistic problems. A common linguistic metaphor in computational

biology is that of protein family classification as speech recognition. The metaphor

suggests interpreting different proteins belonging to the same family as different

vocalizations of the same word. Another metaphor is gene finding in DNA sequences as

the parsing of language into words and semantically meaningful sentences. It follows that

biological sequences can be treated linguistically with the same techniques used for

speech recognition and language processing.

Since the theory of HMMs was formalized in the late 1960s, several scientists

have applied the theory to speech recognition and language processing. Just as HMMs

were first introduced as mathematical models of language, HMMs can be used as

mathematical models of molecular processes and biological sequences. In addition,

HMMs have been applied to linguistics because they are suited for problems where the

exact theory may be unknown but where there exists large amounts of data and

knowledge derived from observation. As this is also the situation in biology, HMM-based

approaches have been successfully applied to problems in computational biology. The

main benefit is that an HMM provides a good method for learning the theory from the

data and can provide a structured model of sequence data and molecular processes.

1.3.6 Application of HMMs to Specific Problems

It is clear that an HMM-based approach is a logical idea for tackling problems in

computational biology. Much work has been performed and applications have been built

using such an approach.

Baldi and Brunak [26] define three main groups of problems in computational biology for

which HMMs have been proven especially useful.

First, HMMs can be used for multiple alignments of DNA sequences, which is a

difficult task to perform using a naive dynamic programming approach. Second, the

structure of trained HMMs can uncover patterns in biological data. Such patterns have

been used to discover periodicities within specific regions of the data and to help predict

regions in sequences prone to forming specific structures. Third is the large set of

classification problems. HMM based approaches have been applied to structure

prediction - the problem of classifying each nucleotide according to which structure it

belongs. HMMs have also been used in protein profiling to discriminate between

different protein families and predict a new protein's family or subfamily. HMM-based

approaches have also been successful when applied to the problem of methylation

function finding. This is the problem of predicting methylation function according to

which type of job they perform. The remainder of this thesis is concerned with this last

problem of computational methylation function prediction.

Chapter 2: Methods and Algorithms

2.1 The Probabilistic Model

The probabilistic model is based on a multivariate instance of a Hidden Markov

Model (HMM). The model assumes a fixed number of hidden states N. In each hidden

state, the emission distribution, that is the probability distribution over each combination

of marks, is modeled with a product of independent Bernoulli random variables. Formally,

for each of the N states, and M=4096 (i.e. 212) input marks, there is an emission

parameter bk,m denoting the probability in state k (k=1,…,N) that input mark m

(m=1,…,M) has a present call. Let  denote a chromosome where C is the set of all

chromosomes. Let  denote an interval on the genome where t =1,…, T

corresponding to the 1000bp intervals on genome and T is the number of intervals that

the genome was divided into. So  is assigned „1‟ if a mark is detected in the tth

1000bp interval on chromosome c. Let aij denote the probability of transitioning from

state i to state j,

where

i=1,…,N and j=1,…,N.

We also have parameters πi (i=1, …, N), which denote the probability that the

state of the first interval on the chromosome is i. Let  be an unobserved state

sequence through chromosome c and  be the set of all possible state sequences. Let 

denote the unobserved state on chromosome c at location t for state sequence . The full

likelihood of all of the observed data D for the parameters a, b, and π can then be

expressed as:



 



 



 





 







……………………………………………………………………………………..E2.1

2.2 Baum-Welch Algorithm

In electrical engineering, computer science, statistical computing and

bioinformatics, the Baum–Welch algorithm is used to find the unknown parameters of a

hidden Markov model (HMM). It makes use of the forward-backward algorithm and is

named for Leonard E. Baum and Lloyd R. Welch.

A Hidden Markov Model is a probabilistic model of the joint probability of a

collection of random variables {O1… Ot, O1… Ot}. The Ot variables are discrete

observations and the variables are “hidden” and discrete. Under an HMM, there are two

conditional independence assumptions made about these random variables that make

associated algorithms tractable. These independence assumptions are:

1. The tth hidden variable, given the (t-1)st hidden variable, is independent of

previous variables, or:

 E2.2

2. The tth observation depends only on the tth state.

 E2.3

In the following, we present the EM algorithm for finding the maximum-likelihood

estimate of the parameters of a hidden Markov model given a set of observed feature

vectors. This algorithm is also known as the Baum-Welch algorithm.

Qt is a discrete random variable with N possible values {S1….SN}. We further assume

that the underlying “hidden” Markov chain defined by P(Qt | Qt-1 } is time-homogeneous

(i.e., is independent of the time t). Therefore, we can represent P (Qt | Qt-1} as a time-

independent stochastic transition matrix

   E2.4

The special case of time t=1 is described by the initial state distribution

 E2.5

We say that we are in state j at time t if Qt = Sj. A particular sequence of states is

described by Q = (Q1. . . QT )

where Qt ϵ{S1…SN}is the state at time t.

The observation is one of M possible observation symbols, Ot ϵ {o1…oM}. The

probability of a particular observation vector at a particular time t for state j is described

by:  E2.6

so B={bij} is an N by M matrix.

A particular observation sequence O is described as O = (O1 = o1, …, OT = oT ).

Therefore, we can describe a HMM by: λ= (A, B, ). Given an observation O, the Baum-

Welch algorithm finds:

 E2.7

that is, the HMM λ, that maximizes the probability of the observation O.

Initialization: set λ= (A, B, ) with random initial conditions. The algorithm

updates the parameters of λ iteratively until convergence, following the procedure below:

The forward procedure:

We define:

 E2.8

which is the probability of seeing the partial sequence o1, , , ot and ending up in state Si at

time t.

We can efficiently calculate αi(t) recursively as:

1.  E2.9

2.



 E2.10

The backward procedure: This is the probability of the ending partial sequence ot+1,…, oT

given that we started at state i, at time t. We can efficiently calculate βi(t) as:

1.  E2.11

2. 



 E2.12

Using α and β, we can calculate the following variables:







 E2.13

 

  







 E2.14

Having  and ξ, we can define update rules as follows:

 E2.15

 









 E2.16











 E2.17

(note that the summation in the nominator of b‟i(k) is only over observed symbols equal

to ok).

Using the updated values of A, B and π, a new iteration performed until convergence.

2.3 Work Flow

Fig2.1 A Broad overview of the HMM work-flow, highlighting the most significant

inputs, transformations, and outputs at each step from start to end.

Detailed description of the work flow:

Data Sources and Data Processing (Will be discussed in the next chapter)

HMM training: For the HMM training, we first randomly initialized the parameters, the

probability matrix for the first state (), the initial transition matrix (A), the initial

emitting matrix (B).

Then we started to try to train the model with different numbers of states, from 12-26,

using the Baum-Welch algorithm. Also, during the training, we did some control to make

the training process more designable. For example, for each iteration, we will calculate

the value of A and B with

  E2.17

and

  E2.18

By doing this kind of thing, we can smooth the transition and emission matrixes. Here in

my program, based on the number of states and possible observations in each state, we

set α as 0.000001.

For the terminate conditions, we have two thresholds:

The first one is delta, which is the difference between the previous LLR (Log

Likelihood Ratio) and the current LLR. For every iteration, we will calculate the LLR for

the current iteration, then compare it to the previous LLR and get the value of delta. If

delta is less than or equal to some threshold (here we set 0.001), then we think that the

training is enough and so we stop training.

The other is the number of iteration, we count the number of iterations all the time

during the training process, if the number of iteration is less than the threshold (we set

300 here), we continue the training, otherwise, we abort.

At the end of every iteration, we will check if any of the two thresholds is

achieved and if either the delta or the iteration number achieves, we just stop the training

process and output the training results.

We trained our HMM using different number of states from 11 to 26, and then we

can get the Bayesian information criterion (BIC) values. The BIC is an asymptotic result

derived under the assumptions that the data distribution is in the exponential family.

The formula for the BIC is:

 E2.19

Where

x is the observed data;

n is the number of data points in x, the number of observations, or equivalently, the

sample size;

k is the number of free parameters to be estimated. If the estimated model is a linear

regression, k is the number of regresses, including the intercept;

then p(x|k) is the probability of the observed data given the number of parameters; or, the

likelihood of the parameters given the dataset;

and L is the maximized value of the likelihood function for the estimated model.

Chapter 3: Date Process

3.1 Data Sets:

We used 36 cell lines as our training datasets, including 33 Breast Cancer Cell Lines, H1,

HCT116 and CD4+T cell lines.

a) 33 Breast Cancer Cell Lines:

Breast cell lines were procured through the Integrative Cancer Biology Program

(ICBP) of the National Cancer Institute (Neve et al., 2006) [27].

b) H1 cell line:

The H1 MBD-seq data used in this thesis is from the paper of R Alan Harris. Et.al

(nature biotechnology, 2010) [28]

c) HCT116:

The HCT116 MBD-seq data used in this thesis is from the paper of David Serre

et.al. (Nucleic Acids Research, 2010) [4]

d) d) CD4+Tcell:

The CD4+Tcell MBD-seq data used in this thesis is from the paper of Jung K

Choi et.al. (Genome Biology, 2009)[29]

3.2 MBD-seq Protocol

Genomic DNA was isolated by the QIAamp DNA Mini Kit (Qiagen) following

the manufacture‟s protocol. Genomic DNA of breast cell lines was procured through the

Integrative Cancer Biology Program (ICBP) of the National Cancer Institute.

MBDCap-seq, mapping and normalization. Methylated DNA was eluted by the

MethylMiner Methylated DNA Enrichment Kit (Invitrogen) according to the

manufacturer‟s instructions. Briefly, one microgram of genomic DNA was sonicated and

captured by MBD proteins. The methylated DNA was eluted in 1 M salt buffer. DNA in

each eluted fraction was precipitated by glycogen, sodium acetate and ethanol, and was

resuspended in TE buffer. Eluted DNA was used to generate libraries for sequencing

following the standard protocols from Illumina. MBDCap-seq libraries were sequenced

using the Illumina Genome Analyzer II as per manufacturer's instructions. Image

analysis and base calling were performed with the standard Illumina

pipeline. Sequencing reads were mapped by ELAND algorithm. Unique reads were up

to 36 base pair reads mapped to the human reference genome (hg18), with up to two

mismatches. Reads in satellite regions were excluded due to the large number of

amplifications.

3.3 Data Preprocess

The original data for the 33 DNA methylation data are in the export format which

is described in (Appendix.C), the H1 is in the form of bam which is described in

(Appendix.A) and the HCT116 and CD4+T cells are in the form of fastq which is

described in (Appendix.E). While dealing with these data, for export files (33 Breast

Cancer cell lines), we have to first divide the reads in the files into three groups:

Group 1: the unique matched reads (which means there is only one sequence on the

genome that matches the read)

Group 2: the multiple-/non- matched reads (which means there is multiply/no sequences

match the read)

Group 3: QC (Quality Control, which means the read itself can‟t meet some quality

requirements)

Then for the reads in group2, we use a tool (Lonut) to get the most possible

matched reads and then add them to the reads from group1 to get the reads that we are

going to deal with (Total Reads After Processing).

For the bam files(H1 cell lines), we first use the popular tool “samtool” to

transform it from bam format to sam format, since bam file is a binary file. Also since

there is no QC reads in sam file, so this time we just divide the reads in to two groups, the

group1 and group2 which are the same as for the export files.

Then for the fastq files(HCT116 and CD4+Tcell), we first use the software bowtie

to map the sequences on to the genome and then followed by the same steps as those for

33 breast cancer cell lines.

Besides, the outputs for all the process above are in bed format (Appendix.D)

Table 3.1 Data summary for 36 cell lines

Cell line ID

Raw Reads

Unique Matched

Reads

Not Unique

Matched Reads

Total Reads After

Process

BrCa-02(AU565)

38,389,113

21,757,417

11,268,129

33,025,546

BrCa-03(BT549)

38,607,423

24,343,702

9,151,298

33,495,000

BrCa-06(HCC1569)

33,243,637

17,790,745

11,032,912

28,823,657

BrCa-07(HCC1937)

32,664,695

17,761,936

10,746,815

28,508,751

BrCa-08(HCC2185)

40,922,132

22,424,765

11,505,148

33,929,913

BrCa-09(HCC70)

42,112,586

24,613,958

11,832,051

36,446,009

BrCa-10(LY2)

38,858,773

23,020,926

11,294,571

34,315,497

BrCa-11(MCF-7)

43,128,546

24,876,183

12,608,935

37,485,118

BrCa-12(MDAMB-231)

36,495,183

22,767,185

8,963,014

31,730,199

BrCa-14(MDAMB-468D)

46,932,495

25,467,786

14,656,101

40,123,887

BrCa-15(SUM149PT)

36,129,334

21,592,142

10,491,546

32,083,688

BrCa-16(SUM225CWN)

27,600,744

17,390,015

7,502,375

24,892,390

BrCa-20(BT20)

38,329,851

21,775,872

11,679,619

33,455,491

BrCa-25(HCC1954)

38,223,154

21,961,680

10,936,018

32,897,698

BrCa-28(MCF10A)

47,587,907

27,946,727

12,391,220

40,337,947

BrCa-32(SKBR3)

41,094,509

24,365,279

10,698,249

35,063,528

BrCa-33(SUM159PT)

43,752,391

25,433,158

11,726,940

37,160,098

BrCa-38(BT474)

46,247,881

27,613,327

11,417,755

39,031,082

BrCa-40(HCC1143)

34,178,168

20,315,316

9,549,005

29,864,321

BrCa-41(HCC1428)

33,877,849

21,196,882

8,922,003

30,118,885

BrCa-43(HCC202)

34,308,392

20,183,614

9,788,109

29,971,723

BrCa-44(HCC3153)

30,572,196

16,910,888

9,429,028

26,339,916

BrCa-49(MDAMB436)

37,982,516

22,680,715

10,033,312

32,714,027

BrCa-51(SUM185PE)

28,071,302

16,021,760

7,372,317

23,394,077

BrCa-55(600MPE)

29,989,492

15,772,575

8,997,509

24,770,084

BrCa-59(HCC1500)

33,512,194

21,689,325

7,297,771

28,987,096

BrCa-63(HS578T)

29,774,913

19,954,550

6,690,775

26,645,325

BrCa-64(MCF12A)

39,816,671

19,667,828

14,407,417

34,075,245

BrCa-65(MDAMB175VII)

36,527,805

19,117,034

11,991,384

31,108,418

BrCa-67(MDAMB453)

34,634,234

18,920,039

11,290,635

30,210,674

BrCa-68(SUM1315MO2)

34,112,917

20,891,597

9,163,182

30,054,779

BrCa-70(SUM52PE)

32,326,562

19,320,660

9,731,807

29,052,467

T47D

43,275,135

26,119,748

10,784,904

36,904,652

HCT116

19,041,613

4,906,885

3,615,483

8,522,368

Tcell

7,172,5143

21,645,848

18,120,978

39,766,826

5,9618,003

30,139,685

174,794

30,314,479

Fig 3.1 Bar figure for 36 cell lines

3.4 Input for HMM

First we divide the whole genome into 1000-base-pair non-overlapping intervals

within which we independently made a call as to whether each of the 36 marks was

detected as being present or not based on the count of tags mapping to the interval. Each

tag was uniquely assigned to one interval based on the location of the 5‟ end of the tag

after applying a shift of 500 bases in the 5‟ to 3‟ direction of the tag (mid-point). The

threshold, t, for each mark was based on the total number of mapped reads for the mark,

and was set to the smallest integer t such that P(X>t)<10-4 where X is a random variable

with a Poisson distribution with mean parameter set to the empirical mean of the number

of tags per interval. In each cell line, if a mark is detected in an interval, then we assign „1‟

to this interval; otherwise, we assign „0‟ to it.

Also we group the 36 cell lines into 12 groups based on some mechanism/factors

(e.g. Gene Cluster, ER+/-, PR+/- and HER2 expression) as follows:

Table 3.2 12 groups for 36 cell lines

Cell lines

Gene Cluster

HER2

group(1for+,2for-)

Groups

BT474

11.9994

111

MCF10A

BaB

6.837

222

MCF12A

BaB

7.226

222

HCC1428

7.6065

112

MCF7

8.4522

112

T47D

8.2666

112

600MPE

9.2756

122

LY2

6.9903

122

MDAMB175

9.2384

122

SUM52PE

7.6287

122

AU565

12.1189

221

HCC202

12.1056

221

SKBR3

11.5751

221

HCC1569

BaA

11.7554

221

HCC1954

BaA

11.5082

221

SUM225CWN

BaA

12.9908

221

HCC2185

9.3429

222

MDAMB453

10.172

222

SUM185PE

8.3417

222

BT20

BaA

7.7677

222

HCC1143

BaA

8.7032

222

HCC1937

BaA

7.7687

222

HCC3153

BaA

8.9164

222

HCC70

BaA

7.9334

222

MDAMB468

BaA

7.0596

222

BT549

BaB

7.0328

222

HCC1500

BaB

7.2479

222

HS578T

BaB

6.6301

222

MDAMB231

BaB

6.5633

222

MDAMB436

BaB

6.9034

222

SUM1315

BaB

6.9018

222

SUM149PT

BaB

6.5676

222

SUM159PT

BaB

7.3181

222

HCT116

Tcell

Then we need to decide whether an interval is hyper-methylated in this group.

What we did here is count the number of marks in each interval for every group. If there

is at least one mark presented in some interval for some group, then we say this group is

hyper-methylated in this interval. Again, for each group, if some interval is hyper-

methylated, then we assign this interval with „1‟, otherwise „0‟.

Then for the input of the HMM, we combine the 12 groups together in the way as

follows:

We denote the observation value for interval t as vt (t=1, …, T), then we calculate the

value of vt in the following steps:

1. Initial vt, set vt = 0;

2. Put the 12 groups in the increasing order, from group1 to group12.

3. While going through the 12 groups, for each group, if the corresponding interval

is 1, then vt =2* vt +1; otherwise, vt =2* vt

4. After going through the 12 groups, we have add 1 to vt, that is vt = vt +1, so we

can get the value of vt from 1 to 4096.

So based on the encode mechanism, we can decode the value of an interval.

What we need to do is simply write the number in the format of binary number which is

in the form of





 E3.1

where ai ϵ {0,1} and i= ϵ {1,…,12}

Then if ai =1, we know the group i in this interval is 1(hyper-methylated).

Thus, for some vt =1, we know, none of the 12 groups is 1 since 1-1=0;

And for some vt =169, we know this interval is a combination of 3 groups (9, 7, 5) with 1

(hyper-methylated), since

169-1=128+32+8=212-5+212-7+212-9

3.5 Methylation Distribution Overview

Also, we need to get an overview of the DNA methylation distribution, so we

modified a web tool developed by Brian to visualize the data.

To find the meaning of the states (training result), we also have to deal with the

source data to find the correlation between our result and the source data. What we did

here is first remap our intervals to the genome (hg18) and get the positions that they are

on the genome. Then we correlate these remapped reads with genes (hg18). By using this,

we can divide the intervals into different regions based on their distances to the gene 5‟

end. Also it is not necessary to include the gene desert regions, so we also filter out the

regions that are farther than 100k from the gene body. Besides, since different genes have

different lengths of gene body, so we artificially choose a certain distance (2kbp) away

from both 5‟ and 3‟ ends in the gene body and all the intervals are with a length of 2k

base pair.

Fig 3.2 Methylation distribution for 33 breast cancer cell lines

3.6 Gene Expression Data

Increasing evidence is revealing a role of methylation in the interaction of

environmental factors with genetic expression. Differences in maternal care during the

first 6 days of life in the rat induce differential methylation patterns in some promoter

regions and, thus, influencing gene expression [30].

Also, we correlated the results with the gene expression data (Richard M.Neve et

al, 2006) [27]. The gene expression data files are in the format of *.cel, so we used a R

package to transform them into readable files. Then we paste them into the same file and

did the one way hierarchy clustering (cluster the genes) for the 33 breast cancer cell lines

and in order to find out the meanings of the states more easily, we ordered the cell lines

in the order of subgroups. So in this way, we can have a straight sight on which cluster of

genes is correlated with a particular subgroup.

0.5

1.5

2.5

5TSS-50

5TSS-46

5TSS-42

5TSS-38

5TSS-34

5TSS-30

5TSS-26

5TSS-22

5TSS-18

5TSS-14

5TSS-10

5TSS-6

5TSS-2

3TSS-2

3TSS+3

3TSS+7

3TSS+11

3TSS+15

3TSS+19

3TSS+23

3TSS+27

3TSS+31

3TSS+35

3TSS+39

3TSS+43

3TSS+47

Chapter 4: Results and Discussion

4.1 Results from HMM

We trained our HMM using different number of states from 11 to 26, and then we get the

BIC values as follows:

Table 4.1 BIC results for HMM results

# states

L(log ratio)

BIC

-4,257,409

45,165

3,070,531

9,189,464

-4,256,362

49,283

3,070,531

9,248,882

-4,254,518

53,403

3,070,531

9,306,736

-4,265,865

57,525

3,070,531

9,391,002

-4,212,430

61,649

3,070,531

9,345,733

-4,195,762

65,775

3,070,531

9,374,029

-4,223,164

69,903

3,070,531

9,490,494

-4,247,955

74,033

3,070,531

9,601,768

-4,162,556

78,165

3,070,531

9,492,691

-4,188,262

82,299

3,070,531

9,605,854

-4,189,774

86,435

3,070,531

9,670,659

-4,158,646

90,573

3,070,531

9,670,214

-4,161,580

94,713

3,070,531

9,737,922

-4,167,498

98,855

3,070,531

9,811,629

-4,151,067

102,999

3,070,531

9,840,667

-4,145,348

107,145

3,070,531

9,891,160

So based on the BIC we decide to choose the model with 11 states.

Following are the training results for the model with 11 states (transition and

emission matrixes):

The transition matrix is 11x11 since there are 11 states in total and the transition

matrix is about the probabilities from each state to all the possible states.

The result is as follows:

Table 4.2 Transition Matrix

Here the rows are the states a transition starts and the columns are the states it

transits to, then each cell in the main table is the probability that a transition from at the

from-state (the column number) to the to-state (the row number).

The corresponding heatmap:

Fig 4.1 Heatmap for transition matrix

From the transition matrix, we can see that most of the states are with very high

probabilities to transit to themselves expect states 1, 4 and 10. In the view of the biology

side, it is very reasonable, since for methylation intervals in the whole genome, if current

region is methylated or not methylated, then it is very possible that the next interval is

also methylated or not methylated. Also some of the states are more likely to transit to

other states; it is possible that they are mostly in the intervals whose next interval is not

the same as it (from methylated interval to non-methylated interval or from non-

methylated interval to methylated interval). We can see that states 1, 4 and 10 do not have

very high probabilities to transit to themselves separately, but when treated as a group,

we can see, it still has a very high probability to transit to itself. So I think maybe we can

treat them as a group in the future analysis.

Then for the emission matrix, we have a matrix of 11 x 4096, there is not enough

space to present the whole here, so we can‟t draw the matrix here. Also we don‟t need to

all the detailed observation probabilities of every combination of marks, actually we have

to get the probability of each group occurs in each state, so we have to add up all the

probabilities of observations which contain a particular group.

The results are as follows:

Table 4.3 Emission probabilities for each mark in each state

Here the rows are the states and the columns are the marks, then each cell in the

main table is the probability that one mark (the column number) can present in the

corresponding state (the row number).

In order to have a more clearly view of the emission probabilities, we ordered the

probabilities in each states and follows are the marks in the decreasing order and the

corresponding probabilities:

Table 4.5 Ordered emission probabilities for each mark in each state- probabilities

Table 4.4 Ordered emission probabilities for each mark in each state-mark

Furthermore, we consider that for all the intervals on the whole genome, there are

a lot of non-methylated intervals which can even be a large part of the whole genome,

and now we don‟t want to consider these kinds of things since we want to focus on the

methylated intervals. So we apply the threshold 0.08 to the observation probability of

each state and get 8 states above the threshold and 3 states under.

Table 4.6 Filtered ordered emission probabilities for each mark in each state- marks

From the table above, we can see that the 3 states under the threshold are exactly

the ones that don‟t have very high probabilities to transit to themselves. Also these 3

states are quite similar to each other. They are all dominated by 3 marks (7, 9, 10) and

followed by the other marks which are also in very similar orders. Also some states are

dominated by non-DNA methylation related cell lines (e.g. state 7 which is dominated by

group 10 and 12 which are H1-stem cell and CD4+T cell).

4.2 Biology Meanings

After we get the states and their features, we have to assign them with biology

meanings by correlate them with other data.

What we did is correlating the results above with the gene expression data for the

33 breast cancer cell lines since increasing evidence is revealing a role of methylation in

the interaction of environmental factors with genetic expression.

We did the one-way hierarchy clustering to 12113 genes we get for the 33 breast

cancer cell lines (Richard M.Neve et al, 2006) [27] as described in the section of data

process.

4.2.1 Gene Expression results for 33 breast cancer cell lines

Fig 4.2 33 Breast Cancer Cell Gene Expression One-Way Hierarchy Clustering

We future applied a threshold to the clustering results and get the results as follows

Fig 4.3 Grouped 33 Breast Cancer Cell Gene Expression One-Way Hierarchy Clustering

4.2.2 Results Based on Different Clusters

From the results above, we can see that the whole genes were divided into 9

clusters which are annotated in 9 different colors. From up to bottom, we denote them

from cluster1 to cluster9. We can see some very interesting features from the figure

above, but we would like to discuss them together with the methylation distribution in

these clusters. Based on the clustering results, we get the genes in each cluster.

Table 4.7 Number of genes in each cluster

clusters

# genes

1,362

860

1,098

1,274

800

4,010

1,211

866

632

Then for each group, we correlated the methylated intervals to the genome based

on their distances to the nearest gene in each cluster. What we do is as follows:

We first remap the methylated intervals onto the genome as the distance between

its midpoint to the 5‟TSS or 3‟TSS of its nearest gene. We do this for all the intervals and

we call this process find-region. Then we count the number of reads located in each 2kb

intervals from -100kb to 4kb based on 5‟TSS and -4kb to 100kb based on 3‟TSS. After

that, we draw the distribution image to have a look at distribution of methylated interval

in each group in each cluster and try to find the relationship between gene expression and

methylation.

Fig 4.4 Methylation distribution based on cluster 1 genes

Here we can see that group3 and group6 are high methylated in gene body region,

also from the clustering results, we can see group 3 and group6 are with high gene

expression which makes sense since methylation in the gene body can up-regulate gene

expression. Also group8 and group9 are low methylated in the 5‟TSS promoter region

(5TSS-2) which also makes sense since methylation in the promoter region can repress

gene expression.

0.5

1.5

2.5

5TSS-50

5TSS-46

5TSS-42

5TSS-38

5TSS-34

5TSS-30

5TSS-26

5TSS-22

5TSS-18

5TSS-14

5TSS-10

5TSS-6

5TSS-2

3TSS-2

3TSS+3

3TSS+7

3TSS+11

3TSS+15

3TSS+19

3TSS+23

3TSS+27

3TSS+31

3TSS+35

3TSS+39

3TSS+43

3TSS+47

cluster1

Fig 4.5 Methylation distribution based on cluster 2 genes

Here group 1 is high methylated in gene body region (5TSS+2) and is with high

gene expression.

Fig 4.6 Methylation distribution based on cluster 3 genes

Here group 2 is high methylated in gene body region (3TSS+1) and is with high

gene expression. Group 8 is low methylated in 5TSS promoter region (5TSS-4) and with

high gene expression.

0.5

1.5

2.5

5TSS-50

5TSS-46

5TSS-42

5TSS-38

5TSS-34

5TSS-30

5TSS-26

5TSS-22

5TSS-18

5TSS-14

5TSS-10

5TSS-6

5TSS-2

3TSS-2

3TSS+3

3TSS+7

3TSS+11

3TSS+15

3TSS+19

3TSS+23

3TSS+27

3TSS+31

3TSS+35

3TSS+39

3TSS+43

3TSS+47

cluster2

0.5

1.5

2.5

5TSS-50

5TSS-46

5TSS-42

5TSS-38

5TSS-34

5TSS-30

5TSS-26

5TSS-22

5TSS-18

5TSS-14

5TSS-10

5TSS-6

5TSS-2

3TSS-2

3TSS+3

3TSS+7

3TSS+11

3TSS+15

3TSS+19

3TSS+23

3TSS+27

3TSS+31

3TSS+35

3TSS+39

3TSS+43

3TSS+47

cluster3

Fig 4.7 Methylation distribution based on cluster 4 genes

Here group 2 is high methylated in 5TSS promoter region (5TSS-1) and is with

low gene expression. Group 7 is low methylated in 5TSS promoter region (5TSS-2) and

is with high gene expression.

Fig 4.8 Methylation distribution based on cluster 5 genes

0.5

1.5

2.5

5TSS-50

5TSS-46

5TSS-42

5TSS-38

5TSS-34

5TSS-30

5TSS-26

5TSS-22

5TSS-18

5TSS-14

5TSS-10

5TSS-6

5TSS-2

3TSS-2

3TSS+3

3TSS+7

3TSS+11

3TSS+15

3TSS+19

3TSS+23

3TSS+27

3TSS+31

3TSS+35

3TSS+39

3TSS+43

3TSS+47

cluster4

0.5

1.5

2.5

5TSS-50

5TSS-46

5TSS-42

5TSS-38

5TSS-34

5TSS-30

5TSS-26

5TSS-22

5TSS-18

5TSS-14

5TSS-10

5TSS-6

5TSS-2

3TSS-2

3TSS+3

3TSS+7

3TSS+11

3TSS+15

3TSS+19

3TSS+23

3TSS+27

3TSS+31

3TSS+35

3TSS+39

3TSS+43

3TSS+47

cluster5

Here group 4 is high methylated in gene body region (5TSS+1) and is with high

gene expression. Group 8 is low methylated in 5TSS promoter region (5TSS-4) and with

high gene expression.

Fig 4.9 Methylation distribution based on cluster 6 genes

Here group2 is high methylated in the 5TSS promoter region (5TSS-1) and is with

low gene expression.

0.5

1.5

2.5

5TSS-50

5TSS-46

5TSS-42

5TSS-38

5TSS-34

5TSS-30

5TSS-26

5TSS-22

5TSS-18

5TSS-14

5TSS-10

5TSS-6

5TSS-2

3TSS-2

3TSS+3

3TSS+7

3TSS+11

3TSS+15

3TSS+19

3TSS+23

3TSS+27

3TSS+31

3TSS+35

3TSS+39

3TSS+43

3TSS+47

cluster6

Fig 4.10 Methylation distribution based on cluster 7 genes

Here group 3 and group 4 ars high methylated in gene body region (5TSS+1 and

5TSS+2) and are with high gene expression.

Fig 4.11 Methylation distribution based on cluster 8 genes

Here group 5 is high methylated in the 5TSS promoter region (5TSS-2) and is

with low gene expression.

0.5

1.5

2.5

5TSS-50

5TSS-46

5TSS-42

5TSS-38

5TSS-34

5TSS-30

5TSS-26

5TSS-22

5TSS-18

5TSS-14

5TSS-10

5TSS-6

5TSS-2

3TSS-2

3TSS+3

3TSS+7

3TSS+11

3TSS+15

3TSS+19

3TSS+23

3TSS+27

3TSS+31

3TSS+35

3TSS+39

3TSS+43

3TSS+47

cluster7

0.5

1.5

2.5

5TSS-50

5TSS-46

5TSS-42

5TSS-38

5TSS-34

5TSS-30

5TSS-26

5TSS-22

5TSS-18

5TSS-14

5TSS-10

5TSS-6

5TSS-2

3TSS-2

3TSS+3

3TSS+7

3TSS+11

3TSS+15

3TSS+19

3TSS+23

3TSS+27

3TSS+31

3TSS+35

3TSS+39

3TSS+43

3TSS+47

cluster8

Fig 4.12 Methylation distribution based on cluster 9 genes

Here group 2 and group 6 are high methylated in gene body region (3TSS-1 and

5TSS+1) and are with high gene expression.

0.5

1.5

2.5

3.5

5TSS-50

5TSS-46

5TSS-42

5TSS-38

5TSS-34

5TSS-30

5TSS-26

5TSS-22

5TSS-18

5TSS-14

5TSS-10

5TSS-6

5TSS-2

3TSS-2

3TSS+3

3TSS+7

3TSS+11

3TSS+15

3TSS+19

3TSS+23

3TSS+27

3TSS+31

3TSS+35

3TSS+39

3TSS+43

3TSS+47

cluster9

4.2.3 States Meanings and Group Patterns

From the figures above, we can see that in each cluster, the distributions of

methylated intervals are quite different for different groups but still it is not very easy to

find the biology meanings for the states. Furthermore, we ordered all the groups in each

2kbp interval in a decreasing order as well as what we did to the emission matrix.

Then we correlate the emission matrix with ordered groups in each cluster. We

assume a state is coherent with an interval if the first 3 groups are the same.

Table 4.8 First 3 marks for each state

For example, we have state 3 which ordered some features excluding group 10,

11 and 12 (5, 8, 6, 3, 9, 4…), then an interval is said to be coherent with state 3 if it‟s

ordered features start with 5, 8, 6 and followed by other groups.

Based on the assumption above, we correlate our states with the clustering results

and we get the results as follows:

Table 4.9 States and interval correlation results

clusters

states

intervals

features

cluster1

State_1

5TSS-36

cluster2

State_3

5TSS-26

cluster2

State_3

3TSS+22

cluster2

State_5

5TSS-12

cluster3

State_2

3TSS+26

cluster3

State_11

3TSS+28

cluster3

State_11

3TSS+30

cluster4

State_1

5TSS-23

cluster4

State_1

3TSS+3

cluster5

State_1

5TSS-6

cluster5

State_1

3TSS+17

cluster5

State_2

3TSS+25

cluster5

State_5

3TSS+46

cluster5

State_8

5TSS-28

cluster6

State_6

5TSS-34

cluster8

State_11

3TSS+48

cluster9

State_5

3TSS+21

cluster9

State_8

5TSS-25

cluster9

State_8

3TSS+20

cluster9

State_9

3TSS+25

In cluster 1, we get:

State_1 5TSS-36 9 7 8 4 1 6 2 5 3

which means the 36th 2kbp interval in the upstream of 5TSS is coherent with state 1.

From the results above, we can see that there are no states 4, 7 and 10 which is

reasonable since states 4 and 10 are in the 3 states that we filtered and also they are

dominated by group 10 which is not breast cancer cell line, they all can be classified as

the . Also state 7 is mostly dominated by non-methylation related groups (group 10 and

group 12). Then for state 1, we can see it appears in cluster 1, 4 and 5. Besides, in cluster

1 and 4, there is only state 1 .While for the other states, we can continue to find some

other deep biology meanings.

Table 4.10 States meanings

states

meanings

regions that are unlikely to be methylated, if methylated, it is high probability

in breast cancer cell lines

regions that methylated in breast cancer cell lines, at far 3 distal (50k

downstream of 3'core with high gene expression and 52k downstream of

3'core) with low gene expressions

regions that methylated in breast cancer cell lines, at far distal (52k upstream

of 5'TSS and 44k downstream of 3'core) regions with high gene expressions

Continued

Table 4.10: Continued

4 and

regions that are unlikely to be methylated, if methylated, it is high probability

in non-breast cancer cell lines

regions that methylated in breast cancer cell lines, at near 5 distal and far 3

distal (24k upstream of 5'TSS and 92k downstream of 3'core) regions with

high gene expressions

regions that methylated in breast cancer cell lines, at far 5 distal (68k

downstream of 5'TSS) regions with low gene expressions

regions that mainly methylated in non-breast cancer cell lines (H1 stem cell

and CD4+ Tcell)

regions that methylated in breast cancer cell lines, at far 5 distal and 3 near

distal (56k and 50k upstream of 5'TSS and 40k downstream of 3'core) regions

with high gene expressions

regions that methylated in breast cancer cell lines, at far 3 distal (50k

downstream of 3'core) regions with low gene expressions

regions that methylated in breast cancer cell lines, at far 3 distal (56k, 60k

and 96k downstream of 3‟TSS) regions with high gene expressions

Near distal regions are the regions in 10k-40k up/down stream of 5‟TSS/3‟TSS

Far distal regions are the regions in 41k-100k up/down stream of 5‟TSS/3‟TSS

Proximal regions are the regions in 4k-10k up/down stream of 5‟TSS/3‟TSS

Then based on the results above and correlated to Table 4.6 and Figure 4.2, we can figure

out the DNA methylation patterns for subtypes of Breast cancers as follows:

Table 4.11 Patterns for subtypes of Breast cancers

subtypes

patterns

Group 1

Low methylated at far 3 distal (56k and 92k downstream of 3‟TSS) regions

with high gene expressions

Group 2

Low methylated at far 3 distal (40k, 42k, 60k and 96k downstream of

3‟TSS) regions with low gene expressions and at far 5 distal regions (68k

upstream of 5‟TSS) with high or low gene expressions

Group 3

Low methylated at far 5 distal (46k and 72k upstream of 5‟TSS), far 3 distal

(44k, 96k downstream of 3‟TSS) and 3 proximal (6k downstream of 3‟TSS)

regions with high gene expressions

Group 4

High methylated at far 3 distal (50k, 52k downstream of 3‟TSS) regions

with low gene expressions

Group 5

High methylated at far distal (52k upstream of 5'TSS and 44k downstream

of 3'core) regions with high gene expressions

Group 6

High methylated at far 3 distal (56k, 60k, 96k downstream of 3'TSS) regions

with high gene expressions

Continued

Table 4.11: Continued

Group 7

Low methylated at far 5 distal (52k upstream of 5‟TSS) regions with high

gene expressions

Group 8

High methylated at far 3 distal (42k and 96k downstream of 3‟TSS) and near

5 distal (24k upstream of 5‟TSS) regions with high gene expressions, but at

far 5 distal (68k upstream of 5‟TSS) regions (68k upstream of 5‟TSS) with

low gene expressions

Group 9

High methylated at 5 distal (46k, 50k, 56k and 72k upstream of 5‟TSS), 3

distal (34k, 40k, and 50k downstream of 3‟TSS) and 3 proximal (6k

downstream of 3‟TSS) regions with high gene expressions

Chapter 5: Data Visualization

We also modified a developed database web tool to visualize the methylation data

in our 36 cell lines to give an intuitional view of the data. Also in the web tool, we

grouped the 36 cell lines into 9 groups which can be quite convenient for us to compare

the data in different groups. For details,

Step1: We divide each cell line into 100bp intervals and then count the number of

methylation reads that fall into to each interval.

Step2: We normalize each of the 36 cell lines to the same level, say each of the cell lines

has 10,000,000 methylation reads, since according to the statistical summary, most of the

cell lines have this level of methylation read number.

Thus, for an original interval value Dij which is the number of methylation reads in the jth

interval on the ith cell line, we can calculate the normalized value





 E5.1

where Si is the total number of methylation reads in the ith cell line.

Step3: In order to express the methylation level of an interval, we need to use red color

from light to dark to present the methylation levels from low to high. Here our level

boundaries are as follows:













 



 



 



 



 



 



 





 E5.2

So, we will use 7 red colors with different brightness to represent the different

levels of methylation.

Also Methylation data is stored as a start coordinate and up to 4 consecutive

following methylation levels for fixed step 100nt spans. This was done to reduce the

number of records in the database and improve performance because while methylation

data is globally dense, it is many disjoint segments usually only a few hundred

consecutive nucleotides long with a non-zero methylation value.

Besides the methylation data, we also correlate the genome regions with genes, so

then we can see the correlation between gene and methylation regions. For example, in

the fig datatool.fig, we can see, in region chr1:27060485-27070485, there is one gene

called SFN (NM_006142) and in the region of this gene, cell lines in Group2 and Group4

are not methylated but some cell lines in Group6 and Group7 are hyper-methylated.

Fig 5.1 Database Web Tool

Corresponding Link: http://motif.bmi.ohio-state.edu/hmm

Chapter 6: Conclusions and Suggestions for Future Work

6.1 Conclusions

Many researchers are doing research in Breast Cancer cell lines and methylation

data, also some people are trying to solve biology problems using Hidden Markov Model.

However, few people had used HMM to deal with DNA methylation data in Breast

Cancer lines. Besides, for those trying to use HMM to solve biology problems, they

usually only set 2 states for training, and the meaning for the 2 states are even known

which makes the training not so meaningful. In this thesis, we used much more states for

the training and also the meanings for the states are not known before the training, also

after training, by correlating with other biology data, we can figure out the meanings of

the states, which is advanced and novel. Also, for the program itself, we modified the

standard version of HMM and make it work better for our data. The time and space

complexities for standard version of HMM are O(#iteration *n2T) and O(n2T), where n is

the number of states and T is the length of the input sequence. Ours are O(#iteration *

(m+n)nT) and O(nT), where m is the number of possible observations. We did this

modification because our server has 30GB memory limitation while deal with the whole

genome with the standard HMM it will take much more memory than the limitation.

Besides the HMM part, we also used programs published or developed by our lab

to processing and analyzing the data which largely help us to find the biology meanings.

For example, for the correlation of DNA methylation data and gene expression data, we

find the relationships between them which are coherent with some published results.

6.2 Future Work

The HMM program we designed here is not parallelized so it takes quite long

time to train the whole genome as the input. So we could try to parallelize the program

use OpenMP, MPI or other methods and make it more efficient. Also, our prediction for

biology meanings is restricted to the dataset we have. So if we further correlated our

results with other data, it is quite possible that we can predict more and deeper biology

meanings. What‟s more, now we can only verify our results with some published paper

but not very systematically, so if possible it is better to do some biology experiments to

further verify the predictions we made.

References:

[1] American Cancer Society (September 13, 2007). "What Are the Key Statistics for

Breast Cancer?". Archived from the original on January 5, 2008.

http://web.archive.org/web/20080105001124/http://www.cancer.org/docroot/CRI/

content/CRI_2_4_1X_What_are_the_key_statistics_for_breast_cancer_5.asp.

Retrieved 2008-02-03

[2] Browse the SEER Cancer Statistics Review 1975–2006".

http://seer.cancer.gov/csr/1975_2006/browse_csr.php?section=4&page=sect_04_t

able.07.html.

[3] March, Jerry; Smith, Michael W. (2001). March's advanced organic chemistry:

reactions, mechanisms, and structure. New York: Wiley. ISBN 0-471-58589-0

[4] Serre D, Lee BH, Ting AH. MBD-isolated Genome Sequencing provides a high-

throughput and comprehensive survey of DNA methylation in the human genome.

Nucleic Acids Res. 2010; 38:391–9

[5] Campbell, M.K., (1995) Biochemistry. Saunders College: Philadelphia, pgs. 615-

16, 181

[6] Maclean, N., S.P. Gregory, and R.A. Flavell (1993) Eukaryotic Genes.

Butterworth and Co., London, pgs. 53-67

[7] Yen RW, Vertino PM, Nelkin BD, Yu JJ, el-Deiry W, Cumaraswamy A, Lennon

GG, Trask BJ, Celano P, Baylin SB. Isolation and characterization of the cDNA

encoding human DNA methyltransferase. Nucleic Acids Res. 1992; 20:2287–2291.

[8] Gruenbaum Y, Cedar H, Razin A. Substrate and sequence specificity of a

eukaryotic DNA methylase. Nature. 1982; 295:620–622.

[9] Bird A. Perceptions of epigenetics. Nature. 2007; 447:396–398.

[10] Holliday R, Pugh JE. DNA modification mechanisms and gene activity during

development. Science. 1975; 187:226–232.

[11] Cedar H, Stein R, Gruenbaum Y, Naveh-Many T, Sciaky-Gallili N, Razin A.

Effect of DNA methylation on gene expression. Cold Spring Harb. Symp. Quant.

Biol. 1983; 47 (Pt 2):605–609.

[12] Reik W, Collick A, Norris ML, Barton SC, Surani MA. Genomic imprinting

determines methylation of parental alleles in transgenic mice. Nature. 1987;

328:248–251.

[13] Riggs AD. X inactivation, differentiation, and DNA methylation. Cytogenet. Cell

Genet. 1975; 14:9–25.

[14] Feinberg AP, Vogelstein B. Alterations in DNA methylation in human colon

neoplasia. Semin. Surg. Oncol. 1987; 3:149–151.

[13] Spruck CH, III, Rideout WM, III, Jones PA. DNA methylation and cancer.

[Review]. EXS. 1993; 64:487–509.

[16] Virmani AK, Rathi A, Sathyanarayana UG, Padar A, Huang CX, Cunnigham HT,

Farinas AJ, Milchgrub S, Euhus DM, Gilcrease M, et al. Aberrant methylation of

the adenomatous polyposis coli (APC) gene promoter 1A in breast and lung

carcinomas. Clin. Cancer Res. 2001; 7:1998–2004.

[17] Tsuchiya T, Tamura G, Sato K, Endoh Y, Sakata K, Jin Z, Motoyama T, Usuba O,

Kimura W, Nishizuka S, et al. Distinct methylation patterns of two APC gene

promoters in normal and cancerous gastric epithelia. Oncogene. 2000; 19:3642–

3646.

[18] Ibanez de Caceres I, Battagli C, Esteller M, Herman JG, Dulaimi E, Edelson MI,

Bergman C, Ehya H, Eisenberg BL, Cairns P. Tumor cell-specific BRCA1 and

RASSF1A hypermethylation in serum, plasma, and peritoneal fluid from ovarian

cancer patients. Cancer Res. 2004; 64:6476–6481.

[19] Rice JC, Massey-Brown KS, Futscher BW. Aberrant methylation of the BRCA1

CpG island promoter is associated with decreased BRCA1 mRNA in sporadic

breast cancer cells. Oncogene. 1998; 17:1807–1812.

[20] Bian YS, Osterheld MC, Fontolliet C, Bosman FT, Benhattar J. p16 inactivation

by methylation of the CDKN2A promoter occurs early during neoplastic

progression in Barrett's; esophagus. Gastroenterology. 2002; 122:1113–1121.

[21] Holst CR, Nuovo GJ, Esteller M, Chew K, Baylin SB, Herman JG, Tlsty TD.

Methylation of p16(INK4a) promoters occurs in vivo in histologically normal

human mammary epithelia. Cancer Res. 2003; 63:1596–1601.

[22] Iqbal, K.; Jin, S.-G.; Pfeifer, G. P.; Szabo, P. E. (2011). "Reprogramming of the

paternal genome upon fertilization involves genome-wide oxidation of 5-

methylcytosine". Proceedings of the National Academy of Sciences 108 (9):

3642–3647. doi:10.1073/pnas.1014033108. PMC 3048122.

PMID 21321204.http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pmce

ntrez&artid=3048122

[23] Jaenisch, R.; Bird, A. (2003). "Epigenetic regulation of gene expression: how the

genome integrates intrinsic and environmental signals". Nature genetics 33 Suppl:

245–254. doi:10.1038/ng1089. PMID 12610534

[24] Craig, JM; Wong, NC (editor) (2011). Epigenetics: A Reference Manual. Caister

Academic Press. ISBN 978-1-904455-88-2

[25] Rabiner, Lawrence R. "A Tutorial on Hidden Markov Models and Selected

Applications in Speech Recognition". Proceedings of the IEEE , Vol. 77, No. 2,

February 1989, pp. 257-286.

[26] Baldi, P. & Brunak S. "Bioinformatics - The Machine Learning Approach".

Massachusetts Institute of Technology, 1998.

[27] Richard M.Neve et al A colloetion of breast cancer cell lines for the study of

functionally distinct cancer subtypes, Cancer Cell 10,515-527, December, 2006

[28] R Alan Harris et.al. "Comparison of sequencing-based methods to profile DNA

methylation and identification of monoallelic epigenetic modifications". Nature

Biotechnology, 2010

[29] Jung K Choi1, Jae-Bum Bae1, Jaemyun Lyu1, Tae-Yoon Kim2 and Young-Joon

Kim1*Nucleosome deposition and DNA methylation at coding region boundaries,

Genome Biology,2009,10:R89

[30] Weaver IC (2007). "Epigenetic programming by maternal behavior and

pharmacological intervention. Nature versus nurture: let's call the whole thing off".

Epigenetics 2 (1): 22–8. doi:10.4161/epi.2.1.3881. PMID 17965624.

http://www.landesbioscience.com/journals/epi/abstract.php?id=3881.

[31] Cock et. al. The Sanger FASTQ file format for sequences with quality scores, and

the Solexa/Illumina FASTQ variants. Nucleic Acids Research, 2009

Appendix_Formats

A. BAM format

BAM format is the compressed binary version of the Sequence Alignment/Map

(SAM) format, a compact and index-able representation of nucleotide sequence

alignments. Many next-generation sequencing and analysis tools work with SAM/BAM.

For custom track display, the main advantage of indexed BAM over PSL and other

human-readable alignment formats is that only the portions of the files needed to display

a particular region are transferred to UCSC. This makes it possible to display alignments

from files that are so large that the connection to UCSC would time out when attempting

to upload the whole file to UCSC. Both the BAM file and its associated index file remain

on your web-accessible server (http or ftp), not on the UCSC server. UCSC temporarily

caches the accessed portions of the files to speed up interactive display.

B. SAM (Sequence Alignment/Map) format

SAM format is a generic format for storing large nucleotide sequence alignments.

SAM aims to be a format that:

 Is flexible enough to store all the alignment information generated by various

alignment programs;

 Is simple enough to be easily generated by alignment programs or converted from

existing alignment formats;

 Is compact in file size;

 Allows most of operations on the alignment to work on a stream without loading

the whole alignment into memory;

 Allows the file to be indexed by genomic position to efficiently retrieve all reads

aligning to a locus.

C. EXPORT format

HWUSI-EAS68R 0012 3 1 1173 16855 0 1

AGATCGAGCTGGAGAAATTCCATGAATATACCACAC

cddddcLcc^dd\d^a`ccYcca^aM_]_b`b\TYc chr10.fa 12854193 R36 118 Y

HWUSI-EAS68R 0012 3 1 1174 2493 0 1

AAGACGGGAAAGGACTCACTCAAAGTCACACAGCTG

cTccc_M_^_L_UUL[MM]XGXZFZXSQV\aaYYaV chr3.fa 195586021 F 17C18 97 N

HWUSI-EAS68R 0012 3 1 1174 7057 0 1

CAACTTGGAGAATCACATTTGAAGTGCAAAGAACAC

BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB NM N

D. BED format

Bed format provides a flexible way to define the data lines that are displayed in an

annotation track. BED lines have three required fields and nine additional optional fields.

The number of fields per line must be consistent throughout any single set of data in an

annotation track. The order of the optional fields is binding: lower-numbered fields must

always be populated if higher-numbered fields are used.

The first three required BED fields are:

1. chrom - The name of the chromosome (e.g. chr3, chrY, chr2_random) or scaffold

(e.g. scaffold10671).

2. chromStart - The starting position of the feature in the chromosome or scaffold.

The first base in a chromosome is numbered 0.

3. chromEnd - The ending position of the feature in the chromosome or scaffold.

The chromEnd base is not included in the display of the feature. For example, the

first 100 bases of a chromosome are defined as chromStart=0, chromEnd=100,

and span the bases numbered 0-99.

The 9 additional optional BED fields are:

4. name - Defines the name of the BED line. This label is displayed to the left of the

BED line in the Genome Browser window when the track is open to full display

mode or directly to the left of the item in pack mode.

5. score - A score between 0 and 1000. If the track line useScore attribute is set to 1

for this annotation data set, the score value will determine the level of gray in

which this feature is displayed (higher numbers = darker gray). This table shows

the Genome Browser's translation of BED score values into shades of gray:

shade

score in

range

≤

166

167-

277

278-

388

389-

499

500-

611

612-

722

723-

833

834-

944

≥

945

6. strand - Defines the strand - either '+' or '-'.

7. thickStart - The starting position at which the feature is drawn thickly (for

example, the start codon in gene displays).

8. thickEnd - The ending position at which the feature is drawn thickly (for

example, the stop codon in gene displays).

9. itemRgb - An RGB value of the form R,G,B (e.g. 255,0,0). If the track line

itemRgb attribute is set to "On", this RBG value will determine the display color

of the data contained in this BED line. NOTE: It is recommended that a simple

color scheme (eight colors or less) be used with this attribute to avoid

overwhelming the color resources of the Genome Browser and your Internet

browser.

10. blockCount - The number of blocks (exons) in the BED line.

11. blockSizes - A comma-separated list of the block sizes. The number of items in

this list should correspond to blockCount.

12. blockStarts - A comma-separated list of block starts. All of the blockStart

positions should be calculated relative to chromStart. The number of items in this

list should correspond to blockCount.

Example

Here's an example of an annotation track that uses a complete BED definition:

track name= pairedReads description="Clone Paired Reads" useScore=1

chr22 1000 5000 cloneA 960 + 1000 5000 0 2 567,488, 0, 3512

chr22 2000 6000 cloneB 900 - 2000 6000 0 2 433,399, 0, 3601

E. Fastq format

Fastq format is a text-based format for storing both a biological sequence (usually

nucleotide sequence) and its corresponding quality scores. Both the sequence letter and

quality score are encoded with a single ASCII character for brevity. It was originally

developed at the Wellcome Trust Sanger Institute to bundle a FASTA sequence and its

quality data, but has recently become the de facto standard for storing the output of high

throughput sequencing instruments such as the Illumina Genome Analyzer [31].

Example

@SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36

GGGTGATGGCCGCTGCCGATGGCGTCAAATCCCACC

+SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36

IIIIIIIIIIIIIIIIIIIIIIIIIIIIII9IG9IC

F. Bowtie output format (generated from software bowtie, and input is the fastq files)

19 + chr9 8003

AGGCTATATGCGCGGCCAGCAGACCTGCAGGGCCCGCTCGTCCAGGGGGCGG

TGCTTGCTCTGGATCGTGTGCGG

IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

167:C>G,72:G>C,73:C>G

28 + chr19 28101

AAATAAATAAATAAAAACAACTTGTCCAAGGTCAGACAGGCCGCCTCTTAGT

AAGCACACCTATCCTCTATAGTA

IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

141:A>C,60:A>C,72:T>G

28 + chr1 25355

AAATAAATAAATAAAAACAACTTGTCCAAGGTCAGACAGGCCGCCTCTTAGT

AAGCACACCTATCCTCTATAGTA

IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

141:A>C,60:A>C,72:T>G

35 + chr1 41809

CAAATACGGTGACTGTTTCTTACGTGGACGACGTTGTGTTGAACATGGGTGA

GTAAGACTGAAGCAGCCGTAATT

IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII 0

ResearchGate has not been able to resolve any citations for this publication.

Bioinformatics: The Machine Learning Approach

Book

Full-text available

Jan 2001

Comparison of sequencing-based methods to profile DNA methylation and identification of monoallelic epigenetic modifications

Article

Full-text available

Oct 2010
NAT BIOTECHNOL

Analysis of DNA methylation patterns relies increasingly on sequencing-based profiling methods. The four most frequently used sequencing-based technologies are the bisulfite-based methods MethylC-seq and reduced representation bisulfite sequencing (RRBS), and the enrichment-based techniques methylated DNA immunoprecipitation sequencing (MeDIP-seq) and methylated DNA binding domain sequencing (MBD-seq). We applied all four methods to biological replicates of human embryonic stem cells to assess their genome-wide CpG coverage, resolution, cost, concordance and the influence of CpG density and genomic context. The methylation levels assessed by the two bisulfite methods were concordant (their difference did not exceed a given threshold) for 82% for CpGs and 99% of the non-CpG cytosines. Using binary methylation calls, the two enrichment methods were 99% concordant and regions assessed by all four methods were 97% concordant. We combined MeDIP-seq with methylation-sensitive restriction enzyme (MRE-seq) sequencing for comprehensive methylome coverage at lower cost. This, along with RNA-seq and ChIP-seq of the ES cells enabled us to detect regions with allele-specific epigenetic states, identifying most known imprinted regions and new loci with monoallelic epigenetic marks and monoallelic expression.

The Sanger FASTQ format for sequences with quality scores, and the Solexa/Illumina FASTQ variants

Article

Full-text available

Dec 2009
NUCLEIC ACIDS RES

FASTQ has emerged as a common file format for sharing sequencing read data combining both the sequence and an associated per base quality score, despite lacking any formal definition to date, and existing in at least three incompatible variants. This article defines the FASTQ format, covering the original Sanger standard, the Solexa/Illumina variants and conversion between them, based on publicly available information such as the MAQ documentation and conventions recently agreed by the Open Bioinformatics Foundation projects Biopython, BioPerl, BioRuby, BioJava and EMBOSS. Being an open access publication, it is hoped that this description, with the example files provided as Supplementary Data, will serve in future as a reference for this important file format.

MBD-isolated Genome Sequencing provides a high-throughput and comprehensive survey of DNA methylation in the human genome

Article

Full-text available

Nov 2009
NUCLEIC ACIDS RES

DNA methylation is an epigenetic modification involved in both normal developmental processes and disease states through the modulation of gene expression and the maintenance of genomic organization. Conventional methods of DNA methylation analysis, such as bisulfite sequencing, methylation sensitive restriction enzyme digestion and array-based detection techniques, have major limitations that impede high-throughput genome-wide analysis. We describe a novel technique, MBD-isolated Genome Sequencing (MiGS), which combines precipitation of methylated DNA by recombinant methyl-CpG binding domain of MBD2 protein and sequencing of the isolated DNA by a massively parallel sequencer. We utilized MiGS to study three isogenic cancer cell lines with varying degrees of DNA methylation. We successfully detected previously known methylated regions in these cells and identified hundreds of novel methylated regions. This technique is highly specific and sensitive and can be applied to any biological settings to identify differentially methylated regions at the genomic scale.

Nucleosome deposition and DNA methylation at coding region boundaries

Article

Full-text available

Oct 2009

Nucleosome deposition downstream of transcription initiation and DNA methylation in the gene body suggest that control of transcription elongation is a key aspect of epigenetic regulation. Here we report a genome-wide observation of distinct peaks of nucleosomes and methylation at both ends of a protein coding unit. Elongating polymerases tend to pause near both coding ends immediately upstream of the epigenetic peaks, causing a significant reduction in elongation efficiency. Conserved features in underlying protein coding sequences seem to dictate their evolutionary conservation across multiple species. The nucleosomal and methylation marks are commonly associated with high sequence-encoded DNA-bending propensity but differentially with CpG density. As the gene grows longer, the epigenetic codes seem to be shifted from variable inner sequences toward boundary regions, rendering the peaks more prominent in higher organisms. Recent studies suggest that epigenetic inhibition of transcription elongation facilitates the inclusion of constitutive exons during RNA splicing. The epigenetic marks we identified here seem to secure the first and last coding exons from exon skipping as they are indispensable for accurate translation.

March's Advanced Organic Chemistry: Reactions, Mechanisms, and Structure

Article

Full-text available

Dec 2001
MOLECULES

n/a

X Inactivation, DNA Methylation, and Differentiation Revisited

Chapter

Jan 1984

Arthur D. Riggs

In 1975, two papers suggested a role for DNA methylation in X chromosome inactivation. In one paper (Riggs, 1975), I argued that: 1) DNA methylation should affect protein-DNA interactions; 2) methylation patterns and a maintenance methylase should exist; and 3) DNA methylation should be involved in mammalian cellular differentiative processes. Holliday and Pugh (1975) argued similarly, although less weight was given to X inactivation and more weight was given to the possibility that 5-methylcytosine (5-meCyt) might be deaminated to thymidine; thus a specific mutational change would be generated, as suggested by Scarano (1971). Recently, several studies of X chromosome inactivation have contributed to the emerging body of evidence supporting a role for DNA methylation in mammalian gene regulation; it is these studies that will be reviewed in this chapter. More comprehensive reviews of X chromosome inactivation have been published recently (Gartler and Riggs, 1983; Graves, 1983).

A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition

Article

Jan 1993

Lawrence Rabiner

Alterations in DNA methylation in human colon neoplasia

Article

Jan 1987
Semin Surg Oncol

A review of studies on DNA methylation in colonic neoplasia is presented. Hypomethylation of a wide variety of genes from throughout the genome was seen in all colon cancers studied. These changes preceded malignancy because benign adenomas were also affected.

Riggs, A. D.. X-inactivation, differentiation and DNA methylation. Cytogenet Cell Genet 14: 9-25

Article

Feb 1975

A.D. Riggs

A model based on DNA methylation is proposed to explain the initiation and maintenance of mammalian X inactivation and certain aspects of other permanent events in eukaryotic cell differentiation. A key feature of the model is the proposal of sequence-specific DNA methylases that methylate unmethylated sites with great difficulty but easily methylate half-methylated sites. Although such enzymes have not yet been detected in eukaryotes, they are known in bacteria. An argument is presented, based on recent data on DNA-binding proteins, that DNA methylation should affect the binding of regulatory proteins. In support of the model, short reviews are included covering both mammalian X inactivation and bacterial restriction and modification enzymes.

A HMM Approach to Identifying Distinct DNA Methylation Patterns for Subtypes of Breast Cancers

Figures

Recommended publications

Identifying Potential Drivers of Differential DNA Methylation Patterns in Breast Cancer Cells

Alterations of monoallelic DNA methylation in breast cancer

Analysis of the Methylome of Human Embryonic Stem Cells Employing Methylated DNA Immunoprecipitation...

Genome-wide analysis and modeling of DNA methylation susceptibility in 30 breast cancer cell lines b...

Ultra-Deep Bisulfite Sequencing to Detect Specific DNA Methylation Patterns of Minor Cell Types in H...