ChapterPDF Available

Next-Generation Sequencing and Assembly of Bacterial Genomes

February 2013

February 2013

DOI:10.1201/b14289-18

In book: OMICS

Authors:

Artur Silva

Federal University of Pará

Rommel Thiago Jucá Ramos

Federal University of Pará

Sintia Silva Almeida

Universidade Federal do Ceará

Show all 15 authorsHide

One of the most important advances in biology has been our capacity to sequence the DNA of organisms. However, long after the conclusion of human genome sequencing, there are still regions of the genome that are unworkable; that is, they are difficult to mount and remain incomplete. Answers may come from second-generation sequencing, which has produced large volumes of data, generating millions of short reads per run, a reality that was unimaginable with Sanger sequencing.

Containment (a), partial overlap (b), prefix overlap (c), and suffix overlap (d).

…

Overlap between two reads, in which the overlapping region does not need to be a perfect match; (2) example of correct assembly of a region of the genome that has two repetitive regions (box) using four reads (a to d); (3) assembly generated by the greedy approach. Reads a and d are mounted first, incorrectly, due to identification of better overlap, and (4) discordance between two reads (discordance region) that could extend a contig (bold sequence). Extension of the contig could be finalized to avoid misassembles.

…

Steps used for ab initio assembly of genomes. After data treatment in the preprocessing stage, ab initio assembly is run, generating the contigs, which then are oriented and ordered to generate the scaffold.

…

Alignment of reads against a reference genome, showing mismatches and a gap.

…

Layout graph: G graph valid based on graph construction theory. First draft of what could be the genome.

…

Figures - uploaded by Sintia Silva Almeida

Content may be subject to copyright.

Content uploaded by Sintia Silva Almeida

Content may be subject to copyright.

367

Next-Generation Sequencing and

Assembly of Bacterial Genomes

Artur Silva, Rommel Ramos, Adriana Carneiro, Sintia Almeida,

Vinicius De Abreu,Anderson Santos, Siomar Soares, Anne Pinto,

Luis Guimarães, Eudes Barbosa, Paula Schneider, Vasudeo Zambare,

Debmalya Barh, Anderson Miyoshi, and Vasco Azevedo

15.1 Introduction

One of the most important advances in biology has been our capacity to sequence the DNA

of organisms. However, long after the conclusion of human genome sequencing, there are

still regions of the genome that are unworkable; that is, they are difcult to mount and

remain incomplete. Answers may come from second-generation sequencing, which has

produced large volumes of data, generating millions of short reads per run, a reality that

was unimaginable with Sanger sequencing.

Although we can now generate a high degree of sequencing coverage (Figure 15.1),

the de novo assembly of short reads is more complex compared with reference assembly.

Various algorithms and bioinformatics tools have been developed to take care of these new

CONTENTS

15.1 Introduction ........................................................................................................................367

15.2 Treatment of the Data ........................................................................................................ 369

15.2.1 Base Quality ............................................................................................................ 369

15.2.2 Error Correction (Tools) ........................................................................................370

15.3 Strategies for Assembling Genomes ............................................................................... 370

15.3.1 Reference Mounting .............................................................................................. 371

15.3.2 De novo Assemblers ...............................................................................................372

15.3.3 Challenges and Difculties for de novo Assembly ............................................372

15.4 Tools for Assembling Genomes .......................................................................................373

15.4.1 Tools for Reference Assembly ..............................................................................373

15.4.2 Tools for De novo Assembly .................................................................................. 374

15.4.2.1 Tools that Use OLC .................................................................................374

15.4.2.2 Tools that Use the DBG ..........................................................................375

15.4.2.3 Tools that Use Greedy Graph ................................................................ 376

15.5 Tools for Visualizing NGS Data and Producing a Scaffold .........................................378

15.6 Closing Gaps ....................................................................................................................... 379

15.7 Description of the Gap Closing Strategy ........................................................................ 379

References .....................................................................................................................................380

K15973_C015.indd 367 12/20/2012 3:39:32 PM

368 OMICS: Applications in Biomedical, Agricultural, and Environmental Sciences

problems and computational challenges, such as identication of repeat regions, sequenc-

ing errors, and simultaneous manipulation of short reads [1,2].

After reads are generated by the sequencer, it is necessary to join them in a logical fashion

to mount the nal sequence. Over the years, various tools have been developed to resolve

this issue, for example, the assemblers PHRAP (http://www.phrap.org), ARACHNE [3],

and Celera [4]. They have a paradigm in common, often referred to as overlap-layout-

consensus (OLC) [5]. This approach is quite similar to that used to resolve a jigsaw puzzle,

as described below.

The rst step consists of aligning the reads, two by two, exhaustively; the pairs of

reads should present a consistent overlap from one read to another, similar to the search

for pieces in a jigsaw puzzle that t each other and have colors that match. Especially

in eukaryotic genomes, the main difculty is in distinguishing inexact overlaps due to

sequencing errors and similarities within the genome, such as highly conserved repeat

regions [6]. Sequence alignment is a widely studied area of bioinformatics, which consists

of supplying the ideal alignment between two sequences as a function of an evaluation

“score.” Most of these methods are based on the Needleman–Wunsch algorithm [7], which

uses the spatial dynamics of the possible alignments between the sequences. Many exten-

sions have been conceived, for example, for multiple alignments [8], local alignments [9], or

rapid research in large data banks [10].

Current techniques can be executed rapidly in parallel processors. To process short

reads generated by second-generation sequencing platforms, one of the solutions found

for simultaneously manipulating thousands of sequences has been the use of computing

clouds [11]. The assemblers detect a group of reads with consistent alignments with each

other, forming contiguous sequences (contigs). This would be equivalent to partially form-

ing an image by putting together pieces of a jigsaw puzzle. In both montages, genome and

jigsaw puzzle, the process can be interrupted in ambiguous regions, where various con-

tinuations or holes are possible, and where no connecting piece was found [12].

Finally, the assembler tries to order and orient the contigs with each other in an de novo

manner; that is, without the help of a reference sequence. Returning to the metaphor of

a jigsaw puzzle, this would correspond to identifying corners and different parts of the

image that relate to each other. In the nal mounting phase, a scaffolded group of contigs

will become available. Nevertheless, it is desirable to remove most possible gaps, eventu-

ally converging to a group of integral chromosomes, that is, those that do not include

breaks. This phase, called nishing, can be expensive and could take considerable time,

depending on the strategy used to close the occasional holes in the genome scaffold [13].

High-throughput Sanger

Genome

Depth

sequencing

FIGURE 15.1

Qualitative comparison between the sequences generated by Sanger and those from the NGS platforms. There

is a higher abundance and depth of coverage with the short reads, but they are also signicantly shorter, with

little overlap available for assurance.

K15973_C015.indd 368 12/20/2012 3:39:32 PM

369Next-Generation Sequencing and Assembly of Bacterial Genomes

15.2 Treatment of the Data

Preprocessing of the data, involving a quality lter and correction of the sequencing

errors, is essential to increase the accuracy of the assemblies, as it prevents incorrect or

low-quality reads from becoming part of the genome assembly process.

15.2.1 Base Quality

In 1998, Ewing and collaborators developed the PHRED algorithm, with the objective of

determining the probability of occurrence of the one of the four nucleotides (A, C, G, or T)

for each base of a DNA sequence during the base-calling process; the intensity of the wave-

length that is obtained through the uorescence produced by ligations between nucleo-

tides is used to calculate the PHRED quality value (Q), which is logarithmically related to

the observed probability of error for each base (P), according to the formula presented in

Equation 15.1.

Q = –10log10P (15.1)

In Table 15.1, we can observe examples of PHRED quality associated with the probability

of being incorrect and the precision of the identication of the base.

The sequences obtained from automatic sequencers are not considered to be reliable

due to their low quality at the extremities (Figure 15.2) and because of contaminants.

Consequently, working on the quality of the data was fundamental so that the following

phases of processing biological information would not be compromised [15]. In the case of

the sequences obtained from next-generation sequencing (NGS), despite the high degree

of coverage, base quality should be evaluated. In this way, the reads can be trimmed and

TABLE 15.1

Error Probability and Precision Based on PHRED Quality Values

PHRED Quality Score Error Probability Accuracy (%)

10 1/10 90

20 1/100 99

30 1/1000 99.9

40 1/10,000 99.99

50 1/100,000 99.999

Sequence

Phred

quality

GCTAGCATGCTAGCTACGATGCATC

Cutoﬀ

FIGURE 15.2

Low quality of the ends of the reads obtained from automatic sequencers, which use the dideoxynucleotide

method.

K15973_C015.indd 369 12/20/2012 3:39:32 PM

370 OMICS: Applications in Biomedical, Agricultural, and Environmental Sciences

quality lters applied. As examples of tools that do this quality treatment, we can cite

Quality Assessment [16], Galaxy [17], ShortRead [18], and PIQA [19]; the latter being used

exclusively for Illumina data.

Analysis of sequence quality, followed by data treatment, makes it possible to reduce

alignment errors because it provides precise alignment parameters according to the data

that are the objects of study [20]. In the assembly of genomes, software such as Quality-

value guided Short Read Assembler (QSRA) [21] propose assembling genomes with the

extension of sequences through analysis of base quality, giving better results than the

assembler on which it was based (VCAKE, which does not take base quality into account

to extend sequences) [22]. In transcriptome studies with NGS platforms using RNAseq,

evaluation of the quality of the data is extremely important because the coverage repre-

sents the level of expression; consequently, quality lters using stringent parameters can

provoke variations in the expression levels that are found [23].

15.2.2 Error Correction (Tools)

Despite the high degree of accuracy provided by NGS platforms, due to the extensive cov-

erage generated by this equipment, sequencing errors can cause problems in the assembly

of genomes when using an de novo approach because the generation of contigs is very

sensitive to these errors [24]. Consequently, to obtain better results, it is necessary to cor-

rect the errors before mounting the genomes, which will make the data more reliable [25].

In re-sequencing projects, in which the reads obtained from the sequencers are aligned

against a reference genome, error correction can avoid elimination (trimming) of the 3′

extremities of the read due to the low quality observed when one tries to improve the

alignments [26].

Some genome assemblers have already included error correction procedures: SHARCGS

only considers reads that have been produced by the sequencer n times, a parameterized

value, and those that present overlap with other reads [25]. In 2001, Pevzner and collabo-

rators used the spectral alignment method to correct errors, which consisted of a given

string S in spectrum T, formed by all of the continuous strings of xed size (T string),

a search is made for the smallest number of modications that need to be made in S to

transform it into a T string. This method of correction is implemented by the assembler

EULER-SR [27] before the process of mounting the genome.

As examples of independent tools that can correct errors, we can cite Short-Read Error

Correction (SHREC) [25] for SOLEXA/Illumina data, which uses a generalized tree of suf-

xes to process the data, the SOLiD Accuracy Enhancement Tool for SOLiD data, which

is available in LifeScope Genomic Analysis Software (http://www.lifetechnologies.com/

lifescope), using an ap proach similar to that of EULER-SR, and Hybrid SHREC [24], which

is based on the SHREC algorithm, but can process les from various sequencing platforms.

15.3 Strategies for Assembling Genomes

Reads from the sequencer should be submitted to preprocessing, where base quality

and sequencing errors are evaluated with software, commonly specic to corresponding

sequencing platforms; they are then submitted to de novo assembly and then oriented and

ordered to produce the scaffold (Figure 15.3). If the assembly is done with a reference

K15973_C015.indd 370 12/20/2012 3:39:32 PM

371Next-Generation Sequencing and Assembly of Bacterial Genomes

genome, after preprocessing, the reads are mapped against this reference, and after align-

ment is nished, a consensus sequence is produced.

15.3.1 Reference Mounting

Basically, reference assembly consists of mapping the reads obtained from sequencing

against a reference genome (Figure 15.4), preferentially, of a phylogenetically closely related

organism, making it possible to align a large part of the reads. However, the alignment

congurations will also inuence the quantity of reads that is utilized; consequently, the

parameters such as depth of coverage and the number of mismatches that are permitted

should be dened based on the sequencing information: estimated coverage and PHRED

quality of the bases [20]. Mapping using a reference sequence provides the identication

of the nucleotide substitutions as well as the indels, principally with the use of NGS plat-

forms, due to the high degree of sequence coverage [28].

After mounting, regions of the reference genome that are not covered are observed, rep-

resenting gaps, which can occur as a function of the presence of a nucleotide sequence in

the reference that does not occur in the sequenced organism, or because this region was

not sequenced. Among the problems with sequence mounting with the use of a reference,

we can cite the representation of repeated regions, for example, the case of a reference

genome that has two such regions and the sequenced organism that has only one; during

mapping against a reference, the two reference regions will be covered, which can result

in mounting errors (Figure 15.5). For mapping reads using a reference genome, one can use

software such as SHRiMP [29] and SOLiD BioScope (Applied Biosystems, Foster City, CA);

both align in color-space SOLiD, SOAP2 [30], MAQ [31], RMAP [26], and ZOOM [32]. The

program SOLiD BioScope is a Java-based application that has various integrated tools in a

Preprocessing

De novo assembly

Scaﬀold

FIGURE 15.3

Steps used for ab initio assembly of genomes. After data treatment in the preprocessing stage, ab initio assembly

is run, generating the contigs, which then are oriented and ordered to generate the scaffold.

ATGAT

ATCAT

Reference

Reads

GAP

FIGURE 15.4

Alignment of reads against a reference genome, showing mismatches and a gap.

K15973_C015.indd 371 12/20/2012 3:39:32 PM

372 OMICS: Applications in Biomedical, Agricultural, and Environmental Sciences

web interface for resequencing and transcriptome analyses. The reference assembly pipe-

line permits mapping of reads based on reference genomes, identifying single nucleotide

polymorphisms, indels (insertions and deletions), and inversions, as well as generating a

consensus genome (Applied Biosystems).

15.3.2 De novo Assemblers

This consists of reconstructing genome sequences without the aid of any other informa-

tion aside from the reads produced by the sequencing process. With this strategy, similar-

ity alignments can be made among the reads themselves, or through the overlap of k-mers.

This allows, at the end of the alignment process, the formation of contiguous sequences

(contigs) as seen in Figure 15.6. Most assemblers is based on graph theory, in which ver-

tices and edges can represent overlap, a k-mer or a read varies according to the strategy

that is used, in which the contigs are the paths formed in the graph. Thus, the assemblers

can be divided into greedy, OLC, or de Bruijn graph (DBG) algorithms; the latter uses a

Eulerian path [28].

DBG is the approach that is mostly widely used by assemblers of short reads because it

works better with large numbers of reads, typical of NGS sequencers. The main programs

that adopt this approach are AllPaths [33], Euler-SR [34], SOAPdenovo [35], and Velvet [36].

Among these programs, Velvet is the only one that mounts short sequences in the color-

space format.

15.3.3 Challenges and Difficulties for de novo Assembly

The limitations of de novo assembly approaches are directly associated with the techno-

logical limitations and the features of the data generated by second-generation sequences

as well as the sizes of the reads and the volume of data that is generated, which exponen-

tially increases the processing time and sometimes makes mounting unviable. Within this

Reference

Reads ABC

ABC

Reference genome

Reads

Repetitive regions

FIGURE 15.5

Double mapping of reads A, B, and C in the reference genome, because it involves a repetitive region. However,

in the sequenced genome, the number of repeats can be different from that observed in the reference genome.

Short reads

Contig

FIGURE 15.6

Alignment between the reads generated by the sequencing, nally obtaining scaffolded contigs.

K15973_C015.indd 372 12/20/2012 3:39:33 PM

373Next-Generation Sequencing and Assembly of Bacterial Genomes

context, various problems can occur, such as the grouping of repeat regions; there are also

regions in which sequences are of low quality, base compression in the sequencing, and

even regions with a low degree of coverage due to the random character of the sequencing

[37,38].

One of the classic examples of problems with de novo assembly is nding a path in the

overlap graph that passes through each of the vertices only once (Hamiltonian path) or

each edge only once (Eulerian path); this often results in the loss of connectivity between

very distant sequences, showing that strategies based on graphs, especially the de Bruijn

strategy, are extremely sensitive to sequencing errors [39]. These problems are more com-

plex and common in assemblies made with short reads, as the number of reads is larger

than with Sanger sequences because the lengths are much shorter, which exponentially

increases the size of the problem. The large sizes of the conserved repeated regions also

make the process of the genome assembly difcult and as it involves Eukaryotic genomes,

which have very large repeat regions, this task is sometimes a problem that is hard to

resolve [5]. Despite the problems cited above, studies show that up to 96.29% of a gene can

be reconstructed using short sequences, with sizes starting at 25 nucleotides [40].

15.4 Tools for Assembling Genomes

Second-generation sequencers are capable of generating thousands of reads, providing a

high degree of coverage and accuracy. As examples of these platforms, we can cite SOLiD,

Illumina, and 454 FLX Titanium. Despite the reduction in sequencing costs, among other

advantages, the reduction in the sizes of the reads, along with the increase in the number

of reads, results in computational challenges for the processing of this data, principally for

genome assemblies [28]. Assembly a genome consists of overlapping based on the similar-

ity of reads generated by a sequencer, to produce contiguous sequences (contigs), which in

turn are aligned and oriented with each other to construct the scaffold. This is called a ref-

erence assembly when it involves mapping reads in comparison with a reference sequence,

whereas mounting reads without such a reference is called de novo assembly [31].

15.4.1 Tools for Reference Assembly

For alignment against a reference genome, there are two approaches: using hash tables

and using prex/sufx trees [20]. Many software that use hash tables dene the subse-

quences obtained from the search sequence as the key. The program tries to map identical

sequences, known as seeds from the reference, so that the sequences can be subsequently

extended. However, the use of templates with spaced seeds gives better results because

it considers internal mismatches (Figure 15.7). Even so, independent of whether seeds or

Seed 11111111111

Spaced Seed 11101010101101011

FIGURE 15.7

Templates using seed in which an exact match of 11 bases is necessary to initiate the extension, and spaced seeds

in which a match of 11 bases in required, but permitting the existence of internal mismatches.

K15973_C015.indd 373 12/20/2012 3:39:33 PM

374 OMICS: Applications in Biomedical, Agricultural, and Environmental Sciences

spaced seeds are used, the alignments do not accept gaps; identication of such gaps is

made in a step after the extension of the alignment.

The mapping algorithms that use prex/sufx trees search for exact alignments, repre-

sented by sufx trees, enhanced sufx array, and Ferragina-Manzini index (FM index) [41];

then they extend the alignments considering mismatches. Among the tools that use sufx

trees, we have MUMmer [42] and OASIS [43]. Among the alignment software based on

enhanced sufx arrays, we can cite Vmatch [44] and Segemehl [45]. The FM index method

uses a small amount of memory (from 0.5 to 2 bytes per nucleotide), which can vary as a

function of implementation and parameters that are used [20]; examples of such programs

include Bowtie [46], BWA [47], BWA-SW [48], SOAP2 [30], and BWT-SW [49].

15.4.2 Tools for De novo Assembly

According to Miller et al. [28], the de novo assembly of genomes consists of aligning the

reads with each other to produce contiguous sequences (contigs). The principal method-

ologies for NGS data are based on graphs, these being:

• OLC

• DBG

• Greedy

15.4.2.1 Tools that Use OLC

This is the most widely used approach for large sequences, such as those produce by

Sanger; nevertheless, there are also applications based on this method for short reads,

such as Edena [50]. The OLC method can be divided into three phases: overlap, layout,

and consensus. In the overlap phase, each read is compared with all of the others to iden-

tify overlaps, considering the minimum size of overlap and k-mer, which will affect the

accuracy of the contigs. Among the types of overlaps that are recorded, four categories are

possible: containment, normal tting, prex, and sufx tting, as shown in Figure 15.8 [51].

(a)

(b)

G TA A T T G C C AT C G G T TG T A C G G G TG G

G C C A T C G G T TG T A C G G

T G C C AT C G G T T GT A C G G G T

G TA C G G G T G G G C C AT

G T A A T T G C C A T C G G T T G T A

T G C C G G GTA AT T G C C A

G C C AT C G GT T G TA C G G G T G G

G TA C G G G T G G TA A C C A TC G G

(c)

(d)

FIGURE 15.8

Containment (a), partial overlap (b), prex overlap (c), and sufx overlap (d).

K15973_C015.indd 374 12/20/2012 3:39:33 PM

375Next-Generation Sequencing and Assembly of Bacterial Genomes

In the layout phase, the information obtained from the previous phase is used to con-

struct a graph (Figure 15.9), which is reconstructed at each actualization. At the end of this

phase, the rst draft representing the genome will be generated, taking into account that in

this phase, various methods are used to simplify the pathways and remove errors detected

in the graph, such as bubbles and linear extensions, known as dead paths [28]. In the con-

sensus phase, multiple alignments are made of the fragments, progressively, to develop

a consensus sequence [28], as shown in Figure 15.10. As examples of the assemblers that

use the OLC strategy, we have Celera Assembler [4], ARACHNE [3], CAP, and PCAP [52].

Edena is the only program for the platforms Solexa and SOLiD that uses OLC[50].

15.4.2.2 Tools that Use the DBG

In 1995, Idury and Waterman introduced the use of a graph to represent a sequence assem-

bly. Their method consisted of creating a vertex for each word. Then, the vertices that

correspond to the overlap of k-mers are connected; k can be represented by a sequence

with a specic number of bases. The original vertex corresponds to prex k-1 of the corre-

sponding overlap region (k-mer), and the vertex destination of sufx k-1 of the same region,

providing a reconstruction of the sequence through a path that traverses each edge exactly

once. Pevzner and Tang [39] proposed a representation that was slightly different from the

graph of the sequence, the so-called DBG, which uses a Eulerian path; that is, a pathway

that visits each edge exactly once, through which the k-mers are represented as arcs or

edges, and overlapping of the k-mers join their ends.

FIGURE 15.9

Layout graph: G graph valid based on graph construction theory. First draft of what could be the genome.

FIGURE 15.10

Consensus graph: using progressive alignment guided by pairs.

K15973_C015.indd 375 12/20/2012 3:39:34 PM

376 OMICS: Applications in Biomedical, Agricultural, and Environmental Sciences

The classic method of assembling fragments is based on the notion of a graph of over-

laps. Each read corresponds with a vertex in the overlap graph and two vertices are con-

nected by an arc, if the corresponding reads overlap [53]. Different from the problem of

the Hamiltonian path, the Eulerian path is less complex and is resolved even with graphs

with millions of vertices, as there are linear-time algorithms that can provide a solution

for them [54]. It is important to emphasize that a de Bruijn graph is centered on the k-mer,

which means that its topology is not affected by the fragmentation of the reads [12]. And

compared with the overlap phase in OLC, the computational cost is much smaller because

the overlaps are not performed against all reference genomes [28]. Operationally, the DBG

containing the vertices with length k is constructed with the result of the division of the

k-mer, being linked in exact, identical overlapping for the previously dened k-mer values.

After construction of the graph, it is generally possible to simplify it without any loss of

information. The reads of the vertices are interrupted and initiated again at each simpli-

cation. Simplication of two vertices is similar to the concatenation of two strands of

characters [51]. As in other approaches, the assemblers add to their main algorithm, acces-

sory algorithms to help remove assembly errors, such as reduction of redundant pathways,

removal of bubbles or pathway loops, and linear extensions; that is, those that do not pos-

sess dened pathways. The main programs that adopt the use of the DBG are AllPaths [33],

Euler-SR for short sequences [34], SOAPdenovo [54], and Velvet [36], which is specialized

in the localization of the use of paired reads. Velvet is the only one that assembles short

sequences in a color-space. Other assembly programs, such as ABySS [55,56], were success-

ful in constructing DBGs, eliminating the limitations in the use of memory that are com-

mon during assemblies; Table 15.2 shows the principal characteristics of each software [28].

In 2009, inspired by the ideas of Pevzner et al. [5], Zerbino implemented the program

Velvet, the structure of which differs in various aspects. Among these, maps of k-mers are

generated for the vertices and not for the arcs, and there can be reverse complementary-

associated sequences to obtain a bidirectional graph [57]. In this way, the vertices can be

connected by a directed edge or an arc. Due to the symmetry of the blocks, an arc goes

from vertex A to B, a symmetrical arc goes from B to A. With any alteration of an arc, it is

implicit that the same change will be made symmetrically in the paired arc.

Each vertex can be represented by a single rectangle, which represents a series of k-mer

overlaps (in this case, k = 5) listed directly above or below. The only nucleotide of each

k-mer is colored red. The arcs are represented as arrows between knots. The last k-mer has

overlaps of an arc of origin with the rst of its destination arcs. Each arc has a symmetrical

arc [12].

15.4.2.3 Tools that Use Greedy Graph

The greedy algorithms were widely implemented in assembly programs for Sanger data,

such as PHRAP, TIGR Assembler, and CAP3 [1]. When new sequencing technologies

became available, other software were developed for assembling the NGS data (short reads)

using different greedy strategies, such as SSAKE [58], SHARCGS [59], and VCAKE [22].

These algorithms can use an OLC approach or a DBG, applying a basic function (Figure

15.11); starting with any read of a group of data, add another, and in this way numerous

interactions are run until all possible operations are tested and the overlaps are identied,

in which a sufx of a read overlaps a prex of another (Figure 15.11.1). Each operation uses

the overlap of a major score, measured by the size of the overlap between reads to make

the next junction [1,2,28].

K15973_C015.indd 376 12/20/2012 3:39:34 PM

377Next-Generation Sequencing and Assembly of Bacterial Genomes

TABLE 15.2

Feature Comparison between De Novo Assemblers for Whole-Genome Shotgun Data from NGS

Platforms

Algorithms Feature

Greedy

Assemblers

OLC

Assemblers

DBG

Assemblers

Modeled features

of reads

Base substitutions — — Euler, AllPaths,

SOAP

Homopolymer miscount — CABOG —

Concentrated error in 3′ end — — Euler

Flow space — Newbler —

Color space — Shorty Velvet

Removal of

erroneous reads

Based on k-mer frequencies — — Euler, Velvet,

AllPaths

Based on k-mer frequency

andquality value

— — AllPaths

For multiple values of k— — AllPaths

By alignment to other reads — CABOG —

By alignment and quality value SHARCGS — —

Correction of

erroneous base

calls

Based on k-mer frequencies — — Euler, SOAP

Based on k-mer frequencies

quality value

— — AllPaths

Based on alignments — COBOG —

Approaches to

graph

construction

Implicit SSAKE,

SHARCGS,

VCAKE

— —

Reads as graph nodes — Edena, CABOG,

Newbler

—

k-mer as graph nodes — — Euler, Velvet,

AbySS, SOAP

Simple path as graph nodes — — AllPaths

Multiple values of k— — Euler

Multiple overlap stringencies HARCGS — —

Approaches to

graph reduction

Filter overlaps — CABOG —

Greedy contig extension SSAKE,

SHARCGS,

VCAKE

— —

Collapse simple paths — CABOG,

Newbler

Euler, SOAP,

Velvet

Erosion of spurs — CABOG, Edena Euler, AllPaths,

SOAP, Velvet

Transitive overlap reduction — Edena —

Bubble smoothing — Edena Euler, SOAP,

Velvet

Bubble detection — — AllPaths

Reads separate tangled paths — — Euler, SOAP

Break at low coverage — — SOAP, Velvet

Break at high coverage — CABOG Euler

High coverage indicates repeat — CABOG Velvet

Special use of long reads — Shorty Velvet

Source: Miller, J.R. et al., Genomics 95: 315–327, 2010. With permission.

K15973_C015.indd 377 12/20/2012 3:39:34 PM

378 OMICS: Applications in Biomedical, Agricultural, and Environmental Sciences

The quality of the overlaps is measured by the size and the identity (percentage bases

shared between two reads in the overlap region). Also, simplication of the graph is based

only on the size of the overlaps between reads, it being necessary to implement mecha-

nisms to prevent misassemblies [1,28]. The term “greedy” refers to the fact that the deci-

sions taken by the algorithm occur as a function of a local quality (in the case of assembly,

the quality of the overlaps between the reads), which may not be an optimal global solu-

tion; in this way, assemblers based on greedy can generate numerous misassemblies

(Figure 15.11.2 and 15.11.3) [1].

During the assembly process, where the reads are added by iteration, the fragments

are considered in descending order according to their quality, as explained previously.

Consequently, to avoid missassemblies, the extension process is nalized when conicting

information is identied, for example, when two or more reads extend a contig, but with

no overlap between them (Figure 15.11.4) [1].

The approaches based on greedy algorithms need a mechanism to avoid incorporating

false-positive overlaps into contigs. The overlaps induced by repeat sequences can have

high scores more than the overlaps of regions without repetitions; also, an assembly that

generates a false-positive overlap will unite unrelated sequences to the ends of a repeti-

tion and produce a chimera [28]. Some assemblers use other algorithms to avoid including

errors; SHARCGS, for example, includes a preprocessing step that lters out erroneous

reads. The parameters of this lter can be modied by the user of the program [59].

15.5 Tools for Visualizing NGS Data and Producing a Scaffold

The development of NGS platforms, also named high-throughput sequencing machines

has opened new opportunities for biological applications, including resequencing of

genomes, sequencing of the transcriptome, ChIP-seq, and discovery of miRNA [60]. These

NGS technologies created a necessity to develop new tools to visualize the results of the

assemblies and alignments of short reads [61]. Consequently, new challenges arose: a

need to rapidly and efciently process an enormous quantity of reads, a need for high-

quality interpretation of data, a user-friendly interface, and the capacity to accept various

ACAGTTAGACAGAA-GAC

G- ACGACGTAGAGGACTTA

Repetition region

ACAGTTAGACAG

ACAGTTAGACAGAACGACGTAGAGGAGACAGTATTGC

ACAGTTAGACAG

AGACAGAACGA

CGACGTAGAGGAGACAG

(1) (3)

(2)

(a) (b) (c)

(d)AGACAGTATTGC

(4)

Contig

AGACAGTATTGC

(a) (d) (b) (c)

AGACAGAACGA

CGACGTAGAGGAGACAG

ACAGTTAGACAGAACGACGTAGAGGAGACAGTATTGC

TATTGCGATATGGAGTT

Discordance region

ACAGTATTGCTTATATAGGGA

FIGURE 15.11

(1) Overlap between two reads, in which the overlapping region does not need to be a perfect match; (2) example

of correct assembly of a region of the genome that has two repetitive regions (box) using four reads (a to d);

(3)assembly generated by the greedy approach. Reads a and d are mounted rst, incorrectly, due to identica-

tion of better overlap, and (4) discordance between two reads (discordance region) that could extend a contig

(bold sequence). Extension of the contig could be nalized to avoid misassembles.

K15973_C015.indd 378 12/20/2012 3:39:34 PM

379Next-Generation Sequencing and Assembly of Bacterial Genomes

formats of les produced by different sequencers and assemblers [62]. Visualization of

the sequences generated from the process of mounting genomes can be done, for exam-

ple, by the software Consed [63], which allows the data to be edited, and Hawkeye [64].

Various programs have been developed for NGS reads, including EagleView [65], Tablet

[62], MapView [66], MaqView [31], SAMtools [48,55], and Integrative Genomics Viewer

(http://www .broadinstitute.org).

The main differences between the visualizers are in the interfaces for presentation to

the user, data processing velocity, as well as the different formats of the data entry les

and development of the scaffold. Loading NGS data in programs such as Consed and

Hawkeye, for example, requires a large amount of memory, which is normally not avail-

able to users of desktop computers. EagleView is a visualization tool developed only for

NGS, but it does not permit visualization of paired reads and it has memory limitations.

MapView permits analysis of genetic variation, supports paired-end data and single-end

reads, and various different entry and output le formats [66].

The scaffold is made of DNA sequences that are reconstructed after sequencing; it can

be composed of contigs, which should be ordered and oriented with each other with the

help of a reference genome, and by gaps: regions where the DNA sequence is not recog-

nized because it does not exist in the genome or because it is not covered in the sequencing

or assembly [67]. Options for software for generating genome scaffolds using a reference

genome include Bambus [68] from the software package AMOS, which can be used as

entry and exit by the software Mummer [42]. Genscaff [69] uses contiguous sequences gen-

erated by an assembly program without the help of a reference genome, through imple-

mentation of graph theory. CLCBio Workbench (http://www.clcbio.com) and the software

package Lasergene (http://www.dnastar.com), besides other functionalities, produce and

edit the scaffold; however, they are commercial software packages.

15.6 Closing Gaps

An artifact related to the assembly of a genome is the formation of gaps (holes or spaces).

Usually, the strategy used to resolve these spaces would be to design specic prim-

ers for this region, and posterior alignment of the amplied sequences by the primers,

thereby closing the gap. However, for large gaps (2 kb or longer), new primers are needed.

Therefore, this process requires considerable time and becomes expensive [70]. Given this

situation, we describe here an in silico strategy to resolve this type of artifact, a solution

consisting of the use of short reads generated by SOLiD (NGS) that were not mapped dur-

ing the assembly process.

15.7 Description of the Gap Closing Strategy

Align the short sequences in the anking regions of the mapped genes. Then, the nucleo-

tides that have a PHRED quality of 20 or more and a minimum of 10× coverage should

be added manually (Figure 15.12; steps 1 and 2). This extension will close the small gaps

(1–100 bp).

K15973_C015.indd 379 12/20/2012 3:39:34 PM

380 OMICS: Applications in Biomedical, Agricultural, and Environmental Sciences

If there are still gaps, the short reads should be realigned in relation to the reference

genome (Figure 15.12; step 3), because with the production of the new contigs, new short

reads align in the anking regions of the gaps, forming what we call a merged contig. In

this way, the genome could possibly be closed completely in silico, without using PCR. We

emphasize that this strategy was used during the mounting of the genome of a strain of

Corynebacterium pseudotuberculosis, which was sequenced with the SOLiD platform, which

generated 19,091,361 reads (140× coverage). This system mapped 590 gap regions, closing

100% of the gaps [71].

References

1. Pop, M. 2009. Genome assembly reborn: Recent computational challenges. Brief Bioinformatics

10: 354–366.

2. Pop, M., and S.L. Salzberg. 2008. Bioinformatics challenges of new sequencing technology.

Trends in Genetics 24: 142–149.

3. Batzoglou, S., D.B. Jaffe, K. Stanley et al. 2002. ARACHNE: A whole genome shotgun assem-

bler. Genome Research 12: 177–189.

4. Myers, E.W., G.G. Sutton, A.L. Delcher et al. 2000. A whole-genome assembly of drosophila.

Science 287: 2196–2204.

5. Pevzner, P.A., H. Tang, and M.S. Waterman. 2001. An eulerian path approach to DNA fragment

assembly. Proceedings of the National Academy of Sciences of the United States of America 98.

6. Phillippy, A.M., M.C. Schatz, and M. Pop. 2008. Genome assembly forensics: nding the elusive

misassembly. Genome Biology 9: R55.

7. Needleman, S., and C. Wunsch. 1970. A general method applicable to the search for similarities

in the amino acid sequence of two proteins. Journal of Molecular Biology 48: 443–453.

8. Higgins, D., and P. Sharp. 1988. CLUSTAL: a package for performing multiple sequence align-

ment on a microcomputer. Gene 73: 237–244.

9. Smith, T., and M. Waterman. 1981. Identication of common molecular subsequences. Journal of

Molecular Biology 147: 195–197.

10. Altschul, S., W. Gish, W. Miller, E. Myers, and D. Lipman. 1990. Basic local alignment search

tool. Journal of Molecular Biology 215: 403–410.

11. Bateman, A., and M. Wood. 2009. Cloud computing. Bioinformatics 2512: 1474.

GAP

Contig 1 Contig 2

1-Pair reads

aligned

2-Extention of the sequences

3-e gaps are reduced and

new alignments are performed

New contig

FIGURE 15.12

Description of the strategy for closing the gaps: step 1, short reads are aligned in the initial assembly; step 2,

short reads that align in terminal contigs are mounted in new contigs; step 3, the short reads are aligned against

the updated sequence and the process is repeated until the gap is closed.

K15973_C015.indd 380 12/20/2012 3:39:34 PM

381Next-Generation Sequencing and Assembly of Bacterial Genomes

12. Zerbino, D. 2009. Genome Assembly and Comparison Using de Bruijn Graphs. PhD thesis,

University of Cambridge.

13. Cole, C.G., O.T. McCann, J.E. Oliver et al. 2008. Finishing the nished human chromosome 22

sequence. Genome Biology 9: R78.

14. Sasson, S.A. 2010. From Millions to One: Theoretical and Concrete Approaches to De novo

Assembly Using Short Read DNA Sequences. PhD thesis, Graduate School-New Brunswick

Rutgers, The State University of New Jersey.

15. Chou, H.H., and M.H. Holmes. 2001. DNA sequence quality trimming and vector removal.

Bioinformatics 1712: 1093–1104.

16. Ramos, R.T., A.R. Carneiro, J. Baumbach, V. Azevedo, M.P. Schneider, and A. Silva. 2011.

Analysis of quality raw data of second generation sequencers with quality assessment soft-

ware. BMC Research Notes 4: 130.

17. Blankenberg, D., A. Gordon, G. Von Kuster, N. Coraor, J. Taylor, and A. Nekrutenko A. 2010.

Manipulation of FASTQ data with Galaxy. Bioinformatics 26: 1783–1785.

18. Morgan, M., S. Anders, M. Lawrence, P. Aboyoun, H. Pagès, and R. Gentleman. 2009. ShortRead:

a bioconductor package for input, quality assessment and exploration of high-throughput

sequence data. Bioinformatics 25: 2607–2608.

19. Martínez-Alcántara, A., E. Ballesteros, C. Feng et al. 2009. PIQA: Pipeline for Illumina G1

genome analyzer data quality assessment. Bioinformatics 25: 2438–2439.

20. Li, H., and N. Homer. 2010. A survey of sequence alignment algorithms for next-generation

sequencing. Briengs Bioinformatics 11: 181–197.

21. Bryant, D.W., W.K. Wong, and T.C. Mockler. 2009. QSRA—a quality-value guided de novo

short read assembler. BMC Bioinformatics 10: 69.

22. Jeck, W., J. Reinhardt, D. Baltrus et al. 2007. Extending assembly of short DNA sequences to

handle error. BMC Bioinformatics 23: 2942–2944.

23. Marioni, J.C., C.E. Mason, S.M. Mane, M. Stephens, and Y. Gilad. 2008. RNA-seq: An assess-

ment of technical reproducibility and comparison with gene expression arrays. Genome Research

18: 1509–1517.

24. Salmela, L. 2010. Correction of sequencing errors in a mixed set of reads. Bioinformatics 26:

1284–1290.

25. Schroder, J., H. Schroder, S.J. Puglisi, R. Sinha, and B. Schmidt. 2009. SHREC: a short-read error

correction method. Bioinformatics 25: 2157–2163.

26. Smith, A.D., Z. Xuan, and M.Q. Zhang. 2008. Using quality scores and longer reads improves

accuracy of Solexa read mapping. BMC Bioinformatics 9: 128.

27. Chaisson, M.J., and P.A. Pevzner. 2008. Short read fragment assembly of bacterial genomes.

Genome Research 18: 324–330.

28. Miller, J.R., S. Koren, and G. Sutton. 2010. Assembly algorithms for next-generation sequencing

data. Genomics 95: 315–327.

29. Rumble, S.M., P. Lacroute, A.V. Dalca, M. Fiume, A. Sidow, and M. Brudno. 2009. SHRiMP:

Accurate mapping of short color-space reads. PLoS Computational Biology 5: 5.

30. Li, R., C. Yu, Y. Li et al. 2009. SOAP2: an improved ultrafast tool for short read alignment.

Bioinformatics 25: 1966–1967.

31. Li, H., J. Ruan, and R. Durbin. 2008. Mapping short DNA sequencing reads and calling variants

using mapping quality scores. Genome Research 18: 1851–1858.

32. Lin, H., Z. Zhang, M.Q. Zhang, B. Ma, and M. Li. 2008. ZOOM! Zillions of oligos mapped.

Bioinformatics 24: 2431–2437.

33. Butler, J., I. MacCallum, M. Kleber et al. 2008. ALLPATHS: De novo assembly of whole-genome

shotgun microreads. Genome Research 18: 810–820.

34. Chaisson, M., P. Pevzner, and H. Tang. 2004. Fragment assembly with short reads. Bioinformatics

20: 2067–2074.

35. Li, Y., Y. Hu, L. Bolund, and J. Wang. 2010. State of the art de novo assembly of human genomes

from massively parallel sequencing data. Human Genomics 44: 271–277.

K15973_C015.indd 381 12/20/2012 3:39:34 PM

382 OMICS: Applications in Biomedical, Agricultural, and Environmental Sciences

36. Zerbino, D.R., and E. Birney. 2008. Velvet: Algorithms for de novo short read assembly using de

Bruijn graphs. Genome Research 18: 821–829.

37. Ewing, B., and P. Green. 1998. Base-calling of automated sequencer traces using PHRED. II.

Error probabilities. Genome Research 83: 186–194.

38. Flicek, P., and E. Birney. 2009. Sense from sequence reads: Methods for alignment and assembly.

Nature Methods 6: S6–S12.

39. Pevzner, P.A., and H. Tang. 2001. Fragment assembly with double-barreled data. Bioinformatics

17: 225–233.

40. Kingsford, C., M.C. Schatz, and P. Pop. 2010. Assembly complexity of prokaryotic genomes

using short reads. BMC Bioinformatics 111: 21.

41. Ferragina, P., and G. Manzini. 2000. Opportunistic data structures with applications. In

Proceedings of the 41st Symposium on Foundations of Computer Science FOCS 2000. California:

Redondo Beach 390–398.

42. Kurtz, S., A. Phillippy, A.L. Delcher et al. 2004. Versatile and open software for comparing large

genomes. Genome Biology 5: R12.

43. Meek, C., J.M. Patel, and S. Kasetty. 2003. OASIS: An online and accurate technique for local-

alignment searches on biological sequences. In Proceedings of 29th International Conference on

Very Large Data Bases VLDB 2003, Berlin: 910–921.

44. Abouelhoda, M.I., S. Kurtz, and E. Ohlebusch. 2004. Replacing sufx trees with enhanced sufx

arrays. Journal of Discrete Algorithms 2: 53–86.

45. Hoffmann, S., C. Otto, S. Kurtz et al. 2009. Fast mapping of short sequences with mismatches,

insertions and deletions using index structures. PLoS Computational Biology 5: 1–10.

46. Langmead, B., C. Trapnell, M. Pop, and S.L. Salzberg. 2009. Ultrafast and memory-efcient

alignment of short DNA sequences to the human genome. Genome Biology 10: R25.

47. Li, H., and R. Durbin. 2010. Fast and accurate long-read alignment with Burrows–Wheeler

transform. Bioinformatics 265: 589–595.

48. Li, H. and R. Durbin. 2009. Fast and accurate short read alignment with Burrows–Wheeler

transform. Bioinformatics 25: 1754–1760.

49. Lam, T.W., W.K. Sung, S.L. Tam, C.K. Wong, and S.M. Yiu. 2008. Compressed indexing and local

alignment of DNA. Bioinformatics 24: 791–797.

50. Hernandez, D., P. François, L. Farinelli, M. Osterås, and J. Schrenzel. 2008. De novo bacterial

genome sequencing: Millions of very short reads assembled on a desktop computer. Genome

Research 18: 802–809.

51. Myers, E.W. 1995. Towards simplifying and accurately formulating fragment assembly. Journal

of Computational Biology 2: 1–21.

52. Huang, X., and S. Yang. 2005. Generating a genome assembly with PCAP. In Current Protocols in

Bioinformatics Unit 11.3.

53. Lemos, M., A. Basílio, and A. Casanova. 2003. Um Estudo dos Algoritmos de Montagem de

Fragmentos de DNA. PUC Rio, Rio de Janeiro.

54. Fleishner, H. 1990. Eulerian Graphs and Related Topics. London: Elsevier Science.

55. Li, H., B. Handsaker, A. Wysoker et al. 2009. 1000 Genome Project Data Processing Subgroup.

The Sequence Alignment/Map format and SAMtools. Bioinformatics 16: 2078–2079.

56. Simpson, J.T., K. Wong, S.D. Jackman, J.E. Schein, S.J.M. Jones, and I. Birol. 2009. ABySS: A par-

allel assembler for short read sequence data. Genome Research 19: 1117–1123.

57. Medvedev, P., K. Georgiou, G. Myers, and M. Brudno. 2007. Computability of models for

sequence assembly. In Proceedings of Workshop on Algorithms in Bioinformatics WABI 289–301.

58. Warren, R.L., C.G. Sutton, S.J. Jones, and R.A. Holt. 2007. Assembling millions of short DNA

sequences using SSAKE. Bioinformatics 15: 234.

59. Dohm, J.C., C. Lottaz, T. Borodina, and H. Himmelbauer. 2007. SHARCGS, a fast and highly

accurate short-read assembly algorithm for de novo genomic sequencing. Genome Research 17:

1697–1706.

60. Shendure, J., and H. Ji. 2008. Next-generation DNA sequencing. Nature Biotechnology 26:

1135–1145.

K15973_C015.indd 382 12/20/2012 3:39:35 PM

383Next-Generation Sequencing and Assembly of Bacterial Genomes

61. Magi, A., M. Benelli, A. Gozzini, F. Girolami, and M.L. Brandi. 2010. Bioinformatics for next

generation sequencing data. Genes 1: 294–307.

62. Milne, I., M. Bayer, L. Cardle et al. 2010. Tablet—Next generation sequence assembly visualiza-

tion. Bioinformatics 3: 401–402.

63. Gordon, D., C. Abajian, and P. Green. 1998. Consed: a graphical tool for sequence nishing.

Genome Research 8: 195–202.

64. Schatz, M.C., A.M. Phillippy, B. Shneiderman, and S.L. Salzberg. 2007. Hawkeye: an interactive

visual analytics tool for genome assemblies. Genome Biology 8: R34.

65. Huang, W. and G. Marth. 2008. EagleView: A genome assembly viewer for next-generation

sequencing technologies. Genome Research 9: 1538–1543.

66. Bao, H., H. Guo, J. Wang, R. Zhou, X. Lu, and S. Shi. 2009. MapView: Visualization of short

reads alignment on a desktop computer. Bioinformatics 12: 1554–1555.

67. Schuster, S.C. 2008. Next-generation sequencing transform today’s biology. Nature Methods 5:

16–18.

68. Pop, M., D.S. Kosack, and S.L. Salzberg. 2004. Hierarchical scaffolding with Bambus. Genome

Research 14: 149–159.

69. Setúbal, J.C., and R. Werneck. 2001. A program for building contig scaffolds in double-barrelled

shotgun genome sequencing. Campinas Instituto de Computação, Unicamp.

70. Tsai, I.J., D.T. Otto, and M. Berriman. 2010. Improving draft assemblies by iterative mapping

and assembly of short reads to eliminate gaps. Genome Biology 11: R41.

71. Silva, A., M.P. Schneider, L. Cerdeira et al. 2011. Complete genome sequence of Corynebacterium

pseudotuberculosis I19, a strain isolated from a cow in Israel with bovine mastitis. Journal of

Bacteriology 1931: 323–324.

K15973_C015.indd 383 12/20/2012 3:39:35 PM

K15973_C015.indd 384 12/20/2012 3:39:35 PM

ResearchGate has not been able to resolve any citations for this publication.

SHREC: A short-read error correction method

Article

Full-text available

Jun 2009

Motivation: Second-generation sequencing technologies produce a massive amount of short reads in a single experiment. However, sequencing errors can cause major problems when using this approach for de novo sequencing applications. Moreover, existing error correction methods have been designed and optimized for shotgun sequencing. Therefore, there is an urgent need for the design of fast and accurate computational methods and tools for error correction of large amounts of short read data. Results: We present SHREC, a new algorithm for correcting errors in short-read data that uses a generalized suffix trie on the read data as the underlying data structure. Our results show that the method can identify erroneous reads with sensitivity and specificity of over 99% and 96% for simulated data with error rates of up to 3% as well as for real data. Furthermore, it achieves an error correction accuracy of over 80% for simulated data and over 88% for real data. These results are clearly superior to previously published approaches. SHREC is available as an efficient open-source Java implementation that allows processing of 10 million of short reads on a standard workstation.

Bioinformatics for Next Generation Sequencing Data

Article

Full-text available

Sep 2010

The emergence of next-generation sequencing (NGS) platforms imposes increasing demands on statistical methods and bioinformatic tools for the analysis and the management of the huge amounts of data generated by these technologies. Even at the early stages of their commercial availability, a large number of softwares already exist for analyzing NGS data. These tools can be fit into many general categories including alignment of sequence reads to a reference, base-calling and/or polymorphism detection, de novo assembly from paired or unpaired reads, structural variant detection and genome browsing. This manuscript aims to guide readers in the choice of the available computational tools that can be used to face the several steps of the data analysis workflow.

Ultrafast and memory-efficient alignment of short reads to the human genome

Article

Full-text available

Jan 2009

Bowtie 1 is a fast and memory-efficient program for aligning short reads to mammalian genomes. Burrows-Wheeler indexing allows Bowtie to align more than 25 million 35-bp reads per CPU hour to the human genome in a memory footprint of as little as 1.1 gigabytes. Bowtie extends previous Burrows-Wheeler techniques with a quality-aware search algorithm that permits mismatches. Multiple processor cores can be used simultaneously to achieve greater alignment speed. Bowtie is free, open source software available for download from http://bowtie.cbcb.umd.edu . The Burrows-Wheeler Transformation of a text T, BWT(T), is constructed as shown to the right. The Burrows- Wheeler Matrix of T is the matrix whose rows are all distinct cyclic rotations of T$ sorted lexicographically ($ is "less than" all other characters). BWT(T) is the sequence of characters in the last column of this matrix.

Um Estudo dos Algoritmos de Montagem de Fragmentos de DNA

Article

Full-text available

The human genome project is a program to map and sequence the entire human genome. A number of model organisms were selected for complete sequencing, partly in order to develop new technology for mapping, sequencing and sequence analysis. In addition, the sequences from these genomes were expected to facilitate the elucidation of the functions of genes and sequences in the human genome. One of the main problems of DNA sequencing on a large scale is that its methods only obtain a small part of the DNA. After breaking a sequence into many fragments, cloning them and sequencing them, there is a set of fragments which needs to be merged for the reconstruction of the original DNA sequence. This monograph presents the biological and computational context of DNA sequence assembly.

1000 genome project data processing subgroup. The sequence alignment/map (SAM) format and SAMtools

Article