ChapterPDF Available

Next-Generation Sequencing and Assembly of Bacterial Genomes

Authors:

Abstract and Figures

One of the most important advances in biology has been our capacity to sequence the DNA of organisms. However, long after the conclusion of human genome sequencing, there are still regions of the genome that are unworkable; that is, they are difficult to mount and remain incomplete. Answers may come from second-generation sequencing, which has produced large volumes of data, generating millions of short reads per run, a reality that was unimaginable with Sanger sequencing.
Content may be subject to copyright.
367
15
Next-Generation Sequencing and
Assembly of Bacterial Genomes
Artur Silva, Rommel Ramos, Adriana Carneiro, Sintia Almeida,
Vinicius De Abreu,Anderson Santos, Siomar Soares, Anne Pinto,
Luis Guimarães, Eudes Barbosa, Paula Schneider, Vasudeo Zambare,
Debmalya Barh, Anderson Miyoshi, and Vasco Azevedo
15.1 Introduction
One of the most important advances in biology has been our capacity to sequence the DNA
of organisms. However, long after the conclusion of human genome sequencing, there are
still regions of the genome that are unworkable; that is, they are difcult to mount and
remain incomplete. Answers may come from second-generation sequencing, which has
produced large volumes of data, generating millions of short reads per run, a reality that
was unimaginable with Sanger sequencing.
Although we can now generate a high degree of sequencing coverage (Figure 15.1),
the de novo assembly of short reads is more complex compared with reference assembly.
Various algorithms and bioinformatics tools have been developed to take care of these new
CONTENTS
15.1 Introduction ........................................................................................................................367
15.2 Treatment of the Data ........................................................................................................ 369
15.2.1 Base Quality ............................................................................................................ 369
15.2.2 Error Correction (Tools) ........................................................................................370
15.3 Strategies for Assembling Genomes ............................................................................... 370
15.3.1 Reference Mounting .............................................................................................. 371
15.3.2 De novo Assemblers ...............................................................................................372
15.3.3 Challenges and Difculties for de novo Assembly ............................................372
15.4 Tools for Assembling Genomes .......................................................................................373
15.4.1 Tools for Reference Assembly ..............................................................................373
15.4.2 Tools for De novo Assembly .................................................................................. 374
15.4.2.1 Tools that Use OLC .................................................................................374
15.4.2.2 Tools that Use the DBG ..........................................................................375
15.4.2.3 Tools that Use Greedy Graph ................................................................ 376
15.5 Tools for Visualizing NGS Data and Producing a Scaffold .........................................378
15.6 Closing Gaps ....................................................................................................................... 379
15.7 Description of the Gap Closing Strategy ........................................................................ 379
References .....................................................................................................................................380
K15973_C015.indd 367 12/20/2012 3:39:32 PM
368 OMICS: Applications in Biomedical, Agricultural, and Environmental Sciences
problems and computational challenges, such as identication of repeat regions, sequenc-
ing errors, and simultaneous manipulation of short reads [1,2].
After reads are generated by the sequencer, it is necessary to join them in a logical fashion
to mount the nal sequence. Over the years, various tools have been developed to resolve
this issue, for example, the assemblers PHRAP (http://www.phrap.org), ARACHNE [3],
and Celera [4]. They have a paradigm in common, often referred to as overlap-layout-
consensus (OLC) [5]. This approach is quite similar to that used to resolve a jigsaw puzzle,
as described below.
The rst step consists of aligning the reads, two by two, exhaustively; the pairs of
reads should present a consistent overlap from one read to another, similar to the search
for pieces in a jigsaw puzzle that t each other and have colors that match. Especially
in eukaryotic genomes, the main difculty is in distinguishing inexact overlaps due to
sequencing errors and similarities within the genome, such as highly conserved repeat
regions [6]. Sequence alignment is a widely studied area of bioinformatics, which consists
of supplying the ideal alignment between two sequences as a function of an evaluation
score.” Most of these methods are based on the Needleman–Wunsch algorithm [7], which
uses the spatial dynamics of the possible alignments between the sequences. Many exten-
sions have been conceived, for example, for multiple alignments [8], local alignments [9], or
rapid research in large data banks [10].
Current techniques can be executed rapidly in parallel processors. To process short
reads generated by second-generation sequencing platforms, one of the solutions found
for simultaneously manipulating thousands of sequences has been the use of computing
clouds [11]. The assemblers detect a group of reads with consistent alignments with each
other, forming contiguous sequences (contigs). This would be equivalent to partially form-
ing an image by putting together pieces of a jigsaw puzzle. In both montages, genome and
jigsaw puzzle, the process can be interrupted in ambiguous regions, where various con-
tinuations or holes are possible, and where no connecting piece was found [12].
Finally, the assembler tries to order and orient the contigs with each other in an de novo
manner; that is, without the help of a reference sequence. Returning to the metaphor of
a jigsaw puzzle, this would correspond to identifying corners and different parts of the
image that relate to each other. In the nal mounting phase, a scaffolded group of contigs
will become available. Nevertheless, it is desirable to remove most possible gaps, eventu-
ally converging to a group of integral chromosomes, that is, those that do not include
breaks. This phase, called nishing, can be expensive and could take considerable time,
depending on the strategy used to close the occasional holes in the genome scaffold [13].
High-throughput Sanger
Genome
Depth
sequencing
FIGURE 15.1
Qualitative comparison between the sequences generated by Sanger and those from the NGS platforms. There
is a higher abundance and depth of coverage with the short reads, but they are also signicantly shorter, with
little overlap available for assurance.
K15973_C015.indd 368 12/20/2012 3:39:32 PM
369Next-Generation Sequencing and Assembly of Bacterial Genomes
15.2 Treatment of the Data
Preprocessing of the data, involving a quality lter and correction of the sequencing
errors, is essential to increase the accuracy of the assemblies, as it prevents incorrect or
low-quality reads from becoming part of the genome assembly process.
15.2.1 Base Quality
In 1998, Ewing and collaborators developed the PHRED algorithm, with the objective of
determining the probability of occurrence of the one of the four nucleotides (A, C, G, or T)
for each base of a DNA sequence during the base-calling process; the intensity of the wave-
length that is obtained through the uorescence produced by ligations between nucleo-
tides is used to calculate the PHRED quality value (Q), which is logarithmically related to
the observed probability of error for each base (P), according to the formula presented in
Equation 15.1.
Q = –10log10P (15.1)
In Table 15.1, we can observe examples of PHRED quality associated with the probability
of being incorrect and the precision of the identication of the base.
The sequences obtained from automatic sequencers are not considered to be reliable
due to their low quality at the extremities (Figure 15.2) and because of contaminants.
Consequently, working on the quality of the data was fundamental so that the following
phases of processing biological information would not be compromised [15]. In the case of
the sequences obtained from next-generation sequencing (NGS), despite the high degree
of coverage, base quality should be evaluated. In this way, the reads can be trimmed and
TABLE 15.1
Error Probability and Precision Based on PHRED Quality Values
PHRED Quality Score Error Probability Accuracy (%)
10 1/10 90
20 1/100 99
30 1/1000 99.9
40 1/10,000 99.99
50 1/100,000 99.999
Sequence
Phred
quality
GCTAGCATGCTAGCTACGATGCATC
Cutoff
FIGURE 15.2
Low quality of the ends of the reads obtained from automatic sequencers, which use the dideoxynucleotide
method.
K15973_C015.indd 369 12/20/2012 3:39:32 PM
370 OMICS: Applications in Biomedical, Agricultural, and Environmental Sciences
quality lters applied. As examples of tools that do this quality treatment, we can cite
Quality Assessment [16], Galaxy [17], ShortRead [18], and PIQA [19]; the latter being used
exclusively for Illumina data.
Analysis of sequence quality, followed by data treatment, makes it possible to reduce
alignment errors because it provides precise alignment parameters according to the data
that are the objects of study [20]. In the assembly of genomes, software such as Quality-
value guided Short Read Assembler (QSRA) [21] propose assembling genomes with the
extension of sequences through analysis of base quality, giving better results than the
assembler on which it was based (VCAKE, which does not take base quality into account
to extend sequences) [22]. In transcriptome studies with NGS platforms using RNAseq,
evaluation of the quality of the data is extremely important because the coverage repre-
sents the level of expression; consequently, quality lters using stringent parameters can
provoke variations in the expression levels that are found [23].
15.2.2 Error Correction (Tools)
Despite the high degree of accuracy provided by NGS platforms, due to the extensive cov-
erage generated by this equipment, sequencing errors can cause problems in the assembly
of genomes when using an de novo approach because the generation of contigs is very
sensitive to these errors [24]. Consequently, to obtain better results, it is necessary to cor-
rect the errors before mounting the genomes, which will make the data more reliable [25].
In re-sequencing projects, in which the reads obtained from the sequencers are aligned
against a reference genome, error correction can avoid elimination (trimming) of the 3
extremities of the read due to the low quality observed when one tries to improve the
alignments [26].
Some genome assemblers have already included error correction procedures: SHARCGS
only considers reads that have been produced by the sequencer n times, a parameterized
value, and those that present overlap with other reads [25]. In 2001, Pevzner and collabo-
rators used the spectral alignment method to correct errors, which consisted of a given
string S in spectrum T, formed by all of the continuous strings of xed size (T string),
a search is made for the smallest number of modications that need to be made in S to
transform it into a T string. This method of correction is implemented by the assembler
EULER-SR [27] before the process of mounting the genome.
As examples of independent tools that can correct errors, we can cite Short-Read Error
Correction (SHREC) [25] for SOLEXA/Illumina data, which uses a generalized tree of suf-
xes to process the data, the SOLiD Accuracy Enhancement Tool for SOLiD data, which
is available in LifeScope Genomic Analysis Software (http://www.lifetechnologies.com/
lifescope), using an ap proach similar to that of EULER-SR, and Hybrid SHREC [24], which
is based on the SHREC algorithm, but can process les from various sequencing platforms.
15.3 Strategies for Assembling Genomes
Reads from the sequencer should be submitted to preprocessing, where base quality
and sequencing errors are evaluated with software, commonly specic to corresponding
sequencing platforms; they are then submitted to de novo assembly and then oriented and
ordered to produce the scaffold (Figure 15.3). If the assembly is done with a reference
K15973_C015.indd 370 12/20/2012 3:39:32 PM
371Next-Generation Sequencing and Assembly of Bacterial Genomes
genome, after preprocessing, the reads are mapped against this reference, and after align-
ment is nished, a consensus sequence is produced.
15.3.1 Reference Mounting
Basically, reference assembly consists of mapping the reads obtained from sequencing
against a reference genome (Figure 15.4), preferentially, of a phylogenetically closely related
organism, making it possible to align a large part of the reads. However, the alignment
congurations will also inuence the quantity of reads that is utilized; consequently, the
parameters such as depth of coverage and the number of mismatches that are permitted
should be dened based on the sequencing information: estimated coverage and PHRED
quality of the bases [20]. Mapping using a reference sequence provides the identication
of the nucleotide substitutions as well as the indels, principally with the use of NGS plat-
forms, due to the high degree of sequence coverage [28].
After mounting, regions of the reference genome that are not covered are observed, rep-
resenting gaps, which can occur as a function of the presence of a nucleotide sequence in
the reference that does not occur in the sequenced organism, or because this region was
not sequenced. Among the problems with sequence mounting with the use of a reference,
we can cite the representation of repeated regions, for example, the case of a reference
genome that has two such regions and the sequenced organism that has only one; during
mapping against a reference, the two reference regions will be covered, which can result
in mounting errors (Figure 15.5). For mapping reads using a reference genome, one can use
software such as SHRiMP [29] and SOLiD BioScope (Applied Biosystems, Foster City, CA);
both align in color-space SOLiD, SOAP2 [30], MAQ [31], RMAP [26], and ZOOM [32]. The
program SOLiD BioScope is a Java-based application that has various integrated tools in a
Preprocessing
De novo assembly
Scaffold
FIGURE 15.3
Steps used for ab initio assembly of genomes. After data treatment in the preprocessing stage, ab initio assembly
is run, generating the contigs, which then are oriented and ordered to generate the scaffold.
ATGAT
ATCAT
Reference
Reads
GAP
FIGURE 15.4
Alignment of reads against a reference genome, showing mismatches and a gap.
K15973_C015.indd 371 12/20/2012 3:39:32 PM
372 OMICS: Applications in Biomedical, Agricultural, and Environmental Sciences
web interface for resequencing and transcriptome analyses. The reference assembly pipe-
line permits mapping of reads based on reference genomes, identifying single nucleotide
polymorphisms, indels (insertions and deletions), and inversions, as well as generating a
consensus genome (Applied Biosystems).
15.3.2 De novo Assemblers
This consists of reconstructing genome sequences without the aid of any other informa-
tion aside from the reads produced by the sequencing process. With this strategy, similar-
ity alignments can be made among the reads themselves, or through the overlap of k-mers.
This allows, at the end of the alignment process, the formation of contiguous sequences
(contigs) as seen in Figure 15.6. Most assemblers is based on graph theory, in which ver-
tices and edges can represent overlap, a k-mer or a read varies according to the strategy
that is used, in which the contigs are the paths formed in the graph. Thus, the assemblers
can be divided into greedy, OLC, or de Bruijn graph (DBG) algorithms; the latter uses a
Eulerian path [28].
DBG is the approach that is mostly widely used by assemblers of short reads because it
works better with large numbers of reads, typical of NGS sequencers. The main programs
that adopt this approach are AllPaths [33], Euler-SR [34], SOAPdenovo [35], and Velvet [36].
Among these programs, Velvet is the only one that mounts short sequences in the color-
space format.
15.3.3 Challenges and Difficulties for de novo Assembly
The limitations of de novo assembly approaches are directly associated with the techno-
logical limitations and the features of the data generated by second-generation sequences
as well as the sizes of the reads and the volume of data that is generated, which exponen-
tially increases the processing time and sometimes makes mounting unviable. Within this
Reference
Reads ABC
ABC
Reference genome
Reads
Repetitive regions
FIGURE 15.5
Double mapping of reads A, B, and C in the reference genome, because it involves a repetitive region. However,
in the sequenced genome, the number of repeats can be different from that observed in the reference genome.
Short reads
Contig
FIGURE 15.6
Alignment between the reads generated by the sequencing, nally obtaining scaffolded contigs.
K15973_C015.indd 372 12/20/2012 3:39:33 PM
373Next-Generation Sequencing and Assembly of Bacterial Genomes
context, various problems can occur, such as the grouping of repeat regions; there are also
regions in which sequences are of low quality, base compression in the sequencing, and
even regions with a low degree of coverage due to the random character of the sequencing
[37,38].
One of the classic examples of problems with de novo assembly is nding a path in the
overlap graph that passes through each of the vertices only once (Hamiltonian path) or
each edge only once (Eulerian path); this often results in the loss of connectivity between
very distant sequences, showing that strategies based on graphs, especially the de Bruijn
strategy, are extremely sensitive to sequencing errors [39]. These problems are more com-
plex and common in assemblies made with short reads, as the number of reads is larger
than with Sanger sequences because the lengths are much shorter, which exponentially
increases the size of the problem. The large sizes of the conserved repeated regions also
make the process of the genome assembly difcult and as it involves Eukaryotic genomes,
which have very large repeat regions, this task is sometimes a problem that is hard to
resolve [5]. Despite the problems cited above, studies show that up to 96.29% of a gene can
be reconstructed using short sequences, with sizes starting at 25 nucleotides [40].
15.4 Tools for Assembling Genomes
Second-generation sequencers are capable of generating thousands of reads, providing a
high degree of coverage and accuracy. As examples of these platforms, we can cite SOLiD,
Illumina, and 454 FLX Titanium. Despite the reduction in sequencing costs, among other
advantages, the reduction in the sizes of the reads, along with the increase in the number
of reads, results in computational challenges for the processing of this data, principally for
genome assemblies [28]. Assembly a genome consists of overlapping based on the similar-
ity of reads generated by a sequencer, to produce contiguous sequences (contigs), which in
turn are aligned and oriented with each other to construct the scaffold. This is called a ref-
erence assembly when it involves mapping reads in comparison with a reference sequence,
whereas mounting reads without such a reference is called de novo assembly [31].
15.4.1 Tools for Reference Assembly
For alignment against a reference genome, there are two approaches: using hash tables
and using prex/sufx trees [20]. Many software that use hash tables dene the subse-
quences obtained from the search sequence as the key. The program tries to map identical
sequences, known as seeds from the reference, so that the sequences can be subsequently
extended. However, the use of templates with spaced seeds gives better results because
it considers internal mismatches (Figure 15.7). Even so, independent of whether seeds or
Seed 11111111111
Spaced Seed 11101010101101011
FIGURE 15.7
Templates using seed in which an exact match of 11 bases is necessary to initiate the extension, and spaced seeds
in which a match of 11 bases in required, but permitting the existence of internal mismatches.
K15973_C015.indd 373 12/20/2012 3:39:33 PM
374 OMICS: Applications in Biomedical, Agricultural, and Environmental Sciences
spaced seeds are used, the alignments do not accept gaps; identication of such gaps is
made in a step after the extension of the alignment.
The mapping algorithms that use prex/sufx trees search for exact alignments, repre-
sented by sufx trees, enhanced sufx array, and Ferragina-Manzini index (FM index) [41];
then they extend the alignments considering mismatches. Among the tools that use sufx
trees, we have MUMmer [42] and OASIS [43]. Among the alignment software based on
enhanced sufx arrays, we can cite Vmatch [44] and Segemehl [45]. The FM index method
uses a small amount of memory (from 0.5 to 2 bytes per nucleotide), which can vary as a
function of implementation and parameters that are used [20]; examples of such programs
include Bowtie [46], BWA [47], BWA-SW [48], SOAP2 [30], and BWT-SW [49].
15.4.2 Tools for De novo Assembly
According to Miller et al. [28], the de novo assembly of genomes consists of aligning the
reads with each other to produce contiguous sequences (contigs). The principal method-
ologies for NGS data are based on graphs, these being:
• OLC
• DBG
• Greedy
15.4.2.1 Tools that Use OLC
This is the most widely used approach for large sequences, such as those produce by
Sanger; nevertheless, there are also applications based on this method for short reads,
such as Edena [50]. The OLC method can be divided into three phases: overlap, layout,
and consensus. In the overlap phase, each read is compared with all of the others to iden-
tify overlaps, considering the minimum size of overlap and k-mer, which will affect the
accuracy of the contigs. Among the types of overlaps that are recorded, four categories are
possible: containment, normal tting, prex, and sufx tting, as shown in Figure 15.8 [51].
(a)
(b)
G TA A T T G C C AT C G G T TG T A C G G G TG G
G C C A T C G G T TG T A C G G
T G C C AT C G G T T GT A C G G G T
G TA C G G G T G G G C C AT
G T A A T T G C C A T C G G T T G T A
T G C C G G GTA AT T G C C A
G C C AT C G GT T G TA C G G G T G G
G TA C G G G T G G TA A C C A TC G G
(c)
(d)
FIGURE 15.8
Containment (a), partial overlap (b), prex overlap (c), and sufx overlap (d).
K15973_C015.indd 374 12/20/2012 3:39:33 PM
375Next-Generation Sequencing and Assembly of Bacterial Genomes
In the layout phase, the information obtained from the previous phase is used to con-
struct a graph (Figure 15.9), which is reconstructed at each actualization. At the end of this
phase, the rst draft representing the genome will be generated, taking into account that in
this phase, various methods are used to simplify the pathways and remove errors detected
in the graph, such as bubbles and linear extensions, known as dead paths [28]. In the con-
sensus phase, multiple alignments are made of the fragments, progressively, to develop
a consensus sequence [28], as shown in Figure 15.10. As examples of the assemblers that
use the OLC strategy, we have Celera Assembler [4], ARACHNE [3], CAP, and PCAP [52].
Edena is the only program for the platforms Solexa and SOLiD that uses OLC[50].
15.4.2.2 Tools that Use the DBG
In 1995, Idury and Waterman introduced the use of a graph to represent a sequence assem-
bly. Their method consisted of creating a vertex for each word. Then, the vertices that
correspond to the overlap of k-mers are connected; k can be represented by a sequence
with a specic number of bases. The original vertex corresponds to prex k-1 of the corre-
sponding overlap region (k-mer), and the vertex destination of sufx k-1 of the same region,
providing a reconstruction of the sequence through a path that traverses each edge exactly
once. Pevzner and Tang [39] proposed a representation that was slightly different from the
graph of the sequence, the so-called DBG, which uses a Eulerian path; that is, a pathway
that visits each edge exactly once, through which the k-mers are represented as arcs or
edges, and overlapping of the k-mers join their ends.
FIGURE 15.9
Layout graph: G graph valid based on graph construction theory. First draft of what could be the genome.
FIGURE 15.10
Consensus graph: using progressive alignment guided by pairs.
K15973_C015.indd 375 12/20/2012 3:39:34 PM
376 OMICS: Applications in Biomedical, Agricultural, and Environmental Sciences
The classic method of assembling fragments is based on the notion of a graph of over-
laps. Each read corresponds with a vertex in the overlap graph and two vertices are con-
nected by an arc, if the corresponding reads overlap [53]. Different from the problem of
the Hamiltonian path, the Eulerian path is less complex and is resolved even with graphs
with millions of vertices, as there are linear-time algorithms that can provide a solution
for them [54]. It is important to emphasize that a de Bruijn graph is centered on the k-mer,
which means that its topology is not affected by the fragmentation of the reads [12]. And
compared with the overlap phase in OLC, the computational cost is much smaller because
the overlaps are not performed against all reference genomes [28]. Operationally, the DBG
containing the vertices with length k is constructed with the result of the division of the
k-mer, being linked in exact, identical overlapping for the previously dened k-mer values.
After construction of the graph, it is generally possible to simplify it without any loss of
information. The reads of the vertices are interrupted and initiated again at each simpli-
cation. Simplication of two vertices is similar to the concatenation of two strands of
characters [51]. As in other approaches, the assemblers add to their main algorithm, acces-
sory algorithms to help remove assembly errors, such as reduction of redundant pathways,
removal of bubbles or pathway loops, and linear extensions; that is, those that do not pos-
sess dened pathways. The main programs that adopt the use of the DBG are AllPaths [33],
Euler-SR for short sequences [34], SOAPdenovo [54], and Velvet [36], which is specialized
in the localization of the use of paired reads. Velvet is the only one that assembles short
sequences in a color-space. Other assembly programs, such as ABySS [55,56], were success-
ful in constructing DBGs, eliminating the limitations in the use of memory that are com-
mon during assemblies; Table 15.2 shows the principal characteristics of each software [28].
In 2009, inspired by the ideas of Pevzner et al. [5], Zerbino implemented the program
Velvet, the structure of which differs in various aspects. Among these, maps of k-mers are
generated for the vertices and not for the arcs, and there can be reverse complementary-
associated sequences to obtain a bidirectional graph [57]. In this way, the vertices can be
connected by a directed edge or an arc. Due to the symmetry of the blocks, an arc goes
from vertex A to B, a symmetrical arc goes from B to A. With any alteration of an arc, it is
implicit that the same change will be made symmetrically in the paired arc.
Each vertex can be represented by a single rectangle, which represents a series of k-mer
overlaps (in this case, k = 5) listed directly above or below. The only nucleotide of each
k-mer is colored red. The arcs are represented as arrows between knots. The last k-mer has
overlaps of an arc of origin with the rst of its destination arcs. Each arc has a symmetrical
arc [12].
15.4.2.3 Tools that Use Greedy Graph
The greedy algorithms were widely implemented in assembly programs for Sanger data,
such as PHRAP, TIGR Assembler, and CAP3 [1]. When new sequencing technologies
became available, other software were developed for assembling the NGS data (short reads)
using different greedy strategies, such as SSAKE [58], SHARCGS [59], and VCAKE [22].
These algorithms can use an OLC approach or a DBG, applying a basic function (Figure
15.11); starting with any read of a group of data, add another, and in this way numerous
interactions are run until all possible operations are tested and the overlaps are identied,
in which a sufx of a read overlaps a prex of another (Figure 15.11.1). Each operation uses
the overlap of a major score, measured by the size of the overlap between reads to make
the next junction [1,2,28].
K15973_C015.indd 376 12/20/2012 3:39:34 PM
377Next-Generation Sequencing and Assembly of Bacterial Genomes
TABLE 15.2
Feature Comparison between De Novo Assemblers for Whole-Genome Shotgun Data from NGS
Platforms
Algorithms Feature
Greedy
Assemblers
OLC
Assemblers
DBG
Assemblers
Modeled features
of reads
Base substitutions Euler, AllPaths,
SOAP
Homopolymer miscount CABOG
Concentrated error in 3 end — Euler
Flow space Newbler
Color space Shorty Velvet
Removal of
erroneous reads
Based on k-mer frequencies Euler, Velvet,
AllPaths
Based on k-mer frequency
andquality value
— AllPaths
For multiple values of k — AllPaths
By alignment to other reads CABOG
By alignment and quality value SHARCGS
Correction of
erroneous base
calls
Based on k-mer frequencies Euler, SOAP
Based on k-mer frequencies
quality value
— AllPaths
Based on alignments COBOG
Approaches to
graph
construction
Implicit SSAKE,
SHARCGS,
VCAKE
— —
Reads as graph nodes Edena, CABOG,
Newbler
k-mer as graph nodes Euler, Velvet,
AbySS, SOAP
Simple path as graph nodes AllPaths
Multiple values of k — Euler
Multiple overlap stringencies HARCGS
Approaches to
graph reduction
Filter overlaps CABOG
Greedy contig extension SSAKE,
SHARCGS,
VCAKE
— —
Collapse simple paths CABOG,
Newbler
Euler, SOAP,
Velvet
Erosion of spurs CABOG, Edena Euler, AllPaths,
SOAP, Velvet
Transitive overlap reduction Edena
Bubble smoothing Edena Euler, SOAP,
Velvet
Bubble detection AllPaths
Reads separate tangled paths Euler, SOAP
Break at low coverage SOAP, Velvet
Break at high coverage CABOG Euler
High coverage indicates repeat CABOG Velvet
Special use of long reads Shorty Velvet
Source: Miller, J.R. et al., Genomics 95: 315–327, 2010. With permission.
K15973_C015.indd 377 12/20/2012 3:39:34 PM
378 OMICS: Applications in Biomedical, Agricultural, and Environmental Sciences
The quality of the overlaps is measured by the size and the identity (percentage bases
shared between two reads in the overlap region). Also, simplication of the graph is based
only on the size of the overlaps between reads, it being necessary to implement mecha-
nisms to prevent misassemblies [1,28]. The term “greedy” refers to the fact that the deci-
sions taken by the algorithm occur as a function of a local quality (in the case of assembly,
the quality of the overlaps between the reads), which may not be an optimal global solu-
tion; in this way, assemblers based on greedy can generate numerous misassemblies
(Figure 15.11.2 and 15.11.3) [1].
During the assembly process, where the reads are added by iteration, the fragments
are considered in descending order according to their quality, as explained previously.
Consequently, to avoid missassemblies, the extension process is nalized when conicting
information is identied, for example, when two or more reads extend a contig, but with
no overlap between them (Figure 15.11.4) [1].
The approaches based on greedy algorithms need a mechanism to avoid incorporating
false-positive overlaps into contigs. The overlaps induced by repeat sequences can have
high scores more than the overlaps of regions without repetitions; also, an assembly that
generates a false-positive overlap will unite unrelated sequences to the ends of a repeti-
tion and produce a chimera [28]. Some assemblers use other algorithms to avoid including
errors; SHARCGS, for example, includes a preprocessing step that lters out erroneous
reads. The parameters of this lter can be modied by the user of the program [59].
15.5 Tools for Visualizing NGS Data and Producing a Scaffold
The development of NGS platforms, also named high-throughput sequencing machines
has opened new opportunities for biological applications, including resequencing of
genomes, sequencing of the transcriptome, ChIP-seq, and discovery of miRNA [60]. These
NGS technologies created a necessity to develop new tools to visualize the results of the
assemblies and alignments of short reads [61]. Consequently, new challenges arose: a
need to rapidly and efciently process an enormous quantity of reads, a need for high-
quality interpretation of data, a user-friendly interface, and the capacity to accept various
ACAGTTAGACAGAA-GAC
G- ACGACGTAGAGGACTTA
Repetition region
ACAGTTAGACAG
ACAGTTAGACAGAACGACGTAGAGGAGACAGTATTGC
ACAGTTAGACAG
AGACAGAACGA
CGACGTAGAGGAGACAG
(1) (3)
(2)
(a) (b) (c)
(d)AGACAGTATTGC
(4)
Contig
AGACAGTATTGC
(a) (d) (b) (c)
AGACAGAACGA
CGACGTAGAGGAGACAG
ACAGTTAGACAGAACGACGTAGAGGAGACAGTATTGC
TATTGCGATATGGAGTT
Discordance region
ACAGTATTGCTTATATAGGGA
FIGURE 15.11
(1) Overlap between two reads, in which the overlapping region does not need to be a perfect match; (2) example
of correct assembly of a region of the genome that has two repetitive regions (box) using four reads (a to d);
(3)assembly generated by the greedy approach. Reads a and d are mounted rst, incorrectly, due to identica-
tion of better overlap, and (4) discordance between two reads (discordance region) that could extend a contig
(bold sequence). Extension of the contig could be nalized to avoid misassembles.
K15973_C015.indd 378 12/20/2012 3:39:34 PM
379Next-Generation Sequencing and Assembly of Bacterial Genomes
formats of les produced by different sequencers and assemblers [62]. Visualization of
the sequences generated from the process of mounting genomes can be done, for exam-
ple, by the software Consed [63], which allows the data to be edited, and Hawkeye [64].
Various programs have been developed for NGS reads, including EagleView [65], Tablet
[62], MapView [66], MaqView [31], SAMtools [48,55], and Integrative Genomics Viewer
(http://www .broadinstitute.org).
The main differences between the visualizers are in the interfaces for presentation to
the user, data processing velocity, as well as the different formats of the data entry les
and development of the scaffold. Loading NGS data in programs such as Consed and
Hawkeye, for example, requires a large amount of memory, which is normally not avail-
able to users of desktop computers. EagleView is a visualization tool developed only for
NGS, but it does not permit visualization of paired reads and it has memory limitations.
MapView permits analysis of genetic variation, supports paired-end data and single-end
reads, and various different entry and output le formats [66].
The scaffold is made of DNA sequences that are reconstructed after sequencing; it can
be composed of contigs, which should be ordered and oriented with each other with the
help of a reference genome, and by gaps: regions where the DNA sequence is not recog-
nized because it does not exist in the genome or because it is not covered in the sequencing
or assembly [67]. Options for software for generating genome scaffolds using a reference
genome include Bambus [68] from the software package AMOS, which can be used as
entry and exit by the software Mummer [42]. Genscaff [69] uses contiguous sequences gen-
erated by an assembly program without the help of a reference genome, through imple-
mentation of graph theory. CLCBio Workbench (http://www.clcbio.com) and the software
package Lasergene (http://www.dnastar.com), besides other functionalities, produce and
edit the scaffold; however, they are commercial software packages.
15.6 Closing Gaps
An artifact related to the assembly of a genome is the formation of gaps (holes or spaces).
Usually, the strategy used to resolve these spaces would be to design specic prim-
ers for this region, and posterior alignment of the amplied sequences by the primers,
thereby closing the gap. However, for large gaps (2 kb or longer), new primers are needed.
Therefore, this process requires considerable time and becomes expensive [70]. Given this
situation, we describe here an in silico strategy to resolve this type of artifact, a solution
consisting of the use of short reads generated by SOLiD (NGS) that were not mapped dur-
ing the assembly process.
15.7 Description of the Gap Closing Strategy
Align the short sequences in the anking regions of the mapped genes. Then, the nucleo-
tides that have a PHRED quality of 20 or more and a minimum of 10× coverage should
be added manually (Figure 15.12; steps 1 and 2). This extension will close the small gaps
(1–100 bp).
K15973_C015.indd 379 12/20/2012 3:39:34 PM
380 OMICS: Applications in Biomedical, Agricultural, and Environmental Sciences
If there are still gaps, the short reads should be realigned in relation to the reference
genome (Figure 15.12; step 3), because with the production of the new contigs, new short
reads align in the anking regions of the gaps, forming what we call a merged contig. In
this way, the genome could possibly be closed completely in silico, without using PCR. We
emphasize that this strategy was used during the mounting of the genome of a strain of
Corynebacterium pseudotuberculosis, which was sequenced with the SOLiD platform, which
generated 19,091,361 reads (140× coverage). This system mapped 590 gap regions, closing
100% of the gaps [71].
References
1. Pop, M. 2009. Genome assembly reborn: Recent computational challenges. Brief Bioinformatics
10: 354–366.
2. Pop, M., and S.L. Salzberg. 2008. Bioinformatics challenges of new sequencing technology.
Trends in Genetics 24: 142–149.
3. Batzoglou, S., D.B. Jaffe, K. Stanley et al. 2002. ARACHNE: A whole genome shotgun assem-
bler. Genome Research 12: 177–189.
4. Myers, E.W., G.G. Sutton, A.L. Delcher et al. 2000. A whole-genome assembly of drosophila.
Science 287: 2196–2204.
5. Pevzner, P.A., H. Tang, and M.S. Waterman. 2001. An eulerian path approach to DNA fragment
assembly. Proceedings of the National Academy of Sciences of the United States of America 98.
6. Phillippy, A.M., M.C. Schatz, and M. Pop. 2008. Genome assembly forensics: nding the elusive
misassembly. Genome Biology 9: R55.
7. Needleman, S., and C. Wunsch. 1970. A general method applicable to the search for similarities
in the amino acid sequence of two proteins. Journal of Molecular Biology 48: 443–453.
8. Higgins, D., and P. Sharp. 1988. CLUSTAL: a package for performing multiple sequence align-
ment on a microcomputer. Gene 73: 237–244.
9. Smith, T., and M. Waterman. 1981. Identication of common molecular subsequences. Journal of
Molecular Biology 147: 195–197.
10. Altschul, S., W. Gish, W. Miller, E. Myers, and D. Lipman. 1990. Basic local alignment search
tool. Journal of Molecular Biology 215: 403–410.
11. Bateman, A., and M. Wood. 2009. Cloud computing. Bioinformatics 2512: 1474.
GAP
Contig 1 Contig 2
1-Pair reads
aligned
2-Extention of the sequences
3-e gaps are reduced and
new alignments are performed
New contig
FIGURE 15.12
Description of the strategy for closing the gaps: step 1, short reads are aligned in the initial assembly; step 2,
short reads that align in terminal contigs are mounted in new contigs; step 3, the short reads are aligned against
the updated sequence and the process is repeated until the gap is closed.
K15973_C015.indd 380 12/20/2012 3:39:34 PM
381Next-Generation Sequencing and Assembly of Bacterial Genomes
12. Zerbino, D. 2009. Genome Assembly and Comparison Using de Bruijn Graphs. PhD thesis,
University of Cambridge.
13. Cole, C.G., O.T. McCann, J.E. Oliver et al. 2008. Finishing the nished human chromosome 22
sequence. Genome Biology 9: R78.
14. Sasson, S.A. 2010. From Millions to One: Theoretical and Concrete Approaches to De novo
Assembly Using Short Read DNA Sequences. PhD thesis, Graduate School-New Brunswick
Rutgers, The State University of New Jersey.
15. Chou, H.H., and M.H. Holmes. 2001. DNA sequence quality trimming and vector removal.
Bioinformatics 1712: 1093–1104.
16. Ramos, R.T., A.R. Carneiro, J. Baumbach, V. Azevedo, M.P. Schneider, and A. Silva. 2011.
Analysis of quality raw data of second generation sequencers with quality assessment soft-
ware. BMC Research Notes 4: 130.
17. Blankenberg, D., A. Gordon, G. Von Kuster, N. Coraor, J. Taylor, and A. Nekrutenko A. 2010.
Manipulation of FASTQ data with Galaxy. Bioinformatics 26: 1783–1785.
18. Morgan, M., S. Anders, M. Lawrence, P. Aboyoun, H. Pagès, and R. Gentleman. 2009. ShortRead:
a bioconductor package for input, quality assessment and exploration of high-throughput
sequence data. Bioinformatics 25: 2607–2608.
19. Martínez-Alcántara, A., E. Ballesteros, C. Feng et al. 2009. PIQA: Pipeline for Illumina G1
genome analyzer data quality assessment. Bioinformatics 25: 2438–2439.
20. Li, H., and N. Homer. 2010. A survey of sequence alignment algorithms for next-generation
sequencing. Briengs Bioinformatics 11: 181–197.
21. Bryant, D.W., W.K. Wong, and T.C. Mockler. 2009. QSRA—a quality-value guided de novo
short read assembler. BMC Bioinformatics 10: 69.
22. Jeck, W., J. Reinhardt, D. Baltrus et al. 2007. Extending assembly of short DNA sequences to
handle error. BMC Bioinformatics 23: 2942–2944.
23. Marioni, J.C., C.E. Mason, S.M. Mane, M. Stephens, and Y. Gilad. 2008. RNA-seq: An assess-
ment of technical reproducibility and comparison with gene expression arrays. Genome Research
18: 1509–1517.
24. Salmela, L. 2010. Correction of sequencing errors in a mixed set of reads. Bioinformatics 26:
1284–1290.
25. Schroder, J., H. Schroder, S.J. Puglisi, R. Sinha, and B. Schmidt. 2009. SHREC: a short-read error
correction method. Bioinformatics 25: 2157–2163.
26. Smith, A.D., Z. Xuan, and M.Q. Zhang. 2008. Using quality scores and longer reads improves
accuracy of Solexa read mapping. BMC Bioinformatics 9: 128.
27. Chaisson, M.J., and P.A. Pevzner. 2008. Short read fragment assembly of bacterial genomes.
Genome Research 18: 324–330.
28. Miller, J.R., S. Koren, and G. Sutton. 2010. Assembly algorithms for next-generation sequencing
data. Genomics 95: 315–327.
29. Rumble, S.M., P. Lacroute, A.V. Dalca, M. Fiume, A. Sidow, and M. Brudno. 2009. SHRiMP:
Accurate mapping of short color-space reads. PLoS Computational Biology 5: 5.
30. Li, R., C. Yu, Y. Li et al. 2009. SOAP2: an improved ultrafast tool for short read alignment.
Bioinformatics 25: 1966–1967.
31. Li, H., J. Ruan, and R. Durbin. 2008. Mapping short DNA sequencing reads and calling variants
using mapping quality scores. Genome Research 18: 1851–1858.
32. Lin, H., Z. Zhang, M.Q. Zhang, B. Ma, and M. Li. 2008. ZOOM! Zillions of oligos mapped.
Bioinformatics 24: 2431–2437.
33. Butler, J., I. MacCallum, M. Kleber et al. 2008. ALLPATHS: De novo assembly of whole-genome
shotgun microreads. Genome Research 18: 810–820.
34. Chaisson, M., P. Pevzner, and H. Tang. 2004. Fragment assembly with short reads. Bioinformatics
20: 2067–2074.
35. Li, Y., Y. Hu, L. Bolund, and J. Wang. 2010. State of the art de novo assembly of human genomes
from massively parallel sequencing data. Human Genomics 44: 271–277.
K15973_C015.indd 381 12/20/2012 3:39:34 PM
382 OMICS: Applications in Biomedical, Agricultural, and Environmental Sciences
36. Zerbino, D.R., and E. Birney. 2008. Velvet: Algorithms for de novo short read assembly using de
Bruijn graphs. Genome Research 18: 821–829.
37. Ewing, B., and P. Green. 1998. Base-calling of automated sequencer traces using PHRED. II.
Error probabilities. Genome Research 83: 186–194.
38. Flicek, P., and E. Birney. 2009. Sense from sequence reads: Methods for alignment and assembly.
Nature Methods 6: S6–S12.
39. Pevzner, P.A., and H. Tang. 2001. Fragment assembly with double-barreled data. Bioinformatics
17: 225–233.
40. Kingsford, C., M.C. Schatz, and P. Pop. 2010. Assembly complexity of prokaryotic genomes
using short reads. BMC Bioinformatics 111: 21.
41. Ferragina, P., and G. Manzini. 2000. Opportunistic data structures with applications. In
Proceedings of the 41st Symposium on Foundations of Computer Science FOCS 2000. California:
Redondo Beach 390–398.
42. Kurtz, S., A. Phillippy, A.L. Delcher et al. 2004. Versatile and open software for comparing large
genomes. Genome Biology 5: R12.
43. Meek, C., J.M. Patel, and S. Kasetty. 2003. OASIS: An online and accurate technique for local-
alignment searches on biological sequences. In Proceedings of 29th International Conference on
Very Large Data Bases VLDB 2003, Berlin: 910–921.
44. Abouelhoda, M.I., S. Kurtz, and E. Ohlebusch. 2004. Replacing sufx trees with enhanced sufx
arrays. Journal of Discrete Algorithms 2: 53–86.
45. Hoffmann, S., C. Otto, S. Kurtz et al. 2009. Fast mapping of short sequences with mismatches,
insertions and deletions using index structures. PLoS Computational Biology 5: 1–10.
46. Langmead, B., C. Trapnell, M. Pop, and S.L. Salzberg. 2009. Ultrafast and memory-efcient
alignment of short DNA sequences to the human genome. Genome Biology 10: R25.
47. Li, H., and R. Durbin. 2010. Fast and accurate long-read alignment with Burrows–Wheeler
transform. Bioinformatics 265: 589–595.
48. Li, H. and R. Durbin. 2009. Fast and accurate short read alignment with Burrows–Wheeler
transform. Bioinformatics 25: 1754–1760.
49. Lam, T.W., W.K. Sung, S.L. Tam, C.K. Wong, and S.M. Yiu. 2008. Compressed indexing and local
alignment of DNA. Bioinformatics 24: 791–797.
50. Hernandez, D., P. François, L. Farinelli, M. Osterås, and J. Schrenzel. 2008. De novo bacterial
genome sequencing: Millions of very short reads assembled on a desktop computer. Genome
Research 18: 802–809.
51. Myers, E.W. 1995. Towards simplifying and accurately formulating fragment assembly. Journal
of Computational Biology 2: 1–21.
52. Huang, X., and S. Yang. 2005. Generating a genome assembly with PCAP. In Current Protocols in
Bioinformatics Unit 11.3.
53. Lemos, M., A. Basílio, and A. Casanova. 2003. Um Estudo dos Algoritmos de Montagem de
Fragmentos de DNA. PUC Rio, Rio de Janeiro.
54. Fleishner, H. 1990. Eulerian Graphs and Related Topics. London: Elsevier Science.
55. Li, H., B. Handsaker, A. Wysoker et al. 2009. 1000 Genome Project Data Processing Subgroup.
The Sequence Alignment/Map format and SAMtools. Bioinformatics 16: 2078–2079.
56. Simpson, J.T., K. Wong, S.D. Jackman, J.E. Schein, S.J.M. Jones, and I. Birol. 2009. ABySS: A par-
allel assembler for short read sequence data. Genome Research 19: 1117–1123.
57. Medvedev, P., K. Georgiou, G. Myers, and M. Brudno. 2007. Computability of models for
sequence assembly. In Proceedings of Workshop on Algorithms in Bioinformatics WABI 289–301.
58. Warren, R.L., C.G. Sutton, S.J. Jones, and R.A. Holt. 2007. Assembling millions of short DNA
sequences using SSAKE. Bioinformatics 15: 234.
59. Dohm, J.C., C. Lottaz, T. Borodina, and H. Himmelbauer. 2007. SHARCGS, a fast and highly
accurate short-read assembly algorithm for de novo genomic sequencing. Genome Research 17:
1697–1706.
60. Shendure, J., and H. Ji. 2008. Next-generation DNA sequencing. Nature Biotechnology 26:
1135–1145.
K15973_C015.indd 382 12/20/2012 3:39:35 PM
383Next-Generation Sequencing and Assembly of Bacterial Genomes
61. Magi, A., M. Benelli, A. Gozzini, F. Girolami, and M.L. Brandi. 2010. Bioinformatics for next
generation sequencing data. Genes 1: 294–307.
62. Milne, I., M. Bayer, L. Cardle et al. 2010. Tablet—Next generation sequence assembly visualiza-
tion. Bioinformatics 3: 401–402.
63. Gordon, D., C. Abajian, and P. Green. 1998. Consed: a graphical tool for sequence nishing.
Genome Research 8: 195–202.
64. Schatz, M.C., A.M. Phillippy, B. Shneiderman, and S.L. Salzberg. 2007. Hawkeye: an interactive
visual analytics tool for genome assemblies. Genome Biology 8: R34.
65. Huang, W. and G. Marth. 2008. EagleView: A genome assembly viewer for next-generation
sequencing technologies. Genome Research 9: 1538–1543.
66. Bao, H., H. Guo, J. Wang, R. Zhou, X. Lu, and S. Shi. 2009. MapView: Visualization of short
reads alignment on a desktop computer. Bioinformatics 12: 1554–1555.
67. Schuster, S.C. 2008. Next-generation sequencing transform today’s biology. Nature Methods 5:
16–18.
68. Pop, M., D.S. Kosack, and S.L. Salzberg. 2004. Hierarchical scaffolding with Bambus. Genome
Research 14: 149–159.
69. Setúbal, J.C., and R. Werneck. 2001. A program for building contig scaffolds in double-barrelled
shotgun genome sequencing. Campinas Instituto de Computação, Unicamp.
70. Tsai, I.J., D.T. Otto, and M. Berriman. 2010. Improving draft assemblies by iterative mapping
and assembly of short reads to eliminate gaps. Genome Biology 11: R41.
71. Silva, A., M.P. Schneider, L. Cerdeira et al. 2011. Complete genome sequence of Corynebacterium
pseudotuberculosis I19, a strain isolated from a cow in Israel with bovine mastitis. Journal of
Bacteriology 1931: 323–324.
K15973_C015.indd 383 12/20/2012 3:39:35 PM
K15973_C015.indd 384 12/20/2012 3:39:35 PM
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Motivation: Second-generation sequencing technologies produce a massive amount of short reads in a single experiment. However, sequencing errors can cause major problems when using this approach for de novo sequencing applications. Moreover, existing error correction methods have been designed and optimized for shotgun sequencing. Therefore, there is an urgent need for the design of fast and accurate computational methods and tools for error correction of large amounts of short read data. Results: We present SHREC, a new algorithm for correcting errors in short-read data that uses a generalized suffix trie on the read data as the underlying data structure. Our results show that the method can identify erroneous reads with sensitivity and specificity of over 99% and 96% for simulated data with error rates of up to 3% as well as for real data. Furthermore, it achieves an error correction accuracy of over 80% for simulated data and over 88% for real data. These results are clearly superior to previously published approaches. SHREC is available as an efficient open-source Java implementation that allows processing of 10 million of short reads on a standard workstation.
Article
Full-text available
The emergence of next-generation sequencing (NGS) platforms imposes increasing demands on statistical methods and bioinformatic tools for the analysis and the management of the huge amounts of data generated by these technologies. Even at the early stages of their commercial availability, a large number of softwares already exist for analyzing NGS data. These tools can be fit into many general categories including alignment of sequence reads to a reference, base-calling and/or polymorphism detection, de novo assembly from paired or unpaired reads, structural variant detection and genome browsing. This manuscript aims to guide readers in the choice of the available computational tools that can be used to face the several steps of the data analysis workflow.
Article
Full-text available
Bowtie 1 is a fast and memory-efficient program for aligning short reads to mammalian genomes. Burrows-Wheeler indexing allows Bowtie to align more than 25 million 35-bp reads per CPU hour to the human genome in a memory footprint of as little as 1.1 gigabytes. Bowtie extends previous Burrows-Wheeler techniques with a quality-aware search algorithm that permits mismatches. Multiple processor cores can be used simultaneously to achieve greater alignment speed. Bowtie is free, open source software available for download from http://bowtie.cbcb.umd.edu . The Burrows-Wheeler Transformation of a text T, BWT(T), is constructed as shown to the right. The Burrows- Wheeler Matrix of T is the matrix whose rows are all distinct cyclic rotations of T$ sorted lexicographically ($ is "less than" all other characters). BWT(T) is the sequence of characters in the last column of this matrix.
Article
Full-text available
The human genome project is a program to map and sequence the entire human genome. A number of model organisms were selected for complete sequencing, partly in order to develop new technology for mapping, sequencing and sequence analysis. In addition, the sequences from these genomes were expected to facilitate the elucidation of the functions of genes and sequences in the human genome. One of the main problems of DNA sequencing on a large scale is that its methods only obtain a small part of the DNA. After breaking a sequence into many fragments, cloning them and sequencing them, there is a set of fragments which needs to be merged for the reconstruction of the original DNA sequence. This monograph presents the biological and computational context of DNA sequence assembly.