ArticlePDF Available

Improvement of Phylogenies after Removing Divergent and Ambiguously Aligned Blocks from Protein Sequence Alignments

September 2007
Systematic Biology 56(4):564-77

September 2007
56(4):564-77

DOI:10.1080/10635150701472164

Source
PubMed

Authors:

Gerard Talavera

Institut Botànic de Barcelona

Alignment quality may have as much impact on phylogenetic reconstruction as the phylogenetic methods used. Not only the alignment algorithm, but also the method used to deal with the most problematic alignment regions, may have a critical effect on the final tree. Although some authors remove such problematic regions, either manually or using automatic methods, in order to improve phylogenetic performance, others prefer to keep such regions to avoid losing any information. Our aim in the present work was to examine whether phylogenetic reconstruction improves after alignment cleaning or not. Using simulated protein alignments with gaps, we tested the relative performance in diverse phylogenetic analyses of the whole alignments versus the alignments with problematic regions removed with our previously developed Gblocks program. We also tested the performance of more or less stringent conditions in the selection of blocks. Alignments constructed with different alignment methods (ClustalW, Mafft, and Probcons) were used to estimate phylogenetic trees by maximum likelihood, neighbor joining, and parsimony. We show that, in most alignment conditions, and for alignments that are not too short, removal of blocks leads to better trees. That is, despite losing some information, there is an increase in the actual phylogenetic signal. Overall, the best trees are obtained by maximum-likelihood reconstruction of alignments cleaned by Gblocks. In general, a relaxed selection of blocks is better for short alignment, whereas a stringent selection is more adequate for longer ones. Finally, we show that cleaned alignments produce better topologies although, paradoxically, with lower bootstrap. This indicates that divergent and problematic alignment regions may lead, when present, to apparently better supported although, in fact, more biased topologies.

Asymmetric (a), intermediate (b), and symmetric (c) trees used in the simulations. The scale bar, in substitutions/position, corresponds to the trees with a divergence ×1.

…

ClustalW alignment strategies that give rise to the statistically best topologies. When two or more strategies do not show statistical differences in Robinson-Foulds distances, all equivalent strategies are represented. The complete alignment is represented by a black block, and the relaxed and stringent Gblocks strategies by grey and white blocks, respectively.

…

Average Robinson-Foulds distances to the real tree from the tree calculated with Mafft complete (solid line, solid symbols) and ClustalW complete alignments (solid line, empty symbols). The tree distances obtained with the same alignments after treatment with Gblocks with relaxed conditions (dotted lines) are also shown. Trees were reconstructed by ML (circles), NJ (squares), and parsimony (triangles). The most divergent asymmetric tree was used for the simulations.

…

Average bootstrap values of NJ trees obtained from ClustalW (a) and Mafft (b) alignments simulated from the asymmetric tree with three different divergence levels. Complete (solid, red line), Gblocks relaxed (dotted, blue line with diamonds), and Gblocks stringent (dotted, green line with squared symbols) alignments are shown.

…

Average Robinson-Foulds distances from the ClustalW guide tree to the real tree (red line with crossed symbols), from the guide tree to the NJ tree of the Gblocks alignment with relaxed conditions (green line with squared symbols), and from the guide tree to the NJ tree of the complementary positions of the same Gblocks alignment (blue line with diamonds). The asymmetric tree with three different divergence levels was used for the simulations.

…

Figures - uploaded by Gerard Talavera

Content may be subject to copyright.

Content uploaded by Gerard Talavera

Content may be subject to copyright.

Downloaded By: [USYB - Systematic Biology] At: 02:07 18 July 2007

Syst. Biol. 56(4):564–577, 2007

 Society of Systematic Biologists

ISSN: 1063-5157 print / 1076-836X online

DOI: 10.1080/10635150701472164

Improvement of Phylogenies after Removing Divergent and Ambiguously Aligned

Blocks from Protein Sequence Alignments

GERARD TALAVERA AND JOSE CASTRESANA

Department of Physiology and Molecular Biodiversity, Institute of Molecular Biology of Barcelona, CSIC, Jordi Girona 18, 08034 Barcelona, Spain;

E-mail: jcvagr@ibmb.csic.es (J.C.)

Abstract.—Alignment quality may have as much impact on phylogenetic reconstruction as the phylogenetic methods used.

Not only the alignment algorithm, but also the method used to deal with the most problematic alignment regions, may

have a critical effect on the ﬁnal tree. Although some authors remove such problematic regions, either manually or using

automatic methods, in order to improve phylogenetic performance, others prefer to keep such regions to avoid losing any

information. Our aim in the present work was to examine whether phylogenetic reconstruction improves after alignment

cleaning or not. Using simulated protein alignments with gaps, we tested the relative performance in diverse phylogenetic

analyses of the whole alignments versus the alignments with problematic regions removed with our previously developed

Gblocks program. We also tested the performance of more or less stringent conditions in the selection of blocks. Alignments

constructed with different alignment methods (ClustalW, Mafft, and Probcons) were used to estimate phylogenetic trees by

maximum likelihood, neighbor joining, and parsimony. We show that, in most alignment conditions, and for alignments

that are not too short, removal of blocks leads to better trees. That is, despite losing some information, there is an increase

in the actual phylogenetic signal. Overall, the best trees are obtained by maximum-likelihood reconstruction of alignments

cleaned by Gblocks. In general, a relaxed selection of blocks is better for short alignment, whereas a stringent selection is more

adequate for longer ones. Finally, we show that cleaned alignments produce better topologies although, paradoxically, with

lower bootstrap. This indicates that divergent and problematic alignment regions may lead, when present, to apparently

better supported although, in fact, more biased topologies. [Bootstrap support; Gblocks; phylogeny; sequence alignment.]

Methods for the simultaneous generation of multiple

alignments and phylogenetic trees are actively being pur-

sued (Fleissner et al., 2005; Lunter et al., 2005; Redelings

and Suchard, 2005; Wheeler, 2001), but, at present, com-

mon practice of phylogenetic analysis requires, as a ﬁrst

step, the generation of a multiple alignment of the se-

quences to be analyzed. It has been repeatedly shown

that the quality of the alignment may have an enor-

mous impact on the ﬁnal phylogenetic tree (Kjer, 1995;

Morrison and Ellis, 1997; Ogden and Rosenberg, 2006;

Smythe et al., 2006; Xia et al., 2003). This is particularly

true when sequences compared are very divergent and

of different length, which makes necessary the introduc-

tion of gaps in the alignments.

Due to the computational requirements of optimal

algorithms for multiple sequence alignments, different

heuristic strategies have been proposed.The most widely

used approach has been the progressive method of align-

ment (Feng and Doolittle, 1987) that, together with en-

hancements related to the introduction of gap penalties,

was implemented in ClustalW (Thompson et al., 1994).

In progressive methods, an initial dendrogram gener-

ated from the pairwise comparisons of the sequences is

used to recursively build the multiple alignment, using

dynamic programming (Needleman and Wunsch, 1970)

in the last step. Dynamic programming is an exact algo-

rithm that assures the best possible alignments for given

gap penalties but, due to heavy computational require-

ments, it is only used for pairs of sequences or pairs of

clades of the dendrogram and not for the whole multi-

ple alignment. Several other heuristic multiple alignment

methods have been recently introduced. They include

T-Coffee (Notredame et al., 2000), Mafft (Katoh et al.,

2005; Katoh et al., 2002), Muscle (Edgar, 2004), Probcons

(Do et al., 2005), and Kalign (Lassmann and Sonnham-

mer, 2005), among others. All of them are based on the

progressive method but include several iterative reﬁne-

ments to construct the ﬁnal multiple alignment. The

latter methods have been shown to outperform purely

progressive methods in terms of alignment accuracy and,

some of them, even in computational time. However, it

has not been shown whether the greater alignment accu-

racy of more sophisticated methods leads to a signiﬁcant

improvement in phylogenetic reconstruction.

Proteins have some regions that, due to their func-

tional or structural importance, are very well con-

served, whereas other regions evolve faster both in terms

of nucleotide substitutions and insertions or deletions

(Henikoff and Henikoff, 1994; Herrmann et al., 1996;

Pesole et al., 1992). That is, evolutionary rate heterogene-

ity affects to whole regions in addition to single positions.

This type of regional rate heterogeneity is very challeng-

ing for phylogenetic reconstruction, not only in terms of

homoplasy due to saturation (Yang, 1998), but also in

terms of errors in homology during alignment.

Dealing with regions of problematic alignment is a

matter of active debate in phylogenetics. Although some

authors consider that it is best to remove such regions

before the tree analysis (Castresana, 2000; Grundy and

Naylor, 1999; L¨oytynoja and Milinkovitch, 2001; Rodrigo

et al., 1994; Swofford et al., 1996), others think that there

is an important loss of information upon removal of any

fragment of the sequences already obtained (Aagesen,

2004; Lee, 2001) and that this practice should only be

used as the last resource (Gatesy et al., 1993). A third,

intermediate option, is the recoding of such regions us-

ing different strategies (Geiger, 2002; Lutzoni et al., 2000;

Young and Healy, 2003), which allows the use of at least

part of the information. Although these coded charac-

ters are most commonly analyzed with parsimony, it is

564

Downloaded By: [USYB - Systematic Biology] At: 02:07 18 July 2007

2007 TALAVERA AND CASTRESANA—IMPROVEMENT OF PHYLOGENIES AFTER REMOVING BLOCKS 565

also possible to use them as independent partitions in

Bayesian or likelihood frameworks.

In the present work we test, by using simulated pro-

tein alignments with gaps, which are the best alignment

strategies for optimal phylogenetic reconstruction. Two

preliminary considerations are necessary here. First, sim-

ulations of sequences may not cover all the complexity

of evolution but have the advantage over real sequences

that we know the tree from which they have been gener-

ated. There are some alignment sets curated from struc-

tural information that can be used to test alignment

accuracy (Thompson et al., 2005), but the phylogenetic

tree is unknown in these sets, thus making problem-

atic their use for proving phylogenetic accuracy. Second,

we have been working with simulated sequences that

try to reﬂect the evolutionary patterns of proteins, and

thus many of the conclusions extracted from our work

cannot be directly extrapolated to other markers such

as rRNA, which show very different evolutionary con-

straints (Gutell et al., 1994; Kjer, 1995; Xia et al., 2003).

In our analysis we used different alignment strategies

of the simulated sequences to test if they make any dif-

ference in the ﬁnal phylogenetic tree. We have selected

ClustalW as the currently most used progressive align-

ment method (Thompson et al., 1994) and Mafft (Katoh

et al., 2005) and Probcons (Do et al., 2005) as examples of

more recently developed methods that have been shown

to obtain very high scores in terms of alignment accuracy

(Blackshields et al., 2006; Nuin et al., 2006). Simultane-

ously with the performance of the alignment programs,

we tested whether removing blocks of problematic align-

ment actually leads to more accurate trees. We used for

this purpose our previously developed Gblocks program

(Castresana, 2000), which selects blocks following a re-

producible set of conditions. Brieﬂy, selected blocks must

be free from large segments of contiguous nonconserved

positions, and ﬂanking positions must be highly con-

served to ensure alignment accuracy. Several parameters

can be modiﬁed to make the selection of blocks more

or less stringent. Phylogenetic trees made by maximum

likelihood (ML), neighbor joining (NJ), and parsimony

of the reconstructed alignments show that, in almost all

conditions tested, and at least for alignments that are

not too short, the elimination of problematic regions by

Gblocks leads to signiﬁcantly better phylogenetic trees.

ATERIALS AND METHODS

We simulated protein sequences by means of Rose

(Stoye et al., 1998). This program allows the simula-

tion of different substitution rates in different positions

with a predetermined spatial pattern. This is a very im-

portant feature for testing the behavior of a program

like Gblocks, which selects from alignments blocks of

contiguous conserved positions with few nonconserved

positions inside. This is the reason why a program that

simulates among-site rate heterogeneity, but not regional

heterogeneity, would not be valid to test the behavior

of Gblocks. Thus, an important preliminary step in our

simulations was the selection from real proteins of spa-

tial patterns of site rates in order to use these parameters

with Rose.

Selection of Evolutionary Rate Patterns

We extracted patterns of rate heterogeneity from

real protein alignments using the program TreePuzzle

(Strimmer and von Haeseler, 1996) with a model of

among-site rate heterogeneity that assumed a Gamma

distribution of rates. This distribution was approximated

with 16 rate categories, which is the maximum number

allowed in TreePuzzle. In particular, we took, from each

position, the category and associated relative rate that

contributed the most to the likelihood. Positions with

rates >1 receive more mutations than the average and po-

sitions with rates <1 receive fewer mutations. This list of

relative rates (whose average should be 1) were given to

Rose to simulate different positions with different rates,

creating conserved and divergent regions with lengths

and boundaries that approximated those of a real pro-

tein. Proteins for extracting rate patterns were NAD2 and

NAD4 (subunits 2 and 4 of the mitochondrial NADH de-

hydrogenase) from several metazoans (Castresana et al.,

1998b), and COG0285 from the COG database, which in-

cludes mainly bacterial sequences (Tatusov et al., 2003).

The three selected proﬁles produced similar conclusions

regarding the best block selection strategy, and we used

the NAD2 pattern to perform most of the tests. This

pattern contained 361 positions but, after the introduc-

tion of further gaps by the simulation algorithm, the

ﬁnal simulated alignments reached approximately 400

positions. In order to simulate alignments of different

length, independent simulations obtained with this pat-

tern were concatenated 1, 2, 3, 4, and 8 times to generate

ﬁnal alignments of, approximately, 400, 800, 1200, 1600,

and 3200 positions, respectively. The PAM evolutionary

model (Dayhoff et al., 1978) was used to simulate the

evolution of amino acids.

Selection of Phylogenetic Trees

Simulations with Rose were performed along phylo-

genetic trees of 16 tips with three different topologies,

a purely asymmetric tree (Fig. 1a), an intermediate tree

(Fig. 1b), and a symmetric tree (Fig. 1c). These known

trees or “real trees” were manually constructed. The av-

erage and maximum length from the root to the tips

was, for the asymmetric tree, 0.89 and 1.30 substitu-

tions/position, respectively. The other trees had very

similar values. The branch lengths of the three trees in

Figure 1 were multiplied by factors of 0.5, 1, and 2, re-

spectively, so that we used in total 9 phylogenetic trees.

These trees had several short internal branches that made

them difﬁcult to resolve; thus, they are trees where the

alignment strategy as well as the phylogenetic algorithm

used were differentially effective. Simpler trees in terms

of longer internodes were easily and equally reproduced

by all methods and were not used here. Similarly, trees

with a total smaller divergence tended to produce con-

served alignments where the alignment method was not

an issue and also not used here. Finally, these trees did

Downloaded By: [USYB - Systematic Biology] At: 02:07 18 July 2007

566 SYSTEMATIC BIOLOGY VOL. 56

FIGURE 1. Asymmetric (a), intermediate (b), and symmetric (c) trees used in the simulations. The scale bar, in substitutions/position,

corresponds to the trees with a divergence ×1.

not contain many closely related sequences, since we

wanted to speciﬁcally measure differences in reproduc-

ing the overall shape of the tree and not differences in

recovering the relationships among close sequences.

Gaps Introduced during the Simulations

The Rose program does not have any speciﬁc model

for the introduction of gaps along the alignment. Rather,

gaps are introduced with equal probability in all posi-

tions with a relative rate ≥1 (Stoye et al., 1998), which

is a limitation of this program. To try to overcome this

limitation, we used two different gap strategies within

Rose. First, we used a single gap threshold for the whole

alignment. After several trials, we considered a thresh-

old of 0.0007 as a reasonable one for the divergence

levels we analyzed, as deduced from visual inspection

of the alignment (that is, eyeing that blocks of diver-

gence and conservation were not so different from the

real proteins used to construct the rate proﬁles). Even so,

this threshold tended to produce too many gaps in con-

served regions (not shown). In addition, we also gener-

ated alignments with two different gap thresholds, 0.001

and 0.0001, which we associated, respectively, to diver-

gent and to conserved regions of the proﬁles. For doing

so, we divided the rate proﬁles in blocks of homoge-

neous divergence (that is, each block was either mostly

conserved or mostly divergent, which resulted in around

10 to 20 blocks for the different proﬁles). Then, we did

the simulations for each block separately, and with its

own gap threshold (high for divergent blocks and low for

more conserved blocks). Finally, the different simulated

blocks were concatenated. The phylogenetic results were

similar with both gap strategies, but we mostly worked

with simulations that had the two different gap thresh-

olds, which we considered more realistic. In all cases we

chose a vector of indels of the form [0.5, 0.4, 0.3, 0.2,

0.1], which reﬂects the relative frequency of indels with

lengths from 1 to 5 amino acids, respectively.

Realignments of Simulated Sequences

Alignments generated by Rose were cleaned fromgaps

and new alignments were reconstructed using ClustalW

version 1.83 (Thompson et al., 1994), Mafft version 5.531

(Katoh et al., 2002, 2005), and Probcons version 1.1 (Do

et al., 2005). Default parameters were used in ClustalW

and Probcons. All defaults were also used in Mafft ex-

cept that a neighbor joining instead of a UPGMA tree was

used as guide tree (option –nj). Alignments were cleaned

from problematic alignment blocks using Gblocks 0.91

(Castresana, 2000), for which two different parameter

sets were used. In one of them, which we call here strin-

gent selection, and which is the default one in Gblocks

0.91, “Minimum Number of Sequences for a Conserved

Position” was 9, “Minimum Number of Sequences for a

Flank Position” was 13, “Maximum Number of Contigu-

ous Nonconserved Positions” was 8, “Minimum Length

of a Block” was 10, and “Allowed Gap Positions” was

“None”. In the second set, which we call relaxed selec-

tion, we changed “Minimum Number of Sequences for

a Flank Position” to 9, “Maximum Number of Contigu-

ous Nonconserved Positions” to 10, “Minimum Length

of a Block” to 5, and “Allowed Gap Positions” to “With

Half”. The latter option allows the selection of positions

with gaps when they are present in less than half of the

sequences.

Original simulated alignments and Mafft realignments

for 30 example simulations (the ﬁrst ﬁve simulations gen-

erated with the symmetric and asymmetric trees) are pro-

vided as supplementary information (available online at

http://systematicbiology.org).

Phylogenetic Reconstruction

Phylogenetic trees from the complete and the two dif-

ferent Gblocks alignments were estimated by ML, NJ,

and parsimony. For ML trees we used the Phyml pro-

gram version 2.4.4 (Guindon and Gascuel, 2003), with

the Jones-Taylor-Thornton model of protein evolution

(Jones et al., 1992) and four rate categories in the Gamma

distribution. The Gamma distribution parameter and

the proportion of invariable sites were estimated by the

program. For NJ trees we used Protdist of the Phylip

package version 3.63 (Felsenstein, 1989) with the Jones-

Taylor-Thornton model to calculate pairwise protein dis-

tances, and Neighbor of the same package to calculate the

NJ tree. For parsimony we used Protpars of the Phylip

Downloaded By: [USYB - Systematic Biology] At: 02:07 18 July 2007

2007 TALAVERA AND CASTRESANA—IMPROVEMENT OF PHYLOGENIES AFTER REMOVING BLOCKS 567

package (Felsenstein, 1989) with 50 random initializa-

tions to ensure a thorough tree search. If no parsimony

tree was obtained, which occurred in less than 1% of the

simulations, the corresponding simulation was totally

excluded from the analysis. When several equally parsi-

monious trees were found, only the ﬁrst one was used.

We did not do Bayesian trees because of the enormous

computational time required for doing enough number

of generations of all simulations performed.

For each alignment length, alignment strategy, and

phylogenetic method, 300 simulations were run in a grid

of 24 processors. The symmetric difference or Robinson-

Foulds (Robinson and Foulds, 1981) topological distance

from the calculated tree to the real tree was obtained us-

ing Vanilla 1.2 (Drummond and Strimmer, 2001), and the

average of all simulations calculated. This program re-

ports half the number of total discordant clades between

two trees. For bootstrap analyses, 100 bootstraps were

calculated. Due to heavy computational requirements of

the bootstrap analyses, the number of simulations was

reduced to 150. We checked that a higher number of boot-

straps and simulations did not improve the accuracy of

the bootstrap results. Bootstrap values were separately

calculated for right and wrong partitions of the tree with

the help of Bioperl functions (Stajich et al., 2002). Statisti-

cal differences among Robinson-Foulds distances in dif-

ferent alignment conditions were detected by the Tukey-

Kramer test with an alpha level of 0.05 using the JMP

package version 5.1 (SAS Institute, Cary, NC).

ESULTS AND DISCUSSION

General Alignment Strategy: Complete versus

Gblocks Alignments

The differences in alignments produced by different

methods can be appreciated in Figure 2. A fragment

of the alignment of simulated sequences (Fig. 2a) was

stripped of gaps and realigned by ClustalW (Fig. 2b),

Mafft (Fig. 2c), and Probcons (Fig. 2d). As it has been

noted before (Higgins et al., 2005), ClustalW tends to

produce more compact alignments. That is, ClustalW

generates many divergent regions that are almost de-

void of gaps, resulting in a relatively simple alignment

(Higgins et al., 2005). This can be clearly appreciated in

the most problematic region in the center of this align-

ment (Fig. 2b). Although Mafft also tends to make align-

ments more compact than the real ones (Fig. 2c), the

deviation from the real situation is not as large as with

ClustalW, at least with default gap penalties. Probcons

TABLE 1. Average number of positions of the complete alignments and the average percentage of positions selected by Gblocks with relaxed

and stringent conditions. Simulation of sequences was done following the asymmetric tree and the heterogeneity pattern of the NAD2 protein

concatenated two times.

ClustalW Mafft Probcons

Total % Gblocks % Gblocks Total % Gblocks % Gblocks Total % Gblocks % Gblocks

Divergence length relaxed stringent length relaxed stringent length relaxed stringent

×0.5 826.6 79.4 54.3 852.5 74.2 51.6 871.8 70.3 50.9

×1 862.4 64.2 42.0 903.7 59.0 39.8 966.4 51.8 37.6

×2 901.8 46.4 30.2 961.7 42.9 28.4 1117.9 34.7 24.5

produces the least compact alignments of the three pro-

grams tested (Fig. 2d). For example, simulations from

asymmetric trees with divergence ×1, which had an av-

erage original length of 1097 positions, were compacted

to an average of 966 positions by Probcons, to 904 posi-

tions by Mafft and to 862 positions by ClustalW (Table 1).

Similar relative degrees of compression were obtained in

other types of simulations.

Gblocks removes problematic regions of a multiple

alignment according to a number of rules. First, blocks

selected for inclusion must be free from a large number

of contiguous nonconserved positions, must be ﬂanked

by highly conserved positions, and must have a mini-

mum length, as controlled by the corresponding param-

eters (see Materials and Methods). In addition, positions

with gaps can be removed either always or only when

more than half of the sequences contain gaps (Castre-

sana, 2000). The latter parameter has a large inﬂuence

on the total number of selected positions. We have used

Gblocks in simulated realigned sequences with two dif-

ferent conditions. The condition that we call stringent

does not allow any gap position. The relaxed condition

allows gap positions if they are present in less than half

of the sequences, and it is also less restrictive in the other

parameters (see Materials and Methods). The effect of

the two different parameter sets of Gblocks selection can

be appreciated in Figure 2, for ClustalW (Fig. 2b), Mafft

(Fig. 2c), and Probcons alignments (Fig. 2d). In both cases,

the relaxed parameters (grey blocks) allow the selection

of more positions than the stringent parameters (white

blocks). Table 1 shows the average number of positions of

the complete alignments and the percentage of positions

left after treatment with Gblocks with the two different

parameter sets. Values in this table are for the asymmetric

tree, but similar values were found for other trees.

In order to infer which type of alignment algorithm

(ClustalW, Mafft, or Probcons) and which treatment of

the resulting alignment (no treatment or Gblocks treat-

ment with stringent or relaxed conditions) was best for

phylogenetic analysis, we calculated phylogenetic trees

from all these alignments, and measured the topologi-

cal distance with respect to the real tree. Figure 3 shows,

for the simulations with the asymmetric tree, the aver-

age topological distances to the real tree from the trees

generated with ClustalW alignments, with and with-

out the use of Gblocks. In addition, the distance to the

tree obtained from the Gblocks complementary align-

ment (that is, the alignment resulting after concatena-

tion of all the blocks rejected by Gblocks) is also shown.

Downloaded By: [USYB - Systematic Biology] At: 02:07 18 July 2007

568 SYSTEMATIC BIOLOGY VOL. 56

EDCLRSGKVQQYFSAQYL---DGVGVSLIPQCLQVEFTSRIDFKSFVCHPAECGL---STPA--GC---AQW------------A--E----AGGAGSDFPQVDVANSGYKAERFTVQWQY-KTRNRATIDHHRSAKSLPKKS

DDCTRSGKVKQYFGAQYAA--MGVIYSLIPQCLQVKITSRIDYKNFICAQKACAK-----PG--IPEFGS-------------AG--R---A-SGAESDFGQVDPANKGYKTDRFTVQWQY-RGRGRADIKYHWHACSYQQISA

EDCTRSGKVQQYFSAQYMS--TGIICSLIPQCLQVKFTSCIDYKTFICSPAACGP-----PG--TCYADKVW----FFHFKLSNG--L----DGSAGSDFPQVDPANEGYKSERFTVQWKY-RARDRANIQHHWSVKTYRSQSK

GDCTRAGKVQEYFSAQYLA--IGKAYALIPQCLQVKFTSRIDYKDFICSPGACGA-----PA--NCYYNVVW----VHQFKLDAG--G----SVNAGSDFPRVDPANGGFKKKRFTVQWKY-GARDRVAIEHHWSAKTFRQRS

NDCTRSGKVQQYFSAQYIG--NAVRTSLIP

LCLQVNFTSRSDFKVFACAPAECGDVGLTLPAPRAC---HVW----HF----AEG--TAHA-AANAGTDFPQIEGANKGYKA ERFTVQWKY--VQSRARIVHHWSARTLRKRSL

NDCLRSGKVQVYFSAQYAN--SGVKAALIPEALQVKFTSFIDFKSFVCSPAQCGV---SLPA--GV---GPWYNAILF----PEG--A----TGGAGSDFPQVEPANNGYKAERFGVQWAY-LTRNRATINHHWSARVLPKKS

EDCTRSG

QVQQYFSAQYKA--AGVVYSLIQQCLQVKFTSRVDYKSFICSPNACGQ-----PA--RAYYGKT--------FKLSAG--V----DGNAGSEFLQIDPANDGYKSERFTVQWKY-RARDRATINHHWSVKTYRGQSK

DECTRSGKVQQFFSPQYITSFFGPIYSIIPQCLQVNFTARIDFKTFVCSKGACGL---VAPV--TC---KEW----FF----TGG--L----KGGAGSDYAQVDPANGGYKAERFTVQWPEIKARSRATIDHHWSAKAY

HKKSL

DDCLRSGKVQQYFSAQYMG--NGVKASLIPQCLQVKFTSKIDFTSFICVPTECGI---SLPA--DC---AAW----FF----PDV--D----RGGAGSDFPQVDPGNDGYKAEHFTVQWKY-KARNRTTINHHWSAKTLRKKS

DDCTRSGRVQQYFSAQYLS--GGIIYSLIPKCLQVKFTSCIDYKSFICSPAACAD-----SP--ACYADATW----FFQFKLSDG--V----PGNAGSDFPQVDPANEGYKSERFTVQWKY-KAPDRATINHHWSVKTYRAEST

DDCLRSGNR

QQYFTAVYGN--LGVPTSLIPNCLQVKFTSVIQFSTFIYAPPKCPQ---DTPG--GA---S------TF-----SM--H-----VSADSGYSQVEGENHGLKMGHFDVQW-Y-RPRARAVIDHHWSA--LQNR S

EDCARSGKVQQYFSAQYMS--AVIIYSLIPQCLQVKFTSCIDYKSLICSPAACGE-----PG--TCYADKTW----FFQFKLTAG--L----EGNAGSDFPQVDPANEGYKSERFTVQWKY-KARDRATIQHHWSVKTYRSQSK

DDCTRSGKVQQYFSAQYMI--GGVI

YSLIPQCLQVKFTSCINFKSFICPPAACAE---NLPE--RC---QFW----FF----DTG--E----GGGAGSDFPQVDPANDGYKAERFTVQWHY-KPRDRAAISHHWSAKSLRKNSL

DDCTRSGKVQQYFSAQYLG--GGVVYSLIPQCHQVKFTSKIDYKSLICAPAACGV---DFPA--NC---QTW----FF----GGGGTL----SGGAGSDFPQVDPANDGYKAERFTVQWKY-QAKNRASINHHWSAKSYRKKSP

SDCTRSGKVQQYFTAQYMS--QGKICSLIPDCLKVKFTSCLD

YKSFNV SAAACGD-----PG--TCYAARAW----FFQFKLSVG--L----DGNAGSAYEQ ASPANEGYKSERFTVQWKY-KARDRATIQHHWSVKVYRRRTT

DDCTREGRVEQYFSANYRS--SGILYSLILVCLQVKFTACINFKSFSCSPASCGT-----PS--LCYADKNW----FYQFKL--S--V----EGNGGSNFPQVDPANDGYKTDRFTVQWVY-KARDRASIKHHWSVDTYREGSC

EDCLRSGKVQQYFSAQYL-D--GVGVSLIPQCLQVEFTSRIDFKSFVCHPAECG-----LSTPAGC---AQW--------AEAGGAGSDFPQVDVANSGYKAERFTVQW-QY KTRNRATIDHHRSAKSLPKK-SL

DDCTRSGKVKQYFGAQYAAM--GVIYSLIPQCLQVKITSRIDYKNFICAQKACA-------KPGIP---------EFGSAGRASGAESDFGQVDPANKGYKTDRFTVQW-QYRGRGRADIKYHWHACSYQQI-S

EDCTRSGKVQQYFSAQYMST--GIICSLIPQCLQVKFTSCIDYKTFICSPAACG-------PPGTCYADKVWFFHFKLSNGLDGSAGSDFPQVDPANEGYKSERFTVQW-KYRARDRANIQHHWSVKTYRSQ-SK

GDCTRAGKVQEYFSAQYLAI--GKAYALIPQCLQVKFTSRIDYKDFICSPGACG-------APANCYYNVVWVHQFKLDAGGSVNAGSDFPRVDPANGGFKKKRFTVQW-KYGARDRVAIEHHWSAKTFRQR-

NDCTRSGKVQQYFSAQYIGN--AVRTSLIPLCLQVNFTSRSDFKVFACAPAECGDVGLTLPAPRAC---HVWHF-AEGTAHAAANAGTDFPQIEGANKGYKA ERFTVQW-KY-VQSRARIVHHWSARTLRKR-SL

NDCLRSGKVQVYFSAQYANS--GVKAALIPEALQVKFTSFIDFKSFVCSPAQCG-----VSLPAGV---GPWYNAILFPEGATGGAGSDFPQVEPANNGYKAERFGVQW-AYLTRNRATINHHWSA

RVLPKK-S

EDCTRSGQVQQYFSAQYKAA--GVVYSLIQQCLQVKFTSRVDYKSFICSPNACG-------QPARAYYGKT----FKLSAGVDGNAGSEFLQIDPANDGYKSERFTVQW-KYRARDRATINHHWSVKTYRGQ-SK

DECTRSGKVQQFFSPQYITSFFGPIYSIIPQCLQVNFTARIDFKTFVCSKGACG-----LVAPVTC---KEWF----FTGGLKGGAGSDYAQVDPANGGYKAERFTVQWPEIKARSRA

TIDHHWSAKAYHKK-SL

DDCLRSGKVQQYFSAQYMGN--GVKASLIPQCLQVKFTSKIDFTSFICVPTECG-----ISLPADC---AAWF--F--PDVDRGGAGSDFPQVDPGNDGYKAEHFTVQW-KYKARNRTTINHHWSAKTLRKK-SL

DDCTRSGRVQQYFSAQYLSG--GIIYSLIPKCLQVKFTSCIDYKSFICSPAACA-------DSPACYADATWFFQFKLSDGVPGNAGSDFPQVDPANEGYKSERFTVQW-KYKAPDRATINHHWSV

KTYRAE-ST

DDCLRSGNRQQYFTAVYGNL--GVPTSLIPNCLQVKFTSVIQFSTFIYAPPKCP-----QDTPGGA-----------STFSMHVSADSGYSQVEGEN HGLKMGHFDVQW--YRPRARAVIDHHWSALQNR

EDCARSGKVQQYFSAQYMSA--VIIYSLIPQCLQVKFTSCIDYKSLICSPAACG-------EPGTCYADKTWFFQFKLTAGLEGNAGSDFPQVDPANEGYKSERFTVQW-KYKARDRATIQHHWSV

KTYRSQ-SK

DDCTRSGKVQQYFSAQYMIG--GVIYSLIPQCLQVKFTSCINFKSFICPPAACA-----ENLPERC---QFWF----FDTGEGGGAGSDFPQVDPANDGYKAERFTVQW-HY KPRDRAAISHHWSAKSLRKN-SL

DDCTRSGKVQQYFSAQYLGG--GVVYSLIPQCHQVKFTSKIDYKSLICAPAACG-----VDFPANC---QTWF--FGGGGTLSGGAGSDFPQVDPANDGYKAERFTVQW-KYQAKNRASINHHWSAKSYRKK-SP

SDCTRSGKVQQYFTAQYMSQ--

GKICSLIPDCLKVKFTSCLDYKSFNV SAAACG-------DPGTCYAARAWF FQFKLSVGLDGNAGSAYEQASPANEGYKSERFTVQW-KYKARDRATIQHHWSVKVYRRRTTT

DDCTREGRVEQYFSANYRSS--GILYSLILVCLQVKFTACINFKSFSCSPASCG-------TPSLCYADKNWFYQFKLS--VEGNGGSNFPQVDPANDGYKTDRFTVQW-VY KARDRASIKHHWSVDTYR---EG

SFFGN

EDCLRSGKVQQYFSAQYLD---GVGVSLIPQCLQVEFTSRIDFKSFVCHPAECGLSTPAGCAQW------------AEAGGAGSDFPQVDVANSGYKAERFTVQW-QYKTRNRATIDHHRSAKSLPKKS

-DCTRSGKVKQYFGAQYAAM--GVIYSLIPQCLQVKITSRIDYKNFICAQKACAKPGIPEFGSAG------------RASGAESDFGQVDPANKGYKTDRFTVQW-QYRGRGRADIKYHWHACSYQQISA

-DCTRSGKVQQYFSAQYMST--GIICSLIPQCLQVKFTSCIDYK

TFICSPAACGPPGTCYADKVWFFHFKLSN---GLDGSAGSDFPQVDPAN EGYKSERFTVQW-KYRARDRANIQHHWSVKTYRSQSK

GDCTRAGKVQEYFSAQYLAI--GKAYALIPQCLQVKFTSRIDYKDFICSPGACGAPANCYYNVVWVHQFKLDA---GGSVNAGSDFPRVDPANGGFKKKRFTVQW-KYGARDRVAIEHHWSAKTFRQRSG

NDCTRSGKVQQYFSAQYIGN--AVRTSLIPLCLQVNFTSRSDFKVFACAPAECGDVGLTLPAPRACHVWHFAEGTAHAAANAGT

DFPQIEGANKGYKAERFTVQW--KYVQSRARIVHHWSARTLRKRSL

NDCLRSGKVQVYFSAQYANS--GVKAALIPEALQVKFTSFIDFKSFVCSPAQCGVSLPAGVGPWYNA-ILFPE---GATGGAGSDFPQVEPANNGYKAERFGVQW-AYLTRNRATINHHWSARVLPKKSF

-DCTRSGQVQQYFSAQYKAA--GVVYSLIQQCLQVKFTSRVDYKSFICSPNACGQPARAYYGKT----FKLSA---GVDGNAGSEFLQIDPANDGYKSERFTVQW-KYRAR

DRATIN HHWSVKTYRGQSK

-ECTRSGKVQQFFSPQYITSFFGPIYSIIPQCLQVNFTARIDFKTFVCSKGACGLVAPVTCKEWFFT-----G---GLKGGAGSDYAQVDPANGGYKAERFTVQWPEIKARSRATIDHHWSAKAYHKKSL

DDCLRSGKVQQYFSAQYMGN--GVKASLIPQCLQVKFTSKIDFTSFICVPTECGISLPADCAAWF-----FPD---VDRGGAGSDFPQVDPGNDGYKAEHFTVQW-KYKARNRTTINHHWSAKTL

RKKS

-DCTRSGRVQQYFSAQYLSG--GIIYSLIPKCLQVKFTSCIDYKSFICSPAACADSPACYADATWFFQFKLSD---GVPGNAGSDFPQVDPANEGYKSERFTVQW-KYKAPDRATINHHWSV KTYRAEST

DDCLRSGNRQQYFTAVYGNLG--VPTSLIPNCLQVKFTSVIQFSTFIYAPPKCPQDTPGGASTFS------------MHVSADSGYSQVEGENHGLKMGHFDVQW--YRPRARAVIDHHWSALQNRSFFG

-DCARSGKVQQYFSAQY

MSA--VIIYSLIPQCLQVKFTSCIDY KSLICSPAACGEPGTCYADKTWFFQFKLTA---GLEGNAGSDFPQVDPAN EGYKSERFTVQW-KYKARDRATIQHHWSVKTYRSQSK

-DCTRSGKVQQYFSAQYMIG--GVIYSLIPQCLQVKFTSCINFKSFICPPAACAENLPERCQFWFFD-----T---GEGGGAGSDFPQVDPANDGYKAERFTVQW-HYKPRDRAAISHHWSAKSLRKNSL

-DCTRSGKVQQYFSAQYLGG--GVVYSLIPQCHQVKFTSKIDYKSLICAPAACG---VDFPANCQTWFFGGGG---TLSGGAGSDFPQVDPANDGYKAERFTVQW-KYQAKN

RASINHHWSAKSYRKKSP

-DCTRSGKVQQYFTAQYMSQ--GKICSLIPDCLKVKFTSCLDYKSFNVSAAACGDPGTCYAARAWFFQFKLSV---GLDGNAGSAYEQASPANEGYKSERFTVQW-KYKARDRATIQHHWSVKVYRRRTT

-DCTREGRVEQYFSANYRSS--GILYSLILVCLQVKFTACINFKSFSCSPASCGTPSLCYADKNWFYQFKLS-----VEGNGGSNFPQVDPANDGYKTDRFTVQW-VYKARD RASI

KHHWSVDTYREGSC

EDCLRSGKVQQYFSAQYLD---GVGVSLIPQCLQVEFTSRIDFKSFV CHPAECGLS-----TPA-GCAQWA-------------EAGGAGSDFPQVDVANSGYKAERFTVQWQ-YKTRNRATIDHHRSAKSLPKKSL

DDCTRSGKVKQYFGAQYAAM--GVIYSLIPQCLQVKITSRIDYKNFICAQKACAKP-----GIPEF-------G--S---A--GRASGAESDFGQVDPANKGYKTD RFTVQWQ-YRGRGRADIKYHWHACSYQQISA

EDCTRSGKVQQYFSAQYMST--GIIC

SLIPQCLQVKFTSCIDYKTFICSPAACGPP-----GTCYADKVWFFHFKLS---N--GLDGSAGSDFPQVDPANEGYKSERFTVQWK-YRARDRANIQHHWSVKTYRSQSK

GDCTRAGKVQEYFSAQYLAI--GKAYA LIPQCLQVKFTSRIDYKDFICSPGACGAP-----ANCYYNVVWVHQFKLD---A--GGSVNAGSDFPRVDPANGGFKKKRFTVQWK-YGARDRVAIEHHWSAKTFRQRSG

NDCTRSGKVQQYFSAQYIGN--AVRTSLIPLCLQVNFTSRSDFKVF

ACAPAECGDVGLTLPAPR-ACHVWH F----AEGTA--HAAANAGTDFPQIEGANKGYKAERFTVQWK-Y-VQSRARIVHHWSARTLRKRSL

NDCLRSGKVQVYFSAQY ANS--GVKAALIPEALQVKFTSFIDFKSFVCSPAQCGVS-----LPA-GVGPWYNAILFP---E--GATGGAGSDFPQVEPANNGYKAERFGVQWA-YLTRNRATINHHWSARVLPKKSF

EDCTRSGQVQQYFSAQYKAA--GVVYSLIQQCLQVKFTSRVDYKSFICSPNACGQP-----ARAYYGKTFK----LS---A--GVD

GNAGSEFLQIDPANDGYKSERFTVQWK-YRARDRATINHHWSVKTYRGQSK

DECTRSGKVQQFFSPQY ITSFFGPIYSIIPQCLQVNFTARIDFKTFVCSKGACGLV-----APV-TCKEWFF----T---G--GLKGGAGSDYAQVDPANGGYKAERFTVQWPEIKARSRATIDHHWSAKAYHKKSL

DDCLRSGKVQQYFSAQYMGN--GVKASLIPQCLQVKFTSKIDFTSFICVPTECGIS-----LPA-DCAAWFF----P---D--VDRG

GAGSDFPQVDPGNDGYKAE HFTVQWK-YKARNRTTINHHWSAKTLRKKSL

DDCTRSGRVQQYFSAQYLSG--GIIYSLIPKCLQVKFTSCIDYKSFICSPAACADS-----PACYADATWFFQFKLS---D--GVPGNAGSDFPQVDPANEGYKSERFTVQWK-YKAPDRA TINHHWSVKTYRAES T

DDCLRSGNRQQYFTAVY GNL--GVPTSLIPNCLQVKFTSVIQFSTFIYAPPKCPQD-----TPG-GASTF-------------SMHVSADSGYSQVEGENH

GLKMGHFDVQW--YRPRARAVIDHHWSALQNRSFFG

EDCARSGKVQQYFSAQYMSA--VIIYSLIPQCLQVKFTSCIDYKSLICSPAACGEP-----GTCYADKTWFFQFKLT---A--GLE GNAGSDFPQVDPANEGYKSERFTVQWK-YKARDRATIQHHWSVKTYRSQSK

DCTRSGKVQQYFSAQYMIG--GVIYSLIPQCLQVKFTSCINFKSFICPPAACAEN-----LPE-RCQFWFF----D---T--GEGGGAGSDFPQVDPAND GYKAERFTVQWH-YKPRDRAAISHHWSAKSLRKNSL

DDCTRSGKVQQYFSAQYLGG--GVVYSLIPQCHQVKFTSKIDYKSLICAPAACGVD-----FPA-NCQTWFF----G---GGGTLSGGAGSDFPQVDPANDGYKAERFTVQWK-YQAKNRASINHHWSAKSYRKKSP

SDCTRSGKVQQYFTAQYMSQ--GKICSLIPDCLKVKFTSCLD YKSFNVSAAACGDP-----GTCYAARAWFF

QFKLS---V--GLDGNAGSAYEQASPANE GYKSERFTVQWK-YKARDRATIQHHWSVKVYRRRTT

DDCTREGRVEQYFSANYRSS--GILYSLILVCLQVKFTACINFKSFSCSPASCGTP-----SLCYADKNWF YQF--K---L--SVEGNGGSNFPQVDPANDGYKTD RFTVQWV-YKARDRASIKHHWSVDTYREGSC

FIGURE 2. Fragment of a simulated alignment (a) and the realignment of the same sequences (after gap removal) by ClustalW (b), Mafft

(c), and Probcons (d). The simulation corresponds to an asymmetric tree with divergence ×1. The blocks below each alignment represent the

fragments selected by Gblocks with relaxed conditions (grey blocks) and with stringent conditions (white blocks). Positions of the alignments

where more than 50% of the sequences are identical are shown with black boxes.

Figure 4 represents for each tree (and for two representa-

tive lengths, 800 and 3200 amino acids, as representatives

of single-gene and concatenated-gene phylogenies) the

best alignment strategies after statistically comparing the

average topological distances by means of the Tukey-

Kramer test. An overview of these two ﬁgures shows

that, when the alignments are cleaned by Gblocks with

any of the two parameter sets used (dotted lines in Fig-

ure 3), the topological distance to the real tree decreases

with respect to the complete alignment (solid, red line)

in almost all divergences and alignment lengths tested,

and with the three tree reconstruction methods used:

Downloaded By: [USYB - Systematic Biology] At: 02:07 18 July 2007

2007 TALAVERA AND CASTRESANA—IMPROVEMENT OF PHYLOGENIES AFTER REMOVING BLOCKS 569

FIGURE 3. Average Robinson-Foulds distances to the real tree from the tree calculated with ClustalW complete alignments (solid, red line with

crossed symbols), the same alignments after treatment with Gblocks relaxed (dotted, blue line with diamonds) and stringent (dotted, green line

with squared symbols) conditions, and the complementary alignments of the Gblocks relaxed alignment (solid, orange line with triangles). The

asymmetric tree with three different divergence levels was used for the simulations with different alignment lengths. Trees were reconstructed

by ML, NJ, and parsimony.

ML, NJ, and parsimony. The improvement in topolog-

ical accuracy upon Gblocks treatment is more noticeable

for the highest divergences (×2). This is expected since

there are more problematic blocks in these alignments,

as shown by the lower percentage of positions selected

by Gblocks (Table 1). In addition, the improvement from

Gblocks treatment is particularly large for NJ and parsi-

mony. These two methods produce quite poor topologies

when using the complete alignments but, upon using

Gblocks, particularly with the most stringent conditions

(green line, squared symbols), there is a substantial gain

in topological accuracy. ML produces the overall best

trees (see also below) although, in the lowest divergence

(×0.5), there is almost no difference in topological qual-

ity between the Gblocks and the complete alignments.

In fact, for short genes (400 to 800 amino acids) the com-

plete alignment gives rise to better trees than the Gblocks

alignments, although there is no statistical difference be-

tween the complete alignment and the Gblocks align-

ment with relaxed parameters (Fig. 4).

Downloaded By: [USYB - Systematic Biology] At: 02:07 18 July 2007

570 SYSTEMATIC BIOLOGY VOL. 56

FIGURE 4. ClustalW alignment strategies that give rise to the statistically best topologies. When two or more strategies do not show statistical

differences in Robinson-Foulds distances, all equivalent strategies are represented. The complete alignment is represented by a black block, and

the relaxed and stringent Gblocks strategies by grey and white blocks, respectively.

It is thus shown from the example above that the re-

moval of divergent and problematic regions of an align-

ment is, in principle, beneﬁcial for phylogenetic analyses

of relatively divergent sequences. In fact, it is true, as pre-

viously argued (Aagesen, 2004; Lee, 2001), that there is

some phylogenetic information in the blocks removed

by methods like Gblocks. This can be appreciated in Fig-

ure 3, which shows the topological distances to the real

trees from the trees obtained with the blocks excluded by

Gblocks (complementary alignment; solid, orange line).

These distances, although very large, become quite re-

duced for long alignments, indicating that trees obtained

from the complementary regions are not random; that is,

there is some phylogenetic information in the regions re-

jected by Gblocks. However, what seems to matter is not

the total phylogenetic signal but the signal-to-noise ratio.

Despite the relatively simple simulations performed, re-

gions excluded by Gblocks seem to add more noise than

signal, thus lowering the quality of the trees from the

complete alignments with respect to the Gblocks-cleaned

alignments.

Similar conclusions about the beneﬁcial effect of

Gblocks can be drawn from Mafft alignments of the same

asymmetric trees (Figs. 5 and 6). In this case, Gblocks is

not an advantage over the complete alignment in the two

most conserved alignments (×0.5 and ×1) when using

the ML method although, again, Gblocks relaxed and

the complete alignments are not statistically different.

The picture for Probcons (Fig. 1 of the online Appendix,

available at http://systematicbiology.org) is similar to

that for Mafft. Figure 2 of the online Appendix shows

a comparison of the three alignment programs with de-

fault gap costs, using the trees produced after Gblocks

cleaning with relaxed conditions. Under the conditions

of these simulations, ClustalW is slightly worse, regard-

ing the trees produced, than the two other programs. The

performances of Mafft and Probcons are very similar, and

only for NJ and parsimony Probcons alignments work

slightly better. Probcons, however, is highly demand-

ing in computational time. Thus, for the rest of the tests

we only compared the performances of ClustalW and

Mafft.

Downloaded By: [USYB - Systematic Biology] At: 02:07 18 July 2007

2007 TALAVERA AND CASTRESANA—IMPROVEMENT OF PHYLOGENIES AFTER REMOVING BLOCKS 571

FIGURE 5. Average Robinson-Foulds distances to the real tree from the tree calculated with Mafft complete alignments (solid, red line with

crossed symbols), the same alignments after treatment with Gblocks relaxed (dotted, blue line with diamonds) and stringent (dotted, green line

with squared symbols) conditions, and the complementary alignments of the Gblocks relaxed alignment (solid, orange line with triangles). The

asymmetric tree with three different divergence levels was used for the simulations with different alignment lengths. Trees were reconstructed

by ML, NJ, and parsimony.

The results for the symmetric and intermediate trees of

both alignment algorithms are shown in the correspond-

ing columns of Figures 4 and 6 for the ClustalW and

Mafft methods, respectively (and in Figures 3 to 6 in the

online Appendix for all alignment lengths). Two results

are noteworthy from these analyses. First, differences

in phylogenetic performance between different align-

ments derived from symmetric trees are quantitatively

smaller, in agreement with a previous work (Ogden and

Rosenberg, 2006). See, for example, the similarity of the

three graphs of ML trees of ClustalW alignments (Fig. 3

in the online Appendix). Second, in these trees there are

two conditions where the Gblocks alignments produce

ML trees that are statistically worse than the complete

alignments: the symmetric and intermediate trees of di-

vergence ×1 with Mafft alignments of 800 amino acids

(Fig. 6). These are the only two conditions where we ob-

served this. However, we do not think that this justiﬁes

Downloaded By: [USYB - Systematic Biology] At: 02:07 18 July 2007

572 SYSTEMATIC BIOLOGY VOL. 56

FIGURE 6. Mafft alignment strategies that give rise to the statistically best topologies. When two or more strategies do not show statistical

differences in Robinson-Foulds distances, all equivalent strategies are represented. The complete alignment is represented by a black block, and

the relaxed and stringent Gblocks strategies by grey and white blocks, respectively.

not using Gblocks in these types of trees, even if we

could know the shape of the tree in advance. In real

alignments, evolution must be much more complex than

what we simulated. For example, we did not simu-

late biased amino acid compositions (Castresana et al.,

1998a) or different models of evolution in different parts

of trees (Philippe and Laurent, 1998), all of which will

have stronger biasing effects in nonconserved blocks. Be-

cause the difference in topological accuracy between the

Gblocks and the complete alignments is very small in

these two conditions, it is very likely that the addition of

any of these effects in the simulations would have made

both the Gblocks relaxed and complete alignments of at

least equal performance.

All simulations shown so far were performed follow-

ing a pattern of rate variation of the NAD2 protein. To

test the inﬂuence of different rate patterns, we used in

the simulations proﬁles derived from two other proteins

(NAD4 and COG0285). From the Mafft alignments of

these simulations we calculated the corresponding ML

trees (Fig. 7 in the online Appendix). Different patterns

(and thus different percentages of block selection) gave

rise to different performances of the complete and the

Gblocks alignments, but the results were similar in rela-

tive terms. We also tested the performance of a different

gap model, in which gaps were introduced homoge-

neously along the alignment, instead of using two differ-

ent gap thresholds in different regions of the alignments

(see Materials and Methods). The results were again sim-

ilar with the simpler gap strategy, as shown for the ML

reconstruction of the asymmetric trees (Fig. 8 of the On-

line appendix).

Phylogenetic Methods Used

The data shown above indicate that ML is the phyloge-

netic method that best extracts reliable information from

problematic alignment regions, since trees derived from

complete alignments are relatively good. This contrasts

with the trees obtained by NJ and parsimony, which are

quite poor from the complete alignments, indicating that

they greatly beneﬁted from the use of Gblocks. ML is also

the method that produces the overall best trees, in agree-

ment with previous simulation analysis (see references

Downloaded By: [USYB - Systematic Biology] At: 02:07 18 July 2007

2007 TALAVERA AND CASTRESANA—IMPROVEMENT OF PHYLOGENIES AFTER REMOVING BLOCKS 573

FIGURE 7. Average Robinson-Foulds distances to the real tree from the tree calculated with Mafft complete (solid line, solid symbols) and

ClustalW complete alignments (solid line, empty symbols). The tree distances obtained with the same alignments after treatment with Gblocks

with relaxed conditions (dotted lines) are also shown. Trees were reconstructed by ML (circles), NJ (squares), and parsimony (triangles). The

most divergent asymmetric tree was used for the simulations.

in Felsenstein, 2004). To show this, Figure 7 presents the

superimposed graphs for the most divergent asymmet-

ric tree as an example. The better performance of ML

in all alignment conditions is clearly appreciated in this

graph.

Short versus Long Alignments

Alignment length turned out to be a very important

factor to be taken into account when deciding the best

alignment cleaning strategy. Figures 3 and 5 show that,

in general, for shorter alignments the best Gblocks con-

dition is the relaxed one, whereas for longer alignments

the stringent condition tends to work better. This can also

be appreciated by comparing the slopes of the graphs

corresponding to the complete alignments, and those of

the Gblocks alignments with relaxed and stringent con-

ditions. The slope downwards (towards better trees) is

less pronounced for the complete alignments and more

pronounced for Gblocks with stringent conditions. This

means that for single genes (400 to 800 amino acids) the

gain in signal-to-noise ratio after elimination of prob-

lematic blocks may not compensate the total loss of in-

formation. However, for longer alignments, for example,

those used in phylogenomic studies where several genes

are concatenated (Delsuc et al., 2005; Jeffroy et al., 2006),

there is enough total information so that selecting the

best pieces with Gblocks using the stringent conditions

allows to get closer to the real tree. This basic tendency

is observed under all simulation conditions we tested.

Bootstrap Support in Trees Obtained

from Gblocks Alignments

Previous performance tests of Gblocks with real data

showed that Gblocks alignments obtained less support

in ML analysis, because the number of trees not sig-

niﬁcantly different from the ML tree was smaller in

the complete alignment than in the Gblocks alignment

(Castresana, 2000). Later, in numerous studies in our

group and in other groups, the same effect was observed

using bootstrap values of NJ trees, which were lower

in the Gblocks alignments. Our simulations reproduced

the same behavior again. In NJ trees obtained from 100

bootstrap samples, the average bootstrap support of all

partitions was higher for the complete alignments, and

lower for Gblocks alignments (Fig. 8). However, the same

simulations (see topological distances of NJ trees in Fig-

ures 3 and 5) showed that the best trees were obtained

with Gblocks conditions and the worse topologies with

the complete alignments, thus following the opposite di-

rection, regarding quality, to the bootstrap values, at least

for the maximum divergence. A similar trend was found

for NJ trees of simulations with symmetric trees (Fig. 9

of the online Appendix) and for bootstrapped ML trees

(Fig. 10 of the online Appendix). One may think that the

bootstraps of Gblocks trees are lower due to the smaller

length of the Gblocks alignments, but it is still very para-

doxical that the best topology is associated to a lower

bootstrap.

The explanation for this contradictory behavior of

Gblocks may be that divergent and problematic align-

ment regions are biased towards an erroneous topology

(Lake, 1991). This could happen if the initial guide tree

used in the progressive alignment methods is conducting

very strongly the alignment in the divergent and most

gappy regions, where alignment programs may easily

create similarity at the expense of homology (Higgins

et al., 2005). In addition, when alignment software is

faced with an ambiguous alignment decision, the algo-

rithmic solution makes consistent but arbitrary decisions

Downloaded By: [USYB - Systematic Biology] At: 02:07 18 July 2007

574 SYSTEMATIC BIOLOGY VOL. 56

FIGURE 8. Average bootstrap values of NJ trees obtained from ClustalW (a) and Mafft (b) alignments simulated from the asymmetric tree

with three different divergence levels. Complete (solid, red line), Gblocks relaxed (dotted, blue line with diamonds), and Gblocks stringent

(dotted, green line with squared symbols) alignments are shown.

FIGURE 9. Average Robinson-Foulds distances from the ClustalW guide tree to the real tree (red line with crossed symbols), from the guide

tree to the NJ tree of the Gblocks alignment with relaxed conditions (green line with squared symbols), and from the guide tree to the NJ tree

of the complementary positions of the same Gblocks alignment (blue line with diamonds). The asymmetric tree with three different divergence

levels was used for the simulations.

Downloaded By: [USYB - Systematic Biology] At: 02:07 18 July 2007

2007 TALAVERA AND CASTRESANA—IMPROVEMENT OF PHYLOGENIES AFTER REMOVING BLOCKS 575

that bias the support indices. That is, this repeated align-

ment decisions will increase the bootstrap support, and

this bias will be stronger in the most divergent regions,

where there is more uncertainty. Three results are con-

sistent with this possibility. Firstly, we have observed

in our simulations that the initial guide dendrogram

used by ClustalW is indeed very different from the real

tree, as measured by the Robinson-Foulds distance of

both trees (Fig. 9). If all divergent regions tend to eas-

ily reproduce this initial dendrogram, we would expect

that the guide tree is more similar to the tree obtained

from the Gblocks excluded regions than to the Gblocks

alignment. Figure 9 shows that this is the case, partic-

ularly in the most divergent simulations. Secondly, we

see that the effect of increased bootstrap support in the

complete alignment with respect to the Gblocks align-

ments is higher in ClustalW, which highly depends on

the initial dendrogram, than in Mafft (Fig. 8). For exam-

ple, in simulations of 400 amino acids and at ×2 diver-

gence, there is an increase from 60% to 76% bootstrap

support in ClustalW when comparing the Gblocks strin-

gent and complete alignments, and only from 60% to

70% in Mafft. In the latter method, the successive it-

erations of the alignment algorithm may make the ﬁ-

nal alignment more independent from the initial crude

dendrogram, thus explaining that trees generated from

these alignments are slightly less biased. And thirdly,

when we calculated separately bootstraps of right and

wrong partitions for each tree we observe, apart from

lower values for wrong partitions, a slightly higher bias

in them (Fig. 11 of the online Appendix). The bias is

also present in the right partitions, probably because

some of the recurrent software decisions in the diver-

gent regions are actually correct. Thus, the bias coming

from divergent regions seems to increase the bootstrap

of all partitions, although the effect is slightly larger in

the wrong ones. All this indicates that bootstrap sup-

port cannot be used as a measure of reliability of the

tree topology when divergent regions are present in the

alignment.

ONCLUSIONS

We have shown, under the conditions of these simu-

lations, that the information contained in divergent and

ambiguously aligned regions of multiple alignments is,

in general, not beneﬁcial for phylogenetic reconstruction.

Thus, using Gblocks or a similar method for removing

problematic blocks seems to be justiﬁed for phylogenetic

analysis, particularly for divergent alignments. In this

work, we have used simulations of moderately diver-

gent and very heterogeneous proteins, which are typ-

ically used in deep phylogenies (i.e., bacterial groups,

eukaryotes lineages, metazoan phyla). However, we do

not know how removal of blocks would affect more con-

served and less heterogeneous alignments. We have also

not tested how a ﬁner tuning of parameters of align-

ment programs and Gblocks may improve the phyloge-

nies. Although we have only used protein alignments,

the same conclusions are expected to apply to protein-

coding DNA alignments of similar divergence. On the

other hand, although we predict that the general con-

clusion that ambiguously aligned regions in any data set

are best excluded when they provide more noise than sig-

nal, rRNA alignments as well as alignments from non-

coding DNA have very different features from coding

alignments, and our simulations were not speciﬁcally

designed to explore the properties of these kinds of se-

quences. However, our purpose in this work is not giving

strict rules about the best alignment strategy and asso-

ciated parameters. Rather, our simulations are mainly

informative about general tendencies. Thus, in the fol-

lowing we summarize important tendencies observed in

our simulations and give some general rules regarding

the best alignment strategy that can be applied to real

situations of protein alignments.

NJ and parsimony seem to be unable to extract

useful phylogenetic information from the problematic

alignment regions, because the complete alignments are

always much worse than the Gblocks treated alignments,

so using Gblocks seems particularly advisable for these

methods. Most probably, these two methods are not able

to take into account the multiple substitutions that oc-

cur in these excessively saturated blocks. On the other

hand, ML, less affected by saturation, is able to extract

some information from these blocks, since in some condi-

tions the complete alignments are similar or even better

than the Gblocks alignments. However, the misidenti-

ﬁed homology that may occur in these regions affects

all phylogenetic methods, which may explain why us-

ing Gblocks is more beneﬁcial at high divergences for all

methods.

Regarding the use of stringent or relaxed conditions

for Gblocks, two important rules can be extracted from

our analysis. First, for ML trees relaxed conditions of

Gblocks seem to give rise to better trees, whereas for NJ

and parsimony stringent conditions are better. Second,

alignment length is a crucial parameter to be taken into

account. For short alignments, such as in studies of sin-

gle short genes, the removal of blocks by Gblocks may

leave too few positions, so in these cases it may be better

to use very relaxed conditions of Gblocks. In the short-

est alignments, which have very little information, use

of Gblocks may be even detrimental. At any rate, one

should be aware that with this type of short alignments

it is only possible to obtain a very approximate topology,

possibly quite distant from the real tree. For phyloge-

nomic studies, where there is enough information from

the concatenation of several genes (Jeffroy et al., 2006),

the use of Gblocks with stringent conditions tends to give

rise to the best phylogenetic trees.

CKNOWLEDGMENTS

This work was supported ﬁnancially by a research grant in bioinfor-

matics from the Fundaci´on BBVA (Spain), and grant number BIO2002-

04426-C02-02 from the Plan Nacional de Investigaci´on Cient´ıﬁca,

Desarrollo e Innovaci ´on Tecnol´ogica (I+D+I) of the MEC, coﬁnanced

with FEDER funds. We thank V. Soria-Carrasco for useful technical as-

sistance, and three anonymous reviewers, K. Kjer, and R.D.M. Page for

critical comments that helped improve the manuscript.

Downloaded By: [USYB - Systematic Biology] At: 02:07 18 July 2007

576 SYSTEMATIC BIOLOGY VOL. 56

REFERENCES

Aagesen, L. 2004. The information content of an ambiguously alignable

region, a case study of the trnL intron from the Rhamnaceae. Organ.

Divers. Evol. 4:35–49.

Blackshields, G., I. M. Wallace, M. Larkin, and D. G. Higgins. 2006.

Analysis and comparison of benchmarks for multiple sequence

alignment. In Silico Biol. 6:321–339.

Castresana, J. 2000. Selection of conserved blocks from multiple align-

ments for their use in phylogenetic analysis. Mol. Biol. Evol. 17:540–

552.

Castresana, J., G. Feldmaier-Fuchs, and S. P¨a¨abo. 1998a. Codon reas-

signment and amino acid composition in hemichordate mitochon-

dria. Proc. Natl. Acad. Sci. USA 95:3703–3707.

Castresana, J., G. Feldmaier-Fuchs, S. Yokobori, N. Satoh, and S. P¨a¨abo.

1998b. The mitochondrial genome of the hemichordate Balanoglossus

carnosus and the evolution of deuterostome mitochondria. Genetics

150:1115–1123.

Dayhoff, M. O., R. M. Schwartz, and B. C. Orcutt. 1978. A model of evo-

lutionary change in proteins. Pages 345–352 in Atlas of protein se-

quence structure (M. O. Dayhoff, ed.) National Biomedical Research

Foundation, Washington, D.C.

Delsuc, F., H. Brinkmann, and H. Philippe. 2005. Phylogenomics

and the reconstruction of the tree of life. Nat. Rev. Genet. 6:361–

375.

Do, C. B., M. S. Mahabhashyam, M. Brudno, and S. Batzoglou. 2005.

ProbCons: Probabilistic consistency-based multiple sequence align-

ment. Genome Res. 15:330–340.

Drummond, A., and K. Strimmer. 2001. PAL: An object-oriented pro-

gramming library for molecular evolution and phylogenetics. Bioin-

formatics 17:662–663.

Edgar, R. C. 2004. MUSCLE: Multiple sequence alignment with

high accuracy and high throughput. Nucleic Acids Res. 32:1792–

1797.

Felsenstein, J. 1989. PHYLIP—Phylogeny inference package (version

3.4). Cladistics 5:164–166.

Felsenstein, J. 2004. Inferring phylogenies. Sinauer Associates, Sunder-

land, Massachusetts.

Feng, D. F., and R. F. Doolittle. 1987. Progressive sequence alignment

as a prerequisite to correct phylogenetic trees. J. Mol. Evol. 25:351–

360.

Fleissner, R., D. Metzler, and A. von Haeseler. 2005. Simultaneous sta-

tistical multiple alignment and phylogeny reconstruction. Syst. Biol.

54:548–561.

Gatesy, J., R. DeSalle, and W. Wheeler. 1993. Alignment-ambiguous nu-

cleotide sites and the exclusion of systematic data. Mol. Phylogenet.

Evol. 2:152–157.

Geiger, D. L. 2002. Stretch coding and block coding: Two new strate-

gies to represent questionably aligned DNA sequences. J. Mol. Evol.

54:191–199.

Grundy, W. N., and G. J. Naylor. 1999. Phylogenetic inference from

conserved sites alignments. J. Exp. Zool. 285:128–139.

Guindon, S., and O. Gascuel. 2003. A simple, fast, and accurate algo-

rithm to estimate large phylogenies by maximum likelihood. Syst.

Biol. 52:696–704.

Gutell, R. R., N.Larsen,and C. R. Woese. 1994. Lessons from an evolving

rRNA: 16S and 23S rRNA structures from a comparative perspective.

Microbiol. Rev. 58:10–26.

Henikoff, S., and J. G. Henikoff. 1994. Proteinfamily classiﬁcation based

on searching a database of blocks. Genomics 19:97–107.

Herrmann, G., A. Schon, R. Brack-Werner, and T. Werner. 1996. CON-

RAD: A method for identiﬁcation of variable and conserved re-

gions within proteins by scale-space ﬁltering. Comput. Appl. Biosci.

12:197–203.

Higgins, D. G., G. Blackshields, and I. M. Wallace. 2005. Mind the

gaps: Progress in progressive alignment. Proc. Natl. Acad. Sci. USA

102:10411–10412.

Jeffroy, O., H. Brinkmann, F. Delsuc, and H. Philippe. 2006. Phy-

logenomics: The beginning of incongruence? Trends Genet. 22:225–

231.

Jones, D. T., W. R. Taylor, and J. M. Thornton. 1992. The rapid generation

of mutation data matrices from protein sequences. Comput. Appl.

Biosci. 8:275–282.

Katoh, K., K. Kuma, H. Toh, and T. Miyata. 2005. MAFFT version 5:

Improvement in accuracy of multiple sequence alignment. Nucleic

Acids Res. 33:511–518.

Katoh, K., K. Misawa, K. Kuma, and T. Miyata. 2002. MAFFT: A novel

method for rapid multiple sequence alignment based on fast Fourier

transform. Nucleic Acids Res. 30:3059–3066.

Kjer, K. M. 1995. Use of rRNA secondary structure in phylogenetic

studies to identify homologous positions: an example of alignment

and data presentation from the frogs. Mol. Phylogenet. Evol. 4:314-

330.

Lake, J. A. 1991. The order of sequence alignment can bias the selection

of tree topology. Mol. Biol. Evol. 8:378–385.

Lassmann, T., and E. L. Sonnhammer. 2005. Kalign—An accurate and

fast multiple sequence alignment algorithm. BMC Bioinformatics

6:298.

Lee, M. S. 2001. Unalignable sequences and molecular evolution.

Trends Ecol. Evol. 16:681–685.

L¨oytynoja, A., and M. C. Milinkovitch. 2001. SOAP, cleaning mul-

tiple alignments from unstable blocks. Bioinformatics 17:573–

574.

Lunter, G., I. Miklos, A. Drummond, J. L. Jensen, and J. Hein. 2005.

Bayesian coestimation of phylogeny and sequence alignment. BMC

Bioinformatics 6:83.

Lutzoni, F., P. Wagner, V. Reeb, and S. Zoller. 2000. Integrating am-

biguously aligned regions of DNA sequences in phylogenetic anal-

yses without violating positional homology. Syst. Biol. 49:628–

651.

Morrison, D. A., and J. T. Ellis. 1997. Effects of nucleotide sequence

alignment on phylogeny estimation: A case study of 18S rDNAs of

apicomplexa. Mol. Biol. Evol. 14:428–441.

Needleman, S. B., and C. D. Wunsch. 1970. A general method applica-

ble to the search for similarities in the amino acid sequence of two

proteins. J. Mol. Biol. 48:443–453.

Notredame, C., D. G. Higgins, and J. Heringa. 2000. T-Coffee: A novel

method for fast and accurate multiple sequence alignment. J. Mol.

Biol. 302:205–217.

Nuin, P. A., Z. Wang, and E. R. Tillier. 2006. The accuracy of several

multiple sequence alignment programs for proteins. BMC Bioinfor-

matics 7:471.

Ogden, T. H., and M. S. Rosenberg. 2006. Multiple sequence alignment

accuracy and phylogenetic inference. Syst. Biol. 55:314–328.

Pesole, G., M. Attimonelli, G. Preparata, and C. Saccone. 1992. A sta-

tistical method for detecting regions with different evolutionary

dynamics in multialigned sequences. Mol. Phylogenet. Evol. 1:91–

96.

Philippe, H., and J. Laurent. 1998. How good are deep phylogenetic

trees? Curr. Opin. Genet. Dev. 8:616–623.

Redelings, B. D., and M. A. Suchard. 2005. Joint Bayesian estimation of

alignment and phylogeny. Syst. Biol. 54:401–418.

Robinson, D. F., and L. R. Foulds. 1981. Comparison of phylogenetic

trees. Math. Biosci. 53:131–147.

Rodrigo, A. G., P. R. Bergquist, and P. L. Bergquist. 1994. Inadequate

support for an evolutionary link between the Metazoa and the Fungi.

Syst. Biol. 43:578–584.

Smythe, A. B., M. J. Sanderson, and S. A. Nadler. 2006. Nematode small

subunit phylogeny correlates with alignment parameters. Syst. Biol.

55:972–992.

Stajich, J. E., et al. 2002. The Bioperl toolkit: Perl modules for the life

sciences. Genome Res. 12:1611–1618.

Stoye, J., D. Evers, and F. Meyer. 1998. Rose: Generating sequence fam-

ilies. Bioinformatics 14:157–163.

Strimmer, K., and A. von Haeseler. 1996. Quartet puzzling: A quar-

tet maximum-likelihood method for reconstructing tree topologies.

Mol. Biol. Evol. 13:964–969.

Swofford, D. L., G. J. Olsen, P. J. Waddell, and D. M. Hillis. 1996. Phy-

logenetic inference. Pages 407–514 in Molecular systematics (D. M.

Hillis, C. Moritz, and B. K. Mable, eds.). Sinauer Associates, Sunder-

land, Massachusetts.

Tatusov, R. L., N. D. Fedorova, J. D. Jackson, A. R. Jacobs, B. Kiryutin,

E. V. Koonin, D. M. Krylov, R. Mazumder, S. L. Mekhedov, A. N.

Nikolskaya, B. S. Rao, S. Smirnov, A. V. Sverdlov, S. Vasudevan, Y. I.

Wolf, J. J. Yin, and D. A. Natale. 2003. The COG database: an updated

version includes eukaryotes. BMC Bioinformatics 4:41.

Downloaded By: [USYB - Systematic Biology] At: 02:07 18 July 2007

2007 TALAVERA AND CASTRESANA—IMPROVEMENT OF PHYLOGENIES AFTER REMOVING BLOCKS 577

Thompson, J. D., D. G. Higgins, and T. J. Gibson. 1994. CLUSTAL

W: Improving the sensitivity of progressive multiple sequence

alignment through sequence weighting, position-speciﬁc gap penal-

ties and weight matrix choice. Nucleic Acids Res. 22:4673–4680.

Thompson, J. D., P. Koehl, R. Ripp, and O. Poch. 2005. BAliBASE 3.0:

Latest developments of the multiple sequence alignment benchmark.

Proteins 61:127–136.

Wheeler, W. 2001. Homology and the optimization of DNA sequence

data. Cladistics 17:S3–S11.

Xia, X., Z. Xie, and K. M. Kjer. 2003. 18S ribosomal RNA and tetrapod

phylogeny. Syst. Biol. 52:283–295.

Yang, Z. 1998. On the best evolutionary rate for phylogenetic analysis.

Syst. Biol. 47:125–133.

Young, N. D., and J. Healy. 2003. GapCoder automates the use

of indel characters in phylogenetic analysis. BMC Bioinformatics

4:6.

First submitted 7 February 2007; reviews returned 6 March 2007;

ﬁnal acceptance 24 March 2007

Associate Editor: Karl Kjer

Editors: Rod Page and Jack Sullivan

Chitosan: Biocontrol agent of Fusarium oxysporum in tomato fruit (Solanum lycopersicum L.)

Article

Full-text available

Jun 2024

Juan José Reyes-Pérez

Synthetic fungicides have experienced a significant increase in recent years, necessitating the search for more sustainable and environmentally friendly alternatives. In this regard, chitosan has emerged as an option to reduce reliance on these products. This study evaluated the effect of chitosan as a biocontrol agent against Fusarium oxysporum in tomato fruits. A fully randomized experimental design incorporating 6 treatments was employed, consisting of four chitosan treatments (0.5, 1, 2, and 3 g L-1), a negative control involving the application of a synthetic fungicide, and a positive control inoculated with F. oxysporum. Samples were taken from infected tomato fruits. The F4 isolate of Fusarium sp. was identified as F. oxysporum, and demonstrated the highest level of virulence. Among the four chitosan treatments, the 3 g L-1 treatment showed the highest a percentage of mycelial growth inhibition (PMGI) at 79.92% and the greatest reduction in biomass at 0.65 g, which did not differ significantly from the synthetic fungicide. Regarding disease severity and incidence, there were significant variations among each of the chitosan treatments, with the highest results obtained with the 2 and 3 g L-1 treatments. All chitosan treatments reduced disease severity in tomato fruits. Applying chitosan on fruits of the tomato plant presents an alternative for diminishing reliance on synthetic fungicides.

Comparative mitogenome research revealed the phylogenetics and evolution of the superfamily Tenebrionoidea (Coleoptera: Polyphage)

Article

Full-text available

Jun 2024

Despite the worldwide distribution and rich diversity of the superfamily Tenebrionoidea, the knowledge of the mitochondrial genomes (mtgenome) characteristics of the superfamily is still very limited, and its phylogenetics and evolution remain unresolved. In the present study, we newly sequenced mtgenomes from 19 species belonging to Tenebrionoidea, and a total of 90 mitochondrial genomes from 16 families of Tenebrionoidea were used for phylogenetic analysis. There exist 37 genes for all 82 species of complete mtgenomes of 16 families investigated, and their characteristics are identical as reported mtgenomes of other Tenebrionoids. The Ka/Ks analysis suggests that all 13 PCGs have undergone a strong purifying selection. The phylogenetic analysis suggests the monophyly of Mordellidae, Meloidae, Oedemeridae, Pyrochroidae, Salpingidae, Scraptiidae, Lagriidae, and Tenebrionidae, and the Mordellidae is close to the Ripiphoridae. The “Tenebrionidae clade” and “Meloidae clade” are monophyletic, and both of them are sister groups. In the “Meloidae clade,” Meloidae is close to Anthicidae. In the “Tenebrionidae clade,” the family Lagriidae and Tenebrionidae are sister groups. The divergence time analysis suggests that Tenebrionoidea originated in the late Jurassic, Meloidae Mordellidae, Lagriidae, and Tenebrionidae in the Cretaceous, Oedemeridae in Paleogene. The work lays a base for the study of mtgenome, phylogenetics, and evolution of the superfamily Tenebrionoidea.

Chromosome-level genome assembly of a cliff plant Taihangia rupestris var. ciliata provides insights into its adaptation and demographic history

Article

Full-text available

Jun 2024
BMC PLANT BIOL

Background Cliffs are recognized as one of the most challenging environments for plants, characterized by harsh conditions such as drought, infertile soil, and steep terrain. However, they surprisingly host ancient and diverse plant communities and play a crucial role in protecting biodiversity. The Taihang Mountains, which act as a natural boundary in eastern China, support a rich variety of plant species, including many unique to cliff habitats. However, it is little known how cliff plants adapt to harsh habitats and the demographic history in this region. Results To better understand the demographic history and adaptation of cliff plants in this area, we analyzed the chromosome-level genome of a representative cliff plant, T. rupestris var. ciliata, which has a genome size of 769.5 Mb, with a scaffold N50 of 104.92 Mb. The rapid expansion of transposable elements may have contributed to the increasing genome and its ability to adapt to unique and challenging cliff habitats. Comparative analysis of the genome evolution between Taihangia and non-cliff plants in Rosaceae revealed a significant expansion of gene families associated with oxidative phosphorylation, which is likely a response to the abiotic stresses faced by cliff plants. This expansion may explain the long-term adaptation of Taihangia to harsh cliff environments. The effective population size of the two varieties has continuously decreased due to climatic fluctuations during the Quaternary period. Furthermore, significant differences in gene expression between the two varieties may explain the varied leaf phenotypes and adaptations to harsh conditions in different natural distributions. Conclusion Our study highlights the extraordinary adaptation of T. rupestris var. ciliata, shedding light on the evolution of cliff plants worldwide.

The complex hexaploid oil‐Camellia genome traces back its phylogenomic history and multi‐omics analysis of Camellia oil biosynthesis

Article

Full-text available

Jun 2024
PLANT BIOTECHNOL J

Oil‐Camellia (Camellia oleifera), belonging to the Theaceae family Camellia, is an important woody edible oil tree species. The Camellia oil in its mature seed kernels, mainly consists of more than 90% unsaturated fatty acids, tea polyphenols, flavonoids, squalene and other active substances, which is one of the best quality edible vegetable oils in the world. However, genetic research and molecular breeding on oil‐Camellia are challenging due to its complex genetic background. Here, we successfully report a chromosome‐scale genome assembly for a hexaploid oil‐Camellia cultivar Changlin40. This assembly contains 8.80 Gb genomic sequences with scaffold N50 of 180.0 Mb and 45 pseudochromosomes comprising 15 homologous groups with three members each, which contain 135 868 genes with an average length of 3936 bp. Referring to the diploid genome, intragenomic and intergenomic comparisons of synteny indicate homologous chromosomal similarity and changes. Moreover, comparative and evolutionary analyses reveal three rounds of whole‐genome duplication (WGD) events, as well as the possible diversification of hexaploid Changlin40 with diploid occurred approximately 9.06 million years ago (MYA). Furthermore, through the combination of genomics, transcriptomics and metabolomics approaches, a complex regulatory network was constructed and allows to identify potential key structural genes (SAD, FAD2 and FAD3) and transcription factors (AP2 and C2H2) that regulate the metabolism of Camellia oil, especially for unsaturated fatty acids biosynthesis. Overall, the genomic resource generated from this study has great potential to accelerate the research for the molecular biology and genetic improvement of hexaploid oil‐Camellia, as well as to understand polyploid genome evolution.

A Mysterious Asian Firefly Genus, Oculogryphus Jeng, Engel & Yang (Coleoptera, Lampyridae): The First Complete Mitochondrial Genome and Its Phylogenetic Implications

Article

Full-text available

Jun 2024

The firefly genus Oculogryphus Jeng, Engel & Yang, 2007 is a rare-species group endemic to Asia. Since its establishment, its position has been controversial but never rigorously tested. To address this perplexing issue, we are the first to present the complete mitochondrial sequence of Oculogryphus, using the material of O. chenghoiyanae Yiu & Jeng, 2018 determined through a comprehensive morphological identification. Our analyses demonstrate that its mitogenome exhibits similar characteristics to that of Stenocladius, including a rearranged gene order between trnC and trnW, and a long intergenic spacer (702 bp) between the two rearranged genes, within which six remnants (29 bp) of trnW were identified. Further, we incorporated this sequence into phylogenetic analyses of Lampyridae based on different molecular markers and datasets using ML and BI analyses. The results consistently place Oculogryphus within the same clade as Stenocladius in all topologies, and the gene rearrangement is a synapomorphy for this clade. It suggests that Oculogryphus should be classified together with Stenocladius in the subfamily Ototretinae at the moment. This study provides molecular evidence confirming the close relationship between Oculogryphus and Stenocladius and discovers a new phylogenetic marker helpful in clarifying the monophyly of Ototretinae, which also sheds a new light on firefly evolution.

Phylotranscriptomics Shed Light on Intrageneric Relationships and Historical Biogeography of Ceratozamia (Cycadales)

Article

Full-text available

Jun 2024

Shouzhou Zhang

Prevalence and environmental abundance of the elusive membrane trafficking complex TSET in five cosmopolitan eukaryotic groups

Preprint

Jun 2024

Eukaryotic cell biology is largely understood from paradigms established on few model organisms, largely from the animal and fungi (opisthokonts) and to a lesser extent plants. These organisms, however, constitute only a small proportion of eukaryotic diversity, and the principles of their cell biology may not be universal to other, understudied but globally impactful, organisms. Intriguingly, there are cellular components that are present in diverse eukaryotes, but are not in the animals and fungi on which the best developed models of cell biology are derived. Consequently, these components are not included in the generally adopted frameworks of cellular function that are meant to explain eukaryotic biology. The membrane complex TSET is the best studied such example, well established to play a role in cell division and endocytosis in plants. It is found across eukaryotes, but is highly reduced in opisthokonts. Its general prevalence, abundance, and relevance in eukaryotic cellular activity is unclear. Here we show that TSET is encoded in genomes of five cosmopolitan and critical groups of primarily photosynthetic eukaryotes (green algae, red algae, stramenopiles, haptophytes and cryptophytes), with particular prevalence in the green algae and some stramenopile groups. A meta-analysis of published gene expression data from the model diatom Phaeodactylum tricornutum shows that this complex is coregulated with components of the endomembrane trafficking machinery. Moreover, meta-transcriptomic data from Tara Oceans reveals that TSET genes are both present and expressed by diatoms in the wild. These data suggest that TSET may be playing an important and underrecognized role in cellular activities within marine ecosystems. More broadly, the results support the idea that use of systems-level data for non-model organisms can illuminate our understanding of core principles of eukaryotic cell function, and may reveal important and under-appreciated players that deserve to be integrated into the pervasive models of cellular capacity.

Genomes of the Orestias pupfish from the Andean Altiplano shed light on their evolutionary history and phylogenetic relationships within Cyprinodontiformes

Article

Full-text available

Jun 2024
BMC GENOMICS

Background To unravel the evolutionary history of a complex group, a comprehensive reconstruction of its phylogenetic relationships is crucial. This requires meticulous taxon sampling and careful consideration of multiple characters to ensure a complete and accurate reconstruction. The phylogenetic position of the Orestias genus has been estimated partly on unavailable or incomplete information. As a consequence, it was assigned to the family Cyprindontidae, relating this Andean fish to other geographically distant genera distributed in the Mediterranean, Middle East and North and Central America. In this study, using complete genome sequencing, we aim to clarify the phylogenetic position of Orestias within the Cyprinodontiformes order. Results We sequenced the genome of three Orestias species from the Andean Altiplano. Our analysis revealed that the small genome size in this genus (~ 0.7 Gb) was caused by a contraction in transposable element (TE) content, particularly in DNA elements and short interspersed nuclear elements (SINEs). Using predicted gene sequences, we generated a phylogenetic tree of Cyprinodontiformes using 902 orthologs extracted from all 32 available genomes as well as three outgroup species. We complemented this analysis with a phylogenetic reconstruction and time calibration considering 12 molecular markers (eight nuclear and four mitochondrial genes) and a stratified taxon sampling to consider 198 species of nearly all families and genera of this order. Overall, our results show that phylogenetic closeness is directly related to geographical distance. Importantly, we found that Orestias is not part of the Cyprinodontidae family, and that it is more closely related to the South American fish fauna, being the Fluviphylacidae the closest sister group. Conclusions The evolutionary history of the Orestias genus is linked to the South American ichthyofauna and it should no longer be considered a member of the Cyprinodontidae family. Instead, we submit that Orestias belongs to the Orestiidae family, as suggested by Freyhof et al. (2017), and that it is the sister group of the Fluviphylacidae family, distributed in the Amazonian and Orinoco basins. These two groups likely diverged during the Late Eocene concomitant with hydrogeological changes in the South American landscape.

Reidentification of hybridization events with transcriptomic data and phylogenomic study in seabuckthorn

Preprint

Full-text available

Jun 2024

The natural hybridization of sea buckthorn is widely observed by researchers. While studies have identified the parents of these hybrid offspring, distinguishing between F1 and Fn generations is challenging for natural hybrids. As a result, the genetic composition of these hybrid offspring remains underexplored. In this study, we propose a novel method for identifying hybrid F1 generations using transcriptome data and reference genomes. We successfully identified eight individuals from two natural hybrid populations of sea buckthorn, all of which were confirmed to be hybrid F1 generations. Additionally, we first noted limitations in detecting heterozygous sites during SNP calling in transcriptome data, where allele-specific expression and low expression of genes or transcripts can lead to heterozygous SNPs being incorrectly identified as homozygous. Furthermore, we constructed a phylogenomic tree of the sea buckthorn genus using transcriptome data and compared the relationships among various sea buckthorn species using SNP and indel molecular markers obtained through transcriptome data.

Systematic study in Dipcadi ursulae (Asaparagaceae, Scillioideae) from Maharashtra, India

Article

Jun 2024

Dipcadi ursulae is identified as having leaves with a central white band, long acuminate bracts, and sweet-scented flowers. Dipcadi ursulae var. longiracemosum has been raised to the species level due to its morphological and anatomical features. Morphometric analysis based on 89 morphological characters is provided to support the status change. A unique new variety of Dipcadi, which is morphologically closely related to Dipcadi ursulae but differs in having pure white campanulate-urceolate flowers, and unequal spreading perianth lobes, is described here. Anatomical, phylogenetic and morphometric evidence is provided to support the status of the new variety.

The information content of an ambiguously alignable region, a case study of the trnL intron from the Rhamnaceae

Article

Full-text available

May 2004

Lone Aagesen

An earlier analysis of the trnL intron in the Colletieae (Rhamnaceae) showed polyphyly of the genus Discaria. Polyphyly of Discaria is supported only by an AT-rich region of ambiguous alignment within the trnL intron. Polyphyly of the genus relies on extracting the information of the AT-rich region correctly. Ambiguously aligned regions are commonly excluded from phylogenetic analysis. In the present study the question was raised whether random or noisy data could generate a pattern like the one found in the AT-rich region of ambiguous alignment. The original pattern was resistant to changes in alignment parameter cost when submitted to a sensitivity analysis using direct optimization. Artificially generated random or noisy data gave well-resolved trees but these were found to be extremely sensitive to changes in parameter costs. However, information from additional data, such as conserved regions, restricts the influence of random data. It is here suggested that the information in ambiguously aligned regions need not be dismissed, provided that an appropriate method that finds all possible optimal alignments is used to extract the information. In addition to commonly used support measures, some information of robustness to changes in alignment parameter costs is needed in order to make the most reliable conclusions.

A Model of Evolutionary Change in Proteins

Article

Jan 1978

Phylogenetic inference

Chapter

Jan 1996

Integrating Ambiguously Aligned Regions of DNA Sequences in Phylogenetic Analyses Without Violating Positional Homology

Article

Dec 2000

Lutzoni, Peter Wagner,

Phylogenetic analyses of non-protein-coding nucleotide sequences such as ribosomal RNA genes, internal transcribed spacers, and introns are often impeded by regions of the alignments that are ambiguously aligned. These regions are characterized by the presence of gaps and their uncertain positions, no matter which optimization criteria are used. This problem is particularly acute in large-scale phylogenetic studies and when aligning highly diverged sequences. Accommodating these regions, where positional homology is likely to be violated, in phylogenetic analyses has been dealt with very differently by molecular systematists and evolutionists, ranging from the total exclusion of these regions to the inclusion of every position regardless of ambiguity in the alignment. We present a new method that allows the inclusion of ambiguously aligned regions without violating homology.In this three-step procedure, first homologous regions of the alignment containing ambiguously aligned sequences are delimited. Second, each ambiguously aligned region is unequivocally coded as a new character, replacing its respective ambiguous region. Third, each of the coded characters is subjected to a specific step matrix to account for the differential number of changes (summing substitutions and indels) needed to transform one sequence to another.The optimal number of steps included in the step matrix is the one derived from the pairwise alignment with the greatest similarity and the least number of steps. In addition to potentially enhancing phylogenetic resolution and support, by integrating previously nonaccessible characters without violating positional homology,this new approach can improve branch length estimations when using parsimony.

Katoh K, Misawa K, Kuma KI, Miyata T.. MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Res 30: 3059-3066

Article

Jul 2002
NUCLEIC ACIDS RES

Kazutaka Katoh

A multiple sequence alignment program, MAFFT, has been developed. The CPU time is drastically reduced as compared with existing methods. MAFFT includes two novel techniques. (i) Homo logous regions are rapidly identified by the fast Fourier transform (FFT), in which an amino acid sequence is converted to a sequence composed of volume and polarity values of each amino acid residue. (ii) We propose a simplified scoring system that performs well for reducing CPU time and increasing the accuracy of alignments even for sequences having large insertions or extensions as well as distantly related sequences of similar length. Two different heuristics, the progressive method (FFT‐NS‐2) and the iterative refinement method (FFT‐NS‐i), are implemented in MAFFT. The performances of FFT‐NS‐2 and FFT‐NS‐i were compared with other methods by computer simulations and benchmark tests; the CPU time of FFT‐NS‐2 is drastically reduced as compared with CLUSTALW with comparable accuracy. FFT‐NS‐i is over 100 times faster than T‐COFFEE, when the number of input sequences exceeds 60, without sacrificing the accuracy.

A Model of Evolutionary Change in Proteins

Article

Nov 1977

In the eight years since we last examined the amino acid exchanges seen in closely related proteins, ' the information has doubled in quantity and comes from a much wider variety of protein types. The matrices derived from these data that describe the amino acid replacement probabilities between two sequences at various evolutionary distances are more accurate and the scoring matrix that is derived is more sensitive in detecting distant relationships than the one that we previously deri~ed.2, ~ The method used 'in this chapter is essentially the same as that described in the Atlas, Volume 34 and Volume 5.' Accepted Point Mutations An accepted poinfmutation in a protein is a replacement of one amino acid by another, accepted by natural selection. It is the result of two distinct processes: the

PHYLIP: phylogeny inference package

Article

Jan 1993

Joe Felsenstein

MAFFT: A novel method for rapid multiple sequence alignment based on fast Fourier transform

Article

Jul 2002

A multiple sequence alignment program, MAFFT, has been developed. The CPU time is drastically reduced as compared with existing methods. MAFFT includes two novel techniques. (i) Homo logous regions are rapidly identified by the fast Fourier transform (FFT), in which an amino acid sequence is converted to a sequence composed of volume and polarity values of each amino acid residue. (ii) We propose a simplified scoring system that performs well for reducing CPU time and increasing the accuracy of alignments even for sequences having large insertions or extensions as well as distantly related sequences of similar length. Two different heuristics, the progressive method (FFT-NS-2) and the iterative refinement method (FFT-NS-i), are implemented in MAFFT. The performances of FFT-NS-2 and FFT-NS-i were compared with other methods by computer simulations and benchmark tests; the CPU time of FFT-NS-2 is drastically reduced as compared with CLUSTALW with comparable accuracy. FFT-NS-i is over 100 times faster than T-COFFEE, when the number of input sequences exceeds 60, without sacrificing the accuracy.

Comparison of Phylogenetic Trees

Article

Feb 1981
MATH BIOSCI

A metric on general phylogenetic trees is presented. This extends the work of most previous authors, who constructed metrics for binary trees. The metric presented in this paper makes possible the comparison of the many nonbinary phylogenetic trees appearing in the literature. This provides an objective procedure for comparing the different methods for constructing phylogenetic trees. The metric is based on elementary operations which transform one tree into another. Various results obtained in applying these operations are given. They enable the distance between any pair of trees to be calculated efficiently. This generalizes previous work by Bourque to the case where interior vertices can be labeled, and labels may contain more than one element or may be empty.

Notredame, C., Higgins, D. G. & Heringa, J. T-Coffee: A novel method for fast and accurate multiple sequence alignment. J. Mol. Biol. 302, 205-217

Article

Oct 2000

We describe a new method (T-Coffee) for multiple sequence alignment that provides a dramatic improvement in accuracy with a modest sacrifice in speed as compared to the most commonly used alternatives. The method is broadly based on the popular progressive approach to multiple alignment but avoids the most serious pitfalls caused by the greedy nature of this algorithm. With T-Coffee we pre-process a data set of all pair-wise alignments between the sequences. This provides us with a library of alignment information that can be used to guide the progressive alignment. Intermediate alignments are then based not only on the sequences to be aligned next but also on how all of the sequences align with each other. This alignment information can be derived from heterogeneous sources such as a mixture of alignment programs and/or structure superposition. Here, we illustrate the power of the approach by using a combination of local and global pair-wise alignments to generate the library. The resulting alignments are significantly more reliable, as determined by comparison with a set of 141 test cases, than any of the popular alternatives that we tried. The improvement, especially clear with the more difficult test cases, is always visible, regardless of the phylogenetic spread of the sequences in the tests.

Improvement of Phylogenies after Removing Divergent and Ambiguously Aligned Blocks from Protein Sequence Alignments

Abstract and Figures

Recommended publications

S2 Fig

S1 Fig

Reanalysis of Murphy et al.?s Data Gives Various Mammalian Phylogenies and Suggests Overcredibility...

S4 Fig