ArticlePDF Available

Improvement of Phylogenies after Removing Divergent and Ambiguously Aligned Blocks from Protein Sequence Alignments

Authors:

Abstract and Figures

Alignment quality may have as much impact on phylogenetic reconstruction as the phylogenetic methods used. Not only the alignment algorithm, but also the method used to deal with the most problematic alignment regions, may have a critical effect on the final tree. Although some authors remove such problematic regions, either manually or using automatic methods, in order to improve phylogenetic performance, others prefer to keep such regions to avoid losing any information. Our aim in the present work was to examine whether phylogenetic reconstruction improves after alignment cleaning or not. Using simulated protein alignments with gaps, we tested the relative performance in diverse phylogenetic analyses of the whole alignments versus the alignments with problematic regions removed with our previously developed Gblocks program. We also tested the performance of more or less stringent conditions in the selection of blocks. Alignments constructed with different alignment methods (ClustalW, Mafft, and Probcons) were used to estimate phylogenetic trees by maximum likelihood, neighbor joining, and parsimony. We show that, in most alignment conditions, and for alignments that are not too short, removal of blocks leads to better trees. That is, despite losing some information, there is an increase in the actual phylogenetic signal. Overall, the best trees are obtained by maximum-likelihood reconstruction of alignments cleaned by Gblocks. In general, a relaxed selection of blocks is better for short alignment, whereas a stringent selection is more adequate for longer ones. Finally, we show that cleaned alignments produce better topologies although, paradoxically, with lower bootstrap. This indicates that divergent and problematic alignment regions may lead, when present, to apparently better supported although, in fact, more biased topologies.
Content may be subject to copyright.
Downloaded By: [USYB - Systematic Biology] At: 02:07 18 July 2007
Syst. Biol. 56(4):564–577, 2007
Copyright
c
Society of Systematic Biologists
ISSN: 1063-5157 print / 1076-836X online
DOI: 10.1080/10635150701472164
Improvement of Phylogenies after Removing Divergent and Ambiguously Aligned
Blocks from Protein Sequence Alignments
GERARD TALAVERA AND JOSE CASTRESANA
Department of Physiology and Molecular Biodiversity, Institute of Molecular Biology of Barcelona, CSIC, Jordi Girona 18, 08034 Barcelona, Spain;
E-mail: jcvagr@ibmb.csic.es (J.C.)
Abstract.—Alignment quality may have as much impact on phylogenetic reconstruction as the phylogenetic methods used.
Not only the alignment algorithm, but also the method used to deal with the most problematic alignment regions, may
have a critical effect on the final tree. Although some authors remove such problematic regions, either manually or using
automatic methods, in order to improve phylogenetic performance, others prefer to keep such regions to avoid losing any
information. Our aim in the present work was to examine whether phylogenetic reconstruction improves after alignment
cleaning or not. Using simulated protein alignments with gaps, we tested the relative performance in diverse phylogenetic
analyses of the whole alignments versus the alignments with problematic regions removed with our previously developed
Gblocks program. We also tested the performance of more or less stringent conditions in the selection of blocks. Alignments
constructed with different alignment methods (ClustalW, Mafft, and Probcons) were used to estimate phylogenetic trees by
maximum likelihood, neighbor joining, and parsimony. We show that, in most alignment conditions, and for alignments
that are not too short, removal of blocks leads to better trees. That is, despite losing some information, there is an increase
in the actual phylogenetic signal. Overall, the best trees are obtained by maximum-likelihood reconstruction of alignments
cleaned by Gblocks. In general, a relaxed selection of blocks is better for short alignment, whereas a stringent selection is more
adequate for longer ones. Finally, we show that cleaned alignments produce better topologies although, paradoxically, with
lower bootstrap. This indicates that divergent and problematic alignment regions may lead, when present, to apparently
better supported although, in fact, more biased topologies. [Bootstrap support; Gblocks; phylogeny; sequence alignment.]
Methods for the simultaneous generation of multiple
alignments and phylogenetic trees are actively being pur-
sued (Fleissner et al., 2005; Lunter et al., 2005; Redelings
and Suchard, 2005; Wheeler, 2001), but, at present, com-
mon practice of phylogenetic analysis requires, as a first
step, the generation of a multiple alignment of the se-
quences to be analyzed. It has been repeatedly shown
that the quality of the alignment may have an enor-
mous impact on the final phylogenetic tree (Kjer, 1995;
Morrison and Ellis, 1997; Ogden and Rosenberg, 2006;
Smythe et al., 2006; Xia et al., 2003). This is particularly
true when sequences compared are very divergent and
of different length, which makes necessary the introduc-
tion of gaps in the alignments.
Due to the computational requirements of optimal
algorithms for multiple sequence alignments, different
heuristic strategies have been proposed.The most widely
used approach has been the progressive method of align-
ment (Feng and Doolittle, 1987) that, together with en-
hancements related to the introduction of gap penalties,
was implemented in ClustalW (Thompson et al., 1994).
In progressive methods, an initial dendrogram gener-
ated from the pairwise comparisons of the sequences is
used to recursively build the multiple alignment, using
dynamic programming (Needleman and Wunsch, 1970)
in the last step. Dynamic programming is an exact algo-
rithm that assures the best possible alignments for given
gap penalties but, due to heavy computational require-
ments, it is only used for pairs of sequences or pairs of
clades of the dendrogram and not for the whole multi-
ple alignment. Several other heuristic multiple alignment
methods have been recently introduced. They include
T-Coffee (Notredame et al., 2000), Mafft (Katoh et al.,
2005; Katoh et al., 2002), Muscle (Edgar, 2004), Probcons
(Do et al., 2005), and Kalign (Lassmann and Sonnham-
mer, 2005), among others. All of them are based on the
progressive method but include several iterative refine-
ments to construct the final multiple alignment. The
latter methods have been shown to outperform purely
progressive methods in terms of alignment accuracy and,
some of them, even in computational time. However, it
has not been shown whether the greater alignment accu-
racy of more sophisticated methods leads to a significant
improvement in phylogenetic reconstruction.
Proteins have some regions that, due to their func-
tional or structural importance, are very well con-
served, whereas other regions evolve faster both in terms
of nucleotide substitutions and insertions or deletions
(Henikoff and Henikoff, 1994; Herrmann et al., 1996;
Pesole et al., 1992). That is, evolutionary rate heterogene-
ity affects to whole regions in addition to single positions.
This type of regional rate heterogeneity is very challeng-
ing for phylogenetic reconstruction, not only in terms of
homoplasy due to saturation (Yang, 1998), but also in
terms of errors in homology during alignment.
Dealing with regions of problematic alignment is a
matter of active debate in phylogenetics. Although some
authors consider that it is best to remove such regions
before the tree analysis (Castresana, 2000; Grundy and
Naylor, 1999; L¨oytynoja and Milinkovitch, 2001; Rodrigo
et al., 1994; Swofford et al., 1996), others think that there
is an important loss of information upon removal of any
fragment of the sequences already obtained (Aagesen,
2004; Lee, 2001) and that this practice should only be
used as the last resource (Gatesy et al., 1993). A third,
intermediate option, is the recoding of such regions us-
ing different strategies (Geiger, 2002; Lutzoni et al., 2000;
Young and Healy, 2003), which allows the use of at least
part of the information. Although these coded charac-
ters are most commonly analyzed with parsimony, it is
564
Downloaded By: [USYB - Systematic Biology] At: 02:07 18 July 2007
2007 TALAVERA AND CASTRESANA—IMPROVEMENT OF PHYLOGENIES AFTER REMOVING BLOCKS 565
also possible to use them as independent partitions in
Bayesian or likelihood frameworks.
In the present work we test, by using simulated pro-
tein alignments with gaps, which are the best alignment
strategies for optimal phylogenetic reconstruction. Two
preliminary considerations are necessary here. First, sim-
ulations of sequences may not cover all the complexity
of evolution but have the advantage over real sequences
that we know the tree from which they have been gener-
ated. There are some alignment sets curated from struc-
tural information that can be used to test alignment
accuracy (Thompson et al., 2005), but the phylogenetic
tree is unknown in these sets, thus making problem-
atic their use for proving phylogenetic accuracy. Second,
we have been working with simulated sequences that
try to reflect the evolutionary patterns of proteins, and
thus many of the conclusions extracted from our work
cannot be directly extrapolated to other markers such
as rRNA, which show very different evolutionary con-
straints (Gutell et al., 1994; Kjer, 1995; Xia et al., 2003).
In our analysis we used different alignment strategies
of the simulated sequences to test if they make any dif-
ference in the final phylogenetic tree. We have selected
ClustalW as the currently most used progressive align-
ment method (Thompson et al., 1994) and Mafft (Katoh
et al., 2005) and Probcons (Do et al., 2005) as examples of
more recently developed methods that have been shown
to obtain very high scores in terms of alignment accuracy
(Blackshields et al., 2006; Nuin et al., 2006). Simultane-
ously with the performance of the alignment programs,
we tested whether removing blocks of problematic align-
ment actually leads to more accurate trees. We used for
this purpose our previously developed Gblocks program
(Castresana, 2000), which selects blocks following a re-
producible set of conditions. Briefly, selected blocks must
be free from large segments of contiguous nonconserved
positions, and flanking positions must be highly con-
served to ensure alignment accuracy. Several parameters
can be modified to make the selection of blocks more
or less stringent. Phylogenetic trees made by maximum
likelihood (ML), neighbor joining (NJ), and parsimony
of the reconstructed alignments show that, in almost all
conditions tested, and at least for alignments that are
not too short, the elimination of problematic regions by
Gblocks leads to significantly better phylogenetic trees.
M
ATERIALS AND METHODS
We simulated protein sequences by means of Rose
(Stoye et al., 1998). This program allows the simula-
tion of different substitution rates in different positions
with a predetermined spatial pattern. This is a very im-
portant feature for testing the behavior of a program
like Gblocks, which selects from alignments blocks of
contiguous conserved positions with few nonconserved
positions inside. This is the reason why a program that
simulates among-site rate heterogeneity, but not regional
heterogeneity, would not be valid to test the behavior
of Gblocks. Thus, an important preliminary step in our
simulations was the selection from real proteins of spa-
tial patterns of site rates in order to use these parameters
with Rose.
Selection of Evolutionary Rate Patterns
We extracted patterns of rate heterogeneity from
real protein alignments using the program TreePuzzle
(Strimmer and von Haeseler, 1996) with a model of
among-site rate heterogeneity that assumed a Gamma
distribution of rates. This distribution was approximated
with 16 rate categories, which is the maximum number
allowed in TreePuzzle. In particular, we took, from each
position, the category and associated relative rate that
contributed the most to the likelihood. Positions with
rates >1 receive more mutations than the average and po-
sitions with rates <1 receive fewer mutations. This list of
relative rates (whose average should be 1) were given to
Rose to simulate different positions with different rates,
creating conserved and divergent regions with lengths
and boundaries that approximated those of a real pro-
tein. Proteins for extracting rate patterns were NAD2 and
NAD4 (subunits 2 and 4 of the mitochondrial NADH de-
hydrogenase) from several metazoans (Castresana et al.,
1998b), and COG0285 from the COG database, which in-
cludes mainly bacterial sequences (Tatusov et al., 2003).
The three selected profiles produced similar conclusions
regarding the best block selection strategy, and we used
the NAD2 pattern to perform most of the tests. This
pattern contained 361 positions but, after the introduc-
tion of further gaps by the simulation algorithm, the
final simulated alignments reached approximately 400
positions. In order to simulate alignments of different
length, independent simulations obtained with this pat-
tern were concatenated 1, 2, 3, 4, and 8 times to generate
final alignments of, approximately, 400, 800, 1200, 1600,
and 3200 positions, respectively. The PAM evolutionary
model (Dayhoff et al., 1978) was used to simulate the
evolution of amino acids.
Selection of Phylogenetic Trees
Simulations with Rose were performed along phylo-
genetic trees of 16 tips with three different topologies,
a purely asymmetric tree (Fig. 1a), an intermediate tree
(Fig. 1b), and a symmetric tree (Fig. 1c). These known
trees or “real trees” were manually constructed. The av-
erage and maximum length from the root to the tips
was, for the asymmetric tree, 0.89 and 1.30 substitu-
tions/position, respectively. The other trees had very
similar values. The branch lengths of the three trees in
Figure 1 were multiplied by factors of 0.5, 1, and 2, re-
spectively, so that we used in total 9 phylogenetic trees.
These trees had several short internal branches that made
them difficult to resolve; thus, they are trees where the
alignment strategy as well as the phylogenetic algorithm
used were differentially effective. Simpler trees in terms
of longer internodes were easily and equally reproduced
by all methods and were not used here. Similarly, trees
with a total smaller divergence tended to produce con-
served alignments where the alignment method was not
an issue and also not used here. Finally, these trees did
Downloaded By: [USYB - Systematic Biology] At: 02:07 18 July 2007
566 SYSTEMATIC BIOLOGY VOL. 56
FIGURE 1. Asymmetric (a), intermediate (b), and symmetric (c) trees used in the simulations. The scale bar, in substitutions/position,
corresponds to the trees with a divergence ×1.
not contain many closely related sequences, since we
wanted to specifically measure differences in reproduc-
ing the overall shape of the tree and not differences in
recovering the relationships among close sequences.
Gaps Introduced during the Simulations
The Rose program does not have any specific model
for the introduction of gaps along the alignment. Rather,
gaps are introduced with equal probability in all posi-
tions with a relative rate 1 (Stoye et al., 1998), which
is a limitation of this program. To try to overcome this
limitation, we used two different gap strategies within
Rose. First, we used a single gap threshold for the whole
alignment. After several trials, we considered a thresh-
old of 0.0007 as a reasonable one for the divergence
levels we analyzed, as deduced from visual inspection
of the alignment (that is, eyeing that blocks of diver-
gence and conservation were not so different from the
real proteins used to construct the rate profiles). Even so,
this threshold tended to produce too many gaps in con-
served regions (not shown). In addition, we also gener-
ated alignments with two different gap thresholds, 0.001
and 0.0001, which we associated, respectively, to diver-
gent and to conserved regions of the profiles. For doing
so, we divided the rate profiles in blocks of homoge-
neous divergence (that is, each block was either mostly
conserved or mostly divergent, which resulted in around
10 to 20 blocks for the different profiles). Then, we did
the simulations for each block separately, and with its
own gap threshold (high for divergent blocks and low for
more conserved blocks). Finally, the different simulated
blocks were concatenated. The phylogenetic results were
similar with both gap strategies, but we mostly worked
with simulations that had the two different gap thresh-
olds, which we considered more realistic. In all cases we
chose a vector of indels of the form [0.5, 0.4, 0.3, 0.2,
0.1], which reflects the relative frequency of indels with
lengths from 1 to 5 amino acids, respectively.
Realignments of Simulated Sequences
Alignments generated by Rose were cleaned fromgaps
and new alignments were reconstructed using ClustalW
version 1.83 (Thompson et al., 1994), Mafft version 5.531
(Katoh et al., 2002, 2005), and Probcons version 1.1 (Do
et al., 2005). Default parameters were used in ClustalW
and Probcons. All defaults were also used in Mafft ex-
cept that a neighbor joining instead of a UPGMA tree was
used as guide tree (option –nj). Alignments were cleaned
from problematic alignment blocks using Gblocks 0.91
(Castresana, 2000), for which two different parameter
sets were used. In one of them, which we call here strin-
gent selection, and which is the default one in Gblocks
0.91, “Minimum Number of Sequences for a Conserved
Position” was 9, “Minimum Number of Sequences for a
Flank Position” was 13, “Maximum Number of Contigu-
ous Nonconserved Positions” was 8, “Minimum Length
of a Block” was 10, and “Allowed Gap Positions” was
“None”. In the second set, which we call relaxed selec-
tion, we changed “Minimum Number of Sequences for
a Flank Position” to 9, “Maximum Number of Contigu-
ous Nonconserved Positions” to 10, “Minimum Length
of a Block” to 5, and “Allowed Gap Positions” to “With
Half”. The latter option allows the selection of positions
with gaps when they are present in less than half of the
sequences.
Original simulated alignments and Mafft realignments
for 30 example simulations (the first five simulations gen-
erated with the symmetric and asymmetric trees) are pro-
vided as supplementary information (available online at
http://systematicbiology.org).
Phylogenetic Reconstruction
Phylogenetic trees from the complete and the two dif-
ferent Gblocks alignments were estimated by ML, NJ,
and parsimony. For ML trees we used the Phyml pro-
gram version 2.4.4 (Guindon and Gascuel, 2003), with
the Jones-Taylor-Thornton model of protein evolution
(Jones et al., 1992) and four rate categories in the Gamma
distribution. The Gamma distribution parameter and
the proportion of invariable sites were estimated by the
program. For NJ trees we used Protdist of the Phylip
package version 3.63 (Felsenstein, 1989) with the Jones-
Taylor-Thornton model to calculate pairwise protein dis-
tances, and Neighbor of the same package to calculate the
NJ tree. For parsimony we used Protpars of the Phylip
Downloaded By: [USYB - Systematic Biology] At: 02:07 18 July 2007
2007 TALAVERA AND CASTRESANA—IMPROVEMENT OF PHYLOGENIES AFTER REMOVING BLOCKS 567
package (Felsenstein, 1989) with 50 random initializa-
tions to ensure a thorough tree search. If no parsimony
tree was obtained, which occurred in less than 1% of the
simulations, the corresponding simulation was totally
excluded from the analysis. When several equally parsi-
monious trees were found, only the first one was used.
We did not do Bayesian trees because of the enormous
computational time required for doing enough number
of generations of all simulations performed.
For each alignment length, alignment strategy, and
phylogenetic method, 300 simulations were run in a grid
of 24 processors. The symmetric difference or Robinson-
Foulds (Robinson and Foulds, 1981) topological distance
from the calculated tree to the real tree was obtained us-
ing Vanilla 1.2 (Drummond and Strimmer, 2001), and the
average of all simulations calculated. This program re-
ports half the number of total discordant clades between
two trees. For bootstrap analyses, 100 bootstraps were
calculated. Due to heavy computational requirements of
the bootstrap analyses, the number of simulations was
reduced to 150. We checked that a higher number of boot-
straps and simulations did not improve the accuracy of
the bootstrap results. Bootstrap values were separately
calculated for right and wrong partitions of the tree with
the help of Bioperl functions (Stajich et al., 2002). Statisti-
cal differences among Robinson-Foulds distances in dif-
ferent alignment conditions were detected by the Tukey-
Kramer test with an alpha level of 0.05 using the JMP
package version 5.1 (SAS Institute, Cary, NC).
R
ESULTS AND DISCUSSION
General Alignment Strategy: Complete versus
Gblocks Alignments
The differences in alignments produced by different
methods can be appreciated in Figure 2. A fragment
of the alignment of simulated sequences (Fig. 2a) was
stripped of gaps and realigned by ClustalW (Fig. 2b),
Mafft (Fig. 2c), and Probcons (Fig. 2d). As it has been
noted before (Higgins et al., 2005), ClustalW tends to
produce more compact alignments. That is, ClustalW
generates many divergent regions that are almost de-
void of gaps, resulting in a relatively simple alignment
(Higgins et al., 2005). This can be clearly appreciated in
the most problematic region in the center of this align-
ment (Fig. 2b). Although Mafft also tends to make align-
ments more compact than the real ones (Fig. 2c), the
deviation from the real situation is not as large as with
ClustalW, at least with default gap penalties. Probcons
TABLE 1. Average number of positions of the complete alignments and the average percentage of positions selected by Gblocks with relaxed
and stringent conditions. Simulation of sequences was done following the asymmetric tree and the heterogeneity pattern of the NAD2 protein
concatenated two times.
ClustalW Mafft Probcons
Total % Gblocks % Gblocks Total % Gblocks % Gblocks Total % Gblocks % Gblocks
Divergence length relaxed stringent length relaxed stringent length relaxed stringent
×0.5 826.6 79.4 54.3 852.5 74.2 51.6 871.8 70.3 50.9
×1 862.4 64.2 42.0 903.7 59.0 39.8 966.4 51.8 37.6
×2 901.8 46.4 30.2 961.7 42.9 28.4 1117.9 34.7 24.5
produces the least compact alignments of the three pro-
grams tested (Fig. 2d). For example, simulations from
asymmetric trees with divergence ×1, which had an av-
erage original length of 1097 positions, were compacted
to an average of 966 positions by Probcons, to 904 posi-
tions by Mafft and to 862 positions by ClustalW (Table 1).
Similar relative degrees of compression were obtained in
other types of simulations.
Gblocks removes problematic regions of a multiple
alignment according to a number of rules. First, blocks
selected for inclusion must be free from a large number
of contiguous nonconserved positions, must be flanked
by highly conserved positions, and must have a mini-
mum length, as controlled by the corresponding param-
eters (see Materials and Methods). In addition, positions
with gaps can be removed either always or only when
more than half of the sequences contain gaps (Castre-
sana, 2000). The latter parameter has a large influence
on the total number of selected positions. We have used
Gblocks in simulated realigned sequences with two dif-
ferent conditions. The condition that we call stringent
does not allow any gap position. The relaxed condition
allows gap positions if they are present in less than half
of the sequences, and it is also less restrictive in the other
parameters (see Materials and Methods). The effect of
the two different parameter sets of Gblocks selection can
be appreciated in Figure 2, for ClustalW (Fig. 2b), Mafft
(Fig. 2c), and Probcons alignments (Fig. 2d). In both cases,
the relaxed parameters (grey blocks) allow the selection
of more positions than the stringent parameters (white
blocks). Table 1 shows the average number of positions of
the complete alignments and the percentage of positions
left after treatment with Gblocks with the two different
parameter sets. Values in this table are for the asymmetric
tree, but similar values were found for other trees.
In order to infer which type of alignment algorithm
(ClustalW, Mafft, or Probcons) and which treatment of
the resulting alignment (no treatment or Gblocks treat-
ment with stringent or relaxed conditions) was best for
phylogenetic analysis, we calculated phylogenetic trees
from all these alignments, and measured the topologi-
cal distance with respect to the real tree. Figure 3 shows,
for the simulations with the asymmetric tree, the aver-
age topological distances to the real tree from the trees
generated with ClustalW alignments, with and with-
out the use of Gblocks. In addition, the distance to the
tree obtained from the Gblocks complementary align-
ment (that is, the alignment resulting after concatena-
tion of all the blocks rejected by Gblocks) is also shown.
Downloaded By: [USYB - Systematic Biology] At: 02:07 18 July 2007
568 SYSTEMATIC BIOLOGY VOL. 56
a)
EDCLRSGKVQQYFSAQYL---DGVGVSLIPQCLQVEFTSRIDFKSFVCHPAECGL---STPA--GC---AQW------------A--E----AGGAGSDFPQVDVANSGYKAERFTVQWQY-KTRNRATIDHHRSAKSLPKKS
DDCTRSGKVKQYFGAQYAA--MGVIYSLIPQCLQVKITSRIDYKNFICAQKACAK-----PG--IPEFGS-------------AG--R---A-SGAESDFGQVDPANKGYKTDRFTVQWQY-RGRGRADIKYHWHACSYQQISA
EDCTRSGKVQQYFSAQYMS--TGIICSLIPQCLQVKFTSCIDYKTFICSPAACGP-----PG--TCYADKVW----FFHFKLSNG--L----DGSAGSDFPQVDPANEGYKSERFTVQWKY-RARDRANIQHHWSVKTYRSQSK
GDCTRAGKVQEYFSAQYLA--IGKAYALIPQCLQVKFTSRIDYKDFICSPGACGA-----PA--NCYYNVVW----VHQFKLDAG--G----SVNAGSDFPRVDPANGGFKKKRFTVQWKY-GARDRVAIEHHWSAKTFRQRS
NDCTRSGKVQQYFSAQYIG--NAVRTSLIP
LCLQVNFTSRSDFKVFACAPAECGDVGLTLPAPRAC---HVW----HF----AEG--TAHA-AANAGTDFPQIEGANKGYKA ERFTVQWKY--VQSRARIVHHWSARTLRKRSL
NDCLRSGKVQVYFSAQYAN--SGVKAALIPEALQVKFTSFIDFKSFVCSPAQCGV---SLPA--GV---GPWYNAILF----PEG--A----TGGAGSDFPQVEPANNGYKAERFGVQWAY-LTRNRATINHHWSARVLPKKS
EDCTRSG
QVQQYFSAQYKA--AGVVYSLIQQCLQVKFTSRVDYKSFICSPNACGQ-----PA--RAYYGKT--------FKLSAG--V----DGNAGSEFLQIDPANDGYKSERFTVQWKY-RARDRATINHHWSVKTYRGQSK
DECTRSGKVQQFFSPQYITSFFGPIYSIIPQCLQVNFTARIDFKTFVCSKGACGL---VAPV--TC---KEW----FF----TGG--L----KGGAGSDYAQVDPANGGYKAERFTVQWPEIKARSRATIDHHWSAKAY
HKKSL
DDCLRSGKVQQYFSAQYMG--NGVKASLIPQCLQVKFTSKIDFTSFICVPTECGI---SLPA--DC---AAW----FF----PDV--D----RGGAGSDFPQVDPGNDGYKAEHFTVQWKY-KARNRTTINHHWSAKTLRKKS
DDCTRSGRVQQYFSAQYLS--GGIIYSLIPKCLQVKFTSCIDYKSFICSPAACAD-----SP--ACYADATW----FFQFKLSDG--V----PGNAGSDFPQVDPANEGYKSERFTVQWKY-KAPDRATINHHWSVKTYRAEST
DDCLRSGNR
QQYFTAVYGN--LGVPTSLIPNCLQVKFTSVIQFSTFIYAPPKCPQ---DTPG--GA---S------TF-----SM--H-----VSADSGYSQVEGENHGLKMGHFDVQW-Y-RPRARAVIDHHWSA--LQNR S
EDCARSGKVQQYFSAQYMS--AVIIYSLIPQCLQVKFTSCIDYKSLICSPAACGE-----PG--TCYADKTW----FFQFKLTAG--L----EGNAGSDFPQVDPANEGYKSERFTVQWKY-KARDRATIQHHWSVKTYRSQSK
DDCTRSGKVQQYFSAQYMI--GGVI
YSLIPQCLQVKFTSCINFKSFICPPAACAE---NLPE--RC---QFW----FF----DTG--E----GGGAGSDFPQVDPANDGYKAERFTVQWHY-KPRDRAAISHHWSAKSLRKNSL
DDCTRSGKVQQYFSAQYLG--GGVVYSLIPQCHQVKFTSKIDYKSLICAPAACGV---DFPA--NC---QTW----FF----GGGGTL----SGGAGSDFPQVDPANDGYKAERFTVQWKY-QAKNRASINHHWSAKSYRKKSP
SDCTRSGKVQQYFTAQYMS--QGKICSLIPDCLKVKFTSCLD
YKSFNV SAAACGD-----PG--TCYAARAW----FFQFKLSVG--L----DGNAGSAYEQ ASPANEGYKSERFTVQWKY-KARDRATIQHHWSVKVYRRRTT
DDCTREGRVEQYFSANYRS--SGILYSLILVCLQVKFTACINFKSFSCSPASCGT-----PS--LCYADKNW----FYQFKL--S--V----EGNGGSNFPQVDPANDGYKTDRFTVQWVY-KARDRASIKHHWSVDTYREGSC
L
G
F
L
F
c)
EDCLRSGKVQQYFSAQYL-D--GVGVSLIPQCLQVEFTSRIDFKSFVCHPAECG-----LSTPAGC---AQW--------AEAGGAGSDFPQVDVANSGYKAERFTVQW-QY KTRNRATIDHHRSAKSLPKK-SL
DDCTRSGKVKQYFGAQYAAM--GVIYSLIPQCLQVKITSRIDYKNFICAQKACA-------KPGIP---------EFGSAGRASGAESDFGQVDPANKGYKTDRFTVQW-QYRGRGRADIKYHWHACSYQQI-S
A
EDCTRSGKVQQYFSAQYMST--GIICSLIPQCLQVKFTSCIDYKTFICSPAACG-------PPGTCYADKVWFFHFKLSNGLDGSAGSDFPQVDPANEGYKSERFTVQW-KYRARDRANIQHHWSVKTYRSQ-SK
GDCTRAGKVQEYFSAQYLAI--GKAYALIPQCLQVKFTSRIDYKDFICSPGACG-------APANCYYNVVWVHQFKLDAGGSVNAGSDFPRVDPANGGFKKKRFTVQW-KYGARDRVAIEHHWSAKTFRQR-
SG
NDCTRSGKVQQYFSAQYIGN--AVRTSLIPLCLQVNFTSRSDFKVFACAPAECGDVGLTLPAPRAC---HVWHF-AEGTAHAAANAGTDFPQIEGANKGYKA ERFTVQW-KY-VQSRARIVHHWSARTLRKR-SL
NDCLRSGKVQVYFSAQYANS--GVKAALIPEALQVKFTSFIDFKSFVCSPAQCG-----VSLPAGV---GPWYNAILFPEGATGGAGSDFPQVEPANNGYKAERFGVQW-AYLTRNRATINHHWSA
RVLPKK-S
EDCTRSGQVQQYFSAQYKAA--GVVYSLIQQCLQVKFTSRVDYKSFICSPNACG-------QPARAYYGKT----FKLSAGVDGNAGSEFLQIDPANDGYKSERFTVQW-KYRARDRATINHHWSVKTYRGQ-SK
DECTRSGKVQQFFSPQYITSFFGPIYSIIPQCLQVNFTARIDFKTFVCSKGACG-----LVAPVTC---KEWF----FTGGLKGGAGSDYAQVDPANGGYKAERFTVQWPEIKARSRA
TIDHHWSAKAYHKK-SL
DDCLRSGKVQQYFSAQYMGN--GVKASLIPQCLQVKFTSKIDFTSFICVPTECG-----ISLPADC---AAWF--F--PDVDRGGAGSDFPQVDPGNDGYKAEHFTVQW-KYKARNRTTINHHWSAKTLRKK-SL
DDCTRSGRVQQYFSAQYLSG--GIIYSLIPKCLQVKFTSCIDYKSFICSPAACA-------DSPACYADATWFFQFKLSDGVPGNAGSDFPQVDPANEGYKSERFTVQW-KYKAPDRATINHHWSV
KTYRAE-ST
DDCLRSGNRQQYFTAVYGNL--GVPTSLIPNCLQVKFTSVIQFSTFIYAPPKCP-----QDTPGGA-----------STFSMHVSADSGYSQVEGEN HGLKMGHFDVQW--YRPRARAVIDHHWSALQNR
EDCARSGKVQQYFSAQYMSA--VIIYSLIPQCLQVKFTSCIDYKSLICSPAACG-------EPGTCYADKTWFFQFKLTAGLEGNAGSDFPQVDPANEGYKSERFTVQW-KYKARDRATIQHHWSV
KTYRSQ-SK
DDCTRSGKVQQYFSAQYMIG--GVIYSLIPQCLQVKFTSCINFKSFICPPAACA-----ENLPERC---QFWF----FDTGEGGGAGSDFPQVDPANDGYKAERFTVQW-HY KPRDRAAISHHWSAKSLRKN-SL
DDCTRSGKVQQYFSAQYLGG--GVVYSLIPQCHQVKFTSKIDYKSLICAPAACG-----VDFPANC---QTWF--FGGGGTLSGGAGSDFPQVDPANDGYKAERFTVQW-KYQAKNRASINHHWSAKSYRKK-SP
SDCTRSGKVQQYFTAQYMSQ--
GKICSLIPDCLKVKFTSCLDYKSFNV SAAACG-------DPGTCYAARAWF FQFKLSVGLDGNAGSAYEQASPANEGYKSERFTVQW-KYKARDRATIQHHWSVKVYRRRTTT
DDCTREGRVEQYFSANYRSS--GILYSLILVCLQVKFTACINFKSFSCSPASCG-------TPSLCYADKNWFYQFKLS--VEGNGGSNFPQVDPANDGYKTDRFTVQW-VY KARDRASIKHHWSVDTYR---EG
F
SFFGN
b)
EDCLRSGKVQQYFSAQYLD---GVGVSLIPQCLQVEFTSRIDFKSFVCHPAECGLSTPAGCAQW------------AEAGGAGSDFPQVDVANSGYKAERFTVQW-QYKTRNRATIDHHRSAKSLPKKS
-DCTRSGKVKQYFGAQYAAM--GVIYSLIPQCLQVKITSRIDYKNFICAQKACAKPGIPEFGSAG------------RASGAESDFGQVDPANKGYKTDRFTVQW-QYRGRGRADIKYHWHACSYQQISA
-DCTRSGKVQQYFSAQYMST--GIICSLIPQCLQVKFTSCIDYK
TFICSPAACGPPGTCYADKVWFFHFKLSN---GLDGSAGSDFPQVDPAN EGYKSERFTVQW-KYRARDRANIQHHWSVKTYRSQSK
GDCTRAGKVQEYFSAQYLAI--GKAYALIPQCLQVKFTSRIDYKDFICSPGACGAPANCYYNVVWVHQFKLDA---GGSVNAGSDFPRVDPANGGFKKKRFTVQW-KYGARDRVAIEHHWSAKTFRQRSG
NDCTRSGKVQQYFSAQYIGN--AVRTSLIPLCLQVNFTSRSDFKVFACAPAECGDVGLTLPAPRACHVWHFAEGTAHAAANAGT
DFPQIEGANKGYKAERFTVQW--KYVQSRARIVHHWSARTLRKRSL
NDCLRSGKVQVYFSAQYANS--GVKAALIPEALQVKFTSFIDFKSFVCSPAQCGVSLPAGVGPWYNA-ILFPE---GATGGAGSDFPQVEPANNGYKAERFGVQW-AYLTRNRATINHHWSARVLPKKSF
-DCTRSGQVQQYFSAQYKAA--GVVYSLIQQCLQVKFTSRVDYKSFICSPNACGQPARAYYGKT----FKLSA---GVDGNAGSEFLQIDPANDGYKSERFTVQW-KYRAR
DRATIN HHWSVKTYRGQSK
-ECTRSGKVQQFFSPQYITSFFGPIYSIIPQCLQVNFTARIDFKTFVCSKGACGLVAPVTCKEWFFT-----G---GLKGGAGSDYAQVDPANGGYKAERFTVQWPEIKARSRATIDHHWSAKAYHKKSL
DDCLRSGKVQQYFSAQYMGN--GVKASLIPQCLQVKFTSKIDFTSFICVPTECGISLPADCAAWF-----FPD---VDRGGAGSDFPQVDPGNDGYKAEHFTVQW-KYKARNRTTINHHWSAKTL
RKKS
-DCTRSGRVQQYFSAQYLSG--GIIYSLIPKCLQVKFTSCIDYKSFICSPAACADSPACYADATWFFQFKLSD---GVPGNAGSDFPQVDPANEGYKSERFTVQW-KYKAPDRATINHHWSV KTYRAEST
DDCLRSGNRQQYFTAVYGNLG--VPTSLIPNCLQVKFTSVIQFSTFIYAPPKCPQDTPGGASTFS------------MHVSADSGYSQVEGENHGLKMGHFDVQW--YRPRARAVIDHHWSALQNRSFFG
-DCARSGKVQQYFSAQY
MSA--VIIYSLIPQCLQVKFTSCIDY KSLICSPAACGEPGTCYADKTWFFQFKLTA---GLEGNAGSDFPQVDPAN EGYKSERFTVQW-KYKARDRATIQHHWSVKTYRSQSK
-DCTRSGKVQQYFSAQYMIG--GVIYSLIPQCLQVKFTSCINFKSFICPPAACAENLPERCQFWFFD-----T---GEGGGAGSDFPQVDPANDGYKAERFTVQW-HYKPRDRAAISHHWSAKSLRKNSL
-DCTRSGKVQQYFSAQYLGG--GVVYSLIPQCHQVKFTSKIDYKSLICAPAACG---VDFPANCQTWFFGGGG---TLSGGAGSDFPQVDPANDGYKAERFTVQW-KYQAKN
RASINHHWSAKSYRKKSP
-DCTRSGKVQQYFTAQYMSQ--GKICSLIPDCLKVKFTSCLDYKSFNVSAAACGDPGTCYAARAWFFQFKLSV---GLDGNAGSAYEQASPANEGYKSERFTVQW-KYKARDRATIQHHWSVKVYRRRTT
-DCTREGRVEQYFSANYRSS--GILYSLILVCLQVKFTACINFKSFSCSPASCGTPSLCYADKNWFYQFKLS-----VEGNGGSNFPQVDPANDGYKTDRFTVQW-VYKARD RASI
KHHWSVDTYREGSC
L
L
EDCLRSGKVQQYFSAQYLD---GVGVSLIPQCLQVEFTSRIDFKSFV CHPAECGLS-----TPA-GCAQWA-------------EAGGAGSDFPQVDVANSGYKAERFTVQWQ-YKTRNRATIDHHRSAKSLPKKSL
DDCTRSGKVKQYFGAQYAAM--GVIYSLIPQCLQVKITSRIDYKNFICAQKACAKP-----GIPEF-------G--S---A--GRASGAESDFGQVDPANKGYKTD RFTVQWQ-YRGRGRADIKYHWHACSYQQISA
EDCTRSGKVQQYFSAQYMST--GIIC
SLIPQCLQVKFTSCIDYKTFICSPAACGPP-----GTCYADKVWFFHFKLS---N--GLDGSAGSDFPQVDPANEGYKSERFTVQWK-YRARDRANIQHHWSVKTYRSQSK
GDCTRAGKVQEYFSAQYLAI--GKAYA LIPQCLQVKFTSRIDYKDFICSPGACGAP-----ANCYYNVVWVHQFKLD---A--GGSVNAGSDFPRVDPANGGFKKKRFTVQWK-YGARDRVAIEHHWSAKTFRQRSG
NDCTRSGKVQQYFSAQYIGN--AVRTSLIPLCLQVNFTSRSDFKVF
ACAPAECGDVGLTLPAPR-ACHVWH F----AEGTA--HAAANAGTDFPQIEGANKGYKAERFTVQWK-Y-VQSRARIVHHWSARTLRKRSL
NDCLRSGKVQVYFSAQY ANS--GVKAALIPEALQVKFTSFIDFKSFVCSPAQCGVS-----LPA-GVGPWYNAILFP---E--GATGGAGSDFPQVEPANNGYKAERFGVQWA-YLTRNRATINHHWSARVLPKKSF
EDCTRSGQVQQYFSAQYKAA--GVVYSLIQQCLQVKFTSRVDYKSFICSPNACGQP-----ARAYYGKTFK----LS---A--GVD
GNAGSEFLQIDPANDGYKSERFTVQWK-YRARDRATINHHWSVKTYRGQSK
DECTRSGKVQQFFSPQY ITSFFGPIYSIIPQCLQVNFTARIDFKTFVCSKGACGLV-----APV-TCKEWFF----T---G--GLKGGAGSDYAQVDPANGGYKAERFTVQWPEIKARSRATIDHHWSAKAYHKKSL
DDCLRSGKVQQYFSAQYMGN--GVKASLIPQCLQVKFTSKIDFTSFICVPTECGIS-----LPA-DCAAWFF----P---D--VDRG
GAGSDFPQVDPGNDGYKAE HFTVQWK-YKARNRTTINHHWSAKTLRKKSL
DDCTRSGRVQQYFSAQYLSG--GIIYSLIPKCLQVKFTSCIDYKSFICSPAACADS-----PACYADATWFFQFKLS---D--GVPGNAGSDFPQVDPANEGYKSERFTVQWK-YKAPDRA TINHHWSVKTYRAES T
DDCLRSGNRQQYFTAVY GNL--GVPTSLIPNCLQVKFTSVIQFSTFIYAPPKCPQD-----TPG-GASTF-------------SMHVSADSGYSQVEGENH
GLKMGHFDVQW--YRPRARAVIDHHWSALQNRSFFG
EDCARSGKVQQYFSAQYMSA--VIIYSLIPQCLQVKFTSCIDYKSLICSPAACGEP-----GTCYADKTWFFQFKLT---A--GLE GNAGSDFPQVDPANEGYKSERFTVQWK-YKARDRATIQHHWSVKTYRSQSK
D
DCTRSGKVQQYFSAQYMIG--GVIYSLIPQCLQVKFTSCINFKSFICPPAACAEN-----LPE-RCQFWFF----D---T--GEGGGAGSDFPQVDPAND GYKAERFTVQWH-YKPRDRAAISHHWSAKSLRKNSL
DDCTRSGKVQQYFSAQYLGG--GVVYSLIPQCHQVKFTSKIDYKSLICAPAACGVD-----FPA-NCQTWFF----G---GGGTLSGGAGSDFPQVDPANDGYKAERFTVQWK-YQAKNRASINHHWSAKSYRKKSP
SDCTRSGKVQQYFTAQYMSQ--GKICSLIPDCLKVKFTSCLD YKSFNVSAAACGDP-----GTCYAARAWFF
QFKLS---V--GLDGNAGSAYEQASPANE GYKSERFTVQWK-YKARDRATIQHHWSVKVYRRRTT
DDCTREGRVEQYFSANYRSS--GILYSLILVCLQVKFTACINFKSFSCSPASCGTP-----SLCYADKNWF YQF--K---L--SVEGNGGSNFPQVDPANDGYKTD RFTVQWV-YKARDRASIKHHWSVDTYREGSC
d)
FIGURE 2. Fragment of a simulated alignment (a) and the realignment of the same sequences (after gap removal) by ClustalW (b), Mafft
(c), and Probcons (d). The simulation corresponds to an asymmetric tree with divergence ×1. The blocks below each alignment represent the
fragments selected by Gblocks with relaxed conditions (grey blocks) and with stringent conditions (white blocks). Positions of the alignments
where more than 50% of the sequences are identical are shown with black boxes.
Figure 4 represents for each tree (and for two representa-
tive lengths, 800 and 3200 amino acids, as representatives
of single-gene and concatenated-gene phylogenies) the
best alignment strategies after statistically comparing the
average topological distances by means of the Tukey-
Kramer test. An overview of these two figures shows
that, when the alignments are cleaned by Gblocks with
any of the two parameter sets used (dotted lines in Fig-
ure 3), the topological distance to the real tree decreases
with respect to the complete alignment (solid, red line)
in almost all divergences and alignment lengths tested,
and with the three tree reconstruction methods used:
Downloaded By: [USYB - Systematic Biology] At: 02:07 18 July 2007
2007 TALAVERA AND CASTRESANA—IMPROVEMENT OF PHYLOGENIES AFTER REMOVING BLOCKS 569
FIGURE 3. Average Robinson-Foulds distances to the real tree from the tree calculated with ClustalW complete alignments (solid, red line with
crossed symbols), the same alignments after treatment with Gblocks relaxed (dotted, blue line with diamonds) and stringent (dotted, green line
with squared symbols) conditions, and the complementary alignments of the Gblocks relaxed alignment (solid, orange line with triangles). The
asymmetric tree with three different divergence levels was used for the simulations with different alignment lengths. Trees were reconstructed
by ML, NJ, and parsimony.
ML, NJ, and parsimony. The improvement in topolog-
ical accuracy upon Gblocks treatment is more noticeable
for the highest divergences (×2). This is expected since
there are more problematic blocks in these alignments,
as shown by the lower percentage of positions selected
by Gblocks (Table 1). In addition, the improvement from
Gblocks treatment is particularly large for NJ and parsi-
mony. These two methods produce quite poor topologies
when using the complete alignments but, upon using
Gblocks, particularly with the most stringent conditions
(green line, squared symbols), there is a substantial gain
in topological accuracy. ML produces the overall best
trees (see also below) although, in the lowest divergence
(×0.5), there is almost no difference in topological qual-
ity between the Gblocks and the complete alignments.
In fact, for short genes (400 to 800 amino acids) the com-
plete alignment gives rise to better trees than the Gblocks
alignments, although there is no statistical difference be-
tween the complete alignment and the Gblocks align-
ment with relaxed parameters (Fig. 4).
Downloaded By: [USYB - Systematic Biology] At: 02:07 18 July 2007
570 SYSTEMATIC BIOLOGY VOL. 56
FIGURE 4. ClustalW alignment strategies that give rise to the statistically best topologies. When two or more strategies do not show statistical
differences in Robinson-Foulds distances, all equivalent strategies are represented. The complete alignment is represented by a black block, and
the relaxed and stringent Gblocks strategies by grey and white blocks, respectively.
It is thus shown from the example above that the re-
moval of divergent and problematic regions of an align-
ment is, in principle, beneficial for phylogenetic analyses
of relatively divergent sequences. In fact, it is true, as pre-
viously argued (Aagesen, 2004; Lee, 2001), that there is
some phylogenetic information in the blocks removed
by methods like Gblocks. This can be appreciated in Fig-
ure 3, which shows the topological distances to the real
trees from the trees obtained with the blocks excluded by
Gblocks (complementary alignment; solid, orange line).
These distances, although very large, become quite re-
duced for long alignments, indicating that trees obtained
from the complementary regions are not random; that is,
there is some phylogenetic information in the regions re-
jected by Gblocks. However, what seems to matter is not
the total phylogenetic signal but the signal-to-noise ratio.
Despite the relatively simple simulations performed, re-
gions excluded by Gblocks seem to add more noise than
signal, thus lowering the quality of the trees from the
complete alignments with respect to the Gblocks-cleaned
alignments.
Similar conclusions about the beneficial effect of
Gblocks can be drawn from Mafft alignments of the same
asymmetric trees (Figs. 5 and 6). In this case, Gblocks is
not an advantage over the complete alignment in the two
most conserved alignments (×0.5 and ×1) when using
the ML method although, again, Gblocks relaxed and
the complete alignments are not statistically different.
The picture for Probcons (Fig. 1 of the online Appendix,
available at http://systematicbiology.org) is similar to
that for Mafft. Figure 2 of the online Appendix shows
a comparison of the three alignment programs with de-
fault gap costs, using the trees produced after Gblocks
cleaning with relaxed conditions. Under the conditions
of these simulations, ClustalW is slightly worse, regard-
ing the trees produced, than the two other programs. The
performances of Mafft and Probcons are very similar, and
only for NJ and parsimony Probcons alignments work
slightly better. Probcons, however, is highly demand-
ing in computational time. Thus, for the rest of the tests
we only compared the performances of ClustalW and
Mafft.
Downloaded By: [USYB - Systematic Biology] At: 02:07 18 July 2007
2007 TALAVERA AND CASTRESANA—IMPROVEMENT OF PHYLOGENIES AFTER REMOVING BLOCKS 571
FIGURE 5. Average Robinson-Foulds distances to the real tree from the tree calculated with Mafft complete alignments (solid, red line with
crossed symbols), the same alignments after treatment with Gblocks relaxed (dotted, blue line with diamonds) and stringent (dotted, green line
with squared symbols) conditions, and the complementary alignments of the Gblocks relaxed alignment (solid, orange line with triangles). The
asymmetric tree with three different divergence levels was used for the simulations with different alignment lengths. Trees were reconstructed
by ML, NJ, and parsimony.
The results for the symmetric and intermediate trees of
both alignment algorithms are shown in the correspond-
ing columns of Figures 4 and 6 for the ClustalW and
Mafft methods, respectively (and in Figures 3 to 6 in the
online Appendix for all alignment lengths). Two results
are noteworthy from these analyses. First, differences
in phylogenetic performance between different align-
ments derived from symmetric trees are quantitatively
smaller, in agreement with a previous work (Ogden and
Rosenberg, 2006). See, for example, the similarity of the
three graphs of ML trees of ClustalW alignments (Fig. 3
in the online Appendix). Second, in these trees there are
two conditions where the Gblocks alignments produce
ML trees that are statistically worse than the complete
alignments: the symmetric and intermediate trees of di-
vergence ×1 with Mafft alignments of 800 amino acids
(Fig. 6). These are the only two conditions where we ob-
served this. However, we do not think that this justifies
Downloaded By: [USYB - Systematic Biology] At: 02:07 18 July 2007
572 SYSTEMATIC BIOLOGY VOL. 56
FIGURE 6. Mafft alignment strategies that give rise to the statistically best topologies. When two or more strategies do not show statistical
differences in Robinson-Foulds distances, all equivalent strategies are represented. The complete alignment is represented by a black block, and
the relaxed and stringent Gblocks strategies by grey and white blocks, respectively.
not using Gblocks in these types of trees, even if we
could know the shape of the tree in advance. In real
alignments, evolution must be much more complex than
what we simulated. For example, we did not simu-
late biased amino acid compositions (Castresana et al.,
1998a) or different models of evolution in different parts
of trees (Philippe and Laurent, 1998), all of which will
have stronger biasing effects in nonconserved blocks. Be-
cause the difference in topological accuracy between the
Gblocks and the complete alignments is very small in
these two conditions, it is very likely that the addition of
any of these effects in the simulations would have made
both the Gblocks relaxed and complete alignments of at
least equal performance.
All simulations shown so far were performed follow-
ing a pattern of rate variation of the NAD2 protein. To
test the influence of different rate patterns, we used in
the simulations profiles derived from two other proteins
(NAD4 and COG0285). From the Mafft alignments of
these simulations we calculated the corresponding ML
trees (Fig. 7 in the online Appendix). Different patterns
(and thus different percentages of block selection) gave
rise to different performances of the complete and the
Gblocks alignments, but the results were similar in rela-
tive terms. We also tested the performance of a different
gap model, in which gaps were introduced homoge-
neously along the alignment, instead of using two differ-
ent gap thresholds in different regions of the alignments
(see Materials and Methods). The results were again sim-
ilar with the simpler gap strategy, as shown for the ML
reconstruction of the asymmetric trees (Fig. 8 of the On-
line appendix).
Phylogenetic Methods Used
The data shown above indicate that ML is the phyloge-
netic method that best extracts reliable information from
problematic alignment regions, since trees derived from
complete alignments are relatively good. This contrasts
with the trees obtained by NJ and parsimony, which are
quite poor from the complete alignments, indicating that
they greatly benefited from the use of Gblocks. ML is also
the method that produces the overall best trees, in agree-
ment with previous simulation analysis (see references
Downloaded By: [USYB - Systematic Biology] At: 02:07 18 July 2007
2007 TALAVERA AND CASTRESANA—IMPROVEMENT OF PHYLOGENIES AFTER REMOVING BLOCKS 573
FIGURE 7. Average Robinson-Foulds distances to the real tree from the tree calculated with Mafft complete (solid line, solid symbols) and
ClustalW complete alignments (solid line, empty symbols). The tree distances obtained with the same alignments after treatment with Gblocks
with relaxed conditions (dotted lines) are also shown. Trees were reconstructed by ML (circles), NJ (squares), and parsimony (triangles). The
most divergent asymmetric tree was used for the simulations.
in Felsenstein, 2004). To show this, Figure 7 presents the
superimposed graphs for the most divergent asymmet-
ric tree as an example. The better performance of ML
in all alignment conditions is clearly appreciated in this
graph.
Short versus Long Alignments
Alignment length turned out to be a very important
factor to be taken into account when deciding the best
alignment cleaning strategy. Figures 3 and 5 show that,
in general, for shorter alignments the best Gblocks con-
dition is the relaxed one, whereas for longer alignments
the stringent condition tends to work better. This can also
be appreciated by comparing the slopes of the graphs
corresponding to the complete alignments, and those of
the Gblocks alignments with relaxed and stringent con-
ditions. The slope downwards (towards better trees) is
less pronounced for the complete alignments and more
pronounced for Gblocks with stringent conditions. This
means that for single genes (400 to 800 amino acids) the
gain in signal-to-noise ratio after elimination of prob-
lematic blocks may not compensate the total loss of in-
formation. However, for longer alignments, for example,
those used in phylogenomic studies where several genes
are concatenated (Delsuc et al., 2005; Jeffroy et al., 2006),
there is enough total information so that selecting the
best pieces with Gblocks using the stringent conditions
allows to get closer to the real tree. This basic tendency
is observed under all simulation conditions we tested.
Bootstrap Support in Trees Obtained
from Gblocks Alignments
Previous performance tests of Gblocks with real data
showed that Gblocks alignments obtained less support
in ML analysis, because the number of trees not sig-
nificantly different from the ML tree was smaller in
the complete alignment than in the Gblocks alignment
(Castresana, 2000). Later, in numerous studies in our
group and in other groups, the same effect was observed
using bootstrap values of NJ trees, which were lower
in the Gblocks alignments. Our simulations reproduced
the same behavior again. In NJ trees obtained from 100
bootstrap samples, the average bootstrap support of all
partitions was higher for the complete alignments, and
lower for Gblocks alignments (Fig. 8). However, the same
simulations (see topological distances of NJ trees in Fig-
ures 3 and 5) showed that the best trees were obtained
with Gblocks conditions and the worse topologies with
the complete alignments, thus following the opposite di-
rection, regarding quality, to the bootstrap values, at least
for the maximum divergence. A similar trend was found
for NJ trees of simulations with symmetric trees (Fig. 9
of the online Appendix) and for bootstrapped ML trees
(Fig. 10 of the online Appendix). One may think that the
bootstraps of Gblocks trees are lower due to the smaller
length of the Gblocks alignments, but it is still very para-
doxical that the best topology is associated to a lower
bootstrap.
The explanation for this contradictory behavior of
Gblocks may be that divergent and problematic align-
ment regions are biased towards an erroneous topology
(Lake, 1991). This could happen if the initial guide tree
used in the progressive alignment methods is conducting
very strongly the alignment in the divergent and most
gappy regions, where alignment programs may easily
create similarity at the expense of homology (Higgins
et al., 2005). In addition, when alignment software is
faced with an ambiguous alignment decision, the algo-
rithmic solution makes consistent but arbitrary decisions
Downloaded By: [USYB - Systematic Biology] At: 02:07 18 July 2007
574 SYSTEMATIC BIOLOGY VOL. 56
FIGURE 8. Average bootstrap values of NJ trees obtained from ClustalW (a) and Mafft (b) alignments simulated from the asymmetric tree
with three different divergence levels. Complete (solid, red line), Gblocks relaxed (dotted, blue line with diamonds), and Gblocks stringent
(dotted, green line with squared symbols) alignments are shown.
FIGURE 9. Average Robinson-Foulds distances from the ClustalW guide tree to the real tree (red line with crossed symbols), from the guide
tree to the NJ tree of the Gblocks alignment with relaxed conditions (green line with squared symbols), and from the guide tree to the NJ tree
of the complementary positions of the same Gblocks alignment (blue line with diamonds). The asymmetric tree with three different divergence
levels was used for the simulations.
Downloaded By: [USYB - Systematic Biology] At: 02:07 18 July 2007
2007 TALAVERA AND CASTRESANA—IMPROVEMENT OF PHYLOGENIES AFTER REMOVING BLOCKS 575
that bias the support indices. That is, this repeated align-
ment decisions will increase the bootstrap support, and
this bias will be stronger in the most divergent regions,
where there is more uncertainty. Three results are con-
sistent with this possibility. Firstly, we have observed
in our simulations that the initial guide dendrogram
used by ClustalW is indeed very different from the real
tree, as measured by the Robinson-Foulds distance of
both trees (Fig. 9). If all divergent regions tend to eas-
ily reproduce this initial dendrogram, we would expect
that the guide tree is more similar to the tree obtained
from the Gblocks excluded regions than to the Gblocks
alignment. Figure 9 shows that this is the case, partic-
ularly in the most divergent simulations. Secondly, we
see that the effect of increased bootstrap support in the
complete alignment with respect to the Gblocks align-
ments is higher in ClustalW, which highly depends on
the initial dendrogram, than in Mafft (Fig. 8). For exam-
ple, in simulations of 400 amino acids and at ×2 diver-
gence, there is an increase from 60% to 76% bootstrap
support in ClustalW when comparing the Gblocks strin-
gent and complete alignments, and only from 60% to
70% in Mafft. In the latter method, the successive it-
erations of the alignment algorithm may make the fi-
nal alignment more independent from the initial crude
dendrogram, thus explaining that trees generated from
these alignments are slightly less biased. And thirdly,
when we calculated separately bootstraps of right and
wrong partitions for each tree we observe, apart from
lower values for wrong partitions, a slightly higher bias
in them (Fig. 11 of the online Appendix). The bias is
also present in the right partitions, probably because
some of the recurrent software decisions in the diver-
gent regions are actually correct. Thus, the bias coming
from divergent regions seems to increase the bootstrap
of all partitions, although the effect is slightly larger in
the wrong ones. All this indicates that bootstrap sup-
port cannot be used as a measure of reliability of the
tree topology when divergent regions are present in the
alignment.
C
ONCLUSIONS
We have shown, under the conditions of these simu-
lations, that the information contained in divergent and
ambiguously aligned regions of multiple alignments is,
in general, not beneficial for phylogenetic reconstruction.
Thus, using Gblocks or a similar method for removing
problematic blocks seems to be justified for phylogenetic
analysis, particularly for divergent alignments. In this
work, we have used simulations of moderately diver-
gent and very heterogeneous proteins, which are typ-
ically used in deep phylogenies (i.e., bacterial groups,
eukaryotes lineages, metazoan phyla). However, we do
not know how removal of blocks would affect more con-
served and less heterogeneous alignments. We have also
not tested how a finer tuning of parameters of align-
ment programs and Gblocks may improve the phyloge-
nies. Although we have only used protein alignments,
the same conclusions are expected to apply to protein-
coding DNA alignments of similar divergence. On the
other hand, although we predict that the general con-
clusion that ambiguously aligned regions in any data set
are best excluded when they provide more noise than sig-
nal, rRNA alignments as well as alignments from non-
coding DNA have very different features from coding
alignments, and our simulations were not specifically
designed to explore the properties of these kinds of se-
quences. However, our purpose in this work is not giving
strict rules about the best alignment strategy and asso-
ciated parameters. Rather, our simulations are mainly
informative about general tendencies. Thus, in the fol-
lowing we summarize important tendencies observed in
our simulations and give some general rules regarding
the best alignment strategy that can be applied to real
situations of protein alignments.
NJ and parsimony seem to be unable to extract
useful phylogenetic information from the problematic
alignment regions, because the complete alignments are
always much worse than the Gblocks treated alignments,
so using Gblocks seems particularly advisable for these
methods. Most probably, these two methods are not able
to take into account the multiple substitutions that oc-
cur in these excessively saturated blocks. On the other
hand, ML, less affected by saturation, is able to extract
some information from these blocks, since in some condi-
tions the complete alignments are similar or even better
than the Gblocks alignments. However, the misidenti-
fied homology that may occur in these regions affects
all phylogenetic methods, which may explain why us-
ing Gblocks is more beneficial at high divergences for all
methods.
Regarding the use of stringent or relaxed conditions
for Gblocks, two important rules can be extracted from
our analysis. First, for ML trees relaxed conditions of
Gblocks seem to give rise to better trees, whereas for NJ
and parsimony stringent conditions are better. Second,
alignment length is a crucial parameter to be taken into
account. For short alignments, such as in studies of sin-
gle short genes, the removal of blocks by Gblocks may
leave too few positions, so in these cases it may be better
to use very relaxed conditions of Gblocks. In the short-
est alignments, which have very little information, use
of Gblocks may be even detrimental. At any rate, one
should be aware that with this type of short alignments
it is only possible to obtain a very approximate topology,
possibly quite distant from the real tree. For phyloge-
nomic studies, where there is enough information from
the concatenation of several genes (Jeffroy et al., 2006),
the use of Gblocks with stringent conditions tends to give
rise to the best phylogenetic trees.
A
CKNOWLEDGMENTS
This work was supported financially by a research grant in bioinfor-
matics from the Fundaci´on BBVA (Spain), and grant number BIO2002-
04426-C02-02 from the Plan Nacional de Investigaci´on Cient´ıfica,
Desarrollo e Innovaci ´on Tecnol´ogica (I+D+I) of the MEC, cofinanced
with FEDER funds. We thank V. Soria-Carrasco for useful technical as-
sistance, and three anonymous reviewers, K. Kjer, and R.D.M. Page for
critical comments that helped improve the manuscript.
Downloaded By: [USYB - Systematic Biology] At: 02:07 18 July 2007
576 SYSTEMATIC BIOLOGY VOL. 56
REFERENCES
Aagesen, L. 2004. The information content of an ambiguously alignable
region, a case study of the trnL intron from the Rhamnaceae. Organ.
Divers. Evol. 4:35–49.
Blackshields, G., I. M. Wallace, M. Larkin, and D. G. Higgins. 2006.
Analysis and comparison of benchmarks for multiple sequence
alignment. In Silico Biol. 6:321–339.
Castresana, J. 2000. Selection of conserved blocks from multiple align-
ments for their use in phylogenetic analysis. Mol. Biol. Evol. 17:540–
552.
Castresana, J., G. Feldmaier-Fuchs, and S. P¨abo. 1998a. Codon reas-
signment and amino acid composition in hemichordate mitochon-
dria. Proc. Natl. Acad. Sci. USA 95:3703–3707.
Castresana, J., G. Feldmaier-Fuchs, S. Yokobori, N. Satoh, and S. P¨abo.
1998b. The mitochondrial genome of the hemichordate Balanoglossus
carnosus and the evolution of deuterostome mitochondria. Genetics
150:1115–1123.
Dayhoff, M. O., R. M. Schwartz, and B. C. Orcutt. 1978. A model of evo-
lutionary change in proteins. Pages 345–352 in Atlas of protein se-
quence structure (M. O. Dayhoff, ed.) National Biomedical Research
Foundation, Washington, D.C.
Delsuc, F., H. Brinkmann, and H. Philippe. 2005. Phylogenomics
and the reconstruction of the tree of life. Nat. Rev. Genet. 6:361–
375.
Do, C. B., M. S. Mahabhashyam, M. Brudno, and S. Batzoglou. 2005.
ProbCons: Probabilistic consistency-based multiple sequence align-
ment. Genome Res. 15:330–340.
Drummond, A., and K. Strimmer. 2001. PAL: An object-oriented pro-
gramming library for molecular evolution and phylogenetics. Bioin-
formatics 17:662–663.
Edgar, R. C. 2004. MUSCLE: Multiple sequence alignment with
high accuracy and high throughput. Nucleic Acids Res. 32:1792–
1797.
Felsenstein, J. 1989. PHYLIP—Phylogeny inference package (version
3.4). Cladistics 5:164–166.
Felsenstein, J. 2004. Inferring phylogenies. Sinauer Associates, Sunder-
land, Massachusetts.
Feng, D. F., and R. F. Doolittle. 1987. Progressive sequence alignment
as a prerequisite to correct phylogenetic trees. J. Mol. Evol. 25:351–
360.
Fleissner, R., D. Metzler, and A. von Haeseler. 2005. Simultaneous sta-
tistical multiple alignment and phylogeny reconstruction. Syst. Biol.
54:548–561.
Gatesy, J., R. DeSalle, and W. Wheeler. 1993. Alignment-ambiguous nu-
cleotide sites and the exclusion of systematic data. Mol. Phylogenet.
Evol. 2:152–157.
Geiger, D. L. 2002. Stretch coding and block coding: Two new strate-
gies to represent questionably aligned DNA sequences. J. Mol. Evol.
54:191–199.
Grundy, W. N., and G. J. Naylor. 1999. Phylogenetic inference from
conserved sites alignments. J. Exp. Zool. 285:128–139.
Guindon, S., and O. Gascuel. 2003. A simple, fast, and accurate algo-
rithm to estimate large phylogenies by maximum likelihood. Syst.
Biol. 52:696–704.
Gutell, R. R., N.Larsen,and C. R. Woese. 1994. Lessons from an evolving
rRNA: 16S and 23S rRNA structures from a comparative perspective.
Microbiol. Rev. 58:10–26.
Henikoff, S., and J. G. Henikoff. 1994. Proteinfamily classification based
on searching a database of blocks. Genomics 19:97–107.
Herrmann, G., A. Schon, R. Brack-Werner, and T. Werner. 1996. CON-
RAD: A method for identification of variable and conserved re-
gions within proteins by scale-space filtering. Comput. Appl. Biosci.
12:197–203.
Higgins, D. G., G. Blackshields, and I. M. Wallace. 2005. Mind the
gaps: Progress in progressive alignment. Proc. Natl. Acad. Sci. USA
102:10411–10412.
Jeffroy, O., H. Brinkmann, F. Delsuc, and H. Philippe. 2006. Phy-
logenomics: The beginning of incongruence? Trends Genet. 22:225–
231.
Jones, D. T., W. R. Taylor, and J. M. Thornton. 1992. The rapid generation
of mutation data matrices from protein sequences. Comput. Appl.
Biosci. 8:275–282.
Katoh, K., K. Kuma, H. Toh, and T. Miyata. 2005. MAFFT version 5:
Improvement in accuracy of multiple sequence alignment. Nucleic
Acids Res. 33:511–518.
Katoh, K., K. Misawa, K. Kuma, and T. Miyata. 2002. MAFFT: A novel
method for rapid multiple sequence alignment based on fast Fourier
transform. Nucleic Acids Res. 30:3059–3066.
Kjer, K. M. 1995. Use of rRNA secondary structure in phylogenetic
studies to identify homologous positions: an example of alignment
and data presentation from the frogs. Mol. Phylogenet. Evol. 4:314-
330.
Lake, J. A. 1991. The order of sequence alignment can bias the selection
of tree topology. Mol. Biol. Evol. 8:378–385.
Lassmann, T., and E. L. Sonnhammer. 2005. Kalign—An accurate and
fast multiple sequence alignment algorithm. BMC Bioinformatics
6:298.
Lee, M. S. 2001. Unalignable sequences and molecular evolution.
Trends Ecol. Evol. 16:681–685.
oytynoja, A., and M. C. Milinkovitch. 2001. SOAP, cleaning mul-
tiple alignments from unstable blocks. Bioinformatics 17:573–
574.
Lunter, G., I. Miklos, A. Drummond, J. L. Jensen, and J. Hein. 2005.
Bayesian coestimation of phylogeny and sequence alignment. BMC
Bioinformatics 6:83.
Lutzoni, F., P. Wagner, V. Reeb, and S. Zoller. 2000. Integrating am-
biguously aligned regions of DNA sequences in phylogenetic anal-
yses without violating positional homology. Syst. Biol. 49:628–
651.
Morrison, D. A., and J. T. Ellis. 1997. Effects of nucleotide sequence
alignment on phylogeny estimation: A case study of 18S rDNAs of
apicomplexa. Mol. Biol. Evol. 14:428–441.
Needleman, S. B., and C. D. Wunsch. 1970. A general method applica-
ble to the search for similarities in the amino acid sequence of two
proteins. J. Mol. Biol. 48:443–453.
Notredame, C., D. G. Higgins, and J. Heringa. 2000. T-Coffee: A novel
method for fast and accurate multiple sequence alignment. J. Mol.
Biol. 302:205–217.
Nuin, P. A., Z. Wang, and E. R. Tillier. 2006. The accuracy of several
multiple sequence alignment programs for proteins. BMC Bioinfor-
matics 7:471.
Ogden, T. H., and M. S. Rosenberg. 2006. Multiple sequence alignment
accuracy and phylogenetic inference. Syst. Biol. 55:314–328.
Pesole, G., M. Attimonelli, G. Preparata, and C. Saccone. 1992. A sta-
tistical method for detecting regions with different evolutionary
dynamics in multialigned sequences. Mol. Phylogenet. Evol. 1:91–
96.
Philippe, H., and J. Laurent. 1998. How good are deep phylogenetic
trees? Curr. Opin. Genet. Dev. 8:616–623.
Redelings, B. D., and M. A. Suchard. 2005. Joint Bayesian estimation of
alignment and phylogeny. Syst. Biol. 54:401–418.
Robinson, D. F., and L. R. Foulds. 1981. Comparison of phylogenetic
trees. Math. Biosci. 53:131–147.
Rodrigo, A. G., P. R. Bergquist, and P. L. Bergquist. 1994. Inadequate
support for an evolutionary link between the Metazoa and the Fungi.
Syst. Biol. 43:578–584.
Smythe, A. B., M. J. Sanderson, and S. A. Nadler. 2006. Nematode small
subunit phylogeny correlates with alignment parameters. Syst. Biol.
55:972–992.
Stajich, J. E., et al. 2002. The Bioperl toolkit: Perl modules for the life
sciences. Genome Res. 12:1611–1618.
Stoye, J., D. Evers, and F. Meyer. 1998. Rose: Generating sequence fam-
ilies. Bioinformatics 14:157–163.
Strimmer, K., and A. von Haeseler. 1996. Quartet puzzling: A quar-
tet maximum-likelihood method for reconstructing tree topologies.
Mol. Biol. Evol. 13:964–969.
Swofford, D. L., G. J. Olsen, P. J. Waddell, and D. M. Hillis. 1996. Phy-
logenetic inference. Pages 407–514 in Molecular systematics (D. M.
Hillis, C. Moritz, and B. K. Mable, eds.). Sinauer Associates, Sunder-
land, Massachusetts.
Tatusov, R. L., N. D. Fedorova, J. D. Jackson, A. R. Jacobs, B. Kiryutin,
E. V. Koonin, D. M. Krylov, R. Mazumder, S. L. Mekhedov, A. N.
Nikolskaya, B. S. Rao, S. Smirnov, A. V. Sverdlov, S. Vasudevan, Y. I.
Wolf, J. J. Yin, and D. A. Natale. 2003. The COG database: an updated
version includes eukaryotes. BMC Bioinformatics 4:41.
Downloaded By: [USYB - Systematic Biology] At: 02:07 18 July 2007
2007 TALAVERA AND CASTRESANA—IMPROVEMENT OF PHYLOGENIES AFTER REMOVING BLOCKS 577
Thompson, J. D., D. G. Higgins, and T. J. Gibson. 1994. CLUSTAL
W: Improving the sensitivity of progressive multiple sequence
alignment through sequence weighting, position-specific gap penal-
ties and weight matrix choice. Nucleic Acids Res. 22:4673–4680.
Thompson, J. D., P. Koehl, R. Ripp, and O. Poch. 2005. BAliBASE 3.0:
Latest developments of the multiple sequence alignment benchmark.
Proteins 61:127–136.
Wheeler, W. 2001. Homology and the optimization of DNA sequence
data. Cladistics 17:S3–S11.
Xia, X., Z. Xie, and K. M. Kjer. 2003. 18S ribosomal RNA and tetrapod
phylogeny. Syst. Biol. 52:283–295.
Yang, Z. 1998. On the best evolutionary rate for phylogenetic analysis.
Syst. Biol. 47:125–133.
Young, N. D., and J. Healy. 2003. GapCoder automates the use
of indel characters in phylogenetic analysis. BMC Bioinformatics
4:6.
First submitted 7 February 2007; reviews returned 6 March 2007;
final acceptance 24 March 2007
Associate Editor: Karl Kjer
Editors: Rod Page and Jack Sullivan
... For the construction of the phylogenetic tree, the ITS sequence obtained was aligned with 13 Fusarium spp. sequences obtained from NCBI using MUSCLE (Edgar 2004), employing Gblocks (Talavera and Castresana 2007) within the Phylogeny.fr platform (Dereeper et al. 2008) for multiple sequence alignment. ...
Article
Full-text available
Synthetic fungicides have experienced a significant increase in recent years, necessitating the search for more sustainable and environmentally friendly alternatives. In this regard, chitosan has emerged as an option to reduce reliance on these products. This study evaluated the effect of chitosan as a biocontrol agent against Fusarium oxysporum in tomato fruits. A fully randomized experimental design incorporating 6 treatments was employed, consisting of four chitosan treatments (0.5, 1, 2, and 3 g L-1), a negative control involving the application of a synthetic fungicide, and a positive control inoculated with F. oxysporum. Samples were taken from infected tomato fruits. The F4 isolate of Fusarium sp. was identified as F. oxysporum, and demonstrated the highest level of virulence. Among the four chitosan treatments, the 3 g L-1 treatment showed the highest a percentage of mycelial growth inhibition (PMGI) at 79.92% and the greatest reduction in biomass at 0.65 g, which did not differ significantly from the synthetic fungicide. Regarding disease severity and incidence, there were significant variations among each of the chitosan treatments, with the highest results obtained with the 2 and 3 g L-1 treatments. All chitosan treatments reduced disease severity in tomato fruits. Applying chitosan on fruits of the tomato plant presents an alternative for diminishing reliance on synthetic fungicides.
... 7.0 (Katoh & Standley, 2013), and ambiguously aligned positions were excluded using Gblocks (Talavera & Castresana, 2007). ...
Article
Full-text available
Despite the worldwide distribution and rich diversity of the superfamily Tenebrionoidea, the knowledge of the mitochondrial genomes (mtgenome) characteristics of the superfamily is still very limited, and its phylogenetics and evolution remain unresolved. In the present study, we newly sequenced mtgenomes from 19 species belonging to Tenebrionoidea, and a total of 90 mitochondrial genomes from 16 families of Tenebrionoidea were used for phylogenetic analysis. There exist 37 genes for all 82 species of complete mtgenomes of 16 families investigated, and their characteristics are identical as reported mtgenomes of other Tenebrionoids. The Ka/Ks analysis suggests that all 13 PCGs have undergone a strong purifying selection. The phylogenetic analysis suggests the monophyly of Mordellidae, Meloidae, Oedemeridae, Pyrochroidae, Salpingidae, Scraptiidae, Lagriidae, and Tenebrionidae, and the Mordellidae is close to the Ripiphoridae. The “Tenebrionidae clade” and “Meloidae clade” are monophyletic, and both of them are sister groups. In the “Meloidae clade,” Meloidae is close to Anthicidae. In the “Tenebrionidae clade,” the family Lagriidae and Tenebrionidae are sister groups. The divergence time analysis suggests that Tenebrionoidea originated in the late Jurassic, Meloidae Mordellidae, Lagriidae, and Tenebrionidae in the Cretaceous, Oedemeridae in Paleogene. The work lays a base for the study of mtgenome, phylogenetics, and evolution of the superfamily Tenebrionoidea.
... The resulting alignments were then converted to codon sequences using PAL-2NAL v14.1 [73]. To ensure alignment quality, we applied Gblocks v0.91 [74] to select conserved blocks from each alignment. These selected blocks were used to construct a ML tree using IQ-TREE v2.0.3 [75] with 5000 ultrafast bootstraps. ...
Article
Full-text available
Background Cliffs are recognized as one of the most challenging environments for plants, characterized by harsh conditions such as drought, infertile soil, and steep terrain. However, they surprisingly host ancient and diverse plant communities and play a crucial role in protecting biodiversity. The Taihang Mountains, which act as a natural boundary in eastern China, support a rich variety of plant species, including many unique to cliff habitats. However, it is little known how cliff plants adapt to harsh habitats and the demographic history in this region. Results To better understand the demographic history and adaptation of cliff plants in this area, we analyzed the chromosome-level genome of a representative cliff plant, T. rupestris var. ciliata, which has a genome size of 769.5 Mb, with a scaffold N50 of 104.92 Mb. The rapid expansion of transposable elements may have contributed to the increasing genome and its ability to adapt to unique and challenging cliff habitats. Comparative analysis of the genome evolution between Taihangia and non-cliff plants in Rosaceae revealed a significant expansion of gene families associated with oxidative phosphorylation, which is likely a response to the abiotic stresses faced by cliff plants. This expansion may explain the long-term adaptation of Taihangia to harsh cliff environments. The effective population size of the two varieties has continuously decreased due to climatic fluctuations during the Quaternary period. Furthermore, significant differences in gene expression between the two varieties may explain the varied leaf phenotypes and adaptations to harsh conditions in different natural distributions. Conclusion Our study highlights the extraordinary adaptation of T. rupestris var. ciliata, shedding light on the evolution of cliff plants worldwide.
... To investigate the evolutionary position of oil-Camellia, a phylogenetic tree was constructed using the 2237 conserved single-copy genes among the six species. The conserved protein sequences of these single-copy orthologues were aligned and extracted by using MAFFT (v.7.471) (Kazutaka and Standley, 2013) and Gblocks (v.0.91b) (Talavera and Castresana, 2007) and then concatenated to generate a supermatrix. The maximum-likelihood phylogenetic tree was generated under the 'PROTGAMMAAUTO' model using RAxML (v.8.2.1264) (Stamatakis, 2014) to automatically determine the best reasonable tree by conducting 1000 bootstrap replicates. ...
Article
Full-text available
Oil‐Camellia (Camellia oleifera), belonging to the Theaceae family Camellia, is an important woody edible oil tree species. The Camellia oil in its mature seed kernels, mainly consists of more than 90% unsaturated fatty acids, tea polyphenols, flavonoids, squalene and other active substances, which is one of the best quality edible vegetable oils in the world. However, genetic research and molecular breeding on oil‐Camellia are challenging due to its complex genetic background. Here, we successfully report a chromosome‐scale genome assembly for a hexaploid oil‐Camellia cultivar Changlin40. This assembly contains 8.80 Gb genomic sequences with scaffold N50 of 180.0 Mb and 45 pseudochromosomes comprising 15 homologous groups with three members each, which contain 135 868 genes with an average length of 3936 bp. Referring to the diploid genome, intragenomic and intergenomic comparisons of synteny indicate homologous chromosomal similarity and changes. Moreover, comparative and evolutionary analyses reveal three rounds of whole‐genome duplication (WGD) events, as well as the possible diversification of hexaploid Changlin40 with diploid occurred approximately 9.06 million years ago (MYA). Furthermore, through the combination of genomics, transcriptomics and metabolomics approaches, a complex regulatory network was constructed and allows to identify potential key structural genes (SAD, FAD2 and FAD3) and transcription factors (AP2 and C2H2) that regulate the metabolism of Camellia oil, especially for unsaturated fatty acids biosynthesis. Overall, the genomic resource generated from this study has great potential to accelerate the research for the molecular biology and genetic improvement of hexaploid oil‐Camellia, as well as to understand polyploid genome evolution.
... The individual genes of PCGs and nuclear sequences were aligned using the MAFFT algorithm [55] implemented in PhyloSuite [56] with the L-INS-I strategy. The alignments were trimmed using G-blocks [57] and then concatenated into the aforementioned different datasets. The pre-defined partitions for different datasets follow a consistent pattern, with protein-coding genes divided according to codon positions, while other genes are each placed into a separate partition. ...
Article
Full-text available
The firefly genus Oculogryphus Jeng, Engel & Yang, 2007 is a rare-species group endemic to Asia. Since its establishment, its position has been controversial but never rigorously tested. To address this perplexing issue, we are the first to present the complete mitochondrial sequence of Oculogryphus, using the material of O. chenghoiyanae Yiu & Jeng, 2018 determined through a comprehensive morphological identification. Our analyses demonstrate that its mitogenome exhibits similar characteristics to that of Stenocladius, including a rearranged gene order between trnC and trnW, and a long intergenic spacer (702 bp) between the two rearranged genes, within which six remnants (29 bp) of trnW were identified. Further, we incorporated this sequence into phylogenetic analyses of Lampyridae based on different molecular markers and datasets using ML and BI analyses. The results consistently place Oculogryphus within the same clade as Stenocladius in all topologies, and the gene rearrangement is a synapomorphy for this clade. It suggests that Oculogryphus should be classified together with Stenocladius in the subfamily Ototretinae at the moment. This study provides molecular evidence confirming the close relationship between Oculogryphus and Stenocladius and discovers a new phylogenetic marker helpful in clarifying the monophyly of Ototretinae, which also sheds a new light on firefly evolution.
... Gene dataset was filtered using a local version of program TranslatorX vLocal.pl [65] in the following four steps: (1) use the standard genetic code to translate the nucleotide sequences into amino acids sequences; (2) align these peptide sequences of each putative SCG with MAFFT v.7.505 [66]; (3) further trim the amino acid sequences that exclude ambiguous portions using Gblocks [67]; (4) convert the alignments into the corresponding nucleotide alignments. ...
... For each TSET subunit found in stramenopiles and haptophytes, sequences of the putative homologs were retrieved and aligned with the query sequences and their AP/COP paralogues (outgroup) in Seaview using ClustalO (default paramters) [33,34]. Alignments were trimmed in Seaview using Gblocks (options for a less stringent selection) [33,35], and phylogenetic trees were computed in Seaview using PhyML with default parameters (LG model, aLRT branch support). Putative homolog sequences that did not group with the reference sequences were considered as false positives and removed using the Taxus software. ...
Preprint
Eukaryotic cell biology is largely understood from paradigms established on few model organisms, largely from the animal and fungi (opisthokonts) and to a lesser extent plants. These organisms, however, constitute only a small proportion of eukaryotic diversity, and the principles of their cell biology may not be universal to other, understudied but globally impactful, organisms. Intriguingly, there are cellular components that are present in diverse eukaryotes, but are not in the animals and fungi on which the best developed models of cell biology are derived. Consequently, these components are not included in the generally adopted frameworks of cellular function that are meant to explain eukaryotic biology. The membrane complex TSET is the best studied such example, well established to play a role in cell division and endocytosis in plants. It is found across eukaryotes, but is highly reduced in opisthokonts. Its general prevalence, abundance, and relevance in eukaryotic cellular activity is unclear. Here we show that TSET is encoded in genomes of five cosmopolitan and critical groups of primarily photosynthetic eukaryotes (green algae, red algae, stramenopiles, haptophytes and cryptophytes), with particular prevalence in the green algae and some stramenopile groups. A meta-analysis of published gene expression data from the model diatom Phaeodactylum tricornutum shows that this complex is coregulated with components of the endomembrane trafficking machinery. Moreover, meta-transcriptomic data from Tara Oceans reveals that TSET genes are both present and expressed by diatoms in the wild. These data suggest that TSET may be playing an important and underrecognized role in cellular activities within marine ecosystems. More broadly, the results support the idea that use of systems-level data for non-model organisms can illuminate our understanding of core principles of eukaryotic cell function, and may reveal important and under-appreciated players that deserve to be integrated into the pervasive models of cellular capacity.
... We generated a multiple sequence alignment (MSA) with MAFFT v7.475 [93] for each of the 12 genes. Then, we extracted the conserved blocks with Gblocks v0.91b [94] in the semi-strict mode (controlled by the -h5 = h parameter). Later, we built a matrix of concatenated sequences using the 12 conserved blocks with the tool Catsequences [95], obtaining a total length of 7,220 bp. ...
Article
Full-text available
Background To unravel the evolutionary history of a complex group, a comprehensive reconstruction of its phylogenetic relationships is crucial. This requires meticulous taxon sampling and careful consideration of multiple characters to ensure a complete and accurate reconstruction. The phylogenetic position of the Orestias genus has been estimated partly on unavailable or incomplete information. As a consequence, it was assigned to the family Cyprindontidae, relating this Andean fish to other geographically distant genera distributed in the Mediterranean, Middle East and North and Central America. In this study, using complete genome sequencing, we aim to clarify the phylogenetic position of Orestias within the Cyprinodontiformes order. Results We sequenced the genome of three Orestias species from the Andean Altiplano. Our analysis revealed that the small genome size in this genus (~ 0.7 Gb) was caused by a contraction in transposable element (TE) content, particularly in DNA elements and short interspersed nuclear elements (SINEs). Using predicted gene sequences, we generated a phylogenetic tree of Cyprinodontiformes using 902 orthologs extracted from all 32 available genomes as well as three outgroup species. We complemented this analysis with a phylogenetic reconstruction and time calibration considering 12 molecular markers (eight nuclear and four mitochondrial genes) and a stratified taxon sampling to consider 198 species of nearly all families and genera of this order. Overall, our results show that phylogenetic closeness is directly related to geographical distance. Importantly, we found that Orestias is not part of the Cyprinodontidae family, and that it is more closely related to the South American fish fauna, being the Fluviphylacidae the closest sister group. Conclusions The evolutionary history of the Orestias genus is linked to the South American ichthyofauna and it should no longer be considered a member of the Cyprinodontidae family. Instead, we submit that Orestias belongs to the Orestiidae family, as suggested by Freyhof et al. (2017), and that it is the sister group of the Fluviphylacidae family, distributed in the Amazonian and Orinoco basins. These two groups likely diverged during the Late Eocene concomitant with hydrogeological changes in the South American landscape.
... For the transcriptome data of other sea buckthorn individuals, we performed separate assemblies based on the same sampling locations. The transcriptome data from multiple individuals at each sampling point in each batch were combined to generate a single assembly result, representing the transcriptome pro le of that sampling point in that batch.Extract the longest transcript as the Unigene, predict the CDS sequence using TransDecoder v.5.5.0, extract single-copy orthologous genes using ortho nder v.2.5.4(Emms & Kelly, 2019), align the protein sequences using MUSCLE v.5.1(Edgar, 2021), trim the aligned sequences using Gblocks v.0.91b(Talavera & Castresana, 2007), construct a phylogenetic tree using RAxML v.8.2.12(Kozlov et al., 2019), extract the corresponding CDS sequences for the proteins, perform codon-based alignment using prank v.SNPs or INDELs loci that are stably homozygous within each species and show differentiation between the two species, using SNP and INDEL information from three H. sinensis individuals and three H. neurocarpa individuals. As the hybrid offspring, If H. goniocarpa individual is a hybrid F1 generation, these SNP loci in its genome should exhibit heterozygous. ...
Preprint
Full-text available
The natural hybridization of sea buckthorn is widely observed by researchers. While studies have identified the parents of these hybrid offspring, distinguishing between F1 and Fn generations is challenging for natural hybrids. As a result, the genetic composition of these hybrid offspring remains underexplored. In this study, we propose a novel method for identifying hybrid F1 generations using transcriptome data and reference genomes. We successfully identified eight individuals from two natural hybrid populations of sea buckthorn, all of which were confirmed to be hybrid F1 generations. Additionally, we first noted limitations in detecting heterozygous sites during SNP calling in transcriptome data, where allele-specific expression and low expression of genes or transcripts can lead to heterozygous SNPs being incorrectly identified as homozygous. Furthermore, we constructed a phylogenomic tree of the sea buckthorn genus using transcriptome data and compared the relationships among various sea buckthorn species using SNP and indel molecular markers obtained through transcriptome data.
Article
Dipcadi ursulae is identified as having leaves with a central white band, long acuminate bracts, and sweet-scented flowers. Dipcadi ursulae var. longiracemosum has been raised to the species level due to its morphological and anatomical features. Morphometric analysis based on 89 morphological characters is provided to support the status change. A unique new variety of Dipcadi, which is morphologically closely related to Dipcadi ursulae but differs in having pure white campanulate-urceolate flowers, and unequal spreading perianth lobes, is described here. Anatomical, phylogenetic and morphometric evidence is provided to support the status of the new variety.
Article
Full-text available
An earlier analysis of the trnL intron in the Colletieae (Rhamnaceae) showed polyphyly of the genus Discaria. Polyphyly of Discaria is supported only by an AT-rich region of ambiguous alignment within the trnL intron. Polyphyly of the genus relies on extracting the information of the AT-rich region correctly. Ambiguously aligned regions are commonly excluded from phylogenetic analysis. In the present study the question was raised whether random or noisy data could generate a pattern like the one found in the AT-rich region of ambiguous alignment. The original pattern was resistant to changes in alignment parameter cost when submitted to a sensitivity analysis using direct optimization. Artificially generated random or noisy data gave well-resolved trees but these were found to be extremely sensitive to changes in parameter costs. However, information from additional data, such as conserved regions, restricts the influence of random data. It is here suggested that the information in ambiguously aligned regions need not be dismissed, provided that an appropriate method that finds all possible optimal alignments is used to extract the information. In addition to commonly used support measures, some information of robustness to changes in alignment parameter costs is needed in order to make the most reliable conclusions.
Article
Phylogenetic analyses of non-protein-coding nucleotide sequences such as ribosomal RNA genes, internal transcribed spacers, and introns are often impeded by regions of the alignments that are ambiguously aligned. These regions are characterized by the presence of gaps and their uncertain positions, no matter which optimization criteria are used. This problem is particularly acute in large-scale phylogenetic studies and when aligning highly diverged sequences. Accommodating these regions, where positional homology is likely to be violated, in phylogenetic analyses has been dealt with very differently by molecular systematists and evolutionists, ranging from the total exclusion of these regions to the inclusion of every position regardless of ambiguity in the alignment. We present a new method that allows the inclusion of ambiguously aligned regions without violating homology.In this three-step procedure, first homologous regions of the alignment containing ambiguously aligned sequences are delimited. Second, each ambiguously aligned region is unequivocally coded as a new character, replacing its respective ambiguous region. Third, each of the coded characters is subjected to a specific step matrix to account for the differential number of changes (summing substitutions and indels) needed to transform one sequence to another.The optimal number of steps included in the step matrix is the one derived from the pairwise alignment with the greatest similarity and the least number of steps. In addition to potentially enhancing phylogenetic resolution and support, by integrating previously nonaccessible characters without violating positional homology,this new approach can improve branch length estimations when using parsimony.
Article
A multiple sequence alignment program, MAFFT, has been developed. The CPU time is drastically reduced as compared with existing methods. MAFFT includes two novel techniques. (i) Homo logous regions are rapidly identified by the fast Fourier transform (FFT), in which an amino acid sequence is converted to a sequence composed of volume and polarity values of each amino acid residue. (ii) We propose a simplified scoring system that performs well for reducing CPU time and increasing the accuracy of alignments even for sequences having large insertions or extensions as well as distantly related sequences of similar length. Two different heuristics, the progressive method (FFT‐NS‐2) and the iterative refinement method (FFT‐NS‐i), are implemented in MAFFT. The performances of FFT‐NS‐2 and FFT‐NS‐i were compared with other methods by computer simulations and benchmark tests; the CPU time of FFT‐NS‐2 is drastically reduced as compared with CLUSTALW with comparable accuracy. FFT‐NS‐i is over 100 times faster than T‐COFFEE, when the number of input sequences exceeds 60, without sacrificing the accuracy.
Article
In the eight years since we last examined the amino acid exchanges seen in closely related proteins, &apos; the information has doubled in quantity and comes from a much wider variety of protein types. The matrices derived from these data that describe the amino acid replacement probabilities between two sequences at various evolutionary distances are more accurate and the scoring matrix that is derived is more sensitive in detecting distant relationships than the one that we previously deri~ed.2, ~ The method used &apos;in this chapter is essentially the same as that described in the Atlas, Volume 34 and Volume 5.&apos; Accepted Point Mutations An accepted poinfmutation in a protein is a replacement of one amino acid by another, accepted by natural selection. It is the result of two distinct processes: the
Article
A multiple sequence alignment program, MAFFT, has been developed. The CPU time is drastically reduced as compared with existing methods. MAFFT includes two novel techniques. (i) Homo logous regions are rapidly identified by the fast Fourier transform (FFT), in which an amino acid sequence is converted to a sequence composed of volume and polarity values of each amino acid residue. (ii) We propose a simplified scoring system that performs well for reducing CPU time and increasing the accuracy of alignments even for sequences having large insertions or extensions as well as distantly related sequences of similar length. Two different heuristics, the progressive method (FFT-NS-2) and the iterative refinement method (FFT-NS-i), are implemented in MAFFT. The performances of FFT-NS-2 and FFT-NS-i were compared with other methods by computer simulations and benchmark tests; the CPU time of FFT-NS-2 is drastically reduced as compared with CLUSTALW with comparable accuracy. FFT-NS-i is over 100 times faster than T-COFFEE, when the number of input sequences exceeds 60, without sacrificing the accuracy.
Article
A metric on general phylogenetic trees is presented. This extends the work of most previous authors, who constructed metrics for binary trees. The metric presented in this paper makes possible the comparison of the many nonbinary phylogenetic trees appearing in the literature. This provides an objective procedure for comparing the different methods for constructing phylogenetic trees. The metric is based on elementary operations which transform one tree into another. Various results obtained in applying these operations are given. They enable the distance between any pair of trees to be calculated efficiently. This generalizes previous work by Bourque to the case where interior vertices can be labeled, and labels may contain more than one element or may be empty.
Article
We describe a new method (T-Coffee) for multiple sequence alignment that provides a dramatic improvement in accuracy with a modest sacrifice in speed as compared to the most commonly used alternatives. The method is broadly based on the popular progressive approach to multiple alignment but avoids the most serious pitfalls caused by the greedy nature of this algorithm. With T-Coffee we pre-process a data set of all pair-wise alignments between the sequences. This provides us with a library of alignment information that can be used to guide the progressive alignment. Intermediate alignments are then based not only on the sequences to be aligned next but also on how all of the sequences align with each other. This alignment information can be derived from heterogeneous sources such as a mixture of alignment programs and/or structure superposition. Here, we illustrate the power of the approach by using a combination of local and global pair-wise alignments to generate the library. The resulting alignments are significantly more reliable, as determined by comparison with a set of 141 test cases, than any of the popular alternatives that we tried. The improvement, especially clear with the more difficult test cases, is always visible, regardless of the phylogenetic spread of the sequences in the tests.