ArticlePDF Available

The CLUSTAL_X Windows Interface: Flexible Strategies for Multiple Sequence Alignment Aided by Quality Analysis Tools

Authors:

Abstract and Figures

CLUSTAL X is a new windows interface for the widely-used progressive multiple sequence alignment program CLUSTAL W. The new system is easy to use, providing an integrated system for performing multiple sequence and profile alignments and analysing the results. CLUSTAL X displays the sequence alignment in a window on the screen. A versatile sequence colouring scheme allows the user to highlight conserved features in the alignment. Pull-down menus provide all the options required for traditional multiple sequence and profile alignment. New features include: the ability to cut-and-paste sequences to change the order of the alignment, selection of a subset of the sequences to be realigned, and selection of a sub-range of the alignment to be realigned and inserted back into the original alignment. Alignment quality analysis can be performed and low-scoring segments or exceptional residues can be highlighted. Quality analysis and realignment of selected residue ranges provide the user with a powerful tool to improve and refine difficult alignments and to trap errors in input sequences. CLUSTAL X has been compiled on SUN Solaris, IRIX5.3 on Silicon Graphics, Digital UNIX on DECstations, Microsoft Windows (32 bit) for PCs, Linux ELF for x86 PCs, and Macintosh PowerMac.
Content may be subject to copyright.
1997 Oxford University Press
4876–4882 Nucleic Acids Research, 1997, Vol. 25, No. 24
The CLUSTAL_X windows interface: flexible strategies
for multiple sequence alignment aided by quality
analysis tools
Julie D. Thompson, Toby J. Gibson1, Frédéric Plewniak, François Jeanmougin* and
Desmond G. Higgins2
Institut de Genetique et de Biologie Moleculaire et Cellulaire (CNRS/INSERM/ULP), BP 163, 67404 Illkirch Cedex,
France, 1European Molecular Biology Laboratory, Postfach 10.2209, 69012 Heidelberg, Germany and 3Department of
Biochemistry, University College, Cork, Ireland
Received September 24, 1997; Revised and Accepted October 28, 1997
ABSTRACT
CLUSTAL X is a new windows interface for the
widely-used progressive multiple sequence alignment
program CLUSTAL W. The new system is easy to use,
providing an integrated system for performing multiple
sequence and profile alignments and analysing the
results. CLUSTAL X displays the sequence alignment in
a window on the screen. A versatile sequence colouring
scheme allows the user to highlight conserved features
in the alignment. Pull-down menus provide all the
options required for traditional multiple sequence and
profile alignment. New features include: the ability to
cut-and-paste sequences to change the order of the
alignment, selection of a subset of the sequences to be
realigned, and selection of a sub-range of the alignment
to be realigned and inserted back into the original
alignment. Alignment quality analysis can be performed
and low-scoring segments or exceptional residues can
be highlighted. Quality analysis and realignment of
selected residue ranges provide the user with a
powerful tool to improve and refine difficult alignments
and to trap errors in input sequences. CLUSTAL X has
been compiled on SUN Solaris, IRIX5.3 on Silicon
Graphics, Digital UNIX on DECstations, Microsoft
Windows (32 bit) for PCs, Linux ELF for x86 PCs, and
Macintosh PowerMac.
INTRODUCTION
The most widely used method in molecular biology to align sets of
nucleotide or amino acid sequences, is to build up a multiple
alignment progressively (12). The most closely related groups of
sequences are aligned first and then these groups are gradually
aligned together, keeping the early alignments fixed. This approach
works well when the sequences are sufficiently closely related.
However, a globally optimal solution (or a biologically significant
one) cannot be guaranteed. In more difficult cases, where many
sequences have <30% residue identity, this automatic method
becomes less reliable. Any misaligned regions introduced in
previous stages of the progressive alignment are not corrected later
as new information from other sequences is added. In such cases,
the automatic alignments need to be refined, either manually or
automatically.
Numerous sequence editors have been developed which allow
the user to display and manually make or modify an alignment
(eg. 37). These programs are useful for making small refinements
to an alignment, but the totally manual alignment of large numbers
of sequences is not feasible. Manual alignment is also highly
subjective hence is at least as likely as the automatic alignment
process to result in errors in the alignment.
The CLUSTAL X interface has been written to provide a single
environment in which the user can perform multiple alignments,
view the results and, if necessary, refine and improve the
alignment. Tools for alignment quality analysis have been
developed which allow the user to highlight low-scoring regions
in the alignment. Options are available for automatically
correcting these low-scoring regions by realigning a misaligned
sequence or a selected region of an alignment.
In earlier CLUSTAL programs (811), nested text menus
provided all the options to do multiple sequence/profile alignments
and simple phylogenetic tree generation. The output alignments
were written to file for display, printing or further manipulation.
With these simple menus, the CLUSTAL programs could be highly
portable, and run on essentially all computers. Portability has been
a major factor in the widespread usage of the CLUSTAL series for
sequence alignment. On the other hand, much more attractive and
powerful user interfaces can be built using non-portable windows
systems.
The NCBI Software Development Toolkit (Version 1.8, National
Center for Biotechnology Information, Bethesda, MD) provides one
solution to the windows portability problem. It interfaces between
the application code and various host windowing systems including
the X Window System, Macintosh windows and Microsoft
Windows. We have made use of the toolkit to provide a portable
windows interface, termed CLUSTAL X.
*To whom correspondence should be addressed. Tel: +33 388 65 32 71; Fax: +33 388 65 32 01; Email: jeanmougin@igbmc.u-strasbg.fr
4877
Nucleic Acids Research, 1994, Vol. 22, No. 1
Nucleic Acids Research, 1997, Vol. 25, No. 24 4877
CLUSTAL X is a new graphical interface to the CLUSTAL W
program which displays the sequence alignment in a window on
the screen, allowing the user to move easily between different parts
of the alignment. Pull-down menus provide all the options familiar
to users of the text-menu-driven CLUSTAL W, plus several new
features. A versatile, configurable, colouring system is used to
highlight conserved residue features in the alignment. Options to
mark suspect regions and realign selected residue ranges give the
user more information and control over the alignment process and
allow difficult alignments to be gradually built and refined.
MATERIALS AND METHODS
We make use of the NCBI VIBRANT (Virtual Interface for
Biological Research and Technology) development library which
acts as an interface between the application code of CLUSTAL
X and the host windowing system. The NCBI libraries are linked
to the CLUSTAL X code providing mechanisms for displaying
windows, menus, buttons etc. In this way, the CLUSTAL X code
remains independent of the underlying operating system and
computer. The CLUSTAL X code is written in ANSI C, and
should be portable to any machine capable of supporting the
NCBI Vibrant toolkit.
CLUSTAL X is available for a number of platforms including
SUN Solaris, IRIX 5.3 on Silicon Graphics, Digital UNIX on
DECstations, Microsoft Windows (32 bit) for PCs, Linux ELF for
x86 PCs, and Macintosh PowerMac. The source code is provided
for anyone wishing to port to any other platform supported by the
Vibrant project.
The source code for CLUSTAL X and several executable
versions for different machines are freely available by anonymous
ftp to ftp-igbmc.u-strasbg.fr. Hypertext documentation can be
viewed at www-igbmc.u-strasbg.fr/BioInfo/clustalx/. The NCBI
Vibrant Toolkit is available by anonymous ftp from
ncbi.nlm.nih.gov.
Installation
The CLUSTAL X program is easily installed by copying the
executable file to a system directory which can be seen by all
users. Several parameter files (named *.par), and an on-line help
text file (clustalx.hlp for MS Windows, otherwise clustalx_help)
are also required by CLUSTAL X. These files should be copied
to one of the following directories: (i) the user s current directory,
(ii) the user’s home directory, (iii) any of the directories specified
by the PATH environment variable.
RESULTS
Algorithms checking alignment quality
Three methods of alignment quality analysis are implemented.
(i) A ‘quality’ score is calculated for each column in the alignment,
which depends on the amino acid variability in the column. A high
score indicates a highly conserved column, a low score indicating
a less well-conserved position. The scores are automatically plotted
in the window display (Fig. 1). (ii) The residues which get
exceptionally low scores in the above quality calculation can be
highlighted in the alignment display (Fig. 1). (iii) Low-scoring
segments in each sequence of the alignment can be highlighted.
These are found by summing negative scores in the profile built
from the sequence alignment (Figs 1 and 2).
Highlighted residues may be expected to occur at a moderate
frequency in all the sequences because of their steady divergence
due to the natural processes of evolution, although the most
divergent sequences are likely to have the most outliers. However,
the highlighted residues are especially useful in pointing to
sequence misalignments. These can arise due to various reasons:
(i) partial or total misalignments caused by a failure in the
alignment algorithm, (ii) partial or total misalignments because at
least one of the sequences in the given set is partly or completely
unrelated to the other sequences, (iii) frameshift translation errors
in a protein sequence causing local mismatched regions to be
heavily highlighted (see Discussion for more details).
Occasionally, highlighted residues may point to regions of
some biological significance. This might happen for example if
a protein alignment contains a sequence which has acquired new
functions relative to the main sequence set.
Quality scores. Suppose we have an alignment of M sequences of
length N. Then, the alignment can be written as:
A1,1 A1,2 A1,3 .......... A1,N
A2,1 A2,2 A2,3 .......... A2,N
.
.
AM,1 AM,2 AM,3 .......... AM,N
Suppose we also define a residue comparison matrix C of size R*R,
where R is the number of residues. C(a,b) is the score for aligning
residue a with residue b (all scores in this matrix are positive).
The problem is to calculate a score for the conservation of the jth
position in the alignment. Vingron and Sibbald (12) used a
geometric analysis based on a continuous sequence space in order
to compare sequence weighting methods. The method defines an
N-dimensional space, where N is the length of the alignment. Each
sequence can be placed in the space, and the distance between two
sequences is defined as the euclidean distance between the
sequences in the space. We have applied an analogous approach to
each position in the alignment. An R-dimensional space is defined,
in which each column of the alignment can be considered. For a
specified position j in the alignment, each sequence consists of a
single residue which is assigned a point S in the space. For
sequence i, position j, the point S is defined as:
S
C(1, Aij)
C(2,Aij)
.
.
C(R,Aij)
We then calculate a consensus value X for the jth position in the
alignment. X is defined as:
X
R
i1
Fi,j*C(i,1)
M
R
i1
Fi,j*C(i,2)
M
.
.
R
i1
Fi,j*C(i,R)
M
Nucleic Acids Research, 1997, Vol. 25, No. 24
4878
Figure 1. The CLUSTAL X window in multiple alignment mode. An alignment of some EFTU proteins is displayed. Low-scoring segments are highlighted using
a white character on a black background. Exceptional residues are shown as a white character on a grey background. The quality analysis reveals two anomalously
low scoring regions, ruler positions 16–25 in EFTU_ODOSI and 61–71 in EFTU_MYCPN. These were found to be caused by frameshift errors. Two more sequences
(EFTU_RICPR and EFTU_SPIPL), not shown here, have 4-residue sequencing errors in this region which CLUSTAL X will also highlight.
where Fi,j is the count of residues i at position j in the alignment.
Now, if S is the position of sequence i in the R-dimensional
space, we can calculate the distance Di between each sequence
residue i and the consensus position X.
Di
R
r1
(XrSr)2
where Xr is the rth dimension of position X, and Sr is the rth
dimension of position S.
We define the quality score for the jth position in the alignment
as the mean of the sequence distances Di:
Quality Score
M
i1
Di
M
Finally the scores are normalised by multiplying by the percen-
tage of sequences which have residues (and not gaps) at this
position. These scores are used in measure (i) above, as an
estimate of the conservation of each alignment column (Fig. 1).
Exceptional residues. It would be useful for each column in the
alignment to identify those sequences in the above calculations
which are found a long way from the consensus point (i.e., which
have a large distance Di), thus lowering the quality score for the
column. For the jth position in the alignment, we take the set of
sequences which have a residue at this position (and not a gap).
The distances Di for this set of sequences are arranged in an array
in order of size from smallest to largest. We can find the upper and
lower quartiles (the distances lying one-quarter of the way from
the top and bottom of the array, respectively) and the inter quartile
range (the difference between the two quartiles).
A residue Aij is considered as an exception in measure (ii) above
if the sequence distance Di is greater than (upper quartile + inter
quartile range × scaling factor). The scaling factor can be adjusted
by the user to select the proportion of residue exceptions that will
be highlighted in the alignment display. Exceptional residues in
an EFTU alignment are shown in Figure 1.
This calculation runs automatically, in a very short time, each
time the screen is updated.
Low-scoring segments. Given the above alignment of M sequences
of length N and a residue exchange matrix, we can build a profile
which is weighted for sequence divergence. Methods for calculat-
ing sequence weights are discussed by Henikoff and Henikoff (13).
Here we calculate sequence weights directly from a neighbour-
joining tree, using the ‘branch-proportional’ method which
corrects for unequal representation by downweighting similar
sequences and upweighting divergent ones (14). Each sequence is
assigned a weight Wi. In the residue comparison matrix C, the
scores for common residue substitutions are positive while rarer
substitutions are scored negatively. By default, the Gonnet PAM
250 matrix (15) is used, but the user may supply a different matrix
e.g. a lower PAM value is appropriate if the sequences are closely
related. The profile P has a column of scores for each position in
the alignment. The column is of height R and consists of a score
4879
Nucleic Acids Research, 1994, Vol. 22, No. 1
Nucleic Acids Research, 1997, Vol. 25, No. 24 4879
Figure 2. Detection and correction of misaligned segments with CLUSTAL X. (A) A set of EFTUs, tested for low scoring regions, highlights a part of the EFTU_ECOLI
sequence (which we deliberately misaligned by incorrect gap insertion). The range selected to be realigned is marked above the alignment. (B) After removal of gaps
and realignment of the selected residue range, the sequence EFTU_ECOLI is now correctly aligned, and the erroneous gaps have been removed. The low-scoring segment
check, the column conservation indicators above the alignment, and the quality graph below it, all reflect the improvement in the alignment.
for each residue in the matrix C. The profile score for residue r at
position j in the alignment is defined as:
P(r,j)+
ȍ
M
i+1
C(r,Aij)*W
i
ȍ
M
i+1
W
i
For the jth position in the ith sequence the score Sij is defined as:
Sij +P(Aij,j)
The low-scoring regions in the ith sequence are found by
summing the scores Sij along the alignment in both the forward
and backward directions. If the sum is found to be positive, it is
reset to zero. The forward phase can be described by the following
recurrence relations:
Fj+ȥ
ȡ
Ȣ
Fj–1)Sij
0
0
if
if
if
Fj–1)Sij t0
Fj–1)Sijh w0
j+0
Having found the regions in the sequence which have negative Fj
scores, these regions are then refined by removing those positions
at the end of each segment which have a positive profile score Sij.
The Fj scores for these positions are reset to zero.
Similarly, the backward phase can be described as:
Bj+ȥ
ȡ
Ȣ
Bj)1)Sij
0
0
if
if
if
Bj)1)Sij t0
Bj)1)Sij w0
j+N)1
The regions in the sequence which have negative Bj scores are
again refined by removing those positions at the beginning of
each segment which have a positive profile score Sij. The Bj
scores for these positions are reset to zero.
Nucleic Acids Research, 1997, Vol. 25, No. 24
4880
The calculation is repeated for each sequence compared with
a profile for all aligned sequences, except itself. The low-scoring
segments, defined as those positions for which both Fj and Bj are
negative, are then highlighted in the display (Figs 1 and 2).
The low-scoring segment calculation is done when the user
selects the ‘Calculate low scoring segments’ option. It takes only
a few seconds to perform unless the alignment is very large,
making it a practical tool for interactive use.
Implementation
CLUSTAL X displays a window on the screen, including a set of
pull-down menus. On-line help is available. The exact format of
the screen will depend on the host computer and the operating
system. The user may select one of two modes: (i) multiple
alignment mode which displays a single display area for multiple
sequence alignment, or (ii) profile alignment mode which has two
display areas, allowing the user to use previously aligned
sequences for alignment. Figure 1 shows a CLUSTAL X window
in multiple alignment mode.
Alignments or individual sequences can be loaded into the
display areas displayed on the screen using the menu options.
Scroll bars allow easy movement to different parts of the
alignment. Extra lines are added to the sequence data displaying a
ruler, an indicator of alignment conservation, plus any secondary
structure data which was found in the input alignment file.
The order of the sequences in the display can be changed by
clicking on the sequence names, and selecting the cut and paste
options from the menus. In profile alignment mode, these options
also allow the user to move sequences from one profile to the other.
The sequences can be saved at any time to an alignment file in
one of a number of file formats. The sequence display can also be
saved in a colour postscript file for printing on a postscript printer.
Colouring the alignment display. The sequences are automatically
coloured to highlight conserved regions of the alignment. The
colours used, and the specification of the conservation of residues
can be configured by the user. The ‘rules’ governing the colouring
of residues are read from a colour parameter file, which can be
loaded at any time. Two types of colouring ‘rules’ are defined. (i) A
residue can be assigned a specific colour regardless of its position
in the alignment. In this case, all occurrences of the residue will be
coloured in the alignment display. (ii) A residue can be assigned
different colours depending on the consensus of the alignment at
each position.
In this way, for example, conserved hydrophobic or hydrophilic
positions in the alignment can be highlighted (Fig. 1).
Realigning divergent regions. In difficult cases, with a family of
highly divergent sequences, it is possible that misalignments are
introduced during the multiple alignment process. CLUSTAL X
provides two simple mechanisms for realigning the most divergent
regions. (i) Misaligned sequences may be selected by clicking on
the sequence names. A single menu option then removes these
sequences from the alignment set and realigns them to the remaining
sequences. (ii) The second option allows the user to specify a range
of the alignment to be realigned. In this case, the selected sub-range
of the alignment is removed and multiply aligned using the standard
progressive multiple sequence alignment method. The sub-range is
then fitted back into the full alignment.
Using these two options, the original multiple sequence
alignment may be iteratively improved and refined.
DISCUSSION
Ideally, methods for multiple sequence alignment should guarantee
to find the biologically correct alignment for a set of sequences. In
practice, this is difficult to achieve. Firstly, it is difficult to define
an optimal alignment between divergent nucleotide or protein
sequences, even given tertiary structural information. Secondly,
methods that find an optimal multiple sequence alignment have
been impractical to implement, mainly due to their computational
cost. As computer performance improves, methods which iterate
toward an optimal alignment are likely to become useful (16,17).
Meanwhile the heuristical approach of progressive alignment is
most often used, as the algorithm is reasonably fast and minimises
error in alignments of moderate difficulty. However, because the
full information in the sequence set is not used to align each
sequence, it can be possible to see one or more misaligned
sequence segments in the output alignment. In such cases, the
sequences would be expected to align correctly if the full
information was used, or if alignment parameters such as gap
penalties were adjusted.
When we developed CLUSTAL W, we gave the user the ability
to iterate the alignment process by realigning an alignment, or by
profile aligning sequences to an alignment. In this way, the user
could choose to iterate the alignment process, thereby overcoming
some of the defects of progressive alignment. With CLUSTAL X,
we have taken this capability further, by building in algorithms to
target the problem regions of an alignment and letting the user
realign solely the suspect residue ranges. Using these tools, high
quality alignments of divergent sequence sets are produced more
quickly and with greater confidence than has previously been
possible by progressive alignment.
Many programs have been developed which allow, to a greater
or lesser degree, manual intervention in the automatic alignment
process. For example, SOMAP (18) was designed to run under
the DEC VMS operating system. The program allows the user to
manually build up a multiple sequence alignment. It can accept
automatic alignments created by the original CLUSTAL program
(8) to provide a starting-point for the manual editing process.
SEAVIEW (7) is a UNIX X Window-based multiple sequence
alignment editor which is interfaced to the CLUSTAL W
program. SEQPUP (Don Gilbert, Biology Department, Indiana
University, Bloomington, IN 47405) is a sequence editor and
analysis program which can launch external applications such as
CLUSTAL W to perform sequence alignment. SEQLAB
[Wisconsin Package Version 9.0, Genetics Computer Group
(GCG), Madison, WI] is a graphical user interface based on the
OSF/Motif windowing system. It displays sequence alignments
on the screen and includes powerful sequence editing facilities.
The PILEUP program is interfaced to the SEQLAB editor to
perform automatic multiple sequence alignments. Numerous
Mac and PC alignment editors have also been developed. Most of
these editors will accept alignment output from CLUSTAL
programs. However, using CLUSTAL X, the amount of time
spent editing alignments by hand should be minimised, while a
hand-edited alignment can itself be returned for error checking.
CLUSTAL X is not confined to either VMS or UNIX
work-stations but also runs on Macintosh and PC computers. The
program provides a flexible approach to the problem of the
multiple alignment of large numbers of sequences. The methods
used can be applied equally well to both nucleotide and amino acid
sequences. An initial automatic alignment using the traditional
4881
Nucleic Acids Research, 1994, Vol. 22, No. 1
Nucleic Acids Research, 1997, Vol. 25, No. 24 4881
progressive, pairwise approach provides a good starting point for
further refinement. The alignments are displayed on the screen, and
the user can move around easily between different parts of the
alignment. A versatile residue colouring scheme based on the
conservation of each position in the alignment automatically
highlights conserved or special features.
Alignment analysis and error detection
Tools for alignment quality analysis have been developed and
incorporated into the package. A ‘quality’ estimate for each
position in the alignment is plotted on the screen (Fig. 1). Highly
conserved positions in the alignment will get a high ‘quality’ score,
whilst either low conservation, or exceptional residues at a partially
conserved position, will lower the score for the column. The
exceptional residues, which may be due to misalignment of the
sequences or simply divergence can be highlighted in the
alignment (Fig. 1). Sometimes these may be of biological interest,
although most divergence is due to neutral evolutionary processes.
Several methods for calculating the conservation of an
alignment column have been developed. Zvelebil et al. (19) used
physico-chemical properties of amino acids to quantify the
conservation of a position in an alignment, in order to predict
protein secondary structure. Smith and Smith (20) define the
‘information density’ of a sub-region, assuming that all amino
acids are informationally equivalent. Sander and Schneider (21)
calculate a variation entropy for each column. Brouillet et al. (22)
calculate the mean and standard deviation of the pairwise distances
between amino acids in each column of an alignment, using a 20 ×20
distance matrix. None of these methods were found to be ideal for
incorporation into Clustal X. Apart from the latter, none of the
methods use a standard residue exchange matrix, as is needed for
consistency with the alignment process, as well as providing a
natural way to allow the user to customise the quality analysis by
varying the matrix. The advantage of the geometric interpretion
developed for Clustal X is that statistical methods can then be
applied to define a mean value for the column and distances can be
measured between each sequence and the mean: upper limits for the
expected distance between any residue and the mean value can be
defined and thus exceptional residues can be identified.
Low-scoring segments in the sequences can also be highlighted in
the alignment (Figs 1 and 2). Low-scoring segments most often
result from one of three major causes: high divergence between the
sequences; errors in input sequences, most notably frameshifts; and
misalignments. If the cause can be ascribed to high divergence, the
alignment may not be wrong, but should be regarded as unreliable
in the low-scoring segment. In particularly unreliable segments,
CLUSTAL X may mark out every sequence! The alignment in such
a region is likely to be meaningless. Frameshift errors are more
frequent than usually realised. In the alignment of EFTUs taken
from Swiss-Prot release 34, four sequences have short frameshifts
within the region shown in Figure 1. Suspect sequences can be
investigated with frameshifting alignment programs such as
PairWise in WiseTools ( 24), or Framesearch in the GCG package.
It is important to detect and remove sequences containing errors,
as they confound many types of inferences based on multiple
alignments, and may themselves also cause the propagation of
further alignment errors.
We have found the low scoring segments test to be remarkably
powerful, picking up a number of frameshifts and leading to the
correction of many misalignments. Not every highlighted region
is false but, by checking them over, the major errors are almost
always uncovered. Nevertheless, there are situations where the
test may give a false sense of alignment accuracy. This could
happen when aligning sequences with strong amino acid residue
biases (‘reduced sequence complexity’). Tandem repeats are
another case, since superposition of the wrong repeats could still
give a high scoring alignment. Alignments of highly divergent
membrane proteins are tricky on both counts since there are many
transmembrane helices with hydrophobic amino acid biases.
More specialised, detailed alignment analysis programs are
available (2427). The advantages of CLUSTAL X are that the
quality analyses are very fast as well as being integrated into the
alignment package and the results are displayed graphically on
the screen, with any low-scoring regions highlighted by shading
the alignment background. This interactive system provides an
efficient and flexible approach to alignment analysis and
correction.
Correcting misaligned regions
In Figure 2, a ‘model’ protein misalignment has been set up. For
clarity, the closely related EFTU sequences have been deliberately
misaligned. Genuine misalignments would normally be highly
divergent with only a few identities in particularly conserved
columns. In such cases, if the correct alignment can be ascertained,
this may be by matches between residue similarities rather than
identities.
In the example, a misaligned segment of EFTU_ECOLI is first
detected and marked by applying the low-scoring segments
algorithm (Fig. 2A). Next, a region of the alignment spanning the
error is selected using the cursor. The menu option ‘reset all gaps
before alignment’ is toggled on: in this example there are falsely
inserted gaps that must be deleted. This is not always the case, and
if the existing gaps seem correct, the option can stay switched off.
Now the ‘realign selected residue range’ option is invoked. The
misaligned region is now rapidly and correctly aligned again, and
the false gaps are deleted (Fig. 2B). This time the low-scoring
segments algorithm finds only short segments ascribable to
natural sequence divergence. Realignments in which the gaps are
left in may result in columns with nothing but padding characters,
in which case there is a menu option available to delete these.
The realignment process uses the alignment parameter default
settings, or as they are set up by the user. Misaligned regions are
often more divergent than other regions of the alignment, which
means that the alignment score may not be much higher than
misaligned alternatives. Therefore it may be necessary to lower
gap penalties to allow the sequences to align: this is tested by trial
and error. However, the user should be aware of two factors that
already affect the gap penalties in the local realignment. There is
no gap penalty at the ends of a selected region, so it is free to put
new gaps there: judicious selection of the range boundaries can
direct gaps to desired sites. Gap penalties are also lowered at
existing gaps if these are retained. These factors mean that the
selected range may give a better alignment without having to
lower the gap penalties.
Further uses for the low scoring segments
In CLUSTAL X, the new algorithm for marking low-scoring
segments has been implemented for visual interaction. However,
the algorithm has the potential for wider usage. There are
currently many projects to automatically produce databases of
Nucleic Acids Research, 1997, Vol. 25, No. 24
4882
multiple sequence alignments. The alignments tend not to be of
high quality as it has been difficult to distinguish good and bad
aligned regions rapidly and reliably. Removing sequences with
low-scoring segments below a cut-off score should dramatically
improve these alignments, as all sequences that contain major
errors, or are too divergent to align, can be trapped.
The algorithm also has the potential to automatically establish the
domain boundaries in sets of partially-related multi-domain proteins.
In this case the Smith–Waterman best local alignment algorithm,
finding the approximate regions encompassing the homologous
domains, would be harnessed to the forwards-backwards approach,
summing both the positive and negative scoring segments in order
to define sharp boundaries. A simpler application would be
end-trimming in an alignment, since the termini of proteins are
often poorly conserved.
ACKNOWLEDGEMENTS
J.T.was supported by institute funds from INSERM, CNRS and
the Ministère de la Recherche et Technologie and the EMBL. We
thank the many users of CLUSTAL W who have reported
bugs/suggestions, and those who beta tested CLUSTAL X. We
would also like to thank Dino Moras, Kevin Leonard, Matti
Saraste and Frank Gannon for support during this work.
REFERENCES
1 Feng,D.F. and Doolittle,R.F. (1987) J. Mol. Evol., 25, 351–360.
2 Taylor,W.R. (1988) J. Mol. Evol., 28, 161–169.
3 Stockwell,P.A. and Peterson,G.B. (1987) Comput. Applic. Biosci., 3, 37–43.
4 Thirup,S. and Larsen,N.E. (1990) Proteins, 7, 291–295.
5 Clark,S.P. (1992) Comput. Applic. Biosci., 8, 535–538.
6 De Rijk,P. and De Wachter,R. (1993) Comput. Applic. Biosci., 9, 735–740.
7 Galtier,N., Gouy,M. and Gautier,C. (1996) Comput. Applic. Biosci., 12,
543–548.
8 Higgins,D.G. and Sharp,P.M. (1988) Gene, 1, 237–244.
9 Higgins,D.G., Bleasby,A.J. and Fuchs,R. (1992) Comput. Applic. Biosci.,
8, 189–191.
10 Thompson,J.D., Higgins,D.G. and Gibson,T.J. (1994) Nucleic Acids Res.,
22, 4673–4680.
11 Higgins,D.G., Thompson, J.D. and Gibson,T.J. (1996) Methods Enzymol.,
266, 383–402.
12 Vingron, M. and Sibbald, P.R. (1993) Proc. Natl. Acad. Sci. USA, 90,
8777–8781.
13 Henikoff,S. and Henikoff,J.G. (1994) J. Mol. Biol., 243, 574–578.
14 Thompson,J.D., Higgins,D.G. and Gibson,T.J. (1994) Comput. Applic. Biosci.,
10, 19–29.
15 Benner,S.A., Cohen,M.A. and Gonnet,G.H. (1994) Protein Engng, 7,
1323–1332.
16 Notredame,C. and Higgins,D.G. (1996) Nucleic Acids Res., 24, 1515–1524.
17 Gotoh,O. (1996) J. Mol. Biol., 264, 823–838.
18 Parry-Smith,D.J. and Attwood,T.K. (1991) Comput. Applic. Biosci., 7,
233–235.
19 Zvelebil,M.J.J.M., Barton,G.J., Taylor,W.R. and Sternberg,M.J.E. (1987)
J. Mol. Biol., 195, 957–961.
20 Smith,R.F. and Smith,T.F. (1990) Proc. Natl. Acad. Sci. USA, 87, 118–122.
21 Sander,C. and Schneider,R. (1991) Proteins Struct. Funct. Genet., 9, 56–68.
22 Brouillet,S., Risler,J.L., Henaut,A. and Slonimski,P.P. (1992) Biochimie,
74, 571–580.
23 Birney,E., Thompson,J.D. and Gibson,T.J. (1996) Nucleic Acids Res., 24,
2730–2739.
24 Schuler,G.D., Altschul,S.F. and Lipman,D.J. (1991) Proteins Struct. Funct.
Genet., 9, 180–190.
25 Vingron,M. and Argos,P. (1991) J. Mol. Biol., 218, 33–43.
26 Friemann,A. and Schmitz,S. (1992) Comput. Applic. Biosci., 8, 261–265.
27 Livingstone,C.D. and Barton,G.J. (1993) Comput. Applic. Biosci., 9, 745–756.
... Este gen ha sido utilizado ampliamente para la identificación de especies de tiburones (Ward et al. 2005). Se obtuvieron 204 secuencias, correspondientes a 204 organismos de distintas partes del mundo (Pacífico oriental, Atlántico occidental, Indo-Pacífico e Índico), y a partir de su alineamiento con el software Clustal x V2.1 (Thompson et al. 1997) se delimitó la secuencia consenso de I. oxyrinchus. ...
Article
Full-text available
Sharks are globally targeted as fishing objects; however, their captures often occur without proper registration. Mostly leading to inadequate administrative measures that allow to have a reliable register of the most exploited species. Consequently, these arises a need to develop rapid and efficient methods for identifying the most fished shark species, like shortfin mako shark, Isurus oxyrinchus. In this study are shown two specific molecular identification methods for this shark. The first method relies on loop-mediated isothermal amplification (lamp) of nucleic acids, while the second method employs lateral flow devices (lfd). Both strategies have been standardized and analytically validated for mako shark identification within a timeframe of less than 60 minutes. Technological advancements like these lay the groundwork for designing diverse systems for the rapid identification of other commercially significant species.
... Sequences were assembled and aligned with BioEdit 7.0.5.3 (Hall, 1999). NEXUS files were generated with Clustal X 1.81 (Thompson et al., 1997). Maximum parsimony (MP) analysis was performed using PAUP 4.0b10. ...
Article
Full-text available
As an evergreen shrub, Euonymus japonicus plays a crucial role in urban landscape construction, and its growth is affected by severe foliar anthracnose caused by Colletotrichum spp. However, the biodiversity of Colletotrichum species associated with anthracnose on E. japonicus remains undetermined. This study involved a two-year collection of E. japonicus leaf samples with typical anthracnose symptoms from 9 districts in Beijing, China. A total of 194 Colletotrichum isolates were obtained, and eight Colletotrichum species were subsequently identified using morphological characteristics and molecular identification with the ACT, GADPH, CHS, TUB2, and CAL genes, as well as the rDNA-ITS region. These species included Colletotrichum aenigma, C. fructicola, C. gloeosporioides, C. grossum, C. hebeiense, C. karstii, C. siamense, and C. theobromicola with C. siamense being the most prevalent (57%), followed by C. aenigma and C. theobromicola. Furthermore, C. fructicola, C. grossum and C. hebeiense are reported for the first time as causal agents of anthracnose on E. japonicus worldwide, and C. karstii is newly reported to be associated with E. japonicus anthracnose in China. Pathogenicity tests revealed that all tested isolates exhibited pathogenicity in the presence of wounds, emphasizing the need to avoid artificial or mechanical wounds to prevent infection in E. japonicus management. The EC50 values of five fungicides, namely difenoconazole, flusilazole, tebuconazole, hexaconazole, and prochloraz, were found to be less than 10 mg/L, indicating their strong potential for application. Notably, the EC50 of prochloraz was less than 0.05 mg/L for C. theobromicola. These findings offer valuable insights for the management of anthracnose on E. japonicus.
... The GenBank 1 accession numbers are given in Supplement 2 for all haplotypes identified from the 411 bp fragment and Supplement 3 for the 987 bp fragment. The sequences were aligned using the ClustalX software (Thompson et al., 1997); manual editing was performed in BioEdit (version 7.0.4) (Hall, 2011). ...
Article
Full-text available
The reproductive and phylogenetic relationships of lenoks of the genus Brachymystax are considered based on analysis of 30 allozyme loci and two fragments (411 and 987 base pairs) of the mitochondrial DNA control region . The presence of three phylogenetic lenok groups, the blunt-snouted and sharp-snouted groups from Russia and neighboring countries and the Qinling group from China and South Korea, has been confirmed. It is assumed that the center of origin of the genus Brachymystax was Primorye and the blunt-snouted lenok from this region is closest to the ancestral form. Modern assumptions on the taxonomic status of different forms of lenok are contradictory both in the number of species (from one to five) and in their composition. The identification of two or three species in the genus Brachymystax is most reasonable . The main problems that should be solved to clarify the phylogeny and taxonomy of representatives of this genus are indicated.
... accessed on 5 May 2023). Multiple sequence alignments were performed using CLUSTAL X [24]. Phylogenetic analyses were conducted using the maximum likelihood method in the MEGA v11 software suite [22]. ...
Article
Full-text available
Caldanaerobacter subterraneus subsp. keratinolyticus subsp. nov. strain KAk was found in a geothermal hot spring located in Kazakhstan. Growth occurred at temperatures ranging from 50 to 80 °C, with approximately 70 °C as optimum. It also thrived in pH conditions ranging from 4.0 to 9.0, with the best growth occurring at 6.8. Under optimal conditions in a glucose-containing medium, the cells were predominantly observed singly, in pairs, or less frequently in chains, and did not form endospores. However, under conditions involving growth with merino wool or feathers, or under suboptimal conditions, the cells of strain KAk exhibited a notably elongated and thinner morphology, with lengths ranging from 5 to 8 µm, and spores were observed. The KAk strain exhibited efficient degradation of feather keratin and merino wool at temperatures ranging from 65 to 70 °C. Analysis of the 16S rRNA gene sequence placed KAk within the genus Caldanaerobacter, family Thermoanaerobacteraceae, with the highest similarity to C. subterraneus subsp. tengcongensis MB4T (98.84% sequence identity). Furthermore, our analysis of the draft genome sequence indicated a genome size of 2.4 Mbp, accompanied by a G+C value of 37.6 mol%. This study elucidated the physiological and genomic characteristics of strain KAk, highlighting its keratinolytic capabilities and distinctiveness compared to other members of the genus Caldanaerobacter.
... All of the generated and publicly available database sequences were aligned using Clustal-X software to create a final dataset. [19] Furthermore, the dataset was produced to match the length of the sequences of DNA in the cox1gene of both the samples (6A and 6B) to prevent contradictory findings from tree and genetic variation analyses. The single DNA barcode sequence of the cox1 gene of O. niloticus was produced by BOLD Systems (www. ...
Article
Full-text available
Background: The present investigation was carried out on the development of a barcode and DNA sequences database of Oreochromis niloticus (Nile Tilapia) of Shahpura Lake, Bhopal through mitochondrial Cytochrome c Oxidase-I gene (cox1) for public domain uses as a reference database for identification, authentication and variation studies. Materials and Methods: We performed the mitochondrial genomic analyses for molecular studies of which genomic DNA was extracted from fish tissues using the standard protocol provided by Janarthanan and Vincent (2007). Then, isolated DNA was introduced to Polymerase Chain Reaction (PCR) using universal primers, after which, the electrophoresis of the PCR product was done, then, obtained the DNA bands of interest of desired molecular weight on the gel. Results: We generated 02 unique DNA barcodes of morphologically identified fish specimens collected from Shahpura Lake, Bhopal. Considering the ambiguous (0%), Barcode Index Number (BIN) URI (BOLD:AAC9904), Top Hit % (100%), Nearest Member of Neighbor (ANGBF54446-19), Nearest BIN URI (BOLDAET5315), Analysis of Barcode Gap as an average intra-specific (204.14) and Analysis of Cluster Sequences (RESL) (20.583717) was found significant for development of DNA barcode of O. niloticus. The overall mean distance among O. niloticus specimens (6A and 6B) was obtained as 1.82 which may be considered as good for conservation point of view. RESL in the BOLD systems has a stronger taxonomic performance than that of the Barcode Gap Analysis and thus showed better species identification, during the present investigation achieved similar results, which may be related to the species identification.
... 2.1, University College Dublin, Dublin, Ireland, https://www.clustal.org, Thompson et al., 1997). The GenBank accession numbers for the P. sedgwicki species complex from Daniels et al., (2009Daniels et al., ( ,2017 were combined with specimens sequenced during the present study. ...
... The resulting chromatograms of the sequences were checked for nucleotide ambiguities, and the forward and reverse sequences assembled and edited using CLC main workbench 7.9.1 (QIAGEN). Generated sequences were aligned with sequences downloaded from Genbank (Accession numbers: HG942363, FJ447387-FJ447391, KJ551843 and OL457303-OL457305) using Clustal X 2.0.12 (Thompson et al. 1997). The aligned dataset was imported into MacClade 4.0 (Maddison & Maddison 2001) and translated into amino acids to verify the alignment. ...
Article
Full-text available
Eight species of Pandarus Leach, 1816 collected from hosts caught off South Africa are reported. These species include P. bicolor Leach, 1816, P. niger Kirtisinghe, 1950 and P. carcharhini Ho, 1963 belonging to the “bicolor” group and P. cranchii Leach, 1819, P. satyrus Dana, 1849, P. smithii Rathbun, 1886 and P. sinuatus Say, 1818 belonging to the “cranchii” group. Notes on previous and new distinguishing features are provided with illustrations, specifically the relative lengths of the dorsal plates and caudal rami as well as the structure of the distomedial spine on the second segment of leg 1 exopod. Additionally, illustrated re-descriptions are provided for P. satyrus and P. sinuatus. Furthermore, a new species Pandarus echinifer n. sp., also belonging to the “cranchii” group, collected from the snaggletooth shark Hemipristis elongata (Klunzinger) is described. This species is most similar to P. sinuatus but can be distinguished from it by the heavily spinulated distomedial spine on the last segment of the first leg exopod. Molecular analysis of the cytochrome oxidase I partial gene is used to calculate sequence divergences amongst different individuals and species. According to the results (as well as based on morphological characters) P. rhincodonicus Norman, Newbound & Knott, 2000 is a synonym of P. cranchii. New hosts and geographic localities from South Africa (and Ningaloo Park, Western Australia) are reported.
Article
Full-text available
We report Bungarus suzhenae Chen, Shi, Vogel, Ding & Shi, 2021 and Ovophis zayuensis (Jiang, 1977) for the first time from India. Specimens of B. suzhenae and O. zayuensis were collected during our field surveys in north (Arunachal Pradesh) and south (Nagaland-Manipur border) of the river Brahmaputra. Species identity was supported by partial cytochrome b (cyt b), and 16s mitochondrial gene. We provide a detailed morphological description and a key to the two genera of this region. This report extends the westernmost distribution of B. suzhenae by ca. 300 km from Myanmar, and the southernmost range of O. zayuensis by 170 km from Tibet. Until now eight species of Bungarus and only one Ovophis species have been reported from India. Ovophis species are recently reported to be medically important venomous snakes whose venom properties have not been investigated in depth.
Article
Full-text available
Background α‐l‐Fucose confers unique functions for fucose‐containing biomolecules such as human milk oligosaccharides. α‐l‐Fucosidases can serve as desirable tools in the application of fucosylated saccharides. Discovering novel α‐l‐fucosidases and elucidating their enzyme properties are always worthy tasks. Results A GH95 family α‐l‐fucosidase named Afc95A_Wf was cloned from the genome of the marine bacterium Wenyingzhuangia fucanilytica and expressed in Escherichia coli. It exhibited maximum activity at 40 °C and pH 7.5. Afc95A_Wf defined a different substrate specificity among reported α‐l‐fucosidases, which was capable of hydrolyzing α‐fucoside in CNP‐fucose, Fucα1‐2Galβ1‐4Glc and Galβ1‐4(Fucα1‐3)Glc, and showed a preference for α1,2‐fucosidic linkage. It adopted Asp residue in the amino acid sequence at position 391, which was distinct from the previously acknowledged residue of Asn. The predicted tertiary structure and site‐directed mutagenesis revealed that Asp391 participates in the catalysis of Afc95A_Wf. The differences in the substrate specificity and catalytic site shed light on that Afc95A_Wf adopted a novel mechanism in catalysis. Conclusion A GH95 family α‐l‐fucosidase (Afc95A_Wf) was cloned and expressed. It showed a cleavage preference for α1,2‐fucosidic linkage to α1,3‐fucosidic linkage. Afc95A_Wf demonstrated a different substrate specificity and a residue at an important catalytic site compared with known GH95 family proteins, which revealed the occurrence of diversity on catalytic mechanisms in the GH95 family. © 2024 Society of Chemical Industry.
Article
Full-text available
A strategy is described for the rapid alignment of many long nucleic acid or protein sequences on a microcomputer. The program described can handle up to 100 sequences of 1200 residues each. The approach is based on progressively aligning sequences according to the branching order in an initial phylogenetic tree. The results obtained using the package appear to be as sensitive as those from any other available method.
Article
Full-text available
An algorithm is described for the systematic characterization of the physico-chemical properties seen at each position in a multiple protein sequence alignment. The new algorithm allows questions important in the design of muta genesis experiments to be quickly answered since positions in the align that show unusual or interesting residue substitution patterns may be rapidly identified. The strategy is based on a flexible set-based description of amino acid properties, which is used to define the conservation between any group of amino acids. Sequences in the alignment are gathered into subgroups on the basis of sequence similariiy, functional, evolutionary or other criteria. All pairs of subgroups are then compared to highlight positions that confer the unique features of each subgroup. The algorithm is encoded in the computer program AMAS (Analysis of Multiply Aligned Sequences) which provides a textual summary of the analysis and an annotated (boxed, shaded and&sol;or coloured) multiple sequence alignment. The algorithm is illustrated by application to an alignment of 67 SH2 domains where patterns of conserved hydrophobic residues that constitute the protein core are highlighted. The analysis of charge conservation across annexin domains identifies the locations at which conserved charges change sign. The algorithm simplifies the analysis of multiple sequence data by condensing the mass of information present, and thus allows the rapid identification of substitutions of structural and functional importance.
Article
For a successful analysis of the relation between amino acid sequence and protein structure, an unambiguous and physically meaningful definition of secondary structure is essential. We have developed a set of simple and physically motivated criteria for secondary structure, programmed as a pattern-recognition process of hydrogen-bonded and geometrical features extracted from x-ray coordinates. Cooperative secondary structure is recognized as repeats of the elementary hydrogen-bonding patterns “turn” and “bridge.” Repeating turns are “helices,” repeating bridges are “ladders,” connected ladders are “sheets.” Geometric structure is defined in terms of the concepts torsion and curvature of differential geometry. Local chain “chirality” is the torsional handedness of four consecutive Cα positions and is positive for right-handed helices and negative for ideal twisted β-sheets. Curved pieces are defined as “bends.” Solvent “exposure” is given as the number of water molecules in possible contact with a residue. The end result is a compilation of the primary structure, including SS bonds, secondary structure, and solvent exposure of 62 different globular proteins. The presentation is in linear form: strip graphs for an overall view and strip tables for the details of each of 10.925 residues. The dictionary is also available in computer-readable form for protein structure prediction work.
Article
DCSE provides a user-friendly package for the creation and editing of sequence alignments. The program runs on different platforms, including microcomputers and workstations. Apart from available hardware, the program is not limited in the size of the alignment it can handle. It deviates more from classical text editors than other available sequence editors because it uses a different approach towards editing. It shifts characters or entire blocks of aligned characters, rather than inserting or deleting gaps in the sequences. Alignment of a new sequence to an existing alignment is partly automated. Although DCSE can be used on protein sequence alignments, it is especially targeted at the examination of RNA. The secondary structure for every sequence can be incorporated easily in the alignment. DCSE also has extensive built-in support for finding and checking secondary structure elements. A sophisticated system of markers allows notation of special positions in an alignment. This system can be used to store information such as the position of hidden breaks, introns and tertiary structure interactions.
Article
A novel interactive method for generating multiple protein sequence alignments is described. The program has no internal limit to the number or length of sequences it can handle and is designed for use with DEC VAX processors running the VMS operating system. The approach used is essentially one of manual sequence manipulation, aided by built-in symbolic displays of identities and similarities, and strict and ‘fuzzy’ (ambiguous) pattern-matching facilities. Additional flexibility is provided by means of an interface to a publicly available automatic alignment system and to a comprehensive sequence analysis package.
Article
Methods for alignment of protein sequences typically measure similarity by using a substitution matrix with scores for all possible exchanges of one amino acid with another. The most widely used matrices are based on the Dayhoff model of evolutionary rates. Using a different approach, we have derived substitution matrices from about 2000 blocks of aligned sequence segments characterizing more than 500 groups of related proteins. This led to marked improvements in alignments and in searches using queries from each of the groups.
Article
The Src-homologous SH3 domain is a small domain present in a large number of proteins that are involved in signal transduction, such as the Src protein tyrosine kinase, or in membrane-cytoskeleton interactions, but the function of SH3 is still unknown (reviewed in refs 1-3). Here we report the three-dimensional structure at 1.8 A resolution of the SH3 domain of the cytoskeletal protein spectrin expressed in Escherichia coli. The domain is a compact beta-barrel made of five antiparallel beta-strands. The amino acids that are conserved in the SH3 sequences are located close to each other on one side of the molecule. This surface is rich in aromatic and carboxylic amino acids, and is distal to the region of the molecule where the N and C termini reside and where SH3 inserts into the alpha-spectrin chain. We suggest that a protein ligand binds to this conserved surface of SH3.
Article
A multiple sequence alignment editor is described which runs on a VAX/VMS system and can exchange data with a number of other programs, including those of the Genetics Computer Group (GCG). Up to 199 sequences can be aligned. The quality of the alignment can be easily judged during its development because the display attributes to each character are determined by the way it matches the other sequences. Four methods are available for calculating the highlighting to emphasize different aspects of the relationships of the sequences and up to four styles of highlighting can be used at the same time. Laser printer output is suitable for publication without modification.
Article
A simple and efficient method is described for analyzing quantitatively multiple protein sequence alignments and finding the most conserved blocks as well as the maxima of divergence within the set of aligned sequences. It consists of calculating the mean distance and the root-mean-square distance in each column of the multiple alignment, averaging the values in a window of defined length and plotting the results as a function of the position of the window. Due attention is paid to the presence of gaps in the columns. Several examples are provided, using the sequences of several cytochromes c, serine proteases, lysozymes and globins. Two distance matrices are compared, namely the matrix derived by Gribskov and Burgess from the Dayhoff matrix, and the Risler Structural Superposition Matrix. In each case, the divergence plots effectively point to the specific residues which are known to be essential for the catalytic activity of the proteins. In addition, the regions of maximum divergence are clearly delineated. Interestingly, they are generally observed in positions immediately flanking the most conserved blocks. The method should therefore be useful for delineating the peptide segments which will be good candidates for site-directed mutagenesis and for visualizing the evolutionary constraints along homologous polypeptide chains.
Article
An analysis of insertions and deletions (indels) occurring in a databank of multiple sequence alignments based on protein tertiary structure is reported. Indels prefer to be short (1 to 5 residues). The average intervening sequence length between them versus the percentage of residue identity in pairwise alignments shows an exponential behaviour, suggesting a stochastic process such that nearly every loop in an ancestral structure is a possible target for indels during evolution. The results also suggest a limit to the average size of indels accommodated by protein structures. The preferred indel conformations are reverse turn and coil as are the preferred conformations at the indel edges (N- and C-terminal sides). Interruptions in helices and strands were observed as very rare events.