ArticlePDF Available

The CLUSTAL_X Windows Interface: Flexible Strategies for Multiple Sequence Alignment Aided by Quality Analysis Tools

January 1998
Nucleic Acids Research 25(24):4876-82

January 1998
25(24):4876-82

DOI:10.1093/nar/25.24.4876

Source
PubMed

Authors:

Julie Thompson

University of Strasbourg

Frédéric Plewniak

University of Strasbourg

Show all 5 authorsHide

CLUSTAL X is a new windows interface for the widely-used progressive multiple sequence alignment program CLUSTAL W. The new system is easy to use, providing an integrated system for performing multiple sequence and profile alignments and analysing the results. CLUSTAL X displays the sequence alignment in a window on the screen. A versatile sequence colouring scheme allows the user to highlight conserved features in the alignment. Pull-down menus provide all the options required for traditional multiple sequence and profile alignment. New features include: the ability to cut-and-paste sequences to change the order of the alignment, selection of a subset of the sequences to be realigned, and selection of a sub-range of the alignment to be realigned and inserted back into the original alignment. Alignment quality analysis can be performed and low-scoring segments or exceptional residues can be highlighted. Quality analysis and realignment of selected residue ranges provide the user with a powerful tool to improve and refine difficult alignments and to trap errors in input sequences. CLUSTAL X has been compiled on SUN Solaris, IRIX5.3 on Silicon Graphics, Digital UNIX on DECstations, Microsoft Windows (32 bit) for PCs, Linux ELF for x86 PCs, and Macintosh PowerMac.

Detection and correction of misaligned segments with CLUSTAL X. (A) A set of EFTUs, tested for low scoring regions, highlights a part of the EFTU_ECOLI sequence (which we deliberately misaligned by incorrect gap insertion). The range selected to be realigned is marked above the alignment. (B) After removal of gaps and realignment of the selected residue range, the sequence EFTU_ECOLI is now correctly aligned, and the erroneous gaps have been removed. The low-scoring segment check, the column conservation indicators above the alignment, and the quality graph below it, all reflect the improvement in the alignment.

…

No caption available

…

Figures - uploaded by Frédéric Plewniak

Content may be subject to copyright.

Content uploaded by Frédéric Plewniak

Content may be subject to copyright.

 1997 Oxford University Press

4876–4882 Nucleic Acids Research, 1997, Vol. 25, No. 24

The CLUSTAL_X windows interface: flexible strategies

for multiple sequence alignment aided by quality

analysis tools

Julie D. Thompson, Toby J. Gibson1, Frédéric Plewniak, François Jeanmougin* and

Desmond G. Higgins2

Institut de Genetique et de Biologie Moleculaire et Cellulaire (CNRS/INSERM/ULP), BP 163, 67404 Illkirch Cedex,

France, 1European Molecular Biology Laboratory, Postfach 10.2209, 69012 Heidelberg, Germany and 3Department of

Biochemistry, University College, Cork, Ireland

Received September 24, 1997; Revised and Accepted October 28, 1997

ABSTRACT

CLUSTAL X is a new windows interface for the

widely-used progressive multiple sequence alignment

program CLUSTAL W. The new system is easy to use,

providing an integrated system for performing multiple

sequence and profile alignments and analysing the

results. CLUSTAL X displays the sequence alignment in

a window on the screen. A versatile sequence colouring

scheme allows the user to highlight conserved features

in the alignment. Pull-down menus provide all the

options required for traditional multiple sequence and

profile alignment. New features include: the ability to

cut-and-paste sequences to change the order of the

alignment, selection of a subset of the sequences to be

realigned, and selection of a sub-range of the alignment

to be realigned and inserted back into the original

alignment. Alignment quality analysis can be performed

and low-scoring segments or exceptional residues can

be highlighted. Quality analysis and realignment of

selected residue ranges provide the user with a

powerful tool to improve and refine difficult alignments

and to trap errors in input sequences. CLUSTAL X has

been compiled on SUN Solaris, IRIX5.3 on Silicon

Graphics, Digital UNIX on DECstations, Microsoft

Windows (32 bit) for PCs, Linux ELF for x86 PCs, and

Macintosh PowerMac.

INTRODUCTION

The most widely used method in molecular biology to align sets of

nucleotide or amino acid sequences, is to build up a multiple

alignment progressively (1–2). The most closely related groups of

sequences are aligned first and then these groups are gradually

aligned together, keeping the early alignments fixed. This approach

works well when the sequences are sufficiently closely related.

However, a globally optimal solution (or a biologically significant

one) cannot be guaranteed. In more difficult cases, where many

sequences have <30% residue identity, this automatic method

becomes less reliable. Any misaligned regions introduced in

previous stages of the progressive alignment are not corrected later

as new information from other sequences is added. In such cases,

the automatic alignments need to be refined, either manually or

automatically.

Numerous sequence editors have been developed which allow

the user to display and manually make or modify an alignment

(eg. 3–7). These programs are useful for making small refinements

to an alignment, but the totally manual alignment of large numbers

of sequences is not feasible. Manual alignment is also highly

subjective hence is at least as likely as the automatic alignment

process to result in errors in the alignment.

The CLUSTAL X interface has been written to provide a single

environment in which the user can perform multiple alignments,

view the results and, if necessary, refine and improve the

alignment. Tools for alignment quality analysis have been

developed which allow the user to highlight low-scoring regions

in the alignment. Options are available for automatically

correcting these low-scoring regions by realigning a misaligned

sequence or a selected region of an alignment.

In earlier CLUSTAL programs (8–11), nested text menus

provided all the options to do multiple sequence/profile alignments

and simple phylogenetic tree generation. The output alignments

were written to file for display, printing or further manipulation.

With these simple menus, the CLUSTAL programs could be highly

portable, and run on essentially all computers. Portability has been

a major factor in the widespread usage of the CLUSTAL series for

sequence alignment. On the other hand, much more attractive and

powerful user interfaces can be built using non-portable windows

systems.

The NCBI Software Development Toolkit (Version 1.8, National

Center for Biotechnology Information, Bethesda, MD) provides one

solution to the windows portability problem. It interfaces between

the application code and various host windowing systems including

the X Window System, Macintosh windows and Microsoft

Windows. We have made use of the toolkit to provide a portable

windows interface, termed CLUSTAL X.

*To whom correspondence should be addressed. Tel: +33 388 65 32 71; Fax: +33 388 65 32 01; Email: jeanmougin@igbmc.u-strasbg.fr

4877

Nucleic Acids Research, 1994, Vol. 22, No. 1

Nucleic Acids Research, 1997, Vol. 25, No. 24 4877

CLUSTAL X is a new graphical interface to the CLUSTAL W

program which displays the sequence alignment in a window on

the screen, allowing the user to move easily between different parts

of the alignment. Pull-down menus provide all the options familiar

to users of the text-menu-driven CLUSTAL W, plus several new

features. A versatile, configurable, colouring system is used to

highlight conserved residue features in the alignment. Options to

mark suspect regions and realign selected residue ranges give the

user more information and control over the alignment process and

allow difficult alignments to be gradually built and refined.

MATERIALS AND METHODS

We make use of the NCBI VIBRANT (Virtual Interface for

Biological Research and Technology) development library which

acts as an interface between the application code of CLUSTAL

X and the host windowing system. The NCBI libraries are linked

to the CLUSTAL X code providing mechanisms for displaying

windows, menus, buttons etc. In this way, the CLUSTAL X code

remains independent of the underlying operating system and

computer. The CLUSTAL X code is written in ANSI C, and

should be portable to any machine capable of supporting the

NCBI Vibrant toolkit.

CLUSTAL X is available for a number of platforms including

SUN Solaris, IRIX 5.3 on Silicon Graphics, Digital UNIX on

DECstations, Microsoft Windows (32 bit) for PCs, Linux ELF for

x86 PCs, and Macintosh PowerMac. The source code is provided

for anyone wishing to port to any other platform supported by the

Vibrant project.

The source code for CLUSTAL X and several executable

versions for different machines are freely available by anonymous

ftp to ftp-igbmc.u-strasbg.fr. Hypertext documentation can be

viewed at www-igbmc.u-strasbg.fr/BioInfo/clustalx/. The NCBI

Vibrant Toolkit is available by anonymous ftp from

ncbi.nlm.nih.gov.

Installation

The CLUSTAL X program is easily installed by copying the

executable file to a system directory which can be seen by all

users. Several parameter files (named *.par), and an on-line help

text file (clustalx.hlp for MS Windows, otherwise clustalx_help)

are also required by CLUSTAL X. These files should be copied

to one of the following directories: (i) the user ’s current directory,

(ii) the user’s home directory, (iii) any of the directories specified

by the PATH environment variable.

RESULTS

Algorithms checking alignment quality

Three methods of alignment quality analysis are implemented.

(i) A ‘quality’ score is calculated for each column in the alignment,

which depends on the amino acid variability in the column. A high

score indicates a highly conserved column, a low score indicating

a less well-conserved position. The scores are automatically plotted

in the window display (Fig. 1). (ii) The residues which get

exceptionally low scores in the above quality calculation can be

highlighted in the alignment display (Fig. 1). (iii) Low-scoring

segments in each sequence of the alignment can be highlighted.

These are found by summing negative scores in the profile built

from the sequence alignment (Figs 1 and 2).

Highlighted residues may be expected to occur at a moderate

frequency in all the sequences because of their steady divergence

due to the natural processes of evolution, although the most

divergent sequences are likely to have the most outliers. However,

the highlighted residues are especially useful in pointing to

sequence misalignments. These can arise due to various reasons:

(i) partial or total misalignments caused by a failure in the

alignment algorithm, (ii) partial or total misalignments because at

least one of the sequences in the given set is partly or completely

unrelated to the other sequences, (iii) frameshift translation errors

in a protein sequence causing local mismatched regions to be

heavily highlighted (see Discussion for more details).

Occasionally, highlighted residues may point to regions of

some biological significance. This might happen for example if

a protein alignment contains a sequence which has acquired new

functions relative to the main sequence set.

Quality scores. Suppose we have an alignment of M sequences of

length N. Then, the alignment can be written as:

A1,1 A1,2 A1,3 .......... A1,N

A2,1 A2,2 A2,3 .......... A2,N

AM,1 AM,2 AM,3 .......... AM,N

Suppose we also define a residue comparison matrix C of size R*R,

where R is the number of residues. C(a,b) is the score for aligning

residue a with residue b (all scores in this matrix are positive).

The problem is to calculate a score for the conservation of the jth

position in the alignment. Vingron and Sibbald (12) used a

geometric analysis based on a continuous sequence space in order

to compare sequence weighting methods. The method defines an

N-dimensional space, where N is the length of the alignment. Each

sequence can be placed in the space, and the distance between two

sequences is defined as the euclidean distance between the

sequences in the space. We have applied an analogous approach to

each position in the alignment. An R-dimensional space is defined,

in which each column of the alignment can be considered. For a

specified position j in the alignment, each sequence consists of a

single residue which is assigned a point S in the space. For

sequence i, position j, the point S is defined as:

S







C(1, Aij)

C(2,Aij)

C(R,Aij)







We then calculate a consensus value X for the jth position in the

alignment. X is defined as:

X









i1

Fi,j*C(i,1)



i1

Fi,j*C(i,2)



i1

Fi,j*C(i,R)







Nucleic Acids Research, 1997, Vol. 25, No. 24

4878

Figure 1. The CLUSTAL X window in multiple alignment mode. An alignment of some EFTU proteins is displayed. Low-scoring segments are highlighted using

a white character on a black background. Exceptional residues are shown as a white character on a grey background. The quality analysis reveals two anomalously

low scoring regions, ruler positions 16–25 in EFTU_ODOSI and 61–71 in EFTU_MYCPN. These were found to be caused by frameshift errors. Two more sequences

(EFTU_RICPR and EFTU_SPIPL), not shown here, have 4-residue sequencing errors in this region which CLUSTAL X will also highlight.

where Fi,j is the count of residues i at position j in the alignment.

Now, if S is the position of sequence i in the R-dimensional

space, we can calculate the distance Di between each sequence

residue i and the consensus position X.

Di

r1

(XrSr)2



where Xr is the rth dimension of position X, and Sr is the rth

dimension of position S.

We define the quality score for the jth position in the alignment

as the mean of the sequence distances Di:

Quality Score 



i1

Finally the scores are normalised by multiplying by the percen-

tage of sequences which have residues (and not gaps) at this

position. These scores are used in measure (i) above, as an

estimate of the conservation of each alignment column (Fig. 1).

Exceptional residues. It would be useful for each column in the

alignment to identify those sequences in the above calculations

which are found a long way from the consensus point (i.e., which

have a large distance Di), thus lowering the quality score for the

column. For the jth position in the alignment, we take the set of

sequences which have a residue at this position (and not a gap).

The distances Di for this set of sequences are arranged in an array

in order of size from smallest to largest. We can find the upper and

lower quartiles (the distances lying one-quarter of the way from

the top and bottom of the array, respectively) and the inter quartile

range (the difference between the two quartiles).

A residue Aij is considered as an exception in measure (ii) above

if the sequence distance Di is greater than (upper quartile + inter

quartile range × scaling factor). The scaling factor can be adjusted

by the user to select the proportion of residue exceptions that will

be highlighted in the alignment display. Exceptional residues in

an EFTU alignment are shown in Figure 1.

This calculation runs automatically, in a very short time, each

time the screen is updated.

Low-scoring segments. Given the above alignment of M sequences

of length N and a residue exchange matrix, we can build a profile

which is weighted for sequence divergence. Methods for calculat-

ing sequence weights are discussed by Henikoff and Henikoff (13).

Here we calculate sequence weights directly from a neighbour-

joining tree, using the ‘branch-proportional’ method which

corrects for unequal representation by downweighting similar

sequences and upweighting divergent ones (14). Each sequence is

assigned a weight Wi. In the residue comparison matrix C, the

scores for common residue substitutions are positive while rarer

substitutions are scored negatively. By default, the Gonnet PAM

250 matrix (15) is used, but the user may supply a different matrix

e.g. a lower PAM value is appropriate if the sequences are closely

related. The profile P has a column of scores for each position in

the alignment. The column is of height R and consists of a score

4879

Nucleic Acids Research, 1994, Vol. 22, No. 1

Nucleic Acids Research, 1997, Vol. 25, No. 24 4879

Figure 2. Detection and correction of misaligned segments with CLUSTAL X. (A) A set of EFTUs, tested for low scoring regions, highlights a part of the EFTU_ECOLI

sequence (which we deliberately misaligned by incorrect gap insertion). The range selected to be realigned is marked above the alignment. (B) After removal of gaps

and realignment of the selected residue range, the sequence EFTU_ECOLI is now correctly aligned, and the erroneous gaps have been removed. The low-scoring segment

check, the column conservation indicators above the alignment, and the quality graph below it, all reflect the improvement in the alignment.

for each residue in the matrix C. The profile score for residue r at

position j in the alignment is defined as:

P(r,j)+

i+1

C(r,Aij)*W

i+1

For the jth position in the ith sequence the score Sij is defined as:

Sij +P(Aij,j)

The low-scoring regions in the ith sequence are found by

summing the scores Sij along the alignment in both the forward

and backward directions. If the sum is found to be positive, it is

reset to zero. The forward phase can be described by the following

recurrence relations:

Fj+ȥ

Fj–1)Sij

Fj–1)Sij t0

Fj–1)Sijh w0

j+0

Having found the regions in the sequence which have negative Fj

scores, these regions are then refined by removing those positions

at the end of each segment which have a positive profile score Sij.

The Fj scores for these positions are reset to zero.

Similarly, the backward phase can be described as:

Bj+ȥ

Bj)1)Sij

Bj)1)Sij t0

Bj)1)Sij w0

j+N)1

The regions in the sequence which have negative Bj scores are

again refined by removing those positions at the beginning of

each segment which have a positive profile score Sij. The Bj

scores for these positions are reset to zero.

Nucleic Acids Research, 1997, Vol. 25, No. 24

4880

The calculation is repeated for each sequence compared with

a profile for all aligned sequences, except itself. The low-scoring

segments, defined as those positions for which both Fj and Bj are

negative, are then highlighted in the display (Figs 1 and 2).

The low-scoring segment calculation is done when the user

selects the ‘Calculate low scoring segments’ option. It takes only

a few seconds to perform unless the alignment is very large,

making it a practical tool for interactive use.

Implementation

CLUSTAL X displays a window on the screen, including a set of

pull-down menus. On-line help is available. The exact format of

the screen will depend on the host computer and the operating

system. The user may select one of two modes: (i) multiple

alignment mode which displays a single display area for multiple

sequence alignment, or (ii) profile alignment mode which has two

display areas, allowing the user to use previously aligned

sequences for alignment. Figure 1 shows a CLUSTAL X window

in multiple alignment mode.

Alignments or individual sequences can be loaded into the

display areas displayed on the screen using the menu options.

Scroll bars allow easy movement to different parts of the

alignment. Extra lines are added to the sequence data displaying a

ruler, an indicator of alignment conservation, plus any secondary

structure data which was found in the input alignment file.

The order of the sequences in the display can be changed by

clicking on the sequence names, and selecting the cut and paste

options from the menus. In profile alignment mode, these options

also allow the user to move sequences from one profile to the other.

The sequences can be saved at any time to an alignment file in

one of a number of file formats. The sequence display can also be

saved in a colour postscript file for printing on a postscript printer.

Colouring the alignment display. The sequences are automatically

coloured to highlight conserved regions of the alignment. The

colours used, and the specification of the conservation of residues

can be configured by the user. The ‘rules’ governing the colouring

of residues are read from a colour parameter file, which can be

loaded at any time. Two types of colouring ‘rules’ are defined. (i) A

residue can be assigned a specific colour regardless of its position

in the alignment. In this case, all occurrences of the residue will be

coloured in the alignment display. (ii) A residue can be assigned

different colours depending on the consensus of the alignment at

each position.

In this way, for example, conserved hydrophobic or hydrophilic

positions in the alignment can be highlighted (Fig. 1).

Realigning divergent regions. In difficult cases, with a family of

highly divergent sequences, it is possible that misalignments are

introduced during the multiple alignment process. CLUSTAL X

provides two simple mechanisms for realigning the most divergent

regions. (i) Misaligned sequences may be selected by clicking on

the sequence names. A single menu option then removes these

sequences from the alignment set and realigns them to the remaining

sequences. (ii) The second option allows the user to specify a range

of the alignment to be realigned. In this case, the selected sub-range

of the alignment is removed and multiply aligned using the standard

progressive multiple sequence alignment method. The sub-range is

then fitted back into the full alignment.

Using these two options, the original multiple sequence

alignment may be iteratively improved and refined.

DISCUSSION

Ideally, methods for multiple sequence alignment should guarantee

to find the biologically correct alignment for a set of sequences. In

practice, this is difficult to achieve. Firstly, it is difficult to define

an optimal alignment between divergent nucleotide or protein

sequences, even given tertiary structural information. Secondly,

methods that find an optimal multiple sequence alignment have

been impractical to implement, mainly due to their computational

cost. As computer performance improves, methods which iterate

toward an optimal alignment are likely to become useful (16,17).

Meanwhile the heuristical approach of progressive alignment is

most often used, as the algorithm is reasonably fast and minimises

error in alignments of moderate difficulty. However, because the

full information in the sequence set is not used to align each

sequence, it can be possible to see one or more misaligned

sequence segments in the output alignment. In such cases, the

sequences would be expected to align correctly if the full

information was used, or if alignment parameters such as gap

penalties were adjusted.

When we developed CLUSTAL W, we gave the user the ability

to iterate the alignment process by realigning an alignment, or by

profile aligning sequences to an alignment. In this way, the user

could choose to iterate the alignment process, thereby overcoming

some of the defects of progressive alignment. With CLUSTAL X,

we have taken this capability further, by building in algorithms to

target the problem regions of an alignment and letting the user

realign solely the suspect residue ranges. Using these tools, high

quality alignments of divergent sequence sets are produced more

quickly and with greater confidence than has previously been

possible by progressive alignment.

Many programs have been developed which allow, to a greater

or lesser degree, manual intervention in the automatic alignment

process. For example, SOMAP (18) was designed to run under

the DEC VMS operating system. The program allows the user to

manually build up a multiple sequence alignment. It can accept

automatic alignments created by the original CLUSTAL program

(8) to provide a starting-point for the manual editing process.

SEAVIEW (7) is a UNIX X Window-based multiple sequence

alignment editor which is interfaced to the CLUSTAL W

program. SEQPUP (Don Gilbert, Biology Department, Indiana

University, Bloomington, IN 47405) is a sequence editor and

analysis program which can launch external applications such as

CLUSTAL W to perform sequence alignment. SEQLAB

[Wisconsin Package Version 9.0, Genetics Computer Group

(GCG), Madison, WI] is a graphical user interface based on the

OSF/Motif windowing system. It displays sequence alignments

on the screen and includes powerful sequence editing facilities.

The PILEUP program is interfaced to the SEQLAB editor to

perform automatic multiple sequence alignments. Numerous

Mac and PC alignment editors have also been developed. Most of

these editors will accept alignment output from CLUSTAL

programs. However, using CLUSTAL X, the amount of time

spent editing alignments by hand should be minimised, while a

hand-edited alignment can itself be returned for error checking.

CLUSTAL X is not confined to either VMS or UNIX

work-stations but also runs on Macintosh and PC computers. The

program provides a flexible approach to the problem of the

multiple alignment of large numbers of sequences. The methods

used can be applied equally well to both nucleotide and amino acid

sequences. An initial automatic alignment using the traditional

4881

Nucleic Acids Research, 1994, Vol. 22, No. 1

Nucleic Acids Research, 1997, Vol. 25, No. 24 4881

progressive, pairwise approach provides a good starting point for

further refinement. The alignments are displayed on the screen, and

the user can move around easily between different parts of the

alignment. A versatile residue colouring scheme based on the

conservation of each position in the alignment automatically

highlights conserved or special features.

Alignment analysis and error detection

Tools for alignment quality analysis have been developed and

incorporated into the package. A ‘quality’ estimate for each

position in the alignment is plotted on the screen (Fig. 1). Highly

conserved positions in the alignment will get a high ‘quality’ score,

whilst either low conservation, or exceptional residues at a partially

conserved position, will lower the score for the column. The

exceptional residues, which may be due to misalignment of the

sequences or simply divergence can be highlighted in the

alignment (Fig. 1). Sometimes these may be of biological interest,

although most divergence is due to neutral evolutionary processes.

Several methods for calculating the conservation of an

alignment column have been developed. Zvelebil et al. (19) used

physico-chemical properties of amino acids to quantify the

conservation of a position in an alignment, in order to predict

protein secondary structure. Smith and Smith (20) define the

‘information density’ of a sub-region, assuming that all amino

acids are informationally equivalent. Sander and Schneider (21)

calculate a variation entropy for each column. Brouillet et al. (22)

calculate the mean and standard deviation of the pairwise distances

between amino acids in each column of an alignment, using a 20 ×20

distance matrix. None of these methods were found to be ideal for

incorporation into Clustal X. Apart from the latter, none of the

methods use a standard residue exchange matrix, as is needed for

consistency with the alignment process, as well as providing a

natural way to allow the user to customise the quality analysis by

varying the matrix. The advantage of the geometric interpretion

developed for Clustal X is that statistical methods can then be

applied to define a mean value for the column and distances can be

measured between each sequence and the mean: upper limits for the

expected distance between any residue and the mean value can be

defined and thus exceptional residues can be identified.

Low-scoring segments in the sequences can also be highlighted in

the alignment (Figs 1 and 2). Low-scoring segments most often

result from one of three major causes: high divergence between the

sequences; errors in input sequences, most notably frameshifts; and

misalignments. If the cause can be ascribed to high divergence, the

alignment may not be wrong, but should be regarded as unreliable

in the low-scoring segment. In particularly unreliable segments,

CLUSTAL X may mark out every sequence! The alignment in such

a region is likely to be meaningless. Frameshift errors are more

frequent than usually realised. In the alignment of EFTUs taken

from Swiss-Prot release 34, four sequences have short frameshifts

within the region shown in Figure 1. Suspect sequences can be

investigated with frameshifting alignment programs such as

PairWise in WiseTools ( 24), or Framesearch in the GCG package.

It is important to detect and remove sequences containing errors,

as they confound many types of inferences based on multiple

alignments, and may themselves also cause the propagation of

further alignment errors.

We have found the low scoring segments test to be remarkably

powerful, picking up a number of frameshifts and leading to the

correction of many misalignments. Not every highlighted region

is false but, by checking them over, the major errors are almost

always uncovered. Nevertheless, there are situations where the

test may give a false sense of alignment accuracy. This could

happen when aligning sequences with strong amino acid residue

biases (‘reduced sequence complexity’). Tandem repeats are

another case, since superposition of the wrong repeats could still

give a high scoring alignment. Alignments of highly divergent

membrane proteins are tricky on both counts since there are many

transmembrane helices with hydrophobic amino acid biases.

More specialised, detailed alignment analysis programs are

available (24–27). The advantages of CLUSTAL X are that the

quality analyses are very fast as well as being integrated into the

alignment package and the results are displayed graphically on

the screen, with any low-scoring regions highlighted by shading

the alignment background. This interactive system provides an

efficient and flexible approach to alignment analysis and

correction.

Correcting misaligned regions

In Figure 2, a ‘model’ protein misalignment has been set up. For

clarity, the closely related EFTU sequences have been deliberately

misaligned. Genuine misalignments would normally be highly

divergent with only a few identities in particularly conserved

columns. In such cases, if the correct alignment can be ascertained,

this may be by matches between residue similarities rather than

identities.

In the example, a misaligned segment of EFTU_ECOLI is first

detected and marked by applying the low-scoring segments

algorithm (Fig. 2A). Next, a region of the alignment spanning the

error is selected using the cursor. The menu option ‘reset all gaps

before alignment’ is toggled on: in this example there are falsely

inserted gaps that must be deleted. This is not always the case, and

if the existing gaps seem correct, the option can stay switched off.

Now the ‘realign selected residue range’ option is invoked. The

misaligned region is now rapidly and correctly aligned again, and

the false gaps are deleted (Fig. 2B). This time the low-scoring

segments algorithm finds only short segments ascribable to

natural sequence divergence. Realignments in which the gaps are

left in may result in columns with nothing but padding characters,

in which case there is a menu option available to delete these.

The realignment process uses the alignment parameter default

settings, or as they are set up by the user. Misaligned regions are

often more divergent than other regions of the alignment, which

means that the alignment score may not be much higher than

misaligned alternatives. Therefore it may be necessary to lower

gap penalties to allow the sequences to align: this is tested by trial

and error. However, the user should be aware of two factors that

already affect the gap penalties in the local realignment. There is

no gap penalty at the ends of a selected region, so it is free to put

new gaps there: judicious selection of the range boundaries can

direct gaps to desired sites. Gap penalties are also lowered at

existing gaps if these are retained. These factors mean that the

selected range may give a better alignment without having to

lower the gap penalties.

Further uses for the low scoring segments

In CLUSTAL X, the new algorithm for marking low-scoring

segments has been implemented for visual interaction. However,

the algorithm has the potential for wider usage. There are

currently many projects to automatically produce databases of

Nucleic Acids Research, 1997, Vol. 25, No. 24

4882

multiple sequence alignments. The alignments tend not to be of

high quality as it has been difficult to distinguish good and bad

aligned regions rapidly and reliably. Removing sequences with

low-scoring segments below a cut-off score should dramatically

improve these alignments, as all sequences that contain major

errors, or are too divergent to align, can be trapped.

The algorithm also has the potential to automatically establish the

domain boundaries in sets of partially-related multi-domain proteins.

In this case the Smith–Waterman best local alignment algorithm,

finding the approximate regions encompassing the homologous

domains, would be harnessed to the forwards-backwards approach,

summing both the positive and negative scoring segments in order

to define sharp boundaries. A simpler application would be

end-trimming in an alignment, since the termini of proteins are

often poorly conserved.

ACKNOWLEDGEMENTS

J.T.was supported by institute funds from INSERM, CNRS and

the Ministère de la Recherche et Technologie and the EMBL. We

thank the many users of CLUSTAL W who have reported

bugs/suggestions, and those who beta tested CLUSTAL X. We

would also like to thank Dino Moras, Kevin Leonard, Matti

Saraste and Frank Gannon for support during this work.

REFERENCES

1 Feng,D.F. and Doolittle,R.F. (1987) J. Mol. Evol., 25, 351–360.

2 Taylor,W.R. (1988) J. Mol. Evol., 28, 161–169.

3 Stockwell,P.A. and Peterson,G.B. (1987) Comput. Applic. Biosci., 3, 37–43.

4 Thirup,S. and Larsen,N.E. (1990) Proteins, 7, 291–295.

5 Clark,S.P. (1992) Comput. Applic. Biosci., 8, 535–538.

6 De Rijk,P. and De Wachter,R. (1993) Comput. Applic. Biosci., 9, 735–740.

7 Galtier,N., Gouy,M. and Gautier,C. (1996) Comput. Applic. Biosci., 12,

543–548.

8 Higgins,D.G. and Sharp,P.M. (1988) Gene, 1, 237–244.

9 Higgins,D.G., Bleasby,A.J. and Fuchs,R. (1992) Comput. Applic. Biosci.,

8, 189–191.

10 Thompson,J.D., Higgins,D.G. and Gibson,T.J. (1994) Nucleic Acids Res.,

22, 4673–4680.

11 Higgins,D.G., Thompson, J.D. and Gibson,T.J. (1996) Methods Enzymol.,

266, 383–402.

12 Vingron, M. and Sibbald, P.R. (1993) Proc. Natl. Acad. Sci. USA, 90,

8777–8781.

13 Henikoff,S. and Henikoff,J.G. (1994) J. Mol. Biol., 243, 574–578.

14 Thompson,J.D., Higgins,D.G. and Gibson,T.J. (1994) Comput. Applic. Biosci.,

10, 19–29.

15 Benner,S.A., Cohen,M.A. and Gonnet,G.H. (1994) Protein Engng, 7,

1323–1332.

16 Notredame,C. and Higgins,D.G. (1996) Nucleic Acids Res., 24, 1515–1524.

17 Gotoh,O. (1996) J. Mol. Biol., 264, 823–838.

18 Parry-Smith,D.J. and Attwood,T.K. (1991) Comput. Applic. Biosci., 7,

233–235.

19 Zvelebil,M.J.J.M., Barton,G.J., Taylor,W.R. and Sternberg,M.J.E. (1987)

J. Mol. Biol., 195, 957–961.

20 Smith,R.F. and Smith,T.F. (1990) Proc. Natl. Acad. Sci. USA, 87, 118–122.

21 Sander,C. and Schneider,R. (1991) Proteins Struct. Funct. Genet., 9, 56–68.

22 Brouillet,S., Risler,J.L., Henaut,A. and Slonimski,P.P. (1992) Biochimie,

74, 571–580.

23 Birney,E., Thompson,J.D. and Gibson,T.J. (1996) Nucleic Acids Res., 24,

2730–2739.

24 Schuler,G.D., Altschul,S.F. and Lipman,D.J. (1991) Proteins Struct. Funct.

Genet., 9, 180–190.

25 Vingron,M. and Argos,P. (1991) J. Mol. Biol., 218, 33–43.

26 Friemann,A. and Schmitz,S. (1992) Comput. Applic. Biosci., 8, 261–265.

27 Livingstone,C.D. and Barton,G.J. (1993) Comput. Applic. Biosci., 9, 745–756.

Rapid detection methods in the mako shark, Isurus oxyrinchus, for its application in forensic genetics

Article

Full-text available

May 2024

Cesar Salvador Cardona-Felix

Sharks are globally targeted as fishing objects; however, their captures often occur without proper registration. Mostly leading to inadequate administrative measures that allow to have a reliable register of the most exploited species. Consequently, these arises a need to develop rapid and efficient methods for identifying the most fished shark species, like shortfin mako shark, Isurus oxyrinchus. In this study are shown two specific molecular identification methods for this shark. The first method relies on loop-mediated isothermal amplification (lamp) of nucleic acids, while the second method employs lateral flow devices (lfd). Both strategies have been standardized and analytically validated for mako shark identification within a timeframe of less than 60 minutes. Technological advancements like these lay the groundwork for designing diverse systems for the rapid identification of other commercially significant species.

Diversity of Colletotrichum species associated with anthracnose on Euonymus japonicus and their sensitivity to fungicides

Article

Full-text available

Jun 2024

As an evergreen shrub, Euonymus japonicus plays a crucial role in urban landscape construction, and its growth is affected by severe foliar anthracnose caused by Colletotrichum spp. However, the biodiversity of Colletotrichum species associated with anthracnose on E. japonicus remains undetermined. This study involved a two-year collection of E. japonicus leaf samples with typical anthracnose symptoms from 9 districts in Beijing, China. A total of 194 Colletotrichum isolates were obtained, and eight Colletotrichum species were subsequently identified using morphological characteristics and molecular identification with the ACT, GADPH, CHS, TUB2, and CAL genes, as well as the rDNA-ITS region. These species included Colletotrichum aenigma, C. fructicola, C. gloeosporioides, C. grossum, C. hebeiense, C. karstii, C. siamense, and C. theobromicola with C. siamense being the most prevalent (57%), followed by C. aenigma and C. theobromicola. Furthermore, C. fructicola, C. grossum and C. hebeiense are reported for the first time as causal agents of anthracnose on E. japonicus worldwide, and C. karstii is newly reported to be associated with E. japonicus anthracnose in China. Pathogenicity tests revealed that all tested isolates exhibited pathogenicity in the presence of wounds, emphasizing the need to avoid artificial or mechanical wounds to prevent infection in E. japonicus management. The EC50 values of five fungicides, namely difenoconazole, flusilazole, tebuconazole, hexaconazole, and prochloraz, were found to be less than 10 mg/L, indicating their strong potential for application. Notably, the EC50 of prochloraz was less than 0.05 mg/L for C. theobromicola. These findings offer valuable insights for the management of anthracnose on E. japonicus.

Origin, Phylogeny, and Taxonomy of Lenoks of the Genus Brachymystax (Salmonidae): Available Data, Their Interpretation, and Unresolved Problems

Article

Full-text available

Jun 2024

A. G. Osinov

The reproductive and phylogenetic relationships of lenoks of the genus Brachymystax are considered based on analysis of 30 allozyme loci and two fragments (411 and 987 base pairs) of the mitochondrial DNA control region . The presence of three phylogenetic lenok groups, the blunt-snouted and sharp-snouted groups from Russia and neighboring countries and the Qinling group from China and South Korea, has been confirmed. It is assumed that the center of origin of the genus Brachymystax was Primorye and the blunt-snouted lenok from this region is closest to the ancestral form. Modern assumptions on the taxonomic status of different forms of lenok are contradictory both in the number of species (from one to five) and in their composition. The identification of two or three species in the genus Brachymystax is most reasonable . The main problems that should be solved to clarify the phylogeny and taxonomy of representatives of this genus are indicated.

Caldanaerobacter subterraneus subsp. keratinolyticus subsp. nov., a Novel Feather-Degrading Anaerobic Thermophile

Article

Full-text available

Jun 2024

Caldanaerobacter subterraneus subsp. keratinolyticus subsp. nov. strain KAk was found in a geothermal hot spring located in Kazakhstan. Growth occurred at temperatures ranging from 50 to 80 °C, with approximately 70 °C as optimum. It also thrived in pH conditions ranging from 4.0 to 9.0, with the best growth occurring at 6.8. Under optimal conditions in a glucose-containing medium, the cells were predominantly observed singly, in pairs, or less frequently in chains, and did not form endospores. However, under conditions involving growth with merino wool or feathers, or under suboptimal conditions, the cells of strain KAk exhibited a notably elongated and thinner morphology, with lengths ranging from 5 to 8 µm, and spores were observed. The KAk strain exhibited efficient degradation of feather keratin and merino wool at temperatures ranging from 65 to 70 °C. Analysis of the 16S rRNA gene sequence placed KAk within the genus Caldanaerobacter, family Thermoanaerobacteraceae, with the highest similarity to C. subterraneus subsp. tengcongensis MB4T (98.84% sequence identity). Furthermore, our analysis of the draft genome sequence indicated a genome size of 2.4 Mbp, accompanied by a G+C value of 37.6 mol%. This study elucidated the physiological and genomic characteristics of strain KAk, highlighting its keratinolytic capabilities and distinctiveness compared to other members of the genus Caldanaerobacter.

Identifying Single DNA Barcode Regions of Oreochromis niloticus of Shahpura Lake, Bhopal and their Contribution to a Reference Library

Article

Full-text available

Jun 2024

Background: The present investigation was carried out on the development of a barcode and DNA sequences database of Oreochromis niloticus (Nile Tilapia) of Shahpura Lake, Bhopal through mitochondrial Cytochrome c Oxidase-I gene (cox1) for public domain uses as a reference database for identification, authentication and variation studies. Materials and Methods: We performed the mitochondrial genomic analyses for molecular studies of which genomic DNA was extracted from fish tissues using the standard protocol provided by Janarthanan and Vincent (2007). Then, isolated DNA was introduced to Polymerase Chain Reaction (PCR) using universal primers, after which, the electrophoresis of the PCR product was done, then, obtained the DNA bands of interest of desired molecular weight on the gel. Results: We generated 02 unique DNA barcodes of morphologically identified fish specimens collected from Shahpura Lake, Bhopal. Considering the ambiguous (0%), Barcode Index Number (BIN) URI (BOLD:AAC9904), Top Hit % (100%), Nearest Member of Neighbor (ANGBF54446-19), Nearest BIN URI (BOLDAET5315), Analysis of Barcode Gap as an average intra-specific (204.14) and Analysis of Cluster Sequences (RESL) (20.583717) was found significant for development of DNA barcode of O. niloticus. The overall mean distance among O. niloticus specimens (6A and 6B) was obtained as 1.82 which may be considered as good for conservation point of view. RESL in the BOLD systems has a stronger taxonomic performance than that of the Barcode Gap Analysis and thus showed better species identification, during the present investigation achieved similar results, which may be related to the species identification.

Congruent patterns of cryptic cladogenesis revealed using RADseq and Sanger sequencing in a velvet worm species complex (Onychophora: Peripatopsidae: Peripatopsis sedgwicki)

Article

Full-text available

Jun 2024
MOL PHYLOGENET EVOL

Pandarus Leach, 1816 (Copepoda: Siphonostomatoida: Pandaridae) species collected from elasmobranchs off South Africa with the description of Pandarus echinifer n. sp

Article

Full-text available

Jun 2024
Syst Parasitol

S. M. Dippenaar

Eight species of Pandarus Leach, 1816 collected from hosts caught off South Africa are reported. These species include P. bicolor Leach, 1816, P. niger Kirtisinghe, 1950 and P. carcharhini Ho, 1963 belonging to the “bicolor” group and P. cranchii Leach, 1819, P. satyrus Dana, 1849, P. smithii Rathbun, 1886 and P. sinuatus Say, 1818 belonging to the “cranchii” group. Notes on previous and new distinguishing features are provided with illustrations, specifically the relative lengths of the dorsal plates and caudal rami as well as the structure of the distomedial spine on the second segment of leg 1 exopod. Additionally, illustrated re-descriptions are provided for P. satyrus and P. sinuatus. Furthermore, a new species Pandarus echinifer n. sp., also belonging to the “cranchii” group, collected from the snaggletooth shark Hemipristis elongata (Klunzinger) is described. This species is most similar to P. sinuatus but can be distinguished from it by the heavily spinulated distomedial spine on the last segment of the first leg exopod. Molecular analysis of the cytochrome oxidase I partial gene is used to calculate sequence divergences amongst different individuals and species. According to the results (as well as based on morphological characters) P. rhincodonicus Norman, Newbound & Knott, 2000 is a synonym of P. cranchii. New hosts and geographic localities from South Africa (and Ningaloo Park, Western Australia) are reported.

First record of two species of venomous snakes Bungarus suzhenae and Ovophis zayuensis (Serpentes: Elapidae, Viperidae) from India

Article

Full-text available

Jun 2024

We report Bungarus suzhenae Chen, Shi, Vogel, Ding & Shi, 2021 and Ovophis zayuensis (Jiang, 1977) for the first time from India. Specimens of B. suzhenae and O. zayuensis were collected during our field surveys in north (Arunachal Pradesh) and south (Nagaland-Manipur border) of the river Brahmaputra. Species identity was supported by partial cytochrome b (cyt b), and 16s mitochondrial gene. We provide a detailed morphological description and a key to the two genera of this region. This report extends the westernmost distribution of B. suzhenae by ca. 300 km from Myanmar, and the southernmost range of O. zayuensis by 170 km from Tibet. Until now eight species of Bungarus and only one Ovophis species have been reported from India. Ovophis species are recently reported to be medically important venomous snakes whose venom properties have not been investigated in depth.

Characterization of an α‐L‐fucosidase in marine bacterium Wenyingzhuangia fucanilytica: new evidence on the catalytic sites of GH95 family glycosidases

Article

Full-text available

Jun 2024
J SCI FOOD AGR

Background α‐l‐Fucose confers unique functions for fucose‐containing biomolecules such as human milk oligosaccharides. α‐l‐Fucosidases can serve as desirable tools in the application of fucosylated saccharides. Discovering novel α‐l‐fucosidases and elucidating their enzyme properties are always worthy tasks. Results A GH95 family α‐l‐fucosidase named Afc95A_Wf was cloned from the genome of the marine bacterium Wenyingzhuangia fucanilytica and expressed in Escherichia coli. It exhibited maximum activity at 40 °C and pH 7.5. Afc95A_Wf defined a different substrate specificity among reported α‐l‐fucosidases, which was capable of hydrolyzing α‐fucoside in CNP‐fucose, Fucα1‐2Galβ1‐4Glc and Galβ1‐4(Fucα1‐3)Glc, and showed a preference for α1,2‐fucosidic linkage. It adopted Asp residue in the amino acid sequence at position 391, which was distinct from the previously acknowledged residue of Asn. The predicted tertiary structure and site‐directed mutagenesis revealed that Asp391 participates in the catalysis of Afc95A_Wf. The differences in the substrate specificity and catalytic site shed light on that Afc95A_Wf adopted a novel mechanism in catalysis. Conclusion A GH95 family α‐l‐fucosidase (Afc95A_Wf) was cloned and expressed. It showed a cleavage preference for α1,2‐fucosidic linkage to α1,3‐fucosidic linkage. Afc95A_Wf demonstrated a different substrate specificity and a residue at an important catalytic site compared with known GH95 family proteins, which revealed the occurrence of diversity on catalytic mechanisms in the GH95 family. © 2024 Society of Chemical Industry.

Knockout of EPO gene in blunt snout bream (Megalobrama amblycephala) by CRISPR/Cas9 reveals its roles in hypoxia-tolerance

Article

Jun 2024
AQUACULTURE

Fast and Sensitive Multiple Sequence Alignments on a Microcomputer

Article

Full-text available

Apr 1989

A strategy is described for the rapid alignment of many long nucleic acid or protein sequences on a microcomputer. The program described can handle up to 100 sequences of 1200 residues each. The approach is based on progressively aligning sequences according to the branching order in an initial phylogenetic tree. The results obtained using the package appear to be as sensitive as those from any other available method.

Protein Sequence Alignments: A Strategy for the Hierarchical Analysis of Residue Conservation

Article

Full-text available

Dec 1993

An algorithm is described for the systematic characterization of the physico-chemical properties seen at each position in a multiple protein sequence alignment. The new algorithm allows questions important in the design of muta genesis experiments to be quickly answered since positions in the align that show unusual or interesting residue substitution patterns may be rapidly identified. The strategy is based on a flexible set-based description of amino acid properties, which is used to define the conservation between any group of amino acids. Sequences in the alignment are gathered into subgroups on the basis of sequence similariiy, functional, evolutionary or other criteria. All pairs of subgroups are then compared to highlight positions that confer the unique features of each subgroup. The algorithm is encoded in the computer program AMAS (Analysis of Multiply Aligned Sequences) which provides a textual summary of the analysis and an annotated (boxed, shaded and/or coloured) multiple sequence alignment. The algorithm is illustrated by application to an alignment of 67 SH2 domains where patterns of conserved hydrophobic residues that constitute the protein core are highlighted. The analysis of charge conservation across annexin domains identifies the locations at which conserved charges change sign. The algorithm simplifies the analysis of multiple sequence data by condensing the mass of information present, and thus allows the rapid identification of substitutions of structural and functional importance.

Dictionary of protein secondary structure

Article

Dec 1983
BIOPOLYMERS

For a successful analysis of the relation between amino acid sequence and protein structure, an unambiguous and physically meaningful definition of secondary structure is essential. We have developed a set of simple and physically motivated criteria for secondary structure, programmed as a pattern-recognition process of hydrogen-bonded and geometrical features extracted from x-ray coordinates. Cooperative secondary structure is recognized as repeats of the elementary hydrogen-bonding patterns “turn” and “bridge.” Repeating turns are “helices,” repeating bridges are “ladders,” connected ladders are “sheets.” Geometric structure is defined in terms of the concepts torsion and curvature of differential geometry. Local chain “chirality” is the torsional handedness of four consecutive Cα positions and is positive for right-handed helices and negative for ideal twisted β-sheets. Curved pieces are defined as “bends.” Solvent “exposure” is given as the number of water molecules in possible contact with a residue. The end result is a compilation of the primary structure, including SS bonds, secondary structure, and solvent exposure of 62 different globular proteins. The presentation is in linear form: strip graphs for an overall view and strip tables for the details of each of 10.925 residues. The dictionary is also available in computer-readable form for protein structure prediction work.

Dcse, an Interactive Tool for Sequence Alignment and Secondary Structure Research

Article

Dec 1993

DCSE provides a user-friendly package for the creation and editing of sequence alignments. The program runs on different platforms, including microcomputers and workstations. Apart from available hardware, the program is not limited in the size of the alignment it can handle. It deviates more from classical text editors than other available sequence editors because it uses a different approach towards editing. It shifts characters or entire blocks of aligned characters, rather than inserting or deleting gaps in the sequences. Alignment of a new sequence to an existing alignment is partly automated. Although DCSE can be used on protein sequence alignments, it is especially targeted at the examination of RNA. The secondary structure for every sequence can be incorporated easily in the alignment. DCSE also has extensive built-in support for finding and checking secondary structure elements. A sophisticated system of markers allows notation of special positions in an alignment. This system can be used to store information such as the position of hidden breaks, introns and tertiary structure interactions.

SOMAP: a novel interactive approach to multiple protein sequences alignment

Article

Apr 1991

A novel interactive method for generating multiple protein sequence alignments is described. The program has no internal limit to the number or length of sequences it can handle and is designed for use with DEC VAX processors running the VMS operating system. The approach used is essentially one of manual sequence manipulation, aided by built-in symbolic displays of identities and similarities, and strict and ‘fuzzy’ (ambiguous) pattern-matching facilities. Additional flexibility is provided by means of an interface to a publicly available automatic alignment system and to a comprehensive sequence analysis package.

Amino Acid Substitution Matrices from Protein Blocks

Article

Nov 1992

Methods for alignment of protein sequences typically measure similarity by using a substitution matrix with scores for all possible exchanges of one amino acid with another. The most widely used matrices are based on the Dayhoff model of evolutionary rates. Using a different approach, we have derived substitution matrices from about 2000 blocks of aligned sequence segments characterizing more than 500 groups of related proteins. This led to marked improvements in alignments and in searches using queries from each of the groups.

Crystal Structure of a Src-Homology 3 (SH3) Domain

Article

Nov 1992

The Src-homologous SH3 domain is a small domain present in a large number of proteins that are involved in signal transduction, such as the Src protein tyrosine kinase, or in membrane-cytoskeleton interactions, but the function of SH3 is still unknown (reviewed in refs 1-3). Here we report the three-dimensional structure at 1.8 A resolution of the SH3 domain of the cytoskeletal protein spectrin expressed in Escherichia coli. The domain is a compact beta-barrel made of five antiparallel beta-strands. The amino acids that are conserved in the SH3 sequences are located close to each other on one side of the molecule. This surface is rich in aromatic and carboxylic amino acids, and is distal to the region of the molecule where the N and C termini reside and where SH3 inserts into the alpha-spectrin chain. We suggest that a protein ligand binds to this conserved surface of SH3.

MALIGNED: a multiple sequence alignment editor

Article

Jan 1993

Stephen P. Clark

A multiple sequence alignment editor is described which runs on a VAX/VMS system and can exchange data with a number of other programs, including those of the Genetics Computer Group (GCG). Up to 199 sequences can be aligned. The quality of the alignment can be easily judged during its development because the display attributes to each character are determined by the way it matches the other sequences. Four methods are available for calculating the highlighting to emphasize different aspects of the relationships of the sequences and up to four styles of highlighting can be used at the same time. Laser printer output is suitable for publication without modification.

Evolutionary divergence plots of homologous proteins

Article

Jul 1992
BIOCHIMIE

A simple and efficient method is described for analyzing quantitatively multiple protein sequence alignments and finding the most conserved blocks as well as the maxima of divergence within the set of aligned sequences. It consists of calculating the mean distance and the root-mean-square distance in each column of the multiple alignment, averaging the values in a window of defined length and plotting the results as a function of the position of the window. Due attention is paid to the presence of gaps in the columns. Several examples are provided, using the sequences of several cytochromes c, serine proteases, lysozymes and globins. Two distance matrices are compared, namely the matrix derived by Gribskov and Burgess from the Dayhoff matrix, and the Risler Structural Superposition Matrix. In each case, the divergence plots effectively point to the specific residues which are known to be essential for the catalytic activity of the proteins. In addition, the regions of maximum divergence are clearly delineated. Interestingly, they are generally observed in positions immediately flanking the most conserved blocks. The method should therefore be useful for delineating the peptide segments which will be good candidates for site-directed mutagenesis and for visualizing the evolutionary constraints along homologous polypeptide chains.

Analysis of insertion/deletions in protein structures

Article

Apr 1992

An analysis of insertions and deletions (indels) occurring in a databank of multiple sequence alignments based on protein tertiary structure is reported. Indels prefer to be short (1 to 5 residues). The average intervening sequence length between them versus the percentage of residue identity in pairwise alignments shows an exponential behaviour, suggesting a stochastic process such that nearly every loop in an ancestral structure is a possible target for indels during evolution. The results also suggest a limit to the average size of indels accommodated by protein structures. The preferred indel conformations are reverse turn and coil as are the preferred conformations at the indel edges (N- and C-terminal sides). Interruptions in helices and strands were observed as very rare events.

The CLUSTAL_X Windows Interface: Flexible Strategies for Multiple Sequence Alignment Aided by Quality Analysis Tools

Abstract and Figures

Recommended publications

Large Scale Print Spool Service.

A Performance Comparison of UNIX Operating Systems on the Pentium

The state of the arts: linux tools for the graphic artist

AN IMPROVED STEGANOGRAPHIC SYSTEM FOR PCs AND MOBILE DEVICES