ArticlePDF Available

Seq2Logo: A method for construction and visualization of amino acid binding motifs and sequence profiles including sequence weighting, pseudo counts and two-sided representation of amino acid enrichment and depletion

May 2012
Nucleic Acids Research 40(Web Server issue):W281-7

May 2012
40(Web Server issue):W281-7

DOI:10.1093/nar/gks469

Source
PubMed

License
CC BY-NC 3.0

Authors:

Martin C F Thomsen

Technical University of Denmark

Morten Nielsen

Technical University of Denmark

Seq2Logo is a web-based sequence logo generator. Sequence logos are a graphical representation of the information content stored in a multiple sequence alignment (MSA) and provide a compact and highly intuitive representation of the position-specific amino acid composition of binding motifs, active sites, etc. in biological sequences. Accurate generation of sequence logos is often compromised by sequence redundancy and low number of observations. Moreover, most methods available for sequence logo generation focus on displaying the position-specific enrichment of amino acids, discarding the equally valuable information related to amino acid depletion. Seq2logo aims at resolving these issues allowing the user to include sequence weighting to correct for data redundancy, pseudo counts to correct for low number of observations and different logotype representations each capturing different aspects related to amino acid enrichment and depletion. Besides allowing input in the format of peptides and MSA, Seq2Logo accepts input as Blast sequence profiles, providing easy access for non-expert end-users to characterize and identify functionally conserved/variable amino acids in any given protein of interest. The output from the server is a sequence logo and a PSSM. Seq2Logo is available at http://www.cbs.dtu.dk/biotools/Seq2Logo (14 May 2012, date last accessed).

The submission (left) and graphical layout (right) part of the web interface. In the submission part the user specifies the input file, the format of output files, the logotype and the conditions for the handling of the input data. In the Graphical Layout part, the user customizes the graphical layout of the logo plot; page size, stacks per line, lines per page, colours, bars, rotation of position numbers and title.

…

Output from Seq2Logo. The upper panel shows the sequence logo calculated from a set of 13 artificial peptide sequences using the specification defined in Figure 1 (sequence weighting using clustering, pseudo count with a weight of 200 and logotype as Kullback–Leibler). Enriched amino acids are shown on the positive y-axis and depleted amino acids on the negative y-axis. The lower panel gives the position-specific (log-odds) scoring matrix (PSSM) calculated by Seq2Logo. Each line corresponds to a position and gives the consensus amino acid and the log-odds scores for the 20 amino acids.

…

Sequence logos generated from small sequence samples. All logos except the right logo in the lower row were calculated from a set of 13 artificial peptide sequences proposed to bind HLA-A*02:01 (see Figure 1). The upper row shows logos calculated by Seq2Logo using: (i) without sequence weighting and pseudo count correction, (ii) sequence weighting by clustering and no pseudo count correction and (iii) sequence weighting by clustering and pseudo count correction with a weight on prior of 200. The lower row shows logos calculated using: (i) Weblogo with ‘small sample correction’, (ii) EnoLOGOS and (iii) Seq2Logo from a set of 229 HLA-A*02:01 9mer ligands downloaded from the SYFPEITHI database (12) with sequence weighting by clustering and pseudo count correction with a weight on prior of 200.

…

The different logotype representations covered by Seq2Logo. Sequence logos generated from at set of 13 artificial peptide sequences proposed to bind HLA-A*02:01 (see Figure 1). All logos were calculated using clustering and pseudo counts with a weight on prior at 200. Upper row, left panel: Shannon, right panel: Kullback–Leibler. Lower row left panel: weighted Kullback–Leibler, right panel: probability weighted Kullback–Leibler.

…

PSSM-logo for the N-linked glycosylation motif. The motif was calculated from a set of 2128 unique experimentally verify N-glycosylation sites downloaded from the UniprotKB protein database. Only peptide fragments of length 11 (5 before and 5 after the N) were included in the analysis.

…

Figures - uploaded by Morten Nielsen

Content may be subject to copyright.

Content uploaded by Morten Nielsen

Content may be subject to copyright.

Available via license: CC BY-NC 3.0

Content may be subject to copyright.

Seq2Logo: a method for construction and

visualization of amino acid binding motifs and

sequence profiles including sequence weighting,

pseudo counts and two-sided representation of

amino acid enrichment and depletion

Martin Christen Frølund Thomsen and Morten Nielsen*

Center for Biological Sequence Analysis, Technical University of Denmark, DK-2800 Kgs. Lyngby, Denmark

Received January 6, 2012; Revised April 30, 2012; Accepted May 2, 2012

ABSTRACT

Seq2Logo is a web-based sequence logo generator.

Sequence logos are a graphical representation of

the information content stored in a multiple

sequence alignment (MSA) and provide a compact

and highly intuitive representation of the position-

specific amino acid composition of binding motifs,

active sites, etc. in biological sequences. Accurate

generation of sequence logos is often compromised

by sequence redundancy and low number of obser-

vations. Moreover, most methods available for

sequence logo generation focus on displaying the

position-specific enrichment of amino acids,

discarding the equally valuable information related

to amino acid depletion. Seq2logo aims at resolving

these issues allowing the user to include sequence

weighting to correct for data redundancy,

pseudo counts to correct for low number of obser-

vations and different logotype representations

each capturing different aspects related to amino

acid enrichment and depletion. Besides

allowing input in the format of peptides and

MSA, Seq2Logo accepts input as Blast sequence

profiles, providing easy access for non-expert

end-users to characterize and identify functionally

conserved/variable amino acids in any given

protein of interest. The output from the server is a

sequence logo and a PSSM. Seq2Logo is available at

http://www.cbs.dtu.dk/biotools/Seq2Logo (14 May

2012, date last accessed).

INTRODUCTION

The idea of generating a logo from aligned sets of sequences

was introduced in 1990 by Schneider and Stephens (1). The

intention of a sequence logo is to concentrate into a single

plot the general consensus, the order of predominance

of residues at every position, the relative frequencies of

every residue at every position, the amount of information

present at every position and signiﬁcant locations. This

logo is then able to present all of the relevant information

to the viewer in a fast and concise manner.

Several webservers exist to generate sequence logos from

MSA’s (2–5). All these servers suffer from different limita-

tions in the handling sequence redundancy and low number

of observations. Moreover, to the best of our knowledge, all

public sequence logo servers, with the exception of the

Icelogo (4) and two-sample logo (5) methods, focus on dis-

playing the position-speciﬁc enrichment of amino acids, dis-

carding the equally valuable information related to amino

acid depletion. Seq2logo aims at resolving these issues

allowing the user to include sequence weighting to correct

for data redundancy, pseudo counts to correct for low

number of observations (6–8) and ﬁve different logotype

representations each capturing different aspects related to

amino acid enrichment and depletion. In addition to the

usual Shannon logo (9), Seq2Logo includes the option to

create Kullback–Leibler (KL) (10) logos where the

depleted (under-represented) amino acids are represented

on the negative y-axis. Besides the conventional KL logo,

Seq2Logo can also display a weighted KL logo, where the

relative height of each amino acid is proportional to the

log-odds ratio and a probability weighted KL logo, where

the relative height of each amino acid is proportional to the

product of the probability and log-odds ratio. Finally,

*To whom correspondence should be addressed. Tel: +45 4525 2425; Fax: +45 4593 1585; Email: mniel@cbs.dtu.dk

Published online 25 May 2012 Nucleic Acids Research, 2012, Vol. 40, Web Server issue W281–W287

doi:10.1093/nar/gks469

ßThe Author(s) 2012. Published by Oxford University Press.

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/

by-nc/3.0), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

inspired by the work of Fujii et al. (11), Seq2Logo also

includes an option to visualize PSSM (position-speciﬁc

scoring matrix) logos, where the height of a bar is given by

the sum of the absolute value of the PSSM weight matrix

values and the height of a given amino acids is proportional

to the absolute value of the weight matrix score. In particu-

lar, the weighted KL logo provides a visual and highly in-

tuitive representation of both amino acid enrichment and

depletion in for instance receptor binding motifs. Besides

allowing input in the format of peptides and MSAs, the

Seq2Logo server accepts inputs such as Blast sequence

proﬁles, providing easy access for non-expert end-users to

characterize and identify functionally conserved/variable

amino acids in any given protein of interest.

MATERIALS AND METHODS

Seq2Logo implements two strategies to improve the

accuracy of the estimated sequence logo. The ﬁrst

strategy is sequence weighting which corrects for data re-

dundancy. The second strategy is pseudo counts which

correct for a low number of observations. Sequence

weighting is implemented as described in (6,8) and

pseudo counts as described in (7). For details, see

Supplementary Data.

In a sequence logo, the height of the bar is equal to

the information content at each amino acid position. The

information content is calculated using the relation

I¼P

palog2pa=qa, where p

and q

are the observed prob-

ability (calculated from the data) and background probabil-

ity, respectively, of the amino acid a. If an equiprobable

background amino acid distribution is applied, a conven-

tional Shannon sequence logo is displayed. If a background

amino acid distribution reﬂecting the prevalence of the dif-

ferent amino acids is applied, a Kullback–Leibler sequence

logo is displayed. The choice of the Kullback–Leibler

logotype in Seq2Logo not only provides correction for the

uneven distribution of amino acids, but also expresses the

depleted amino acids (where p

) on the negative side of

the y-axis. This enables the user to quickly identify enriched

and depleted (under-represented) amino acids. To enhance

the identiﬁcation and information of the depleted

amino acids, Seq2Logo includes another logotype called

weighted Kullback–Leibler. This logo type presents each

individual amino acid proportional to its relative log-odds

score [log

)]. Another logotype is included called

probability weighted Kullback–Leibler, where the relative

height of each individual amino acid is proportional to p

log

). Finally, Seq2Logo includes an option to display

PSSM-logos (11), where the height of a bar is equal to the

sum of the absolute value of the PSSM weight matrix values

and the height of each amino acid is proportional to the

absolute value of the weight matrix score (with negative

values displayed on the negative y-axis).

THE WEB SERVER

The Seq2Logo server has a simple interface that allows

non-expert users to generate and customize accurate

logos from any amino acid sequence data of interest.

Input

The interface is split in two parts for easy overview. The

ﬁrst and the most important part is submission (Figure 1,

left panel). Here, the user can upload or paste in the input

data in addition to specifying the logotype (Shannon,

Kullback–Leibler, Weighted Kullback–Leibler, probabil-

ity weighted Kullback–Leibler or PSSM-logo) and condi-

tions for handling the input data (sequence weighting and

pseudo counts). Seq2Logo can read sequence data in the

following formats: Fasta, ClustalW, Raw peptide

sequences and Weight/Blast matrix (for details on each

format refer to Supplementary Data). The detection of

the format happens automatically through the identiﬁca-

tion of key elements from each format. In the submission

part, the user further speciﬁes which output ﬁles should be

created. In the graphical layout (Figure 1, right panel), the

user can customize the graphical layout of the logo plot.

Page size sets the resolution of the image and stacks per

line and lines per page determine how the logo should

look. Assigning each amino acid symbol to a color

deﬁnes the amino acid colors. There are six colors to

choose from: Red, green, blue, yellow, purple or orange.

All amino acids left out will be black. Several predeﬁned

color-schemes are available. The user can also rotate the

position numbers on the x-axis and hide various features

of the graph.

Output

An example of the output from Seq2Logo generated using

the input speciﬁcations from Figure 1 is shown in

Figure 2. The ﬁgure shows on the positive y-axis, the

amino acids enriched at each peptide position and on

the negative y-axis the corresponding depleted amino

acids. In this case, the logo is calculated from a set of 13

artiﬁcial peptide sequences proposed to bind the

HLA-A*02:01 class I major histocompatibility complex

(MHC) molecule. This molecule has a binding motif

with strong interactions at P2 and P9 both positions

with prevalence for hydrophobic amino acids (12).

One of the distinct powers of Seq2Logo is its ability to

deal with data redundancy and low number of observa-

tions. To the best of our knowledge, no other public

sequence logo servers share this ability. In Figure 3, the

cruciality of these features for the generation of accurate

sequence logos describing a binding motif is illustrated.

The ﬁgure displays Shannon sequence logos generated

by Seq2Logo, using different option to improve the

accuracy, as well as sequence logos generated by

Weblogo (2) and EnoLOGOS (3). When comparing the

logos calculated from the small sample data set with the

logo obtained from the larger data set, it is apparent that

the inclusion of sequence weighting and pseudo counts

have a signiﬁcant positive impact on the overall

accuracy of the binding motif description.

The other distinct feature of Seq2Logo compared

to most other public sequence logo server is the display

of depleted amino acids on the negative y-axis in

Kullback–Leibler logos. Most sequence logo servers

display the relative height of the different amino acids

in a manner proportional to their frequency, thus

W282 Nucleic Acids Research, 2012, Vol. 40, Web Server issue

Figure 2. Output from Seq2Logo. The upper panel shows the sequence logo calculated from a set of 13 artiﬁcial peptide sequences using the

speciﬁcation deﬁned in Figure 1 (sequence weighting using clustering, pseudo count with a weight of 200 and logotype as Kullback–Leibler).

Enriched amino acids are shown on the positive y-axis and depleted amino acids on the negative y-axis. The lower panel gives the position-speciﬁc

(log-odds) scoring matrix (PSSM) calculated by Seq2Logo. Each line corresponds to a position and gives the consensus amino acid and the log-odds

scores for the 20 amino acids.

Figure 1. The submission (left) and graphical layout (right) part of the web interface. In the submission part the user speciﬁes the input ﬁle, the

format of output ﬁles, the logotype and the conditions for the handling of the input data. In the Graphical Layout part, the user customizes the

graphical layout of the logo plot; page size, stacks per line, lines per page, colours, bars, rotation of position numbers and title.

Nucleic Acids Research, 2012, Vol. 40, Web Server issue W283

displaying only the position-speciﬁc enrichment of amino

acids, discarding the equally valuable information related

to amino acid depletion. To improve on this issue,

Seq2Logo includes a series of distinct logotypes (see

Figure 4). In addition to the usual Shannon logo,

Seq2Logo includes the option to create Kullback–Leibler

(KL) logos where depleted amino acids are represented on

the negative y-axis. Besides the conventional KL logo,

Seq2Logo can also display a weighted KL logo, where

the relative height of each amino acid is proportional to

the log-odds ratio and a probability weighted KL logo,

where the relative height of each amino acid is propor-

tional to the product of the probability and log-odds

ratio. In particular, the weighted KL logo provides a

visual and highly intuitive representation of both amino

acid enrichment and depletion in for instance receptor

binding motifs. Besides these information-based logo-

types, Seq2Logo offers the possibility of displaying

PSSM-logos calculated either from a log-odds weight

matrix derived by Seq2Logo from a multiple sequence

alignment or from a user-deﬁned PSSM. In the

PSSM-logo, the height of the bar and amino acid at

each position is proportional to the absolute value of the

PSSM weight matrix values. This logotype is particularly

powerful when illustrating depletion of a small set of

amino acids form otherwise variable positions in a

sequence motif. One such example is N-linked

glycosylation sites that are known to have the motif

N-X-S/T where X can be any amino acid but

P. Visualizing this motif as an information-based

sequence logo will not capture the depletion of P at the

position between N and S/T as all amino acids except

P are found at this position, hence making the overall

information content very small. On the other hand,

visualizing the motif as a PSSM-logo, the strong depletion

of P at the position between N and S/T becomes apparent

(see Figure 5).

A powerful way to characterize sequence conservation/

variation within a protein family is by use of sequence

proﬁles. Such sequence proﬁles can be obtained using

Psi-Blast (7). Seq2Logo accepts input of such sequence

proﬁle in the Blast proﬁle format allowing easy access

for non-expert end-users to characterize and identify func-

tionally conserved/variable amino acids in any given

protein of interest. Blast sequence proﬁle can be generated

either in-house using a command like ‘blastpgp ddbe

0.00001 j4Q blastproﬁle i fasta o out’, where dbis

the sequence database used to search by Blast, edeﬁnes

the e-value cut-off for signiﬁcant hits, jdeﬁnes the

number of Psi-blast iterations, iis the input ﬁle in

FASTA format, Qis the output ﬁle for the blast

proﬁle (the ﬁle to be used by Seq2Logo to visualized the

sequence proﬁle) and o is the ﬁle for the blast output.

Alternatively, the Blast2logo webserver (www.cbs.dtu.dk/

biotools/Blast2logo (14 May 2012, date last accessed)) can

be used to obtain the sequence proﬁle. Figure 6 demon-

strates the use of Seq2Logo to display a sequence proﬁle

for Rhamnogalacturonan acetylesterase (PDBid 1K7C,

chain A). The active site of 1K7C.A is deﬁned by the

residues S9, G42, N74, D192 and H195 (13). All these

residues are highly conserved in the sequence logo (in

fact they are among the 10 residues with the highest infor-

mation content, data not shown). Another striking obser-

vation from the logo is the lack of sequence information in

the area between positions 75 and 105, suggesting that this

part of the protein is highly variable (most likely an inser-

tion) within the protein family. Both these observations

illustrate the power of sequence proﬁles combined with

Seq2Logo as a simple tool to identify functionally import-

ant residues and insertions in protein sequences.

Figure 3. Sequence logos generated from small sequence samples. All logos except the right logo in the lower row were calculated from a set of 13

artiﬁcial peptide sequences proposed to bind HLA-A*02:01 (see Figure 1). The upper row shows logos calculated by Seq2Logo using: (i) without

sequence weighting and pseudo count correction, (ii) sequence weighting by clustering and no pseudo count correction and (iii) sequence weighting by

clustering and pseudo count correction with a weight on prior of 200. The lower row shows logos calculated using: (i) Weblogo with ‘small sample

correction’, (ii) EnoLOGOS and (iii) Seq2Logo from a set of 229 HLA-A*02:01 9mer ligands downloaded from the SYFPEITHI database (12) with

sequence weighting by clustering and pseudo count correction with a weight on prior of 200.

W284 Nucleic Acids Research, 2012, Vol. 40, Web Server issue

Figure 4. The different logotype representations covered by Seq2Logo. Sequence logos generated from at set of 13 artiﬁcial peptide sequences

proposed to bind HLA-A*02:01 (see Figure 1). All logos were calculated using clustering and pseudo counts with a weight on prior at 200.

Upper row, left panel: Shannon, right panel: Kullback–Leibler. Lower row left panel: weighted Kullback–Leibler, right panel: probability

weighted Kullback–Leibler.

Figure 5. PSSM-logo for the N-linked glycosylation motif. The motif was calculated from a set of 2128 unique experimentally verify N-glycosylation

sites downloaded from the UniprotKB protein database. Only peptide fragments of length 11 (5 before and 5 after the N) were included in the

analysis.

Nucleic Acids Research, 2012, Vol. 40, Web Server issue W285

INTEGRATING SEQ2LOGO WITH OTHER

PREDICTION SERVERS

To improve the usability and make Seq2Logo able to co-

operate with other programs and servers, a form-handler

was implemented on the server that makes it possible to

send input data directly to Seq2Logo. This simple

form-handler allows a quick and easy transfer of data to

Seq2Logo and deﬁnes a platform for using Seq2Logo as a

visualization tool for other programs. The form data sent

to Seq2Logo is inserted directly into the input ﬁeld.

An instruction of how to implement this transfer can be

found at: http://www.cbs.dtu.dk/biotools/Seq2Logo-1.0/

bin/easytransferbutton.html (14 May 2012, date last

accessed).

DISCUSSION AND CONCLUSION

Sequence logos provide a powerful way to visualize amino

acid preferences in a receptor binding motif, as well as

sequence conservation/variation and the location of func-

tionally essential residues in multiple sequence alignments.

Accurate estimation of a sequence motif is often

compromised by data redundancy and low number of

observations. Inappropriate handling of these issues can

lead to inaccurate estimation of the sequence motif and

subsequent poor sequence logo representation. Moreover,

the majority of sequence logo webservers have a poor

visualization of the information related to amino acid de-

pletion since they focus on displaying the position-speciﬁc

enrichment of amino acids.

Here, we have proposed a novel sequence logo generator,

Seq2Logo that aims at addressing these shortcomings and

allow non-expert end-users, via an easy to use web-interface,

to generate accurate sequence logos from protein sequence

data. We have demonstrated that Seq2Logo can deal with

sequence redundancy and low number of observations in a

manner superior to that of other public available sequence

logo generators like Weblogo and ENOlogos. Besides

the conventional Shannon sequence logo, Seq2Logo also

incorporates distinct logotypes where depleted amino

acids are displayed on the negative y-axis. These logotypes

offer a unique possibility for Seq2Logo to display for

instance receptor-binding motifs in a format that highlights

both favored and disfavored amino acids at the different

positions in the motif.

Figure 6. Seq2Logo visualization of a Blast sequence proﬁle for 1K7C chain A. The Blast proﬁle was obtained using Blast2logo (www.cbs.dtu.dk/

biotools/Blast2logo (14 May 2012, date last accessed)) searching against the nr70 sequence database with default options. The active site of 1K7C:A

is deﬁned by the residues S9, G42, N74, D192 and H195 (13). All these residues show up as highly conserved in the sequence logo.

W286 Nucleic Acids Research, 2012, Vol. 40, Web Server issue

A sequence proﬁle is a powerful way to capture pos-

ition-speciﬁc information about sequence conservation/

variation within a protein family. Seq2Logo accepts

sequence proﬁles in the Blast format as input and can in

a very simple and intuitive manner be used in combination

with Blast as a tool to visualize sequence proﬁles and

identify functionally conserved/variable amino acids in

any given protein of interest.

Finally, to allow other servers dealing with multiple

sequence alignments and binding motifs to directly co-

operate with Seq2Logo and beneﬁt from its improved

features, the server includes a form-handler that enables

communication with Seq2Logo via a simple html form.

This feature has allowed for a simple and effective im-

provement to two of our own webservers NNAlign (14)

and Blast2logo (www.cbs.dtu.dk/biotools/Blast2logo

(14 May 2012, date last accessed)), and we believe this

to be an additional feature that will become very useful

for other webserver developers within the ﬁeld of for

instance receptor-binding motif characterization.

In its current form, Seq2Logo can only handle amino

acid input data. The reason for this limitation is that most

of its unique features like pseudo count estimates from

Blosum substitution matrices and sequence weighting of

are speciﬁc for amino acid data. The ability to also handle

nucleic acids will be a part of a future update for the

method.

In conclusion, we believe Seq2Logo to be an important

and novel tool for non-expert users to construct accurate

sequence logos describing receptor binding motifs and

sequence variations in multiple sequence alignments.

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online:

Supplementary Methods and Supplementary References

[6–8,15,16].

FUNDING

Funding for open access charge: National Institutes of

Health (NIH) [contract nos HHSN272200900045C and

HHSNN26600400006C].

Conﬂict of interest statement. None declared.

REFERENCES

1. Schneider,T.D. and Stephens,R.M. (1990) Sequence logos: a new

way to display consensus sequences. Nucleic Acids Res.,18,

6097–6100.

2. Crooks,G.E., Hon,G., Chandonia,J.M. and Brenner,S.E. (2004)

WebLogo: a sequence logo generator. Genome Res.,14,

1188–1190.

3. Workman,C.T., Yin,Y., Corcoran,D.L., Ideker,T., Stormo,G.D.

and Benos,P.V. (2005) enoLOGOS: a versatile web tool for

energy normalized sequence logos. Nucleic Acids Res.,33,

W389–W392.

4. Colaert,N., Helsens,K., Martens,L., Vandekerckhove,J. and

Gevaert,K. (2009) Improved visualization of protein consensus

sequences by iceLogo. Nat. Methods,6, 786–787.

5. Vacic,V., Iakoucheva,L.M. and Radivojac,P. (2006) Two Sample

Logo: a graphical representation of the differences between two

sets of sequence alignments. Bioinformatics,22, 1536–1537.

6. Henikoff,S. and Henikoff,J.G. (1994) Position-based sequence

weights. J. Mol. Biol.,243, 574–578.

7. Altschul,S.F., Madden,T.L., Schaffer,A.A., Zhang,J., Zhang,Z.,

Miller,W. and Lipman,D.J. (1997) Gapped BLAST and

PSI-BLAST: a new generation of protein database search

programs. Nucleic Acids Res.,25, 3389–3402.

8. Nielsen,M., Lundegaard,C., Worning,P., Hvid,C.S., Lamberth,K.,

Buus,S., Brunak,S. and Lund,O. (2004) Improved prediction of

MHC class I and class II epitopes using a novel Gibbs sampling

approach. Bioinformatics,20, 1388–1397.

9. Shannon,C.E. (1948) A mathematical theory of communication.

Bell Syst. Tech. J.,27, 379–423, 623–656.

10. Kullback,S. and Leibler,R.A. (1951) On Information and

Sufﬁciency. Ann. Math. Stat.,22, 79–86.

11. Fujii,K., Zhu,G., Liu,Y., Hallam,J., Chen,L., Herrero,J. and

Shaw,S. (2004) Kinase peptide speciﬁcity: improved determination

and relevance to protein phosphorylation. Proc. Natl Acad. Sci.

USA,101, 13744–13749.

12. Rammensee,H., Bachmann,J., Emmerich,N.P., Bachor,O.A. and

Stevanovic,S. (1999) SYFPEITHI: database for MHC ligands and

peptide motifs. Immunogenetics,50, 213–219.

13. Porter,C.T., Bartlett,G.J. and Thornton,J.M. (2004) The Catalytic

Site Atlas: a resource of catalytic sites and residues identiﬁed in

enzymes using structural data. Nucleic Acids Res.,32,

D129–D133.

14. Andreatta,M., Schafer-Nielsen,C., Lund,O., Buus,S. and

Nielsen,M. (2011) NNAlign: a web-based prediction method

allowing non-expert end-user discovery of sequence motifs in

quantitative peptide data. PLoS One,6, e26781.

15. Henikoff,S. and Henikoff,J.G. (1992) Amino acid substitution

matrices from protein blocks. Proc. Natl Acad. Sci. USA,89,

10915–10919.

16. Hobohm,U., Scharf,M., Schneider,R. and Sander,C. (1992)

Selection of representative protein data sets. Protein Sci.,1,

409–417.

Nucleic Acids Research, 2012, Vol. 40, Web Server issue W287

CD8 T-cell responses towards conserved influenza B virus epitopes across anatomical sites and age

Article

Full-text available

Apr 2024

Influenza B viruses (IBVs) cause substantive morbidity and mortality, and yet immunity towards IBVs remains understudied. CD8⁺ T-cells provide broadly cross-reactive immunity and alleviate disease severity by recognizing conserved epitopes. Despite the IBV burden, only 18 IBV-specific T-cell epitopes restricted by 5 HLAs have been identified currently. A broader array of conserved IBV T-cell epitopes is needed to develop effective cross-reactive T-cell based IBV vaccines. Here we identify 9 highly conserved IBV CD8⁺ T-cell epitopes restricted to HLA-B*07:02, HLA-B*08:01 and HLA-B*35:01. Memory IBV-specific tetramer⁺CD8⁺ T-cells are present within blood and tissues. Frequencies of IBV-specific CD8⁺ T-cells decline with age, but maintain a central memory phenotype. HLA-B*07:02 and HLA-B*08:01-restricted NP30-38 epitope-specific T-cells have distinct T-cell receptor repertoires. We provide structural basis for the IBV HLA-B*07:02-restricted NS1196-206 (11-mer) and HLA-B*07:02-restricted NP30-38 epitope presentation. Our study increases the number of IBV CD8⁺ T-cell epitopes, and defines IBV-specific CD8⁺ T-cells at cellular and molecular levels, across tissues and age.

Structural and Dynamic-Based Characterization of the Recognition Patterns of E7 and TRP-2 Epitopes by MHC Class I Receptors through Computational Approaches

Article

Full-text available

Jan 2024
INT J MOL SCI

A detailed comprehension of MHC-epitope recognition is essential for the design and development of new antigens that could be effectively used in immunotherapy. Yet, the high variability of the peptide together with the large abundance of MHC variants binding makes the process highly specific and large-scale characterizations extremely challenging by standard experimental techniques. Taking advantage of the striking predictive accuracy of AlphaFold, we report a structural and dynamic-based strategy to gain insights into the molecular basis that drives the recognition and interaction of MHC class I in the immune response triggered by pathogens and/or tumor-derived peptides. Here, we investigated at the atomic level the recognition of E7 and TRP-2 epitopes to their known receptors, thus offering a structural explanation for the different binding preferences of the studied receptors for specific residues in certain positions of the antigen sequences. Moreover, our analysis provides clues on the determinants that dictate the affinity of the same epitope with different receptors. Collectively, the data here presented indicate the reliability of the approach that can be straightforwardly extended to a large number of related systems.

A multi-faceted approach to unravel coding and non-coding gene fusions and target chimeric proteins in ataxia

Article

Feb 2024
J BIOMOL STRUCT DYN

Ataxia represents a heterogeneous group of neurodegenerative disorders characterized by a loss of balance and coordination, often resulting from mutations in genes vital for cerebellar function and maintenance. Recent advances in genomics have identified gene fusion events as critical contributors to various cancers and neurodegenerative diseases. However, their role in ataxia pathogenesis remains largely unexplored. Our study delved into this possibility by analyzing RNA sequencing data from 1443 diverse samples, including cell and mouse models, patient samples, and healthy controls. We identified 7067 novel gene fusions, potentially pivotal in disease onset. These fusions, notably in-frame, could produce chimeric proteins, disrupt gene regulation, or introduce new functions. We observed conservation of specific amino acids at fusion breakpoints and identified potential aggregate formations in fusion proteins, known to contribute to ataxia. Through AI-based protein structure prediction, we identified topological changes in three high-confidence fusion proteins—TEN1-ACOX1, PEX14-NMNAT1, and ITPR1-GRID2—which could potentially alter their functions. Subsequent virtual drug screening identified several molecules and peptides with high-affinity binding to fusion sites. Molecular dynamics simulations confirmed the stability of these protein-ligand complexes at fusion breakpoints. Additionally, we explored the role of non-coding RNA fusions as miRNA sponges. One such fusion, RP11-547P4-FLJ33910, showed strong interaction with hsa-miR-504-5p, potentially acting as its sponge. This interaction correlated with the upregulation of hsa-miR-504-5p target genes, some previously linked to ataxia. In conclusion, our study unveils new aspects of gene fusions in ataxia, suggesting their significant role in pathogenesis and opening avenues for targeted therapeutic interventions.

Interpretable Prediction of SARS-CoV-2 Epitope-specific TCR Recognition Using a Pre-Trained Protein Language Model

Article

Full-text available

Feb 2024
IEEE ACM T COMPUT BI

The emergence of the novel coronavirus, designated as severe acute respiratory syndrome coronavirus-2 (SARS-CoV-2), has posed a significant threat to public health worldwide. There has been progress in reducing hospitalizations and deaths due to SARS-CoV-2. However, challenges stem from the emergence of SARS-CoV-2 variants, which exhibit high transmission rates, increased disease severity, and the ability to evade humoral immunity. Epitope-specific T-cell receptor (TCR) recognition is key in determining the T-cell immunogenicity for SARS-CoV-2 epitopes. Although several data-driven methods for predicting epitope-specific TCR recognition have been proposed, they remain challenging due to the enormous diversity of TCRs and the lack of available training data. Self-supervised transfer learning has recently been proven useful for extracting information from unlabeled protein sequences, increasing the predictive performance of fine-tuned models, and using a relatively small amount of training data. This study presents a deep-learning model generated by fine-tuning pre-trained protein embeddings from a large corpus of protein sequences. The fine-tuned model showed markedly high predictive performance and outperformed the recent Gaussian process-based prediction model. The output attentions captured by the deep-learning model suggested critical amino acid positions in the SARS-CoV-2 epitope-specific TCRβ sequences that are highly associated with the viral escape of T-cell immune response.

Target specific peptide design using latent space approximate trajectory collector

Preprint

Full-text available

Mar 2023

Despite the prevalence and many successes of deep learning applications in de novo molecular design, the problem of peptide generation targeting specific proteins remains unsolved. A main barrier for this is the scarcity of the high-quality training data. To tackle the issue, we propose a novel machine learning based peptide design architecture, called Latent Space Approximate Trajectory Collector (LSATC). It consists of a series of samplers on an optimization trajectory on a highly non-convex energy landscape that approximates the distributions of peptides with desired properties in a latent space. The process involves little human intervention and can be implemented in an end-to-end manner. We demonstrate the model by the design of peptide extensions targeting β-catenin, a key nuclear effector protein involved in canonical Wnt signalling. When compared with a random sampler, LSATC can sample peptides with 36% lower mean binding scores in a 16 times smaller interquartile range (IQR) and 284% less mean hydrophobicity with a 1.4 times smaller IQR. LSATC also largely outperforms other common generative models.Finally, we utilize a clustering algorithm to select 4 peptides from the100 LSATC designed peptides for experimental validation. The resultconfirms that all the four peptides extended by LSATC show improved β-catenin binding by at least 20.0%, and two of the peptides show a 3 fold increase in binding affinity as compared to the base peptide.

Fragment ion intensity prediction improves the identification rate of non-tryptic peptides in timsTOF

Article

Full-text available

May 2024

Immunopeptidomics is crucial for immunotherapy and vaccine development. Because the generation of immunopeptides from their parent proteins does not adhere to clear-cut rules, rather than being able to use known digestion patterns, every possible protein subsequence within human leukocyte antigen (HLA) class-specific length restrictions needs to be considered during sequence database searching. This leads to an inflation of the search space and results in lower spectrum annotation rates. Peptide-spectrum match (PSM) rescoring is a powerful enhancement of standard searching that boosts the spectrum annotation performance. We analyze 302,105 unique synthesized non-tryptic peptides from the ProteomeTools project on a timsTOF-Pro to generate a ground-truth dataset containing 93,227 MS/MS spectra of 74,847 unique peptides, that is used to fine-tune the deep learning-based fragment ion intensity prediction model Prosit. We demonstrate up to 3-fold improvement in the identification of immunopeptides, as well as increased detection of immunopeptides from low input samples.

MAIT cell-MR1 reactivity is highly conserved across multiple divergent species

Article

May 2024
J BIOL CHEM

Mucosal-associated invariant T (MAIT) cells are a subset of unconventional T cells that recognize small molecule metabolites presented by major histocompatibility complex class I related protein 1 (MR1), via an αβ T cell receptor (TCR). MAIT TCRs feature an essentially invariant TCR α-chain, which is highly conserved between mammals. Similarly, MR1 is the most highly conserved major histocompatibility complex-I–like molecule. This extreme conservation, including the mode of interaction between the MAIT TCR and MR1, has been shown to allow for species-mismatched reactivities unique in T cell biology, thereby allowing the use of selected species-mismatched MR1–antigen (MR1–Ag) tetramers in comparative immunology studies. However, the pattern of cross-reactivity of species-mismatched MR1–Ag tetramers in identifying MAIT cells in diverse species has not been formally assessed. We developed novel cattle and pig MR1–Ag tetramers and utilized these alongside previously developed human, mouse, and pig-tailed macaque MR1–Ag tetramers to characterize cross-species tetramer reactivities. MR1–Ag tetramers from each species identified T cell populations in distantly related species with specificity that was comparable to species-matched MR1–Ag tetramers. However, there were subtle differences in staining characteristics with practical implications for the accurate identification of MAIT cells. Pig MR1 is sufficiently conserved across species that pig MR1–Ag tetramers identified MAIT cells from the other species. However, MAIT cells in pigs were at the limits of phenotypic detection. In the absence of sheep MR1–Ag tetramers, a MAIT cell population in sheep blood was identified phenotypically, utilizing species-mismatched MR1–Ag tetramers. Collectively, our results validate the use and define the limitations of species-mismatched MR1–Ag tetramers in comparative immunology studies.

New plastids, old proteins: repeated endosymbiotic acquisitions in kareniacean dinoflagellates

Article

Full-text available

Mar 2024
EMBO REP

Dinoflagellates are a diverse group of ecologically significant micro-eukaryotes that can serve as a model system for plastid symbiogenesis due to their susceptibility to plastid loss and replacement via serial endosymbiosis. Kareniaceae harbor fucoxanthin-pigmented plastids instead of the ancestral peridinin-pigmented ones and support them with a diverse range of nucleus-encoded plastid-targeted proteins originating from the haptophyte endosymbiont, dinoflagellate host, and/or lateral gene transfers (LGT). Here, we present predicted plastid proteomes from seven distantly related kareniaceans in three genera ( Karenia , Karlodinium , and Takayama ) and analyze their evolutionary patterns using automated tree building and sorting. We project a relatively limited ( ~ 10%) haptophyte signal pointing towards a shared origin in the family Chrysochromulinaceae. Our data establish significant variations in the functional distributions of these signals, emphasizing the importance of micro-evolutionary processes in shaping the chimeric proteomes. Analysis of plastid genome sequences recontextualizes these results by a striking finding the extant kareniacean plastids are in fact not all of the same origin, as two of the studied species ( Karlodinium armiger , Takayama helix ) possess plastids from different haptophyte orders than the rest.

Structure, dynamics and redox reactivity of an all-purpose flavodoxin

Article

Full-text available

Feb 2024
J BIOL CHEM

The flavodoxin of Rhodopseudomonas palustris CGA009 (Rp9Fld) supplies highly reducing equivalents to crucial enzymes such as hydrogenase, especially when the organism is iron-restricted. By acquiring those electrons from photodriven electron flow via the bifurcating electron transfer flavoprotein, Rp9Fld provides solar power to vital metabolic processes. To understand Rp9Fld's ability to work with diverse partners, we solved its crystal structure. We observed the canonical flavodoxin (Fld) fold and features common to other long-chain Flds but not all the surface loops thought to recognize partner proteins. Moreover, some of the loops display alternative structures and dynamics. To advance studies of protein–protein associations and conformational consequences, we assigned the ¹⁹F NMR signals of all five tyrosines (Tyrs). Our electrochemical measurements show that incorporation of 3-¹⁹F-Tyr in place of Tyr has only a modest effect on Rp9Fld's redox properties even though Tyrs flank the flavin on both sides. Meanwhile, the ¹⁹F probes demonstrate the expected paramagnetic effect, with signals from nearby Tyrs becoming broadened beyond detection when the flavin semiquinone is formed. However, the temperature dependencies of chemical shifts and linewidths reveal dynamics affecting loops close to the flavin and regions that bind to partners in a variety of systems. These coincide with patterns of amino acid type conservation but not retention of specific residues, arguing against detailed specificity with respect to partners. We propose that the loops surrounding the flavin adopt altered conformations upon binding to partners and may even participate actively in electron transfer.

A large-scale study of peptide features defining immunogenicity of cancer neo-epitopes

Article

Full-text available

Jan 2024

Accurate prediction of immunogenicity for neo-epitopes arising from a cancer associated mutation is a crucial step in many bioinformatics pipelines that predict outcome of checkpoint blockade treatments or that aim to design personalised cancer immunotherapies and vaccines. In this study, we performed a comprehensive analysis of peptide features relevant for prediction of immunogenicity using the Cancer Epitope Database and Analysis Resource (CEDAR), a curated database of cancer epitopes with experimentally validated immunogenicity annotations from peer-reviewed publications. The developed model, ICERFIRE (ICore-based Ensemble Random Forest for neo-epitope Immunogenicity pREdiction), extracts the predicted ICORE from the full neo-epitope as input, i.e. the nested peptide with the highest predicted major histocompatibility complex (MHC) binding potential combined with its predicted likelihood of antigen presentation (%Rank). Key additional features integrated into the model include assessment of the BLOSUM mutation score of the neo-epitope, and antigen expression levels of the wild-type counterpart which is often reflecting a neo-epitope's abundance. We demonstrate improved and robust performance of ICERFIRE over existing immunogenicity and epitope prediction models, both in cross-validation and on external validation datasets.

Petromagnetic Properties In The Naica Mining District, Chihuahua, Mexico: Searching For Source of Mineralization

Article

Full-text available

Jan 2003
EARTH PLANETS SPACE

Ore mineral and host lithologies have been sampled with 89 oriented samples from 14 sites in the Naica District, northern Mexico. Magnetic parameters permit to charac- terise samples: saturation magnetization, density, low- high-temperature magnetic sus- ceptibility, remanence intensity, Koenigsberger ratio, Curie temperature and hystere- sis parameters. Rock magnetic properties are controlled by variations in titanomag- netite content and hydrothermal alteration. Post-mineralization hydrothermal alter- ation seems the major event that affected the minerals and magnetic properties. Curie temperatures are characteristic of titanomagnetites or titanomaghemites. Hysteresis parameters indicate that most samples have pseudo-single domain (PSD) magnetic grains. Alternating filed (AF) demagnetization and isothermal remanence (IRM) ac- quisition both indicate that natural and laboratory remanences are carried by MD-PSD spinels in the host rocks. The trend of NRM intensity vs susceptibility suggests that the carrier of remanent and induced magnetization is the same in all cases (spinels). The Koenigsberger ratio range from 0.05 to 34.04, indicating the presence of MD and PSD magnetic grains. Constraints on the geometry of the intrusive source body devel- oped in the model of the magnetic anomaly are obtained by quantifying the relative contributions of induced and remanent magnetization components.

NNAlign: A Web-Based Prediction Method Allowing Non-Expert End-User Discovery of Sequence Motifs in Quantitative Peptide Data

Article

Full-text available

Nov 2011
PLOS ONE

Recent advances in high-throughput technologies have made it possible to generate both gene and protein sequence data at an unprecedented rate and scale thereby enabling entirely new “omics”-based approaches towards the analysis of complex biological processes. However, the amount and complexity of data that even a single experiment can produce seriously challenges researchers with limited bioinformatics expertise, who need to handle, analyze and interpret the data before it can be understood in a biological context. Thus, there is an unmet need for tools allowing non-bioinformatics users to interpret large data sets. We have recently developed a method, NNAlign, which is generally applicable to any biological problem where quantitative peptide data is available. This method efficiently identifies underlying sequence patterns by simultaneously aligning peptide sequences and identifying motifs associated with quantitative readouts. Here, we provide a web-based implementation of NNAlign allowing non-expert end-users to submit their data (optionally adjusting method parameters), and in return receive a trained method (including a visual representation of the identified motif) that subsequently can be used as prediction method and applied to unknown proteins/peptides. We have successfully applied this method to several different data sets including peptide microarray-derived sets containing more than 100,000 data points. NNAlign is available online at http://www.cbs.dtu.dk/services/NNAlign.

A Mathematical Theory of Communication

Article

Jul 1948

Claude Elwood Shannon

Bell System Technical Journal, also pp. 623-656 (October)

On Information and Sufficiency

Article

Mar 1951
Ann Math Stat

A Mathematical Theory of Communication

Article

Jan 2001

Claude E. Shannon

Amino Acid Substitution Matrices from Protein Blocks

Article

Nov 1992

Methods for alignment of protein sequences typically measure similarity by using a substitution matrix with scores for all possible exchanges of one amino acid with another. The most widely used matrices are based on the Dayhoff model of evolutionary rates. Using a different approach, we have derived substitution matrices from about 2000 blocks of aligned sequence segments characterizing more than 500 groups of related proteins. This led to marked improvements in alignments and in searches using queries from each of the groups.

A Mathematical Theory of Communication

Article

Jan 1948

Claude E. Shannon

An abstract is not available.

Improved visualization of protein consensus sequences by iceLogo

Article

Nov 2009
Br J Pharmacol

Selection of a representative set

Article

Mar 2008

The Protein Data Bank currently contains about 600 data sets of three- dimensional protein coordinates determined by X-ray crystallography or NMR. There is considerable redundancy in the data base, as many protein pairs are identical or very similar in sequence. However, statistical analyses of protein sequence-structure relations require nonredundant data. We have developed two algorithms to extract from the data base representative sets of protein chains with maximum coverage and minimum redundancy. The first algorithm focuses on optimizing a particular property of the selected proteins and works by successive selection of proteins from an ordered list and exclusion of all neighbors of each selected protein. The other algorithm aims at maximizing the size of the selected set and works by successive thinning out of clusters of similar proteins. Both algorithms are generally applicable to other data bases in which criteria of similarity can be defined and relate to problems in graph theory. The largest nonredundant set extracted from the current release of the Protein Data Bank has 155 protein chains. In this set, no two proteins have sequence similarity higher than a certain cutoff (30% identical residues for aligned subsequences longer than 80 residues), yet all structurally unique protein families are represented. Periodically updated lists of representative data sets are available by electronic mail from the file server '[email protected] /* */' The selection may be useful in statistical approaches to protein folding as well as in the analysis and documentation of the known spectrum of three- dimensional protein structures.

Sequence Logos: A New Way to Display Consensus Sequences

Article

Nov 1990

A graphical method is presented for displaying the patterns in a set of aligned sequences. The characters representing the sequence are stacked on top of each other for each position in the aligned sequences. The height of each letter is made proportional to Its frequency, and the letters are sorted so the most common one is on top. The height of the entire stack is then adjusted to signify the information content of the sequences at that position. From these ‘sequence logos’, one can determine not only the consensus sequence but also the relative frequency of bases and the information content (measured In bits) at every position in a site or sequence. The logo displays both significant residues and subtle sequence patterns.

Seq2Logo: A method for construction and visualization of amino acid binding motifs and sequence profiles including sequence weighting, pseudo counts and two-sided representation of amino acid enrichment and depletion

Abstract and Figures

Recommended publications

MATLIGN: A motif clustering, comparison and matching tool

Biologically Active Proteins from Natural Product Extracts 1

Multi-process structuring of user interface software

Intrusion-Tolerant System Design for Web Server Survivability