ArticlePDF Available

Seq2Logo: A method for construction and visualization of amino acid binding motifs and sequence profiles including sequence weighting, pseudo counts and two-sided representation of amino acid enrichment and depletion

Authors:

Abstract and Figures

Seq2Logo is a web-based sequence logo generator. Sequence logos are a graphical representation of the information content stored in a multiple sequence alignment (MSA) and provide a compact and highly intuitive representation of the position-specific amino acid composition of binding motifs, active sites, etc. in biological sequences. Accurate generation of sequence logos is often compromised by sequence redundancy and low number of observations. Moreover, most methods available for sequence logo generation focus on displaying the position-specific enrichment of amino acids, discarding the equally valuable information related to amino acid depletion. Seq2logo aims at resolving these issues allowing the user to include sequence weighting to correct for data redundancy, pseudo counts to correct for low number of observations and different logotype representations each capturing different aspects related to amino acid enrichment and depletion. Besides allowing input in the format of peptides and MSA, Seq2Logo accepts input as Blast sequence profiles, providing easy access for non-expert end-users to characterize and identify functionally conserved/variable amino acids in any given protein of interest. The output from the server is a sequence logo and a PSSM. Seq2Logo is available at http://www.cbs.dtu.dk/biotools/Seq2Logo (14 May 2012, date last accessed).
Content may be subject to copyright.
Seq2Logo: a method for construction and
visualization of amino acid binding motifs and
sequence profiles including sequence weighting,
pseudo counts and two-sided representation of
amino acid enrichment and depletion
Martin Christen Frølund Thomsen and Morten Nielsen*
Center for Biological Sequence Analysis, Technical University of Denmark, DK-2800 Kgs. Lyngby, Denmark
Received January 6, 2012; Revised April 30, 2012; Accepted May 2, 2012
ABSTRACT
Seq2Logo is a web-based sequence logo generator.
Sequence logos are a graphical representation of
the information content stored in a multiple
sequence alignment (MSA) and provide a compact
and highly intuitive representation of the position-
specific amino acid composition of binding motifs,
active sites, etc. in biological sequences. Accurate
generation of sequence logos is often compromised
by sequence redundancy and low number of obser-
vations. Moreover, most methods available for
sequence logo generation focus on displaying the
position-specific enrichment of amino acids,
discarding the equally valuable information related
to amino acid depletion. Seq2logo aims at resolving
these issues allowing the user to include sequence
weighting to correct for data redundancy,
pseudo counts to correct for low number of obser-
vations and different logotype representations
each capturing different aspects related to amino
acid enrichment and depletion. Besides
allowing input in the format of peptides and
MSA, Seq2Logo accepts input as Blast sequence
profiles, providing easy access for non-expert
end-users to characterize and identify functionally
conserved/variable amino acids in any given
protein of interest. The output from the server is a
sequence logo and a PSSM. Seq2Logo is available at
http://www.cbs.dtu.dk/biotools/Seq2Logo (14 May
2012, date last accessed).
INTRODUCTION
The idea of generating a logo from aligned sets of sequences
was introduced in 1990 by Schneider and Stephens (1). The
intention of a sequence logo is to concentrate into a single
plot the general consensus, the order of predominance
of residues at every position, the relative frequencies of
every residue at every position, the amount of information
present at every position and significant locations. This
logo is then able to present all of the relevant information
to the viewer in a fast and concise manner.
Several webservers exist to generate sequence logos from
MSA’s (2–5). All these servers suffer from different limita-
tions in the handling sequence redundancy and low number
of observations. Moreover, to the best of our knowledge, all
public sequence logo servers, with the exception of the
Icelogo (4) and two-sample logo (5) methods, focus on dis-
playing the position-specific enrichment of amino acids, dis-
carding the equally valuable information related to amino
acid depletion. Seq2logo aims at resolving these issues
allowing the user to include sequence weighting to correct
for data redundancy, pseudo counts to correct for low
number of observations (6–8) and five different logotype
representations each capturing different aspects related to
amino acid enrichment and depletion. In addition to the
usual Shannon logo (9), Seq2Logo includes the option to
create Kullback–Leibler (KL) (10) logos where the
depleted (under-represented) amino acids are represented
on the negative y-axis. Besides the conventional KL logo,
Seq2Logo can also display a weighted KL logo, where the
relative height of each amino acid is proportional to the
log-odds ratio and a probability weighted KL logo, where
the relative height of each amino acid is proportional to the
product of the probability and log-odds ratio. Finally,
*To whom correspondence should be addressed. Tel: +45 4525 2425; Fax: +45 4593 1585; Email: mniel@cbs.dtu.dk
Published online 25 May 2012 Nucleic Acids Research, 2012, Vol. 40, Web Server issue W281–W287
doi:10.1093/nar/gks469
ßThe Author(s) 2012. Published by Oxford University Press.
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/
by-nc/3.0), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
inspired by the work of Fujii et al. (11), Seq2Logo also
includes an option to visualize PSSM (position-specific
scoring matrix) logos, where the height of a bar is given by
the sum of the absolute value of the PSSM weight matrix
values and the height of a given amino acids is proportional
to the absolute value of the weight matrix score. In particu-
lar, the weighted KL logo provides a visual and highly in-
tuitive representation of both amino acid enrichment and
depletion in for instance receptor binding motifs. Besides
allowing input in the format of peptides and MSAs, the
Seq2Logo server accepts inputs such as Blast sequence
profiles, providing easy access for non-expert end-users to
characterize and identify functionally conserved/variable
amino acids in any given protein of interest.
MATERIALS AND METHODS
Seq2Logo implements two strategies to improve the
accuracy of the estimated sequence logo. The first
strategy is sequence weighting which corrects for data re-
dundancy. The second strategy is pseudo counts which
correct for a low number of observations. Sequence
weighting is implemented as described in (6,8) and
pseudo counts as described in (7). For details, see
Supplementary Data.
In a sequence logo, the height of the bar is equal to
the information content at each amino acid position. The
information content is calculated using the relation
I¼P
a
palog2pa=qa, where p
a
and q
a
are the observed prob-
ability (calculated from the data) and background probabil-
ity, respectively, of the amino acid a. If an equiprobable
background amino acid distribution is applied, a conven-
tional Shannon sequence logo is displayed. If a background
amino acid distribution reflecting the prevalence of the dif-
ferent amino acids is applied, a Kullback–Leibler sequence
logo is displayed. The choice of the Kullback–Leibler
logotype in Seq2Logo not only provides correction for the
uneven distribution of amino acids, but also expresses the
depleted amino acids (where p
a
<q
a
) on the negative side of
the y-axis. This enables the user to quickly identify enriched
and depleted (under-represented) amino acids. To enhance
the identification and information of the depleted
amino acids, Seq2Logo includes another logotype called
weighted Kullback–Leibler. This logo type presents each
individual amino acid proportional to its relative log-odds
score [log
2
(p
a
/q
a
)]. Another logotype is included called
probability weighted Kullback–Leibler, where the relative
height of each individual amino acid is proportional to p
a
·
log
2
(p
a
/q
a
). Finally, Seq2Logo includes an option to display
PSSM-logos (11), where the height of a bar is equal to the
sum of the absolute value of the PSSM weight matrix values
and the height of each amino acid is proportional to the
absolute value of the weight matrix score (with negative
values displayed on the negative y-axis).
THE WEB SERVER
The Seq2Logo server has a simple interface that allows
non-expert users to generate and customize accurate
logos from any amino acid sequence data of interest.
Input
The interface is split in two parts for easy overview. The
first and the most important part is submission (Figure 1,
left panel). Here, the user can upload or paste in the input
data in addition to specifying the logotype (Shannon,
Kullback–Leibler, Weighted Kullback–Leibler, probabil-
ity weighted Kullback–Leibler or PSSM-logo) and condi-
tions for handling the input data (sequence weighting and
pseudo counts). Seq2Logo can read sequence data in the
following formats: Fasta, ClustalW, Raw peptide
sequences and Weight/Blast matrix (for details on each
format refer to Supplementary Data). The detection of
the format happens automatically through the identifica-
tion of key elements from each format. In the submission
part, the user further specifies which output files should be
created. In the graphical layout (Figure 1, right panel), the
user can customize the graphical layout of the logo plot.
Page size sets the resolution of the image and stacks per
line and lines per page determine how the logo should
look. Assigning each amino acid symbol to a color
defines the amino acid colors. There are six colors to
choose from: Red, green, blue, yellow, purple or orange.
All amino acids left out will be black. Several predefined
color-schemes are available. The user can also rotate the
position numbers on the x-axis and hide various features
of the graph.
Output
An example of the output from Seq2Logo generated using
the input specifications from Figure 1 is shown in
Figure 2. The figure shows on the positive y-axis, the
amino acids enriched at each peptide position and on
the negative y-axis the corresponding depleted amino
acids. In this case, the logo is calculated from a set of 13
artificial peptide sequences proposed to bind the
HLA-A*02:01 class I major histocompatibility complex
(MHC) molecule. This molecule has a binding motif
with strong interactions at P2 and P9 both positions
with prevalence for hydrophobic amino acids (12).
One of the distinct powers of Seq2Logo is its ability to
deal with data redundancy and low number of observa-
tions. To the best of our knowledge, no other public
sequence logo servers share this ability. In Figure 3, the
cruciality of these features for the generation of accurate
sequence logos describing a binding motif is illustrated.
The figure displays Shannon sequence logos generated
by Seq2Logo, using different option to improve the
accuracy, as well as sequence logos generated by
Weblogo (2) and EnoLOGOS (3). When comparing the
logos calculated from the small sample data set with the
logo obtained from the larger data set, it is apparent that
the inclusion of sequence weighting and pseudo counts
have a significant positive impact on the overall
accuracy of the binding motif description.
The other distinct feature of Seq2Logo compared
to most other public sequence logo server is the display
of depleted amino acids on the negative y-axis in
Kullback–Leibler logos. Most sequence logo servers
display the relative height of the different amino acids
in a manner proportional to their frequency, thus
W282 Nucleic Acids Research, 2012, Vol. 40, Web Server issue
Figure 2. Output from Seq2Logo. The upper panel shows the sequence logo calculated from a set of 13 artificial peptide sequences using the
specification defined in Figure 1 (sequence weighting using clustering, pseudo count with a weight of 200 and logotype as Kullback–Leibler).
Enriched amino acids are shown on the positive y-axis and depleted amino acids on the negative y-axis. The lower panel gives the position-specific
(log-odds) scoring matrix (PSSM) calculated by Seq2Logo. Each line corresponds to a position and gives the consensus amino acid and the log-odds
scores for the 20 amino acids.
Figure 1. The submission (left) and graphical layout (right) part of the web interface. In the submission part the user specifies the input file, the
format of output files, the logotype and the conditions for the handling of the input data. In the Graphical Layout part, the user customizes the
graphical layout of the logo plot; page size, stacks per line, lines per page, colours, bars, rotation of position numbers and title.
Nucleic Acids Research, 2012, Vol. 40, Web Server issue W283
displaying only the position-specific enrichment of amino
acids, discarding the equally valuable information related
to amino acid depletion. To improve on this issue,
Seq2Logo includes a series of distinct logotypes (see
Figure 4). In addition to the usual Shannon logo,
Seq2Logo includes the option to create Kullback–Leibler
(KL) logos where depleted amino acids are represented on
the negative y-axis. Besides the conventional KL logo,
Seq2Logo can also display a weighted KL logo, where
the relative height of each amino acid is proportional to
the log-odds ratio and a probability weighted KL logo,
where the relative height of each amino acid is propor-
tional to the product of the probability and log-odds
ratio. In particular, the weighted KL logo provides a
visual and highly intuitive representation of both amino
acid enrichment and depletion in for instance receptor
binding motifs. Besides these information-based logo-
types, Seq2Logo offers the possibility of displaying
PSSM-logos calculated either from a log-odds weight
matrix derived by Seq2Logo from a multiple sequence
alignment or from a user-defined PSSM. In the
PSSM-logo, the height of the bar and amino acid at
each position is proportional to the absolute value of the
PSSM weight matrix values. This logotype is particularly
powerful when illustrating depletion of a small set of
amino acids form otherwise variable positions in a
sequence motif. One such example is N-linked
glycosylation sites that are known to have the motif
N-X-S/T where X can be any amino acid but
P. Visualizing this motif as an information-based
sequence logo will not capture the depletion of P at the
position between N and S/T as all amino acids except
P are found at this position, hence making the overall
information content very small. On the other hand,
visualizing the motif as a PSSM-logo, the strong depletion
of P at the position between N and S/T becomes apparent
(see Figure 5).
A powerful way to characterize sequence conservation/
variation within a protein family is by use of sequence
profiles. Such sequence profiles can be obtained using
Psi-Blast (7). Seq2Logo accepts input of such sequence
profile in the Blast profile format allowing easy access
for non-expert end-users to characterize and identify func-
tionally conserved/variable amino acids in any given
protein of interest. Blast sequence profile can be generated
either in-house using a command like ‘blastpgp ddbe
0.00001 j4Q blastprofile i fasta o out’, where dbis
the sequence database used to search by Blast, edefines
the e-value cut-off for significant hits, jdefines the
number of Psi-blast iterations, iis the input file in
FASTA format, Qis the output file for the blast
profile (the file to be used by Seq2Logo to visualized the
sequence profile) and o is the file for the blast output.
Alternatively, the Blast2logo webserver (www.cbs.dtu.dk/
biotools/Blast2logo (14 May 2012, date last accessed)) can
be used to obtain the sequence profile. Figure 6 demon-
strates the use of Seq2Logo to display a sequence profile
for Rhamnogalacturonan acetylesterase (PDBid 1K7C,
chain A). The active site of 1K7C.A is defined by the
residues S9, G42, N74, D192 and H195 (13). All these
residues are highly conserved in the sequence logo (in
fact they are among the 10 residues with the highest infor-
mation content, data not shown). Another striking obser-
vation from the logo is the lack of sequence information in
the area between positions 75 and 105, suggesting that this
part of the protein is highly variable (most likely an inser-
tion) within the protein family. Both these observations
illustrate the power of sequence profiles combined with
Seq2Logo as a simple tool to identify functionally import-
ant residues and insertions in protein sequences.
Figure 3. Sequence logos generated from small sequence samples. All logos except the right logo in the lower row were calculated from a set of 13
artificial peptide sequences proposed to bind HLA-A*02:01 (see Figure 1). The upper row shows logos calculated by Seq2Logo using: (i) without
sequence weighting and pseudo count correction, (ii) sequence weighting by clustering and no pseudo count correction and (iii) sequence weighting by
clustering and pseudo count correction with a weight on prior of 200. The lower row shows logos calculated using: (i) Weblogo with ‘small sample
correction’, (ii) EnoLOGOS and (iii) Seq2Logo from a set of 229 HLA-A*02:01 9mer ligands downloaded from the SYFPEITHI database (12) with
sequence weighting by clustering and pseudo count correction with a weight on prior of 200.
W284 Nucleic Acids Research, 2012, Vol. 40, Web Server issue
Figure 4. The different logotype representations covered by Seq2Logo. Sequence logos generated from at set of 13 artificial peptide sequences
proposed to bind HLA-A*02:01 (see Figure 1). All logos were calculated using clustering and pseudo counts with a weight on prior at 200.
Upper row, left panel: Shannon, right panel: Kullback–Leibler. Lower row left panel: weighted Kullback–Leibler, right panel: probability
weighted Kullback–Leibler.
Figure 5. PSSM-logo for the N-linked glycosylation motif. The motif was calculated from a set of 2128 unique experimentally verify N-glycosylation
sites downloaded from the UniprotKB protein database. Only peptide fragments of length 11 (5 before and 5 after the N) were included in the
analysis.
Nucleic Acids Research, 2012, Vol. 40, Web Server issue W285
INTEGRATING SEQ2LOGO WITH OTHER
PREDICTION SERVERS
To improve the usability and make Seq2Logo able to co-
operate with other programs and servers, a form-handler
was implemented on the server that makes it possible to
send input data directly to Seq2Logo. This simple
form-handler allows a quick and easy transfer of data to
Seq2Logo and defines a platform for using Seq2Logo as a
visualization tool for other programs. The form data sent
to Seq2Logo is inserted directly into the input field.
An instruction of how to implement this transfer can be
found at: http://www.cbs.dtu.dk/biotools/Seq2Logo-1.0/
bin/easytransferbutton.html (14 May 2012, date last
accessed).
DISCUSSION AND CONCLUSION
Sequence logos provide a powerful way to visualize amino
acid preferences in a receptor binding motif, as well as
sequence conservation/variation and the location of func-
tionally essential residues in multiple sequence alignments.
Accurate estimation of a sequence motif is often
compromised by data redundancy and low number of
observations. Inappropriate handling of these issues can
lead to inaccurate estimation of the sequence motif and
subsequent poor sequence logo representation. Moreover,
the majority of sequence logo webservers have a poor
visualization of the information related to amino acid de-
pletion since they focus on displaying the position-specific
enrichment of amino acids.
Here, we have proposed a novel sequence logo generator,
Seq2Logo that aims at addressing these shortcomings and
allow non-expert end-users, via an easy to use web-interface,
to generate accurate sequence logos from protein sequence
data. We have demonstrated that Seq2Logo can deal with
sequence redundancy and low number of observations in a
manner superior to that of other public available sequence
logo generators like Weblogo and ENOlogos. Besides
the conventional Shannon sequence logo, Seq2Logo also
incorporates distinct logotypes where depleted amino
acids are displayed on the negative y-axis. These logotypes
offer a unique possibility for Seq2Logo to display for
instance receptor-binding motifs in a format that highlights
both favored and disfavored amino acids at the different
positions in the motif.
Figure 6. Seq2Logo visualization of a Blast sequence profile for 1K7C chain A. The Blast profile was obtained using Blast2logo (www.cbs.dtu.dk/
biotools/Blast2logo (14 May 2012, date last accessed)) searching against the nr70 sequence database with default options. The active site of 1K7C:A
is defined by the residues S9, G42, N74, D192 and H195 (13). All these residues show up as highly conserved in the sequence logo.
W286 Nucleic Acids Research, 2012, Vol. 40, Web Server issue
A sequence profile is a powerful way to capture pos-
ition-specific information about sequence conservation/
variation within a protein family. Seq2Logo accepts
sequence profiles in the Blast format as input and can in
a very simple and intuitive manner be used in combination
with Blast as a tool to visualize sequence profiles and
identify functionally conserved/variable amino acids in
any given protein of interest.
Finally, to allow other servers dealing with multiple
sequence alignments and binding motifs to directly co-
operate with Seq2Logo and benefit from its improved
features, the server includes a form-handler that enables
communication with Seq2Logo via a simple html form.
This feature has allowed for a simple and effective im-
provement to two of our own webservers NNAlign (14)
and Blast2logo (www.cbs.dtu.dk/biotools/Blast2logo
(14 May 2012, date last accessed)), and we believe this
to be an additional feature that will become very useful
for other webserver developers within the field of for
instance receptor-binding motif characterization.
In its current form, Seq2Logo can only handle amino
acid input data. The reason for this limitation is that most
of its unique features like pseudo count estimates from
Blosum substitution matrices and sequence weighting of
are specific for amino acid data. The ability to also handle
nucleic acids will be a part of a future update for the
method.
In conclusion, we believe Seq2Logo to be an important
and novel tool for non-expert users to construct accurate
sequence logos describing receptor binding motifs and
sequence variations in multiple sequence alignments.
SUPPLEMENTARY DATA
Supplementary Data are available at NAR Online:
Supplementary Methods and Supplementary References
[6–8,15,16].
FUNDING
Funding for open access charge: National Institutes of
Health (NIH) [contract nos HHSN272200900045C and
HHSNN26600400006C].
Conflict of interest statement. None declared.
REFERENCES
1. Schneider,T.D. and Stephens,R.M. (1990) Sequence logos: a new
way to display consensus sequences. Nucleic Acids Res.,18,
6097–6100.
2. Crooks,G.E., Hon,G., Chandonia,J.M. and Brenner,S.E. (2004)
WebLogo: a sequence logo generator. Genome Res.,14,
1188–1190.
3. Workman,C.T., Yin,Y., Corcoran,D.L., Ideker,T., Stormo,G.D.
and Benos,P.V. (2005) enoLOGOS: a versatile web tool for
energy normalized sequence logos. Nucleic Acids Res.,33,
W389–W392.
4. Colaert,N., Helsens,K., Martens,L., Vandekerckhove,J. and
Gevaert,K. (2009) Improved visualization of protein consensus
sequences by iceLogo. Nat. Methods,6, 786–787.
5. Vacic,V., Iakoucheva,L.M. and Radivojac,P. (2006) Two Sample
Logo: a graphical representation of the differences between two
sets of sequence alignments. Bioinformatics,22, 1536–1537.
6. Henikoff,S. and Henikoff,J.G. (1994) Position-based sequence
weights. J. Mol. Biol.,243, 574–578.
7. Altschul,S.F., Madden,T.L., Schaffer,A.A., Zhang,J., Zhang,Z.,
Miller,W. and Lipman,D.J. (1997) Gapped BLAST and
PSI-BLAST: a new generation of protein database search
programs. Nucleic Acids Res.,25, 3389–3402.
8. Nielsen,M., Lundegaard,C., Worning,P., Hvid,C.S., Lamberth,K.,
Buus,S., Brunak,S. and Lund,O. (2004) Improved prediction of
MHC class I and class II epitopes using a novel Gibbs sampling
approach. Bioinformatics,20, 1388–1397.
9. Shannon,C.E. (1948) A mathematical theory of communication.
Bell Syst. Tech. J.,27, 379–423, 623–656.
10. Kullback,S. and Leibler,R.A. (1951) On Information and
Sufficiency. Ann. Math. Stat.,22, 79–86.
11. Fujii,K., Zhu,G., Liu,Y., Hallam,J., Chen,L., Herrero,J. and
Shaw,S. (2004) Kinase peptide specificity: improved determination
and relevance to protein phosphorylation. Proc. Natl Acad. Sci.
USA,101, 13744–13749.
12. Rammensee,H., Bachmann,J., Emmerich,N.P., Bachor,O.A. and
Stevanovic,S. (1999) SYFPEITHI: database for MHC ligands and
peptide motifs. Immunogenetics,50, 213–219.
13. Porter,C.T., Bartlett,G.J. and Thornton,J.M. (2004) The Catalytic
Site Atlas: a resource of catalytic sites and residues identified in
enzymes using structural data. Nucleic Acids Res.,32,
D129–D133.
14. Andreatta,M., Schafer-Nielsen,C., Lund,O., Buus,S. and
Nielsen,M. (2011) NNAlign: a web-based prediction method
allowing non-expert end-user discovery of sequence motifs in
quantitative peptide data. PLoS One,6, e26781.
15. Henikoff,S. and Henikoff,J.G. (1992) Amino acid substitution
matrices from protein blocks. Proc. Natl Acad. Sci. USA,89,
10915–10919.
16. Hobohm,U., Scharf,M., Schneider,R. and Sander,C. (1992)
Selection of representative protein data sets. Protein Sci.,1,
409–417.
Nucleic Acids Research, 2012, Vol. 40, Web Server issue W287
... We also identified potential peptide binders from alternative reading frames for Article https://doi.org/10.1038/s41467-024-47576-y all three HLAs, including NP + 2 180-188 and PB1 + 2 [80][81][82][83][84][85][86][87][88][89] , which were common to HLA-B*07:02 and HLA-B*08:01. ...
... This was not performed for the C1R.B*07:02 and C1R.B*35:01 datasets to avoid elimination of binders shared with the transfectant HLA given HLA-B*07:02, HLA-B*35:01 and HLA-B*35:03 all favor Proline at P2. Gibbs cluster analysis was performed on the unfiltered human 8-13mers for each data set using GibbsCluster2.0 78,79 (recommended configuration MHC class I ligands of length [8][9][10][11][12][13], and those peptide sequences that clustered with experimentally identified HLA-C*04:01 ligands were also excluded ( Supplementary Fig. 6). Sequence Logos were generated with Seq2-Logo2.0 using default settings 80 and graphs were generated using GraphPad Prism 9.5 for Windows (GraphPad Software, San Diego, California USA, www.graphpad.com). ...
Article
Full-text available
Influenza B viruses (IBVs) cause substantive morbidity and mortality, and yet immunity towards IBVs remains understudied. CD8⁺ T-cells provide broadly cross-reactive immunity and alleviate disease severity by recognizing conserved epitopes. Despite the IBV burden, only 18 IBV-specific T-cell epitopes restricted by 5 HLAs have been identified currently. A broader array of conserved IBV T-cell epitopes is needed to develop effective cross-reactive T-cell based IBV vaccines. Here we identify 9 highly conserved IBV CD8⁺ T-cell epitopes restricted to HLA-B*07:02, HLA-B*08:01 and HLA-B*35:01. Memory IBV-specific tetramer⁺CD8⁺ T-cells are present within blood and tissues. Frequencies of IBV-specific CD8⁺ T-cells decline with age, but maintain a central memory phenotype. HLA-B*07:02 and HLA-B*08:01-restricted NP30-38 epitope-specific T-cells have distinct T-cell receptor repertoires. We provide structural basis for the IBV HLA-B*07:02-restricted NS1196-206 (11-mer) and HLA-B*07:02-restricted NP30-38 epitope presentation. Our study increases the number of IBV CD8⁺ T-cell epitopes, and defines IBV-specific CD8⁺ T-cells at cellular and molecular levels, across tissues and age.
... accessed on 1 June 2023) using the sequence of the epitope (RAHYNIVTF) did not provide any results. In this scenario, the PDB was surveyed for complexes of H2-D b with peptides sharing some sequence similarity with the HPV-16 E7 epitope [49][50][51][52][53][54][55][56][57] . A promising candidate was the PDB entry 1FG2 which corresponds to a complex of H2-D b with the GP33 peptide (KAVYNFATC) [23]. ...
... server [47,48]. The visualization of position-dependent amino acid residues as logo motifs is also provided [49]. ...
Article
Full-text available
A detailed comprehension of MHC-epitope recognition is essential for the design and development of new antigens that could be effectively used in immunotherapy. Yet, the high variability of the peptide together with the large abundance of MHC variants binding makes the process highly specific and large-scale characterizations extremely challenging by standard experimental techniques. Taking advantage of the striking predictive accuracy of AlphaFold, we report a structural and dynamic-based strategy to gain insights into the molecular basis that drives the recognition and interaction of MHC class I in the immune response triggered by pathogens and/or tumor-derived peptides. Here, we investigated at the atomic level the recognition of E7 and TRP-2 epitopes to their known receptors, thus offering a structural explanation for the different binding preferences of the studied receptors for specific residues in certain positions of the antigen sequences. Moreover, our analysis provides clues on the determinants that dictate the affinity of the same epitope with different receptors. Collectively, the data here presented indicate the reliability of the approach that can be straightforwardly extended to a large number of related systems.
... Sequence conservation analysis of peptides derived from breakpoint regions of gene fusions was performed using the Seq2Logo tool with the Probability Weighted Kullback-Leibler (PWKL) method (Thomsen & Nielsen, 2012). Initially, the peptide sequences were aligned using a suitable multiple sequence alignment tool Clustal Omega. ...
Article
Ataxia represents a heterogeneous group of neurodegenerative disorders characterized by a loss of balance and coordination, often resulting from mutations in genes vital for cerebellar function and maintenance. Recent advances in genomics have identified gene fusion events as critical contributors to various cancers and neurodegenerative diseases. However, their role in ataxia pathogenesis remains largely unexplored. Our study delved into this possibility by analyzing RNA sequencing data from 1443 diverse samples, including cell and mouse models, patient samples, and healthy controls. We identified 7067 novel gene fusions, potentially pivotal in disease onset. These fusions, notably in-frame, could produce chimeric proteins, disrupt gene regulation, or introduce new functions. We observed conservation of specific amino acids at fusion breakpoints and identified potential aggregate formations in fusion proteins, known to contribute to ataxia. Through AI-based protein structure prediction, we identified topological changes in three high-confidence fusion proteins—TEN1-ACOX1, PEX14-NMNAT1, and ITPR1-GRID2—which could potentially alter their functions. Subsequent virtual drug screening identified several molecules and peptides with high-affinity binding to fusion sites. Molecular dynamics simulations confirmed the stability of these protein-ligand complexes at fusion breakpoints. Additionally, we explored the role of non-coding RNA fusions as miRNA sponges. One such fusion, RP11-547P4-FLJ33910, showed strong interaction with hsa-miR-504-5p, potentially acting as its sponge. This interaction correlated with the upregulation of hsa-miR-504-5p target genes, some previously linked to ataxia. In conclusion, our study unveils new aspects of gene fusions in ataxia, suggesting their significant role in pathogenesis and opening avenues for targeted therapeutic interventions.
... These attention-based results are similar to those of MSAbased approaches, which suggests that the conserved positions are highly relevant for epitope-specific TCR recognition (The sequence logos of the MSAs for the CDR3β sequences are provided in the Supplementary Fig. 1, available in the online supplemental material, of this paper.) [60]. ...
Article
Full-text available
The emergence of the novel coronavirus, designated as severe acute respiratory syndrome coronavirus-2 (SARS-CoV-2), has posed a significant threat to public health worldwide. There has been progress in reducing hospitalizations and deaths due to SARS-CoV-2. However, challenges stem from the emergence of SARS-CoV-2 variants, which exhibit high transmission rates, increased disease severity, and the ability to evade humoral immunity. Epitope-specific T-cell receptor (TCR) recognition is key in determining the T-cell immunogenicity for SARS-CoV-2 epitopes. Although several data-driven methods for predicting epitope-specific TCR recognition have been proposed, they remain challenging due to the enormous diversity of TCRs and the lack of available training data. Self-supervised transfer learning has recently been proven useful for extracting information from unlabeled protein sequences, increasing the predictive performance of fine-tuned models, and using a relatively small amount of training data. This study presents a deep-learning model generated by fine-tuning pre-trained protein embeddings from a large corpus of protein sequences. The fine-tuned model showed markedly high predictive performance and outperformed the recent Gaussian process-based prediction model. The output attentions captured by the deep-learning model suggested critical amino acid positions in the SARS-CoV-2 epitope-specific TCRβ sequences that are highly associated with the viral escape of T-cell immune response.
Preprint
Full-text available
Despite the prevalence and many successes of deep learning applications in de novo molecular design, the problem of peptide generation targeting specific proteins remains unsolved. A main barrier for this is the scarcity of the high-quality training data. To tackle the issue, we propose a novel machine learning based peptide design architecture, called Latent Space Approximate Trajectory Collector (LSATC). It consists of a series of samplers on an optimization trajectory on a highly non-convex energy landscape that approximates the distributions of peptides with desired properties in a latent space. The process involves little human intervention and can be implemented in an end-to-end manner. We demonstrate the model by the design of peptide extensions targeting β-catenin, a key nuclear effector protein involved in canonical Wnt signalling. When compared with a random sampler, LSATC can sample peptides with 36% lower mean binding scores in a 16 times smaller interquartile range (IQR) and 284% less mean hydrophobicity with a 1.4 times smaller IQR. LSATC also largely outperforms other common generative models.Finally, we utilize a clustering algorithm to select 4 peptides from the100 LSATC designed peptides for experimental validation. The resultconfirms that all the four peptides extended by LSATC show improved β-catenin binding by at least 20.0%, and two of the peptides show a 3 fold increase in binding affinity as compared to the base peptide.
Article
Full-text available
Immunopeptidomics is crucial for immunotherapy and vaccine development. Because the generation of immunopeptides from their parent proteins does not adhere to clear-cut rules, rather than being able to use known digestion patterns, every possible protein subsequence within human leukocyte antigen (HLA) class-specific length restrictions needs to be considered during sequence database searching. This leads to an inflation of the search space and results in lower spectrum annotation rates. Peptide-spectrum match (PSM) rescoring is a powerful enhancement of standard searching that boosts the spectrum annotation performance. We analyze 302,105 unique synthesized non-tryptic peptides from the ProteomeTools project on a timsTOF-Pro to generate a ground-truth dataset containing 93,227 MS/MS spectra of 74,847 unique peptides, that is used to fine-tune the deep learning-based fragment ion intensity prediction model Prosit. We demonstrate up to 3-fold improvement in the identification of immunopeptides, as well as increased detection of immunopeptides from low input samples.
Article
Mucosal-associated invariant T (MAIT) cells are a subset of unconventional T cells that recognize small molecule metabolites presented by major histocompatibility complex class I related protein 1 (MR1), via an αβ T cell receptor (TCR). MAIT TCRs feature an essentially invariant TCR α-chain, which is highly conserved between mammals. Similarly, MR1 is the most highly conserved major histocompatibility complex-I–like molecule. This extreme conservation, including the mode of interaction between the MAIT TCR and MR1, has been shown to allow for species-mismatched reactivities unique in T cell biology, thereby allowing the use of selected species-mismatched MR1–antigen (MR1–Ag) tetramers in comparative immunology studies. However, the pattern of cross-reactivity of species-mismatched MR1–Ag tetramers in identifying MAIT cells in diverse species has not been formally assessed. We developed novel cattle and pig MR1–Ag tetramers and utilized these alongside previously developed human, mouse, and pig-tailed macaque MR1–Ag tetramers to characterize cross-species tetramer reactivities. MR1–Ag tetramers from each species identified T cell populations in distantly related species with specificity that was comparable to species-matched MR1–Ag tetramers. However, there were subtle differences in staining characteristics with practical implications for the accurate identification of MAIT cells. Pig MR1 is sufficiently conserved across species that pig MR1–Ag tetramers identified MAIT cells from the other species. However, MAIT cells in pigs were at the limits of phenotypic detection. In the absence of sheep MR1–Ag tetramers, a MAIT cell population in sheep blood was identified phenotypically, utilizing species-mismatched MR1–Ag tetramers. Collectively, our results validate the use and define the limitations of species-mismatched MR1–Ag tetramers in comparative immunology studies.
Article
Full-text available
Dinoflagellates are a diverse group of ecologically significant micro-eukaryotes that can serve as a model system for plastid symbiogenesis due to their susceptibility to plastid loss and replacement via serial endosymbiosis. Kareniaceae harbor fucoxanthin-pigmented plastids instead of the ancestral peridinin-pigmented ones and support them with a diverse range of nucleus-encoded plastid-targeted proteins originating from the haptophyte endosymbiont, dinoflagellate host, and/or lateral gene transfers (LGT). Here, we present predicted plastid proteomes from seven distantly related kareniaceans in three genera ( Karenia , Karlodinium , and Takayama ) and analyze their evolutionary patterns using automated tree building and sorting. We project a relatively limited ( ~ 10%) haptophyte signal pointing towards a shared origin in the family Chrysochromulinaceae. Our data establish significant variations in the functional distributions of these signals, emphasizing the importance of micro-evolutionary processes in shaping the chimeric proteomes. Analysis of plastid genome sequences recontextualizes these results by a striking finding the extant kareniacean plastids are in fact not all of the same origin, as two of the studied species ( Karlodinium armiger , Takayama helix ) possess plastids from different haptophyte orders than the rest.
Article
Full-text available
The flavodoxin of Rhodopseudomonas palustris CGA009 (Rp9Fld) supplies highly reducing equivalents to crucial enzymes such as hydrogenase, especially when the organism is iron-restricted. By acquiring those electrons from photodriven electron flow via the bifurcating electron transfer flavoprotein, Rp9Fld provides solar power to vital metabolic processes. To understand Rp9Fld's ability to work with diverse partners, we solved its crystal structure. We observed the canonical flavodoxin (Fld) fold and features common to other long-chain Flds but not all the surface loops thought to recognize partner proteins. Moreover, some of the loops display alternative structures and dynamics. To advance studies of protein–protein associations and conformational consequences, we assigned the ¹⁹F NMR signals of all five tyrosines (Tyrs). Our electrochemical measurements show that incorporation of 3-¹⁹F-Tyr in place of Tyr has only a modest effect on Rp9Fld's redox properties even though Tyrs flank the flavin on both sides. Meanwhile, the ¹⁹F probes demonstrate the expected paramagnetic effect, with signals from nearby Tyrs becoming broadened beyond detection when the flavin semiquinone is formed. However, the temperature dependencies of chemical shifts and linewidths reveal dynamics affecting loops close to the flavin and regions that bind to partners in a variety of systems. These coincide with patterns of amino acid type conservation but not retention of specific residues, arguing against detailed specificity with respect to partners. We propose that the loops surrounding the flavin adopt altered conformations upon binding to partners and may even participate actively in electron transfer.
Article
Full-text available
Accurate prediction of immunogenicity for neo-epitopes arising from a cancer associated mutation is a crucial step in many bioinformatics pipelines that predict outcome of checkpoint blockade treatments or that aim to design personalised cancer immunotherapies and vaccines. In this study, we performed a comprehensive analysis of peptide features relevant for prediction of immunogenicity using the Cancer Epitope Database and Analysis Resource (CEDAR), a curated database of cancer epitopes with experimentally validated immunogenicity annotations from peer-reviewed publications. The developed model, ICERFIRE (ICore-based Ensemble Random Forest for neo-epitope Immunogenicity pREdiction), extracts the predicted ICORE from the full neo-epitope as input, i.e. the nested peptide with the highest predicted major histocompatibility complex (MHC) binding potential combined with its predicted likelihood of antigen presentation (%Rank). Key additional features integrated into the model include assessment of the BLOSUM mutation score of the neo-epitope, and antigen expression levels of the wild-type counterpart which is often reflecting a neo-epitope's abundance. We demonstrate improved and robust performance of ICERFIRE over existing immunogenicity and epitope prediction models, both in cross-validation and on external validation datasets.
Article
Full-text available
Ore mineral and host lithologies have been sampled with 89 oriented samples from 14 sites in the Naica District, northern Mexico. Magnetic parameters permit to charac- terise samples: saturation magnetization, density, low- high-temperature magnetic sus- ceptibility, remanence intensity, Koenigsberger ratio, Curie temperature and hystere- sis parameters. Rock magnetic properties are controlled by variations in titanomag- netite content and hydrothermal alteration. Post-mineralization hydrothermal alter- ation seems the major event that affected the minerals and magnetic properties. Curie temperatures are characteristic of titanomagnetites or titanomaghemites. Hysteresis parameters indicate that most samples have pseudo-single domain (PSD) magnetic grains. Alternating filed (AF) demagnetization and isothermal remanence (IRM) ac- quisition both indicate that natural and laboratory remanences are carried by MD-PSD spinels in the host rocks. The trend of NRM intensity vs susceptibility suggests that the carrier of remanent and induced magnetization is the same in all cases (spinels). The Koenigsberger ratio range from 0.05 to 34.04, indicating the presence of MD and PSD magnetic grains. Constraints on the geometry of the intrusive source body devel- oped in the model of the magnetic anomaly are obtained by quantifying the relative contributions of induced and remanent magnetization components.
Article
Full-text available
Recent advances in high-throughput technologies have made it possible to generate both gene and protein sequence data at an unprecedented rate and scale thereby enabling entirely new “omics”-based approaches towards the analysis of complex biological processes. However, the amount and complexity of data that even a single experiment can produce seriously challenges researchers with limited bioinformatics expertise, who need to handle, analyze and interpret the data before it can be understood in a biological context. Thus, there is an unmet need for tools allowing non-bioinformatics users to interpret large data sets. We have recently developed a method, NNAlign, which is generally applicable to any biological problem where quantitative peptide data is available. This method efficiently identifies underlying sequence patterns by simultaneously aligning peptide sequences and identifying motifs associated with quantitative readouts. Here, we provide a web-based implementation of NNAlign allowing non-expert end-users to submit their data (optionally adjusting method parameters), and in return receive a trained method (including a visual representation of the identified motif) that subsequently can be used as prediction method and applied to unknown proteins/peptides. We have successfully applied this method to several different data sets including peptide microarray-derived sets containing more than 100,000 data points. NNAlign is available online at http://www.cbs.dtu.dk/services/NNAlign.
Article
Bell System Technical Journal, also pp. 623-656 (October)
Article
Methods for alignment of protein sequences typically measure similarity by using a substitution matrix with scores for all possible exchanges of one amino acid with another. The most widely used matrices are based on the Dayhoff model of evolutionary rates. Using a different approach, we have derived substitution matrices from about 2000 blocks of aligned sequence segments characterizing more than 500 groups of related proteins. This led to marked improvements in alignments and in searches using queries from each of the groups.
Article
An abstract is not available.
Article
The Protein Data Bank currently contains about 600 data sets of three- dimensional protein coordinates determined by X-ray crystallography or NMR. There is considerable redundancy in the data base, as many protein pairs are identical or very similar in sequence. However, statistical analyses of protein sequence-structure relations require nonredundant data. We have developed two algorithms to extract from the data base representative sets of protein chains with maximum coverage and minimum redundancy. The first algorithm focuses on optimizing a particular property of the selected proteins and works by successive selection of proteins from an ordered list and exclusion of all neighbors of each selected protein. The other algorithm aims at maximizing the size of the selected set and works by successive thinning out of clusters of similar proteins. Both algorithms are generally applicable to other data bases in which criteria of similarity can be defined and relate to problems in graph theory. The largest nonredundant set extracted from the current release of the Protein Data Bank has 155 protein chains. In this set, no two proteins have sequence similarity higher than a certain cutoff (30% identical residues for aligned subsequences longer than 80 residues), yet all structurally unique protein families are represented. Periodically updated lists of representative data sets are available by electronic mail from the file server '[email protected] /* */' The selection may be useful in statistical approaches to protein folding as well as in the analysis and documentation of the known spectrum of three- dimensional protein structures.
Article
A graphical method is presented for displaying the patterns in a set of aligned sequences. The characters representing the sequence are stacked on top of each other for each position in the aligned sequences. The height of each letter is made proportional to Its frequency, and the letters are sorted so the most common one is on top. The height of the entire stack is then adjusted to signify the information content of the sequences at that position. From these ‘sequence logos’, one can determine not only the consensus sequence but also the relative frequency of bases and the information content (measured In bits) at every position in a site or sequence. The logo displays both significant residues and subtle sequence patterns.