ArticlePDF Available

trome, trEST and trGEN: Databases of predicted protein sequences

Authors:

Abstract

We previously introduced two new protein databases (trEST and trGEN) of hypothetical protein sequences predicted from EST and HTG sequences, respectively. Here, we present the updates made on these two databases plus a new database (trome), which uses alignments of EST data to HTG or full genomes to generate virtual transcripts and coding sequences. This new database is of higher quality and since it contains the information in a much denser format it is of much smaller size. These new databases are in a Swiss‐Prot‐like format and are updated on a weekly basis (trEST and trGEN) or every 3 months (trome). They can be downloaded by anonymous ftp from ftp://ftp.isrec.isb‐sib.ch/pub/databases.
trome, trEST and trGEN: databases of predicted
protein sequences
Peter Sperisen
1
,
*, Christian Iseli
1
,
3
, Marco Pagni
1
,
3
, Brian J. Stevenson
1
,
3
,
Philipp Bucher
1
,
2
and C. Victor Jongeneel
1
,
3
1
Swiss Institute of Bioinformatics,
2
ISREC and
3
Of®ce of Information Technology, Ludwig Institute for Cancer
Research, Chemin des Boveresses 155, 1066 Epalinges s/Lausanne, Switzerland
Received September 15, 2003; Revised and Accepted September 29, 2003
ABSTRACT
We previously introduced two new protein data-
bases (trEST and trGEN) of hypothetical protein
sequences predicted from EST and HTG sequences,
respectively. Here, we present the updates made on
these two databases plus a new database (trome),
which uses alignments of EST data to HTG or full
genomes to generate virtual transcripts and coding
sequences. This new database is of higher quality
and since it contains the information in a much
denser format it is of much smaller size. These new
databases are in a Swiss-Prot-like format and are
updated on a weekly basis (trEST and trGEN) or
every 3 months (trome). They can be downloaded by
anonymous ftp from ftp://ftp.isrec.isb-sib.ch/pub/
databases.
DESCRIPTION OF DATABASES
High-throughput genome (HTG) and expressed sequence tag
(EST) sequences are currently the most abundant nucleotide
sequence classes in the public databases. The large volume,
high degree of fragmentation and lack of gene structure
annotations prevent ef®cient searches of HTG and EST data
for protein sequence homologies by standard search methods.
We have compiled three databases of predicted and annotated
protein sequences to facilitate the use of proteomics tools. All
databases are distributed in a Swiss-Prot-like format, with
features that are speci®c to the databases presented below.
trome
trome is an attempt to map transcribed RNA from different
sources to the NCBI RefSeq genome sequence (1,2). As an
example, for Homo sapiens the transcribed RNA sources are:
the human EST section of the EMBL database (3), the human
HTC section of the EMBL database, human mRNA docu-
mented in the EMBL database, ORESTES sequences from
the LICR/FAPESP Human Cancer Genome project (4,5),
human mRNA documented in the NCBI-curated RefSeq
database [http://www.ncbi.nih.gov/RefSeq (6)], published
CHR21 gene list and SEREX sequences [http://www2.licr.
org/CancerImmunomeDB (7)]. For other species, similar
sources are used. Currently four species are represented:
H.sapiens, Mus musculus, Drosophila melanogaster,
Arabidopsis thaliana and Caenorhabditis elegans (Table 1).
The mapping of the transcribed RNA sources to the genome is
a three-step process (1,2):
(i) The program Megablast (8) is used to identify pairwise
similarities between all known transcript sequences and the
genomic data.
(ii) For each pair of matching RNA and genomic sequences,
local alignments were generated using a modi®ed version of
sim4 (9).
(iii) The output of sim4 was ®ltered to eliminate all
alignments that did not contain at least one region (exon)
matching with at least 95% identity over their high-quality
part and 88% over the remainder.
The output of sim4 is then used to generate directed acyclic
graphs using the program tromer (locally developed program
to automate the reconstruction of transcripts from transcript to
genome mapping). These graphs (the edges and nodes
represent exons/introns or splice donors/acceptors, respect-
ively) represent transcribed loci of the genome and contain in
a condensed form the information about all possible alterna-
tive splice variants that are experimentally documented. They
can be used to reconstruct virtual transcripts from the
underlying genomic sequence following a path from 3¢ tags
along experimentally veri®ed exon boundaries. Transcript
generation is a three-step process: (i) a seed edge is selected;
(ii) this edge is extended toward the 5¢ end and (iii) toward the
3¢ end. The seed edge is ®rst selected among unused 5¢-most
exons, then among any unused edge. The extension process
always attempts to include unused edges, which were derived
from the same RNA elements as the seed edge. The resulting
virtual transcripts are translated into protein sequences using
the program ESTScan (10). These protein sequences are the
basis of the trome database. ESTScan detects the coding frame
and corrects most frameshift errors introduced by sequencing
errors, but predicts their position within a range of a few amino
acids. Simulation experiments have shown that in 95% of the
cases the range is seven or fewer amino acids. To visualize this
uncertainty, the FT key UNSURE was used, indicating the
range within which the predicted sequence is more likely to
contain errors. However, due to the mapping of transcribed
RNA data onto the genome, this is a rare event in contrast to
*To whom correspondence should be addressed. Tel: +41 21 6925956; Fax: +41 21 6925945; Email: peter.sperisen@isb-sib.ch
Nucleic Acids Research, 2004, Vol. 32, Database issue D509±D511
DOI: 10.1093/nar/gkh067
Nucleic Acids Research, Vol. 32, Database issue ã Oxford University Press 2004; all rights reserved
corrections found in the database trEST (see below). The new
FT key EXON was introduced to indicate the positions of the
exon boundaries with respect to the NT contig.
FT EXON 1 173 Exon E0;NT_026943[46041..46562].
FT EXON 174 174 AA on splice site: tt/g -> L.
FT EXON 175 405 Exon E1;NT_026943[56062..56757].
FT EXON 406 406 AA on splice site: a/tg -> M.
FT EXON 407 514 Exon E2;NT_026943[63614..64087]
FT UNSURE 507 514 Frameshift error at pos.: 514;
base inserted:
trEST and trGEN
trEST is an attempt to produce contigs from clusters of ESTs
and to translate them into proteins (11). In the past 2 years the
following improvements have been introduced:
(i) Initially trEST was composed only of protein sequences
that were generated through the translation of contigs
produced from UniGene clusters (12) using ESTScan.
Protein sequences of coding ESTs that are not present in any
UniGene cluster were also introduced into trEST.
(ii) The species list has been increased (see Table 1).
(iii) Exactly as described for the trome database, the FT key
UNSURE was used to re¯ect the uncertainty range in
frameshift error correction by the program ESTScan plus the
correction of internal stop codons. The parameters used with
ESTScan are adapted to the error prone contigs produced from
UniGene clusters as well as the ESTs, since frameshift errors
as well as internal stop codons are found more frequently as
compared to the database trome.
trEST is cross-referenced to the UniGene database for the
entries that are based on UniGene clusters and to the EMBL
database for the ESTs that do not belong to UniGene clusters.
The amino acid sequences of the trGEN database are
predicted from genomic sequences from the NCBI database or
from HTG sequences from the EMBL database. The
sequences are searched for putative genes and their coding
regions using the program Genscan (13). The following
improvements were introduced:
(i) The species Rattus norvegicus was added.
(ii) The predictions for H.sapiens, M.musculus,
R.norvegicus and A.thaliana are now made on the basis of
the NCBI reference genome sequences (NT contigs).
(iii) The new FT Key GENSCAN was introduced. It
contains the predictions made by the program Genscan. These
are: FIRST EXON, INTERNAL EXON, LAST EXON and
SINGLE EXON together with their associated p-values (sum
over all parses containing exon) calculated by GENSCAN,
which serve as an indication about the degree of certainty that
should be ascribed to exons predicted by the program
FT GENSCAN 1 226 FIRST EXON; p-value: 0.159.
FT GENSCAN 227 262 INTERNAL EXON; p-value: 0.093.
FT GENSCAN 263 269 INTERNAL EXON; p-value: 0.065.
FT GENSCAN 270 320 LAST EXON; p-value: 0.074.
(iv) The ID is composed of either the EMBL or NCBI
accession number of the contig on which the protein was
predicted, plus a number (_#) that enumerates the proteins as
they are found on the contig.
(v) trGEN is cross-referenced to either the EMBL or the
NCBI RefSeq database, with a cross-link to the underlying
contig.
UPDATE TO THE DATABASES
The trEST and the trGEN databases are updated on a weekly
basis. The trome database is updated roughly every 3 months.
ACCESS
FTP
The ®les for the three databases are available by anonymous
ftp from the directories: ftp://ftp.isrec.isb-sib.ch/pub/
databases/trest, ftp://ftp.isrec.isb-sib.ch/pub/databases/trgen
and ftp://ftp.isrec.isb-sib.ch/pub/databases/trome. In addition
user manuals which provide more details about the format of
each database are found in the individual directories.
WWW
Several web pages offer services that include the trome, trEST
and trGEN databases. http://www.ch.embnet.org/software/
fetch.html allows one to retrieve individual entries of trome,
trEST and trGEN. http://www.ch.embnet.org/software/
aBLAST.html allows the three databases of hypothetical
proteins to be searched using BLAST.
ACKNOWLEDGEMENTS
This work was supported by a grant from the Swiss federal
of®ce for education and health (OFES): 01.0101 which is part
of the TEMBLOR project (QLRI-CT-2001-00015) of the
Quality of Life and Management of living Resources
Programme and by the Ludwig Institute for Cancer Research.
REFERENCES
1. Iseli,C., Stevenson,B.J., de Souza,S.J., Samaia,H.B., Camargo,A.A.,
Buetow,K.H., Strausberg,R.L., Simpson,A.J., Bucher,P. and
Jongeneel,C.V. (2002) Long-range heterogeneity at the 3¢ ends of human
mRNAs. Genome Res., 12, 1068±1074.
Table 1. Number of entries in each database for each species represented
(established September 5, 2003)
trGEN trEST trome
Homo sapiens 196110 1709305 121072
Mus musculus 225759 1018677 95754
Rattus norvegicus 293151 274511 n.a.
Drosophila melanogaster 128071 28254 20717
Arabidopsis thaliana 31424 39924 n.a.
Oryza sativa 111723 57269 n.a.
Bos taurus n.a. 88378 n.a.
Danio rerio n.a. 125453 n.a.
Hordeum vulgare n.a. 79315 n.a.
Triticum aestivum n.a. 146462 n.a.
Xenopus laevis n.a. 116084 n.a.
Zea mays n.a. 121846 n.a.
Caenorhabditis elegans n.a. n.a. 25841
Total 986238 3805478 263384
Due to the limited amount of data, not all species are represented in all
databases. Missing data are indicated by `n.a.'.
D510 Nucleic Acids Research, 2004, Vol. 32, Database issue
2. Stevenson,B.J., Iseli,C., Beutler,B. and Jongeneel,C.V. (2003) Use of
transcriptome data to unravel the ®ne structure of genes involved in
sepsis. J. Infect. Dis., 187 (Suppl. 2), S308±S314.
3. Stoesser,G., Sterk,P., Tuli,M.A., Stoehr,P.J. and Cameron,G.N. (1997)
The EMBL Nucleotide Sequence Database. Nucleic Acids Res., 25, 7±14.
4. Dias,N.E., Correa,R.G., Verjovski-Almeida,S., Briones,M.R.,
Nagai,M.A., da Silva,,W.,Jr, Zago,M.A., Bordin,S., Costa,F.F.,
Goldman,G.H. et al. (2000) Shotgun sequencing of the human
transcriptome with ORF expressed sequence tags. Proc. Natl Acad. Sci.
USA, 97, 3491±3496.
5. Camargo,A.A., Samaia,H.P., Dias-Neto,E., Simao,D.F., Migotto,I.A.,
Briones,M.R., Costa,F.F., Nagai,M.A., Verjovski-Almeida,S., Zago,M.A.
et al. (2001) The contribution of 700 000 ORF sequence tags to the
de®nition of the human transcriptome. Proc. Natl Acad. Sci. USA, 98,
12103±12108.
6. Pruitt,K.D., Katz,K.S., Sicotte,H. and Maglott,D.R. (2000) Introducing
RefSeq and LocusLink: curated human genome resources at the NCBI.
Trends Genet., 16, 44±47.
7. Tureci,O., Sahin,U. and Pfreundschuh,M. (1997) Serological analysis of
human tumor antigens: molecular de®nition and implications. Mol. Med.
Today, 3, 342±349.
8. Zhang,Z., Schwartz,S., Wagner,L. and Miller,W. (2000) A greedy
algorithm for aligning DNA sequences. J. Comput. Biol., 7, 203±214.
9. Florea,L., Hartzell,G., Zhang,Z., Rubin,G.M. and Miller,W. (1998) A
computer program for aligning a cDNA sequence with a genomic DNA
sequence. Genome Res., 8, 967±974.
10. Iseli,C., Jongeneel,V. and Bucher,P. (1999) ESTScan: A program for
detecting, evaluating, and reconstructing potential coding regions in EST
sequences. Proceedings of the Seventh ISMB, pp. 138±148.
11. Pagni,M., Iseli,C., Junier,T., Falquet,L., Jongeneel,V. and Bucher,P.
(2001) trEST, trGEN and Hits: access to databases of predicted protein
sequences. Nucleic Acids Res., 29, 148±151.
12. Schuler,G.D. (1997) Pieces of the puzzle: expressed sequence tags and
the catalog of human genes. J. Mol. Med., 75, 694±698.
13. Burge,C.B. and Karlin,S. (1998) Finding the genes in genomic DNA.
Curr. Opin. Struct. Biol., 8, 346±354.
Nucleic Acids Research, 2004, Vol. 32, Database issue D511
... Based on the KNOWN dataset, we compared the performance of ALTSCAN with 3 ab initio predictors [20,34,35] available in UCSC Genome Browser, as well as 7 predictors [36][37][38][39] evaluated in RGASP [31] with capability of predicting coding regions ( Table 1). As a result, ALTSCAN's gene-level sensitivity and specificity were 41.8% and 24.4% respectively, which were much higher than other ab initio predictors (the highest one with a sensitivity of 16.8% and a specificity of 14.3%). ...
... We evaluated the performance of tools for CDS prediction including 4 ab initio predictors (ALTSCAN, Genscan [35], Geneid [34] and AUGUSTUS [20]) and 7 predictors using RNA-seq data (AUGUSTUS [37], Exonerate [38], mGene [36], mTim, NextGeneid, Transomics and Tromer [39]) based on the KNOWN annotation. Predictions from AUGUSTUS_no_RNA and all predictors using RNA-seq data were downloaded not certified by peer review) is the author/funder. ...
Preprint
Biological and biomedical research relies on comprehensive understanding of protein-coding transcripts. However, the total number of human proteins is still unknown due to the prevalence of alternative splicing and is much larger than the number of human genes. In this paper, we detected 31,566 novel transcripts with coding potential by filtering our ab initio predictions with 50 RNA-seq datasets from diverse tissues/cell lines. PCR followed by MiSeq sequencing showed that at least 84.1% of these predicted novel splice sites could be validated. In contrast to known transcripts, the expression of these novel transcripts were highly tissue-specific. Based on these novel transcripts, at least 36 novel proteins were detected from shotgun proteomics data of 41 breast samples. We also showed L1 retrotransposons have a more significant impact on the origin of new transcripts/genes than previously thought. Furthermore, we found that alternative splicing is extraordinarily widespread for genes involved in specific biological functions like protein binding, nucleoside binding, neuron projection, membrane organization and cell adhesion. In the end, the total number of human transcripts with protein-coding potential was estimated to be at least 204,950.
... Second, we also reconstructed splice sites graphs from mapped RNA-seq reads or alternatively from our Cufflinks assemblies or public sequences. Each splice sites graph is a set of splice sites as vertices and the connecting edges represent the intermediate exons and introns (25)(26)(27). The graphical representation of both concepts of splicing graph is given in the Supplementary Figure S1. ...
Preprint
Full-text available
Alternative splicing is an essential characteristic of living cells that usually infers a various exon-exon junction governed by different splice sites. The traditional classification based on the mode of use designates splice site to one of the two groups, constitutive or alternative. Here, we considered another criterion and reorganized splice sites into "unisplice" and "multisplice" groups according to the number of undertaken splicing events. This approach provided us with a new insight in the organization and functionality of leukemia cells. We determined features associated with uni- and multisplice sites and found that combinatorics of these sites follows strict rules of the power-law in the t(8;21)-positive leukemia cells. We also found that system splicing characteristics of the transcriptome of leukemia cells remained persistent after drastic changes in the transcript composition caused by knockdown of the RUNX1-RUNX1T1 oncogene. In this work, we show for the first time that leukemia cells possess a sub-set of unisplice sites with a hidden multisplice potential. These findings reveal a new side in organization and functioning of the leukemic cells and open up new perspectives in the study of the t(8;21)-positive leukemia.
... The Tromer pipeline first maps reads to the genome using fetchGWI to identify unique exact matches 11 . ...
... In order to construct a comprehensive transcript dataset, we first collected transcripts from the GENCODE and RefSeq database. Then GENCODE and RefSeq transcripts 11 were combined and filtered. Transcripts sharing the same coding regions, having internal stop codons or short introns (<20bp) were removed. ...
... Developers of leading software programs were invited to participate in a consortium effort, the RNA-seq Genome Annotation Assessment Project (RGASP), to benchmark methods to predict and quantify expressed transcripts from RNA-seq data. Results were evaluated from methods based on genome alignments (AUGUSTUS 9 , Cufflinks 3 , Exonerate 10 , GSTRUCT, iReckon 2 , mGene 11 , mTim, NextGeneid 12 , SLIDE 4 , Transomics, Trembly, Tromer 13 ) as well as de novo assembly (Oases 5 and Velvet 14 ). Our results identify aspects of RNA-seq analysis where current approaches are relatively adept, along with more challenging areas for future improvement. ...
Article
Full-text available
RNA sequencing (RNA-seq) is transforming genome biology, enabling comprehensive transcriptome profiling with unprecendented accuracy and detail. Due to technical limitations of current high-throughput sequencing platforms, transcript identity, structure and expression level must be inferred programmatically from partial sequence reads of fragmented gene products. We evaluated 24 protocol variants of 14 independent computational methods for exon identification, transcript reconstruction and expression level quantification from RNA-seq data. Our results show that most algorithms are able to identify discrete transcript components with high success rates, but that assembly of complete isoform structures poses a major challenge even when all constituent elements are identified. Expression level estimates also varied widely across methods, even when based on similar transcript models. Consequently, the complexity of higher eukaryotic genomes imposes severe limitations in transcript recall and splice product discrimination that are likely to remain limiting factors for the analysis of current-generation RNA-seq data.
... Compared to the genome-based derivation, the transcriptome is a more accurate way to derive the proteome, because the cDNA transcripts are directly responsible for the creation of the proteins. A tool called TromER had been introduced to map the proteome with hypothetical protein sequences derived from both the transcriptomic data (trEST) and genomic data (trGEN) [47]. ...
Article
The last few decades have seen the rise of widely-available proteomics tools. From new data acquisition devices, such as MALDI-MS and 2DE to new database searching softwares, these new products have paved the way for high throughput microbial proteomics (HTMP). These tools are enabling researchers to gain new insights into microbial metabolism, and are opening up new fields of study, such as protein-protein interactions (interactomics) and drug discovery. Computer software is a key part of these emerging fields. This current review considers: 1) software tools for identifying the proteome, such as MASCOT or PDQuest, 2) online databases of proteomes, such as SWISS-PROT, Proteome Web, or the Proteomics Facility of the Pathogen Functional Genomics Resource Center, and 3) software tools for applying proteomic data, such as PSI-BLAST or VESPA. These tools allow for the study of such diverse fields as network biology, protein identification, functional annotation, target identification/validation, protein expression, protein structural analysis, metabolic pathway engineering and drug discovery.
... This filter is based on an in-house built Human and Mouse transcriptome called the Trome database (http://www.isrec.isb-sib.ch/tromer/) (10). Using this filtered sequences database, the procedure maps tags to sequences via the ‘tagger’ program (11), and then links sequence identifiers to gene names via Unigene clusters. ...
Article
Full-text available
The CleanEx expression database (http://www.cleanex.isb-sib.ch) provides access to public gene expression data via unique gene names as well as via experiments biomedical characteristics. To reach this, a dual annotation of both sequences and experiments has been generated. First, the system links official gene symbols to any kind of sequences used for gene expression measurements (cDNA, Affymetrix, oligonucleotide arrays, SAGE or MPSS tags, Expressed Sequence Tags or other mRNA sequences, etc.). For the biomedical annotation, we re-annotate each experiment from the CleanEx database with the MeSH (Medical Subject Headings) terms, primarily used by NLM (National Library of Medicine) for indexing articles for the MEDLINE/PubMED database. This annotation allows a fast and easy retrieval of expression data with common biological or medical features. The numerical data can then be exported as matrix-like tab-delimited text files. Data can be extracted from either one dataset or from heterogeneous datasets.
Article
The Eukaryotic Promoter Database (EPD) is an annotated non-redundant collection of eukaryotic POL II promoters, experimentally defined by a transcription start site (TSS). There may be multiple promoter entries for a single gene. The underlying experimental evidence comes from journal articles and, starting from release 73, from 5' ESTs of full-length cDNA clones used for so-called in silico primer extension. Access to promoter sequences is provided by pointers to TSS positions in nucleotide sequence entries. The annotation part of an EPD entry includes a description of the type and source of the initiation site mapping data, links to other biological databases and bibliographic references. EPD is structured in a way that facilitates dynamic extraction of biologically meaningful promoter subsets for comparative sequence analysis. Web-based interfaces have been developed that enable the user to view EPD entries in different formats, to select and extract promoter sequences according to a variety of criteria and to navigate to related databases exploiting different cross-references. Tools for analysing sequence motifs around TSSs defined in EPD are provided by the signal search analysis server. EPD can be accessed at http://www.epd. isb-sib.ch.
Article
Full-text available
Biological and biomedical research relies on comprehensive understanding of protein-coding transcripts. However, the total number of human proteins is still unknown due to the prevalence of alternative splicing. In this paper, we detected 31,566 novel transcripts with coding potential by filtering our ab initio predictions with 50 RNA-seq datasets from diverse tissues/cell lines. PCR followed by MiSeq sequencing showed that at least 84.1% of these predicted novel splice sites could be validated. In contrast to known transcripts, the expression of these novel transcripts were highly tissue-specific. Based on these novel transcripts, at least 36 novel proteins were detected from shotgun proteomics data of 41 breast samples. We also showed L1 retrotransposons have a more significant impact on the origin of new transcripts/genes than previously thought. Furthermore, we found that alternative splicing is extraordinarily widespread for genes involved in specific biological functions like protein binding, nucleoside binding, neuron projection, membrane organization and cell adhesion. In the end, the total number of human transcripts with protein-coding potential was estimated to be at least 204,950.
Article
Full-text available
We evaluated 25 protocol variants of 14 independent computational methods for exon identification, transcript reconstruction and expression-level quantification from RNA-seq data. Our results show that most algorithms are able to identify discrete transcript components with high success rates but that assembly of complete isoform structures poses a major challenge even when all constituent elements are identified. Expression-level estimates also varied widely across methods, even when based on similar transcript models. Consequently, the complexity of higher eukaryotic genomes imposes severe limitations on transcript recall and splice product discrimination that are likely to remain limiting factors for the analysis of current-generation RNA-seq data. Supplementary information The online version of this article (doi:10.1038/nmeth.2714) contains supplementary material, which is available to authorized users.
Article
Full-text available
Theoretical considerations predict that amplification of expressed gene transcripts by reverse transcription–PCR using arbitrarily chosen primers will result in the preferential amplification of the central portion of the transcript. Systematic, high-throughput sequencing of such products would result in an expressed sequence tag (EST) database consisting of central, generally coding regions of expressed genes. Such a database would add significant value to existing public EST databases, which consist mostly of sequences derived from the extremities of cDNAs, and facilitate the construction of contigs of transcript sequences. We tested our predictions, creating a database of 10,000 sequences from human breast tumors. The data confirmed the central distribution of the sequences, the significant normalization of the sequence population, the frequent extension of contigs composed of existing human ESTs, and the identification of a series of potentially important homologues of known genes. This approach should make a significant contribution to the early identification of important human genes, the deciphering of the draft human genome sequence currently being compiled, and the shotgun sequencing of the human transcriptome.
Article
Full-text available
Genome sequencing efforts will soon generate hundreds of millions of bases of human genomic DNA containing thousands of novel genes. In the past year, the accuracy of computational gene-finding methods has improved significantly, to the point where a reasonable approximation of the gene structures within an extended genomic region can often be predicted in advance of more detailed experimental studies.
Article
Full-text available
We address the problem of efficiently aligning a transcribed and spliced DNA sequence with a genomic sequence containing that gene, allowing for introns in the genomic sequence and a relatively small number of sequencing errors. A freely available computer program, described herein, solves the problem for a 100-kb genomic sequence in a few seconds on a workstation.
Article
Full-text available
One of the problems associated with the large-scale analysis of unannotated, low quality EST sequences is the detection of coding regions and the correction of frameshift errors that they often contain. We introduce a new type of hidden Markov model that explicitly deals with the possibility of errors in the sequence to analyze, and incorporates a method for correcting these errors. This model was implemented in an efficient and robust program, ESTScan. We show that ESTScan can detect and extract coding regions from low-quality sequences with high selectivity and sensitivity, and is able to accurately correct frameshift errors. In the framework of genome sequencing projects, ESTScan could become a very useful tool for gene discovery, for quality control, and for the assembly of contigs representing the coding regions of genes.
Article
Full-text available
For aligning DNA sequences that differ only by sequencing errors, or by equivalent errors from other sources, a greedy algorithm can be much faster than traditional dynamic programming approaches and yet produce an alignment that is guaranteed to be theoretically optimal. We introduce a new greedy alignment algorithm with particularly good performance and show that it computes the same alignment as does a certain dynamic programming algorithm, while executing over 10 times faster on appropriate data. An implementation of this algorithm is currently used in a program that assembles the UniGene database at the National Center for Biotechnology Information.
Article
The EMBL Nucleotide Sequence Database is a comprehensive database of DNA and RNA sequences directly submitted from researchers and genome sequencing groups and collected from the scientific literature and patent applications. In collaboration with DDBJ and GenBank the database is produced, maintained and distributed at the European Bioinformatics Institute (EBI) and constitutes Europe's primary nucleotide sequence resource. Database releases are produced quarterly and are distributed on CD-ROM. EBI's network services allow access to the most up-to-date data collection via Internet and World Wide Web interface, providing database searching and sequence similarity facilities plus access to a large number of additional databases.
Article
Specific vaccines for the immunotherapy of human neoplasms require specific human tumor antigens. While efforts to identify such antigens by the analysis of the T-cell repertoire have yielded few antigens, the application of SEREX, the serological identification of antigens by recombinant expression cloning, has brought a cornucopia of new antigens. Several specific antigens have been identified in each tumor tested, suggesting that many human tumors elicit multiple immune responses in the autologous host. The frequency of human tumor antigens, which can be readily defined at the molecular level, facilitates the identification of T-cell-dependent antigens and provides a basis for peptide and gene-therapeutic vaccine strategies.
Article
The EMBL Nucleotide Sequence Database is a comprehensive database of DNA and RNA sequences directly submitted from researchers and genome sequencing groups and collected from the scientific literature and patent applications. In collaboration with DDBJ and GenBank the database is produced, maintained and distributed at the European Bioinformatics Institute (EBI) and constitutes Europe's primary nucleotide sequence resource. Database releases are produced quarterly and are distributed on CD-ROM. EBI's network services allow access to the most up-to-date data collection via Internet and World Wide Web interface, providing database searching and sequence similarity facilities plus access to a large number of additional databases.
Article
RefSeq records are included in the Entrez retrieval system (http://www.ncbi.nlm.nih.gov/entrez/). This allows a third query pathway, namely directly by Entrez nucleotide or protein text queries or indirectly by neighboring strategies. LocusLink and RefSeq data are also provided without restriction for ftp transfer (ftp://ncbi.nlm.nih.gov/refseq). Therefore, the combination of LocusLink and RefSeq resources provides a powerful approach to answering such questions as: •Is my sequence from a known gene? (Try blast and look for a RefSeq result.)•What sequence can I use as a standard for gene A? (Try blast or LocusLink.)•Where can I get more information about gene B? (Start at LocusLink.)•What genes are related to disorder Z? (Start at OMIM or LocusLink.)The goal of LocusLink and RefSeq is to include all known genes and their major products. As of the end of August 1999, ∼10 690 loci have been included, 7500 with at least some sequence data and 5985 reference mRNAs. Expanding the LocusLink and RefSeq datasets is an ongoing effort – LocusIDs are established as additional genes are identified, and RefSeq records are added as new links between genes and sequences with complete coding regions are made. The public sites are refreshed weekly. We welcome collaborations with the scientific community to ensure that these resources are as comprehensive and accurate as possible.