ArticlePDF Available

trome, trEST and trGEN: Databases of predicted protein sequences

February 2004
Nucleic Acids Research 32(Database issue):D509-11

February 2004
32(Database issue):D509-11

DOI:10.1093/nar/gkh067

Source
PubMed

Authors:

Peter Sperisen

Nestlé S.A.

Christian Iseli

École Polytechnique Fédérale de Lausanne

Marco Pagni

Swiss Institute of Bioinformatics

Brian Stevenson

Swiss Institute of Bioinformatics

Show all 6 authorsHide

We previously introduced two new protein databases (trEST and trGEN) of hypothetical protein sequences predicted from EST and HTG sequences, respectively. Here, we present the updates made on these two databases plus a new database (trome), which uses alignments of EST data to HTG or full genomes to generate virtual transcripts and coding sequences. This new database is of higher quality and since it contains the information in a much denser format it is of much smaller size. These new databases are in a Swiss‐Prot‐like format and are updated on a weekly basis (trEST and trGEN) or every 3 months (trome). They can be downloaded by anonymous ftp from ftp://ftp.isrec.isb‐sib.ch/pub/databases.

Content uploaded by Philipp Bucher

Content may be subject to copyright.

trome, trEST and trGEN: databases of predicted

protein sequences

Peter Sperisen

*, Christian Iseli

, Marco Pagni

, Brian J. Stevenson

Philipp Bucher

and C. Victor Jongeneel

Swiss Institute of Bioinformatics,

ISREC and

Of®ce of Information Technology, Ludwig Institute for Cancer

Research, Chemin des Boveresses 155, 1066 Epalinges s/Lausanne, Switzerland

Received September 15, 2003; Revised and Accepted September 29, 2003

ABSTRACT

We previously introduced two new protein data-

bases (trEST and trGEN) of hypothetical protein

sequences predicted from EST and HTG sequences,

respectively. Here, we present the updates made on

these two databases plus a new database (trome),

which uses alignments of EST data to HTG or full

genomes to generate virtual transcripts and coding

sequences. This new database is of higher quality

and since it contains the information in a much

denser format it is of much smaller size. These new

databases are in a Swiss-Prot-like format and are

updated on a weekly basis (trEST and trGEN) or

every 3 months (trome). They can be downloaded by

anonymous ftp from ftp://ftp.isrec.isb-sib.ch/pub/

databases.

DESCRIPTION OF DATABASES

High-throughput genome (HTG) and expressed sequence tag

(EST) sequences are currently the most abundant nucleotide

sequence classes in the public databases. The large volume,

high degree of fragmentation and lack of gene structure

annotations prevent ef®cient searches of HTG and EST data

for protein sequence homologies by standard search methods.

We have compiled three databases of predicted and annotated

protein sequences to facilitate the use of proteomics tools. All

databases are distributed in a Swiss-Prot-like format, with

features that are speci®c to the databases presented below.

trome

trome is an attempt to map transcribed RNA from different

sources to the NCBI RefSeq genome sequence (1,2). As an

example, for Homo sapiens the transcribed RNA sources are:

the human EST section of the EMBL database (3), the human

HTC section of the EMBL database, human mRNA docu-

mented in the EMBL database, ORESTES sequences from

the LICR/FAPESP Human Cancer Genome project (4,5),

human mRNA documented in the NCBI-curated RefSeq

database [http://www.ncbi.nih.gov/RefSeq (6)], published

CHR21 gene list and SEREX sequences [http://www2.licr.

org/CancerImmunomeDB (7)]. For other species, similar

sources are used. Currently four species are represented:

H.sapiens, Mus musculus, Drosophila melanogaster,

Arabidopsis thaliana and Caenorhabditis elegans (Table 1).

The mapping of the transcribed RNA sources to the genome is

a three-step process (1,2):

(i) The program Megablast (8) is used to identify pairwise

similarities between all known transcript sequences and the

genomic data.

(ii) For each pair of matching RNA and genomic sequences,

local alignments were generated using a modi®ed version of

sim4 (9).

(iii) The output of sim4 was ®ltered to eliminate all

alignments that did not contain at least one region (exon)

matching with at least 95% identity over their high-quality

part and 88% over the remainder.

The output of sim4 is then used to generate directed acyclic

graphs using the program tromer (locally developed program

to automate the reconstruction of transcripts from transcript to

genome mapping). These graphs (the edges and nodes

represent exons/introns or splice donors/acceptors, respect-

ively) represent transcribed loci of the genome and contain in

a condensed form the information about all possible alterna-

tive splice variants that are experimentally documented. They

can be used to reconstruct virtual transcripts from the

underlying genomic sequence following a path from 3¢ tags

along experimentally veri®ed exon boundaries. Transcript

generation is a three-step process: (i) a seed edge is selected;

(ii) this edge is extended toward the 5¢ end and (iii) toward the

3¢ end. The seed edge is ®rst selected among unused 5¢-most

exons, then among any unused edge. The extension process

always attempts to include unused edges, which were derived

from the same RNA elements as the seed edge. The resulting

virtual transcripts are translated into protein sequences using

the program ESTScan (10). These protein sequences are the

basis of the trome database. ESTScan detects the coding frame

and corrects most frameshift errors introduced by sequencing

errors, but predicts their position within a range of a few amino

acids. Simulation experiments have shown that in 95% of the

cases the range is seven or fewer amino acids. To visualize this

uncertainty, the FT key UNSURE was used, indicating the

range within which the predicted sequence is more likely to

contain errors. However, due to the mapping of transcribed

RNA data onto the genome, this is a rare event in contrast to

*To whom correspondence should be addressed. Tel: +41 21 6925956; Fax: +41 21 6925945; Email: peter.sperisen@isb-sib.ch

Nucleic Acids Research, 2004, Vol. 32, Database issue D509±D511

DOI: 10.1093/nar/gkh067

corrections found in the database trEST (see below). The new

FT key EXON was introduced to indicate the positions of the

exon boundaries with respect to the NT contig.

FT EXON 1 173 Exon E0;NT_026943[46041..46562].

FT EXON 174 174 AA on splice site: tt/g -> L.

FT EXON 175 405 Exon E1;NT_026943[56062..56757].

FT EXON 406 406 AA on splice site: a/tg -> M.

FT EXON 407 514 Exon E2;NT_026943[63614..64087]

FT UNSURE 507 514 Frameshift error at pos.: 514;

base inserted:

trEST and trGEN

trEST is an attempt to produce contigs from clusters of ESTs

and to translate them into proteins (11). In the past 2 years the

following improvements have been introduced:

(i) Initially trEST was composed only of protein sequences

that were generated through the translation of contigs

produced from UniGene clusters (12) using ESTScan.

Protein sequences of coding ESTs that are not present in any

UniGene cluster were also introduced into trEST.

(ii) The species list has been increased (see Table 1).

(iii) Exactly as described for the trome database, the FT key

UNSURE was used to re¯ect the uncertainty range in

frameshift error correction by the program ESTScan plus the

correction of internal stop codons. The parameters used with

ESTScan are adapted to the error prone contigs produced from

UniGene clusters as well as the ESTs, since frameshift errors

as well as internal stop codons are found more frequently as

compared to the database trome.

trEST is cross-referenced to the UniGene database for the

entries that are based on UniGene clusters and to the EMBL

database for the ESTs that do not belong to UniGene clusters.

The amino acid sequences of the trGEN database are

predicted from genomic sequences from the NCBI database or

from HTG sequences from the EMBL database. The

sequences are searched for putative genes and their coding

regions using the program Genscan (13). The following

improvements were introduced:

(i) The species Rattus norvegicus was added.

(ii) The predictions for H.sapiens, M.musculus,

R.norvegicus and A.thaliana are now made on the basis of

the NCBI reference genome sequences (NT contigs).

(iii) The new FT Key GENSCAN was introduced. It

contains the predictions made by the program Genscan. These

are: FIRST EXON, INTERNAL EXON, LAST EXON and

SINGLE EXON together with their associated p-values (sum

over all parses containing exon) calculated by GENSCAN,

which serve as an indication about the degree of certainty that

should be ascribed to exons predicted by the program

FT GENSCAN 1 226 FIRST EXON; p-value: 0.159.

FT GENSCAN 227 262 INTERNAL EXON; p-value: 0.093.

FT GENSCAN 263 269 INTERNAL EXON; p-value: 0.065.

FT GENSCAN 270 320 LAST EXON; p-value: 0.074.

(iv) The ID is composed of either the EMBL or NCBI

accession number of the contig on which the protein was

predicted, plus a number (_#) that enumerates the proteins as

they are found on the contig.

(v) trGEN is cross-referenced to either the EMBL or the

NCBI RefSeq database, with a cross-link to the underlying

contig.

UPDATE TO THE DATABASES

The trEST and the trGEN databases are updated on a weekly

basis. The trome database is updated roughly every 3 months.

ACCESS

FTP

The ®les for the three databases are available by anonymous

ftp from the directories: ftp://ftp.isrec.isb-sib.ch/pub/

databases/trest, ftp://ftp.isrec.isb-sib.ch/pub/databases/trgen

and ftp://ftp.isrec.isb-sib.ch/pub/databases/trome. In addition

user manuals which provide more details about the format of

each database are found in the individual directories.

WWW

Several web pages offer services that include the trome, trEST

and trGEN databases. http://www.ch.embnet.org/software/

fetch.html allows one to retrieve individual entries of trome,

trEST and trGEN. http://www.ch.embnet.org/software/

aBLAST.html allows the three databases of hypothetical

proteins to be searched using BLAST.

ACKNOWLEDGEMENTS

This work was supported by a grant from the Swiss federal

of®ce for education and health (OFES): 01.0101 which is part

of the TEMBLOR project (QLRI-CT-2001-00015) of the

Quality of Life and Management of living Resources

Programme and by the Ludwig Institute for Cancer Research.

REFERENCES

1. Iseli,C., Stevenson,B.J., de Souza,S.J., Samaia,H.B., Camargo,A.A.,

Buetow,K.H., Strausberg,R.L., Simpson,A.J., Bucher,P. and

Jongeneel,C.V. (2002) Long-range heterogeneity at the 3¢ ends of human

mRNAs. Genome Res., 12, 1068±1074.

Table 1. Number of entries in each database for each species represented

(established September 5, 2003)

trGEN trEST trome

Homo sapiens 196110 1709305 121072

Mus musculus 225759 1018677 95754

Rattus norvegicus 293151 274511 n.a.

Drosophila melanogaster 128071 28254 20717

Arabidopsis thaliana 31424 39924 n.a.

Oryza sativa 111723 57269 n.a.

Bos taurus n.a. 88378 n.a.

Danio rerio n.a. 125453 n.a.

Hordeum vulgare n.a. 79315 n.a.

Triticum aestivum n.a. 146462 n.a.

Xenopus laevis n.a. 116084 n.a.

Zea mays n.a. 121846 n.a.

Caenorhabditis elegans n.a. n.a. 25841

Total 986238 3805478 263384

Due to the limited amount of data, not all species are represented in all

databases. Missing data are indicated by `n.a.'.

D510 Nucleic Acids Research, 2004, Vol. 32, Database issue

2. Stevenson,B.J., Iseli,C., Beutler,B. and Jongeneel,C.V. (2003) Use of

transcriptome data to unravel the ®ne structure of genes involved in

sepsis. J. Infect. Dis., 187 (Suppl. 2), S308±S314.

3. Stoesser,G., Sterk,P., Tuli,M.A., Stoehr,P.J. and Cameron,G.N. (1997)

The EMBL Nucleotide Sequence Database. Nucleic Acids Res., 25, 7±14.

4. Dias,N.E., Correa,R.G., Verjovski-Almeida,S., Briones,M.R.,

Nagai,M.A., da Silva,,W.,Jr, Zago,M.A., Bordin,S., Costa,F.F.,

Goldman,G.H. et al. (2000) Shotgun sequencing of the human

transcriptome with ORF expressed sequence tags. Proc. Natl Acad. Sci.

USA, 97, 3491±3496.

5. Camargo,A.A., Samaia,H.P., Dias-Neto,E., Simao,D.F., Migotto,I.A.,

Briones,M.R., Costa,F.F., Nagai,M.A., Verjovski-Almeida,S., Zago,M.A.

et al. (2001) The contribution of 700 000 ORF sequence tags to the

de®nition of the human transcriptome. Proc. Natl Acad. Sci. USA, 98,

12103±12108.

6. Pruitt,K.D., Katz,K.S., Sicotte,H. and Maglott,D.R. (2000) Introducing

RefSeq and LocusLink: curated human genome resources at the NCBI.

Trends Genet., 16, 44±47.

7. Tureci,O., Sahin,U. and Pfreundschuh,M. (1997) Serological analysis of

human tumor antigens: molecular de®nition and implications. Mol. Med.

Today, 3, 342±349.

8. Zhang,Z., Schwartz,S., Wagner,L. and Miller,W. (2000) A greedy

algorithm for aligning DNA sequences. J. Comput. Biol., 7, 203±214.

9. Florea,L., Hartzell,G., Zhang,Z., Rubin,G.M. and Miller,W. (1998) A

computer program for aligning a cDNA sequence with a genomic DNA

sequence. Genome Res., 8, 967±974.

10. Iseli,C., Jongeneel,V. and Bucher,P. (1999) ESTScan: A program for

detecting, evaluating, and reconstructing potential coding regions in EST

sequences. Proceedings of the Seventh ISMB, pp. 138±148.

11. Pagni,M., Iseli,C., Junier,T., Falquet,L., Jongeneel,V. and Bucher,P.

(2001) trEST, trGEN and Hits: access to databases of predicted protein

sequences. Nucleic Acids Res., 29, 148±151.

12. Schuler,G.D. (1997) Pieces of the puzzle: expressed sequence tags and

the catalog of human genes. J. Mol. Med., 75, 694±698.

13. Burge,C.B. and Karlin,S. (1998) Finding the genes in genomic DNA.

Curr. Opin. Struct. Biol., 8, 346±354.

Nucleic Acids Research, 2004, Vol. 32, Database issue D511

Revealing missing isoforms encoded in the human genome by integrating genomic, transcriptomic and proteomic data

Preprint

Dec 2014

Biological and biomedical research relies on comprehensive understanding of protein-coding transcripts. However, the total number of human proteins is still unknown due to the prevalence of alternative splicing and is much larger than the number of human genes. In this paper, we detected 31,566 novel transcripts with coding potential by filtering our ab initio predictions with 50 RNA-seq datasets from diverse tissues/cell lines. PCR followed by MiSeq sequencing showed that at least 84.1% of these predicted novel splice sites could be validated. In contrast to known transcripts, the expression of these novel transcripts were highly tissue-specific. Based on these novel transcripts, at least 36 novel proteins were detected from shotgun proteomics data of 41 breast samples. We also showed L1 retrotransposons have a more significant impact on the origin of new transcripts/genes than previously thought. Furthermore, we found that alternative splicing is extraordinarily widespread for genes involved in specific biological functions like protein binding, nucleoside binding, neuron projection, membrane organization and cell adhesion. In the end, the total number of human transcripts with protein-coding potential was estimated to be at least 204,950.

Splice sites obey the power-law during splicing in leukemia cells

Preprint

Full-text available

May 2021

Alternative splicing is an essential characteristic of living cells that usually infers a various exon-exon junction governed by different splice sites. The traditional classification based on the mode of use designates splice site to one of the two groups, constitutive or alternative. Here, we considered another criterion and reorganized splice sites into "unisplice" and "multisplice" groups according to the number of undertaken splicing events. This approach provided us with a new insight in the organization and functionality of leukemia cells. We determined features associated with uni- and multisplice sites and found that combinatorics of these sites follows strict rules of the power-law in the t(8;21)-positive leukemia cells. We also found that system splicing characteristics of the transcriptome of leukemia cells remained persistent after drastic changes in the transcript composition caused by knockdown of the RUNX1-RUNX1T1 oncogene. In this work, we show for the first time that leukemia cells possess a sub-set of unisplice sites with a hidden multisplice potential. These findings reveal a new side in organization and functioning of the leukemic cells and open up new perspectives in the study of the t(8;21)-positive leukemia.

Supplementary Material

Data

Full-text available

Nov 2013

Supplimental material zqhu Jan 11 2015

Data

Full-text available

Jul 2015

Assessment of transcript reconstruction methods for RNA-seq

Article

Full-text available

Jan 2013
Br J Pharmacol

RNA sequencing (RNA-seq) is transforming genome biology, enabling comprehensive transcriptome profiling with unprecendented accuracy and detail. Due to technical limitations of current high-throughput sequencing platforms, transcript identity, structure and expression level must be inferred programmatically from partial sequence reads of fragmented gene products. We evaluated 24 protocol variants of 14 independent computational methods for exon identification, transcript reconstruction and expression level quantification from RNA-seq data. Our results show that most algorithms are able to identify discrete transcript components with high success rates, but that assembly of complete isoform structures poses a major challenge even when all constituent elements are identified. Expression level estimates also varied widely across methods, even when based on similar transcript models. Consequently, the complexity of higher eukaryotic genomes imposes severe limitations in transcript recall and splice product discrimination that are likely to remain limiting factors for the analysis of current-generation RNA-seq data.

Computer Applications Making Rapid Advances in High Throughput Microbial Proteomics (HTMP)

Article

Jan 2014
COMB CHEM HIGH T SCR

The last few decades have seen the rise of widely-available proteomics tools. From new data acquisition devices, such as MALDI-MS and 2DE to new database searching softwares, these new products have paved the way for high throughput microbial proteomics (HTMP). These tools are enabling researchers to gain new insights into microbial metabolism, and are opening up new fields of study, such as protein-protein interactions (interactomics) and drug discovery. Computer software is a key part of these emerging fields. This current review considers: 1) software tools for identifying the proteome, such as MASCOT or PDQuest, 2) online databases of proteomes, such as SWISS-PROT, Proteome Web, or the Proteomics Facility of the Pathogen Functional Genomics Resource Center, and 3) software tools for applying proteomic data, such as PSI-BLAST or VESPA. These tools allow for the study of such diverse fields as network biology, protein identification, functional annotation, target identification/validation, protein expression, protein structural analysis, metabolic pathway engineering and drug discovery.

CleanEx: New data extraction and merging tools based on MeSH term annotation

Article

Full-text available

Jan 2009
NUCLEIC ACIDS RES

The CleanEx expression database (http://www.cleanex.isb-sib.ch) provides access to public gene expression data via unique gene names as well as via experiments biomedical characteristics. To reach this, a dual annotation of both sequences and experiments has been generated. First, the system links official gene symbols to any kind of sequences used for gene expression measurements (cDNA, Affymetrix, oligonucleotide arrays, SAGE or MPSS tags, Expressed Sequence Tags or other mRNA sequences, etc.). For the biomedical annotation, we re-annotate each experiment from the CleanEx database with the MeSH (Medical Subject Headings) terms, primarily used by NLM (National Library of Medicine) for indexing articles for the MEDLINE/PubMED database. This annotation allows a fast and easy retrieval of expression data with common biological or medical features. The numerical data can then be exported as matrix-like tab-delimited text files. Data can be extracted from either one dataset or from heterogeneous datasets.

The Eukaryotic Promoter Database EPD: the impact of in silico primer extension

Article

Jan 2004

The Eukaryotic Promoter Database (EPD) is an annotated non-redundant collection of eukaryotic POL II promoters, experimentally defined by a transcription start site (TSS). There may be multiple promoter entries for a single gene. The underlying experimental evidence comes from journal articles and, starting from release 73, from 5' ESTs of full-length cDNA clones used for so-called in silico primer extension. Access to promoter sequences is provided by pointers to TSS positions in nucleotide sequence entries. The annotation part of an EPD entry includes a description of the type and source of the initiation site mapping data, links to other biological databases and bibliographic references. EPD is structured in a way that facilitates dynamic extraction of biologically meaningful promoter subsets for comparative sequence analysis. Web-based interfaces have been developed that enable the user to view EPD entries in different formats, to select and extract promoter sequences according to a variety of criteria and to navigate to related databases exploiting different cross-references. Tools for analysing sequence motifs around TSSs defined in EPD are provided by the signal search analysis server. EPD can be accessed at http://www.epd. isb-sib.ch.

Revealing Missing Human Protein Isoforms Based on Ab Initio Prediction, RNA-seq and Proteomics

Article

Full-text available

Jul 2015

Biological and biomedical research relies on comprehensive understanding of protein-coding transcripts. However, the total number of human proteins is still unknown due to the prevalence of alternative splicing. In this paper, we detected 31,566 novel transcripts with coding potential by filtering our ab initio predictions with 50 RNA-seq datasets from diverse tissues/cell lines. PCR followed by MiSeq sequencing showed that at least 84.1% of these predicted novel splice sites could be validated. In contrast to known transcripts, the expression of these novel transcripts were highly tissue-specific. Based on these novel transcripts, at least 36 novel proteins were detected from shotgun proteomics data of 41 breast samples. We also showed L1 retrotransposons have a more significant impact on the origin of new transcripts/genes than previously thought. Furthermore, we found that alternative splicing is extraordinarily widespread for genes involved in specific biological functions like protein binding, nucleoside binding, neuron projection, membrane organization and cell adhesion. In the end, the total number of human transcripts with protein-coding potential was estimated to be at least 204,950.

Assessment of transcript reconstruction methods for RNA-seq

Article

Full-text available

Nov 2013
Br J Pharmacol

We evaluated 25 protocol variants of 14 independent computational methods for exon identification, transcript reconstruction and expression-level quantification from RNA-seq data. Our results show that most algorithms are able to identify discrete transcript components with high success rates but that assembly of complete isoform structures poses a major challenge even when all constituent elements are identified. Expression-level estimates also varied widely across methods, even when based on similar transcript models. Consequently, the complexity of higher eukaryotic genomes imposes severe limitations on transcript recall and splice product discrimination that are likely to remain limiting factors for the analysis of current-generation RNA-seq data. Supplementary information The online version of this article (doi:10.1038/nmeth.2714) contains supplementary material, which is available to authorized users.

Shotgun sequencing of the human transcriptome with ORF expressed sequence tags

Article

Full-text available

Apr 2000

Theoretical considerations predict that amplification of expressed gene transcripts by reverse transcription–PCR using arbitrarily chosen primers will result in the preferential amplification of the central portion of the transcript. Systematic, high-throughput sequencing of such products would result in an expressed sequence tag (EST) database consisting of central, generally coding regions of expressed genes. Such a database would add significant value to existing public EST databases, which consist mostly of sequences derived from the extremities of cDNAs, and facilitate the construction of contigs of transcript sequences. We tested our predictions, creating a database of 10,000 sequences from human breast tumors. The data confirmed the central distribution of the sequences, the significant normalization of the sequence population, the frequent extension of contigs composed of existing human ESTs, and the identification of a series of potentially important homologues of known genes. This approach should make a significant contribution to the early identification of important human genes, the deciphering of the draft human genome sequence currently being compiled, and the shotgun sequencing of the human transcriptome.

Finding the genes in genomic DNA

Article

Full-text available

Jul 1998
CURR OPIN STRUC BIOL

Genome sequencing efforts will soon generate hundreds of millions of bases of human genomic DNA containing thousands of novel genes. In the past year, the accuracy of computational gene-finding methods has improved significantly, to the point where a reasonable approximation of the gene structures within an extended genomic region can often be predicted in advance of more detailed experimental studies.

A Computer Program for Aligning a cDNA Sequence with a Genomic DNA Sequence

Article

Full-text available

Oct 1998
GENOME RES

We address the problem of efficiently aligning a transcribed and spliced DNA sequence with a genomic sequence containing that gene, allowing for introns in the genomic sequence and a relatively small number of sequencing errors. A freely available computer program, described herein, solves the problem for a 100-kb genomic sequence in a few seconds on a workstation.

ESTScan: A Program for detecting, evaluating, and reconstructing potential coding regions in EST sequences

Article

Full-text available

Feb 1999

One of the problems associated with the large-scale analysis of unannotated, low quality EST sequences is the detection of coding regions and the correction of frameshift errors that they often contain. We introduce a new type of hidden Markov model that explicitly deals with the possibility of errors in the sequence to analyze, and incorporates a method for correcting these errors. This model was implemented in an efficient and robust program, ESTScan. We show that ESTScan can detect and extract coding regions from low-quality sequences with high selectivity and sensitivity, and is able to accurately correct frameshift errors. In the framework of genome sequencing projects, ESTScan could become a very useful tool for gene discovery, for quality control, and for the assembly of contigs representing the coding regions of genes.

Greedy Algorithm for Aligning DNA Sequences

Article

Full-text available

Feb 2000

For aligning DNA sequences that differ only by sequencing errors, or by equivalent errors from other sources, a greedy algorithm can be much faster than traditional dynamic programming approaches and yet produce an alignment that is guaranteed to be theoretically optimal. We introduce a new greedy alignment algorithm with particularly good performance and show that it computes the same alignment as does a certain dynamic programming algorithm, while executing over 10 times faster on appropriate data. An implementation of this algorithm is currently used in a program that assembles the UniGene database at the National Center for Biotechnology Information.

The EMBL Nucleotide Sequence Database

Article

Feb 1997

The EMBL Nucleotide Sequence Database is a comprehensive database of DNA and RNA sequences directly submitted from researchers and genome sequencing groups and collected from the scientific literature and patent applications. In collaboration with DDBJ and GenBank the database is produced, maintained and distributed at the European Bioinformatics Institute (EBI) and constitutes Europe's primary nucleotide sequence resource. Database releases are produced quarterly and are distributed on CD-ROM. EBI's network services allow access to the most up-to-date data collection via Internet and World Wide Web interface, providing database searching and sequence similarity facilities plus access to a large number of additional databases.

Tureci O, Sahin U & Pfreundschuh M. Serological analysis of human tumor antigens: Molecular definition and implications. Mol Med Today3: 342-349

Article

Sep 1997
Mol Med Today

Specific vaccines for the immunotherapy of human neoplasms require specific human tumor antigens. While efforts to identify such antigens by the analysis of the T-cell repertoire have yielded few antigens, the application of SEREX, the serological identification of antigens by recombinant expression cloning, has brought a cornucopia of new antigens. Several specific antigens have been identified in each tumor tested, suggesting that many human tumors elicit multiple immune responses in the autologous host. The frequency of human tumor antigens, which can be readily defined at the molecular level, facilitates the identification of T-cell-dependent antigens and provides a basis for peptide and gene-therapeutic vaccine strategies.

Pieces of use puzzle: Expressed sequence tags and the catalog of human genes

Article

Nov 1997

Gregory D Schuler

The EMBL nucleotide sequence database

Article

Feb 1998

Introducing RefSeq and LocusLink: Curated human genome resources at the NCBI

Article

Feb 2000
TRENDS GENET

RefSeq records are included in the Entrez retrieval system (http://www.ncbi.nlm.nih.gov/entrez/). This allows a third query pathway, namely directly by Entrez nucleotide or protein text queries or indirectly by neighboring strategies. LocusLink and RefSeq data are also provided without restriction for ftp transfer (ftp://ncbi.nlm.nih.gov/refseq). Therefore, the combination of LocusLink and RefSeq resources provides a powerful approach to answering such questions as: •Is my sequence from a known gene? (Try blast and look for a RefSeq result.)•What sequence can I use as a standard for gene A? (Try blast or LocusLink.)•Where can I get more information about gene B? (Start at LocusLink.)•What genes are related to disorder Z? (Start at OMIM or LocusLink.)The goal of LocusLink and RefSeq is to include all known genes and their major products. As of the end of August 1999, ∼10 690 loci have been included, 7500 with at least some sequence data and 5985 reference mRNAs. Expanding the LocusLink and RefSeq datasets is an ongoing effort – LocusIDs are established as additional genes are identified, and RefSeq records are added as new links between genes and sequences with complete coding regions are made. The public sites are refreshed weekly. We welcome collaborations with the scientific community to ensure that these resources are as comprehensive and accurate as possible.

trome, trEST and trGEN: Databases of predicted protein sequences

Abstract

Recommended publications

Gene prediction in eukaryotes with a generalized hidden Markov model that uses hints from external s...

TrEST, trGEN and Hits: Access to databases of predicted protein sequences

Modeling sequencing errors by combining Hidden Markov models

ESTScan: A Program for detecting, evaluating, and reconstructing potential coding regions in EST seq...

Searching the expressed sequence tag (EST) databases: Panning for genes