ArticlePDF Available

A gene atlas of the mouse and human protein-encoding transcriptomes

May 2004
Proceedings of the National Academy of Sciences 101(16):6062-7

May 2004
101(16):6062-7

DOI:10.1073/pnas.0400782101

Source
PubMed

Authors:

Tim Wiltshire

University of North Carolina at Chapel Hill

Serge Batalov

Genomics Institute of the Novartis Research Foundation

Show all 13 authorsHide

The tissue-specific pattern of mRNA expression can indicate important clues about gene function. High-density oligonucleotide arrays offer the opportunity to examine patterns of gene expression on a genome scale. Toward this end, we have designed custom arrays that interrogate the expression of the vast majority of protein-encoding human and mouse genes and have used them to profile a panel of 79 human and 61 mouse tissues. The resulting data set provides the expression patterns for thousands of predicted genes, as well as known and poorly characterized genes, from mice and humans. We have explored this data set for global trends in gene expression, evaluated commonly used lines of evidence in gene prediction methodologies, and investigated patterns indicative of chromosomal organization of transcription. We describe hundreds of regions of correlated transcription and show that some are subject to both tissue and parental allele-specific expression, suggesting a link between spatial expression and imprinting.

Validation of gene predictions in humans. Gene targets on the GNF1H array were divided into four categories: contained in Refseq, predicted by both gene prediction efforts considered (''Consensus''), and predicted by only one group (''Ensembl-only'' and ''Celera-only''). On the left axis (solid bars), rates of validation are shown, where detectable expression in at least one tissue is taken as evidence of the validity of a gene prediction. The right axis (blue line) indicates the total number of validated genes per group.

…

RCT. (A) An RCT was identified on mouse chromosome 9, consisting of 11 genes that share a highly conserved expression pattern. (Upper) The y axis is average normalized expression value, the x axis contains the 61 different tissues, and red bars are fertilized egg and oocytes. The correlation plot (Lower Left) visualizes the pairwise correlation coefficients. Each row represents a gene, ordered vertically according to their position on the chromosome. The center yellow vertical strip represents autocorrelation (R 1); positions to the right of center represent correlation of the gene to its downstream neighbors, whereas positions to the left represent correlation to the upstream neighbors. Yellow indicates high correlation; blue indicates low correlation (scale at bottom). The sequence similarity plot (using TBLASTX, Lower Right) has the same structure as the correlation plot, except pairwise sequence similarity is shown. In this RCT with high expression levels in fertilized eggs and oocytes, the genes share a high degree of sequence similarity, likely indicating they are all members of a single gene family and the result of one or more gene duplication events. (B) An example RCT is identified on human chromosome 13, which contains three genes with highly correlated expression (red bars are brain regions, green bar is fetal brain). In contrast to the first example, these genes share very little pairwise sequence similarity. (C) An evolutionarily conserved RCT is shown from human chromosome 2 (Left) and the syntenic region on mouse chromosome 6 (Right). These RCTs share a pancreas-enriched expression pattern (red bar), as well as significant sequence similarity.

…

Six genes on mouse chromosome 12 share a distinctive pattern of expression. ( A ) A genomic view of this region (not to scale). Locations of the genes on the mouse genome assembly: Dlk (103.508 Mb), Gtl2 (103.593 Mb), 1110006E14Rik (103.646 Mb), Rian (103.696 Mb), 5330411G14Rik (103.788 Mb), C130007E11Rik (103.798 Mb), and Dio3 (104.328 Mb). ( B ) These genes share enriched expression in brain regions (green bars) and umbilical cord (red bar). The y axes indicate normalized expression values, whereas each bar along the x axis indicates a sample pro fi led in our data set. ( C ) Three of these genes ( Dlk1 , Gtl2 , and Rian ) have been previously reported to be imprinted. Using our allele-speci fi c probe expression analysis approach (see text), we con fi rmed the imprinted regulation of Gtl2 and Rian and report two previously undescribed imprinted transcripts at this locus ( 5330411G14Rik and C130007E11Rik ). The y axes indicate the normalized signal intensity for individual probes on the array, and each bar represents a pooled sample from a cross indicated by color (see key).

…

Figures - uploaded by Gabriel Kreiman

Content may be subject to copyright.

Content uploaded by Gabriel Kreiman

Content may be subject to copyright.

A gene atlas of the mouse and human

protein-encoding transcriptomes

Andrew I. Su*

†

, Tim Wiltshire*

†

, Serge Batalov*

†

, Hilmar Lapp*, Keith A. Ching*, David Block*, Jie Zhang*,

Richard Soden*, Mimi Hayakawa*, Gabriel Kreiman*

‡

, Michael P. Cooke*, John R. Walker*, and John B. Hogenesch*

§¶

*The Genomics Institute of the Novartis Research Foundation, 10675 John J. Hopkins Drive, San Diego, CA 92121; and

Department of Neuropharmacology,

The Scripps Research Institute, 10550 North Torrey Pines Road, San Diego, CA 92037

Edited by Peter K. Vogt, The Scripps Research Institute, La Jolla, CA, and approved March 2, 2004 (received for review February 3, 2004)

The tissue-speciﬁc pattern of mRNA expression can indicate im-

portant clues about gene function. High-density oligonucleotide

arrays offer the opportunity to examine patterns of gene expres-

sion on a genome scale. Toward this end, we have designed custom

arrays that interrogate the expression of the vast majority of

protein-encoding human and mouse genes and have used them to

proﬁle a panel of 79 human and 61 mouse tissues. The resulting

data set provides the expression patterns for thousands of pre-

dicted genes, as well as known and poorly characterized genes,

from mice and humans. We have explored this data set for global

trends in gene expression, evaluated commonly used lines of

evidence in gene prediction methodologies, and investigated pat-

terns indicative of chromosomal organization of transcription. We

describe hundreds of regions of correlated transcription and show

that some are subject to both tissue and parental allele-speciﬁc

expression, suggesting a link between spatial expression and

imprinting.

he completion of the human and mouse genome sequences

opened an historic era in mammalian biology. One common

conclusion from these projects was the determination that

mammals have only ⬇30,000 protein-encoding genes (1, 2). Yet,

despite the apparent tractability of this figure (earlier estimates

were much higher), to date all existing research has determined

the function of only a fraction of these genes. Currently, only

⬇15,000 human and ⬇10,000 mouse genes are described in the

literature (Medline, www.ncbi.nih.gov兾Pubmed). The challenge

and opportunity for genomics strategies and techniques are to

accelerate the functional annotation of novel genes from the

uncharted genome.

High-throughput technologies for biological annotation have

the capacity to partially address the discrepancy between the

identification of genes and the understanding of their function.

For example, proteins have well defined molecular roles encoded

in their primary amino acid sequence as domains. Using se-

quence informatics, these domains can be used as a tool to search

the entire genome to find protein family members that likely

function in an analogous manner. Gene expression arrays have

also been a useful tool for genome-wide studies where changes

in gene expression can be associated with physiological or

pathophysiological states (3). Recently, other high-throughput

techniques such as RNA interference (4) and cDNA overex-

pression (5) have been developed, further accelerating func-

tional genome annotation. The integration of these diverse

strategies is critical to annotation efforts and remains a signif-

icant challenge.

Previously, we generated a preliminary description of the

human and mouse transcriptome using oligonucleotide arrays

that interrogate the expression of ⬇10,000 human and ⬇7,000

mouse target genes (6). We explored this data set for insights

into gene function, transcriptional regulation, disease etiology,

and comparative genomics. However, this data set was based on

commercially available gene expression arrays and therefore was

biased toward previously characterized genes. In this report, we

significantly extend this earlier work by determining the expres-

sion patterns of previously uncharacterized protein-encoding

genes and de novo gene predictions from the mouse and human

genome projects. Using custom-designed whole-genome gene

expression arrays that target 44,775 human and 36,182 mouse

transcripts, we have built a more extensive gene atlas using a

panel of RNAs derived from 79 human and 61 mouse tissues.

This data set constitutes one of the largest quantitative evalua-

tions of gene expression of the protein-encoding transcriptome

to date.

Building on our previous analyses, these expression patterns

were examined for global trends in gene expression. We also

provide experimental validation of thousands of gene predic-

tions and use these data to determine which of the commonly

used types of evidence for gene prediction most accurately

correlates with expressed genes. In addition, we used this data set

to search for chromosomal regions of correlated transcription

(RCTs), which may indicate higher-order mechanisms of tran-

scriptional regulation. Furthermore, we show that some of these

tissue-specific coregulated genes are subject to another form of

regulation, parental imprinting, and thus that several of these

regions are under the control of both tissue- and parental

allele-specific expression. Finally, we have made these data

publicly available for searching and visualization by keyword,

accession number, sequence, expression pattern, and coregula-

tion at our web site (http:兾兾symatlas.gnf.org).

Materials and Methods

Microarray Chip Design. We identified a nonredundant set of

target sequences for the human and mouse using the following

sources: RefSeq (15,491 human and 12,029 mouse sequences)

(7); Celera (49,859 human and 29,331 mouse sequences) (8);

Ensembl (33,698 human sequences); and RIKEN (46,299 mouse

sequences) (9). First, all sequences were screened with

REPEAT-

MASKER (www.repeatmasker.org) to remove repetitive elements.

Next, sequence identity between individual sequences was es-

tablished by using pairwise

BLAT (10) or BLAST (11) and SIM4

(12). The results from single-linkage clustering were further

triaged to produce a final target set of 44,775 human and 36,182

mouse targets with the highest degree of confidence of compu-

tational prediction [biasing toward sequences containing Inter-

pro domains (13) and away from noncoding RNAs]. Finally, the

human sequence set was pruned of all targets already repre-

sented on the Affymetrix (Santa Clara, CA) commercially

available HG-U133A array, leaving 22,645 target sequences for

our custom array. One hundred target sequences from the

This paper was submitted directly (Track II) to the PNAS ofﬁce.

Abbreviations: RCT, regions of correlated transcription; AUC, area under the curve; LCR,

locus control regions.

†

A.I.S., T.W., and S.B. contributed equally to this work.

‡

Present address: Center for Biological and Computational Learning, Massachusetts Insti-

tute of Technology, MIT E25-201, Cambridge, MA 02142.

To whom correspondence should be addressed. E-mail: hogenesch@gnf.org.

6062–6067

兩

PNAS

兩

April 20, 2004

兩

vol. 101

兩

no. 16 www.pnas.org兾cgi兾doi兾10.1073兾pnas.0400782101

HG-U133A chip were also included in the GNF1H design for the

normalization procedure (see below). The final human and

mouse target sets were submitted to the Affymetrix chip design

pipeline for fabrication of the GNF1H and GNF1M arrays,

respectively.

Tissue Preparation. Human tissue samples were obtained from

several sources: Clinomics Biosciences (Pittsfield, MA), Clon-

tech, AllCells (Berkeley, CA), Clonetics兾BioWhittaker (Walk-

ersville, MD), AMS Biotechnology (Abingdon, Oxfordshire,

U.K.), and the University of California at San Diego. When

samples from four or more subjects were available, equal num-

bers of male and female subjects were used to make two

independent pools; when fewer than four samples were available,

RNA samples were pooled, and duplicate amplifications were

performed for each pool (detailed annotation for human sam-

ples is on our web site and in Table 1, which is published as

supporting information on the PNAS web site). Adult (10- to

12-week-old) mouse tissue samples were independently gener-

ated from two groups of four male and three female C57BL兾6

mice by dissection, and tissues were subsequently quickly frozen

on dry ice. Tissues were pulverized while frozen, and total RNA

was extracted with Trizol (Invitrogen, Carlsbad) by using ⬇100

mg of tissue, then further processed by using the RNeasy

miniprep kit according to manufacturer’s protocols (Qiagen,

Chatsworth, CA). The quality of all samples was determined with

an Agilent Bioanalyzer (Palo Alto, CA).

Microarray Procedure. Microarray analysis was performed essen-

tially as described (14). Briefly, 5

␮

g of total RNA was used to

synthesize cDNA that was then used as a template to generate

biotinylated cRNA. cRNA was fragmented and hybridized to

Affymetrix custom or commercially available gene expression

arrays. The arrays were then washed and scanned with a laser

scanner, and images were analyzed by using the MAS5 algorithm

(15). Arrays were normalized by using global median scaling.

The human HG-U133A and GNF1H chips, which were hybrid-

ized to the same biological sample, were first paired and

prenormalized by using the common targets. The condensed

data files are available from our web site (http:兾兾symatlas.

gnf.org) and Gene Expression Omnibus (www.ncbi.nih.gov兾

geo) (16). Raw CEL files will be provided upon request

(http:兾兾symatlas.gnf.org).

Identification of RCTs. All target genes were mapped to their

corresponding genome assembly (human to National Center for

Biotechnology Information Hs34 assembly, mouse to February

2003 Mm30 assembly) by using

BLAT (10). To account for

multiple probes interrogating a single gene, target sequences

were also compared to UniGene (www.ncbi.nih.gov兾UniGene)

by using

BLAST. Target sequences that map within 25 kb of each

other and to a common UniGene cluster were pooled, and their

expression values were averaged and treated as a single target in

the RCT analysis. Next, each chromosome was scanned in

window sizes of 3–10 adjacent genes. Windows where ⬎50% of

all pairwise comparisons of expression pattern showed a Pearson

correlation coefficient ⬎0.6 were identified as RCTs. Random-

ization studies of gene order confirmed the significance of both

the overall number of RCTs and the average pairwise correlation

of each individual RCT (P ⬍ 0.005, correcting for multiple

testing). Pairwise sequence similarity within each RCT was

assessed by using

TBLASTX (11), where a similarity value is the

product of the alignment similarity and the percentage of total

sequence length aligned. Synteny between the human and mouse

genome assemblies was derived from a published analysis of

syntenic anchors (17). For the analysis of evolutionarily con-

served RCTs, only the 32 tissues profiled in common between

the mouse and human data sets were used. All analyses and

visualizations were performed by using

R (www.r-project.org).

Imprinting Analysis. Allele-specific probe expression analysis was

used to identify genes with an imprinted expression pattern. Two

distinct mouse strains, C57BL兾6J (B6) and Mus musculus cas-

taneus (CAST兾Ei), were bred to produce four independent

mouse crosses (male::female): B6::B6, B6::CAST兾Ei,

CAST兾Ei::B6, and CAST兾Ei::CAST兾Ei. Each litter of embryonic

day 14–16 embryos was pooled, and RNA from four to five

separate litters was labeled and hybridized to GNF1M arrays. A

probe-level analysis was performed to detect naturally occurring

polymorphisms between the two strains. Individual probes (but

not entire probe sets) showing a significantly different signal

between the two homozygous groups were identified as putative

polymorphisms in the target gene. Next, the hybridization signal

from the two reciprocal crosses was examined for statistically

significant differences in signal based on the paternal or mater-

nal allele, as assessed by t test (P ⬍ 0.001), indicating a pattern

of male or female imprinting.

Results and Discussion

The tissue-specific RNA expression pattern of a gene can

indicate important clues to its physiological function. To build an

extensive atlas of tissue-specific gene expression, we created

custom arrays that interrogate the expression of known and

predicted protein-encoding genes from the mouse and human

genomes. The design process used a nonredundant set of known

genes and gene predictions compiled from Refseq, Celera,

Ensembl (for human), and RIKEN (for mouse). For our GNF1H

custom human array, we further removed gene targets that were

already represented on the commercially available HG-U133A

array from Affymetrix. Finally, we biased the final selection

toward gene predictions with likely protein-coding regions. In

total, the U133A兾GNF1H chipset interrogates 44,775 probe sets,

and our custom-designed GNF1M mouse array interrogates

36,182 probe sets. As of the most current annotation in January

2004, these correspond to 33,698 and 33,825 unique human and

mouse genes, respectively, after accounting for multiple probe

sets interrogating single genes and split transcripts.

Using these whole-genome gene expression arrays, we mea-

sured the expression of an extensive set of transcripts and

transcript predictions on a single technology platform across a

diverse panel of 79 human and 61 mouse tissues. This gene atlas

represents the normal transcriptome and allowed us to examine

global trends in gene expression. Classical reassociation kinetics

(Rot) has been used to assess global trends in gene expression at

a population level (18). The analysis of our data set expands this

knowledge by examining transcript expression across a large

number of tissues at the individual transcript level. We find that

52% (16,454) and 59% (17,924) of target genes are detected in

at least one tissue in the human and mouse, respectively (Fig. 4A,

which is published as supporting information on the PNAS web

site). The average number of transcripts expressed in a single

tissue was ⬇8,200 (mouse). These observations generally concur

with previous findings derived from Rot analyses, which indicate

that ⬇10,000–15,000 mRNAs are expressed in a given tissue at

⬇1–10 copies per cell, and that 90% of these are common

between two tissues (19). However, although Rot analysis sug-

gests that the majority of transcripts are present in many or all

tissues, our data show that ⬍1% of human target sequences are

ubiquitously expressed. Approximately 3% of mouse target

sequences are detected in all samples profiled, although this

number will certainly decline as the number of samples increases.

Not surprisingly, the expression of these ubiquitously expressed

housekeeping genes is ⬇30-fold higher than for all genes in the

data set (Fig. 4B).

Su et al. PNAS

兩

April 20, 2004

兩

vol. 101

兩

no. 16

兩

6063

GENETICS

Another valuable use of this data set is characterization of

novel predicted genes derived from the mouse and human

genome projects (1, 2). Many of these exist solely as in silico

predictions, and therefore evidence of their expression can serve

as validation of these predictions. Furthermore, determining the

expression pattern of an uncharacterized gene can indicate the

appropriate tissue(s) from which the transcript can be cloned, as

well as provide a base layer of physiological annotation. Gene

prediction is an inexact art, where distinct methods and research-

ers often produce largely nonoverlapping sets of gene predic-

tions (20). For the human data, we subdivided the transcripts

into four classes based on annotation information at the time of

design: known genes found in Refseq, genes predicted indepen-

dently by two groups (Celera and Ensembl), singleton predic-

tions found by the Ensembl group only, and singleton predictions

found by the Celera group only. As expected, the set of known

genes (Refseq) has the highest rate of detection in our data set,

because 79% have detectable expression in at least one sample

(Fig. 1). Because all Refseq genes are known to be expressed, this

suggests that our methodologies and current tissue libraries have

a minimum false-negative rate of ⬇21% in detection of expres-

sion. This can certainly be improved with the profiling of

additional tissues and cell types. Consensus gene predictions

have a higher rate of detectable expression (53%) than either

singleton gene predictions offered by Ensembl or Celera only

(30% and 24%, respectively) (Fig. 1). Although the Ensembl-

only group had a slightly higher rate of detection, a greater total

number of Celera-only predictions was detected (2,918 Celera vs.

618 Ensembl predictions). Analogous results are seen in the

mouse data set, in which Refseq genes had a higher rate of

detection than gene predictions by Celera (79% vs. 46%). The

differences among these three classes are also reflected in the

quantitative measures of gene expression. On average, human

Refseq genes are expressed at a level 2-fold higher than con-

sensus predictions, which in turn are 66% higher than singleton

predictions (P⬍⬍0.001; data not shown). This observation likely

reflects a historical bias in the biology of studying highly

abundant proteins. In total, we find evidence of expression for

5,641 (31.2%) human and 2,629 (46.2%) mouse gene predictions

through detection of their transcribed mRNA product in at least

one tissue. In addition, we describe the expression pattern for

9,708 mouse RIKEN-derived genes, many of which lack signif-

icant expression annotation. It is important to note that the gene

predictions for which we do not observe detectable expression

are not necessarily incorrect, because the appropriate tissue(s)

for a given gene may have not been profiled, the gene may be

present in a small number of copies (e.g., in a small subset of cells

within a tissue), or the probe set may not properly interrogate the

expression of the gene (e.g., UTRs, split transcripts, or missing

or mistaken terminal exons). Despite these caveats, this data set

provides the expression pattern of thousands of gene predictions

and poorly characterized transcripts from the mouse and human

genome projects, offering the opportunity to study the function

of these genes in their most relevant tissues.

Given the differing methods and subsequent results from gene

prediction efforts, we next investigated which characteristics of

a predicted transcript were better indicators of its detectable

expression. In the methodology used by Celera, the following

lines of evidence were considered in their gene prediction

algorithm: ‘‘conservation between mouse and human genomic

DNA, similarity to human [and] rodent transcripts (ESTs and

cDNAs), and similarity of the translation of human genomic

DNA to known proteins’’ (1). Using the detectable expression of

a gene product as validation of the prediction, we created

receiver operating characteristic curves for each line of evidence

that plot the true positive rate as a function of the false positive

rate. The area under the curve (AUC) measures the strength of

the predictor; a perfect predictor would have AUC ⫽ 1, and a

random factor would have AUC ⫽ 0.5. When comparing the

predictor strength among the three lines of evidence above in the

human data set, we find that although no single line of evidence

is universally predictive of expression, EST evidence has the

most predictive value (AUC ⫽ 0.77) (Fig. 5, which is published

as supporting information on the PNAS web site), an observa-

tion likely linked to the fact that highly expressed genes are more

likely to be represented in EST databases. Protein homology

support and sequence similarity between human and mouse

genomic sequences both had a lesser impact on the validation of

gene predictions (AUC of 0.66 and 0.65, respectively). The

availability of additional mammalian genome sequences should

increase the power of sequence conservation in gene prediction.

Somewhat surprisingly, simply the length of the transcript pre-

diction was also a reasonable predictor of detection in our data

set (AUC ⫽ 0.68), suggesting that incomplete transcript predic-

tions were significant factors in the nonobservation of many gene

targets.

We and others have used gene expression information, ge-

nome sequence, and de novo motif discovery tools to search for

enhancer elements that direct tissue-specific gene expression

(21, 22). In contrast to enhancers that generally direct the

expression of a single gene, locus control regions (LCR) are

characterized by their ability to promote the expression of

multiple genes at a single locus. To date, only a handful of LCRs

have been reported (23). Recently, Spellman and Rubin (24)

used Drosophila gene expression arrays to identify ⬇200 clusters

of adjacent and similarly expressed genes and suggest that these

patterns are most consistent with regulation of chromatin struc-

ture. Others (25–27) have also performed similar analyses in

humans, Caenorhabditis elegans, and yeast on more limited sets

of experimental conditions.

To identify potential loci in our data set, the expression of

which may be controlled in a locus-dependent manner, we

mapped the transcripts represented on our gene expression

arrays to genome assemblies and scanned each chromosome for

windows of genes with correlated expression patterns. We called

these sites RCTs as a general term encompassing LCRs and

correlated expression achieved through gene duplication. It is

important to note that detection of these RCTs is heavily

influenced by comparison algorithms, normalization proce-

dures, and underlying data. In particular, the inclusion of several

purified immune cell populations in our human sample set

Fig. 1. Validation of gene predictions in humans. Gene targets on the GNF1H

array were divided into four categories: contained in Refseq, predicted by

both gene prediction efforts considered (‘‘Consensus’’), and predicted by only

one group (‘‘Ensembl-only’’ and ‘‘Celera-only’’). On the left axis (solid bars),

rates of validation are shown, where detectable expression in at least one

tissue is taken as evidence of the validity of a gene prediction. The right axis

(blue line) indicates the total number of validated genes per group.

6064

兩

www.pnas.org兾cgi兾doi兾10.1073兾pnas.0400782101 Su et al.

skewed the normalization procedure and led to an increase in

RCTs whose expression is enriched in these samples. In total, we

identified 156 and 108 RCTs in human and mouse, respectively

(descriptions of all RCTs are available for download from

http:兾兾symatlas.gnf.org). Tissues with very specific clusters of

genes such as those in the immune system, liver, testis, and

placenta had more RCTs than other tissues in both the mouse

and human data sets. Mechanistically, expression of these RCTs

is likely mediated through either common promoter elements

(resulting from gene duplication) or through higher-order gene

regulation such as site-specific chromatin remodeling. To sepa-

rate these two possibilities, we identified likely paralogs using

TBLASTX, a local six-frame translated nucleotide-to-nucleotide

alignment algorithm (11). RCTs whose genes share significant

sequence similarity in their coding sequences are likely to be

products of gene duplication, whereas dissimilar genes may result

from an LCR or other higher-order transcriptional regulation.

As expected, we found RCTs with both related and unrelated

genes. Fig. 2A illustrates an example of an RCT driven by gene

duplication. This cluster of genes on mouse chromosome 9

represents a family of 11 uncharacterized F-box and WD40

repeat containing proteins that are specifically expressed in

fertilized eggs and oocytes. Because of their high degree of

sequence similarity, we hypothesize that their correlated expres-

sion pattern is a result of duplicated regulatory elements present

in their structural genes, and that these genes may play an

important role in the specialized protein expression of oocytes.

In contrast, we also note a cluster of three genes with no

apparent sequence similarity on human chromosome 13 that are

highly enriched in samples derived from brain tissues, particu-

larly the fetal brain sample (Fig. 2B). The genes in this cluster

are neurobeachin, an uncharacterized mRNA, and doublecortin-

and calmodulin kinase-like 1 protein (DCAMKL1). It is appeal-

ing to hypothesize that the correlated expression patterns of

these genes and their colocalization at a chromosomal locus

indicate a common role in a neurological process or network.

Because these genes do not share sequence similarity, this region

may also contain a previously unrecognized LCR or strong

regional enhancer. Overall, 97 (62%) and 78 (72%) of the human

and mouse RCTs identified have an average pairwise sequence

similarity of ⬍20% and do not encode related genes.

We next examined both the mouse and human data for RCTs

that were identified in both data sets and are likely evolutionarily

conserved. The majority of the RCTs were not found in both

human and mouse, in many cases because the orthologs or

syntenic regions have not yet been defined or the patterns were

not conserved. However, in some cases, the apparent lack of

conservation likely reflects physiological differences between

the two organisms. For example, we observed RCTs with

expression enriched in the olfactory bulb present in the mouse

Fig. 2. RCT. (A) An RCT was identiﬁed on mouse chromosome 9, consisting of 11 genes that share a highly conserved expression pattern. (Upper) The y axis

is average normalized expression value, the x axis contains the 61 different tissues, and red bars are fertilized egg and oocytes. The correlation plot (Lower Left)

visualizes the pairwise correlation coefﬁcients. Each row represents a gene, ordered vertically according to their position on the chromosome. The center yellow

vertical strip represents autocorrelation (R ⫽ 1); positions to the right of center represent correlation of the gene to its downstream neighbors, whereas positions

to the left represent correlation to the upstream neighbors. Yellow indicates high correlation; blue indicates low correlation (scale at bottom). The sequence

similarity plot (using

TBLASTX, Lower Right) has the same structure as the correlation plot, except pairwise sequence similarity is shown. In this RCT with high

expression levels in fertilized eggs and oocytes, the genes share a high degree of sequence similarity, likely indicating they are all members of a single gene family

and the result of one or more gene duplication events. (B) An example RCT is identiﬁed on human chromosome 13, which contains three genes with highly

correlated expression (red bars are brain regions, green bar is fetal brain). In contrast to the ﬁrst example, these genes share very little pairwise sequence

similarity. (C) An evolutionarily conserved RCT is shown from human chromosome 2 (Left) and the syntenic region on mouse chromosome 6 (Right). These RCTs

share a pancreas-enriched expression pattern (red bar), as well as signiﬁcant sequence similarity.

Su et al. PNAS

兩

April 20, 2004

兩

vol. 101

兩

no. 16

兩

6065

GENETICS

but not the human data set. Nevertheless, several RCTs were

conserved, including a cluster of pancreas-specific genes map-

ping to human chromosome 2 and its syntenic region on mouse

chromosome 6 (Fig. 2C). The human cluster is comprised of five

genes, including pancreatitis-associated proteins (PAP), three

regenerating islet-derived proteins (REG1A, REG1B, and

REGL), and one protein of unknown function (LPPM429). The

mouse cluster contains the ortholog to PAP, four isoforms of

regenerating islet-derived proteins, and islet neogenesis-

associated protein-related protein. The conservation of this

RCT in human and mouse suggests that these genes perform

analogous and important roles in both of these mammals.

After mapping all target genes to their respective genome

assemblies, we noted a region of mouse chromosome 7 (130 Mb)

that contained several genes previously shown to be imprinted

(28–30), three of which (H19, Igf2, and Cdkn1c) shared a pattern

of enriched expression in placenta, umbilical cord, and embry-

onic tissues. We also noted another pair of adjacent genes (Zim1

and Peg3) elsewhere on chromosome 7 (6 Mb) that shared this

tissue-specific expression pattern, and whose expression has

been shown to be imprinted (31). Prompted by these observa-

tions, we examined our set of RCTs for other imprinted genes

that were clustered in a single locus. On mouse chromosome 12

(103 Mb), we observed an RCT that consists of six adjacent

genes, all with enriched expression in brain regions and umbilical

cord (Fig. 3 A and B). Recently, several groups showed that two

genes in this locus, Dlk1 and Gtl2, are imprinted (reviewed in ref.

32). Later, it was also shown that another gene at this locus, Rian,

and several adjacent tandemly repeated C兾D small nucleolar

RNA genes are also imprinted (33, 34). Furthermore, although

we do not have a probe set on our array that reliably detects its

expression, Dio3 is located proximal to this locus and has also

Fig. 3. Six genes on mouse chromosome 12 share a distinctive pattern of expression. (A) A genomic view of this region (not to scale). Locations of the genes

on the mouse genome assembly: Dlk (103.508 Mb), Gtl2 (103.593 Mb), 1110006E14Rik (103.646 Mb), Rian (103.696 Mb), 5330411G14Rik (103.788 Mb),

C130007E11Rik (103.798 Mb), and Dio3 (104.328 Mb). (B) These genes share enriched expression in brain regions (green bars) and umbilical cord (red bar). The

y axes indicate normalized expression values, whereas each bar along the x axis indicates a sample proﬁled in our data set. (C) Three of these genes (Dlk1, Gtl2,

and Rian) have been previously reported to be imprinted. Using our allele-speciﬁc probe expression analysis approach (see text), we conﬁrmed the imprinted

regulation of Gtl2 and Rian and report two previously undescribed imprinted transcripts at this locus (5330411G14Rik and C130007E11Rik). The y axes indicate

the normalized signal intensity for individual probes on the array, and each bar represents a pooled sample from a cross indicated by color (see key).

6066

兩

www.pnas.org兾cgi兾doi兾10.1073兾pnas.0400782101 Su et al.

shown to exhibit genomic imprinting (35). The imprinting status

of the three remaining RIKEN clones at this locus

(1110006E14Rik, 5330411G14Rik, and C130007E11Rik)isnot

known, although they share the brain- and umbilical cord-

enriched expression characteristic of all of the genes in the RCT.

To investigate whether these three genes were also imprinted,

we used two distinct mouse strains, C57BL兾6J (B6) and M. m.

castaneus (CAST兾Ei), to set up four independent mouse crosses

(male::female): B6::B6, B6::CAST兾Ei, CAST兾Ei::B6, and

CAST兾Ei::CAST兾Ei. Four independent litters of pooled embry-

onic day 14–16 embryos were dissected, and RNA expression

was analyzed by allele-specific probe expression analysis, which

allows us to determine whether the transcript is expressed

exclusively or preferentially from either the paternal or maternal

allele. This analysis reconfirmed the imprinted expression of

Gtl2 and Rian (Fig. 3C). Because no probes could distinguish

between the B6 and CAST兾Ei forms of Dlk1, we were unable

to reconfirm its imprinted expression. Two of the uncharacter-

ized RIKEN genes at this locus, 5330411G14Rik and

C130007E11Rik, showed expression from the maternal allele

only, further expanding the number of known imprinted genes at

this locus (Fig. 3C). Because these cDNAs are within 10 kb of one

another, it is possible they are derived from the same structural

gene. The third gene (1110006E14Rik), like Dlk1, did not contain

a probe capable of ascertaining its imprinting status. During the

preparation of this manuscript, another gene in this locus sharing

the 3⬘-end of C130007E11Rik was also shown to be imprinted

(36). In sum, the allele-specific probe expression analysis method

has identified another two imprinted transcripts at this locus.

Furthermore, based on the observation that well-characterized

imprinted loci on mouse chromosomes 7 and 12 share a common

pattern of gene expression in our data, we speculate that the

LCR machinery that regulates the parental expression of these

genes may also influence their tissue-specific expression pattern.

Conclusion

Here we report an extensive compendium of gene expression of the

protein-encoding transcriptomes of the mouse and humans. Fur-

ther augmentation by additional samples, including region-specific

dissections using laser capture microdissection or even cell type-

specific gene expression, will undoubtedly increase the utility of

these resources. We have investigated this data set for global

signatures in tissue-specific gene regulation, expression character-

istics of de novo predicted transcripts, and chromosomal RCTs. The

identification of several known imprinted loci in our tissue-specific

RCT list suggests that these regulatory mechanisms that direct

tissue- or parental allele-specific expression may be intertwined.

Consistent with this observation, we were able to identify two

previously undescribed transcripts that were imprinted on mouse

chromosome 12 based on the observation that they share a tissue-

specific expression pattern with their neighbors.

With the sequencing phase of the human and mouse genome

projects nearly complete, and with the rapid progress in the

sequencing of other mammalian genomes, we are now poised to

develop and exploit a variety of methods to ascertain the

function of the thousands of recently described genes. In this

regard, the genome-scale RNA expression data described herein

provide a framework for the functional annotation process. By

making the underlying data available on our web site (http:兾兾

symatlas.gnf.org) and through the Gene Expression Omnibus

(www.ncbi.nih.gov兾geo), we anticipate that this study will aid

researchers throughout the global research community to reap

the harvests of the human and mouse genome projects.

We thank the following individuals for providing human RNA samples:

Gino Van Heeke, Novartis (bronchial epithelial cells); Graeme Bilbe,

Novartis (fetal thyroid); Clifford Shults, University of California at San

Diego (whole blood); Bill Sugden, University of Wisconsin, Madison

(721 B-lymphoblasts); Joseph D Buxbaum, Mt. Sinai School of Medicine,

New York (prefrontal cortex). We also thank Ines Hoffmann and Satchin

Panda for preparation of mouse embryonic samples and Peter Dimitrov,

Christian Zmasek, and Michael Heuer for technical expertise. This work

was supported by the Novartis Research Foundation.

1. Venter, J. C., Adams, M. D., Myers, E. W., Li, P. W., Mural, R. J., Sutton, G. G.,

Smith, H. O., Yandell, M., Evans, C. A., Holt, R. A., et al. (2001) Science 291,

1304–1351.

2. Lander, E. S., Linton, L. M., Birren, B., Nusbaum, C., Zody, M. C., Baldwin,

J., Devon, K., Dewar, K., Doyle, M., FitzHugh, W., et al. (2001) Nature 409,

860–921.

3. Su, A. I., Welsh, J. B., Sapinoso, L. M., Kern, S. G., Dimitrov, P., Lapp, H.,

Schultz, P. G., Powell, S. M., Moskaluk, C. A., Frierson, H. F., Jr., et al. (2001)

Cancer Res. 61, 7388–7393.

4. Aza-Blanc, P., Cooper, C. L., Wagner, K., Batalov, S., Deveraux, Q. L. &

Cooke, M. P. (2003) Mol. Cell 12, 627–637.

5. Chanda, S. K., White, S., Orth, A. P., Reisdorph, R., Miraglia, L., Thomas,

R. S., DeJesus, P., Mason, D. E., Huang, Q., Vega, R., et al. (2003) Proc. Natl.

Acad. Sci. USA 100, 12153–12158.

6. Su, A. I., Cooke, M. P., Ching, K. A., Hakak, Y., Walker, J. R., Wiltshire, T.,

Orth, A. P., Vega, R. G., Sapinoso, L. M., Moqrich, A., et al. (2002) Proc. Natl.

Acad. Sci. USA 99, 4465–4470.

7. Pruitt, K. D. & Maglott, D. R. (2001) Nucleic Acids Res. 29, 137–140.

8. Kerlavage, A., Bonazzi, V., di Tommaso, M., Lawrence, C., Li, P., Mayberry,

F., Mural, R., Nodell, M., Yandell, M., Zhang, J., et al. (2002) Nucleic Acids Res.

30, 129–136.

9. Okazaki, Y., Furuno, M., Kasukawa, T., Adachi, J., Bono, H., Kondo, S.,

Nikaido, I., Osato, N., Saito, R., Suzuki, H., et al. (2002) Nature 420, 563–573.

10. Kent, W. J. (2002) Genome Res. 12, 656–664.

11. Altschul, S. F., Madden, T. L., Schaffer, A. A., Zhang, J., Zhang, Z., Miller, W.

& Lipman, D. J. (1997) Nucleic Acids Res. 25, 3389–3402.

12. Florea, L., Hartzell, G., Zhang, Z., Rubin, G. M. & Miller, W. (1998) Genome

Res. 8, 967–974.

13. Kanapin, A., Batalov, S., Davis, M. J., Gough, J., Grimmond, S., Kawaji, H.,

Magrane, M., Matsuda, H., Schonbach, C., Teasdale, R. D., et al. (2003)

Genome Res. 13, 1335–1344.

14. Lockhart, D. J., Dong, H., Byrne, M. C., Follettie, M. T., Gallo, M. V., Chee,

M. S., Mittmann, M., Wang, C., Kobayashi, M., Horton, H., et al. (1996) Nat.

Biotechnol. 14, 1675–1680.

15. Hubbell, E., Liu, W. M. & Mei, R. (2002) Bioinformatics 18, 1585–1592.

16. Edgar, R., Domrachev, M. & Lash, A. E. (2002) Nucleic Acids Res. 30, 207–210.

17. Kent, W. J., Baertsch, R., Hinrichs, A., Miller, W. & Haussler, D. (2003) Proc.

Natl. Acad. Sci. USA 100, 11484–11489.

18. Bishop, J. O., Morton, J. G., Rosbash, M. & Richardson, M. (1974) Nature 250,

199–204.

19. Hastie, N. D. & Bishop, J. O. (1976) Cell 9, 761–774.

20. Hogenesch, J. B., Ching, K. A., Batalov, S., Su, A. I., Walker, J. R., Zhou, Y.,

Kay, S. A., Schultz, P. G. & Cooke, M. P. (2001) Cell 106, 413–415.

21. Harmer, S. L., Hogenesch, J. B., Straume, M., Chang, H. S., Han, B., Zhu, T.,

Wang, X., Kreps, J. A. & Kay, S. A. (2000) Science 290, 2110–2113.

22. DeRisi, J. L., Iyer, V. R. & Brown, P. O. (1997) Science 278, 680–686.

23. Li, Q., Peterson, K. R., Fang, X. & Stamatoyannopoulos, G. (2002) Blood 100,

3077–3086.

24. Spellman, P. T. & Rubin, G. M. (2002) J. Biol. 1, 5.

25. Caron, H., van Schaik, B., van der Mee, M., Baas, F., Riggins, G., van Sluis, P.,

Hermus, M. C., van Asperen, R., Boon, K., Voute, P. A., et al. (2001) Science

291, 1289–1292.

26. Roy, P. J., Stuart, J. M., Lund, J. & Kim, S. K. (2002) Nature 418, 975–979.

27. Cohen, B. A., Mitra, R. D., Hughes, J. D. & Church, G. M. (2000) Nat. Genet.

26, 183–186.

28. Bell, A. C. & Felsenfeld, G. (2000) Nature 405, 482–485.

29. Hark, A. T., Schoenherr, C. J., Katz, D. J., Ingram, R. S., Levorse, J. M. &

Tilghman, S. M. (2000) Nature 405, 486–489.

30. Thorvaldsen, J. L., Duran, K. L. & Bartolomei, M. S. (1998) Genes Dev. 12,

3693–3702.

31. Kim, J., Lu, X. & Stubbs, L. (1999) Hum. Mol. Genet. 8, 847–854.

32. Georges, M., Charlier, C. & Cockett, N. (2003) Trends Genet. 19, 248–252.

33. Hatada, I., Morita, S., Obata, Y., Sotomaru, Y., Shimoda, M. & Kono, T. (2001)

J. Biochem. 130, 187–190.

34. Cavaille, J., Seitz, H., Paulsen, M., Ferguson-Smith, A. C. & Bachellerie, J. P.

(2002) Hum. Mol. Genet. 11, 1527–1538.

35. Yevtodiyenko, A., Carr, M. S., Patel, N. & Schmidt, J. V. (2002) Mamm.

Genome 13, 633–638.

36. Seitz, H., Youngson, N., Lin, S. P., Dalbert, S., Paulsen, M., Bachellerie, J. P.,

Ferguson-Smith, A. C. & Cavaille, J. (2003) Nat. Genet. 34, 261–262.

Su et al. PNAS

兩

April 20, 2004

兩

vol. 101

兩

no. 16

兩

6067

GENETICS

Deficiency in Ever2 does not increase susceptibility of mice to pathogenesis by the mouse papillomavirus, MmuPV1

Article

Jun 2024
J VIROL

Epidermodysplasia verruciformis (EV) is a rare genetic skin disorder that is characterized by the development of papillomavirus-induced skin lesions that can progress to squamous cell carcinoma (SCC). Certain high-risk, cutaneous β-genus human papillomaviruses (β-HPVs), in particular HPV5 and HPV8, are associated with inducing EV in individuals who have a homozygous mutation in one of three genes tied to this disease: EVER1 , EVER2 , or CIB1. EVER1 and EVER2 are also known as TMC6 and TMC8, respectively. Little is known about the biochemical activities of EVER gene products or their roles in facilitating EV in conjunction with β-HPV infection. To investigate the potential effect of EVER genes on papillomavirus infection, we pursued in vivo infection studies by infecting Ever2 -null mice with mouse papillomavirus (MmuPV1). MmuPV1 shares characteristics with β-HPVs including similar genome organization, shared molecular activities of their early, E6 and E7, oncoproteins, the lack of a viral E5 gene, and the capacity to cause skin lesions that can progress to SCC. MmuPV1 infections were conducted both in the presence and absence of UVB irradiation, which is known to increase the risk of MmuPV1-induced pathogenesis. Infection with MmuPV1 induced skin lesions in both wild-type and Ever2 -null mice with and without UVB. Many lesions in both genotypes progressed to malignancy, and the disease severity did not differ between Ever2 -null and wild-type mice. However, somewhat surprisingly, lesion growth and viral transcription was decreased, and lesion regression was increased in Ever2 -null mice compared with wild-type mice. These studies demonstrate that Ever2 -null mice infected with MmuPV1 do not exhibit the same phenotype as human EV patients infected with β-HPVs. IMPORTANCE Humans with homozygous mutations in the EVER2 gene develop epidermodysplasia verruciformis (EV), a disease characterized by predisposition to persistent β-genus human papillomavirus (β-HPV) skin infections, which can progress to skin cancer. To investigate how EVER2 confers protection from papillomaviruses, we infected the skin of homozygous Ever2 -null mice with mouse papillomavirus MmuPV1. Like in humans with EV, infected Ever2 -null mice developed skin lesions that could progress to cancer. Unlike in humans with EV, lesions in these Ever2 -null mice grew more slowly and regressed more frequently than in wild-type mice. MmuPV1 transcription was higher in wild-type mice than in Ever2 -null mice, indicating that mouse EVER2 does not confer protection from papillomaviruses. These findings suggest that there are functional differences between MmuPV1 and β-HPVs and/or between mouse and human EVER2.

Identification of Transcripts with Shared Roles in the Pathogenesis of Postmenopausal Osteoporosis and Cardiovascular Disease

Article

Full-text available

May 2024
INT J MOL SCI

Epidemiological evidence suggests existing comorbidity between postmenopausal osteoporosis (OP) and cardiovascular disease (CVD), but identification of possible shared genes is lacking. The skeletal global transcriptomes were analyzed in trans-iliac bone biopsies (n = 84) from clinically well-characterized postmenopausal women (50 to 86 years) without clinical CVD using microchips and RNA sequencing. One thousand transcripts highly correlated with areal bone mineral density (aBMD) were further analyzed using bioinformatics, and common genes overlapping with CVD and associated biological mechanisms, pathways and functions were identified. Fifty genes (45 mRNAs, 5 miRNAs) were discovered with established roles in oxidative stress, inflammatory response, endothelial function, fibrosis, dyslipidemia and osteoblastogenesis/calcification. These pleiotropic genes with possible CVD comorbidity functions were also present in transcriptomes of microvascular endothelial cells and cardiomyocytes and were differentially expressed between healthy and osteoporotic women with fragility fractures. The results were supported by a genetic pleiotropy-informed conditional False Discovery Rate approach identifying any overlap in single nucleotide polymorphisms (SNPs) within several genes encoding aBMD- and CVD-associated transcripts. The study provides transcriptional and genomic evidence for genes of importance for both BMD regulation and CVD risk in a large collection of postmenopausal bone biopsies. Most of the transcripts identified in the CVD risk categories have no previously recognized roles in OP pathogenesis and provide novel avenues for exploring the mechanistic basis for the biological association between CVD and OP.

A review on available proteomic databases, annotation techniques and data projects important in male reproductive physiology research

Article

Full-text available

Apr 2024

Recent breakthroughs in the field of medicine and physiology have come through the application of bioinformatics and computational biology in experimental designs and in silico analyses. Genomics and proteomics-based strategies are currently used for data presentation, sequence analysis and alignment, primer designs, mapping and annotation of the entire human genome. This advancement has made it possible to identify the roles of specific genes and proteins, understand their physiological functions, as well as pathophysiology during mutation, malformation and diseased conditions. This review describes the available proteomic databases; essential proteins used as markers of fertility in male, annotation techniques and data projects on male reproductive physiology. The materials used in this descriptive review were searched using PubMed and Google scholar databases. The following terms and phrases were reviewed: 'genomics', 'proteomics', 'Interrelationship between bioinformatics and Life sciences', 'bioinformatics tools for analyzing male reproductive system', 'Male reproductive system functions', 'infertility', and 'Application of bioinformatics in male reproductive physiology research'. Analyses in proteomics and genomics have further expanded the understanding of male reproductive physiology research through different bioinformatics tools and databases. A better understanding of the mechanisms behind spermatogenesis, testicular gene expression, protein involvement in male reproduction, the discovery of cancer genes in reproductive organs, and possible markers to identify infertility in males have evolved. There is, therefore, a need for further application of bioinformatics in the study of male reproductive system with the introduction of more databases, better identification of cancer genes in male reproductive organs and male infertility's possible biomarkers.

Estimating the selection pressure and evolutionary rate of proteins on the non-neutral hypothesis of synonymous mutations

Preprint

Full-text available

Apr 2024

The Non-synonymous/Synonymous substitution rate (Ka/Ks) ratio is a commonly used metric to estimate the selection pressure and evolutionary rate of proteins in comparative genomics, which plays critical roles in both biology and medicine. A fundamental assumption of Ka/Ks is that synonymous mutations are evolutionarily neutral and not subject to natural selection as they do not alter protein sequences and function. Recently, however, a number of studies have demonstrated that synonymous mutations are non-neutral and functional through a number of mechanisms, such as altering miRNA regulation. This further implies that synonymous mutations also participate in the process of natural selection and thus Ka/Ks should be improved as well. For this purpose, here we propose an improved Ka/Ks ratio, i Ka/Ks, which redefines the neutral mutation rate by considering the altered status of miRNA regulation of the synonymous mutations, and thereby incorporate the impact of synonymous mutations on miRNA regulation. As a result, i Ka/Ks shows better performance than Ka/Ks when comparing them using their correlation with expression distance in the human-mouse study. Moreover, case studies showed that i Ka/Ks is able to identify the positive/negative selective genes that are missed by Ka/Ks. For example, TMEM72/Tmem72 is estimated to be positively selected by i Ka/Ks (1.13) but negatively selected by the conventional Ka/Ks ratio (0.21). Further evidence showed its rapid evolution, which further support the power of the new algorithm.

Deciphering cell types by integrating scATAC-seq data with genome sequences

Article

Full-text available

Apr 2024

The single-cell assay for transposase-accessible chromatin using sequencing (scATAC-seq) technology provides insight into gene regulation and epigenetic heterogeneity at single-cell resolution, but cell annotation from scATAC-seq remains challenging due to high dimensionality and extreme sparsity within the data. Existing cell annotation methods mostly focus on the cell peak matrix without fully utilizing the underlying genomic sequence. Here we propose a method, SANGO, for accurate single-cell annotation by integrating genome sequences around the accessibility peaks within scATAC data. The genome sequences of peaks are encoded into low-dimensional embeddings, and then iteratively used to reconstruct the peak statistics of cells through a fully connected network. The learned weights are considered as regulatory modes to represent cells, and utilized to align the query cells and the annotated cells in the reference data through a graph transformer network for cell annotations. SANGO was demonstrated to consistently outperform competing methods on 55 paired scATAC-seq datasets across samples, platforms and tissues. SANGO was also shown to be able to detect unknown tumor cells through attention edge weights learned by the graph transformer. Moreover, from the annotated cells, we found cell-type-specific peaks that provide functional insights/biological signals through expression enrichment analysis, cis-regulatory chromatin interaction analysis and motif enrichment analysis.

Impact of benzo[a]pyrene, PCB153 and sex hormones on human ESC-Derived thyroid follicles using single cell transcriptomics

Article

May 2024
ENVIRON INT

Discrete latent embedding of single-cell chromatin accessibility sequencing data for uncovering cell heterogeneity

Article

Full-text available

May 2024

Single-cell epigenomic data has been growing continuously at an unprecedented pace, but their characteristics such as high dimensionality and sparsity pose substantial challenges to downstream analysis. Although deep learning models—especially variational autoencoders—have been widely used to capture low-dimensional feature embeddings, the prevalent Gaussian assumption somewhat disagrees with real data, and these models tend to struggle to incorporate reference information from abundant cell atlases. Here we propose CASTLE, a deep generative model based on the vector-quantized variational autoencoder framework to extract discrete latent embeddings that interpretably characterize single-cell chromatin accessibility sequencing data. We validate the performance and robustness of CASTLE for accurate cell-type identification and reasonable visualization compared with state-of-the-art methods. We demonstrate the advantages of CASTLE for effective incorporation of existing massive reference datasets in a weakly supervised or supervised manner. We further demonstrate CASTLE’s capacity for intuitively distilling cell-type-specific feature spectra that unveil cell heterogeneity and biological implications quantitatively.

Multifaceted role of TRIM21 in inflammation

Article

May 2024

Tripartite motif (TRIM) family members participate in a variety of cellular activities, such as intracellular signaling, development, cellular death, protein quality control, immunological defense, waste degradation, and the emergence of cancer. These proteins usually act as E3 ubiquitin ligase. The final line of resistance against infectious viruses is a cytosolic ubiquitin ligase and antibody receptor called TRIM containing 21. TRIM21, a protein with a tripartite structure, has been linked to autoimmune erythematosus, Sjogren’s disorder, and innate immunity. TRIM21 may either promote the formation of specific cancer-activating proteins, resulting in their proteasomal degradation, or it may do neither, depending on the kind of cancer and cancer-causing trigger. The current research has shown that the antiviral action of TRIM mostly depends on their role as E3-ubiquitin ligases and a significant portion of the TRIM family mediates the transmission of innate immune cell signals and the subsequent production of cytokines. We highlighted the function of TRIM family members in various inflammatory diseases.

The dynamic TRPV2 ion channel proximity proteome reveals functional links of calcium flux with cellular adhesion factors NCAM and L1CAM in neurite outgrowth

Article

May 2024
CELL CALCIUM

An integrated multi-omics analysis reveals osteokines involved in global regulation

Article

Apr 2024
CELL METAB

Bone secretory proteins, termed osteokines, regulate bone metabolism and whole-body homeostasis. However, fundamental questions as to what the bona fide osteokines and their cellular sources are and how they are regulated remain unclear. In this study, we analyzed bone and extraskeletal tissues, osteoblast (OB) conditioned media, bone marrow supernatant (BMS), and serum, for basal osteokines and those responsive to aging and mechanical loading/unloading. We identified 375 candidate osteokines and their changes in response to aging and mechanical dynamics by integrating data from RNA-seq, scRNA-seq, and proteomic approaches. Furthermore, we analyzed their cellular sources in the bone and inter-organ communication facilitated by them (bone-brain, liver, and aorta). Notably, we discovered that senescent OBs secrete fatty-acid-binding protein 3 to propagate senescence toward vascular smooth muscle cells (VSMCs). Taken together, we identified previously unknown candidate osteokines and established a dynamic regulatory network among them, thus providing valuable resources to further investigate their systemic roles.

Petromagnetic Properties In The Naica Mining District, Chihuahua, Mexico: Searching For Source of Mineralization

Article

Full-text available

Jan 2003
EARTH PLANETS SPACE

Ore mineral and host lithologies have been sampled with 89 oriented samples from 14 sites in the Naica District, northern Mexico. Magnetic parameters permit to charac- terise samples: saturation magnetization, density, low- high-temperature magnetic sus- ceptibility, remanence intensity, Koenigsberger ratio, Curie temperature and hystere- sis parameters. Rock magnetic properties are controlled by variations in titanomag- netite content and hydrothermal alteration. Post-mineralization hydrothermal alter- ation seems the major event that affected the minerals and magnetic properties. Curie temperatures are characteristic of titanomagnetites or titanomaghemites. Hysteresis parameters indicate that most samples have pseudo-single domain (PSD) magnetic grains. Alternating filed (AF) demagnetization and isothermal remanence (IRM) ac- quisition both indicate that natural and laboratory remanences are carried by MD-PSD spinels in the host rocks. The trend of NRM intensity vs susceptibility suggests that the carrier of remanent and induced magnetization is the same in all cases (spinels). The Koenigsberger ratio range from 0.05 to 34.04, indicating the presence of MD and PSD magnetic grains. Constraints on the geometry of the intrusive source body devel- oped in the model of the magnetic anomaly are obtained by quantifying the relative contributions of induced and remanent magnetization components.

Analysis of the mouse transcriptome based on functional annotation of 60,770 full-length cDNAs

Article

Full-text available

Dec 2002

Only a small proportion of the mouse genome is transcribed into mature messenger RNA transcripts. There is an international collaborative effort to identify all full-length mRNA transcripts from the mouse, and to ensure thaZhut each is represented in a physical collection of clones. Here we report the manual annotation of 60,770 full-length mouse complementary DNA sequences. These are clustered into 33,409 'transcriptional units', contributing 90.1% of a newly established mouse transcriptome database. Of these transcriptional units, 4,258 are new protein-coding and 11,665 are new non-coding messages, indicating that non-coding RNA is a major component of the transcriptome. 41% of all transcriptional units showed evidence of alternative splicing. In protein-coding transcripts, 79% of splice variations altered the protein product. Whole-transcriptome analyses resulted in the identification of 2,431 sense-antisense pairs. The present work, completely supported by physical clones, provides the most comprehensive survey of a mammalian transcriptome so far, and is a valuable resource for functional genomics.

Gapped BLAST and PSI-BLAST: A new generation of protein database search programs

Article

Full-text available

Sep 1997

The BLAST programs are widely used tools for searching protein and DNA databases for sequence similarities. For protein comparisons, a variety of definitional, algorithmic and statistical refinements described here permits the execution time of the BLAST programs to be decreased substantially while enhancing their sensitivity to weak similarities. A new criterion for triggering the extension of word hits, combined with a new heuristic for generating gapped alignments, yields a gapped BLAST program that runs at approximately three times the speed of the original. In addition, a method is introduced for automatically combining statistically significant alignments produced by BLAST into a position-specific score matrix, and searching the database using this matrix. The resulting Position-Specific Iterated BLAST (PSIBLAST) program runs at approximately the same speed per iteration as gapped BLAST, but in many cases is much more sensitive to weak but biologically relevant sequence similarities. PSI-BLAST is used to uncover several new and interesting members of the BRCT superfamily.

Cavaille, J, Seitz, H, Paulsen, M, Ferguson-Smith, AC and Bachellerie, JP. Identification of tandemly-repeated C/D snoRNA genes at the imprinted human 14q32 domain reminiscent of those at the Prader-Willi/Angelman syndrome region. Hum Mol Genet 11: 1527-1538

Article

Jun 2002
HUM MOL GENET

Jérôme Cavaillé

A human imprinted domain at 14q32contains two co-expressed and reciprocally imprinted genes, DLK1 and GTL2, which are expressed from the paternally and maternally inherited alleles, respectively. We have previously shown that another imprinted locus, on human 15q11–q13, contains a large number of tandemly repeated C/D small nucleolar RNA genes (or C/D snoRNAs) only expressed from the paternal allele. Here we show that the region downstream from the GTL2 gene is also characterized by a transcription unit spanning many repeated intron-encoded C/D snoRNA genes, most of them arranged into two tandem arrays of 31 and 9 copies. Intriguingly, these snoRNAs depart from previously reported rRNA or snRNA methylation guides by their tissue-specific expression and by their lack of complementarity to rRNA or snRNA within their sequences. Analysis of the orthologous region in the mouse shows that the previously reported maternally expressed Rian gene, located downstream of Gtl2 on the distal 12chromosome, encodes at least nine C/D snoRNAs. Through a systematic search in rodents, we could identify other C/D snoRNA genes in this domain. All snoRNAs identified on mouse distal 12are brain-specific and only expressed from the maternally inherited allele. The human imprinted 14q32domain therefore shares common genomic features with the imprinted 15q11–q13 loci. This link between tandemly repeated C/D snoRNA genes and genomic imprinting suggests a role for these snoRNAs and/or their host non-coding RNA genes in the evolution and/or mechanism of the epigenetic imprinting process.

Blat-the BLAST-like alignment tool

Article

Nov 2001
GENOME RES

W James Kent

Analyzing vertebrate genomes requires rapid mRNA/DNA and cross-species protein alignments. A new tool, BLAT, is more accurate and 500 times faster than popular existing tools for mRNA/DNA alignments and 50 times faster for protein alignments at sensitivity settings typically used when comparing vertebrate sequences. BLAT's speed stems from an index of all nonoverlapping K-mers in the genome. This index fits inside the RAM of inexpensive computers, and need only be computed once for each genome assembly. BLAT has several major stages. It uses the index to find regions in the genome likely to be homologous to the query sequence. It performs an alignment between homologous regions. It stitches together these aligned regions (often exons) into larger alignments (typically genes). Finally, BLAT revisits small internal exons possibly missed at the first stage and adjusts large gap boundaries that have canonical splice sites where feasible. This paper describes how BLAT was optimized. Effects on speed and sensitivity are explored for various K-mer sizes, mismatch schemes, and number of required index matches. BLAT is compared with other alignment programs on various test sets and then used in several genome-wide applications. http://genome.ucsc.edu hosts a web-basedBLAT server for the human genome.

Analysis of the mouse transcriptome based on functional annotation of 60

Article

Jan 2002
NATURE

International Human Genome Sequencing Consotium. Initial sequencing and analysis of the human genome. Nature 409, 860−921

Article

Feb 2001

The human genome holds an extraordinary trove of information about human development, physiology, medicine and evolution. Here we report the results of an international collaboration to produce and make freely available a draft sequence of the human genome. We also present an initial analysis of the data, describing some of the insights that can be gleaned from the sequence.

The expression of three abundance classes of messenger RNA in mouse tissues

Article

Jan 1977
CELL

Using the technique of mRNA-cDNA hybridization, we have shown that there are between 11,500 and 12,500 different mRNAs in three different mouse tissues:kidneys, brains, and livers. Several experiments suggest that in each tissue the mRNAs are organized into three abundance classes rather than as a continumm with respect to concentration. Cross-hybridization experminets show that the most abundant class of mRNA in each tissue is characteristic, and that a high proportion of the total sequences are common between tissues. For a more complete analysis, cDNA was fractionated into three classes. Studies using isolated abundant cDNA show that some abundant sequences of liver and kidney are present in other tissues, but among the lower frequency classes. Thus tissue-specific differences in mRNA populations may be related to abundance as well as qualitative differences. Using isolated middle frequency cDNA of the kidney, it was shown that of the 550 or so sequences in this class, approximately 500 are shared with the liver. Similarly, between 9,500 and 10,500 of the low frequency kidney cDNAs are shared with the brain and liver, respectively, suggesting that the majority of mRNAs may be involved with "housekeeping" activities. In an attempt to see whether abundance of mRNA is related to repetition of the sequence in the genome, it was shown that abundant and middle frequency cDNA of the liver and kidney contain a component that anneals with DNA repeated approximately 100 fold. However, the low frequency cDNA of the kidney contains no repeated sequences.

Three abundance class in HeLa cell mRNA

Article

Aug 1974

Approximately 35,000 different poly(A)-containing RNA sequences are present in HeLa cell cytoplasm. The sequences are grouped in three distinct abundance classes.

A TEL-JAK2 Fusion Protein with Constitutive Kinase Activity in Human Leukemia

Article

Dec 1997

The Janus family of tyrosine kinases (JAK) plays an essential role in development and in coupling cytokine receptors to downstream intracellular signaling events. A t(9;12)(p24;p13) chromosomal translocation in a T cell childhood acute lymphoblastic leukemia patient was characterized and shown to fuse the 3′ portion ofJAK2 to the 5′ region of TEL, a gene encoding a member of the ETS transcription factor family. The TEL-JAK2 fusion protein includes the catalytic domain of JAK2 and the TEL-specific oligomerization domain. TEL-induced oligomerization of TEL-JAK2 resulted in the constitutive activation of its tyrosine kinase activity and conferred cytokine-independent proliferation to the interleukin-3–dependent Ba/F3 hematopoietic cell line.

A gene atlas of the mouse and human protein-encoding transcriptomes

Abstract and Figures

Recommended publications

Experimental validation of predicted cancer genes using FRET

Shedding Light on the Dark Side of the Genome: Overlapping Genes in Higher Eukaryotes

S2 Table

Long-term global gene expression patterns in irradiated human lymphocytes