ArticlePDF Available

A gene atlas of the mouse and human protein-encoding transcriptomes

Authors:

Abstract and Figures

The tissue-specific pattern of mRNA expression can indicate important clues about gene function. High-density oligonucleotide arrays offer the opportunity to examine patterns of gene expression on a genome scale. Toward this end, we have designed custom arrays that interrogate the expression of the vast majority of protein-encoding human and mouse genes and have used them to profile a panel of 79 human and 61 mouse tissues. The resulting data set provides the expression patterns for thousands of predicted genes, as well as known and poorly characterized genes, from mice and humans. We have explored this data set for global trends in gene expression, evaluated commonly used lines of evidence in gene prediction methodologies, and investigated patterns indicative of chromosomal organization of transcription. We describe hundreds of regions of correlated transcription and show that some are subject to both tissue and parental allele-specific expression, suggesting a link between spatial expression and imprinting.
RCT. (A) An RCT was identified on mouse chromosome 9, consisting of 11 genes that share a highly conserved expression pattern. (Upper) The y axis is average normalized expression value, the x axis contains the 61 different tissues, and red bars are fertilized egg and oocytes. The correlation plot (Lower Left) visualizes the pairwise correlation coefficients. Each row represents a gene, ordered vertically according to their position on the chromosome. The center yellow vertical strip represents autocorrelation (R 1); positions to the right of center represent correlation of the gene to its downstream neighbors, whereas positions to the left represent correlation to the upstream neighbors. Yellow indicates high correlation; blue indicates low correlation (scale at bottom). The sequence similarity plot (using TBLASTX, Lower Right) has the same structure as the correlation plot, except pairwise sequence similarity is shown. In this RCT with high expression levels in fertilized eggs and oocytes, the genes share a high degree of sequence similarity, likely indicating they are all members of a single gene family and the result of one or more gene duplication events. (B) An example RCT is identified on human chromosome 13, which contains three genes with highly correlated expression (red bars are brain regions, green bar is fetal brain). In contrast to the first example, these genes share very little pairwise sequence similarity. (C) An evolutionarily conserved RCT is shown from human chromosome 2 (Left) and the syntenic region on mouse chromosome 6 (Right). These RCTs share a pancreas-enriched expression pattern (red bar), as well as significant sequence similarity.
… 
Content may be subject to copyright.
A gene atlas of the mouse and human
protein-encoding transcriptomes
Andrew I. Su*
, Tim Wiltshire*
, Serge Batalov*
, Hilmar Lapp*, Keith A. Ching*, David Block*, Jie Zhang*,
Richard Soden*, Mimi Hayakawa*, Gabriel Kreiman*
, Michael P. Cooke*, John R. Walker*, and John B. Hogenesch*
§¶
*The Genomics Institute of the Novartis Research Foundation, 10675 John J. Hopkins Drive, San Diego, CA 92121; and
§
Department of Neuropharmacology,
The Scripps Research Institute, 10550 North Torrey Pines Road, San Diego, CA 92037
Edited by Peter K. Vogt, The Scripps Research Institute, La Jolla, CA, and approved March 2, 2004 (received for review February 3, 2004)
The tissue-specific pattern of mRNA expression can indicate im-
portant clues about gene function. High-density oligonucleotide
arrays offer the opportunity to examine patterns of gene expres-
sion on a genome scale. Toward this end, we have designed custom
arrays that interrogate the expression of the vast majority of
protein-encoding human and mouse genes and have used them to
profile a panel of 79 human and 61 mouse tissues. The resulting
data set provides the expression patterns for thousands of pre-
dicted genes, as well as known and poorly characterized genes,
from mice and humans. We have explored this data set for global
trends in gene expression, evaluated commonly used lines of
evidence in gene prediction methodologies, and investigated pat-
terns indicative of chromosomal organization of transcription. We
describe hundreds of regions of correlated transcription and show
that some are subject to both tissue and parental allele-specific
expression, suggesting a link between spatial expression and
imprinting.
T
he completion of the human and mouse genome sequences
opened an historic era in mammalian biology. One common
conclusion from these projects was the determination that
mammals have only 30,000 protein-encoding genes (1, 2). Yet,
despite the apparent tractability of this figure (earlier estimates
were much higher), to date all existing research has determined
the function of only a fraction of these genes. Currently, only
15,000 human and 10,000 mouse genes are described in the
literature (Medline, www.ncbi.nih.govPubmed). The challenge
and opportunity for genomics strategies and techniques are to
accelerate the functional annotation of novel genes from the
uncharted genome.
High-throughput technologies for biological annotation have
the capacity to partially address the discrepancy between the
identification of genes and the understanding of their function.
For example, proteins have well defined molecular roles encoded
in their primary amino acid sequence as domains. Using se-
quence informatics, these domains can be used as a tool to search
the entire genome to find protein family members that likely
function in an analogous manner. Gene expression arrays have
also been a useful tool for genome-wide studies where changes
in gene expression can be associated with physiological or
pathophysiological states (3). Recently, other high-throughput
techniques such as RNA interference (4) and cDNA overex-
pression (5) have been developed, further accelerating func-
tional genome annotation. The integration of these diverse
strategies is critical to annotation efforts and remains a signif-
icant challenge.
Previously, we generated a preliminary description of the
human and mouse transcriptome using oligonucleotide arrays
that interrogate the expression of 10,000 human and 7,000
mouse target genes (6). We explored this data set for insights
into gene function, transcriptional regulation, disease etiology,
and comparative genomics. However, this data set was based on
commercially available gene expression arrays and therefore was
biased toward previously characterized genes. In this report, we
significantly extend this earlier work by determining the expres-
sion patterns of previously uncharacterized protein-encoding
genes and de novo gene predictions from the mouse and human
genome projects. Using custom-designed whole-genome gene
expression arrays that target 44,775 human and 36,182 mouse
transcripts, we have built a more extensive gene atlas using a
panel of RNAs derived from 79 human and 61 mouse tissues.
This data set constitutes one of the largest quantitative evalua-
tions of gene expression of the protein-encoding transcriptome
to date.
Building on our previous analyses, these expression patterns
were examined for global trends in gene expression. We also
provide experimental validation of thousands of gene predic-
tions and use these data to determine which of the commonly
used types of evidence for gene prediction most accurately
correlates with expressed genes. In addition, we used this data set
to search for chromosomal regions of correlated transcription
(RCTs), which may indicate higher-order mechanisms of tran-
scriptional regulation. Furthermore, we show that some of these
tissue-specific coregulated genes are subject to another form of
regulation, parental imprinting, and thus that several of these
regions are under the control of both tissue- and parental
allele-specific expression. Finally, we have made these data
publicly available for searching and visualization by keyword,
accession number, sequence, expression pattern, and coregula-
tion at our web site (http:兾兾symatlas.gnf.org).
Materials and Methods
Microarray Chip Design. We identified a nonredundant set of
target sequences for the human and mouse using the following
sources: RefSeq (15,491 human and 12,029 mouse sequences)
(7); Celera (49,859 human and 29,331 mouse sequences) (8);
Ensembl (33,698 human sequences); and RIKEN (46,299 mouse
sequences) (9). First, all sequences were screened with
REPEAT-
MASKER (www.repeatmasker.org) to remove repetitive elements.
Next, sequence identity between individual sequences was es-
tablished by using pairwise
BLAT (10) or BLAST (11) and SIM4
(12). The results from single-linkage clustering were further
triaged to produce a final target set of 44,775 human and 36,182
mouse targets with the highest degree of confidence of compu-
tational prediction [biasing toward sequences containing Inter-
pro domains (13) and away from noncoding RNAs]. Finally, the
human sequence set was pruned of all targets already repre-
sented on the Affymetrix (Santa Clara, CA) commercially
available HG-U133A array, leaving 22,645 target sequences for
our custom array. One hundred target sequences from the
This paper was submitted directly (Track II) to the PNAS office.
Abbreviations: RCT, regions of correlated transcription; AUC, area under the curve; LCR,
locus control regions.
A.I.S., T.W., and S.B. contributed equally to this work.
Present address: Center for Biological and Computational Learning, Massachusetts Insti-
tute of Technology, MIT E25-201, Cambridge, MA 02142.
To whom correspondence should be addressed. E-mail: hogenesch@gnf.org.
© 2004 by The National Academy of Sciences of the USA
6062–6067
PNAS
April 20, 2004
vol. 101
no. 16 www.pnas.orgcgidoi10.1073pnas.0400782101
HG-U133A chip were also included in the GNF1H design for the
normalization procedure (see below). The final human and
mouse target sets were submitted to the Affymetrix chip design
pipeline for fabrication of the GNF1H and GNF1M arrays,
respectively.
Tissue Preparation. Human tissue samples were obtained from
several sources: Clinomics Biosciences (Pittsfield, MA), Clon-
tech, AllCells (Berkeley, CA), CloneticsBioWhittaker (Walk-
ersville, MD), AMS Biotechnology (Abingdon, Oxfordshire,
U.K.), and the University of California at San Diego. When
samples from four or more subjects were available, equal num-
bers of male and female subjects were used to make two
independent pools; when fewer than four samples were available,
RNA samples were pooled, and duplicate amplifications were
performed for each pool (detailed annotation for human sam-
ples is on our web site and in Table 1, which is published as
supporting information on the PNAS web site). Adult (10- to
12-week-old) mouse tissue samples were independently gener-
ated from two groups of four male and three female C57BL6
mice by dissection, and tissues were subsequently quickly frozen
on dry ice. Tissues were pulverized while frozen, and total RNA
was extracted with Trizol (Invitrogen, Carlsbad) by using 100
mg of tissue, then further processed by using the RNeasy
miniprep kit according to manufacturer’s protocols (Qiagen,
Chatsworth, CA). The quality of all samples was determined with
an Agilent Bioanalyzer (Palo Alto, CA).
Microarray Procedure. Microarray analysis was performed essen-
tially as described (14). Briefly, 5
g of total RNA was used to
synthesize cDNA that was then used as a template to generate
biotinylated cRNA. cRNA was fragmented and hybridized to
Affymetrix custom or commercially available gene expression
arrays. The arrays were then washed and scanned with a laser
scanner, and images were analyzed by using the MAS5 algorithm
(15). Arrays were normalized by using global median scaling.
The human HG-U133A and GNF1H chips, which were hybrid-
ized to the same biological sample, were first paired and
prenormalized by using the common targets. The condensed
data files are available from our web site (http:兾兾symatlas.
gnf.org) and Gene Expression Omnibus (www.ncbi.nih.gov
geo) (16). Raw CEL files will be provided upon request
(http:兾兾symatlas.gnf.org).
Identification of RCTs. All target genes were mapped to their
corresponding genome assembly (human to National Center for
Biotechnology Information Hs34 assembly, mouse to February
2003 Mm30 assembly) by using
BLAT (10). To account for
multiple probes interrogating a single gene, target sequences
were also compared to UniGene (www.ncbi.nih.govUniGene)
by using
BLAST. Target sequences that map within 25 kb of each
other and to a common UniGene cluster were pooled, and their
expression values were averaged and treated as a single target in
the RCT analysis. Next, each chromosome was scanned in
window sizes of 3–10 adjacent genes. Windows where 50% of
all pairwise comparisons of expression pattern showed a Pearson
correlation coefficient 0.6 were identified as RCTs. Random-
ization studies of gene order confirmed the significance of both
the overall number of RCTs and the average pairwise correlation
of each individual RCT (P 0.005, correcting for multiple
testing). Pairwise sequence similarity within each RCT was
assessed by using
TBLASTX (11), where a similarity value is the
product of the alignment similarity and the percentage of total
sequence length aligned. Synteny between the human and mouse
genome assemblies was derived from a published analysis of
syntenic anchors (17). For the analysis of evolutionarily con-
served RCTs, only the 32 tissues profiled in common between
the mouse and human data sets were used. All analyses and
visualizations were performed by using
R (www.r-project.org).
Imprinting Analysis. Allele-specific probe expression analysis was
used to identify genes with an imprinted expression pattern. Two
distinct mouse strains, C57BL6J (B6) and Mus musculus cas-
taneus (CASTEi), were bred to produce four independent
mouse crosses (male::female): B6::B6, B6::CASTEi,
CASTEi::B6, and CASTEi::CASTEi. Each litter of embryonic
day 14–16 embryos was pooled, and RNA from four to five
separate litters was labeled and hybridized to GNF1M arrays. A
probe-level analysis was performed to detect naturally occurring
polymorphisms between the two strains. Individual probes (but
not entire probe sets) showing a significantly different signal
between the two homozygous groups were identified as putative
polymorphisms in the target gene. Next, the hybridization signal
from the two reciprocal crosses was examined for statistically
significant differences in signal based on the paternal or mater-
nal allele, as assessed by t test (P 0.001), indicating a pattern
of male or female imprinting.
Results and Discussion
The tissue-specific RNA expression pattern of a gene can
indicate important clues to its physiological function. To build an
extensive atlas of tissue-specific gene expression, we created
custom arrays that interrogate the expression of known and
predicted protein-encoding genes from the mouse and human
genomes. The design process used a nonredundant set of known
genes and gene predictions compiled from Refseq, Celera,
Ensembl (for human), and RIKEN (for mouse). For our GNF1H
custom human array, we further removed gene targets that were
already represented on the commercially available HG-U133A
array from Affymetrix. Finally, we biased the final selection
toward gene predictions with likely protein-coding regions. In
total, the U133AGNF1H chipset interrogates 44,775 probe sets,
and our custom-designed GNF1M mouse array interrogates
36,182 probe sets. As of the most current annotation in January
2004, these correspond to 33,698 and 33,825 unique human and
mouse genes, respectively, after accounting for multiple probe
sets interrogating single genes and split transcripts.
Using these whole-genome gene expression arrays, we mea-
sured the expression of an extensive set of transcripts and
transcript predictions on a single technology platform across a
diverse panel of 79 human and 61 mouse tissues. This gene atlas
represents the normal transcriptome and allowed us to examine
global trends in gene expression. Classical reassociation kinetics
(Rot) has been used to assess global trends in gene expression at
a population level (18). The analysis of our data set expands this
knowledge by examining transcript expression across a large
number of tissues at the individual transcript level. We find that
52% (16,454) and 59% (17,924) of target genes are detected in
at least one tissue in the human and mouse, respectively (Fig. 4A,
which is published as supporting information on the PNAS web
site). The average number of transcripts expressed in a single
tissue was 8,200 (mouse). These observations generally concur
with previous findings derived from Rot analyses, which indicate
that 10,000–15,000 mRNAs are expressed in a given tissue at
1–10 copies per cell, and that 90% of these are common
between two tissues (19). However, although Rot analysis sug-
gests that the majority of transcripts are present in many or all
tissues, our data show that 1% of human target sequences are
ubiquitously expressed. Approximately 3% of mouse target
sequences are detected in all samples profiled, although this
number will certainly decline as the number of samples increases.
Not surprisingly, the expression of these ubiquitously expressed
housekeeping genes is 30-fold higher than for all genes in the
data set (Fig. 4B).
Su et al. PNAS
April 20, 2004
vol. 101
no. 16
6063
GENETICS
Another valuable use of this data set is characterization of
novel predicted genes derived from the mouse and human
genome projects (1, 2). Many of these exist solely as in silico
predictions, and therefore evidence of their expression can serve
as validation of these predictions. Furthermore, determining the
expression pattern of an uncharacterized gene can indicate the
appropriate tissue(s) from which the transcript can be cloned, as
well as provide a base layer of physiological annotation. Gene
prediction is an inexact art, where distinct methods and research-
ers often produce largely nonoverlapping sets of gene predic-
tions (20). For the human data, we subdivided the transcripts
into four classes based on annotation information at the time of
design: known genes found in Refseq, genes predicted indepen-
dently by two groups (Celera and Ensembl), singleton predic-
tions found by the Ensembl group only, and singleton predictions
found by the Celera group only. As expected, the set of known
genes (Refseq) has the highest rate of detection in our data set,
because 79% have detectable expression in at least one sample
(Fig. 1). Because all Refseq genes are known to be expressed, this
suggests that our methodologies and current tissue libraries have
a minimum false-negative rate of 21% in detection of expres-
sion. This can certainly be improved with the profiling of
additional tissues and cell types. Consensus gene predictions
have a higher rate of detectable expression (53%) than either
singleton gene predictions offered by Ensembl or Celera only
(30% and 24%, respectively) (Fig. 1). Although the Ensembl-
only group had a slightly higher rate of detection, a greater total
number of Celera-only predictions was detected (2,918 Celera vs.
618 Ensembl predictions). Analogous results are seen in the
mouse data set, in which Refseq genes had a higher rate of
detection than gene predictions by Celera (79% vs. 46%). The
differences among these three classes are also reflected in the
quantitative measures of gene expression. On average, human
Refseq genes are expressed at a level 2-fold higher than con-
sensus predictions, which in turn are 66% higher than singleton
predictions (P0.001; data not shown). This observation likely
reflects a historical bias in the biology of studying highly
abundant proteins. In total, we find evidence of expression for
5,641 (31.2%) human and 2,629 (46.2%) mouse gene predictions
through detection of their transcribed mRNA product in at least
one tissue. In addition, we describe the expression pattern for
9,708 mouse RIKEN-derived genes, many of which lack signif-
icant expression annotation. It is important to note that the gene
predictions for which we do not observe detectable expression
are not necessarily incorrect, because the appropriate tissue(s)
for a given gene may have not been profiled, the gene may be
present in a small number of copies (e.g., in a small subset of cells
within a tissue), or the probe set may not properly interrogate the
expression of the gene (e.g., UTRs, split transcripts, or missing
or mistaken terminal exons). Despite these caveats, this data set
provides the expression pattern of thousands of gene predictions
and poorly characterized transcripts from the mouse and human
genome projects, offering the opportunity to study the function
of these genes in their most relevant tissues.
Given the differing methods and subsequent results from gene
prediction efforts, we next investigated which characteristics of
a predicted transcript were better indicators of its detectable
expression. In the methodology used by Celera, the following
lines of evidence were considered in their gene prediction
algorithm: ‘‘conservation between mouse and human genomic
DNA, similarity to human [and] rodent transcripts (ESTs and
cDNAs), and similarity of the translation of human genomic
DNA to known proteins’’ (1). Using the detectable expression of
a gene product as validation of the prediction, we created
receiver operating characteristic curves for each line of evidence
that plot the true positive rate as a function of the false positive
rate. The area under the curve (AUC) measures the strength of
the predictor; a perfect predictor would have AUC 1, and a
random factor would have AUC 0.5. When comparing the
predictor strength among the three lines of evidence above in the
human data set, we find that although no single line of evidence
is universally predictive of expression, EST evidence has the
most predictive value (AUC 0.77) (Fig. 5, which is published
as supporting information on the PNAS web site), an observa-
tion likely linked to the fact that highly expressed genes are more
likely to be represented in EST databases. Protein homology
support and sequence similarity between human and mouse
genomic sequences both had a lesser impact on the validation of
gene predictions (AUC of 0.66 and 0.65, respectively). The
availability of additional mammalian genome sequences should
increase the power of sequence conservation in gene prediction.
Somewhat surprisingly, simply the length of the transcript pre-
diction was also a reasonable predictor of detection in our data
set (AUC 0.68), suggesting that incomplete transcript predic-
tions were significant factors in the nonobservation of many gene
targets.
We and others have used gene expression information, ge-
nome sequence, and de novo motif discovery tools to search for
enhancer elements that direct tissue-specific gene expression
(21, 22). In contrast to enhancers that generally direct the
expression of a single gene, locus control regions (LCR) are
characterized by their ability to promote the expression of
multiple genes at a single locus. To date, only a handful of LCRs
have been reported (23). Recently, Spellman and Rubin (24)
used Drosophila gene expression arrays to identify 200 clusters
of adjacent and similarly expressed genes and suggest that these
patterns are most consistent with regulation of chromatin struc-
ture. Others (2527) have also performed similar analyses in
humans, Caenorhabditis elegans, and yeast on more limited sets
of experimental conditions.
To identify potential loci in our data set, the expression of
which may be controlled in a locus-dependent manner, we
mapped the transcripts represented on our gene expression
arrays to genome assemblies and scanned each chromosome for
windows of genes with correlated expression patterns. We called
these sites RCTs as a general term encompassing LCRs and
correlated expression achieved through gene duplication. It is
important to note that detection of these RCTs is heavily
influenced by comparison algorithms, normalization proce-
dures, and underlying data. In particular, the inclusion of several
purified immune cell populations in our human sample set
Fig. 1. Validation of gene predictions in humans. Gene targets on the GNF1H
array were divided into four categories: contained in Refseq, predicted by
both gene prediction efforts considered (‘‘Consensus’’), and predicted by only
one group (‘‘Ensembl-only’’ and ‘‘Celera-only’’). On the left axis (solid bars),
rates of validation are shown, where detectable expression in at least one
tissue is taken as evidence of the validity of a gene prediction. The right axis
(blue line) indicates the total number of validated genes per group.
6064
www.pnas.orgcgidoi10.1073pnas.0400782101 Su et al.
skewed the normalization procedure and led to an increase in
RCTs whose expression is enriched in these samples. In total, we
identified 156 and 108 RCTs in human and mouse, respectively
(descriptions of all RCTs are available for download from
http:兾兾symatlas.gnf.org). Tissues with very specific clusters of
genes such as those in the immune system, liver, testis, and
placenta had more RCTs than other tissues in both the mouse
and human data sets. Mechanistically, expression of these RCTs
is likely mediated through either common promoter elements
(resulting from gene duplication) or through higher-order gene
regulation such as site-specific chromatin remodeling. To sepa-
rate these two possibilities, we identified likely paralogs using
TBLASTX, a local six-frame translated nucleotide-to-nucleotide
alignment algorithm (11). RCTs whose genes share significant
sequence similarity in their coding sequences are likely to be
products of gene duplication, whereas dissimilar genes may result
from an LCR or other higher-order transcriptional regulation.
As expected, we found RCTs with both related and unrelated
genes. Fig. 2A illustrates an example of an RCT driven by gene
duplication. This cluster of genes on mouse chromosome 9
represents a family of 11 uncharacterized F-box and WD40
repeat containing proteins that are specifically expressed in
fertilized eggs and oocytes. Because of their high degree of
sequence similarity, we hypothesize that their correlated expres-
sion pattern is a result of duplicated regulatory elements present
in their structural genes, and that these genes may play an
important role in the specialized protein expression of oocytes.
In contrast, we also note a cluster of three genes with no
apparent sequence similarity on human chromosome 13 that are
highly enriched in samples derived from brain tissues, particu-
larly the fetal brain sample (Fig. 2B). The genes in this cluster
are neurobeachin, an uncharacterized mRNA, and doublecortin-
and calmodulin kinase-like 1 protein (DCAMKL1). It is appeal-
ing to hypothesize that the correlated expression patterns of
these genes and their colocalization at a chromosomal locus
indicate a common role in a neurological process or network.
Because these genes do not share sequence similarity, this region
may also contain a previously unrecognized LCR or strong
regional enhancer. Overall, 97 (62%) and 78 (72%) of the human
and mouse RCTs identified have an average pairwise sequence
similarity of 20% and do not encode related genes.
We next examined both the mouse and human data for RCTs
that were identified in both data sets and are likely evolutionarily
conserved. The majority of the RCTs were not found in both
human and mouse, in many cases because the orthologs or
syntenic regions have not yet been defined or the patterns were
not conserved. However, in some cases, the apparent lack of
conservation likely reflects physiological differences between
the two organisms. For example, we observed RCTs with
expression enriched in the olfactory bulb present in the mouse
Fig. 2. RCT. (A) An RCT was identied on mouse chromosome 9, consisting of 11 genes that share a highly conserved expression pattern. (Upper) The y axis
is average normalized expression value, the x axis contains the 61 different tissues, and red bars are fertilized egg and oocytes. The correlation plot (Lower Left)
visualizes the pairwise correlation coefcients. Each row represents a gene, ordered vertically according to their position on the chromosome. The center yellow
vertical strip represents autocorrelation (R 1); positions to the right of center represent correlation of the gene to its downstream neighbors, whereas positions
to the left represent correlation to the upstream neighbors. Yellow indicates high correlation; blue indicates low correlation (scale at bottom). The sequence
similarity plot (using
TBLASTX, Lower Right) has the same structure as the correlation plot, except pairwise sequence similarity is shown. In this RCT with high
expression levels in fertilized eggs and oocytes, the genes share a high degree of sequence similarity, likely indicating they are all members of a single gene family
and the result of one or more gene duplication events. (B) An example RCT is identied on human chromosome 13, which contains three genes with highly
correlated expression (red bars are brain regions, green bar is fetal brain). In contrast to the rst example, these genes share very little pairwise sequence
similarity. (C) An evolutionarily conserved RCT is shown from human chromosome 2 (Left) and the syntenic region on mouse chromosome 6 (Right). These RCTs
share a pancreas-enriched expression pattern (red bar), as well as signicant sequence similarity.
Su et al. PNAS
April 20, 2004
vol. 101
no. 16
6065
GENETICS
but not the human data set. Nevertheless, several RCTs were
conserved, including a cluster of pancreas-specific genes map-
ping to human chromosome 2 and its syntenic region on mouse
chromosome 6 (Fig. 2C). The human cluster is comprised of five
genes, including pancreatitis-associated proteins (PAP), three
regenerating islet-derived proteins (REG1A, REG1B, and
REGL), and one protein of unknown function (LPPM429). The
mouse cluster contains the ortholog to PAP, four isoforms of
regenerating islet-derived proteins, and islet neogenesis-
associated protein-related protein. The conservation of this
RCT in human and mouse suggests that these genes perform
analogous and important roles in both of these mammals.
After mapping all target genes to their respective genome
assemblies, we noted a region of mouse chromosome 7 (130 Mb)
that contained several genes previously shown to be imprinted
(2830), three of which (H19, Igf2, and Cdkn1c) shared a pattern
of enriched expression in placenta, umbilical cord, and embry-
onic tissues. We also noted another pair of adjacent genes (Zim1
and Peg3) elsewhere on chromosome 7 (6 Mb) that shared this
tissue-specific expression pattern, and whose expression has
been shown to be imprinted (31). Prompted by these observa-
tions, we examined our set of RCTs for other imprinted genes
that were clustered in a single locus. On mouse chromosome 12
(103 Mb), we observed an RCT that consists of six adjacent
genes, all with enriched expression in brain regions and umbilical
cord (Fig. 3 A and B). Recently, several groups showed that two
genes in this locus, Dlk1 and Gtl2, are imprinted (reviewed in ref.
32). Later, it was also shown that another gene at this locus, Rian,
and several adjacent tandemly repeated CD small nucleolar
RNA genes are also imprinted (33, 34). Furthermore, although
we do not have a probe set on our array that reliably detects its
expression, Dio3 is located proximal to this locus and has also
Fig. 3. Six genes on mouse chromosome 12 share a distinctive pattern of expression. (A) A genomic view of this region (not to scale). Locations of the genes
on the mouse genome assembly: Dlk (103.508 Mb), Gtl2 (103.593 Mb), 1110006E14Rik (103.646 Mb), Rian (103.696 Mb), 5330411G14Rik (103.788 Mb),
C130007E11Rik (103.798 Mb), and Dio3 (104.328 Mb). (B) These genes share enriched expression in brain regions (green bars) and umbilical cord (red bar). The
y axes indicate normalized expression values, whereas each bar along the x axis indicates a sample proled in our data set. (C) Three of these genes (Dlk1, Gtl2,
and Rian) have been previously reported to be imprinted. Using our allele-specic probe expression analysis approach (see text), we conrmed the imprinted
regulation of Gtl2 and Rian and report two previously undescribed imprinted transcripts at this locus (5330411G14Rik and C130007E11Rik). The y axes indicate
the normalized signal intensity for individual probes on the array, and each bar represents a pooled sample from a cross indicated by color (see key).
6066
www.pnas.orgcgidoi10.1073pnas.0400782101 Su et al.
shown to exhibit genomic imprinting (35). The imprinting status
of the three remaining RIKEN clones at this locus
(1110006E14Rik, 5330411G14Rik, and C130007E11Rik)isnot
known, although they share the brain- and umbilical cord-
enriched expression characteristic of all of the genes in the RCT.
To investigate whether these three genes were also imprinted,
we used two distinct mouse strains, C57BL6J (B6) and M. m.
castaneus (CASTEi), to set up four independent mouse crosses
(male::female): B6::B6, B6::CASTEi, CASTEi::B6, and
CASTEi::CASTEi. Four independent litters of pooled embry-
onic day 1416 embryos were dissected, and RNA expression
was analyzed by allele-specific probe expression analysis, which
allows us to determine whether the transcript is expressed
exclusively or preferentially from either the paternal or maternal
allele. This analysis reconfirmed the imprinted expression of
Gtl2 and Rian (Fig. 3C). Because no probes could distinguish
between the B6 and CASTEi forms of Dlk1, we were unable
to reconfirm its imprinted expression. Two of the uncharacter-
ized RIKEN genes at this locus, 5330411G14Rik and
C130007E11Rik, showed expression from the maternal allele
only, further expanding the number of known imprinted genes at
this locus (Fig. 3C). Because these cDNAs are within 10 kb of one
another, it is possible they are derived from the same structural
gene. The third gene (1110006E14Rik), like Dlk1, did not contain
a probe capable of ascertaining its imprinting status. During the
preparation of this manuscript, another gene in this locus sharing
the 3-end of C130007E11Rik was also shown to be imprinted
(36). In sum, the allele-specific probe expression analysis method
has identified another two imprinted transcripts at this locus.
Furthermore, based on the observation that well-characterized
imprinted loci on mouse chromosomes 7 and 12 share a common
pattern of gene expression in our data, we speculate that the
LCR machinery that regulates the parental expression of these
genes may also influence their tissue-specific expression pattern.
Conclusion
Here we report an extensive compendium of gene expression of the
protein-encoding transcriptomes of the mouse and humans. Fur-
ther augmentation by additional samples, including region-specific
dissections using laser capture microdissection or even cell type-
specific gene expression, will undoubtedly increase the utility of
these resources. We have investigated this data set for global
signatures in tissue-specific gene regulation, expression character-
istics of de novo predicted transcripts, and chromosomal RCTs. The
identification of several known imprinted loci in our tissue-specific
RCT list suggests that these regulatory mechanisms that direct
tissue- or parental allele-specific expression may be intertwined.
Consistent with this observation, we were able to identify two
previously undescribed transcripts that were imprinted on mouse
chromosome 12 based on the observation that they share a tissue-
specific expression pattern with their neighbors.
With the sequencing phase of the human and mouse genome
projects nearly complete, and with the rapid progress in the
sequencing of other mammalian genomes, we are now poised to
develop and exploit a variety of methods to ascertain the
function of the thousands of recently described genes. In this
regard, the genome-scale RNA expression data described herein
provide a framework for the functional annotation process. By
making the underlying data available on our web site (http:兾兾
symatlas.gnf.org) and through the Gene Expression Omnibus
(www.ncbi.nih.govgeo), we anticipate that this study will aid
researchers throughout the global research community to reap
the harvests of the human and mouse genome projects.
We thank the following individuals for providing human RNA samples:
Gino Van Heeke, Novartis (bronchial epithelial cells); Graeme Bilbe,
Novartis (fetal thyroid); Clifford Shults, University of California at San
Diego (whole blood); Bill Sugden, University of Wisconsin, Madison
(721 B-lymphoblasts); Joseph D Buxbaum, Mt. Sinai School of Medicine,
New York (prefrontal cortex). We also thank Ines Hoffmann and Satchin
Panda for preparation of mouse embryonic samples and Peter Dimitrov,
Christian Zmasek, and Michael Heuer for technical expertise. This work
was supported by the Novartis Research Foundation.
1. Venter, J. C., Adams, M. D., Myers, E. W., Li, P. W., Mural, R. J., Sutton, G. G.,
Smith, H. O., Yandell, M., Evans, C. A., Holt, R. A., et al. (2001) Science 291,
13041351.
2. Lander, E. S., Linton, L. M., Birren, B., Nusbaum, C., Zody, M. C., Baldwin,
J., Devon, K., Dewar, K., Doyle, M., FitzHugh, W., et al. (2001) Nature 409,
860921.
3. Su, A. I., Welsh, J. B., Sapinoso, L. M., Kern, S. G., Dimitrov, P., Lapp, H.,
Schultz, P. G., Powell, S. M., Moskaluk, C. A., Frierson, H. F., Jr., et al. (2001)
Cancer Res. 61, 73887393.
4. Aza-Blanc, P., Cooper, C. L., Wagner, K., Batalov, S., Deveraux, Q. L. &
Cooke, M. P. (2003) Mol. Cell 12, 627637.
5. Chanda, S. K., White, S., Orth, A. P., Reisdorph, R., Miraglia, L., Thomas,
R. S., DeJesus, P., Mason, D. E., Huang, Q., Vega, R., et al. (2003) Proc. Natl.
Acad. Sci. USA 100, 1215312158.
6. Su, A. I., Cooke, M. P., Ching, K. A., Hakak, Y., Walker, J. R., Wiltshire, T.,
Orth, A. P., Vega, R. G., Sapinoso, L. M., Moqrich, A., et al. (2002) Proc. Natl.
Acad. Sci. USA 99, 44654470.
7. Pruitt, K. D. & Maglott, D. R. (2001) Nucleic Acids Res. 29, 137140.
8. Kerlavage, A., Bonazzi, V., di Tommaso, M., Lawrence, C., Li, P., Mayberry,
F., Mural, R., Nodell, M., Yandell, M., Zhang, J., et al. (2002) Nucleic Acids Res.
30, 129136.
9. Okazaki, Y., Furuno, M., Kasukawa, T., Adachi, J., Bono, H., Kondo, S.,
Nikaido, I., Osato, N., Saito, R., Suzuki, H., et al. (2002) Nature 420, 563573.
10. Kent, W. J. (2002) Genome Res. 12, 656664.
11. Altschul, S. F., Madden, T. L., Schaffer, A. A., Zhang, J., Zhang, Z., Miller, W.
& Lipman, D. J. (1997) Nucleic Acids Res. 25, 33893402.
12. Florea, L., Hartzell, G., Zhang, Z., Rubin, G. M. & Miller, W. (1998) Genome
Res. 8, 967974.
13. Kanapin, A., Batalov, S., Davis, M. J., Gough, J., Grimmond, S., Kawaji, H.,
Magrane, M., Matsuda, H., Schonbach, C., Teasdale, R. D., et al. (2003)
Genome Res. 13, 13351344.
14. Lockhart, D. J., Dong, H., Byrne, M. C., Follettie, M. T., Gallo, M. V., Chee,
M. S., Mittmann, M., Wang, C., Kobayashi, M., Horton, H., et al. (1996) Nat.
Biotechnol. 14, 16751680.
15. Hubbell, E., Liu, W. M. & Mei, R. (2002) Bioinformatics 18, 15851592.
16. Edgar, R., Domrachev, M. & Lash, A. E. (2002) Nucleic Acids Res. 30, 207210.
17. Kent, W. J., Baertsch, R., Hinrichs, A., Miller, W. & Haussler, D. (2003) Proc.
Natl. Acad. Sci. USA 100, 1148411489.
18. Bishop, J. O., Morton, J. G., Rosbash, M. & Richardson, M. (1974) Nature 250,
199204.
19. Hastie, N. D. & Bishop, J. O. (1976) Cell 9, 761774.
20. Hogenesch, J. B., Ching, K. A., Batalov, S., Su, A. I., Walker, J. R., Zhou, Y.,
Kay, S. A., Schultz, P. G. & Cooke, M. P. (2001) Cell 106, 413415.
21. Harmer, S. L., Hogenesch, J. B., Straume, M., Chang, H. S., Han, B., Zhu, T.,
Wang, X., Kreps, J. A. & Kay, S. A. (2000) Science 290, 21102113.
22. DeRisi, J. L., Iyer, V. R. & Brown, P. O. (1997) Science 278, 680686.
23. Li, Q., Peterson, K. R., Fang, X. & Stamatoyannopoulos, G. (2002) Blood 100,
30773086.
24. Spellman, P. T. & Rubin, G. M. (2002) J. Biol. 1, 5.
25. Caron, H., van Schaik, B., van der Mee, M., Baas, F., Riggins, G., van Sluis, P.,
Hermus, M. C., van Asperen, R., Boon, K., Voute, P. A., et al. (2001) Science
291, 12891292.
26. Roy, P. J., Stuart, J. M., Lund, J. & Kim, S. K. (2002) Nature 418, 975979.
27. Cohen, B. A., Mitra, R. D., Hughes, J. D. & Church, G. M. (2000) Nat. Genet.
26, 183186.
28. Bell, A. C. & Felsenfeld, G. (2000) Nature 405, 482485.
29. Hark, A. T., Schoenherr, C. J., Katz, D. J., Ingram, R. S., Levorse, J. M. &
Tilghman, S. M. (2000) Nature 405, 486489.
30. Thorvaldsen, J. L., Duran, K. L. & Bartolomei, M. S. (1998) Genes Dev. 12,
36933702.
31. Kim, J., Lu, X. & Stubbs, L. (1999) Hum. Mol. Genet. 8, 847854.
32. Georges, M., Charlier, C. & Cockett, N. (2003) Trends Genet. 19, 248252.
33. Hatada, I., Morita, S., Obata, Y., Sotomaru, Y., Shimoda, M. & Kono, T. (2001)
J. Biochem. 130, 187190.
34. Cavaille, J., Seitz, H., Paulsen, M., Ferguson-Smith, A. C. & Bachellerie, J. P.
(2002) Hum. Mol. Genet. 11, 15271538.
35. Yevtodiyenko, A., Carr, M. S., Patel, N. & Schmidt, J. V. (2002) Mamm.
Genome 13, 633638.
36. Seitz, H., Youngson, N., Lin, S. P., Dalbert, S., Paulsen, M., Bachellerie, J. P.,
Ferguson-Smith, A. C. & Cavaille, J. (2003) Nat. Genet. 34, 261262.
Su et al. PNAS
April 20, 2004
vol. 101
no. 16
6067
GENETICS
... We started with the hypothesis that wild-type EVER proteins perform some anti-viral function in the context of immune responses to viral infection. Consistent with this hypothesis, high expression levels of EVER1 and EVER2 gene products have been found in multiple immune cells including, but not limited to, CD4+ and CD8+ T lymphocytes, B lympho cytes, and natural killer (NK) cells (3,40). Therefore, we expected that knocking out EVER protein expression would allow viral infection to occur and persist by bypassing normal immune surveillance. ...
Article
Epidermodysplasia verruciformis (EV) is a rare genetic skin disorder that is characterized by the development of papillomavirus-induced skin lesions that can progress to squamous cell carcinoma (SCC). Certain high-risk, cutaneous β-genus human papillomaviruses (β-HPVs), in particular HPV5 and HPV8, are associated with inducing EV in individuals who have a homozygous mutation in one of three genes tied to this disease: EVER1 , EVER2 , or CIB1. EVER1 and EVER2 are also known as TMC6 and TMC8, respectively. Little is known about the biochemical activities of EVER gene products or their roles in facilitating EV in conjunction with β-HPV infection. To investigate the potential effect of EVER genes on papillomavirus infection, we pursued in vivo infection studies by infecting Ever2 -null mice with mouse papillomavirus (MmuPV1). MmuPV1 shares characteristics with β-HPVs including similar genome organization, shared molecular activities of their early, E6 and E7, oncoproteins, the lack of a viral E5 gene, and the capacity to cause skin lesions that can progress to SCC. MmuPV1 infections were conducted both in the presence and absence of UVB irradiation, which is known to increase the risk of MmuPV1-induced pathogenesis. Infection with MmuPV1 induced skin lesions in both wild-type and Ever2 -null mice with and without UVB. Many lesions in both genotypes progressed to malignancy, and the disease severity did not differ between Ever2 -null and wild-type mice. However, somewhat surprisingly, lesion growth and viral transcription was decreased, and lesion regression was increased in Ever2 -null mice compared with wild-type mice. These studies demonstrate that Ever2 -null mice infected with MmuPV1 do not exhibit the same phenotype as human EV patients infected with β-HPVs. IMPORTANCE Humans with homozygous mutations in the EVER2 gene develop epidermodysplasia verruciformis (EV), a disease characterized by predisposition to persistent β-genus human papillomavirus (β-HPV) skin infections, which can progress to skin cancer. To investigate how EVER2 confers protection from papillomaviruses, we infected the skin of homozygous Ever2 -null mice with mouse papillomavirus MmuPV1. Like in humans with EV, infected Ever2 -null mice developed skin lesions that could progress to cancer. Unlike in humans with EV, lesions in these Ever2 -null mice grew more slowly and regressed more frequently than in wild-type mice. MmuPV1 transcription was higher in wild-type mice than in Ever2 -null mice, indicating that mouse EVER2 does not confer protection from papillomaviruses. These findings suggest that there are functional differences between MmuPV1 and β-HPVs and/or between mouse and human EVER2.
... Cluster 1 genes are characterized by a broad specter of biological actions and their clinical relevance for CVD is therefore not always easily appreciable. However, support for the pleiotropic function of cluster 1 genes exists since profiling the protein-encoding transcriptome of human smooth muscle [36] detected 64.7% of the genes. Moreover, among the 53 aBMDassociated genes in cluster 1, 28 were associated with CVD based on their GO biological process term or CVD-related phenotype of associated SNPs (Table S1). ...
Article
Full-text available
Epidemiological evidence suggests existing comorbidity between postmenopausal osteoporosis (OP) and cardiovascular disease (CVD), but identification of possible shared genes is lacking. The skeletal global transcriptomes were analyzed in trans-iliac bone biopsies (n = 84) from clinically well-characterized postmenopausal women (50 to 86 years) without clinical CVD using microchips and RNA sequencing. One thousand transcripts highly correlated with areal bone mineral density (aBMD) were further analyzed using bioinformatics, and common genes overlapping with CVD and associated biological mechanisms, pathways and functions were identified. Fifty genes (45 mRNAs, 5 miRNAs) were discovered with established roles in oxidative stress, inflammatory response, endothelial function, fibrosis, dyslipidemia and osteoblastogenesis/calcification. These pleiotropic genes with possible CVD comorbidity functions were also present in transcriptomes of microvascular endothelial cells and cardiomyocytes and were differentially expressed between healthy and osteoporotic women with fragility fractures. The results were supported by a genetic pleiotropy-informed conditional False Discovery Rate approach identifying any overlap in single nucleotide polymorphisms (SNPs) within several genes encoding aBMD- and CVD-associated transcripts. The study provides transcriptional and genomic evidence for genes of importance for both BMD regulation and CVD risk in a large collection of postmenopausal bone biopsies. Most of the transcripts identified in the CVD risk categories have no previously recognized roles in OP pathogenesis and provide novel avenues for exploring the mechanistic basis for the biological association between CVD and OP.
... Gene Wiki) which runs a bot that gathers gene information from research databases and uses it to generate gene stubs. On the database is a report page with free text, interactors, references, and a box with external links and clickable icons leading to a graphical display of microarray expression data gathered with numerous tissues, including gonads (Su et al 2004). ...
Article
Full-text available
Recent breakthroughs in the field of medicine and physiology have come through the application of bioinformatics and computational biology in experimental designs and in silico analyses. Genomics and proteomics-based strategies are currently used for data presentation, sequence analysis and alignment, primer designs, mapping and annotation of the entire human genome. This advancement has made it possible to identify the roles of specific genes and proteins, understand their physiological functions, as well as pathophysiology during mutation, malformation and diseased conditions. This review describes the available proteomic databases; essential proteins used as markers of fertility in male, annotation techniques and data projects on male reproductive physiology. The materials used in this descriptive review were searched using PubMed and Google scholar databases. The following terms and phrases were reviewed: 'genomics', 'proteomics', 'Interrelationship between bioinformatics and Life sciences', 'bioinformatics tools for analyzing male reproductive system', 'Male reproductive system functions', 'infertility', and 'Application of bioinformatics in male reproductive physiology research'. Analyses in proteomics and genomics have further expanded the understanding of male reproductive physiology research through different bioinformatics tools and databases. A better understanding of the mechanisms behind spermatogenesis, testicular gene expression, protein involvement in male reproduction, the discovery of cancer genes in reproductive organs, and possible markers to identify infertility in males have evolved. There is, therefore, a need for further application of bioinformatics in the study of male reproductive system with the introduction of more databases, better identification of cancer genes in male reproductive organs and male infertility's possible biomarkers.
... It is generally believed that the smaller the Ka/Ks ratio of homologous genes, the closer their expression profiles are. To authenticate the superiority of the improved algorithm, we obtained the expression profile data of orthologous genes for both human and mouse from Su et al.'s study [22]. This dataset comprises microarray expression profiles of 2275 orthologous gene-pairs across 27 corresponding tissues in human and mouse, including heart, kidney, lung, liver, cerebellum, testis and so on. ...
Preprint
Full-text available
The Non-synonymous/Synonymous substitution rate (Ka/Ks) ratio is a commonly used metric to estimate the selection pressure and evolutionary rate of proteins in comparative genomics, which plays critical roles in both biology and medicine. A fundamental assumption of Ka/Ks is that synonymous mutations are evolutionarily neutral and not subject to natural selection as they do not alter protein sequences and function. Recently, however, a number of studies have demonstrated that synonymous mutations are non-neutral and functional through a number of mechanisms, such as altering miRNA regulation. This further implies that synonymous mutations also participate in the process of natural selection and thus Ka/Ks should be improved as well. For this purpose, here we propose an improved Ka/Ks ratio, i Ka/Ks, which redefines the neutral mutation rate by considering the altered status of miRNA regulation of the synonymous mutations, and thereby incorporate the impact of synonymous mutations on miRNA regulation. As a result, i Ka/Ks shows better performance than Ka/Ks when comparing them using their correlation with expression distance in the human-mouse study. Moreover, case studies showed that i Ka/Ks is able to identify the positive/negative selective genes that are missed by Ka/Ks. For example, TMEM72/Tmem72 is estimated to be positively selected by i Ka/Ks (1.13) but negatively selected by the conventional Ka/Ks ratio (0.21). Further evidence showed its rapid evolution, which further support the power of the new algorithm.
... We conducted SNPsea analysis 45 utilizing the default configurations for single-nucleotide polymorphisms within each group of cell-type-specific peaks and the corresponding background peak set. Specifically, we evaluated the extent of tissue-specific expression enrichments in the profiles of 17,581 genes across various human tissues, using the Gene Atlas dataset 46 . ...
Article
Full-text available
The single-cell assay for transposase-accessible chromatin using sequencing (scATAC-seq) technology provides insight into gene regulation and epigenetic heterogeneity at single-cell resolution, but cell annotation from scATAC-seq remains challenging due to high dimensionality and extreme sparsity within the data. Existing cell annotation methods mostly focus on the cell peak matrix without fully utilizing the underlying genomic sequence. Here we propose a method, SANGO, for accurate single-cell annotation by integrating genome sequences around the accessibility peaks within scATAC data. The genome sequences of peaks are encoded into low-dimensional embeddings, and then iteratively used to reconstruct the peak statistics of cells through a fully connected network. The learned weights are considered as regulatory modes to represent cells, and utilized to align the query cells and the annotated cells in the reference data through a graph transformer network for cell annotations. SANGO was demonstrated to consistently outperform competing methods on 55 paired scATAC-seq datasets across samples, platforms and tissues. SANGO was also shown to be able to detect unknown tumor cells through attention edge weights learned by the graph transformer. Moreover, from the annotated cells, we found cell-type-specific peaks that provide functional insights/biological signals through expression enrichment analysis, cis-regulatory chromatin interaction analysis and motif enrichment analysis.
Article
Full-text available
Single-cell epigenomic data has been growing continuously at an unprecedented pace, but their characteristics such as high dimensionality and sparsity pose substantial challenges to downstream analysis. Although deep learning models—especially variational autoencoders—have been widely used to capture low-dimensional feature embeddings, the prevalent Gaussian assumption somewhat disagrees with real data, and these models tend to struggle to incorporate reference information from abundant cell atlases. Here we propose CASTLE, a deep generative model based on the vector-quantized variational autoencoder framework to extract discrete latent embeddings that interpretably characterize single-cell chromatin accessibility sequencing data. We validate the performance and robustness of CASTLE for accurate cell-type identification and reasonable visualization compared with state-of-the-art methods. We demonstrate the advantages of CASTLE for effective incorporation of existing massive reference datasets in a weakly supervised or supervised manner. We further demonstrate CASTLE’s capacity for intuitively distilling cell-type-specific feature spectra that unveil cell heterogeneity and biological implications quantitatively.
Article
Tripartite motif (TRIM) family members participate in a variety of cellular activities, such as intracellular signaling, development, cellular death, protein quality control, immunological defense, waste degradation, and the emergence of cancer. These proteins usually act as E3 ubiquitin ligase. The final line of resistance against infectious viruses is a cytosolic ubiquitin ligase and antibody receptor called TRIM containing 21. TRIM21, a protein with a tripartite structure, has been linked to autoimmune erythematosus, Sjogren’s disorder, and innate immunity. TRIM21 may either promote the formation of specific cancer-activating proteins, resulting in their proteasomal degradation, or it may do neither, depending on the kind of cancer and cancer-causing trigger. The current research has shown that the antiviral action of TRIM mostly depends on their role as E3-ubiquitin ligases and a significant portion of the TRIM family mediates the transmission of innate immune cell signals and the subsequent production of cytokines. We highlighted the function of TRIM family members in various inflammatory diseases.
Article
Bone secretory proteins, termed osteokines, regulate bone metabolism and whole-body homeostasis. However, fundamental questions as to what the bona fide osteokines and their cellular sources are and how they are regulated remain unclear. In this study, we analyzed bone and extraskeletal tissues, osteoblast (OB) conditioned media, bone marrow supernatant (BMS), and serum, for basal osteokines and those responsive to aging and mechanical loading/unloading. We identified 375 candidate osteokines and their changes in response to aging and mechanical dynamics by integrating data from RNA-seq, scRNA-seq, and proteomic approaches. Furthermore, we analyzed their cellular sources in the bone and inter-organ communication facilitated by them (bone-brain, liver, and aorta). Notably, we discovered that senescent OBs secrete fatty-acid-binding protein 3 to propagate senescence toward vascular smooth muscle cells (VSMCs). Taken together, we identified previously unknown candidate osteokines and established a dynamic regulatory network among them, thus providing valuable resources to further investigate their systemic roles.
Article
Full-text available
Ore mineral and host lithologies have been sampled with 89 oriented samples from 14 sites in the Naica District, northern Mexico. Magnetic parameters permit to charac- terise samples: saturation magnetization, density, low- high-temperature magnetic sus- ceptibility, remanence intensity, Koenigsberger ratio, Curie temperature and hystere- sis parameters. Rock magnetic properties are controlled by variations in titanomag- netite content and hydrothermal alteration. Post-mineralization hydrothermal alter- ation seems the major event that affected the minerals and magnetic properties. Curie temperatures are characteristic of titanomagnetites or titanomaghemites. Hysteresis parameters indicate that most samples have pseudo-single domain (PSD) magnetic grains. Alternating filed (AF) demagnetization and isothermal remanence (IRM) ac- quisition both indicate that natural and laboratory remanences are carried by MD-PSD spinels in the host rocks. The trend of NRM intensity vs susceptibility suggests that the carrier of remanent and induced magnetization is the same in all cases (spinels). The Koenigsberger ratio range from 0.05 to 34.04, indicating the presence of MD and PSD magnetic grains. Constraints on the geometry of the intrusive source body devel- oped in the model of the magnetic anomaly are obtained by quantifying the relative contributions of induced and remanent magnetization components.
Article
Full-text available
Only a small proportion of the mouse genome is transcribed into mature messenger RNA transcripts. There is an international collaborative effort to identify all full-length mRNA transcripts from the mouse, and to ensure thaZhut each is represented in a physical collection of clones. Here we report the manual annotation of 60,770 full-length mouse complementary DNA sequences. These are clustered into 33,409 'transcriptional units', contributing 90.1% of a newly established mouse transcriptome database. Of these transcriptional units, 4,258 are new protein-coding and 11,665 are new non-coding messages, indicating that non-coding RNA is a major component of the transcriptome. 41% of all transcriptional units showed evidence of alternative splicing. In protein-coding transcripts, 79% of splice variations altered the protein product. Whole-transcriptome analyses resulted in the identification of 2,431 sense-antisense pairs. The present work, completely supported by physical clones, provides the most comprehensive survey of a mammalian transcriptome so far, and is a valuable resource for functional genomics.
Article
Full-text available
The BLAST programs are widely used tools for searching protein and DNA databases for sequence similarities. For protein comparisons, a variety of definitional, algorithmic and statistical refinements described here permits the execution time of the BLAST programs to be decreased substantially while enhancing their sensitivity to weak similarities. A new criterion for triggering the extension of word hits, combined with a new heuristic for generating gapped alignments, yields a gapped BLAST program that runs at approximately three times the speed of the original. In addition, a method is introduced for automatically combining statistically significant alignments produced by BLAST into a position-specific score matrix, and searching the database using this matrix. The resulting Position-Specific Iterated BLAST (PSIBLAST) program runs at approximately the same speed per iteration as gapped BLAST, but in many cases is much more sensitive to weak but biologically relevant sequence similarities. PSI-BLAST is used to uncover several new and interesting members of the BRCT superfamily.
Article
A human imprinted domain at 14q32contains two co-expressed and reciprocally imprinted genes, DLK1 and GTL2, which are expressed from the paternally and maternally inherited alleles, respectively. We have previously shown that another imprinted locus, on human 15q11–q13, contains a large number of tandemly repeated C/D small nucleolar RNA genes (or C/D snoRNAs) only expressed from the paternal allele. Here we show that the region downstream from the GTL2 gene is also characterized by a transcription unit spanning many repeated intron-encoded C/D snoRNA genes, most of them arranged into two tandem arrays of 31 and 9 copies. Intriguingly, these snoRNAs depart from previously reported rRNA or snRNA methylation guides by their tissue-specific expression and by their lack of complementarity to rRNA or snRNA within their sequences. Analysis of the orthologous region in the mouse shows that the previously reported maternally expressed Rian gene, located downstream of Gtl2 on the distal 12chromosome, encodes at least nine C/D snoRNAs. Through a systematic search in rodents, we could identify other C/D snoRNA genes in this domain. All snoRNAs identified on mouse distal 12are brain-specific and only expressed from the maternally inherited allele. The human imprinted 14q32domain therefore shares common genomic features with the imprinted 15q11–q13 loci. This link between tandemly repeated C/D snoRNA genes and genomic imprinting suggests a role for these snoRNAs and/or their host non-coding RNA genes in the evolution and/or mechanism of the epigenetic imprinting process.
Article
Analyzing vertebrate genomes requires rapid mRNA/DNA and cross-species protein alignments. A new tool, BLAT, is more accurate and 500 times faster than popular existing tools for mRNA/DNA alignments and 50 times faster for protein alignments at sensitivity settings typically used when comparing vertebrate sequences. BLAT's speed stems from an index of all nonoverlapping K-mers in the genome. This index fits inside the RAM of inexpensive computers, and need only be computed once for each genome assembly. BLAT has several major stages. It uses the index to find regions in the genome likely to be homologous to the query sequence. It performs an alignment between homologous regions. It stitches together these aligned regions (often exons) into larger alignments (typically genes). Finally, BLAT revisits small internal exons possibly missed at the first stage and adjusts large gap boundaries that have canonical splice sites where feasible. This paper describes how BLAT was optimized. Effects on speed and sensitivity are explored for various K-mer sizes, mismatch schemes, and number of required index matches. BLAT is compared with other alignment programs on various test sets and then used in several genome-wide applications. http://genome.ucsc.edu hosts a web-basedBLAT server for the human genome.
Article
The human genome holds an extraordinary trove of information about human development, physiology, medicine and evolution. Here we report the results of an international collaboration to produce and make freely available a draft sequence of the human genome. We also present an initial analysis of the data, describing some of the insights that can be gleaned from the sequence.
Article
Using the technique of mRNA-cDNA hybridization, we have shown that there are between 11,500 and 12,500 different mRNAs in three different mouse tissues:kidneys, brains, and livers. Several experiments suggest that in each tissue the mRNAs are organized into three abundance classes rather than as a continumm with respect to concentration. Cross-hybridization experminets show that the most abundant class of mRNA in each tissue is characteristic, and that a high proportion of the total sequences are common between tissues. For a more complete analysis, cDNA was fractionated into three classes. Studies using isolated abundant cDNA show that some abundant sequences of liver and kidney are present in other tissues, but among the lower frequency classes. Thus tissue-specific differences in mRNA populations may be related to abundance as well as qualitative differences. Using isolated middle frequency cDNA of the kidney, it was shown that of the 550 or so sequences in this class, approximately 500 are shared with the liver. Similarly, between 9,500 and 10,500 of the low frequency kidney cDNAs are shared with the brain and liver, respectively, suggesting that the majority of mRNAs may be involved with "housekeeping" activities. In an attempt to see whether abundance of mRNA is related to repetition of the sequence in the genome, it was shown that abundant and middle frequency cDNA of the liver and kidney contain a component that anneals with DNA repeated approximately 100 fold. However, the low frequency cDNA of the kidney contains no repeated sequences.
Article
Approximately 35,000 different poly(A)-containing RNA sequences are present in HeLa cell cytoplasm. The sequences are grouped in three distinct abundance classes.
Article
The Janus family of tyrosine kinases (JAK) plays an essential role in development and in coupling cytokine receptors to downstream intracellular signaling events. A t(9;12)(p24;p13) chromosomal translocation in a T cell childhood acute lymphoblastic leukemia patient was characterized and shown to fuse the 3′ portion ofJAK2 to the 5′ region of TEL, a gene encoding a member of the ETS transcription factor family. The TEL-JAK2 fusion protein includes the catalytic domain of JAK2 and the TEL-specific oligomerization domain. TEL-induced oligomerization of TEL-JAK2 resulted in the constitutive activation of its tyrosine kinase activity and conferred cytokine-independent proliferation to the interleukin-3–dependent Ba/F3 hematopoietic cell line.