ArticlePDF Available

Systems-Level Analysis of Genome-Wide Association Data

Authors:

Abstract and Figures

Genome-wide association studies (GWAS) have emerged as the method of choice for identifying common variants affecting complex disease. In a GWAS, particular attention is placed, for obvious reasons, on single-nucleotide polymorphisms (SNPs) that exceed stringent genome-wide significance thresholds. However, it is expected that many SNPs with only nominal evidence of association (e.g., P < 0.05) truly influence disease. Efforts to extract additional biological information from entire GWAS datasets have primarily focused on pathway-enrichment analyses. However, these methods suffer from a number of limitations and typically fail to lead to testable hypotheses. To evaluate alternative approaches, we performed a systems-level analysis of GWAS data using weighted gene coexpression network analysis. A weighted gene coexpression network was generated for 1918 genes harboring SNPs that displayed nominal evidence of association (P ≤ 0.05) from a GWAS of bone mineral density (BMD) using microarray data on circulating monocytes isolated from individuals with extremely low or high BMD. Thirteen distinct gene modules were identified, each comprising coexpressed and highly interconnected GWAS genes. Through the characterization of module content and topology, we illustrate how network analysis can be used to discover disease-associated subnetworks and characterize novel interactions for genes with a known role in the regulation of BMD. In addition, we provide evidence that network metrics can be used as a prioritizing tool when selecting genes and SNPs for replication studies. Our results highlight the advantages of using systems-level strategies to add value to and inform GWAS.
Content may be subject to copyright.
INVESTIGATION
Systems-Level Analysis of Genome-Wide
Association Data
Charles R. Farber
1
Center for Public Health Genomics, Departments of Medicine (Division of Cardiology) and Biochemistry and Molecular
Genetics, University of Virginia, Charlottesville, Virginia 22908
ABSTRACT Genome-wide association studies (GWAS) have emerged as the method of choice for
identifying common variants affecting complex disease. In a GWAS, particular attention is placed, for
obvious reasons, on single-nucleotide polymorphisms (SNPs) that exceed stringent genome-wide signicance
thresholds. However, it is expected that many SNPs with only nominal evidence of association (e.g.,P,0.05)
truly inuence disease. Efforts to extract additional biological information from entire GWAS datasets have
primarily focused on pathway-enrichment analyses. However, these methods suffer from a number of limi-
tations and typically fail to lead to testable hypotheses. To evaluate alternative approaches, we performed
a systems-level analysis of GWAS data using weighted gene coexpression network analysis. A weighted gene
coexpression network was generated for 1918 genes harboring SNPs that displayed nominal evidence of
association (P#0.05) from a GWAS of bone mineral density (BMD) using microarray data on circulating
monocytes isolated from individuals with extremely low or high BMD. Thirteen distinct gene modules were
identied, each comprising coexpressed and highly interconnected GWAS genes. Through the characteriza-
tion of module content and topology, we illustrate how network analysis can be used to discover disease-
associated subnetworks and characterize novel interactions for genes with a known role in the regulation of
BMD. In addition, we provide evidence that network metrics can be used as a prioritizing tool when selecting
genes and SNPs for replication studies. Our results highlight the advantages of using systems-level strategies
to add value to and inform GWAS.
KEYWORDS
genome-wide
association
study (GWAS)
systems biology
coexpression
network
osteoporosis
Genome-wide association studies (GWAS) have revolutionized com-
plex disease genetics. In just the last few years, GWAS have been used
to identify hundreds of variants affecting a diverse range of common
diseases and disease associated quantitative traits (for a summary, see
http://www.genome.gov/gwastudies/). Although GWAS have proven
extremely effective at identifying common variants with relatively
large effects, the rst wave of data suggests that for many diseases,
this class of variation accounts for only a small fraction of the genetic
risk. For example, a large-scale, meta-analysis of ~32,000 individuals
identied 56 loci associated with bone mineral density (BMD),
a strong predictor of osteoporotic fracture. However, in aggregate
these single-nucleotide polymorphisms (SNPs) only explained 5.8%
of the variance in femoral neck BMD (Estrada et al. 2012).
It is possible that for most diseases, the missing heritability is
attributable to a combination of many more common variants with
increasingly smaller effect sizes and rare variants, both of which are
difcult to detect with GWAS in its current form (Altshuler et al.
2008). It has been suggested that additional genes and biological
mechanisms underlying a disease process could be extracted from
GWAS data by searching lists of genes harboring nominally signi-
cant (e.g.,P,0.05) associations. Most of the initial attempts to
identify such pathways have used gene ontology (GO) and path-
way-enrichment tools to compare the number of genes in a specic
pathway harboring nominally signicant SNPs to the number
expected at random. This approach has been applied to several GWAS
datasets with varying results (Askland et al. 2009; Baranzini et al.
2009; Elbers et al. 2009a; ODushlaine et al. 2009; Peng et al. 2010;
Copyright © 2013 Farber
doi: 10.1534/g3.112.004788
Manuscript received October 24, 2012; accepted for publication November 20, 2012
This is an open-access article distributed under the terms of the Creative
Commons Attribution Unported License (http://creativecommons.org/licenses/
by/3.0/), which permits unrestricted use, distribution, and reproduction in any
medium, provided the original work is properly cited.
Supporting information is available online at http://www.g3journal.org/lookup/
suppl/doi:10.1534/g3.112.004788/-/DC1
Gene expression microarray data in this article have been submitted to the GEO
database at NCBI as series GSE7158. Summarized (P-values) genome-wide
association data are available at http://content.nejm.org/cgi/content/full/
NEJMoa0801197/DC1.
1
Address for correspondence: Center for Public Health Genomics, P.O. Box
800717, University of Virginia, Charlottesville, VA 22908. E-mail: crf2s@virginia.edu
Volume 3 | January 2013 | 119
Ritchie 2009; Torkamani and Schork 2009; Torkamani et al. 2008;
Wang et al. 2007).
Several issues complicate pathway analysis. First, enrichment
results can vary widely across software tools (Elbers et al. 2009b).
Second, enrichment analyses are biased toward what we already
know concerning pathway membership, and most predened gene
categories are very general in nature, making it more difcult to
develop testable hypotheses with the goal of investigating specic
disease mechanisms. Third, these strategies fail to provide informa-
tion on the relationships between associated genes. Such informa-
tion is critical to understanding how networks of polymorphic genes
work together to promote or provide protection against disease.
Recently, Baranzini et al. 2009 used protein2protein interaction
data to address this latter point by identifying interacting partners
that were nominally associated with multiple sclerosis. However,
missing from this approach was the ability to incorporate network
concepts with clinical information. The specic goal of this study
was to address these issues.
Weighted gene coexpression network analysis (WGCNA) is
a widely used analytical method that identies functional connections
between genes using microarray gene expression data (Chen et al.
2008; Gargalovic et al. 2006; Ghazalpour et al. 2006; Horvath et al.
2006; Oldham et al. 2008; van Nas et al. 2009; Winden et al. 2009).
WGCNA groups genes into modules on the basis of their coexpres-
sion similarities across a population of samples. The resulting modules
have been shown to be comprised of genes that share similar functions
or are involved in the same pathway [as examples: (Ghazalpour et al.
2006; Horvath et al. 2006; Oldham et al. 2008; van Nas et al. 2009)].
The advantage of WGCNA is that connections between genes can be
established in an unbiased manner using disease-relevant expression
data.
In the present work we used WGCNA to perform a systems-level
analysis of GWAS data. The analysis was performed by combining
SNP-level association data from a large BMD GWAS with microarray
expression data from a disease-relevant cell type from subjects with
known BMD status (low vs. high). Using WGCNA, we identied
modules composed of genes that were highly interconnected with
one another and displayed nominal evidence of association with
BMD. Through the characterization of module content and topology,
our approach identied biological mechanisms, modules, individual
Figure 1 Overview of the systems-level anal-
ysis of GWAS data.
120 | C. R. Farber
genes, and network concepts that likely play an important role in the
regulation of BMD.
MATERIALS AND METHODS
Converting SNP lists to gene lists using ProxyGeneLD
Several caveats complicate the conversion of a list of SNPs with
association P-values to the assignment of gene-wide P-values using
raw GWAS data. The primary confounders are linkage disequilibrium
(LD) and biases due to gene size and the number of SNPs typed per
gene. LD makes gene identication difcult because many nominally
signicant SNPs will be in LD with multiple genes. In addition, larger
genes and genes with a greater density of SNPs typed have an in-
creased probability of harboring nominally signicant SNPs just by
chance. Recently, Hong et al. 2009 developed an algorithm (referred to
as ProxyGeneLD) that reduces biases by accounting for LD when
annotating genes. ProxyGeneLD works by identifying clusters of
GWAS SNPs (referred to as proxy clusters) in high LD (r2 #0.80)
using HapMap data. It then assigns proxy clusters and singleton SNPs
(that did not group within a proxy cluster) to the nearest gene. Un-
adjusted gene-wide P-values are then calculated as the minimum of
any SNP, either as a singleton or member of a proxy cluster per gene.
P-value adjustments are made by multiplying the unadjusted P-value
by the number of SNPs assigned to that gene.
We used precomputed P-values from a recently published GWAS
performed by deCODE (Styrkarsdottir et al. 2008). These data are
available for download from http://content.nejm.org/cgi/content/full/
NEJMoa0801197/DC1 as individual text les. The GWAS consisted of
5,861 Icelandic subjects phenotyped for hip (HBMD) and spine
(SBMD) BMD and genotyped at 301,019 SNPs (Styrkarsdottir et al.
2008). All SNPs for both traits were annotated using ProxyGeneLD.
LD patterns were determined using CEU HapMap samples and genes
were dened as the transcript plus a 1-kbp extension upstream to
include promoter regions. P-values were assigned to a total of
16,878 genes. Genes with an adjusted P#0.05 for at least one of
the two BMD traits were referred to as the nominally signicant
GWAS geneset (NSGG).
GO and pathway-enrichment analysis
We performed GO and pathway-enrichment analysis for the NSGG
and network modules by using the Database for Annotation,
Visualization and Integrated Discovery [DAVID (Dennis et al. 2003;
Huang da et al. 2009)]. Each analysis was performed using the func-
tional annotation charting and functional annotation clustering
options. Functional annotation charting tests each individual GO or
pathway term for enrichment. In contrast, functional annotation clus-
tering combines single categories with a signicant overlap in gene
content and then assigns an enrichment score (ES; dened as the
log10 of the geometric mean of the P-values for each single term
in the cluster) to each cluster, making interpretation of the results
more straightforward. Functional annotation clustering cannot be per-
formed for more than 3000 genes. Because the NSGG contained 3083
genes, we used to top 3000 ranked on adjusted P-value for the anal-
ysis. The search was limited to KEGG and Biocarta pathways, PFAM
protein domains, and GO terms in the Molecular Function,”“
Biological
Process,and Cellular Componentcategories. Single categories
were considered signicantly enriched at a false discovery rate
(FDR) #5%. To assess the signicance of functional clusters, we
created 10 sets of 3000 genes randomly selected from the aforemen-
tioned list of 16,878 genes with assigned P-values. Functional anno-
tation clustering was performed for all 10 random gene sets. The max
random ES was 2.75. Therefore, we used an ES cutoff of $3.0 as the
threshold for signicance in all analyses.
Gene expression data processing
To generate gene coexpression networks we used previous published
microarray data from 26 healthy Chinese females ages 20245 yr, with
a mean age of 27.3 yr (Lei et al. 2009). In this study expression proles
were generated from circulating monocytes that were isolated and puri-
ed from subjects with low (n = 12) and high (n = 14) BMD. We
downloaded the Affymetrix CEL les from National Center for Biotech-
nology Information (NCBI)s Gene Expression Omnibus (GSE7158;
http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE7158). The raw
data were imported and processed using the affy package (Gautier
et al. 2004) for the R Language and Environment for Statistical Com-
puting (Ihaka and Gentleman 1996). Robust multiarray algorithm was
used to normalize and generate probe level expression data (Irizarry et al.
2003).
WGCNA
Network analysis was performed using the WGCNA R package
(Langfelder and Horvath 2008). An extensive overview of WGCNA,
including numerous tutorials, can be found at http://www.genetics.
ucla.edu/labs/horvath/CoexpressionNetwork/. To begin, we identied
all probes assaying the expression of NSGG genes. To eliminate noise
due to genes that were not expressed, we selected NSGG probes whose
levels exceeded the median level of expression across the entire array.
As part of our quality control, we performed a clustering and principal
components analysis based on the expression of these probes. Two
samples from the high BMD group, GSM172405 and GSM172418,
were signicant outliers and were removed from the analysis. A pre-
liminary calculation of network connectivity was used to identify the
most connected probe for each gene. A WGCNA network for the
selected probes was generated exactly as described in (Farber 2010).
GeneSignicance (GS) for the each network gene was dened as the
absolute value of its Pearson correlation with BMD status. Module
Membership (MM) was calculated as the Pearson correlation between
each genes expression and its module eigengene, calculated using
Singular Value Decomposition (Alter et al. 2000). Network depictions
were constructed using Cytoscape (Shannon et al. 2003).
nTable 1 Gene category and pathway enrichment analysis of NSGG genes
Functional Group Top GO Term Top Term FDR ES
a
1 GO:0044424intracellular part 9.5 ·10
27
6.3
2 GO:0046872metal ion binding 1.3 ·10
24
5.9
3 GO:0032502developmental process 9.5 ·10
25
5.6
4 GO:0044446intracellular organelle part 5.5 ·10
22
4.2
5 GO:0019866organelle inner membrane 2.1 ·10
21
3.3
a
ES, enrichment score dened as the log10 (geometric uncorrected P-value for all single categories) for each DAVID cluster.
Volume 3 January 2013 | GWAS Networks | 121
In silico replication
To compare replication success rates in hubs and genes with the
highest GWAS P-values, we used data from a second GWAS, the
Framingham Osteoporosis Study [FOS (Kiel et al. 2007)]. The FOS
GWAS consisted of 1141 subjects genotyped at ~100,000 SNPs. We
downloaded the association data [in the form of SNPs and precom-
puted P-values generated using generalized estimating equation mod-
els (Kiel et al. 2007)] for three BMD traits (femoral neck, lumbar
spine, and trochanter) from the database of Genotype and Phenotype
at NCBI (http://www.ncbi.nlm.nih.gov/sites/entrez?Db=gap). SNP
lists for each of the three traits were converted to gene lists using
ProxyGeneLD precisely as described previously. A gene was consid-
ered successfully replicated if it had an unadjusted P#0.05 for at least
one of the three BMD traits. The percentage of successfully replicated
genes was calculated in the blue, magenta, greenyellow, and brown
modules for the top 20%, 10%, and 5% of genes based on intramodular
connectivity (k.in). These rates were compared with those for the top
20%, 10%, and 5% of GWAS network genes selected based on adjusted
P-value from the deCODE (Styrkarsdottir et al. 2008) GWAS or GS.
RESULTS
Identifying genes with nominally signicant genome-
wide associations
An overview of the systems-level analysis of GWAS data are presented
in Figure 1. The rst step in the analysis was the identication of genes
displaying evidence of association using data from a BMD GWAS
[n = 5861 (Styrkarsdottir et al. 2008)]. We used the ProxyGeneLD
algorithm (Hong et al. 2009), which takes LD patterns into account
when assigning SNPs to genes and adjusts for gene length and SNP
density biases (see Materials and Methods), to generate gene-wide
adjusted P-values for two osteoporosis-related traits, HBMD and
SBMD. Gene-wide P-values were calculated for a total of 16,878 genes.
Of these, 1777 and 1861 had gene-wide adjusted P#0.05 for HBMD
and SBMD, respectively. By combining the two lists, 3083 unique genes
were identied with adjusted P#0.05 for at least one of the BMD
traits. We refer to these genes as NSGG.
To determine whether gene length and SNP density were potential
confounders in the NSGG, we calculated the correlation between these
two variables and HBMD unadjusted (dened as the minimum
P-value for proxy clusters and single SNPs assigned to a particular
gene) and adjusted P-values. As described previously, 1777 genes
had adjusted P#0.05. In contrast, 5228 genes had unadjusted
P#0.05. In the latter gene set, we observed a strong correlation
between unadjusted Pand gene length (r = 0.46, P= 0) and SNP
density (r = 0.50, P= 0). However, this correlation was not observed
after adjustment for gene length (r=-0.01, P= 0.88) or SNP density
(r = 20.01, P= 0.74). Thus, our network analysis of GWAS genes
should not be inuenced by these systematic biases.
Conventional pathway enrichment fails to pinpoint
specic biological mechanisms
We next determined whether the NSGG was enriched for biological
themesusing the conventional approach of GO and pathway enrich-
ment analysis. DAVID (Dennis et al. 2003; Huang da et al. 2009) was
used for this analysis, although we also used WebGestalt (Zhang et al.
2005) and observed similar results. A total of 24 individual terms, all
of which were GO categories, were signicantly enriched in the NSGG
at an FDR #5% (Supporting Information, File S1). The most signif-
icant term was protein binding (GO:0005515; FDR = 1.7 ·10
210
).
Other signicant categories included developmental process
(GO:0032502; FDR = 9.5 ·10
25
), cation binding (GO:0043169;
FDR = 2.5 ·10
23
), and cell differentiation (GO:0030154; FDR =
2.7 ·10
22
).
DAVID also generates category clusters by condensing sets of
related terms (Dennis et al. 2003; Huang da et al. 2009). This con-
denses redundant categories, identies terms containing a smaller
number of genes that on their own would require higher fold enrich-
ments to reach statistical signicance, and makes interpreting the
results much easier. Each cluster receives an ES, which is dened as
the geometric mean (on a log10 scale) of the P-values for all single
terms in the cluster. A total of 32 clusters had ESs .1.3 (equivalent to
anominalP#0.05); however, it was unclear whether this was an
appropriate signicance cutoff. To determine the distribution of ESs
observed using a set of random genes we created 10 sets of 3000 genes
randomly selected from the whole genome and ran each through
DAVID. ESs for the random gene sets ranged from 1.36 to 2.75.
Therefore, we selected an ES cutoff of $3.0. Using this threshold,
atotalofve signicant clusters were identied in the NSGG (Table
1andFile S2). The top GO terms in each of the ve clusters were
intracellular part,”“metal ion binding,”“developmental process,”“in-
tracellular organelle part,and organelle inner membrane.These
data indicate that the NSGG is enriched for groups of genes sharing
similar functionality; however, because the identied categories are
very general in nature this analysis does little to pinpoint specic
biological mechanisms underlying variation in BMD.
Generation of a weighted gene coexpression network
for NSGG genes
WGCNA reveals connections between genes using microarray
expression data by grouping genes based on a topological overlap
measure [TOM (Dong and Horvath 2007; Zhang and Horvath 2005)].
Two genes have a high TOM if they are highly interconnected with
the same set of genes (Dong and Horvath 2007; Zhang and Horvath
2005). To evaluate the coexpression relationships between NSGG
Figure 2 WGCNA coexpression network composed of BMD GWAS
genes. Shown is the hierarchical clustering dendogram for all 1918
genes used in the analysis. Each line is an individual gene. Genes were
clustered based on a dissimilarity measure (1 2TOM). The branches
correspond to modules of highly interconnected groups of genes. The
tips of the branches represent genes that are the least dissimilar and
thus share the most similar network connections. Below the dendo-
gram each gene is color coded to indicate its module assignment.
122 | C. R. Farber
genes in a disease-relevant context we used microarray expression
proles of puried circulating monocytes isolated from individuals
with discordant levels of BMD (Lei et al. 2009). The dataset included
24 proles from young (mean age = 27.3 years) Chinese females, 12
with low BMD (mean Z-score=-1.72) and 12 with high BMD (mean
Z-score = 1.57). We choose to use this dataset because it represents the
largest study performed to date with both expression proles for
a cell-type relevant to BMD [monocytes are precursors to bone-
resorbing osteoclasts (Fujikawa et al. 1996)] and clinical information
on the subjects. After excluding non- and lowly expressed genes we
identied probes representing 1918 (62%) of the 3083 NSGG genes
and applied the WGCNA algorithm to generate a GWAS network.
The resulting network was composed of 13 distinct gene modules
(Figure 2). Sixty-three of the genes failed to t within a distinct group
and were assigned to the greymodule. The modules ranged in size
from 40 (salmon module) to 356 genes (turquoise module). A com-
plete list of module assignments and network metrics for all genes is
included in File S3.
The WGCNA approach has been used to generate robust
networks in several diverse applications (Chen et al. 2008; Gargalovic
et al. 2006; Ghazalpour et al. 2006; Horvath et al. 2006; Oldham et al.
2008; van Nas et al. 2009; Winden et al. 2009), including experiments
with a similar or smaller number of samples relative to this study
(Gargalovic et al. 2006; Gong et al. 2007). Most WGCNA analyses,
however, use a series of preliminary ltering steps to select the most
biologically meaningful genes for network construction (Ghazalpour
et al. 2006). In such studies, the expression data exclusively deter-
mines which genes are used in the analysis. Because our network
genes were not selected entirely based on expression proles, we
wanted to ensure that the resulting modules were cohesive and
robust. To test cohesiveness, we calculated the mean MM for each
module. MM is the correlation between each gene in a module and
its module eigengene. Thus, it is a measure of how tightly a particular
gene ts into its module. The greater the mean MM for a module,
the more similar the coexpression relationships are across the mod-
ule. The mean MM 6SEM ranged from 0.60 60.01 (brown mod-
ule) to 0.74 60.01 (tan module), indicating that modules consisted
of genes sharing highly similar expression patterns. We addressed
robustness, as described previously (Ghazalpour et al. 2006), by
randomly splitting the dataset in half 1000 times and calculating
k.in in each half. The analysis was performed for the largest (tur-
quoise) and smallest (salmon) modules. The mean correlation 6
SEM between the real and random k.in values was 0.65 60.05
and 0.52 60.03 in the turquoise and salmon modules, respectively.
Thus, the GWAS network modules are cohesive and robust to ex-
clusion of half the data.
Characterization of module content reveals a key role
for oxidative phosphorylation in the regulation of BMD
One way in which network analysis can inform GWAS is to expose
pathway enrichments that were not observed in a large set of
nominally signicant genes, such as the NSGG. We expected that
by parsing genes based on coexpression similarities, more rened
functions would be condensed within modules, revealing enrichments
for more specic processes. This would improve the process of
converting a detectable enrichment into a testable hypothesis.
To determine whether specic modules were enriched for novel
gene categories or pathways we repeated the DAVID analysis for each
module. Of the 13 modules, ve had at least one cluster with an ES $
3.0. Interestingly, the turquoise module stood out as displaying de-
tailed enrichments that were not observed in the analysis of the entire
nTable 2 Network modules with signicant DAVID enrichments
Module Number of Genes Top Term for Each Cluster Top Term FDR ES
a
Pink 112 GO:0044446~intracellular organelle part 0.78 3.1
GO:0019538~protein metabolic process 6.0 ·10
22
3.0
Black 134 GO:0043231~intracellular membrane-bound organelle 2.0 ·10
22
3.0
Red 134 GO:0044429~mitochondrial part 2.0 ·10
22
3.4
GO:0044446~intracellular organelle part 1.3 ·10
21
3.1
Blue 297 GO:0043231~intracellular membrane-bound organelle 2.7 ·10
26
5.9
hsa00040:Pentose and glucuronate interconversions 2.6 ·10
26
4.1
GO:0005634~nucleus 5.9 ·10
26
3.7
Turquoise 356 GO:0044444~cytoplasmic part 7.1 ·10
210
8.1
GO:0005739~mitochondrion 6.0 ·10
28
7.3
GO:0009055electron carrier activity 7.5 ·10
25
4.9
GO:0009055~electron carrier activity 7.5 ·10
25
4.2
GO:0016836~hydrolyase activity 2.0 ·10
22
3.9
GO:0008380~RNA splicing 9.2 ·10
22
3.5
a
ES, enrichment score dened as the log10 (geometric uncorrected P-value for all single categories) for each DAVID cluster.
nTable 3 Members of the turquoise module involved in oxidative
phosphorylation
Gene Unadjusted GWAS
P-value k.in
a
rank k.total
b
rank r
c
NDUFB6 1.0 ·10
22
1820.10
COX5B 4.9 ·10
23
2920.22
COX8A 5.0 ·10
23
3520.22
COX7A2 4.2 ·10
23
62220.21
NDUFA13 7.8 ·10
23
92720.16
ATP5J2 9.0 ·10
24
14 54 20.20
NDUFS7 3.2 ·10
22
15 60 20.35
COX6B1 1.4 ·10
22
20 49 20.25
ATP5G2 1.2 ·10
23
24 41 20.13
NDUFB1 6.0 ·10
23
29 70 20.08
NDUFA2 3.8 ·10
22
32 128 20.39
NDUFA11 6.1 ·10
23
36 113 20.14
COX17 1.0 ·10
22
54 199 20.19
NDUFV2 8.0 ·10
24
55 111 20.12
NDUFA7 8.0 ·10
23
69 252 20.30
ATP6V1H 5.6 ·10
24
181 491 0.48
a
k.in = Intramodule (the turquoise module) connectivity.
b
k.total = Total network connectivity.
c
r = Pearson correlation between expression of gene in monocytes and BMD
status (low vs. high).
Volume 3 January 2013 | GWAS Networks | 123
NSGG (Table 2 and File S4). In the turquoise module, signicant
enrichments were observed for six clusters with the following top
terms cytoplasmic part(ES = 8.1), mitochondrion(ES = 7.3),
electron carrier activity(ES = 4.9), electron carrier activity(ES =
4.2), hydro-lyase activity(ES = 3.9), and RNA splicing(ES = 3.5).
Within each cluster there were a number of terms that were not
signicant in the entire NSGG, suggesting that partitioning genes into
coexpression can reveal hidden enrichments.
To investigate the enrichments in more detail, we focused on
a single enriched term in cluster 2, the KEGG pathway oxidative
phosphorylation(oxphos), because it represented one of the most
specic enriched terms. This single term was not enriched in the
NSGG (FDR = 99.8); however, its enrichment in the turquoise module
was signicant (FDR = 1.1 ·10
23
). Of the 356 turquoise module
genes, 16 (4.5%) were involved in oxphos (Table 3). To determine
whether this enrichment was specic to the GWAS network, we gen-
erated 100 random networks. Each network was created by selecting
3083 genes at random using the same gene ltering steps and network
parameters used to construct the real network. A total of 114 of the
20,080 genes (0.6%) with unique gene identiersonthearray
belonged to the KEGG oxphos pathway. As shown above 16 of the
356 turquoise (4.5%) module genes were involved in oxphos. Using
a Fishers Exact test this enrichment is highly signicant (4.5% vs.
0.6%; P=1.8·10
29
). We then performed this same test for each
of 1709 modules belonging to the 100 random networks. None of the
random module enrichment P-values exceeded the P-value for the real
turquoise module, indicating that this enrichment is specictothe
BMD GWAS network.
Oxphos genes were also among the most connected in both the
turquoise module and the whole network (Table 3). In fact, the three
most connected turquoise hubs were oxphos genes. In addition, of the
16 total genes, 15 were in the top 20% of genes when ranked on k.in
(Table 3). Another observation was that the expression of all 15 highly
connected oxphos genes was negatively correlated with BMD status
(Table 3). Thus, by exploring the content of the turquoise module, we
have identied an association between genetic variation in oxphos
genes and BMD, determined that oxphos genes are module and net-
work hubs, and determined that oxphos gene expression in monocytes
was inversely correlated with BMD levels.
Discovery of a turquoise submodule highly correlated
with BMD status
In addition to content, module topology (the unique distribution of
edges among nodes) can also be evaluated in WGCNA networks. We
investigated turquoise module topology by generating a network view
showing all edges with a TOM $0.15 and their corresponding nodes
(Figure 3). The network consisted of 88 nodes and 256 edges. An
Figure 3 Network view of the turquoise module reveals a submodule of genes negatively correlated with BMD status. This network contains all
turquoise module edges with TOM $0.15 and their corresponding nodes. Genes are shaded based on their correlation with BMD from white (no
correlation) to dark green (strong negative correlation). Node sizes are proportional to each geneslog10 GWAS P (most signicant unadjusted
GWAS P-value for either HBMD or SBMD). The submodule of interest is on the right-hand side of the gure. Notice that this group of gene is
highly interconnected and negatively correlated with BMD status.
124 | C. R. Farber
initial inspection indicated that most nodes were grouped into a cen-
tral core (containing many of the oxphos genes identied previously
in this article) with two small submodules radiating from COX5B,an
oxphos gene and the second most connected node in the module. We
then overlaid information regarding the correlation between each
genes expression and BMD status in the monocyte expression study.
We suggest that correlation is a meaningful measure of biological
signicance, especially when considering GWAS genes, because it is
likely that the correlations reect subtle genetically-regulated differ-
ences in expression that are associated with alterations in BMD. As
shown in Figure 3 most of the genes were either not correlated (nodes
shaded white) or slightly negatively correlated with BMD (nodes
shaded light green). None of the genes were signicantly positively
correlated (max correlation in the turquoise module is 0.10). Interest-
ingly, the genes in one of the submodules were among the most
negatively correlated (shaded dark green) in the turquoise module
and the entire network (Table 4). One of the submodule genes,
IFI35, was the second most negatively correlated (r = 20.58, P=
2.7 ·10
23
) with BMD in the NSGG network and 4 of the 8 genes
in the sub-module were in the top 50. The average correlation for this
group was -0.42. To determine the probability of randomly observing
a group of 8 genes this negatively correlated (Table 4) we created 10
6
sets of 8 genes selected at random from the turquoise module. Of the
random gene sets none had an average correlation more extreme than
this turquoise sub-module (most negative r = 20.36).
Using gene information and literature searches, we found no
obvious functional connection between the genes that comprised this
subnetwork. However, using expression data from a panel of mouse
tissues [http://www.biogps.org (Lattin et al. 2008; Su et al. 2002,
2004)] we did observe that six of the genes are expressed in osteoclasts
(EPSTl1,IFI35,PARP12,CMPK2,ZCCHC2,andTAP1) and the other
two are expressed in osteoblasts (LOC26010 and LYSMD2). The group
of osteoclast genes is also the most negatively correlated with BMD
(Table 4). Next, we determined whether any of the eight genes were
located in close proximity to suggestive or signicant GWAS loci (P,
1.0 ·10
25
) identied in a recent meta-analysis of BMD (Estrada et al.
2012). Interestingly of the eight, the transcription start site for four
(EPSTl1,IFI35,ZCCH2,andLYSMD2) are less than 750 Kbp away
from a GWAS association (Table 4). Therefore, these genes represent
a highly interconnected sub-module whose expression is negatively
correlated with BMD. These data together suggest they play a role in
the regulation of BMD. Again, as demonstrated above, the functional
interconnections between genes in this sub-module, and its correlation
with BMD, was only revealed by network analysis.
Identifying functional connections between known and
novel genes
One of the advantages of our approach is the ability to identify
connections between novel genes with evidence of association and
those with a previously established role in disease. This information
can be used in two ways. First, it can identify new pathways that
a known gene may participate in and second, it can identify novel
genes through guilty by association.To investigate the network
connections for a known gene we focused on tumor necrosis factor
(TNF), the most highly connected gene in the NSGG network with
a known role in BMD. TNF was the 13th most connected gene in the
entire network with a total network connectivity (k.total) of 29.0 (max
k.total = 35.2). It was the 6th most connected gene in the blue module
with a k.in = 27.6 (max blue module k.in = 30.8). TNF is known to
play a prominent role in osteoclastogenesis (Lam et al. 2000) and
several studies have found associations between TNF polymorphisms
and BMD (Fontova et al. 2002; Kim et al. 2009). In the deCODE
GWAS it was associated with HBMD and SBMD with unadjusted
P-values of 1.2 ·10
23
and 1.6 ·10
22
, respectively. The fact that
TNF is one of hubs of a monocyte network provides additional sup-
port for the biological relevance of the GWAS network.
We created a TNF submodule by identifying all edges within the
blue module involving TNF with a TOM $0.15. The submodule
contained 99 genes (Figure 4). Using DAVID we identied three
signicant clusters that were enriched in the sub-module with terms
related to nuclear proteins(ES = 4.0), gene expression(ES = 3.6),
and regulation of transcription(ES = 3.0) (File S5). Of the 99 genes,
47 belonged to the GO cellular component category nucleus(FDR =
3.82 ·10
26
, 1.9 fold enrichment), and 32 were in the GO molecular
function category transcription factor activity(FDR = 1.8 ·10
24
,
3.5 fold enrichment). In support of its disease relevance the submod-
ule included several genes with known roles in bone metabolism, such
as nuclear receptor subfamily 3, group C, member 1 (glucocorticoid
receptor; NR3C1); protein tyrosine phosphatase, receptor type, E
(PTPRE); CD44 molecule (Indian blood group; CD44); NLR family,
nTable 4 Genes comprising the turquoise sub-module
Gene Description Unadjusted GWAS
P-Value r
a
rP-Value Meta-analysis
Distance, Kbp
b
Meta-analysis
P-Value
c
IFI35 Interferon-induced protein 35 1.0 ·10
22
20.58 2.7 ·10
23
742 5.1 ·10
27
TAP1 Transporter 1, ATP-binding
cassette, subfamily B (MDR/TAP)
9.9 ·10
24
20.48 1.7 ·10
22
EPSTI1 Epithelial stromal interaction 1 (breast) 8.0 ·10
24
20.48 1.8 ·10
22
510 9.8 ·10
28
CMPK2 Cytidine monophosphate (UMP-CMP)
kinase 2, mitochondrial
9.6 ·10
23
20.47 2.2 ·10
22
PARP12 Poly (ADP-ribose) polymerase family,
member 12
1.9 ·10
24
20.42 4.0 ·10
22
ZCCHC2 Zinc nger, CCHC domain containing 2 3.3 ·10
23
20.37 7.5 ·10
22
172 4.9 ·10
29
LYSMD2 LysM, putative peptidoglycan-binding,
domain containing 2
6.0 ·10
23
20.35 9.0 ·10
22
564 1.4 ·10
26
LOC26010 Spermatogenesis associated,
serine-rich 2-like
1.3 ·10
23
20.24 2.6 ·10
21
a
r, Pearson correlation between expression of gene in monocytes and BMD status (low vs. high).
b
The distance between the TSS for each respective gene and the location of a genome-wide suggestive or signicant BMD association identied by (Estrada et al.
2012).
c
The P-value for the associations identied by (Estrada et al. 2012).
Volume 3 January 2013 | GWAS Networks | 125
pyrin domain containing 3 (NLRP3); FBJ murine osteosarcoma viral
oncogene homolog B (FOSB); and dual-specicity phosphatase 6
(DUSP6). Thus, our network analysis rediscovered TNF as key in-
tracellular signaling hubgene important in bone metabolism. More
importantly, this network can be mined in future studies to identify
novel genes that interact with TNF in some way (e.g., are downstream
targets of TNF signaling, etc.) to affect bone mass.
Relating network concepts to measures of
biological relevance
Exploring GWAS genes in the context of an expression network also
allows one to relate network concepts, such as MM, to a measure of
biological relevance. If a network property, inherent to a specic
module, is associated with disease this suggests that the module serves
an important biological role. It may also be possible to use the
property as a gene screening tool to select genes for downstream
studies.
We focused on the association between the network concept MM
and GS, a measure of biological relevance. GS was dened as the
absolute value of the correlation between a genesexpressionand
BMD status. Of the 13 network modules, signicant (P,0.003 after
adjusting for number of modules) positive correlations were observed
between MM and GS in the magenta (r = 0.44, P=9.9·10
25
),
greenyellow (r = 0.66, P=1.6·10
210
), and brown (r = 0.36, P=
1.9 ·10
27
) modules (Figure 5).
On the basis of the correlations between MM and GS, we
hypothesized that hub genes from these three modules were the most
biologically relevant and thus, the most likely to represent true positive
associations with BMD. If true this suggests that selecting genes based
on MM may result in greater replication success rates in subsequent
studies compared with selecting genes using the traditional metric,
GWAS P-value. To test this we performed an in silico replication
study using data from a second BMD GWAS [FOS (Kiel et al.
2007)]. Of the 1918 total network genes, 1264 were annotated in
FOS using ProxyGeneLD. Genes were considered successfully repli-
cated if their gene-wide associations were less than the signicance
thresholds dened below with any one of three BMD traits (femoral
neck, lumbar spine, and trochanter BMD). From the 1264 network
genes annotated in both studies, we compared the FOS replication
rates for three groups of genes: (1) hub genes (based on k.in) from the
magenta, greenyellow, and brown modules; (2) network genes ranked
on GS; and (3) network genes ranked on P-value in the deCODE
GWAS. The replication rates were compared for the top 20%, 10%,
and 5% of genes within each group at three different signicant levels,
P#0.05, P#0.01, and P#0.001. As shown in Table 5, selecting
genes on K.in resulted in greater replication rates in all comparisons.
The difference in replication rate between K.in and GWAS P-value
increased as the denition of a hub gene became more stringent. For
example, when comparing the top 5% of hubs vs. the top 5% of genes
based on P-value, the difference in replication rate was twofold higher
for hubs. Although validation studies will be needed, these data sug-
gest that k.in may be a better metric than GWAS P-value to use to
select genes for subsequent replication studies.
DISCUSSION
In this study, we have applied network theory to a list of genes with
evidence of association with BMD using disease-relevant microarray
gene expression data in subjects with known BMD status. We
demonstrate that network analysis can group genes into modules
that are enriched for specic biological processes. In some cases the
enrichments were unique to modules and were more detailed and
specicthanthoseidentied in the entire gene set. We also show that
Figure 4 Characterizing the coexpression relationships
for a highly connected known BMD gene. This TNF
centered network provides a view of all edges and their
corresponding nodes connected to TNF with a TOM $
0.15. Genes are color coded based on their correlation
with BMD; white (20.20 ,r,0.20), blue (r $0.20), and
yellow (r#-0.20). Node sizes are proportional to each
geneslog10 GWAS P (most signicant unadjusted
GWAS P-value for either HBMD or SBMD).
126 | C. R. Farber
module topology can be used to identify groups of interconnected
genes strongly associated with a clinical trait. Not only can this
approach be used to reveal hidden enrichments, but it can also
identify potentially important coexpression relationships for genes
that exceed genome-wide signicant thresholds or that have been
previously associated with the disease. We also demonstrate that for
three of the modules there was a signicant correlation between MM
and GS. We go on to provide evidence suggesting that hub genes
replicate at a higher rate relative to genes selected using GWAS
P-value or GS. This study provides a framework for combing network
analysis and gene expression data to extract additional biological in-
formation from GWAS data.
One of the limitations of GWAS is that it does not provide
functional information for associated genes. Our systems-level
approach does so by grouping genes using expression data from a cell
type or tissue that is relevant to the disease in subjects with clinical
data. Our discovery of the turquoise submodule of eight genes
negatively correlated with BMD is a good example. Importantly, the
interconnections between genes in this group could only have been
identied by studying their relationships in a disease context. This
information combined with the knowledge that they are expressed in
mouse osteoclasts can be used to guide in vitro and in vivo experi-
ments to validate their role in bone.
The major bottleneck in any analysis using GWAS data are
generating gene lists. Because of the nature of GWAS data, many
SNPs with nominally signicant P-values will be false-positives.
This coupled with the difculties in converting SNP-based to
gene-based P-values leads to gene lists that contain a considerable
level of noise. What is clear from this study and others (Hong et al.
2009) is that potential biases have to be taken into consideration.
In addition, our data suggest that functional grouping using coex-
pression similarities is an excellent approach to separate noise
from real biological signal. We have proven this by identifying that
the inherent network concept MM is correlated with GS in three of
the 13 modules.
The main purpose of any analysis designed to mine GWAS data
are the generation of testable hypotheses. We believe a systems-level
approach offers many advantages over other strategies for this
purpose. For example, we demonstrate that parsing GWAS gene lists
into functional groups identied a key role for oxidative phosphor-
ylation, which can now be experimentally validated. Additionally, we
identied novel genes based on their connection to known bone genes,
membership in an enriched pathway or connectivity in one of the
modules in which MM was correlated with BMD. Such genes can be
tested to validate their associations and to investigate their biological
role in functional genomics and replication studies.
Figure 5 Correlation between
MM and GS for each of the 13
distinct GWAS modules. MM
(dened as the correlation be-
tween each genes expression
and its module eigengene) for
each module is plotted against
GS (dened as each genes cor-
relation with BMD status). MM
in the blue, magenta, greenyel-
low and brown modules is sig-
nicantly (P,0.003) correlated
with GS.
Volume 3 January 2013 | GWAS Networks | 127
Oxidative stress is known to be increased in age-related diseases
such as osteoporosis. It is also known that oxphos plays a direct and
key role in bone metabolism (Bratic and Trifunovic 2010; Kousteni
2011). In bone modeling and remodeling, osteoclasts resorb mineral
by acidifying the bone matrix (Blair 1998). This process requires
signicant energetic resources, which are primarily generated through
the oxidative phosphorylation of glucose (Williams et al. 1997). Re-
cently, it has been demonstrated that increased oxidative phosphory-
lation occurs in osteoclast precursors as they differentiate into mature
osteoclasts (Kim et al. 2007). Importantly, our data suggest that ge-
netic variation in multiple oxphos genes inuence bone mass. More-
over, the expression of these genes in monocytes is inversely correlated
with bone mass, suggesting that increased oxphos in monocytes/osteo-
clasts results in decreased bone mass.
Our analysis focused on osteoporosis; however, it is likely
applicable to any disease with GWAS data and the appropriate gene
expression proles. GWASs have been performed for a myriad of
disease. As an example, our search of the Gene Expression Omnibus
database at NCBI using the term cancerresulted in 344 datasets,
suggesting that for many diseases relevant gene expression data that
can be used for network analysis is already available.
In conclusion, this study provides proof-of-principle that a sys-
tems-level analysis of GWAS data is capable of adding signicant
value to existing datasets and future studies. This analysis provides
a straightforward approach to identify pathways, individual genes,
gene modules and network concepts that play an important role in
disease.
ACKNOWLEDGMENTS
We thank Jake Lusis and Steve Horvath at UCLA for insightful
comments. This work was supported in part by National Institutes of
Health/National Institute of Arthritis and Musculoskeletal and Skin
Diseases R01 AR057759.
LITERATURE CITED
Alter, O., P. O. Brown, and D. Botstein, 2000 Singular value decomposition
for genome-wide expression data processing and modeling. Proc. Natl.
Acad. Sci. USA 97: 1010110106.
Altshuler, D., M. J. Daly, and E. S. Lander, 2008 Genetic mapping in human
disease. Science 322: 881888.
Askland, K., C. Read, and J. Moore, 2009 Pathways-based analyses of
whole-genome association study data in bipolar disorder reveal genes
mediating ion channel activity and synaptic neurotransmission. Hum.
Genet. 125: 6379.
Baranzini, S. E., N. W. Galwey, J. Wang, P. Khankhanian, R. Lindberg et al.,
2009 Pathway and network-based analysis of genome-wide association
studies in multiple sclerosis. Hum. Mol. Genet. 18: 20782090.
Blair, H. C., 1998 How the osteoclast degrades bone. Bioessays 20: 837846.
Bratic, I., and A. Trifunovic, 2010 Mitochondrial energy metabolism and
ageing. Biochim. Biophys. Acta 1797: 961967.
Chen, Y., J. Zhu, P. Y. Lum, X. Yang, S. Pinto et al.,2008 VariationsinDNA
elucidate molecular networks that cause disease. Nature 452: 429435.
Dennis, G. Jr., B. T. Sherman, D. A. Hosack, J. Yang, W. Gao et al.,
2003 DAVID: Database for Annotation, Visualization, and Integrated
Discovery. Genome Biol. 4: 3.
Dong, J., and S. Horvath, 2007 Understanding network concepts in mod-
ules. BMC Syst. Biol. 1: 24.
Elbers, C. C., K. R. van Eijk, L. Franke, F. Mulder, Y. T. van der Schouw et al.,
2009a Using genome-wide pathway analysis to unravel the etiology of
complex diseases. Genet. Epidemiol. 33: 419431.
Elbers, C. C., K. R. van Eijk, L. Franke, F. Mulder, Y. T. van der Schouw et al.,
2009b Using genome-wide pathway analysis to unravel the etiology of
complex diseases. Genet. Epidemiol. 33: 419431.
Estrada, K., U. Styrkarsdottir, E. Evangelou, Y. H. Hsu, E. L. Duncan et al.,
2012 Genome-wide meta-analysis identies 56 bone mineral density
loci and reveals 14 loci associated with risk of fracture. Nat. Genet. 44:
491501.
Farber, C. R., 2010 Identication of a gene module associated with BMD
through the integration of network analysis and genome-wide association
data. J. Bone Miner. Res. 25: 23592367.
Fontova, R., C. Gutierrez, J. Vendrell, M. Broch, I. Vendrell et al.,
2002 Bone mineral mass is associated with interleukin 1 receptor
autoantigen and TNF-alpha gene polymorphisms in post-menopausal
Mediterranean women. J. Endocrinol. Invest. 25: 684690.
Fujikawa, Y., J. M. Quinn, A. Sabokbar, J. O. McGee, and N. A. Athanasou,
1996 The human osteoclast precursor circulates in the monocyte frac-
tion. Endocrinology 137: 40584060.
Gargalovic, P. S., M. Imura, B. Zhang, N. M. Gharavi, M. J. Clark et al.,
2006 Identication of inammatory gene modules based on variations
of human endothelial cell responses to oxidized lipids. Proc. Natl. Acad.
Sci. USA 103: 1274112746.
Gautier, L., L. Cope, B. M. Bolstad, and R. A. Irizarry, 2004 affyanalysis of
Affymetrix GeneChip data at the probe level. Bioinformatics 20: 307315.
Ghazalpour, A., S. Doss, B. Zhang, S. Wang, C. Plaisier et al.,
2006 Integrating genetic and network analysis to characterize genes
related to mouse weight. PLoS Genet. 2: e130.
Gong, K. W., W. Zhao, N. Li, B. Barajas, M. Kleinman et al., 2007 Air-
pollutant chemicals and oxidized lipids exhibit genome-wide synergistic
effects on endothelial cells. Genome Biol. 8: R149.
Hong, M. G., Y. Pawitan, P. K. Magnusson, and J. A. Prince, 2009 Strategies
and issues in the detection of pathway enrichment in genome-wide as-
sociation studies. Hum. Genet. 126: 289301
Horvath, S., B. Zhang, M. Carlson, K. V. Lu, S. Zhu et al., 2006 Analysis of
oncogenic signaling networks in glioblastoma identies ASPM as a mo-
lecular target. Proc. Natl. Acad. Sci. USA 103: 1740217407.
Huang da, W., B. T. Sherman, and R. A. Lempicki, 2009 Systematic and
integrative analysis of large gene lists using DAVID bioinformatics re-
sources. Nat. Protoc. 4: 4457.
Ihaka, R., and R. Gentleman, 1996 R: a language for data analysis and
graphics. J. Comput. Graph. Statist. 5: 299314.
Irizarry, R. A., B. Hobbs, F. Collin, Y. D. Beazer-Barclay, K. J. Antonellis et al.,
2003 Exploration, normalization, and summaries of high density oli-
gonucleotide array probe level data. Biostatistics 4: 249264.
Kiel, D. P., S. Demissie, J. Dupuis, K. L. Lunetta, J. M. Murabito et al.,
2007 Genome-wide association with bone mass and geometry in the
Framingham Heart Study. BMC Med. Genet. 8(Suppl 1): S14.
Kim, J. M., D. Jeong, H. K. Kang, S. Y. Jung, S. S. Kang et al.,
2007 Osteoclast precursors display dynamic metabolic shifts toward
accelerated glucose metabolism at an early stage of RANKL-stimulated
osteoclast differentiation. Cell. Physiol. Biochem. 20: 935946.
Kim, H., S. Chun, S. Y. Ku, C. S. Suh, Y. M. Choi et al., 2009 Association
between polymorphisms in tumor necrosis factor (TNF) and TNF re-
ceptor genes and circulating TNF, soluble TNF receptor levels, and
bone mineral density in postmenopausal Korean women. Menopause 16:
10141020.
Kousteni, S., 2011 FoxOs: Unifying links between oxidative stress and
skeletal homeostasis. Curr. Osteoporos. Rep. 9: 6066.
nTable 5 Replication rates of network genes selected using
intramodular connectivity (k.in), gene signicance (GS), or P-value
Top 20%
a
Top 10% Top 5%
0.05 0.01 0.001 0.05 0.01 0.001 0.05 0.01 0.001
K.in 35.0% 15.0% 2.5% 57.9% 26.3% 5.3% 60.0% 10.0% 10.0%
GS 35.0% 2.5% 0.0% 21.4% 5.3% 0.0% 20.0% 0.0% 0.0%
P-value 32.0% 14.6% 0.0% 34.1% 17.5% 0.0% 30.1% 9.5% 0.0%
a
Genes selected for replication were in the top 20%, 10%, and 5% based on K.
in or GS in the magenta, greenyellow, and brown modules or P-value using all
network genes.
128 | C. R. Farber
Lam, J., S. Takeshita, J. E. Barker, O. Kanagawa, F. P. Ross et al., 2000 TNF-
alpha induces osteoclastogenesis by direct stimulation of macrophages
exposed to permissive levels of RANK ligand. J. Clin. Invest. 106: 1481
1488.
Langfelder, P., and S. Horvath, 2008 WGCNA: an R package for weighted
correlation network analysis. BMC Bioinformatics 9: 559.
Lattin, J. E., K. Schroder, A. I. Su, J. R. Walker, J. Zhang et al.,
2008 Expression analysis of G protein-coupled receptors in mouse
macrophages. Immunome Res. 4: 5.
Lei, S. F., S. Wu, L. M. Li, F. Y. Deng, S. M. Xiao et al., 2009 An in vivo
genome wide gene expression study of circulating monocytes suggested
GBP1, STAT1 and CXCL10 as novel risk genes for the differentiation of
peak bone mass. Bone 44: 10101014.
ODushlaine, C., E. Kenny, E. A. Heron, R. Segurado, M. Gill et al.,
2009 The SNP ratio test: pathway analysis of genome-wide association
datasets. Bioinformatics 25: 27622763.
Oldham, M. C., G. Konopka, K. Iwamoto, P. Langfelder, T. Kato et al.,
2008 Functional organization of the transcriptome in human brain.
Nat. Neurosci. 11: 12711282.
Peng, G., L. Luo, H. Siu, Y. Zhu, P. Hu et al., 2010 Gene and pathway-based
second-wave analysis of genome-wide association studies. Eur. J. Hum.
Genet. 18: 111117.
Ritchie, M. D., 2009 Using prior knowledge and genome-wide association
to identify pathways involved in multiple sclerosis. Genome Med 1: 65.
Shannon, P., A. Markiel, O. Ozier, N. S. Baliga, J. T. Wang et al.,
2003 Cytoscape: a software environment for integrated models of bio-
molecular interaction networks. Genome Res. 13: 24982504.
Styrkarsdottir, U., B. V. Halldorsson, S. Gretarsdottir, D. F. Gudbjartsson, G.
B. Walters et al., 2008 Multiple genetic loci for bone mineral density
and fractures. N. Engl. J. Med. 358: 23552365.
Su, A. I., M. P. Cooke, K. A. Ching, Y. Hakak, J. R. Walker et al.,
2002 Large-scale analysis of the human and mouse transcriptomes.
Proc. Natl. Acad. Sci. USA 99: 44654470.
Su, A. I., T. Wiltshire, S. Batalov, H. Lapp, K. A. Ching et al., 2004 A gene
atlas of the mouse and human protein-encoding transcriptomes. Proc.
Natl. Acad. Sci. USA 101: 60626067.
Torkamani, A., and N. J. Schork, 2009 Pathway and network analysis with
high-density allelic association data. Methods Mol. Biol. 563: 289301.
Torkamani, A., E. J. Topol, and N. J. Schork, 2008 Pathway analysis of
seven common diseases assessed by genome-wide association. Genomics
92: 265272.
van Nas, A., D. Guhathakurta, S. S. Wang, N. Yehya, S. Horvath et al.,
2009 Elucidating the role of gonadal hormones in sexually dimorphic
gene coexpression networks. Endocrinology 150: 12351249.
Wang, K., M. Li, and M. Bucan, 2007 Pathway-based approaches for
analysis of genomewide association studies. Am. J. Hum. Genet. 81:
12781283.
Williams, J. P., H. C. Blair, J. M. McDonald, M. A. McKenna, S. E. Jordan
et al., 1997 Regulation of osteoclastic bone resorption by glucose. Bio-
chem. Biophys. Res. Commun. 235: 646651.
Winden, K. D., M. C. Oldham, K. Mirnics, P. J. Ebert, C. H. Swan et al.,
2009 The organization of the transcriptional network in specic neu-
ronal classes. Mol. Syst. Biol. 5: 291.
Zhang, B., and S. Horvath, 2005 A general framework for weighted gene co-
expression network analysis. Stat. Appl. Genet. Mol. Biol. 4: Article 17.
Zhang, B., S. Kirov, and J. Snoddy, 2005 WebGestalt: an integrated system
for exploring gene sets in various biological contexts. Nucleic Acids Res.
33: W741748.
Communicating editor: O. Troyanskaya
Volume 3 January 2013 | GWAS Networks | 129

Supplementary resource (1)

... They function by comparing the number of interesting genes enriched in a special term or pathway with the number of random background genes. Although these methods are user-friendly, their statistical efficiency is limited since expressive information of genes is not taken into account (Farber, 2013). Therefore, the more powerful gene-set enrichment analysis (GSEA), combined with the signal strength of gene expression, was applied for pathway analysis of genes in this study. ...
... It is reported that one network-integrated approach combining GWAS with expression profile data by WGCNA possesses significant advantages in mining hidden disease-associated pathways and functional genes, compared with other existing approaches. Farber (2013) used this analysis method to repeat the previous functional pathways from osteoporosis-associated GWAS and transcriptome data, and identify newly functional genes. Chen et al. (2016) validated the potential ability of this method by exploring the hidden biological functions in a larger data set associated with osteoporosis. ...
... Gene significance (GS) was defined as minus log 10 of the p-value obtained by the DESeq function, measuring differential expression between AF and SR groups, for individual genes among all modules (Farber, 2010). In addition, the association between GS and module membership (MM) was used to verify the relationship between ME and AF (Farber, 2013), and the threshold for significance is set to p < 0.05 and R > 0.3. MM was defined as the correlation of ME in one module and gene expression values. ...
Article
Full-text available
More reliable methods are needed to uncover novel biomarkers associated with atrial fibrillation (AF). Our objective is to identify significant network modules and newly AF-associated genes by integrative genetic analysis approaches. The single nucleotide polymorphisms with nominal relevance significance from the AF-associated genome-wide association study (GWAS) data were converted into the GWAS discovery set using ProxyGeneLD, followed by merging with significant network modules constructed by weighted gene coexpression network analysis (WGCNA) from one expression profile data set, composed of left and right atrial appendages (LAA and RAA). In LAA, two distinct network modules were identified (blue: p = 0.0076; yellow: p = 0.023). Five AF-associated biomarkers were identified (ERBB2, HERC4, MYH7, MYPN, and PBXIP1), combined with the GWAS test set. In RAA, three distinct network modules were identified and only one AF-associated gene LOXL1 was determined. Using human LAA tissues by real-time quantitative polymerase chain reaction, the differentially expressive results of ERBB2, MYH7, and MYPN were observed (p < 0.05). This study first demonstrated the feasibility of fusing GWAS with expression profile data by ProxyGeneLD and WGCNA to explore AF-associated genes. In particular, two newly identified genes ERBB2 and MYPN via this approach contribute to further understanding the occurrence and development of AF, thereby offering preliminary data for subsequent studies.
... The 335 wheat accessions were genotyped by GBS and a total of 3,161,158 SNP loci were detected; these SNP loci were then ltered using the parameters "maf > 0.05 and geno < 0.4" which yielded a total of 226,206 high-quality SNP loci. The greatest number of markers was present in the B genome (125,531), the lowest number of markers was present in the D genome (9,190), Table S2, Supplementary Fig. S1). ...
... Multiple genes were directly or indirectly involved in "plant-pathogen interactions" (ko04626) according to KEGG analysis. . WGCNA was rst used as a complement to GWAS analysis to identify loci with small effects on bone mineral density (Farber, 2013). Previous studies have shown that the use of WGCNA to screen the large number of "notionally" signi cant genes from GWAS can enhance the identi cation of candidate genes. ...
Preprint
Full-text available
Wheat stripe rust, which is caused by the wheat stripe rust fungus (Puccinia striiformis f. sp. tritici, Pst) is one of the world’s most devastating diseases of wheat. Genetic resistance is the most effective strategy for controlling diseases. Although wheat stripe rust-resistance genes have been identified to date, only a few of them confer strong and broad-spectrum resistance. Here, the resistance of 335 wheat germplasm resources (mainly wheat landraces) from Southwestern China to wheat stripe rust was evaluated at the adult stage. Combined genome-wide association study (GWAS) and weighted gene co-expression network analysis (WGCNA) based on RNA sequencing from stripe rust resistant accession Y0337 and susceptible accession Y0402, five candidate resistance genes to wheat stripe rust ( TraesCS1B02G170200 , TraesCS2D02G181000 , TraesCS4B02G117200 , TraesCS6A02G189300 , and TraesCS3A02G122300 ) were identified. The transcription level analyses showed that these five genes were significantly differentially expressed between resistant and susceptible accessions post inoculation with Pst at different times. These candidate genes could be experimentally transformed to validate and manipulate fungal resistance which is beneficial for development of the wheat cultivars resistant to stripe rust.
... One of the biggest problems facing GWAS is the lack of detection ability for quantitative traits or complex traits controlled by multiple genes [14]. This is because complex biological process cannot be resolved through general association analysis [15], and the correlation between a single gene and a trait is relatively weak, thus often failing to reach a significant level after multiple tests and corrections in association analysis. Another reason is that a large number of pseudo-positive genes have been found in the ones that are categorized as "nominal" significant genes, which may affect the subsequent analysis. ...
... To solve these problems, we believe multistep methods should be used for allowing an integrated analysis to study complex attributes controlled by multiple genes [15,16], such as the integration of GWAS and the weighted gene co-expression network analysis (WGCNA) method [17,18]. ...
Article
Full-text available
Previous studies on the growth rate of antlers are inconsistent, and few genes significantly re-lated to growth traits have been obtained, which may be caused by the low-quality genome of sika deer or by the traditional genome-wide association analysis method being singly used. In this study, we conducted an integrated analysis of genome-wide association analysis and weighted gene co-expression network analysis using resequencing data identified in our previ-ous analysis, which used antler weight and transcriptome sequencing data of faster- vs. slower-growing antlers of sika deer. The results show that a total of 49 genes related to antler growth rate were identified, and most of those genes were enriched in the IGF1R (insulin-like growth fac-tor 1 receptor) and LOX (lysyl oxidase) modules. A gene regulation network of antler growth rate through the IGF1R pathway was constructed. We believe that our findings in the present study can provide further insight into revealing the molecular mechanism underlying the regulation of the tissue that can grow quickly without transforming into a tumor. Furthermore, the results of this study may be applied for increasing antler output for the deer industry.
... Several approaches have been used to circumvent these problems, including haplotype-based GWAS, gene-based GWAS, k-mer GWAS, GWAS using copy number variants, and the restricted twostage multi-locus multi-allele GWAS approach, which is similar to haplotype-based GWAS (Huang et al. 2013;Schulthess et al. 2022;Zhao et al. 2022;He et al. 2017). WGCNA was first used as a complement to GWAS analysis to identify loci with small effects on bone mineral density (Farber 2013). Previous studies have shown that the use of WGCNA to screen the large number of "notionally" significant genes from GWAS can enhance the identification of Content courtesy of Springer Nature, terms of use apply. ...
Article
Full-text available
Key message In this study, genome-wide association studies combined with transcriptome data analysis were utilized to reveal potential candidate genes for stripe rust resistance in wheat, providing a basis for screening wheat varieties for stripe rust resistance. Abstract Wheat stripe rust, which is caused by the wheat stripe rust fungus (Puccinia striiformis f. sp. tritici, Pst) is one of the world’s most devastating diseases of wheat. Genetic resistance is the most effective strategy for controlling diseases. Although wheat stripe rust resistance genes have been identified to date, only a few of them confer strong and broad-spectrum resistance. Here, the resistance of 335 wheat germplasm resources (mainly wheat landraces) from southwestern China to wheat stripe rust was evaluated at the adult stage. Combined genome-wide association study (GWAS) and weighted gene co-expression network analysis (WGCNA) based on RNA sequencing from stripe rust resistant accession Y0337 and susceptible accession Y0402, five candidate resistance genes to wheat stripe rust (TraesCS1B02G170200, TraesCS2D02G181000, TraesCS4B02G117200, TraesCS6A02G189300, and TraesCS3A02G122300) were identified. The transcription level analyses showed that these five genes were significantly differentially expressed between resistant and susceptible accessions post inoculation with Pst at different times. These candidate genes could be experimentally transformed to validate and manipulate fungal resistance, which is beneficial for the development of the wheat cultivars resistant to stripe rust.
... Weiss proposed Gene Module Association Study (GMAS) as a supplement to GWAS analysis results [44]. By integrating the results of GWAS and WGCNA analysis, Farber found that this method can significantly improve the mining efficiency of the micro-effect sites of the yellow seed phenotype, and found a shortcut for GWAS to solve its own limitations [45]. ...
Article
Full-text available
Populus euphratica is mainly distributed in desert environments with dry and hot climate in summer and cold in winter. Compared with other poplars, P. euphratica is more resistant to salt stress. It is critical to investigate the transcriptome and molecular basis of salt tolerance in order to uncover stress-related genes. In this study, salt-tolerant treatment of P. euphratica resulted in an increase in osmo-regulatory substances and recovery of antioxidant enzymes. To improve the mining efficiency of candidate genes, the analysis combining both the transcriptome WGCNA and the former GWAS results was selected, and a range of key regulatory factors with salt resistance were found. The PeERF1 gene was highly connected in the turquoise modules with significant differences in salt stress traits, and the expression levels were significantly different in each treatment. For further functional verification of PeERF1, and we obtained stable overexpression and dominant suppression transgenic lines by transforming into Populus alba × Populusglandulosa. The growth and physiological characteristics of the PeERF1 overexpressed plants were better than that of the wild type under salt stress. Transcriptome analysis of leaves of transgenic lines and WT revealed that highly enriched GO terms in DEGs were associated with stress responses, including abiotic stimuli responses, chemical responses, and oxidative stress responses. The result is helpful for in-depth analysis of the salt tolerance mechanism of poplar. This work provides important genes for poplar breeding with salt tolerance.
... WGCNA results indicated that B9D2, TMEM145, WWC2, CDKN2AIP, TRAPPC11, and PELO were located within the most significant ME. GWAS and WGCNA have been integrated and applied in the previous studies of other species (Farber, 2013;Deng et al., 2019). In this study, WGCNA was used as the supplementation of GWAS to further validate candidate genes identified by GWAS. ...
Article
Semen traits are crucial in commercial pig production since semen from boars is widely used in artificial insemination for both purebred and crossbred pig production. Revealing the genetic architecture of semen traits potentially promotes the efficiencies of improving semen traits through artificial selection. This study is aimed to identify candidate genes related to the semen traits in Duroc boars. First, we identified the genes that were significantly associated with three semen traits, including sperm motility (MO), sperm concentration (CON), and semen volume (VOL) in a Duroc boar population through a genome wide association study (GWAS). Second, we performed a weighted gene co-expression network analysis (WGCNA). A total of 2, 3, and 20 SNPs were found to be significantly associated with MO, CON, and VOL, respectively. Based on the haplotype block analysis, we identified one genetic region associated with MO, which explained 6.15% of the genetic trait variance. ENSSSCG00000018823 located within this region was considered as the candidate gene for regulating MO. Another genetic region explaining 1.95% of CON genetic variance was identified, and in this region B9D2, PAFAH1B3, TMEM145, and CIC were detected as the CON-related candidate genes. Two genetic regions that accounted for 2.23% and 2.48% of VOL genetic variance were identified, and in these two regions, WWC2, CDKN2AIP, ING2, TRAPPC11, STOX2, and PELO were identified as VOL-related candidate genes. WGCNA analysis showed that among these candidate genes, B9D2, TMEM145, WWC2, CDKN2AIP, TRAPPC11, and PELO were located within the most significant module eigengenes, confirming these candidate genes’ role in regulating semen traits in Duroc boars. The identification of these candidate genes can help to better understand the genetic architecture of semen traits in boars. Our findings can be applied for semen traits improvement in Duroc boars.
... Functionally related genes are often transcriptionally coordinated across various developmental stages, genotypes, and environments (Ruprecht et al., 2017). It follows that functionally equivalent QTGs display a similar co-expression pattern (Farber, 2013;Schaefer et al., 2018;Shen et al., 2018). ...
Article
Full-text available
Quantitative trait loci (QTL) have been discovered in crops, where some of causal quantitative trait genes (QTGs) may not be functionally characterized even in model plant Arabidopsis. We propose an approach to delineate QTGs by coordinating expression of genes located within QTL in crops and known orthologs related with trait from Arabidopsis. Using this method, we established an acyl-lipid metabolism co-expression network in developing siliques 15 days after pollination in 71 lines of rapeseed with 21 modules, which are composed of 270 known acyl-lipid genes and 3,503 new genes. The core module harbored 76 known genes involved in fatty acid and triacylglycerol biosynthesis and 671 new genes involved in sucrose transport, carbon metabolism, amino acid metabolism, seed storage protein process, seed maturation and phytohormone metabolism. Moreover, the core module closely associates with the modules of photosynthesis and carbon metabolism. From the co-expression network, we selected 12 hub genes to identify their putative Arabidopsis orthologs. These putative orthologs were functionally analyzed using Arabidopsis knockout and over-expression lines. Four knockout mutants exhibited lower seed oil content, while the seed oil content in 10 over-expression lines was significantly increased. Therefore, combining gene co-expression network analysis and QTL mapping provides new insights into the detection of QTGs.
... Based on those, '0 kbwindow' (SNPs are assigned to genes lie only within the gene body) was adopted in our analysis when conducting physicallocation-based annotation. Secondly, the SNPs were assigned to genes via SNP-to-RE annotations and RE-gene regulatory pairs derived from Hi-C-based chromatin interaction profiles; thirdly, the SNPs were assigned to genes based on significant eQTL results from GTEx; finally, the TOM was calculated with WGCNA for genes based on their expression profiles from GTEx, where a pair of genes with a TOM ≥ 0.15 were regarded as strongly interconnected as described previously [34]. In this way, the lists of genes inferred from the first three steps can be further extended, where a gene having a TOM ≥ 0.15 with any gene inferred from the first three steps will be included in analysis. ...
Article
Full-text available
Motivation: Annotating genetic variants from summary statistics of genome-wide association studies (GWAS) is crucial for predicting risk genes of various disorders. The multimarker analysis of genomic annotation (MAGMA) is one of the most popular tools for this purpose, where MAGMA aggregates signals of single nucleotide polymorphisms (SNPs) to their nearby genes. In biology, SNPs may also affect genes that are far away in the genome, thus missed by MAGMA. Although different upgrades of MAGMA have been proposed to extend gene-wise variant annotations with more information (e.g. Hi-C or eQTL), the regulatory relationships among genes and the tissue specificity of signals have not been taken into account. Results: We propose a new approach, namely network-enhanced MAGMA (nMAGMA), for gene-wise annotation of variants from GWAS summary statistics. Compared with MAGMA and H-MAGMA, nMAGMA significantly extends the lists of genes that can be annotated to SNPs by integrating local signals, long-range regulation signals (i.e. interactions between distal DNA elements), and tissue-specific gene networks. When applied to schizophrenia (SCZ), nMAGMA is able to detect more risk genes (217% more than MAGMA and 57% more than H-MAGMA) that are involved in SCZ compared with MAGMA and H-MAGMA, and more of nMAGMA results can be validated with known SCZ risk genes. Some disease-related functions (e.g. the ATPase pathway in Cortex) are also uncovered in nMAGMA but not in MAGMA or H-MAGMA. Moreover, nMAGMA provides tissue-specific risk signals, which are useful for understanding disorders with multitissue origins.
Article
Grouper is an economically important fish in China. However, it exhibits a high frequency of skeletal abnormalities, particularly vertebral deformities. The molecular mechanisms underlying fish vertebral deformities are still poorly understood. In this study, a HiSeq™ 4000 platform (Illumina) was used to analyze the transcriptomic profiles of the brain, pituitary, and vertebrae from normal fish (NF) and fish with lordosis (LF) of Yunlong grouper. A total of 87,888 unigenes were assembled with lengths that varied from 201 to 28,922 bp and a N50 length of 2670 bp. A total of 36,268 unigenes were functionally annotated by BLAST alignments. A total of 2875 significantly differentially expressed genes (DEGs) were identified between the NF group and the LF group, including 706 upregulated unigenes and 2169 downregulated unigenes in LF. GO and KEGG pathway enrichment analyses showed that DNA binding, transmembrane receptor activity, cytokine receptor interaction, neuroactive ligand-receptor interaction, calcium signaling pathway and ECM-receptor interaction HIF-1 signaling pathway, and mineral absorption may be involved in the formation of vertebral deformities. Furthermore, weighted gene co-expression network analyses, including three modules (turquoise, yellow, and blue), significantly positively corrected with vertebral deformities. A network map that included these three modules enabled the identification of a series of hub genes, including claudin-22-like (cldn22), fibronectin type III domain-containing protein 1 isoform X2 (fndc1l2), E3 ubiquitin-protein ligase NRDP1-like (rnf41), and Catenin alpha-2 (ctnna1). We found that the levels of most genes in the blue module were closely related to the expression of parvalbumin, thymic CPV3-like isoform X2 (ocm), platelet glycoprotein Ib alpha chain (gp1ba), and matrix metalloproteinase-9 (mmp9), suggesting that this module is associated with skeletal development. Some uncharacterized genes associated with known bone-related genes, including Unigene0067643, Unigene0056862, and Unigene0059867, were detected by a weighted gene co-expression network analysis. A detailed functional investigation of these networks and genes will further improve our understanding of the molecular mechanisms that underlie the formation of lordosis in fish.
Article
Full-text available
Background Phenotypes such as height and intelligence, are thought to be a product of the collective effects of multiple phenotype-associated genes and interactions among their protein products. High/low degree of interactions is suggestive of coherent/random molecular mechanisms, respectively. Comparing the degree of interactions may help to better understand the coherence of phenotype-specific molecular mechanisms and the potential for therapeutic intervention. However, direct comparison of the degree of interactions is difficult due to different sizes and configurations of phenotype-associated gene networks. Methods We introduce a metric for measuring coherence of molecular-interaction networks as a slope of internal versus external distributions of the degree of interactions. The internal degree distribution is defined by interaction counts within a phenotype-specific gene network, while the external degree distribution counts interactions with other genes in the whole protein–protein interaction (PPI) network. We present a novel method for normalizing the coherence estimates, making them directly comparable. Results Using STRING and BioGrid PPI databases, we compared the coherence of 116 phenotype-associated gene sets from GWAScatalog against size-matched KEGG pathways (the reference for high coherence) and random networks (the lower limit of coherence). We observed a range of coherence estimates for each category of phenotypes. Metabolic traits and diseases were the most coherent, while psychiatric disorders and intelligence-related traits were the least coherent. We demonstrate that coherence and modularity measures capture distinct network properties. Conclusions We present a general-purpose method for estimating and comparing the coherence of molecular-interaction gene networks that accounts for the network size and shape differences. Our results highlight gaps in our current knowledge of genetics and molecular mechanisms of complex phenotypes and suggest priorities for future GWASs.
Article
Full-text available
DAVID bioinformatics resources consists of an integrated biological knowledgebase and analytic tools aimed at systematically extracting biological meaning from large gene/protein lists. This protocol explains how to use DAVID, a high-throughput and integrated data-mining environment, to analyze gene lists derived from high-throughput genomic experiments. The procedure first requires uploading a gene list containing any number of common gene identifiers followed by analysis using one or more text and pathway-mining tools such as gene functional classification, functional annotation chart or clustering and functional annotation table. By following this protocol, investigators are able to gain an in-depth understanding of the biological themes in lists of genes that are enriched in genome-scale studies.
Article
Full-text available
Bone mineral density (BMD) is the most widely used predictor of fracture risk. We performed the largest meta-analysis to date on lumbar spine and femoral neck BMD, including 17 genome-wide association studies and 32,961 individuals of European and east Asian ancestry. We tested the top BMD-associated markers for replication in 50,933 independent subjects and for association with risk of low-trauma fracture in 31,016 individuals with a history of fracture (cases) and 102,444 controls. We identified 56 loci (32 new) associated with BMD at genome-wide significance (P < 5 × 10(-8)). Several of these factors cluster within the RANK-RANKL-OPG, mesenchymal stem cell differentiation, endochondral ossification and Wnt signaling pathways. However, we also discovered loci that were localized to genes not known to have a role in bone biology. Fourteen BMD-associated loci were also associated with fracture risk (P < 5 × 10(-4), Bonferroni corrected), of which six reached P < 5 × 10(-8), including at 18p11.21 (FAM210A), 7q21.3 (SLC25A13), 11q13.2 (LRP5), 4q22.1 (MEPE), 2p16.2 (SPTBN1) and 10q21.1 (DKK1). These findings shed light on the genetic architecture and pathophysiological mechanisms underlying BMD variation and fracture susceptibility.
Article
Full-text available
Background: Functional annotation of differentially expressed genes is a necessary and critical step in the analysis of microarray data. The distributed nature of biological knowledge frequently requires researchers to navigate through numerous web-accessible databases gathering information one gene at a time. A more judicious approach is to provide query-based access to an integrated database that disseminates biologically rich information across large datasets and displays graphic summaries of functional information. Results: Database for Annotation, Visualization, and Integrated Discovery (DAVID; http://www.david.niaid.nih.gov) addresses this need via four web-based analysis modules: 1) Annotation Tool - rapidly appends descriptive data from several public databases to lists of genes; 2) GoCharts - assigns genes to Gene Ontology functional categories based on user selected classifications and term specificity level; 3) KeggCharts - assigns genes to KEGG metabolic processes and enables users to view genes in the context of biochemical pathway maps; and 4) DomainCharts - groups genes according to PFAM conserved protein domains. Conclusions: Analysis results and graphical displays remain dynamically linked to primary data and external data repositories, thereby furnishing in-depth as well as broad-based data coverage. The functionality provided by DAVID accelerates the analysis of genome-scale datasets by facilitating the transition from data collection to biological meaning.